Hi Hadoop!

Heard Hadoop for many times, never got deep into it. Now is a chance, so I started the experiment in Microsoft Azure.

First, you can find a number of Hadoop packages in Azure Marketplace. I chose Hortonworks Sandbox with HDP 2.5.  I tried Hadoop by Bitnami as well, but it’s usability is a bit tricky, I couldn’t find a way to make Bitnami work without creating a number of accounts and expose more of my own information. I may try it later (and enable the boot diagnostic to find the password in the log when the image starts the first time) when I have time. For now, I stick to Hortonworks.


Then just follow the standard Azure procedure –

  • Basics: filling in the VM name, username, SSH key or password, subscription, resource group, location, etc.
  • Size: choose the size of the VM.
  • Settings: choose the storage, network, etc. I suggest to leave boot diagnostics enabled.
  • Confirm on the Summary and Buy.

Notice that on the price page, there is warning on the charge besides the Azure VM itself, also since the HDP Sandbox just showed 0.0000 CAD/hr, I don’t think you need to worry too much about it. BTW, Bitnami’s Hadoop is also free, explicitely mentioned.


Wait for a few minutes until the deployments succeeds. You can then check the status of your new Hadoop VM.  Hortonworks suggests that you make the public IP static. You can find more detail information on its tutorial page.

Next is to configure your SSH client. I am using PuTTY on Windows, so there are more mouse-clicks than the config example given in the tutorial.  Basically these settings let you connect to your VM in the Azure cloud using various ports from localhost to the remote VM via the SSH tunnel you set up here.


Here is how to configure PuTTY:

  • Fill in the public IP of your Hadoop VM
  • Expand Connection – SSH
  • Click Tunnels
  • Fill the source and destination, then click Add button


According to the document, you need to add 8 forward ports


So in PuTTY, you can add one by one, it should eventually look like this (scroll up and down to see total 8 lines/ports).


You can then go back to the Session page, give a name in “Saved Sessions” and save the configuration. Next time, you only need to load it from there.

One trick is that the VM need some time to start and become stable. My first few login attempts failed, only after 20 or 30 minutes can I eventually login. so be patient. After login, you should be able to see the following directories.


Then according to the tutorial, keep the SSH session active, you can use brower to visit this page on your VM.


Click on the left icon, you will see the dashboard.


Click on the right, you can read more advanced topics including the default username and password, and how to change them.

That is the first step into Hadoop.


Yahoo is so irresponsible – Yahoo China to end email services

Related news: Yahoo China to end email services – China News – SINA English.

Yahoo China detected my @yahoo.com account and gave me the reminder every time when I log in, but in the page it only mentioned @yahoo.com.cn and @yahoo.cn will be stopped.



My email is @yahoo.com, then what?  The problem is my Yahoo email was originally registered in China.  Here is another page that just says it, and no solution for that.



Also I see the “&source=alibaba&cnNoRedirect=1#mail” string in the address bar every time when I access my Yahoo email.  There is no choice.  I have transferred most of my liaison to another service.

It is the time to leave Yahoo! for good.

Power searching with Google – course notes

Class 1

  • Search/filter image results by colour(overall/background) and visual property (style, similarity).
  • Boldface in results are Google associates (联想、近义词).
  • Higher in results: Having the words or synonyms; Appear in the title or URL; Linked by high quality pages.
  • Word order matters.  Capitalization does not matter.
  • Most special characters (¶, £, €, ©, ®, ÷, §, %, (), @, ?, !) are ignored in the query, except +, # $, etc.
  • Find words in the page – Ctrl-F or Cmd-F, this is not search.

Class 2

  • Upper-right-hand side panel for search entity that is well-known; Search-as-you-type; Related searches at the bottom of the page.
  • search “define keyword”, dictionary definition and translation, and Search Tools in the left panel.
  • SERP – Search Engine Results Page; rollover preview; title/URL/snippet(abstract)/deeper-links.

Class 3

  • Operator site:domain, including top level domain where the dot can be omitted.
  • Operator filetype:pdf, doc, docx, ppt, txt, csv, etc.
  • Minus (-) operator to exclude certain keywords.
  • OR operator and double-quote.
  • Operator “intext:”.
  • Advanced search – gear button.

Class 4

  • Search by image: drag-and-drop local file to image search.
  • Search features – descript your query – geo, measure, time, flight, weather, movie showtime, etc.
  • Conversions [number unit1 in unit2], also currency
  • Calculation
  • Left hand panel “show search tool” – date range limiting, custom range
  • Translation  – left hand panel

Class 5

  • Credibility … use time search
  • Use Books
  • Use WHOIS

[to be continued …]

Why do I see failed connection to sip*.example.com on proxy firewall?

From time to time, we saw attempts on proxy or firewall trying to go out for the following destinations:


Because the domain example.com is reserved in RFC 2606, those hosts don’t actually exist, so all the attempts failed.  Consider the number of users in the network, how many resource had been wasted due to this kind of nonsense traffic?  If there is a proxy configured, the client will periodically send requests to the proxy, the proxy then need to authenticate and process the request.  If user has direct connection, the DNS need to resolve this non-existing hostname every few minutes.  Think if there are 1000 users in the same situation in your network.

Here is the log that shown on the proxy server from one client, the attempts repeat every few minutes:

The question then became where the traffic are from and how to stop them.

SIP is the keyword, it must be from an instant messaging client.  So on the client machine, we found only Office Communicator was installed but not configured ever since.  The Sign-in address (URI) was the default someone@example.com, and somehow it starts observing or connecting to all three hosts mentioned in the begining of this article.

imageI searched the Internet for similar complaints, most of those have three hostnames are Microsoft official documents – OCS Deployment Guides and Communicator Testing Guide.  The domain example.com are real examples in those documents.

There is only one thread in Microsoft online community discussed about the issue.  worb68_ocs brought up the same concern that I have.  The only answer that closes to the root cause was from Turgay Ongun in Microsoft:

When you install the Communicator client and run it as the very first time, the textbox where you enter your SIP address is someone@example.com

If by mistake, any user click the sign in button without entering his/her SIP address, then the communicator tries to find the edge server for example.com for someone@example.com SIP address.

All other answers were not quite straight forward.

So the next step is either remove Office Communicator if user is not using it, or configure it by a correct sign-in address, or disable automatic login if it’s not in use all the time.

Microsoft should also do something in their next release of Office Communicator or Lync client, they should leave the Sign-in address blank or lead user by a wizard to put in some more meaningful address/URI instead of just dropping a example like someone@example.com.

Here is the thread address:

Some extend readings: