Archive for July, 2013

Mahout k-means Example

July 27, 2013 Leave a comment

Here’s the previous example on Logistic Regression using mahout.

Here‘s is my recent try out of Mahout K-means. There are some key points I think it’s necessary to clarify first. Mahout kmeans is mainly for text processing, if you need to process some numerical data, you need to write some utility functions to write the numerical data into sequence-vector format. For the general example “Reuters”, the first few Mahout steps are actually doing some data processing.

To be explicit, for reuters example, the original downloaded file is in SGML format, which is similar to XML. So we need to first parse(like preprocessing) those files into document-id and document-text. After that we can convert the file into sequenceFiles.  SequencesFiles is kind of key-value format. Key is the document id and value is the document content. This step will be done using ‘seqdirectory’. Then use ‘seq2sparse’ do if-idf convert the id-text data to vectors (Vector Space Model: VSM).

For the first preprocessing job, a much quicker way is to reuse the Reuters parser given in the Lucene benchmark JAR file.
Because its bundled along with Mahout, all you need to do is change to the examples/ directory under the Mahout source tree and run the org.apache.lucene.benchmark.utils.ExtractReuters class. Details see the chapter 8 of book Mahout In Action. (

The generated vectors dir should contain the following items:

  • reuters-vectors/df-count
  • reuters-vectors/dictionary.file-0
  • reuters-vectors/frequency.file-0
  • reuters-vectors/tf-vectors
  • reuters-vectors/tfidf-vectors
  • reuters-vectors/tokenized-documents
  • reuters-vectors/wordcount

We will then use tfidf-vectors to run kmeans. You could give a ‘fake’ initial center path, as given argument k, mahout will automatically random select k to initial the clustering.

mahout-0.5-cdh3u5:$./bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -o mahout-clusters -c mahout-initial-centers -c 0.1 -k 20 -x 10 -ow

The clustering results will look like this

Categories: Hadoop, Mahout

SIGGRAPH 2013 Trailer

July 20, 2013 Leave a comment

SIGGRAPH 2013 Technical Trailer:

SIGGRAPH 2013 Play List

Dynamic Hair manipulation

Learnt from a not-job duty

July 19, 2013 Leave a comment

dream2So different from my daily job, I need to have a 70min video transcripted — a not-job duty. I decided to post the job to some online ‘service market’ — freelancers. Because of the noises and the accent in the video, the task is relatively very hard job. I have always respected any class of workers. I believe with gratitude, humbleness and respect, world can be a better place. This time, I am deeply touched by these few workers I met, who are  so hardworking, willing to take any challenge, no matter how simple and basic the job is (I am only saying simple/basic as comparing to nowadays so call high-tech stuff). I could see their effort, feel their dedication,  as they are doing their best, using their skills and right attitude to change life and to make a better life. I would like to quote from what Maynard Webb said during his visit back to eBay:

” You can be anything you want to be in the world. There’s more in all of us.”

“Whatever it is you want to be, go for it and have fun while you’re doing it. Be driven and motivated. We all have a purpose and the more you can focus on your purpose, the more impact will be.”

Categories: MISC, Myself

Let data speak of itself

July 5, 2013 Leave a comment

Very data-driven, really data-centric, talk of “big visual data” by  Alexei A. Efros.


The “Type” you can create

July 2, 2013 Leave a comment

This is a simple but very cool idea which allows you to create your personalized ‘art’ by simply replace the texture with some shape. Nowadays, personalization, gaming-like applications are very attractive. Applications like this, if can be extended to let people collaborate with each other, might be more fun. Each piece on the final piece of work can cross various dimensions, time, subject, color, place, people, etc. .

Categories: MISC