Archive

Archive for the ‘Mahout’ Category

Mahout k-means Example

July 27, 2013 Leave a comment

Here’s the previous example on Logistic Regression using mahout.

Here‘s is my recent try out of Mahout K-means. There are some key points I think it’s necessary to clarify first. Mahout kmeans is mainly for text processing, if you need to process some numerical data, you need to write some utility functions to write the numerical data into sequence-vector format. For the general example “Reuters”, the first few Mahout steps are actually doing some data processing.

To be explicit, for reuters example, the original downloaded file is in SGML format, which is similar to XML. So we need to first parse(like preprocessing) those files into document-id and document-text. After that we can convert the file into sequenceFiles.  SequencesFiles is kind of key-value format. Key is the document id and value is the document content. This step will be done using ‘seqdirectory’. Then use ‘seq2sparse’ do if-idf convert the id-text data to vectors (Vector Space Model: VSM).

For the first preprocessing job, a much quicker way is to reuse the Reuters parser given in the Lucene benchmark JAR file.
Because its bundled along with Mahout, all you need to do is change to the examples/ directory under the Mahout source tree and run the org.apache.lucene.benchmark.utils.ExtractReuters class. Details see the chapter 8 of book Mahout In Action. (http://manning.com/owen/MiA_SampleCh08.pdf)

The generated vectors dir should contain the following items:

  • reuters-vectors/df-count
  • reuters-vectors/dictionary.file-0
  • reuters-vectors/frequency.file-0
  • reuters-vectors/tf-vectors
  • reuters-vectors/tfidf-vectors
  • reuters-vectors/tokenized-documents
  • reuters-vectors/wordcount

We will then use tfidf-vectors to run kmeans. You could give a ‘fake’ initial center path, as given argument k, mahout will automatically random select k to initial the clustering.

mahout-0.5-cdh3u5:$./bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -o mahout-clusters -c mahout-initial-centers -c 0.1 -k 20 -x 10 -ow

The clustering results will look like this

Categories: Hadoop, Mahout

Mahout Logistic Regression

November 2, 2011 4 comments

classifier.sgd

# Check infor for help

$mahout org.apache.mahout.classifier.sgd.TrainLogistic –help
$mahout org.apache.mahout.classifier.sgd.RunLogistic –help

# Example of Training

# To train the model– model stored in donut.model, which is a json type file, to read the file better, try http://jsonviewer.stack.hu/

$mahout org.apache.mahout.classifier.sgd.TrainLogistic \
–passes 100 \
–rate 50 –lambda 0.001 \
–input /mahout_examples/donut.csv \

–features 21 \
–output /mahout_examples/donut.model \
–target color \
–categories 2 \
–predictors x y xx xy yy a b c –types n n

Then you should be able to get from the terminal:
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop
No HADOOP_CONF_DIR set, using /usr/lib/hadoop/conf
11/11/01 17:58:21 WARN driver.MahoutDriver: No org.apache.mahout.classifier.sgd.TrainLogistic.props found on classpath, will use command-line arguments only
21
color ~ 5.048*Intercept Term + 3.747*x + 4.530*y + -3.986*xx + 2.191*xy + -4.723*yy + 0.562*a + -0.580*b + -22.188*c
Intercept Term 5.04769
a 0.56192
b -0.57986
c -22.18806
x 3.74697
xx -3.98555
xy 2.19129
y 4.52954
yy -4.72268
-3.985546155 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 -0.579859597 4.529541855 0.000000000 0.000000000 0.000000000 -4.722678608 5.047685107 0.000000000 0.000000000 2.191286892 0.561916360 -22.188056574 0.000000000 0.000000000 3.746971894
11/11/01 17:58:22 INFO driver.MahoutDriver: Program took 858 ms

# To test the model
/usr/lib/mahout/bin/mahout org.apache.mahout.classifier.sgd.RunLogistic –help

$mahout org.apache.mahout.classifier.sgd.RunLogistic \
–input /mahout_examples/donut-test.csv \
–model /mahout_examples/donut.model –auc \
–scores –confusion

Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop

No HADOOP_CONF_DIR set, using /usr/lib/hadoop/src/conf
11/11/02 10:35:27 WARN driver.MahoutDriver: No org.apache.mahout.classifier.sgd.RunLogistic.props found on classpath, will use command-line arguments only
“target”,”model-output”,”log-likelihood”
0,0.004,-0.003696
0,0.003,-0.002722
1,0.959,-0.042384
1,0.977,-0.023617
0,0.000,-0.000166
1,0.922,-0.081457
1,0.678,-0.388569
0,0.160,-0.174764
0,0.019,-0.019335
0,0.740,-1.348002
0,0.040,-0.040603
1,0.873,-0.135365
1,0.106,-2.242013
1,0.933,-0.069273
1,0.997,-0.003449
0,0.106,-0.112158
1,0.971,-0.029869
0,0.001,-0.001182
1,0.898,-0.107512
0,0.000,-0.000007
0,0.103,-0.108486
0,0.033,-0.034022
0,0.003,-0.003357
0,0.722,-1.281526
0,0.002,-0.002285
1,0.997,-0.002749
1,0.968,-0.032817
0,0.013,-0.013217
0,0.458,-0.613088
0,0.020,-0.019809
0,0.563,-0.827950
0,0.178,-0.195591
0,0.340,-0.416144
0,0.043,-0.043604
0,0.020,-0.020153
0,0.088,-0.091683
1,0.649,-0.432606
0,0.832,-1.786718
0,0.007,-0.006844
0,0.014,-0.014132
AUC = 0.96
confusion: [[23.0, 1.0], [4.0, 12.0]]
entropy: [[-0.2, -2.3], [-4.2, -0.2]]
11/11/02 10:35:28 INFO driver.MahoutDriver: Program took 312 ms

 

AUC: Area under Curve 

http://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_Under_Curve