Archive

Archive for November, 2011

Setting Up for Amazon EC2

November 18, 2011 Leave a comment

Here’s how to set up the the Amazon EC2 and connect it using your Putty:

1) Create an account in AWS
2) Generate your keys by Key Pairs

Input your key name, such as ec2, and save the value(ec2.pem file), which is important.
Download the PuTTYGen, import the ec2.pem file, select “Type-of-key-to-generate”: SSH1
Then save the private key

 

3) Change your ‘Security Groups’ setting, for example, for the default one: add “SHH” property

 

4) Create an instance, when comes to the advanced setting, select your exist key, and default in security group ( since we have already update the security group, this will make our instance accessible from the SSH)

5) with the lunched instance, use the Public DNS to Putty’s IP,

 

6) go to Connection -> Data and enter “ec2-user” as the Auto-login username.

For Ubuntu Instance, use ‘ubuntu’ as the Auto-loginusename.

Advertisements
Categories: Amazon Web Services

A note on randomForest in R

November 9, 2011 Leave a comment

Using the importance value to select features.

Link: http://www.statmethods.net/advstats/cart.html

RANDOM FORESTS

Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new “forest”, and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification). Breiman and Cutler’s random forest approach is implimented via therandomForest package.

Here is an example.

# Random Forest prediction of Kyphosis data
library(randomForest)
fit <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis)
print(fit) # view results
importance(fit) # importance of each predictor

For more details see the comprehensive Random Forest website.

Categories: R

Two APIs for Amazon

November 7, 2011 Leave a comment

Two APIs
mrjob – Run Hadoop Streaming jobs on Amazon Elastic MapReduce or your own Hadoop cluster
http://packages.python.org/mrjob/

boto – An integrated interface to current and future infrastructural services offered by Amazon Web Services.
http://code.google.com/p/boto/

Categories: Uncategorized

Note on Hadoop Speculative

November 7, 2011 Leave a comment

From Yahoo Hadoop Tutorial

Speculative execution: One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes.

By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known asspeculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.

Speculative execution is enabled by default. You can disable speculative execution for the mappers and reducers by setting the mapred.map.tasks.speculative.execution andmapred.reduce.tasks.speculative.execution JobConf options to false, respectively.

Categories: Uncategorized

Mahout Logistic Regression

November 2, 2011 4 comments

classifier.sgd

# Check infor for help

$mahout org.apache.mahout.classifier.sgd.TrainLogistic –help
$mahout org.apache.mahout.classifier.sgd.RunLogistic –help

# Example of Training

# To train the model– model stored in donut.model, which is a json type file, to read the file better, try http://jsonviewer.stack.hu/

$mahout org.apache.mahout.classifier.sgd.TrainLogistic \
–passes 100 \
–rate 50 –lambda 0.001 \
–input /mahout_examples/donut.csv \

–features 21 \
–output /mahout_examples/donut.model \
–target color \
–categories 2 \
–predictors x y xx xy yy a b c –types n n

Then you should be able to get from the terminal:
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop
No HADOOP_CONF_DIR set, using /usr/lib/hadoop/conf
11/11/01 17:58:21 WARN driver.MahoutDriver: No org.apache.mahout.classifier.sgd.TrainLogistic.props found on classpath, will use command-line arguments only
21
color ~ 5.048*Intercept Term + 3.747*x + 4.530*y + -3.986*xx + 2.191*xy + -4.723*yy + 0.562*a + -0.580*b + -22.188*c
Intercept Term 5.04769
a 0.56192
b -0.57986
c -22.18806
x 3.74697
xx -3.98555
xy 2.19129
y 4.52954
yy -4.72268
-3.985546155 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 -0.579859597 4.529541855 0.000000000 0.000000000 0.000000000 -4.722678608 5.047685107 0.000000000 0.000000000 2.191286892 0.561916360 -22.188056574 0.000000000 0.000000000 3.746971894
11/11/01 17:58:22 INFO driver.MahoutDriver: Program took 858 ms

# To test the model
/usr/lib/mahout/bin/mahout org.apache.mahout.classifier.sgd.RunLogistic –help

$mahout org.apache.mahout.classifier.sgd.RunLogistic \
–input /mahout_examples/donut-test.csv \
–model /mahout_examples/donut.model –auc \
–scores –confusion

Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop

No HADOOP_CONF_DIR set, using /usr/lib/hadoop/src/conf
11/11/02 10:35:27 WARN driver.MahoutDriver: No org.apache.mahout.classifier.sgd.RunLogistic.props found on classpath, will use command-line arguments only
“target”,”model-output”,”log-likelihood”
0,0.004,-0.003696
0,0.003,-0.002722
1,0.959,-0.042384
1,0.977,-0.023617
0,0.000,-0.000166
1,0.922,-0.081457
1,0.678,-0.388569
0,0.160,-0.174764
0,0.019,-0.019335
0,0.740,-1.348002
0,0.040,-0.040603
1,0.873,-0.135365
1,0.106,-2.242013
1,0.933,-0.069273
1,0.997,-0.003449
0,0.106,-0.112158
1,0.971,-0.029869
0,0.001,-0.001182
1,0.898,-0.107512
0,0.000,-0.000007
0,0.103,-0.108486
0,0.033,-0.034022
0,0.003,-0.003357
0,0.722,-1.281526
0,0.002,-0.002285
1,0.997,-0.002749
1,0.968,-0.032817
0,0.013,-0.013217
0,0.458,-0.613088
0,0.020,-0.019809
0,0.563,-0.827950
0,0.178,-0.195591
0,0.340,-0.416144
0,0.043,-0.043604
0,0.020,-0.020153
0,0.088,-0.091683
1,0.649,-0.432606
0,0.832,-1.786718
0,0.007,-0.006844
0,0.014,-0.014132
AUC = 0.96
confusion: [[23.0, 1.0], [4.0, 12.0]]
entropy: [[-0.2, -2.3], [-4.2, -0.2]]
11/11/02 10:35:28 INFO driver.MahoutDriver: Program took 312 ms

 

AUC: Area under Curve 

http://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_Under_Curve