Archive for December, 2011

Machine Learning Blogs

December 31, 2011 Leave a comment

Here’s the OPML for some good machine learning blogs. Well this Q&A posts a lot others
Maybe I should continue collecting them 🙂

Random Links for AWS, MapReduce, rpy2

December 29, 2011 Leave a comment

Still having the issue using rpy2 in userdata sent to EC2 instance. Everything works well when opening the terminal of the instance created by my own image, but fails when sending userdata via boto. Tried several testing for test the rpy2, like python -m ‘rpy2.tests’, went through well, but still not work.
wired error like:

RRuntimeError”, “<no args> \”/usr/local/lib/python2.7/dist-packages/rpy2-2.2.4dev_20111122-py2.7-linux-x86_64.egg/rpy2/robjects/\”, line 225, in __call__\n res = self.eval(p)\n”, ” File \”/usr/local/lib/python2.7/dist-packages/rpy2-2.2.4dev_20111122-py2.7-linux-x86_64.egg/rpy2/robjects/\”, line 82, in __call__\n return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)\n”, ” File \”/usr/local/lib/python2.7/dist-packages/rpy2-2.2.4dev_20111122-py2.7-linux-x86_64.egg/rpy2/robjects/\”, line 34, in __call__\n res = super(Function, self).__call__(*new_args, **new_kwargs)\n”]]

Below just some random links for records.

Amazon Web Service:
Get started with Amazon EC2 by using using the boto library for Python:
Documentation of Boto, online at

Cloud vs AMR:
Best Cloud Apps and Service:

Use R in Amazon MapReduce

Create Amazon Machine Image:

Lunch the instance and save the key-value pair:



RF tree size

December 29, 2011 Leave a comment

The number of trees in Random Forest is suggested to be high enough in order to ensure that every input sample gets predicted at least a few times. It is suggested that if want auxiliary information like variable importance or proximity , grow a large number of trees is agg choice, since more stable results can be obtained. Well for the current testing data, I have seen that it is not straight proportional to the performance. Features from POS and NEG classes are not containing very good discrimination information, so even a very complex tree can not save the performance.


Categories: Data Mining

Name Prefix

December 29, 2011 Leave a comment

Some name prefix links collected before. Many of them are royal titles.

Mr., Mrs., Ms., Miss, Dr., A.V.M., AB, Adm., Amb, AMN, Archbishop, Baron, Baroness, Bishop, Brig. Gen., Brigadier, Bro., Cantor, Capt., Cardinal, Chaplain, Cmdr., CMSGT, Col., Consul, Count, Countess, Cpl., CPO, CWO, Dean, Duchess, Duke, Earl, Ens., Eur Eng, Father, Fr., Gen., Gov., H. E., Herr, Hon., HRH, Lady, Lord, Lt., Lt. Cmdr., Lt. Col., Lt. Gen., M., Maj., Maj. Gen., Master, Mlle., Mme., Mother, MSGT, Pastor, PFC, Pres., Prince, Princess, Prof., Rabbi, Radm., Rev., Rt. Hon., Senator, Sgt., Sgt. Maj., Sir, Sister, SMSGT, Speaker, Squad.Ldr., Sr., SrA, Sra., Srta., SSGT, Swami, TSGT, Vadm.

Categories: NLP

rpy2 error when importing numpy array

December 14, 2011 Leave a comment

Get error when importing arrays from numpy to rpy2:

ValueError: Nothing can be done for the type <type ‘numpy.ndarray’> at the moment.
 Tried adding two additional import commands, and solved the problem:
import rpy2.robjects.numpy2ri
Categories: Python, R

Weekly Record

December 13, 2011 Leave a comment

1. Ran into problem of “Task attempt failed to report status for 6003 seconds. Killing!”

Figure out that it is due to eliminating (not output anything) when have missing features. Some part of data could have hugh part as missing feature, which cause the map-reduce status not updating for a long time. Basically, the error means that the task stayed in map or reduce for more then allowed time but with no stdin/stdout.

Changed the code, but there are also other ways to solve this by increasing the timeout parameter. Here’s a link to this

Another way is to use the Reporter.

Reporter is a facility for MapReduce applications to report progress, set application-level status messages and update Counters.

Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive. In scenarios where the application takes a significant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task. Another way to avoid this is to set the configuration parameter mapred.task.timeout to a high-enough value (or even set it to zero for no time-outs).

A Java code to check the status:


if ((++count % 1000)==0) {
context.setStatus((count / len) +” done!”);

2. Boto connection error:

The requested instance type’s architecture (i386) does not match the architecture in the manifest for ami-c9c70da0 (x86_64)</Message></Error></Errors><RequestID>b71b1ee4-5a98-45e2-af1d-7da0db114afb

It is saying that my created image is 64bit, while the instance going to lunch is32bit. Tried a test on lunch the instance using a 32bit image. Went through well.

Finally, found out that it is because of instance_type is not correct.


3. Other collections on AWS

Possible to use the customized AMI for Elastic MapReduce on AW: Elastic MapReduce doesn’t support customer AMIs at this time. The service instead has a feature called “Bootstrap Actions” that allows you to pass a reference to a script stored in Amazon S3 and relatedarguments to Elastic MapReduce when creating a job flow. This script is executed on each job flowinstance before the actual job flow runs. This post describes how to create bootstrap actions: (section “Creating a Job Flow with Bootstrap Actions”)

Processing images is one of the typical Elastic MapReduce use cases.

Use S3 or hdfs:

I would like to be able to access and use HDFS directly instead of having to worry about using the S3 bucket for initial or intermediate IO. I am worried about the IO performance of the S3 bucket against the HDFS performance when accessing the S3 bucket.I have seen multiple posts that say it doesn’t matter and others that say it can matter.

HDFS vs S3 provide different benefits; HDFS has lower latency but S3 has higher durability. In terms of long term storage (without compute) S3 is the cheaper option.

Would people recommend using EMR or EC2 with a Hadoop 0.20 image for doing something like this?

EMR is highly tuned to offer the best performance possible with S3.

Does the EMR setup support using the HDFS like this with custom JARs?

Definitely. Intermediate data is stored in HDFS unless you configure things otherwise. You are able to choose whether to use HDFS or S3 for your initial data.

Common Problems Running Job flows. <>
Using s3:// instead of s3n://

The Amazon Elastic Mapreduce Instances run on s pre-defied AMI. To use the customized Intance for MapReduce, an way is to run a boostrap action




Q: What is Amazon Elastic MapReduce Bootstrap Actions?

Bootstrap Actions is a feature in Amazon Elastic MapReduce that provides users a way to run custom set-up prior to the execution of their job flow. Bootstrap Actions can be used to install software or configure instances before running your job flow.

Q: How can I use Bootstrap Actions?

You can write a Bootstrap Action script in any language already installed on the job flow instance including Bash,Perl, Python, Ruby, C++, or Java. There are several pre-defined Bootstrap Actions available. Once the script is written, you need to upload it to Amazon S3 and reference its location when you start a job flow. Please refer to the “Developer’s Guide”: for details on how to use Bootstrap Actions.

Q: How do I configure Hadoop settings for my job flow?

The Elastic MapReduce default Hadoop configuration is appropriate for most workloads. However, based on your job flow’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.

Using AmazonEC2 create the image with hadoop <>

Categories: Amazon Web Services

Keep Tring

December 8, 2011 Leave a comment

Always the question, how to perform timely and scalable analytical processing of large datasets.

Let’s start building the house piece by piece.

Categories: Uncategorized Tags: