Home > Amazon Web Services, Hadoop > Amazon Elastic MapReduce & Bootstrap Action

Amazon Elastic MapReduce & Bootstrap Action

December 3, 2011 Leave a comment Go to comments

Here’s a bunch of commands to setting up the Amazon EC2 Linux Environment. Basically trying to install Python, R, and most common python packages. The resources are limited as compared to if you are setting up a Ubuntu system ( then everything will come much more straight forward) . I still got problem with rpy2. Hope to fix it soon. Besides that, here’s something really bothering, using Amazon Elastic MapReduce, there is no direct/easy way to use such personalized image. Basically, EMR based on Amazon EC2 Linux, so you can’t set it up like Ubuntu. It supports four Amazon EC2 Linux families: Standard, High-CPU, High-Memory, and Cluster Compute Instances. Then, to customize each instance that will do the mapReduce job for you, you need to use the Bootstrap Action, which basical submit a bash command, like below to setting up the environment every time before the job starts.

In my original MapReduce algo, mapper simply loads the json, create index as key and give it to Reducer, Reducer coded in python by calls R via rpy2 and create RandomForest in R. For my case, to bypass this Bootstrap Action, using a python as mapper and a R.script as reducer could be the option.

#basic stats on the EC2 machine:
$ cat /proc/cpuinfo

$sudo yum -y install python-devel

$yum -y install python-devel
$yum -y install gcc
$yum -y install gcc-c++
$yum -y install subversion gcc-gfortran
$yum -y install fftw-devel swig
$yum -y install compat-gcc-34 compat-gcc-34-g77 compat-gcc-34-c++ compat-libstdc++-33 compat-db compat-readline43
$yum -y install hdf5-devel
$yum -y install readline-devel
$yum -y install python-numeric python-numarray Pyrex
$yum -y install python-psyco
$yum -y install wxPython-devel zlib-devel freetype-devel tk-devel tkinter gtk2-devel pygtk2-devel libpng-devel
$yum -y install octave

#To installing all the availabel python package
$sudo yum -y install python-*

# To install numpy
$sudo yum -y install numpy

$ sudo yum -y install make libX11-devel.* libICE-devel.* libSM-devel.* libdmx-devel.* libx* xorg-x11* libFS* libX* readline-devel gcc-gfortran gcc-c++ texinfo tetex
$ wget http://cran.r-project.org/src/base/R-2/R-2.13.1.tar.gz
$ tar zxf R-2.13.1.tar.gz && cd R-2.13.1
$ ./configure && make
$ ./configure –enable-R-shlib && make
$ # make coffee… or finish your PhD thesis… (yes, it takes that long)
$ # finally, if all is well:
$ sudo make install
$ cd
$ R –version

$ wget http://sourceforge.net/projects/rpy/files/rpy2/2.2.x/rpy2-2.2.3.tar.gz
$ tar zxvf rpy-0.2.tar.gz
$ cd rpy-0.2
$ sudo python setup.py install


# Install easy_install – make life easier 🙂

$ sudo yum install python-setuptools
$ sudo easy_install simplejson                                                # so afterwards you can install python packages very easily


Related Links:


– Possible to use the customized AMI for Elastic MapReduce on AWS?

Elastic MapReduce doesn’t support customer AMIs at this time. The service instead has a feature called “Bootstrap Actions” that allows you to pass a reference to a script stored in Amazon S3 and relatedarguments to Elastic MapReduce when creating a job flow. This script is executed on each job flowinstance before the actual job flow runs. This post describes how to create bootstrap actions: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=3938&categoryID=265 (section “Creating a Job Flow with Bootstrap Actions”)

  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: