Archive

Archive for the ‘Amazon Web Services’ Category

Useful links for AWS

January 20, 2012 Leave a comment

Some useful links for AWS:

 

  • Use Bootstrap Actions to do the configuration
  • All jobs should be mapper+reducer, no mapper only jobs in AWS, so use a ‘pass-through’ as your reducer

Distributed cache file for ElasticMapReduce

January 19, 2012 Leave a comment

The cache-file option provides a good way for using AWS Elastic MapReduce when you have extra data (rather than input data — where input data will be processed via stdin to mapper) , such as parameter file or other kind of information. Also using GZipped input in the extra arguments to let Hadoop decompress data on the fly before passing data to mapper: -jobconf stream.recordreader.compression=gzip . Here’s an example of how to specify the cache-file in boto:

mapper = 's3://<your-code-bucket>/mapper.py'
reducer = 's3://<your-code-bucket>/reducer.py'
input_mr = 's3://<your-input-data-bucket>'
output_mr = 's3://<your-output-bucket>' + job_name

step_args = ['-jobconf', 'mapred.reduce.tasks=1', '-jobconf', 'mapred.map.tasks=2',
             '-jobconf', 'stream.recordreader.compression=gzip']

cache_files=['s3://<your-cache-file-bucket>/randomForest-model-1.txt#rf1.txt',
              s3://<your-cache-file-bucket>/randomForest-model-1.txt#rf2.txt']

step = StreamingStep(name = "my-step", mapper = mapper, reducer = reducer, input = input_mr, output = output_mr, step_args = step_args, cache_files= cache_files)

[Refs]
Distributed_cacheFile
http://blog.tophernet.com/2011/10/importing-custom-python-package-in.html
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/DistributedCache.html

Use boto run user-data in AWS instance

January 4, 2012 Leave a comment

I am using boto to launch the instance run the user-code in the background, but find that it’s not so convenient to debug and get the error information. Here’s finally what I did. Use the traceback to get the error, use smtp to send back the error information to my email. There must be some other cleverer ways to do so, ha ~

import boto
from boto.ec2.connection import EC2Connection
import time

ami_id = '----'                                                  # ami-ubuntu-64 (settedup) use t1.micro as instance type
key_pair_name = '----'
AWS_ACCESS_KEY_ID = '----'
AWS_SECRET_ACCESS_KEY = '----'

ec2conn = EC2Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)

test_script_00 = """#!/bin/bash
sudo apt-get update
sudo apt-get install imagemagick
"""

test_script_01 = """#!/usr/bin/env python
 import boto
 from boto.s3.key import Key

AWS_ACCESS_KEY_ID = '-------'
AWS_SECRET_ACCESS_KEY = '----'

bucket_name = 'demo-test'
conn_s3 = boto.connect_s3(AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY)
bucket = conn_s3.get_bucket(bucket_name)

k=Key(bucket)
k.key='images/rose111105.jpg'
k.copy('demo-test', 'images/rose111105copy6.jpg')
"""

test_script ="""#!/usr/bin/env python
import smtplib
import boto
from boto.s3.key import Key
import numpy as np
import sys

AWS_ACCESS_KEY_ID = '-----' # Your AWS_KEY
AWS_SECRET_ACCESS_KEY = '---------' # Your Secret KEY

def send_notice(msg='testing'):
    fromaddr = '******@gmail.com'
    toaddrs = '******@gmail.com'
    username = '******'
    password = '******'
    
    server = smtplib.SMTP('smtp.gmail.com:587')
    server.starttls()
    server.login(username,password)
    server.sendmail(fromaddr, toaddrs, msg)
    server.quit()

send_notice("start processing")
#DO YOUR PROCESS
send_notice("Finishing...")
"""

my_reservation = ec2conn.run_instances(ami_id,
                                       instance_type=instance_type,
                                       key_name=key_pair_name,
                                       user_data=test_script)

instance = my_reservation.instances[0]
while not instance.update() == 'running':
    time.sleep(5)
instance.stop()

Random Links for AWS, MapReduce, rpy2

December 29, 2011 Leave a comment

Still having the issue using rpy2 in userdata sent to EC2 instance. Everything works well when opening the terminal of the instance created by my own image, but fails when sending userdata via boto. Tried several testing for test the rpy2, like python -m ‘rpy2.tests’, went through well, but still not work.
wired error like:

RRuntimeError”, “<no args> \”/usr/local/lib/python2.7/dist-packages/rpy2-2.2.4dev_20111122-py2.7-linux-x86_64.egg/rpy2/robjects/__init__.py\”, line 225, in __call__\n res = self.eval(p)\n”, ” File \”/usr/local/lib/python2.7/dist-packages/rpy2-2.2.4dev_20111122-py2.7-linux-x86_64.egg/rpy2/robjects/functions.py\”, line 82, in __call__\n return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)\n”, ” File \”/usr/local/lib/python2.7/dist-packages/rpy2-2.2.4dev_20111122-py2.7-linux-x86_64.egg/rpy2/robjects/functions.py\”, line 34, in __call__\n res = super(Function, self).__call__(*new_args, **new_kwargs)\n”]]

Below just some random links for records.

Amazon Web Service: http://aws.amazon.com/elasticmapreduce/
Get started with Amazon EC2 by using using the boto library for Python: http://aws.amazon.com/articles/3998?_encoding=UTF8&jiveRedirect=1
Documentation of Boto, online at http://boto.cloudhackers.com

Cloud vs AMR:
http://umichcloud.blogspot.com/2011/11/gui-and-running-mapreduce-from-desktop.html
Best Cloud Apps and Service: https://sites.google.com/site/truthkos/old-pages/ci-days-research-in-the-cloud/favoritecloudapps

Use R in Amazon MapReduce
http://blog.revolutionanalytics.com/2009/05/running-r-in-the-cloud-with-amazon-ec2.html
http://benreuven.com/udiwiki/index.php?title=R_with_Amazon_MapReduce

Create Amazon Machine Image:
http://docs.amazonwebservices.com/AWSEC2/2007-08-29/GettingStartedGuide/creating-an-image.html
http://docs.amazonwebservices.com/AmazonEC2/gsg/2006-10-01/
http://ged.msu.edu/angus/tutorials/creating-custom-amis.html

Lunch the instance and save the key-value pair:
http://ged.msu.edu/angus/tutorials/renting-a-computer-from-amazon.html

R-glm
http://data.princeton.edu/R/glms.html

Chain-MapR-Job
http://blog.data-miners.com/2008/02/mapreduce-and-k-means-clustering.html

http://stackoverflow.com/questions/2499585/chaining-multiple-mapreduce-jobs-in-hadoop
http://stackoverflow.com/questions/4170060/hadoop-map-reduce-chaining

http://stackoverflow.com/questions/2986271/need-help-implementing-this-algorithm-with-map-hadoop-mapreduce

http://stackoverflow.com/questions/6446914/implementing-parallel-for-in-hadoop
http://stackoverflow.com/questions/2986271/need-help-implementing-this-algorithm-with-map-hadoop-mapreduce
http://stackoverflow.com/questions/6523911/implementing-cross-join-in-hadoop
http://stackoverflow.com/questions/6800438/parallel-reducing-with-hadoop-mapreduce
http://stackoverflow.com/questions/7009930/shared-variable-in-map-reduce
http://stackoverflow.com/questions/2888788/global-variables-in-hadoop
http://stackoverflow.com/questions/1217850/streaming-data-and-hadoop-not-hadoop-streaming
http://stackoverflow.com/questions/2333618/hadoop-one-map-and-multiple-reduce

http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
http://www.econsteve.com/r/barr-parallelPresoFeb2011.pdf

Weekly Record

December 13, 2011 Leave a comment

1. Ran into problem of “Task attempt failed to report status for 6003 seconds. Killing!”

Figure out that it is due to eliminating (not output anything) when have missing features. Some part of data could have hugh part as missing feature, which cause the map-reduce status not updating for a long time. Basically, the error means that the task stayed in map or reduce for more then allowed time but with no stdin/stdout.

Changed the code, but there are also other ways to solve this by increasing the timeout parameter. Here’s a link to this
<https://issues.apache.org/jira/browse/MAPREDUCE-1308>
<http://stackoverflow.com/questions/5864589/how-to-fix-task-attempt-201104251139-0295-r-000006-0-failed-to-report-status-fo>

Another way is to use the Reporter.

Reporter is a facility for MapReduce applications to report progress, set application-level status messages and update Counters.

Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive. In scenarios where the application takes a significant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task. Another way to avoid this is to set the configuration parameter mapred.task.timeout to a high-enough value (or even set it to zero for no time-outs).

A Java code to check the status:

http://code.google.com/p/appengine-mapreduce/source/browse/trunk/python/src/mapreduce/context.py

cnkiTable.put(put);

if ((++count % 1000)==0) {
context.setStatus((count / len) +” done!”);
context.progress();
}


2. Boto connection error:

The requested instance type’s architecture (i386) does not match the architecture in the manifest for ami-c9c70da0 (x86_64)</Message></Error></Errors><RequestID>b71b1ee4-5a98-45e2-af1d-7da0db114afb

It is saying that my created image is 64bit, while the instance going to lunch is32bit. Tried a test on lunch the instance using a 32bit image. Went through well.

Finally, found out that it is because of instance_type is not correct.

 

3. Other collections on AWS
https://cwiki.apache.org/MAHOUT/mahout-on-elastic-mapreduce.html

Possible to use the customized AMI for Elastic MapReduce on AW: Elastic MapReduce doesn’t support customer AMIs at this time. The service instead has a feature called “Bootstrap Actions” that allows you to pass a reference to a script stored in Amazon S3 and relatedarguments to Elastic MapReduce when creating a job flow. This script is executed on each job flowinstance before the actual job flow runs. This post describes how to create bootstrap actions:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=3938&categoryID=265 (section “Creating a Job Flow with Bootstrap Actions”)

Processing images is one of the typical Elastic MapReduce use cases.

Use S3 or hdfs:

I would like to be able to access and use HDFS directly instead of having to worry about using the S3 bucket for initial or intermediate IO. I am worried about the IO performance of the S3 bucket against the HDFS performance when accessing the S3 bucket.I have seen multiple posts that say it doesn’t matter and others that say it can matter.

HDFS vs S3 provide different benefits; HDFS has lower latency but S3 has higher durability. In terms of long term storage (without compute) S3 is the cheaper option.

Would people recommend using EMR or EC2 with a Hadoop 0.20 image for doing something like this?

EMR is highly tuned to offer the best performance possible with S3.

Does the EMR setup support using the HDFS like this with custom JARs?

Definitely. Intermediate data is stored in HDFS unless you configure things otherwise. You are able to choose whether to use HDFS or S3 for your initial data.

Common Problems Running Job flows. <https://forums.aws.amazon.com/thread.jspa?threadID=30925>
Using s3:// instead of s3n://

The Amazon Elastic Mapreduce Instances run on s pre-defied AMI. To use the customized Intance for MapReduce, an way is to run a boostrap action
[1]<http://serverfault.com/questions/253917/what-ami-is-launched-on-amazons-elastic-mapreduce-instances>.

[2]< http://aws.typepad.com/aws/2010/04/new-elastic-mapreduce-feature-bootstrap-actions.html>

[3]< http://atbrox.com/2010/10/01/programmatic-deployment-to-elastic-mapreduce-with-boto-and-bootstrap-action/>

[4]< https://github.com/atbrox/atbroxexamples>

Q: What is Amazon Elastic MapReduce Bootstrap Actions?

Bootstrap Actions is a feature in Amazon Elastic MapReduce that provides users a way to run custom set-up prior to the execution of their job flow. Bootstrap Actions can be used to install software or configure instances before running your job flow.

Q: How can I use Bootstrap Actions?

You can write a Bootstrap Action script in any language already installed on the job flow instance including Bash,Perl, Python, Ruby, C++, or Java. There are several pre-defined Bootstrap Actions available. Once the script is written, you need to upload it to Amazon S3 and reference its location when you start a job flow. Please refer to the “Developer’s Guide”: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/ for details on how to use Bootstrap Actions.

Q: How do I configure Hadoop settings for my job flow?

The Elastic MapReduce default Hadoop configuration is appropriate for most workloads. However, based on your job flow’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/index.html?Bootstrap.html#PredefinedBootstrapActions_ConfigureHadoop

Using AmazonEC2 create the image with hadoop <http://wiki.apache.org/hadoop/AmazonEC2>

Categories: Amazon Web Services

Build a Ubuntu Amazon EC2 Instance

December 3, 2011 Leave a comment

# Steps to create Ubuntu Amazon EC2 Instance
– Login your AWS Management Console.
– Go to EC2 tab, then ‘lunch an instance’
– Select classical wizard -> community AMI -> search for ubuntu AMI (ubuntu official ID: 099720109477) You can even try different version of the ubuntu. (Link http://uksysadmin.wordpress.com/2010/11/21/amazon-ec2-ubuntu-quickstart-guide/)

# Login Your Instance

*If using Putty:
generate the key pair, save the value(ec2.pem file), which is important.
Download the PuTTYGen, import the ec2.pem file, select “Type-of-key-to-generate”: SSH1
Then save the private key. import the key into Putty, and from your running instance, copy the ‘Public DNS’ to the Host-Name of Putty.
Difference with Amazon EC2 Linux (using ‘ec2-user’ ), here the connection->data->auto_loing_user_name is ‘ubuntu’

* Access EC2 Instance by ssh
Change the private key readable to the owner only
$ chmod 400 ec2.pem

To access by SSH ( after @ is your public DNS, which will be difference each time when you restart your instance):
$ ssh -i ec2.pem ubuntu

# Check to see if your EBS already attached
sudo fdisk -l

# Ater creating EBS volume and attaching to the AMI using the AWS console:

#Formatted:
mkfs -t ext2 /dev/sdf
#Created mount point (in my case a backup directory):
mkdir /backups
#Mounted it:

mount /dev/sdf /backups
#Edited rc.local:
added mount /dev/sdf /backups to the end which mounts the EBS directory on bootup.

Linux Devices: /dev/sdf through /dev/sdp
Note: Newer linux kernels may rename your devices to /dev/xvdf through /dev/xvdp internally, even when the device name entered here (and shown in the details) is /dev/sdf through /dev/sdp

# To update Amazon EC2 instance which starts out with a default set of software. To install security updates and other pieces of software, run the following commands, in order:

$ apt-get -y update
$ apt-get -y dist-upgrade
$ apt-get -y install mercurial less python-matplotlib unzip bzip2 zlib1g-dev ncurses-dev python-dev

# Install pip and virtualenv for Ubuntu 10.10 Maverick and newer. Pip is a better alternative to Easy Install for installing Python packages.

$ sudo apt-get install python-pip python-dev build-essential
$ sudo pip install –upgrade pip
$ sudo pip install –upgrade virtualenv

# install Easy Install on Ubuntu Linux
$ sudo apt-get install python-setuptools python-dev build-essential

# Install whatever you want, here’s my list: R, rpy2, numpy, bpython (use Pygments)
$ sudo apt-get install r-base r-base-dev
$ sudo apt-get update
$ sudo apt-get update
$ sudo apt-get install python-pip python-dev build-essential

$ sudo apt-get install python-setuptools python-dev build-essential
$ sudo easy_install rpy2
$ sudo apt-get install python-numpy
$ sudo easy_install Pygments
$ sudo easy_install bpython
$ sudo apt-get install python-scipy

# Useful Resources

http://jonathanhui.com/create-and-launch-amazon-ec2-instance-ubuntu-and-centos
http://www.saltycrane.com/blog/2010/02/how-install-pip-ubuntu/
http://www.turnkeylinux.org/blog/ebsmount
http://jonathanhui.com/create-and-launch-amazon-ec2-instance-ubuntu-and-centos
http://wiki.bitnami.org/cloud/how_to_connect_to_your_amazon_instance

Amazon Elastic MapReduce & Bootstrap Action

December 3, 2011 Leave a comment

Here’s a bunch of commands to setting up the Amazon EC2 Linux Environment. Basically trying to install Python, R, and most common python packages. The resources are limited as compared to if you are setting up a Ubuntu system ( then everything will come much more straight forward) . I still got problem with rpy2. Hope to fix it soon. Besides that, here’s something really bothering, using Amazon Elastic MapReduce, there is no direct/easy way to use such personalized image. Basically, EMR based on Amazon EC2 Linux, so you can’t set it up like Ubuntu. It supports four Amazon EC2 Linux families: Standard, High-CPU, High-Memory, and Cluster Compute Instances. Then, to customize each instance that will do the mapReduce job for you, you need to use the Bootstrap Action, which basical submit a bash command, like below to setting up the environment every time before the job starts.

In my original MapReduce algo, mapper simply loads the json, create index as key and give it to Reducer, Reducer coded in python by calls R via rpy2 and create RandomForest in R. For my case, to bypass this Bootstrap Action, using a python as mapper and a R.script as reducer could be the option.

#basic stats on the EC2 machine:
$ cat /proc/cpuinfo

$sudo yum -y install python-devel

$yum -y install python-devel
$yum -y install gcc
$yum -y install gcc-c++
$yum -y install subversion gcc-gfortran
$yum -y install fftw-devel swig
$yum -y install compat-gcc-34 compat-gcc-34-g77 compat-gcc-34-c++ compat-libstdc++-33 compat-db compat-readline43
$yum -y install hdf5-devel
$yum -y install readline-devel
$yum -y install python-numeric python-numarray Pyrex
$yum -y install python-psyco
$yum -y install wxPython-devel zlib-devel freetype-devel tk-devel tkinter gtk2-devel pygtk2-devel libpng-devel
$yum -y install octave

#To installing all the availabel python package
$sudo yum -y install python-*

# To install numpy
$sudo yum -y install numpy

#Installing-R
$ sudo yum -y install make libX11-devel.* libICE-devel.* libSM-devel.* libdmx-devel.* libx* xorg-x11* libFS* libX* readline-devel gcc-gfortran gcc-c++ texinfo tetex
$ wget http://cran.r-project.org/src/base/R-2/R-2.13.1.tar.gz
$ tar zxf R-2.13.1.tar.gz && cd R-2.13.1
$ ./configure && make
$ ./configure –enable-R-shlib && make
$ # make coffee… or finish your PhD thesis… (yes, it takes that long)
$ # finally, if all is well:
$ sudo make install
$ cd
$ R –version

$ wget http://sourceforge.net/projects/rpy/files/rpy2/2.2.x/rpy2-2.2.3.tar.gz
$ tar zxvf rpy-0.2.tar.gz
$ cd rpy-0.2
$ sudo python setup.py install

 

# Install easy_install – make life easier 🙂

$ sudo yum install python-setuptools
$ sudo easy_install simplejson                                                # so afterwards you can install python packages very easily

 

Related Links:

http://www.datawrangling.com/on-demand-mpi-cluster-with-python-and-ec2-part-1-of-3
http://www.r-bloggers.com/automating-r-scripts-on-amazon-ec2/
http://www.r-bloggers.com/installing-r-2-13-1-on-amazon-ec2%E2%80%B2s-%E2%80%9Camazon-linux%E2%80%9D-ami-rstats/

Python: Rpy2 with Python 2.6 and R 2.12.1 R_HOME path problem.

– Possible to use the customized AMI for Elastic MapReduce on AWS?

Elastic MapReduce doesn’t support customer AMIs at this time. The service instead has a feature called “Bootstrap Actions” that allows you to pass a reference to a script stored in Amazon S3 and relatedarguments to Elastic MapReduce when creating a job flow. This script is executed on each job flowinstance before the actual job flow runs. This post describes how to create bootstrap actions: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=3938&categoryID=265 (section “Creating a Job Flow with Bootstrap Actions”)

Setting Up for Amazon EC2

November 18, 2011 Leave a comment

Here’s how to set up the the Amazon EC2 and connect it using your Putty:

1) Create an account in AWS
2) Generate your keys by Key Pairs

Input your key name, such as ec2, and save the value(ec2.pem file), which is important.
Download the PuTTYGen, import the ec2.pem file, select “Type-of-key-to-generate”: SSH1
Then save the private key

 

3) Change your ‘Security Groups’ setting, for example, for the default one: add “SHH” property

 

4) Create an instance, when comes to the advanced setting, select your exist key, and default in security group ( since we have already update the security group, this will make our instance accessible from the SSH)

5) with the lunched instance, use the Public DNS to Putty’s IP,

 

6) go to Connection -> Data and enter “ec2-user” as the Auto-login username.

For Ubuntu Instance, use ‘ubuntu’ as the Auto-loginusename.

Categories: Amazon Web Services