Home > Amazon Web Services, Hadoop, MapReduce > Distributed cache file for ElasticMapReduce

Distributed cache file for ElasticMapReduce

January 19, 2012 Leave a comment Go to comments

The cache-file option provides a good way for using AWS Elastic MapReduce when you have extra data (rather than input data — where input data will be processed via stdin to mapper) , such as parameter file or other kind of information. Also using GZipped input in the extra arguments to let Hadoop decompress data on the fly before passing data to mapper: -jobconf stream.recordreader.compression=gzip . Here’s an example of how to specify the cache-file in boto:

mapper = 's3://<your-code-bucket>/mapper.py'
reducer = 's3://<your-code-bucket>/reducer.py'
input_mr = 's3://<your-input-data-bucket>'
output_mr = 's3://<your-output-bucket>' + job_name

step_args = ['-jobconf', 'mapred.reduce.tasks=1', '-jobconf', 'mapred.map.tasks=2',
             '-jobconf', 'stream.recordreader.compression=gzip']


step = StreamingStep(name = "my-step", mapper = mapper, reducer = reducer, input = input_mr, output = output_mr, step_args = step_args, cache_files= cache_files)


  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: