Home > Hadoop, MapReduce > Hadoop Notes 9/10/12

Hadoop Notes 9/10/12

September 10, 2012 Leave a comment Go to comments

 * hadoop compression

There are several tools like
DEFLATE ( file extentiosn will be .deflate) this is what we often see
gzip
zip
LZO (optimize for speed)

There are also nice different options: -1 means optimize for spped and -9 mieans optimize for space. e.g. gzip -1 file

Another way of doing compression is to use ‘Codecs’
A codec is the implementation of a compression-decompression algorithm. In Hadoop, a codec is represented by an implementation of the CompressionCodec interface.

For performance, it is recommended to use the native library for compression and decompression: DEFLATE, gzip, LZO.

If you are using a native library and doing a lot of compression or decompression in your application, consider using “Codecpool”.

For only compress the mapper output, an example( of also choosing the codec type):
conf.set(“mapred.compress.map.output”, “true”)
conf.set(“mapred.output.compression.type”, “BLOCK”);
conf.set(“mapred.map.output.compression.codec”, “org.apache.hadoop.io.compress.GzipCodec”);
Writable

In Hadoop, there are two basic data types:

WritableComparable   (base interface for keys)

Writable (base class interface for values), Writable is a general-purpose wrapper  for the following: Java primitives, String, enum, Writable, null, or arrays of any of these types, e.g.

  • Text (strings),
  • BytesWritable (raw bytes)
  • IntWritable (Int)
  • LongWritable (Long)

Text is a Writable for UTF-8 sequences. It can be thought of as the Writable equivalent of java.lang.String. Text is a replacement for the UTF8 class. The length of a String is the number of char code units it contains. The length of a Text object is the number of bytes in its UTF-8 encoding. You could basically treat everything as Text, and hack it if it actually has a very complicate content. Also, using you are doing Int value, Text will cost your more time.

NullWritable is a special type of Writable, as it has a zero-length serialization. No bytes are written to, or read from, the stream.

Writable collections

There are four Writable collection types in the org.apache.hadoop.io package: Array Writable, TwoDArrayWritable, MapWritable, and SortedMapWritable. ArrayWritable and TwoDArrayWritable are Writable implementations for arrays and two-dimensional arrays (array of arrays) of Writable instances. All the elements of an ArrayWritable or a TwoDArrayWritable must be instances of the same class, which is specified at construction.

hadoop fs command has a -text option to display sequence files in textual form. e.g.
%hadoop fs -text number.seq | head

Advertisements
Categories: Hadoop, MapReduce
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: