Hadoop Notes 9/10/12
* hadoop compression
There are several tools like
DEFLATE ( file extentiosn will be .deflate) this is what we often see
LZO (optimize for speed)
There are also nice different options: -1 means optimize for spped and -9 mieans optimize for space. e.g. gzip -1 file
Another way of doing compression is to use ‘Codecs’
A codec is the implementation of a compression-decompression algorithm. In Hadoop, a codec is represented by an implementation of the CompressionCodec interface.
For performance, it is recommended to use the native library for compression and decompression: DEFLATE, gzip, LZO.
If you are using a native library and doing a lot of compression or decompression in your application, consider using “Codecpool”.
For only compress the mapper output, an example( of also choosing the codec type):
In Hadoop, there are two basic data types:
WritableComparable (base interface for keys)
Writable (base class interface for values), Writable is a general-purpose wrapper for the following: Java primitives, String, enum, Writable, null, or arrays of any of these types, e.g.
- IntWritable (Int)
- LongWritable (Long)
Text is a Writable for UTF-8 sequences. It can be thought of as the Writable equivalent of java.lang.String. Text is a replacement for the UTF8 class. The length of a String is the number of char code units it contains. The length of a Text object is the number of bytes in its UTF-8 encoding. You could basically treat everything as Text, and hack it if it actually has a very complicate content. Also, using you are doing Int value, Text will cost your more time.
NullWritable is a special type of Writable, as it has a zero-length serialization. No bytes are written to, or read from, the stream.
There are four Writable collection types in the org.apache.hadoop.io package: Array Writable, TwoDArrayWritable, MapWritable, and SortedMapWritable. ArrayWritable and TwoDArrayWritable are Writable implementations for arrays and two-dimensional arrays (array of arrays) of Writable instances. All the elements of an ArrayWritable or a TwoDArrayWritable must be instances of the same class, which is specified at construction.
hadoop fs command has a -text option to display sequence files in textual form. e.g.
%hadoop fs -text number.seq | head