Archive

Posts Tagged ‘Pig’

Basics for Pig

November 2, 2012 Leave a comment

Pig is a very powerful tool which builds upon hadoop.

Some tutorial links:

http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#GROUP

 

Here are some basic notes for using it:

1) LoadData

tmp_table = LOAD ‘hdfs_data_file’ USING PigStorage(‘\t’) AS (key1:chararray, key2:int, key3:int);

2) Check out the entire table:

Mysql > SELECT * FROM TMP_TABLE;

Pig > DUMP tmp_table;

3) Check out the first 50 rows ( this will start a mapreduce job, so don’t do it if your table is very big):

Mysql  > SELECT * FROM TMP_TABLE LIMIT 50;

Pig  >  tmp_table_limit = LIMIT tmp_table 50;

DUMP tmp_table_limit;

4) Order Table > tmp_table_order = ORDER tmp_table BY age ASC;

5) Filter Table > tmp_table_where = FILTER tmp_table BYage > 20;

6) Inner Join

Mysql> SELECT * FROM TMP_TABLE A JOIN TMP_TABLE_2 B ON A.AGE=B.AGE;mp_table_inner_join =

Pig> JOIN tmp_table BY age,tmp_table_2 BY age;

7) Left Join

Mysql> SELECT * FROM TMP_TABLE A LEFT JOIN TMP_TABLE_2 B ON A.AGE=B.AGE;

Pig > tmp_table_left_join = JOIN tmp_table BY age LEFT OUTER,tmp_table_2 BY age;

8) Right Join

Mysql  > SELECT * FROM TMP_TABLE A LEFT JOIN TMP_TABLE_2 B ON A.AGE=B.AGE;

Pig  >   tmp_table_right_join = JOIN tmp_table BY age RIGHT OUTER, tmp_table_2 BY age;

9) Cross

Mysql > SELECT * FROM TMP_TABLE,TMP_TABLE_2;

Pig > tmp_table_cross = CROSS tmp_table,tmp_table_2;

10) GROUP BY

Mysql > SELECT * FROM TMP_TABLE GROUP BY IS_MALE;

Pig >  tmp_table_group = GROUP tmp_table BY is_male;

11)Group&Count

Mysql > SELECT IS_MALE,COUNT(*) FROM TMP_TABLE GROUP BY IS_MALE;

Pig >  tmp_table_group_count = GROUP tmp_table BY is_male;

tmp_table_group_count = FOREACH tmp_table_group_count GENERATE group,COUNT($1);

12) DISTINCT

MYSQL > SELECT DISTINCT IS_MALE FROM TMP_TABLE;

Pig >   tmp_table_distinct = FOREACH tmp_table GENERATE is_male;

tmp_table_distinct = DISTINCT tmp_table_distinct;

Here are some links if you want to highlight the pig script:

emacs: https://github.com/cloudera/piglatin-mode

just follow the direction to add one line code to you .emacs file

VIM:  https://github.com/vim-scripts/pig.vim/blob/master/syntax/pig.vim

https://github.com/motus/pig.vim

If you have the permission to change the system configuration, you can copy the pig.vim to the system </usr/share/vim/vimXX/syntax> folder, where you will find there are already a lot of .vim files there.

Otherwise you can also do it simplely in your .vimrc file ( make sure you have a .vimrc in your home directory, if not create one), put down the following lines in it:

filetype on
syntax on
au BufRead,BufNewFile *.pig set filetype=pig   "Create a new filetype in your vimrc file.
au! Syntax pig source your-path-to-pig.vim     "An entry to read your syntax file
Categories: Hadoop, MapReduce, Pig Tags: ,