The objective of this Pig tutorial is to get you up and running Pig scripts on a real-world dataset stored in Hadoop.
- You should have the latest stable build of Hadoop (as of writing this article, 1.0.3)
- To install hadoop, see my previous blog article on Hadoop Setup
- Your machine should have Java 1.6 installed
- It is assumed you have basic knowledge of Java programming and SQL.
- Basic knowledge of Linux will help you understand many of the linux commands used in the tutorial.
- Download the Book Crossing DataSet. This is the data set we will use. (Alternative link to the Dataset is on the github page of the Tutorials)
Setting up Pig
Pig has two modes of execution – local mode and MapReduce mode.
Local mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets which a single machine could handle. It runs on a single JVM and access the local filesystem. To run in local mode, you pass the local option to the -x or -exectype parameter when starting pig. This starts the interactive shell called Grunt:
$ pig -x local
Executing Scripts in Batch Mode
Running in Local mode:
$ pig -x local BookXGroupByYear.pig
Note that Pig, in MapReduce mode, takes file from HDFS only, and stores the results back to HDFS.
Now we focus on solving a simple but real-world use case with Pig. This is the same problem that was solved in the previous blog articles (Step-by-step MapReduce Programming using Java and Hive for Beginners using SQL-like query for Hive).
The problem we are trying to solve through this tutorial is to find the frequency of books published each year. Our input data set (file BX-Books.csv) is a csv file. Some sample rows:
“0195153448”;”Classical Mythology”;”Mark P. O. Morford”;”2002“;”Oxford University Press”;”https://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg”;
“0002005018”;”Clara Callan”;”Richard Bruce Wright”;”2001“;”HarperFlamingo Canada”;”https://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg”;
$ cd /Users/Work/Data/BX-CSV-Dump
$ sed ‘s/&/&/g’ BX-Books.csv | sed -e ‘1d’ |sed ‘s/;/$$$/g’ | sed ‘s/”$$$”/”;”/g’ > BX-BooksCorrected.txt
“0393045218”;”The Mummies of Urumchi;”;”E. J. W. Barber”;”1999″;”W. W. Norton & Company”; “https://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg” is changed to
“0393045218”;”The Mummies of Urumchi$$$“;”E. J. W. Barber”;”1999″; “W. W. Norton & Company”; “https://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg”
$ hadoop fs -mkdir input
$ hadoop fs -put /Users/Work/Data/BX-CSV-Dump/BX-BooksCorrected.txt input
grunt> BookXRecords = LOAD ‘/user/varadmeru/input/BX-BooksCorrected1.txt’
>> USING PigStorage(‘;’) AS (ISBN:chararray,BookTitle:chararray,
2012-11-05 01:09:11,554 [main] WARN org.apache.pig.PigServer – Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
grunt> GroupByYear = GROUP BookXRecords BY YearOfPublication;
2012-11-05 01:09:11,810 [main] WARN org.apache.pig.PigServer – Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
grunt> CountByYear = FOREACH GroupByYear
>> GENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1)));
2012-11-05 01:09:11,996 [main] WARN org.apache.pig.PigServer – Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
grunt> STORE CountByYear
>> INTO ‘/user/work/output/pig_output_bookx’ USING PigStorage(‘t’);
The username (“work” in our example) in the second query is dependent on the hadoop setup on your machine and the username of the hadoop setup. The output of the above pig run is stored into the output/pig_output_bookx folder on HDFS. It can be displayed on the screen by :
$ hadoop fs -cat output/pig_output_bookx/part-r-00000
The snapshot of the output of the Pig flow is shown below:
Comparison with Java MapReduce and Hive
You can see the above output and compare with the outputs from the MapReduce code from the step-by-setp MapReduce guide and the Hive for beginners blog post. Let’s take a look at how the code for Pig differs from Java MapReduce code and Hive query for the same solution:
It is clear from the above that the high level abstractions such as Hive and Pig reduce the programming effort required as well as the complexity of learning and writing MapReduce code. In the small Pig example above, we reduced the lines of code from roughly 25 (for Java) to 3 (for Hive) and 4 (for Pig).
In this tutorial we learned how to setup Pig, and run Pig Latin queries. We saw the query for the same problem which we solved MapReduce code from the step-by-setp MapReduce guide and the Hive for beginners with MapReduce and compared how the programming effort is reduced with the use of HiveQL. Stay tuned for more exciting tutorials from the small world of BigData.