Orzota

Blog fullwidth

We are hiring in Chennai, India

Orzota is hiring Java hackers. Orzota is an exciting Big Data startup in Silicon Valley.
Orzota is hiring Java hackers at its India Development Center in Chennai. We are building Big Data platforms and solutions, and we would be interested in hearing from you and your colleagues who would be excited to be working on Hadoop and associated technologies.

So, if you are a superb Java Programmer
– with 1+ years experience in Java
– have worked on Application Server/DB Backend, SQL
– on Linux/similar operating systems
– can hack in Python/Shell (preferred)

We would like to hear from you. We need you to augment our team, and get immersed in Hadoop and related technologies.

Please send your resume to india-careers@orzota.com


Sai Sivam, Engineering Manager
 

Pig Tutorial for Beginners

Pig is a data flow platform for writing Hadoop operations in a language called Pig Latin. It adds a layer of abstraction on top of Hadoop to simplify its use by giving a SQL-like interface to process data on Hadoop and thus help the programmer focus on business logic and help increase productivity. It supports a variety of data types and the use of user-defined functions (UDFs) to write custom operations in Java, Python and JavaScript. Due its simple interface,  support for doing complex operations such as joins and filters, Pig is popular for performing query operations in hadoop.

Objective

pig tutorial

The objective of this Pig tutorial is to get you up and running Pig scripts on a real-world dataset stored in Hadoop.

Prerequisites

The following are the prerequisites for setting up Pig and running Pig scripts
  • You should have the latest stable build of Hadoop (as of writing this article, 1.0.3)
  • To install hadoop, see my previous blog article on Hadoop Setup
  • Your machine should have Java 1.6 installed
  • It is assumed you have basic knowledge of Java programming and SQL.
  • Basic knowledge of Linux will help you understand many of the linux commands used in the tutorial.
  • Download the Book Crossing DataSet. This is the data set we will use. (Alternative link to the Dataset is on the github page of the Tutorials)

Setting up Pig

Platform

This Pig tutorial assumes Linux/Mac OS X. If using Windows, please install Cygwin. It is required for shell support in addition to the required software above.

Procedure

Download a stable tarbal (for our tutorial we used pig-0.10.0.tar.gz (~50 MB),  which works with Hadoop 0.20.X, 1.0.X and 0.23.X) from one of the apache download mirrors.
Unpack the tarball in the directory of your choice, using the following command 
  $ tar -xzvf pig-x.y.z.tar.gz
Set the environment variable PIG_HOME to point to the installation directory for convinience:
You can either do
  $ cd pig-x.y.z
  $ export PIG_HOME={{pwd}}
or set PIG_HOME in $HOME/.profile so it will be set every time you login.
Add the following line to it.
  $ export PIG_HOME=<path_to_hive_home_directory>
e.g.
  $ export PIG_HOME=’/Users/Work/pig-0.10.0′
  $ export PATH=$HADOOP_HOME/bin:$PIG_HOME/bin:$PATH
Set the environment variable JAVA_HOME to point to the Java installation directory, which Pig uses internally.
  $ export JAVA_HOME=<<Java_installation_directory>>

Execution Modes

Pig has two modes of execution – local mode and MapReduce mode.

Local Mode

Local mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets which a single machine could handle. It runs on a single JVM and access the local filesystem. To run in local mode, you pass the local option to the -x or -exectype parameter when starting pig. This starts the interactive shell called Grunt
  $ pig -x local
  grunt>

MapReduce Mode

In this mode, Pig translates the queries into MapReduce jobs and runs the job on the hadoop cluster. This cluster can be pseudo- or fully distributed cluster.
First check the compatibility of the Pig and Hadoop versions being used. The compatibility details are given in the Pig release page (for our tutorial, the Hadoop version is 1.0.3, and Pig version is 0.10.0, which is compatible to the Hadoop version we are using). 
First, export the variable PIG_CLASSPATH to add Hadoop conf directory.
  $ export PIG_CLASSPATH=$HADOOP_HOME/conf/
After exporting the PIG_CLASSPATH, run the pig command, as shown below
  $ pig
  [main] INFO  org.apache.pig.Main – Apache Pig version 0.10.0 (r1328203) compiled Apr 19 2012, 22:54:12
  [main] INFO  org.apache.pig.Main – Logging error messages to: /Users/varadmeru/pig_1351858332488.log
  [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: hdfs://localhost:9000
  main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to map-reduce job tracker at: localhost:9001
  grunt> 
You can see the log reports from Pig stating the filesystem and jobtracker it connected to. Grunt is an interactive shell for your Pig queries. You can run Pig programs in three ways via Script, Grunt, or embedding the script into Java code. Running on Interactive shell is shown in the Problem section. To run a batch of pig scripts, it is recommended to place them in a single file and execute them in batch mode.

Executing Scripts in Batch Mode

Running in Local mode:
$ pig -x local BookXGroupByYear.pig
Note that Pig, in MapReduce mode, takes file from HDFS only, and stores the results back to HDFS.

For running Pig script file in MapReduce mode:
  $ pig -x mapreduce BookXGroupByYear.pig
OR
  $ pig BookXGroupByYear.pig
Its a good practice to have “*.pig” extension to the file for clarity and maintainabilty. The BookXGroupByYear.pig file is available on our github page of the Tutorials for your reference.

Now we focus on solving a simple but real-world use case with Pig. This is the same problem that was solved in the previous blog articles (Step-by-step MapReduce Programming using Java and Hive for Beginners using SQL-like query for Hive).

Problem
The problem we are trying to solve through this tutorial is to find the frequency of books published each year. Our input data set (file BX-Books.csv) is a csv file. Some sample rows:

“ISBN”;”Book-Title”;”Book-Author”;”Year-Of-Publication”;”Publisher”;”Image-URL-S”;”Image-URL-M”;”Image-URL-L”
“0195153448”;”Classical Mythology”;”Mark P. O. Morford”;”2002“;”Oxford University Press”;”https://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg”;
“https://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg”;”https://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg”
  …
“0002005018”;”Clara Callan”;”Richard Bruce Wright”;”2001“;”HarperFlamingo Canada”;”https://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg”;
“https://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg”;”https://images.amazon.com/images/P/0002005018.01.LZZZZZZZ.jpg”
The first row is the header row. The other rows are sample records from the file. Our objective is to find the frequency of books published each year.
Now as our data is not cleansed and might give us erroronous results due to it being serialized, we clean it by the following commands:
$ cd /Users/Work/Data/BX-CSV-Dump
$ sed ‘s/&amp;/&/g’ BX-Books.csv | sed -e ‘1d’ |sed ‘s/;/$$$/g’ | sed ‘s/”$$$”/”;”/g’ > BX-BooksCorrected.txt
The sed commands help us to remove the delimiters “;” (semicolon) from the content and replace them with $$$. Also, the pattern “&amp;” is replaced with ‘&’ only. It also removes the first line (header line). If we don’t remove the header line, it will be processed as part of the data, which it isn’t.
All the above steps are required to cleanse the data and give accurate results.

“0393045218”;”The Mummies of Urumchi;”;”E. J. W. Barber”;”1999″;”W. W. Norton &amp; Company”; “https://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg”                is changed to

“0393045218”;”The Mummies of Urumchi$$$“;”E. J. W. Barber”;”1999″; “W. W. Norton & Company”; “https://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg”

Now, copy the file into Hadoop:
$ hadoop fs -mkdir input
$ hadoop fs -put /Users/Work/Data/BX-CSV-Dump/BX-BooksCorrected.txt input
Note that Pig, in MapReduce mode, takes file from HDFS only, and stores the results back to HDFS.
Running Pig Flow using the command line:
$ pig
grunt> BookXRecords = LOAD ‘/user/varadmeru/input/BX-BooksCorrected1.txt’
>> USING PigStorage(‘;’) AS (ISBN:chararray,BookTitle:chararray,
>> BookAuthor:chararray,YearOfPublication:chararray,
>> Publisher:chararray,ImageURLS:chararray,ImageURLM:chararray,ImageURLL:chararray);
2012-11-05 01:09:11,554 [main] WARN  org.apache.pig.PigServer – Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
grunt> GroupByYear = GROUP BookXRecords BY YearOfPublication;
2012-11-05 01:09:11,810 [main] WARN  org.apache.pig.PigServer – Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
grunt> CountByYear = FOREACH GroupByYear
>> GENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1)));
2012-11-05 01:09:11,996 [main] WARN  org.apache.pig.PigServer – Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
grunt> STORE CountByYear
>> INTO ‘/user/work/output/pig_output_bookx’ USING PigStorage(‘t’);

The username (“work” in our example) in the second query is dependent on the hadoop setup on your machine and the username of the hadoop setup. The output of the above pig run is stored into the output/pig_output_bookx folder on HDFS. It can be displayed on the screen by :

  $ hadoop fs -cat output/pig_output_bookx/part-r-00000

Output

The snapshot of the output of the Pig flow is shown below:

output of the Pig flow

Comparison with Java MapReduce and Hive

You can see the above output and compare with the outputs from the MapReduce code from the step-by-setp MapReduce guide and the Hive for beginners blog postLet’s take a look at how the code for Pig differs from Java MapReduce code and Hive query for the same solution: 

Mappermapper

Reducer

reducer

Hivehive> CREATE TABLE IF NOT EXISTS BXDataSet (ISBN STRING,BookTitle STRING,BookAuthor STRING, YearOfPublication STRING, Publisher STRING,ImageURLS STRING,ImageURLM STRING, ImageURLL STRING) 
COMMENT ‘BX-Books Table’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘;’

STORED AS TEXTFILE;hive> LOAD DATA INPATH ‘/user/work/input/BX-BooksCorrected.csv’ OVERWRITE INTO TABLE BXDataSethive> select yearofpublication, count(booktitle) from bxdataset group by yearofpublication;
PigBookXRecords = LOAD ‘/user/varadmeru/input/BX-BooksCorrected1.txt’ USING PigStorage(‘;’)
AS (ISBN:chararray,
BookTitle:chararray,
BookAuthor:chararray,
YearOfPublication:chararray,
Publisher:chararray,
ImageURLS:chararray,
ImageURLM:chararray,
ImageURLL:chararray);

GroupByYear =
GROUP BookXRecords BY YearOfPublication;
CountByYear =
FOREACH GroupByYear
GENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1)));
STORE CountByYear INTO ‘/user/varadmeru/output/pig_output_bookx’
USING PigStorage(‘t’);

It is clear from the above that the high level abstractions such as Hive and Pig reduce the programming effort required as well as the complexity of learning and writing MapReduce code. In the small Pig example above, we reduced the lines of code from roughly 25 (for Java) to 3 (for Hive) and 4 (for Pig).

Conclusion

In this tutorial we learned how to setup Pig, and run Pig Latin queries. We saw the query for the same problem which we solved MapReduce code from the step-by-setp MapReduce guide and the Hive for beginners with MapReduce and compared how the programming effort is reduced with the use of HiveQL. Stay tuned for more exciting tutorials from the small world of BigData.

References

1. We referred the Pig-Latin Basics and Built-in Functions guides from Apache Pig page.

2. We referred the Getting Started guide from Apache Pig page.

Hive Tutorial for Beginners

Hive is a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis of large datasets stored in Hadoop. Hive provides a SQL-like language called HiveQL. Due its SQL-like interface, Hive is increasingly becoming the technology of choice for using Hadoop.

Objectivehive tutorial

The objective of this Hive tutorial is to get you up and running Hive queries on a real-world dataset.

Prerequisites

The following are the prerequisites for setting up Hive and running Hive queries
  • You should have the latest stable build of Hadoop (as of writing this article, 1.0.3)
  • To install hadoop, see my previous blog article on Hadoop Setup
  • Your machine should have Java 1.6 installed
  • It is assumed you have some knowledge of Java programming and are familiar with concepts such as classes and objects, inheritance, and interfaces/abstract classes.
  • Basic knowledge of Linux will help you understand many of the linux commands used in the tutorial
  • Download the Book Crossing DataSet. This is the data set we will use. (Alternative link to the Dataset at the github page of the Tutorials)

Setting up Hive

Platform

This Hive tutorial assumes Linux. If using Windows, please install Cygwin. It is required for shell support in addition to the required software above.

Procedure

Download the most recent stable release of Hive as a tarball from one of the apache download mirrors. For our Hive tutorial, we are going to use hive-0.9.0.tar.gz
Unpack the tarball in the directory of your choice, using the following command 
  $ tar -xzvf hive-x.y.z.tar.gz
Set the environment variable HIVE_HOME to point to the installation directory:
You can either do
  $ cd hive-x.y.z
  $ export HIVE_HOME={{pwd}}
or set HIVE_HOME in $HOME/.profile so it will be set every time you login.
Add the following line to it.
  export HIVE_HOME=<path_to_hive_home_directory>
e.g.
  export HIVE_HOME=’/Users/Work/hive-0.9.0′
  export PATH=$HADOOP_HOME/bin:$HIVE_HOME/bin:$PATH
Start Hadoop (Refer to the Single-Node Hadoop Setup Guide for more information). It should show the processes being started. You can check the processes started by using the jps query
$ start-all.sh
<< Starting various hadoop processes >>
$ jps
  3097 Jps
  2355 RunJar
  2984 JobTracker
  2919 SecondaryNameNode
  2831 DataNode
  2743 NameNode
  3075 TaskTracker
In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set aprpopriate permissions in HDFS before a table can be created in Hive as shown below:
  $ hadoop fs -mkdir /tmp
  $ hadoop fs -mkdir /user/hive/warehouse
  $ hadoop fs -chmod g+w /tmp
  $ hadoop fs -chmod g+w /user/hive/warehouse

Problem

The problem we are trying to solve through this tutorial is to find the frequency of books published each year. Our input data set (file BX-Books.csv) is a csv file. Some sample rows:

“ISBN”;”Book-Title”;”Book-Author”;”Year-Of-Publication”;”Publisher”;”Image-URL-S”;”Image-URL-M”;”Image-URL-L”
“0195153448”;”Classical Mythology”;”Mark P. O. Morford”;”2002“;”Oxford University Press”;”https://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg”;
“https://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg”;”https://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg”
“0002005018”;”Clara Callan”;”Richard Bruce Wright”;”2001“;”HarperFlamingo Canada”;”https://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg”;
“https://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg”;”https://images.amazon.com/images/P/0002005018.01.LZZZZZZZ.jpg”
  …
The first row is the header row. The other rows are sample records from the file. Our objective is to find the frequency of Books Published each year. This is the same problem that was solved in the previous blog-post (Step-by-step MapReduce Programming using Java).
Now as our data is not cleansed and might give us erroronous results due to some serialization support, we clean it by the following command
$ cd /Users/Work/Data/BX-CSV-Dump
$ sed ‘s/&amp;/&/g’ BX-Books.csv | sed -e ‘1d’ |sed ‘s/;/$$$/g’ | sed ‘s/”$$$”/”;”/g’ > BX-BooksCorrected.csv
The sed commands help us to remove the delimeters “;” (semicolon) from the content and replace them with $$$. Also, the pattern “&amp;” is replaced with ‘&’ only. It also removes the first line (header line). If we don’t remove the header line, Hive will process it as part of the data, which it isn’t.
All the above steps are required to cleanse the data, and help hive give accurate results of our queries.

“0393045218”;”The Mummies of Urumchi;“;”E. J. W. Barber”;”1999″;”W. W. Norton &amp; Company”; “https://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg”

                is changed to

“0393045218”;”The Mummies of Urumchi$$$“;”E. J. W. Barber”;”1999″; “W. W. Norton & Company”; “https://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg”

Now, copy the file into Hadoop:
$ hadoop fs -mkdir input
$ hadoop fs -put /Users/Work/Data/BX-CSV-Dump/BX-BooksCorrected.csv input
Running Hive using the command line:
$ hive
hive> CREATE TABLE IF NOT EXISTS BXDataSet 
    >   (ISBN STRING, 
    >   BookTitle STRING, 
    >   BookAuthor STRING, 
    >   YearOfPublication STRING, 
    >   Publisher STRING, 
    >   ImageURLS STRING, 
    >   ImageURLM STRING, 
    >   ImageURLL STRING) 
    > COMMENT ‘BX-Books Table’
    > ROW FORMAT DELIMITED  
    > FIELDS TERMINATED BY ‘;’ 
    > STORED AS TEXTFILE;
OK
Time taken: 0.086 seconds
hive> LOAD DATA INPATH ‘/user/work/input/BX-BooksCorrected.csv’ OVERWRITE INTO TABLE BXDataSet;
Loading data to table default.bxdataset
Deleted hdfs://localhost:9000/user/hive/warehouse/bxdataset
OK
Time taken: 0.192 seconds
hive> select yearofpublication, count(booktitle) from bxdataset group by yearofpublication;

The username (“work” in our example) in the second query is dependent on the hadoop setup on your machine and the username of the hadoop setup.

Output

The output of the query is shown below:

hadoop set up

Comparison with Java MapReduce

You can see the above output and compare with the output of the MapReduce code from the previous blog entryLet’s take a look at how the code for Hive differs from MapReduce:

Mapper

mapper

Reducer

reducer

hive> CREATE TABLE IF NOT EXISTS BXDataSet (ISBN STRING,BookTitle STRING,BookAuthor STRING, YearOfPublication STRING, Publisher STRING,ImageURLS STRING,ImageURLM STRING, ImageURLL STRING) 
COMMENT ‘BX-Books Table’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘;’
STORED AS TEXTFILE;hive> LOAD DATA INPATH ‘/user/work/input/BX-BooksCorrected.csv’ OVERWRITE INTO TABLE BXDataSethive> select yearofpublication, count(booktitle) from bxdataset group by yearofpublication;

It is clear from the above that Hive reduces the programming effort required as well as the complexity of learning and writing MapReduce code. In the small example above, we reduced the lines of code from roughly 25 to 3.

Conclusion

In this tutorial we learned how to setup Hive, and run hive queries. We saw the query for the same problem statement which we solved with MapReduce and compared how the programming effort is reduced with the use of HiveQL. Stay tuned for more exciting tutorials from the small world of BigData.

 

MapReduce Tutorial

Objective

We will learn the following things with this step-by-step MapReduce tutorial
  • MapReduce programming with a column
  • Writing a map function
  • Writing a reduce function
  • Writing the Driver class

Prerequisites

The following are the prerequisites for writing MapReduce programs using Apache Hadoop
  • You should have the latest stable build of Hadoop (as of writing this article, 1.0.3)
  • To install hadoop, see my previous blog article.
  • You should have eclipse installed on your machine. Any eclipse version before 3.6 is compatible with the eclipse plugin (doesn’t work with Eclipse 3.7). Please refer to Eclipse Setup for Hadoop Development for details.
  • It is expected that you have some knowledge of Java programming and are familiar with the concepts such as classes and objects, inheritance, and interfaces/abstract classes.
  • Download the Book Crossing DataSet. (Alternative link to the Dataset at the github page of the Tutorials)

Problem

The problem we are trying to solve through this MapReduce tutorial is to find the frequency of books published each year. Our input data set is a csv file which looks like this:

Sample Rows from the input file BX-Books.csv 
“ISBN”;”Book-Title”;”Book-Author”;”Year-Of-Publication”;”Publisher”;”Image-URL-S”;”Image-URL-M”;”Image-URL-L”
“0195153448”;”Classical Mythology”;”Mark P. O. Morford”;“2002”;”Oxford University Press”;”https://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg”;
“https://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg”;”https://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg”
“0002005018”;”Clara Callan”;”Richard Bruce Wright”;“2001”;”HarperFlamingo Canada”;”https://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg”;
“https://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg”;”https://images.amazon.com/images/P/0002005018.01.LZZZZZZZ.jpg”
  …
The first row is the header row. The other rows are sample records from the file. Our objective is to find the frequency of Books Published each year.

Procedure

1. Open Eclipse in the MapReduce Perspective, as shown below:
Eclipse MapReduce
2. Create a New MapReduce Project, as shown below:
MapReduce
and fill in the details of the Project. For our sample project, we have named it  “BookCrossingData”
MapReduce tutorial
3. Create New Mapper in the BookCrossingData Project 
BookCrossingData
Write the map method as shown below:
map method
The BookXMapper class contains a map function which reads the record with the default record reader. The default record has the key as the cumulated character count for that line (added to the character count of the previous line) and the value as the whole line as a Text till the newline charater. Using the split() method of the String class, we split using the delimeter (“;” in our case) to get the array of strings. The 4th entry of the array is the field “Year-Of-Publication” which is the output key of the mapper.
5. Create New Reducer in the BookCrossingData Project 
BookCrossingData Project
Write the reduce method as shown below
reduce method

The BookXReducer Class contains a reduce method which takes the parameters – key and an iterable list of values (values grouped for each key). For our program, we use the key from mapper again as the output of the Reducer and add each individual value from the list. Remember, the output of the mapper was a new IntWritable(1). We add all the occurrences of new IntWritable(1) to get the count of books published for that particular key (the year of publication).

6. Create a class named BookXDriver. This class will be our main program to run the MapReduce job we have just written.
BookXDriver
7. To run in Eclipse
  • Set the Run Configurations from the menu option of the Project (Right-click on the project)

Eclipse

  • In the arguments for a project, set the paths of the input file/directory and the expected output directory. Make sure that the output directory does not exist, as it would throw an error if it already exists.

arguments for a project

  • Run and see the console. The output directory contains two files. The file with the prefix “part-” is the actual output of your MapReduce logic as shown below. The file with name “_SUCCESS” is just a marker to signify a successful run.

8. To run on a Hadoop cluster (The size of the cluster does not affect its functioning, It can be single node or multi-node)

  • Make sure your cluster is started.
  • Export the Eclipse Project as a Runnable Jar. Right-click the Project and you’ll find the details for exporting the jar with configurations for exporting as a Runnable jar. For our example, we created a jar file name as BookCrossingJar.jar
  • Upload your Book Crossing dataset to HDFS by the following command

$ hadoop fs -put ~/Work/HadoopDev/Input/BX-Books.csv input

  • Run the jar with the HDFS path of the file as the parameter passed to the hadoop command along with the expected path of the output. Make sure that the output directory does not exist, as it would throw an error if it already exists.

$ hadoop jar ~/Work/HadoopDev/BookCrossingJar.jar input output

  • The output is generated and can be seen using the following query:hadoop

$ hadoop fs -cat output/*

Download

The running code for this tutorial is present at the github location of the tutorials at https://github.com/Orzota/tutorials.

Conclusion

In this tutorial we learned how to write a mapper, a reducer and the driver class for running MapReduce programs. We also learned two ways of running our MapReduce logic – one using Eclipse, which is suitable for local debugging and the other using Single-node Hadoop Cluster for real world execution. Also we learned about some basic input and output datatypes. 
Stay tuned for more exciting tutorials from the Small world of BigData.