Orzota
back

MapReduce Tutorial

Avatar
by Shanti Subramanyam for Blog
MapReduce Tutorial

Objective

We will learn the following things with this step-by-step MapReduce tutorial
  • MapReduce programming with a column
  • Writing a map function
  • Writing a reduce function
  • Writing the Driver class

Prerequisites

The following are the prerequisites for writing MapReduce programs using Apache Hadoop
  • You should have the latest stable build of Hadoop (as of writing this article, 1.0.3)
  • To install hadoop, see my previous blog article.
  • You should have eclipse installed on your machine. Any eclipse version before 3.6 is compatible with the eclipse plugin (doesn’t work with Eclipse 3.7). Please refer to Eclipse Setup for Hadoop Development for details.
  • It is expected that you have some knowledge of Java programming and are familiar with the concepts such as classes and objects, inheritance, and interfaces/abstract classes.
  • Download the Book Crossing DataSet. (Alternative link to the Dataset at the github page of the Tutorials)

Problem

The problem we are trying to solve through this MapReduce tutorial is to find the frequency of books published each year. Our input data set is a csv file which looks like this:

Sample Rows from the input file BX-Books.csv 
“ISBN”;”Book-Title”;”Book-Author”;”Year-Of-Publication”;”Publisher”;”Image-URL-S”;”Image-URL-M”;”Image-URL-L”
“0195153448”;”Classical Mythology”;”Mark P. O. Morford”;“2002”;”Oxford University Press”;”https://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg”;
“https://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg”;”https://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg”
“0002005018”;”Clara Callan”;”Richard Bruce Wright”;“2001”;”HarperFlamingo Canada”;”https://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg”;
“https://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg”;”https://images.amazon.com/images/P/0002005018.01.LZZZZZZZ.jpg”
  …
The first row is the header row. The other rows are sample records from the file. Our objective is to find the frequency of Books Published each year.

Procedure

1. Open Eclipse in the MapReduce Perspective, as shown below:
Eclipse MapReduce
2. Create a New MapReduce Project, as shown below:
MapReduce
and fill in the details of the Project. For our sample project, we have named it  “BookCrossingData”
MapReduce tutorial
3. Create New Mapper in the BookCrossingData Project 
BookCrossingData
Write the map method as shown below:
map method
The BookXMapper class contains a map function which reads the record with the default record reader. The default record has the key as the cumulated character count for that line (added to the character count of the previous line) and the value as the whole line as a Text till the newline charater. Using the split() method of the String class, we split using the delimeter (“;” in our case) to get the array of strings. The 4th entry of the array is the field “Year-Of-Publication” which is the output key of the mapper.
5. Create New Reducer in the BookCrossingData Project 
BookCrossingData Project
Write the reduce method as shown below
reduce method

The BookXReducer Class contains a reduce method which takes the parameters – key and an iterable list of values (values grouped for each key). For our program, we use the key from mapper again as the output of the Reducer and add each individual value from the list. Remember, the output of the mapper was a new IntWritable(1). We add all the occurrences of new IntWritable(1) to get the count of books published for that particular key (the year of publication).

6. Create a class named BookXDriver. This class will be our main program to run the MapReduce job we have just written.
BookXDriver
7. To run in Eclipse
  • Set the Run Configurations from the menu option of the Project (Right-click on the project)

Eclipse

  • In the arguments for a project, set the paths of the input file/directory and the expected output directory. Make sure that the output directory does not exist, as it would throw an error if it already exists.

arguments for a project

  • Run and see the console. The output directory contains two files. The file with the prefix “part-” is the actual output of your MapReduce logic as shown below. The file with name “_SUCCESS” is just a marker to signify a successful run.

8. To run on a Hadoop cluster (The size of the cluster does not affect its functioning, It can be single node or multi-node)

  • Make sure your cluster is started.
  • Export the Eclipse Project as a Runnable Jar. Right-click the Project and you’ll find the details for exporting the jar with configurations for exporting as a Runnable jar. For our example, we created a jar file name as BookCrossingJar.jar
  • Upload your Book Crossing dataset to HDFS by the following command

$ hadoop fs -put ~/Work/HadoopDev/Input/BX-Books.csv input

  • Run the jar with the HDFS path of the file as the parameter passed to the hadoop command along with the expected path of the output. Make sure that the output directory does not exist, as it would throw an error if it already exists.

$ hadoop jar ~/Work/HadoopDev/BookCrossingJar.jar input output

  • The output is generated and can be seen using the following query:hadoop

$ hadoop fs -cat output/*

Download

The running code for this tutorial is present at the github location of the tutorials at https://github.com/Orzota/tutorials.

Conclusion

In this tutorial we learned how to write a mapper, a reducer and the driver class for running MapReduce programs. We also learned two ways of running our MapReduce logic – one using Eclipse, which is suitable for local debugging and the other using Single-node Hadoop Cluster for real world execution. Also we learned about some basic input and output datatypes. 
Stay tuned for more exciting tutorials from the Small world of BigData.
Eclipse Setup for Hadoop Development
Prev post Eclipse Setup for Hadoop Development

Objectives We will learn the following things with this Eclipse Setup for Hadoop tutorial  Setting…

Hive Tutorial for Beginners
Next post Hive Tutorial for Beginners

Hive is a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis…