- MapReduce programming with a column
- Writing a map function
- Writing a reduce function
- Writing the Driver class
- You should have the latest stable build of Hadoop (as of writing this article, 1.0.3)
- To install hadoop, see my previous blog article.
- You should have eclipse installed on your machine. Any eclipse version before 3.6 is compatible with the eclipse plugin (doesn’t work with Eclipse 3.7). Please refer to Eclipse Setup for Hadoop Development for details.
- It is expected that you have some knowledge of Java programming and are familiar with the concepts such as classes and objects, inheritance, and interfaces/abstract classes.
- Download the Book Crossing DataSet. (Alternative link to the Dataset at the github page of the Tutorials)
The problem we are trying to solve through this MapReduce tutorial is to find the frequency of books published each year. Our input data set is a csv file which looks like this:
“0195153448”;”Classical Mythology”;”Mark P. O. Morford”;“2002”;”Oxford University Press”;”https://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg”;
“0002005018”;”Clara Callan”;”Richard Bruce Wright”;“2001”;”HarperFlamingo Canada”;”https://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg”;
The BookXReducer Class contains a reduce method which takes the parameters – key and an iterable list of values (values grouped for each key). For our program, we use the key from mapper again as the output of the Reducer and add each individual value from the list. Remember, the output of the mapper was a new IntWritable(1). We add all the occurrences of new IntWritable(1) to get the count of books published for that particular key (the year of publication).
- Set the Run Configurations from the menu option of the Project (Right-click on the project)
- In the arguments for a project, set the paths of the input file/directory and the expected output directory. Make sure that the output directory does not exist, as it would throw an error if it already exists.
- Run and see the console. The output directory contains two files. The file with the prefix “part-” is the actual output of your MapReduce logic as shown below. The file with name “_SUCCESS” is just a marker to signify a successful run.
8. To run on a Hadoop cluster (The size of the cluster does not affect its functioning, It can be single node or multi-node)
- Make sure your cluster is started.
- Export the Eclipse Project as a Runnable Jar. Right-click the Project and you’ll find the details for exporting the jar with configurations for exporting as a Runnable jar. For our example, we created a jar file name as BookCrossingJar.jar
- Upload your Book Crossing dataset to HDFS by the following command
$ hadoop fs -put ~/Work/HadoopDev/Input/BX-Books.csv input
- Run the jar with the HDFS path of the file as the parameter passed to the hadoop command along with the expected path of the output. Make sure that the output directory does not exist, as it would throw an error if it already exists.
$ hadoop jar ~/Work/HadoopDev/BookCrossingJar.jar input output
- The output is generated and can be seen using the following query:
$ hadoop fs -cat output/*
The running code for this tutorial is present at the github location of the tutorials at https://github.com/Orzota/tutorials.