Author: Shanti Subramanyam


We will learn the following things with this step-by-step MapReduce tutorial
  • MapReduce programming with a column
  • Writing a map function
  • Writing a reduce function
  • Writing the Driver class


The following are the prerequisites for writing MapReduce programs using Apache Hadoop
  • You should have the latest stable build of Hadoop (as of writing this article, 1.0.3)
  • To install hadoop, see my previous blog article.
  • You should have eclipse installed on your machine. Any eclipse version before 3.6 is compatible with the eclipse plugin (doesn’t work with Eclipse 3.7). Please refer to Eclipse Setup for Hadoop Development for details.
  • It is expected that you have some knowledge of Java programming and are familiar with the concepts such as classes and objects, inheritance, and interfaces/abstract classes.
  • Download the Book Crossing DataSet. (Alternative link to the Dataset at the github page of the Tutorials)


The problem we are trying to solve through this MapReduce tutorial is to find the frequency of books published each year. Our input data set is a csv file which looks like this:

Sample Rows from the input file BX-Books.csv 
“0195153448”;”Classical Mythology”;”Mark P. O. Morford”;“2002”;”Oxford University Press”;”https://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg”;
“0002005018”;”Clara Callan”;”Richard Bruce Wright”;“2001”;”HarperFlamingo Canada”;”https://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg”;
The first row is the header row. The other rows are sample records from the file. Our objective is to find the frequency of Books Published each year.


1. Open Eclipse in the MapReduce Perspective, as shown below:
Eclipse MapReduce
2. Create a New MapReduce Project, as shown below:
and fill in the details of the Project. For our sample project, we have named it  “BookCrossingData”
MapReduce tutorial
3. Create New Mapper in the BookCrossingData Project 
Write the map method as shown below:
map method
The BookXMapper class contains a map function which reads the record with the default record reader. The default record has the key as the cumulated character count for that line (added to the character count of the previous line) and the value as the whole line as a Text till the newline charater. Using the split() method of the String class, we split using the delimeter (“;” in our case) to get the array of strings. The 4th entry of the array is the field “Year-Of-Publication” which is the output key of the mapper.
5. Create New Reducer in the BookCrossingData Project 
BookCrossingData Project
Write the reduce method as shown below
reduce method

The BookXReducer Class contains a reduce method which takes the parameters – key and an iterable list of values (values grouped for each key). For our program, we use the key from mapper again as the output of the Reducer and add each individual value from the list. Remember, the output of the mapper was a new IntWritable(1). We add all the occurrences of new IntWritable(1) to get the count of books published for that particular key (the year of publication).

6. Create a class named BookXDriver. This class will be our main program to run the MapReduce job we have just written.
7. To run in Eclipse
  • Set the Run Configurations from the menu option of the Project (Right-click on the project)


  • In the arguments for a project, set the paths of the input file/directory and the expected output directory. Make sure that the output directory does not exist, as it would throw an error if it already exists.

arguments for a project

  • Run and see the console. The output directory contains two files. The file with the prefix “part-” is the actual output of your MapReduce logic as shown below. The file with name “_SUCCESS” is just a marker to signify a successful run.

8. To run on a Hadoop cluster (The size of the cluster does not affect its functioning, It can be single node or multi-node)

  • Make sure your cluster is started.
  • Export the Eclipse Project as a Runnable Jar. Right-click the Project and you’ll find the details for exporting the jar with configurations for exporting as a Runnable jar. For our example, we created a jar file name as BookCrossingJar.jar
  • Upload your Book Crossing dataset to HDFS by the following command

$ hadoop fs -put ~/Work/HadoopDev/Input/BX-Books.csv input

  • Run the jar with the HDFS path of the file as the parameter passed to the hadoop command along with the expected path of the output. Make sure that the output directory does not exist, as it would throw an error if it already exists.

$ hadoop jar ~/Work/HadoopDev/BookCrossingJar.jar input output

  • The output is generated and can be seen using the following query:hadoop

$ hadoop fs -cat output/*


The running code for this tutorial is present at the github location of the tutorials at https://github.com/Orzota/tutorials.


In this tutorial we learned how to write a mapper, a reducer and the driver class for running MapReduce programs. We also learned two ways of running our MapReduce logic – one using Eclipse, which is suitable for local debugging and the other using Single-node Hadoop Cluster for real world execution. Also we learned about some basic input and output datatypes. 
Stay tuned for more exciting tutorials from the Small world of BigData.


We will learn the following things with this Eclipse Setup for Hadoop tutorial 
  • Setting Up the Eclipse plugin for Hadoop
  • Testing the running of Hadoop MapReduce jobs


The following are the prerequisites for Eclipse setup for Hadoop program development using MapReduce and further extensions.
  • You should have the latest stable build of Hadoop (as of writing this article 1.0.3)
  • You should have eclipse installed on your machine. Any eclipse before 3.6 is compatible to the eclipse plugin. (Doesn’t work with Eclipse 3.7) Please refer to Link 6 of the Appendix for details of how to get and install Eclipse for your platform of development.
  • It is expected that you have a preliminary knowledge of Java programming and are familiar with the concepts involved in Java Programming such as Classes and Objects, Inheritances, and Interfaces/Abstract Classes.
  • Please refer the blog describing Hadoop installation (refer here).


1. Download the Eclipse plugin from the link given here.
eclipse plugin
We are utilizing this plugin to support the newer versions of eclipse and the newer versions of Hadoop. The eclipse-plugins packaged with earlier versions of Hadoop do not work.
2. After downloading the link, it needs to be copied into the plugins folder of eclipse.
  • Windows: Go to the eclipse folder located by default at C:eclipse (You may have installed it elsewhere). Copy the downloaded plugin to the eclipseplugins folder.
  • Mac OS X: Eclipse comes as an archive. Unarchive it and find the folder name ../eclipse/plugins/. Copy the downloaded plugin inside it.
  • Linux Version: The plugins folder is in the eclipse installation directory. Normally the /usr/share/eclipse folder is the installation directory of eclipse on Linux Machines
3. After you have copied the plugin, start Eclipse (restart, if it was already started) to reflect the changes in the Eclipse environment.
Go to the Window option of the menu bar in eclipse, 
eclipse menu bar
Select the option “Other” from the drop-down list.
start eclipse
Select the option Map/Reduce from the list of Perspectives.
As you select the “Map/Reduce” perspective, you’ll notice few additions in the eclipse views.
Notice the Perspective Name 
set up eclipse
The DFS locations are now directly accessible through the Eclipse configurations. Use the MapReduce Locations View to configure Hadoop Access from Eclipse.
DFS Locations
4. You have now setup Eclipse for MapReduce programming.
Right Click on Eclipse Package Explorer view or click on file option in the menu bar of your eclipse distribution.
setup Eclipse
Select the new option and then, the “Other” option in it. (Ctrl+N or Command+N)
eclipse plugin
Select Map/Reduce Project
select map reduce
Fill in the details in the Project Wizard.
The Hadoop library location must be the specified location in $HADOOP_HOME.
hadoop mapreduce
Now, after the Project has been created, create the Mapper, Reduce and the Driver Classes, by right-clicking the project and getting the options from New option.
Eclipse for MapReduce
Create a Mapper classes
mapper create
After creating a Mapper Class, the code snippet created is as following

… // Necessary Classes Imported

public class BookCrossingMapper extends MapReduceBase implements Mapper {

public void map(WritableComparable key, Writable values,

OutputCollector output, Reporter reporter) throws IOException {



Please replace the above code snippet with the one below,

… // Necessary Classes imported

public class BookCrossingMapper extends MapReduceBase implements Mapper<K,V,K,V> {


public void map(K arg0, V arg1, OutputCollector<K, V> arg2,

Reporter arg3) throws IOException {

// TODO Replace K and V with the suitable data types



Replace K and V with the required Key and Value datatypes such as LongWritable, Text, etc. Repeat the same steps while creating the new Reducer Class.

You are now ready to start programming in MapReduce.

Next Steps
My next blog article will be an intro to MapReduce programming, so stay tuned.



We will learn the following things with this single-node Hadoop tutorialsingle-node Hadoop tutorial

  • Setting Up Hadoop in Single-Node and Pseudo-Cluster Node modes.
  • Test execution of Hadoop by running sample MapReduce programs provided in the Hadoop distribution.



  • Either Linux or Windows.
  • Basic knowledge of Linux shell commands.


  • Java™ 1.6.x must be installed.
  • ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.

Additional requirement for Windows:

Installing Software

If your machine doesn’t have the requisite software you will need to install it. For example on Ubuntu Linux:

$ sudo apt-get install ssh
$ sudo apt-get install rsync

On Windows, if you did not install the required software when you installed cygwin, start the cygwin installer and select the ‘openssh’ package in the Net category

Setting up Environment Variables

  • Ubuntu/Fedora (and for most Linux-kernel systems)

$ vi ~/.bash_profile
Add the following line(if not present)

export HADOOP_HOME=/usr/root/hadoop-1.0.x
(the directory containing the Hadoop distribution)

export JAVA_HOME=/usr/java/jdk1.6.0_26
(the directory containing Java. Set it according to the actual path in your distribution)



(optional. For direct access)

export HADOOP_SLAVES=$HADOOP_HOME/conf/slaves
(optional. For direct access)

$ source ~/.bash_profile

(type the above command to activate the new path settings immediately)

  • Mac OS X :

export HADOOP_HOME=/usr/root/hadoop-1.0.x
(the directory containing the Hadoop distribution)

export JAVA_HOME=$(/usr/libexec/java_home)
(Will execute the script to set JAVA_HOME)

(Adding the hadoop bin path to the PATH Variable)

(optional. For direct access)

export HADOOP_SLAVES=$HADOOP_HOME/conf/slaves
(optional. For direct access)

ssh Settings

  • Works on GNU/Linux, Mac OS X and Cygwin only
  • Check that you can ssh to the localhost without using the password for the user:

$ssh localhost

  • If you cannot ssh to the localhost without using the password for the user, execute the following commands,

$ ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa 

        This creates a new pair of public-private keys in the directory ~/.sshin two files: id_dsa and id_dsa.pub 

$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

  • The public key of ssh-key generated is copied to a new file called authorized_keys for passphrase-less authentication.

Download Hadoop

  • To get a Hadoop distribution, download a recent stable release from one of the Apache Download Mirrors. For this tutorial, we’ll use the Hadoop 1.0.3 version only.

Install Hadoop

  1. Unpack the Hadoop Distribution downloaded. Make sure you use the same directory path that was used to set  HADOOP_HOME.
  2. In the distribution, edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of the Java installation. This could be the same path as JAVA_HOME set earlier.
  • For Mac users only: In the file bin/hadoop, search for the line:


and change it to: 


  • Try the following command now (from anywhere, as you have already set $HADOOP_HOME/bin in the PATH variable)

$ hadoop

If this shows the usage documentation for the hadoop script, then your initial setup for Hadoop and Java is correct. If it cannot find hadoop, then re-trace the steps above to ensure that everything has been set correctly.

Single Node Deployment

By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.

The following example copies the conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory. 

$ mkdir input

$ cp conf/*.xml input

Copy the contents of $HADOOP_HOME/conf  to a new input directory.

Now run hadoop from the command line with arguments

$ hadoop jar hadoop-examples-*.jar grep input output ‘dfs[a-z.]+’ 

jar <the name of the jar file containing MapReduce logic> hadoop

<grep – arguments follow>

<input – input folder>

<output – output folder>

< ‘dfs[a-z.]+’ the regular expression for grep program>

The output of the hadoop command should look like this screen shot.

$ cat output/*

Check the output of the grep program.

Pseudo-Node Deployment

Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process. 

To do so, edit the following files in the conf  directory of $HADOOP_HOME.

  • conf/core-site.xml







  • conf/hdfs-site.xml







  • conf/mapred-site.xml








Format a new distributed-filesystem:

$ bin/hadoop namenode -format

Usage: java NameNode [-format] | [-upgrade] | [-rollback] | [-finalize] | [-importCheckpoint]

Start the hadoop daemons:

$ bin/start-all.sh

The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs).

NameNode Details

The NameNode and JobTracker can be accessed at https://localhost:50070

It describes the current usage of HDFS, Live/Dead Nodes, and other necessary administrative details.



The JobTracker can be accessed at  https://localhost:50030. It is used to monitor the MapReduce jobs and the Tasks submitted to each individual machine. This information is also logged to logging files, as configured, inside the $HADOOP_HOME/logs folder.

job tracker 
Examples in Hadoop

Copy the input files into the distributed filesystem:

$ bin/hadoop fs -mkdir input

$ bin/hadoop fs -put conf input

Run some of the examples provided:

$ bin/hadoop jar hadoop-examples-.jar grep input output ‘dfs[a-z.]+’

Examples in Hadoop

A snapshot after successful run of the MapReduce example:

MapReduce example

Examine the output files:

Copy the output files from the distributed filesystem to the local filesystem and examine them:

$ bin/hadoop fs -get output output
$ cat output/*

or View the output files on the distributed filesystem:

$ bin/hadoop fs -cat output/*

View the output from HDFS itself or copy it to local and view in your editor of choice.

 output from HDFS


When you’re done, stop the daemons with:
$ bin/stop-all.sh

Next Steps

Congratulations. You have successfully installed and configured hadoop on a single node and run the example application. The next step is to program your own application using the MapReduce API. We will cover this topic in the next tutorial.


1. We referred the Single-Node Setup guide from Apache Hadoop page.

We are very excited to launch the Orzota blog. We hope to provide articles on hadoop and related technologies which will hopefully prove useful. We will address both beginner topics for those just getting started on hadoop and also more advanced tips and techniques.