Orzota

Blog fullwidth

Eclipse Setup for Hadoop Development

Objectives

We will learn the following things with this Eclipse Setup for Hadoop tutorial 
  • Setting Up the Eclipse plugin for Hadoop
  • Testing the running of Hadoop MapReduce jobs

Prerequisites

The following are the prerequisites for Eclipse setup for Hadoop program development using MapReduce and further extensions.
  • You should have the latest stable build of Hadoop (as of writing this article 1.0.3)
  • You should have eclipse installed on your machine. Any eclipse before 3.6 is compatible to the eclipse plugin. (Doesn’t work with Eclipse 3.7) Please refer to Link 6 of the Appendix for details of how to get and install Eclipse for your platform of development.
  • It is expected that you have a preliminary knowledge of Java programming and are familiar with the concepts involved in Java Programming such as Classes and Objects, Inheritances, and Interfaces/Abstract Classes.
  • Please refer the blog describing Hadoop installation (refer here).

Procedure

1. Download the Eclipse plugin from the link given here.
eclipse plugin
We are utilizing this plugin to support the newer versions of eclipse and the newer versions of Hadoop. The eclipse-plugins packaged with earlier versions of Hadoop do not work.
2. After downloading the link, it needs to be copied into the plugins folder of eclipse.
  • Windows: Go to the eclipse folder located by default at C:eclipse (You may have installed it elsewhere). Copy the downloaded plugin to the eclipseplugins folder.
  • Mac OS X: Eclipse comes as an archive. Unarchive it and find the folder name ../eclipse/plugins/. Copy the downloaded plugin inside it.
  • Linux Version: The plugins folder is in the eclipse installation directory. Normally the /usr/share/eclipse folder is the installation directory of eclipse on Linux Machines
3. After you have copied the plugin, start Eclipse (restart, if it was already started) to reflect the changes in the Eclipse environment.
Go to the Window option of the menu bar in eclipse, 
eclipse menu bar
Select the option “Other” from the drop-down list.
start eclipse
Select the option Map/Reduce from the list of Perspectives.
map/reduce
As you select the “Map/Reduce” perspective, you’ll notice few additions in the eclipse views.
Notice the Perspective Name 
set up eclipse
The DFS locations are now directly accessible through the Eclipse configurations. Use the MapReduce Locations View to configure Hadoop Access from Eclipse.
DFS Locations
4. You have now setup Eclipse for MapReduce programming.
Right Click on Eclipse Package Explorer view or click on file option in the menu bar of your eclipse distribution.
setup Eclipse
Select the new option and then, the “Other” option in it. (Ctrl+N or Command+N)
eclipse plugin
Select Map/Reduce Project
select map reduce
Fill in the details in the Project Wizard.
MapReduce
The Hadoop library location must be the specified location in $HADOOP_HOME.
hadoop mapreduce
Now, after the Project has been created, create the Mapper, Reduce and the Driver Classes, by right-clicking the project and getting the options from New option.
Eclipse for MapReduce
Create a Mapper classes
mapper create
After creating a Mapper Class, the code snippet created is as following

… // Necessary Classes Imported

public class BookCrossingMapper extends MapReduceBase implements Mapper {

public void map(WritableComparable key, Writable values,

OutputCollector output, Reporter reporter) throws IOException {

}

}

Please replace the above code snippet with the one below,

… // Necessary Classes imported

public class BookCrossingMapper extends MapReduceBase implements Mapper<K,V,K,V> {

@Override

public void map(K arg0, V arg1, OutputCollector<K, V> arg2,

Reporter arg3) throws IOException {

// TODO Replace K and V with the suitable data types

}

}

Replace K and V with the required Key and Value datatypes such as LongWritable, Text, etc. Repeat the same steps while creating the new Reducer Class.

You are now ready to start programming in MapReduce.

Next Steps
My next blog article will be an intro to MapReduce programming, so stay tuned.

 

Single-Node Hadoop Tutorial

Objectives

We will learn the following things with this single-node Hadoop tutorialsingle-node Hadoop tutorial

  • Setting Up Hadoop in Single-Node and Pseudo-Cluster Node modes.
  • Test execution of Hadoop by running sample MapReduce programs provided in the Hadoop distribution.

Prerequisites

Platform

  • Either Linux or Windows.
  • Basic knowledge of Linux shell commands.

Software

  • Java™ 1.6.x must be installed.
  • ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.

Additional requirement for Windows:

Installing Software

If your machine doesn’t have the requisite software you will need to install it. For example on Ubuntu Linux:

$ sudo apt-get install ssh
$ sudo apt-get install rsync

On Windows, if you did not install the required software when you installed cygwin, start the cygwin installer and select the ‘openssh’ package in the Net category

Setting up Environment Variables

  • Ubuntu/Fedora (and for most Linux-kernel systems)

$ vi ~/.bash_profile
Add the following line(if not present)

export HADOOP_HOME=/usr/root/hadoop-1.0.x
(the directory containing the Hadoop distribution)

export JAVA_HOME=/usr/java/jdk1.6.0_26
(the directory containing Java. Set it according to the actual path in your distribution)

export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH

export HADOOP_LOG_DIR=$HADOOP_HOME/logs

(optional. For direct access)

export HADOOP_SLAVES=$HADOOP_HOME/conf/slaves
(optional. For direct access)

$ source ~/.bash_profile

(type the above command to activate the new path settings immediately)

  • Mac OS X :

export HADOOP_HOME=/usr/root/hadoop-1.0.x
(the directory containing the Hadoop distribution)

export JAVA_HOME=$(/usr/libexec/java_home)
(Will execute the script to set JAVA_HOME)

export PATH=$HADOOP_HOME/bin:$PATH
(Adding the hadoop bin path to the PATH Variable)

export HADOOP_LOG_DIR=$HADOOP_HOME/logs
(optional. For direct access)

export HADOOP_SLAVES=$HADOOP_HOME/conf/slaves
(optional. For direct access)

ssh Settings

  • Works on GNU/Linux, Mac OS X and Cygwin only
  • Check that you can ssh to the localhost without using the password for the user:

$ssh localhost

  • If you cannot ssh to the localhost without using the password for the user, execute the following commands,

$ ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa 

        This creates a new pair of public-private keys in the directory ~/.sshin two files: id_dsa and id_dsa.pub 

$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

  • The public key of ssh-key generated is copied to a new file called authorized_keys for passphrase-less authentication.

Download Hadoop

  • To get a Hadoop distribution, download a recent stable release from one of the Apache Download Mirrors. For this tutorial, we’ll use the Hadoop 1.0.3 version only.

Install Hadoop

  1. Unpack the Hadoop Distribution downloaded. Make sure you use the same directory path that was used to set  HADOOP_HOME.
  2. In the distribution, edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of the Java installation. This could be the same path as JAVA_HOME set earlier.
  • For Mac users only: In the file bin/hadoop, search for the line:

JAVA=$JAVA_HOME/bin/java 

and change it to: 

JAVA=$JAVA_HOME/Commands/java

  • Try the following command now (from anywhere, as you have already set $HADOOP_HOME/bin in the PATH variable)

$ hadoop

If this shows the usage documentation for the hadoop script, then your initial setup for Hadoop and Java is correct. If it cannot find hadoop, then re-trace the steps above to ensure that everything has been set correctly.

Single Node Deployment

By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.

The following example copies the conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory. 

$ mkdir input

$ cp conf/*.xml input

Copy the contents of $HADOOP_HOME/conf  to a new input directory.

Now run hadoop from the command line with arguments

$ hadoop jar hadoop-examples-*.jar grep input output ‘dfs[a-z.]+’ 

jar <the name of the jar file containing MapReduce logic> hadoop

<grep – arguments follow>

<input – input folder>

<output – output folder>

< ‘dfs[a-z.]+’ the regular expression for grep program>

The output of the hadoop command should look like this screen shot.

$ cat output/*

Check the output of the grep program.

Pseudo-Node Deployment

Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process. 

To do so, edit the following files in the conf  directory of $HADOOP_HOME.

  • conf/core-site.xml

<configuration>

     <property>

         <name>fs.default.name</name>

         <value>hdfs://localhost:9000</value>

     </property>

</configuration>

  • conf/hdfs-site.xml

<configuration>

     <property>

         <name>dfs.replication</name>

         <value>1</value>

     </property>

</configuration>

  • conf/mapred-site.xml

<configuration>

     <property>

         <name>mapred.job.tracker</name>

         <value>localhost:9001</value>

     </property>

</configuration>

Execution

Format a new distributed-filesystem:

$ bin/hadoop namenode -format

Usage: java NameNode [-format] | [-upgrade] | [-rollback] | [-finalize] | [-importCheckpoint]

Start the hadoop daemons:

$ bin/start-all.sh

The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs).

NameNode Details

The NameNode and JobTracker can be accessed at https://localhost:50070

It describes the current usage of HDFS, Live/Dead Nodes, and other necessary administrative details.

NameNode

JobTracker

The JobTracker can be accessed at  https://localhost:50030. It is used to monitor the MapReduce jobs and the Tasks submitted to each individual machine. This information is also logged to logging files, as configured, inside the $HADOOP_HOME/logs folder.

job tracker 
Examples in Hadoop

Copy the input files into the distributed filesystem:

$ bin/hadoop fs -mkdir input

$ bin/hadoop fs -put conf input

Run some of the examples provided:

$ bin/hadoop jar hadoop-examples-.jar grep input output ‘dfs[a-z.]+’

Examples in Hadoop

A snapshot after successful run of the MapReduce example:

MapReduce example

Examine the output files:

Copy the output files from the distributed filesystem to the local filesystem and examine them:

$ bin/hadoop fs -get output output
$ cat output/*

or View the output files on the distributed filesystem:

$ bin/hadoop fs -cat output/*

View the output from HDFS itself or copy it to local and view in your editor of choice.

 output from HDFS

 

When you’re done, stop the daemons with:
$ bin/stop-all.sh

Next Steps

Congratulations. You have successfully installed and configured hadoop on a single node and run the example application. The next step is to program your own application using the MapReduce API. We will cover this topic in the next tutorial.

References

1. We referred the Single-Node Setup guide from Apache Hadoop page.

Welcome

We are very excited to launch the Orzota blog. We hope to provide articles on hadoop and related technologies which will hopefully prove useful. We will address both beginner topics for those just getting started on hadoop and also more advanced tips and techniques.