Orzota
back

Single-Node Hadoop Tutorial

Avatar
by Shanti Subramanyam for Blog
Single-Node Hadoop Tutorial

Objectives

We will learn the following things with this single-node Hadoop tutorialsingle-node Hadoop tutorial

  • Setting Up Hadoop in Single-Node and Pseudo-Cluster Node modes.
  • Test execution of Hadoop by running sample MapReduce programs provided in the Hadoop distribution.

Prerequisites

Platform

  • Either Linux or Windows.
  • Basic knowledge of Linux shell commands.

Software

  • Java™ 1.6.x must be installed.
  • ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.

Additional requirement for Windows:

Installing Software

If your machine doesn’t have the requisite software you will need to install it. For example on Ubuntu Linux:

$ sudo apt-get install ssh
$ sudo apt-get install rsync

On Windows, if you did not install the required software when you installed cygwin, start the cygwin installer and select the ‘openssh’ package in the Net category

Setting up Environment Variables

  • Ubuntu/Fedora (and for most Linux-kernel systems)

$ vi ~/.bash_profile
Add the following line(if not present)

export HADOOP_HOME=/usr/root/hadoop-1.0.x
(the directory containing the Hadoop distribution)

export JAVA_HOME=/usr/java/jdk1.6.0_26
(the directory containing Java. Set it according to the actual path in your distribution)

export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH

export HADOOP_LOG_DIR=$HADOOP_HOME/logs

(optional. For direct access)

export HADOOP_SLAVES=$HADOOP_HOME/conf/slaves
(optional. For direct access)

$ source ~/.bash_profile

(type the above command to activate the new path settings immediately)

  • Mac OS X :

export HADOOP_HOME=/usr/root/hadoop-1.0.x
(the directory containing the Hadoop distribution)

export JAVA_HOME=$(/usr/libexec/java_home)
(Will execute the script to set JAVA_HOME)

export PATH=$HADOOP_HOME/bin:$PATH
(Adding the hadoop bin path to the PATH Variable)

export HADOOP_LOG_DIR=$HADOOP_HOME/logs
(optional. For direct access)

export HADOOP_SLAVES=$HADOOP_HOME/conf/slaves
(optional. For direct access)

ssh Settings

  • Works on GNU/Linux, Mac OS X and Cygwin only
  • Check that you can ssh to the localhost without using the password for the user:

$ssh localhost

  • If you cannot ssh to the localhost without using the password for the user, execute the following commands,

$ ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa 

        This creates a new pair of public-private keys in the directory ~/.sshin two files: id_dsa and id_dsa.pub 

$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

  • The public key of ssh-key generated is copied to a new file called authorized_keys for passphrase-less authentication.

Download Hadoop

  • To get a Hadoop distribution, download a recent stable release from one of the Apache Download Mirrors. For this tutorial, we’ll use the Hadoop 1.0.3 version only.

Install Hadoop

  1. Unpack the Hadoop Distribution downloaded. Make sure you use the same directory path that was used to set  HADOOP_HOME.
  2. In the distribution, edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of the Java installation. This could be the same path as JAVA_HOME set earlier.
  • For Mac users only: In the file bin/hadoop, search for the line:

JAVA=$JAVA_HOME/bin/java 

and change it to: 

JAVA=$JAVA_HOME/Commands/java

  • Try the following command now (from anywhere, as you have already set $HADOOP_HOME/bin in the PATH variable)

$ hadoop

If this shows the usage documentation for the hadoop script, then your initial setup for Hadoop and Java is correct. If it cannot find hadoop, then re-trace the steps above to ensure that everything has been set correctly.

Single Node Deployment

By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.

The following example copies the conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory. 

$ mkdir input

$ cp conf/*.xml input

Copy the contents of $HADOOP_HOME/conf  to a new input directory.

Now run hadoop from the command line with arguments

$ hadoop jar hadoop-examples-*.jar grep input output ‘dfs[a-z.]+’ 

jar <the name of the jar file containing MapReduce logic> hadoop

<grep – arguments follow>

<input – input folder>

<output – output folder>

< ‘dfs[a-z.]+’ the regular expression for grep program>

The output of the hadoop command should look like this screen shot.

$ cat output/*

Check the output of the grep program.

Pseudo-Node Deployment

Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process. 

To do so, edit the following files in the conf  directory of $HADOOP_HOME.

  • conf/core-site.xml

<configuration>

     <property>

         <name>fs.default.name</name>

         <value>hdfs://localhost:9000</value>

     </property>

</configuration>

  • conf/hdfs-site.xml

<configuration>

     <property>

         <name>dfs.replication</name>

         <value>1</value>

     </property>

</configuration>

  • conf/mapred-site.xml

<configuration>

     <property>

         <name>mapred.job.tracker</name>

         <value>localhost:9001</value>

     </property>

</configuration>

Execution

Format a new distributed-filesystem:

$ bin/hadoop namenode -format

Usage: java NameNode [-format] | [-upgrade] | [-rollback] | [-finalize] | [-importCheckpoint]

Start the hadoop daemons:

$ bin/start-all.sh

The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs).

NameNode Details

The NameNode and JobTracker can be accessed at https://localhost:50070

It describes the current usage of HDFS, Live/Dead Nodes, and other necessary administrative details.

NameNode

JobTracker

The JobTracker can be accessed at  https://localhost:50030. It is used to monitor the MapReduce jobs and the Tasks submitted to each individual machine. This information is also logged to logging files, as configured, inside the $HADOOP_HOME/logs folder.

job tracker 
Examples in Hadoop

Copy the input files into the distributed filesystem:

$ bin/hadoop fs -mkdir input

$ bin/hadoop fs -put conf input

Run some of the examples provided:

$ bin/hadoop jar hadoop-examples-.jar grep input output ‘dfs[a-z.]+’

Examples in Hadoop

A snapshot after successful run of the MapReduce example:

MapReduce example

Examine the output files:

Copy the output files from the distributed filesystem to the local filesystem and examine them:

$ bin/hadoop fs -get output output
$ cat output/*

or View the output files on the distributed filesystem:

$ bin/hadoop fs -cat output/*

View the output from HDFS itself or copy it to local and view in your editor of choice.

 output from HDFS

 

When you’re done, stop the daemons with:
$ bin/stop-all.sh

Next Steps

Congratulations. You have successfully installed and configured hadoop on a single node and run the example application. The next step is to program your own application using the MapReduce API. We will cover this topic in the next tutorial.

References

1. We referred the Single-Node Setup guide from Apache Hadoop page.

Welcome
Prev post Welcome

We are very excited to launch the Orzota blog. We hope to provide articles on…

Eclipse Setup for Hadoop Development
Next post Eclipse Setup for Hadoop Development

Objectives We will learn the following things with this Eclipse Setup for Hadoop tutorial  Setting…