Orzota

Blog fullwidth

Hive Tutorial for Beginners

Hive is a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis of large datasets stored in Hadoop. Hive provides a SQL-like language called HiveQL. Due its SQL-like interface, Hive is increasingly becoming the technology of choice for using Hadoop.

Objectivehive tutorial

The objective of this Hive tutorial is to get you up and running Hive queries on a real-world dataset.

Prerequisites

The following are the prerequisites for setting up Hive and running Hive queries
  • You should have the latest stable build of Hadoop (as of writing this article, 1.0.3)
  • To install hadoop, see my previous blog article on Hadoop Setup
  • Your machine should have Java 1.6 installed
  • It is assumed you have some knowledge of Java programming and are familiar with concepts such as classes and objects, inheritance, and interfaces/abstract classes.
  • Basic knowledge of Linux will help you understand many of the linux commands used in the tutorial
  • Download the Book Crossing DataSet. This is the data set we will use. (Alternative link to the Dataset at the github page of the Tutorials)

Setting up Hive

Platform

This Hive tutorial assumes Linux. If using Windows, please install Cygwin. It is required for shell support in addition to the required software above.

Procedure

Download the most recent stable release of Hive as a tarball from one of the apache download mirrors. For our Hive tutorial, we are going to use hive-0.9.0.tar.gz
Unpack the tarball in the directory of your choice, using the following command 
  $ tar -xzvf hive-x.y.z.tar.gz
Set the environment variable HIVE_HOME to point to the installation directory:
You can either do
  $ cd hive-x.y.z
  $ export HIVE_HOME={{pwd}}
or set HIVE_HOME in $HOME/.profile so it will be set every time you login.
Add the following line to it.
  export HIVE_HOME=<path_to_hive_home_directory>
e.g.
  export HIVE_HOME=’/Users/Work/hive-0.9.0′
  export PATH=$HADOOP_HOME/bin:$HIVE_HOME/bin:$PATH
Start Hadoop (Refer to the Single-Node Hadoop Setup Guide for more information). It should show the processes being started. You can check the processes started by using the jps query
$ start-all.sh
<< Starting various hadoop processes >>
$ jps
  3097 Jps
  2355 RunJar
  2984 JobTracker
  2919 SecondaryNameNode
  2831 DataNode
  2743 NameNode
  3075 TaskTracker
In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set aprpopriate permissions in HDFS before a table can be created in Hive as shown below:
  $ hadoop fs -mkdir /tmp
  $ hadoop fs -mkdir /user/hive/warehouse
  $ hadoop fs -chmod g+w /tmp
  $ hadoop fs -chmod g+w /user/hive/warehouse

Problem

The problem we are trying to solve through this tutorial is to find the frequency of books published each year. Our input data set (file BX-Books.csv) is a csv file. Some sample rows:

“ISBN”;”Book-Title”;”Book-Author”;”Year-Of-Publication”;”Publisher”;”Image-URL-S”;”Image-URL-M”;”Image-URL-L”
“0195153448”;”Classical Mythology”;”Mark P. O. Morford”;”2002“;”Oxford University Press”;”https://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg”;
“https://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg”;”https://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg”
“0002005018”;”Clara Callan”;”Richard Bruce Wright”;”2001“;”HarperFlamingo Canada”;”https://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg”;
“https://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg”;”https://images.amazon.com/images/P/0002005018.01.LZZZZZZZ.jpg”
  …
The first row is the header row. The other rows are sample records from the file. Our objective is to find the frequency of Books Published each year. This is the same problem that was solved in the previous blog-post (Step-by-step MapReduce Programming using Java).
Now as our data is not cleansed and might give us erroronous results due to some serialization support, we clean it by the following command
$ cd /Users/Work/Data/BX-CSV-Dump
$ sed ‘s/&amp;/&/g’ BX-Books.csv | sed -e ‘1d’ |sed ‘s/;/$$$/g’ | sed ‘s/”$$$”/”;”/g’ > BX-BooksCorrected.csv
The sed commands help us to remove the delimeters “;” (semicolon) from the content and replace them with $$$. Also, the pattern “&amp;” is replaced with ‘&’ only. It also removes the first line (header line). If we don’t remove the header line, Hive will process it as part of the data, which it isn’t.
All the above steps are required to cleanse the data, and help hive give accurate results of our queries.

“0393045218”;”The Mummies of Urumchi;“;”E. J. W. Barber”;”1999″;”W. W. Norton &amp; Company”; “https://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg”

                is changed to

“0393045218”;”The Mummies of Urumchi$$$“;”E. J. W. Barber”;”1999″; “W. W. Norton & Company”; “https://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg”

Now, copy the file into Hadoop:
$ hadoop fs -mkdir input
$ hadoop fs -put /Users/Work/Data/BX-CSV-Dump/BX-BooksCorrected.csv input
Running Hive using the command line:
$ hive
hive> CREATE TABLE IF NOT EXISTS BXDataSet 
    >   (ISBN STRING, 
    >   BookTitle STRING, 
    >   BookAuthor STRING, 
    >   YearOfPublication STRING, 
    >   Publisher STRING, 
    >   ImageURLS STRING, 
    >   ImageURLM STRING, 
    >   ImageURLL STRING) 
    > COMMENT ‘BX-Books Table’
    > ROW FORMAT DELIMITED  
    > FIELDS TERMINATED BY ‘;’ 
    > STORED AS TEXTFILE;
OK
Time taken: 0.086 seconds
hive> LOAD DATA INPATH ‘/user/work/input/BX-BooksCorrected.csv’ OVERWRITE INTO TABLE BXDataSet;
Loading data to table default.bxdataset
Deleted hdfs://localhost:9000/user/hive/warehouse/bxdataset
OK
Time taken: 0.192 seconds
hive> select yearofpublication, count(booktitle) from bxdataset group by yearofpublication;

The username (“work” in our example) in the second query is dependent on the hadoop setup on your machine and the username of the hadoop setup.

Output

The output of the query is shown below:

hadoop set up

Comparison with Java MapReduce

You can see the above output and compare with the output of the MapReduce code from the previous blog entryLet’s take a look at how the code for Hive differs from MapReduce:

Mapper

mapper

Reducer

reducer

hive> CREATE TABLE IF NOT EXISTS BXDataSet (ISBN STRING,BookTitle STRING,BookAuthor STRING, YearOfPublication STRING, Publisher STRING,ImageURLS STRING,ImageURLM STRING, ImageURLL STRING) 
COMMENT ‘BX-Books Table’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘;’
STORED AS TEXTFILE;hive> LOAD DATA INPATH ‘/user/work/input/BX-BooksCorrected.csv’ OVERWRITE INTO TABLE BXDataSethive> select yearofpublication, count(booktitle) from bxdataset group by yearofpublication;

It is clear from the above that Hive reduces the programming effort required as well as the complexity of learning and writing MapReduce code. In the small example above, we reduced the lines of code from roughly 25 to 3.

Conclusion

In this tutorial we learned how to setup Hive, and run hive queries. We saw the query for the same problem statement which we solved with MapReduce and compared how the programming effort is reduced with the use of HiveQL. Stay tuned for more exciting tutorials from the small world of BigData.

 

MapReduce Tutorial

Objective

We will learn the following things with this step-by-step MapReduce tutorial
  • MapReduce programming with a column
  • Writing a map function
  • Writing a reduce function
  • Writing the Driver class

Prerequisites

The following are the prerequisites for writing MapReduce programs using Apache Hadoop
  • You should have the latest stable build of Hadoop (as of writing this article, 1.0.3)
  • To install hadoop, see my previous blog article.
  • You should have eclipse installed on your machine. Any eclipse version before 3.6 is compatible with the eclipse plugin (doesn’t work with Eclipse 3.7). Please refer to Eclipse Setup for Hadoop Development for details.
  • It is expected that you have some knowledge of Java programming and are familiar with the concepts such as classes and objects, inheritance, and interfaces/abstract classes.
  • Download the Book Crossing DataSet. (Alternative link to the Dataset at the github page of the Tutorials)

Problem

The problem we are trying to solve through this MapReduce tutorial is to find the frequency of books published each year. Our input data set is a csv file which looks like this:

Sample Rows from the input file BX-Books.csv 
“ISBN”;”Book-Title”;”Book-Author”;”Year-Of-Publication”;”Publisher”;”Image-URL-S”;”Image-URL-M”;”Image-URL-L”
“0195153448”;”Classical Mythology”;”Mark P. O. Morford”;“2002”;”Oxford University Press”;”https://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg”;
“https://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg”;”https://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg”
“0002005018”;”Clara Callan”;”Richard Bruce Wright”;“2001”;”HarperFlamingo Canada”;”https://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg”;
“https://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg”;”https://images.amazon.com/images/P/0002005018.01.LZZZZZZZ.jpg”
  …
The first row is the header row. The other rows are sample records from the file. Our objective is to find the frequency of Books Published each year.

Procedure

1. Open Eclipse in the MapReduce Perspective, as shown below:
Eclipse MapReduce
2. Create a New MapReduce Project, as shown below:
MapReduce
and fill in the details of the Project. For our sample project, we have named it  “BookCrossingData”
MapReduce tutorial
3. Create New Mapper in the BookCrossingData Project 
BookCrossingData
Write the map method as shown below:
map method
The BookXMapper class contains a map function which reads the record with the default record reader. The default record has the key as the cumulated character count for that line (added to the character count of the previous line) and the value as the whole line as a Text till the newline charater. Using the split() method of the String class, we split using the delimeter (“;” in our case) to get the array of strings. The 4th entry of the array is the field “Year-Of-Publication” which is the output key of the mapper.
5. Create New Reducer in the BookCrossingData Project 
BookCrossingData Project
Write the reduce method as shown below
reduce method

The BookXReducer Class contains a reduce method which takes the parameters – key and an iterable list of values (values grouped for each key). For our program, we use the key from mapper again as the output of the Reducer and add each individual value from the list. Remember, the output of the mapper was a new IntWritable(1). We add all the occurrences of new IntWritable(1) to get the count of books published for that particular key (the year of publication).

6. Create a class named BookXDriver. This class will be our main program to run the MapReduce job we have just written.
BookXDriver
7. To run in Eclipse
  • Set the Run Configurations from the menu option of the Project (Right-click on the project)

Eclipse

  • In the arguments for a project, set the paths of the input file/directory and the expected output directory. Make sure that the output directory does not exist, as it would throw an error if it already exists.

arguments for a project

  • Run and see the console. The output directory contains two files. The file with the prefix “part-” is the actual output of your MapReduce logic as shown below. The file with name “_SUCCESS” is just a marker to signify a successful run.

8. To run on a Hadoop cluster (The size of the cluster does not affect its functioning, It can be single node or multi-node)

  • Make sure your cluster is started.
  • Export the Eclipse Project as a Runnable Jar. Right-click the Project and you’ll find the details for exporting the jar with configurations for exporting as a Runnable jar. For our example, we created a jar file name as BookCrossingJar.jar
  • Upload your Book Crossing dataset to HDFS by the following command

$ hadoop fs -put ~/Work/HadoopDev/Input/BX-Books.csv input

  • Run the jar with the HDFS path of the file as the parameter passed to the hadoop command along with the expected path of the output. Make sure that the output directory does not exist, as it would throw an error if it already exists.

$ hadoop jar ~/Work/HadoopDev/BookCrossingJar.jar input output

  • The output is generated and can be seen using the following query:hadoop

$ hadoop fs -cat output/*

Download

The running code for this tutorial is present at the github location of the tutorials at https://github.com/Orzota/tutorials.

Conclusion

In this tutorial we learned how to write a mapper, a reducer and the driver class for running MapReduce programs. We also learned two ways of running our MapReduce logic – one using Eclipse, which is suitable for local debugging and the other using Single-node Hadoop Cluster for real world execution. Also we learned about some basic input and output datatypes. 
Stay tuned for more exciting tutorials from the Small world of BigData.
Eclipse Setup for Hadoop Development

Objectives

We will learn the following things with this Eclipse Setup for Hadoop tutorial 
  • Setting Up the Eclipse plugin for Hadoop
  • Testing the running of Hadoop MapReduce jobs

Prerequisites

The following are the prerequisites for Eclipse setup for Hadoop program development using MapReduce and further extensions.
  • You should have the latest stable build of Hadoop (as of writing this article 1.0.3)
  • You should have eclipse installed on your machine. Any eclipse before 3.6 is compatible to the eclipse plugin. (Doesn’t work with Eclipse 3.7) Please refer to Link 6 of the Appendix for details of how to get and install Eclipse for your platform of development.
  • It is expected that you have a preliminary knowledge of Java programming and are familiar with the concepts involved in Java Programming such as Classes and Objects, Inheritances, and Interfaces/Abstract Classes.
  • Please refer the blog describing Hadoop installation (refer here).

Procedure

1. Download the Eclipse plugin from the link given here.
eclipse plugin
We are utilizing this plugin to support the newer versions of eclipse and the newer versions of Hadoop. The eclipse-plugins packaged with earlier versions of Hadoop do not work.
2. After downloading the link, it needs to be copied into the plugins folder of eclipse.
  • Windows: Go to the eclipse folder located by default at C:eclipse (You may have installed it elsewhere). Copy the downloaded plugin to the eclipseplugins folder.
  • Mac OS X: Eclipse comes as an archive. Unarchive it and find the folder name ../eclipse/plugins/. Copy the downloaded plugin inside it.
  • Linux Version: The plugins folder is in the eclipse installation directory. Normally the /usr/share/eclipse folder is the installation directory of eclipse on Linux Machines
3. After you have copied the plugin, start Eclipse (restart, if it was already started) to reflect the changes in the Eclipse environment.
Go to the Window option of the menu bar in eclipse, 
eclipse menu bar
Select the option “Other” from the drop-down list.
start eclipse
Select the option Map/Reduce from the list of Perspectives.
map/reduce
As you select the “Map/Reduce” perspective, you’ll notice few additions in the eclipse views.
Notice the Perspective Name 
set up eclipse
The DFS locations are now directly accessible through the Eclipse configurations. Use the MapReduce Locations View to configure Hadoop Access from Eclipse.
DFS Locations
4. You have now setup Eclipse for MapReduce programming.
Right Click on Eclipse Package Explorer view or click on file option in the menu bar of your eclipse distribution.
setup Eclipse
Select the new option and then, the “Other” option in it. (Ctrl+N or Command+N)
eclipse plugin
Select Map/Reduce Project
select map reduce
Fill in the details in the Project Wizard.
MapReduce
The Hadoop library location must be the specified location in $HADOOP_HOME.
hadoop mapreduce
Now, after the Project has been created, create the Mapper, Reduce and the Driver Classes, by right-clicking the project and getting the options from New option.
Eclipse for MapReduce
Create a Mapper classes
mapper create
After creating a Mapper Class, the code snippet created is as following

… // Necessary Classes Imported

public class BookCrossingMapper extends MapReduceBase implements Mapper {

public void map(WritableComparable key, Writable values,

OutputCollector output, Reporter reporter) throws IOException {

}

}

Please replace the above code snippet with the one below,

… // Necessary Classes imported

public class BookCrossingMapper extends MapReduceBase implements Mapper<K,V,K,V> {

@Override

public void map(K arg0, V arg1, OutputCollector<K, V> arg2,

Reporter arg3) throws IOException {

// TODO Replace K and V with the suitable data types

}

}

Replace K and V with the required Key and Value datatypes such as LongWritable, Text, etc. Repeat the same steps while creating the new Reducer Class.

You are now ready to start programming in MapReduce.

Next Steps
My next blog article will be an intro to MapReduce programming, so stay tuned.

 

Single-Node Hadoop Tutorial

Objectives

We will learn the following things with this single-node Hadoop tutorialsingle-node Hadoop tutorial

  • Setting Up Hadoop in Single-Node and Pseudo-Cluster Node modes.
  • Test execution of Hadoop by running sample MapReduce programs provided in the Hadoop distribution.

Prerequisites

Platform

  • Either Linux or Windows.
  • Basic knowledge of Linux shell commands.

Software

  • Java™ 1.6.x must be installed.
  • ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.

Additional requirement for Windows:

Installing Software

If your machine doesn’t have the requisite software you will need to install it. For example on Ubuntu Linux:

$ sudo apt-get install ssh
$ sudo apt-get install rsync

On Windows, if you did not install the required software when you installed cygwin, start the cygwin installer and select the ‘openssh’ package in the Net category

Setting up Environment Variables

  • Ubuntu/Fedora (and for most Linux-kernel systems)

$ vi ~/.bash_profile
Add the following line(if not present)

export HADOOP_HOME=/usr/root/hadoop-1.0.x
(the directory containing the Hadoop distribution)

export JAVA_HOME=/usr/java/jdk1.6.0_26
(the directory containing Java. Set it according to the actual path in your distribution)

export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH

export HADOOP_LOG_DIR=$HADOOP_HOME/logs

(optional. For direct access)

export HADOOP_SLAVES=$HADOOP_HOME/conf/slaves
(optional. For direct access)

$ source ~/.bash_profile

(type the above command to activate the new path settings immediately)

  • Mac OS X :

export HADOOP_HOME=/usr/root/hadoop-1.0.x
(the directory containing the Hadoop distribution)

export JAVA_HOME=$(/usr/libexec/java_home)
(Will execute the script to set JAVA_HOME)

export PATH=$HADOOP_HOME/bin:$PATH
(Adding the hadoop bin path to the PATH Variable)

export HADOOP_LOG_DIR=$HADOOP_HOME/logs
(optional. For direct access)

export HADOOP_SLAVES=$HADOOP_HOME/conf/slaves
(optional. For direct access)

ssh Settings

  • Works on GNU/Linux, Mac OS X and Cygwin only
  • Check that you can ssh to the localhost without using the password for the user:

$ssh localhost

  • If you cannot ssh to the localhost without using the password for the user, execute the following commands,

$ ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa 

        This creates a new pair of public-private keys in the directory ~/.sshin two files: id_dsa and id_dsa.pub 

$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

  • The public key of ssh-key generated is copied to a new file called authorized_keys for passphrase-less authentication.

Download Hadoop

  • To get a Hadoop distribution, download a recent stable release from one of the Apache Download Mirrors. For this tutorial, we’ll use the Hadoop 1.0.3 version only.

Install Hadoop

  1. Unpack the Hadoop Distribution downloaded. Make sure you use the same directory path that was used to set  HADOOP_HOME.
  2. In the distribution, edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of the Java installation. This could be the same path as JAVA_HOME set earlier.
  • For Mac users only: In the file bin/hadoop, search for the line:

JAVA=$JAVA_HOME/bin/java 

and change it to: 

JAVA=$JAVA_HOME/Commands/java

  • Try the following command now (from anywhere, as you have already set $HADOOP_HOME/bin in the PATH variable)

$ hadoop

If this shows the usage documentation for the hadoop script, then your initial setup for Hadoop and Java is correct. If it cannot find hadoop, then re-trace the steps above to ensure that everything has been set correctly.

Single Node Deployment

By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.

The following example copies the conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory. 

$ mkdir input

$ cp conf/*.xml input

Copy the contents of $HADOOP_HOME/conf  to a new input directory.

Now run hadoop from the command line with arguments

$ hadoop jar hadoop-examples-*.jar grep input output ‘dfs[a-z.]+’ 

jar <the name of the jar file containing MapReduce logic> hadoop

<grep – arguments follow>

<input – input folder>

<output – output folder>

< ‘dfs[a-z.]+’ the regular expression for grep program>

The output of the hadoop command should look like this screen shot.

$ cat output/*

Check the output of the grep program.

Pseudo-Node Deployment

Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process. 

To do so, edit the following files in the conf  directory of $HADOOP_HOME.

  • conf/core-site.xml

<configuration>

     <property>

         <name>fs.default.name</name>

         <value>hdfs://localhost:9000</value>

     </property>

</configuration>

  • conf/hdfs-site.xml

<configuration>

     <property>

         <name>dfs.replication</name>

         <value>1</value>

     </property>

</configuration>

  • conf/mapred-site.xml

<configuration>

     <property>

         <name>mapred.job.tracker</name>

         <value>localhost:9001</value>

     </property>

</configuration>

Execution

Format a new distributed-filesystem:

$ bin/hadoop namenode -format

Usage: java NameNode [-format] | [-upgrade] | [-rollback] | [-finalize] | [-importCheckpoint]

Start the hadoop daemons:

$ bin/start-all.sh

The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs).

NameNode Details

The NameNode and JobTracker can be accessed at https://localhost:50070

It describes the current usage of HDFS, Live/Dead Nodes, and other necessary administrative details.

NameNode

JobTracker

The JobTracker can be accessed at  https://localhost:50030. It is used to monitor the MapReduce jobs and the Tasks submitted to each individual machine. This information is also logged to logging files, as configured, inside the $HADOOP_HOME/logs folder.

job tracker 
Examples in Hadoop

Copy the input files into the distributed filesystem:

$ bin/hadoop fs -mkdir input

$ bin/hadoop fs -put conf input

Run some of the examples provided:

$ bin/hadoop jar hadoop-examples-.jar grep input output ‘dfs[a-z.]+’

Examples in Hadoop

A snapshot after successful run of the MapReduce example:

MapReduce example

Examine the output files:

Copy the output files from the distributed filesystem to the local filesystem and examine them:

$ bin/hadoop fs -get output output
$ cat output/*

or View the output files on the distributed filesystem:

$ bin/hadoop fs -cat output/*

View the output from HDFS itself or copy it to local and view in your editor of choice.

 output from HDFS

 

When you’re done, stop the daemons with:
$ bin/stop-all.sh

Next Steps

Congratulations. You have successfully installed and configured hadoop on a single node and run the example application. The next step is to program your own application using the MapReduce API. We will cover this topic in the next tutorial.

References

1. We referred the Single-Node Setup guide from Apache Hadoop page.

Welcome

We are very excited to launch the Orzota blog. We hope to provide articles on hadoop and related technologies which will hopefully prove useful. We will address both beginner topics for those just getting started on hadoop and also more advanced tips and techniques.