Orzota
back

Hive Tutorial for Beginners

Avatar
by Shanti Subramanyam for Blog
Hive Tutorial for Beginners

Hive is a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis of large datasets stored in Hadoop. Hive provides a SQL-like language called HiveQL. Due its SQL-like interface, Hive is increasingly becoming the technology of choice for using Hadoop.

Objectivehive tutorial

The objective of this Hive tutorial is to get you up and running Hive queries on a real-world dataset.

Prerequisites

The following are the prerequisites for setting up Hive and running Hive queries
  • You should have the latest stable build of Hadoop (as of writing this article, 1.0.3)
  • To install hadoop, see my previous blog article on Hadoop Setup
  • Your machine should have Java 1.6 installed
  • It is assumed you have some knowledge of Java programming and are familiar with concepts such as classes and objects, inheritance, and interfaces/abstract classes.
  • Basic knowledge of Linux will help you understand many of the linux commands used in the tutorial
  • Download the Book Crossing DataSet. This is the data set we will use. (Alternative link to the Dataset at the github page of the Tutorials)

Setting up Hive

Platform

This Hive tutorial assumes Linux. If using Windows, please install Cygwin. It is required for shell support in addition to the required software above.

Procedure

Download the most recent stable release of Hive as a tarball from one of the apache download mirrors. For our Hive tutorial, we are going to use hive-0.9.0.tar.gz
Unpack the tarball in the directory of your choice, using the following command 
  $ tar -xzvf hive-x.y.z.tar.gz
Set the environment variable HIVE_HOME to point to the installation directory:
You can either do
  $ cd hive-x.y.z
  $ export HIVE_HOME={{pwd}}
or set HIVE_HOME in $HOME/.profile so it will be set every time you login.
Add the following line to it.
  export HIVE_HOME=<path_to_hive_home_directory>
e.g.
  export HIVE_HOME=’/Users/Work/hive-0.9.0′
  export PATH=$HADOOP_HOME/bin:$HIVE_HOME/bin:$PATH
Start Hadoop (Refer to the Single-Node Hadoop Setup Guide for more information). It should show the processes being started. You can check the processes started by using the jps query
$ start-all.sh
<< Starting various hadoop processes >>
$ jps
  3097 Jps
  2355 RunJar
  2984 JobTracker
  2919 SecondaryNameNode
  2831 DataNode
  2743 NameNode
  3075 TaskTracker
In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set aprpopriate permissions in HDFS before a table can be created in Hive as shown below:
  $ hadoop fs -mkdir /tmp
  $ hadoop fs -mkdir /user/hive/warehouse
  $ hadoop fs -chmod g+w /tmp
  $ hadoop fs -chmod g+w /user/hive/warehouse

Problem

The problem we are trying to solve through this tutorial is to find the frequency of books published each year. Our input data set (file BX-Books.csv) is a csv file. Some sample rows:

“ISBN”;”Book-Title”;”Book-Author”;”Year-Of-Publication”;”Publisher”;”Image-URL-S”;”Image-URL-M”;”Image-URL-L”
“0195153448”;”Classical Mythology”;”Mark P. O. Morford”;”2002“;”Oxford University Press”;”https://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg”;
“https://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg”;”https://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg”
“0002005018”;”Clara Callan”;”Richard Bruce Wright”;”2001“;”HarperFlamingo Canada”;”https://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg”;
“https://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg”;”https://images.amazon.com/images/P/0002005018.01.LZZZZZZZ.jpg”
  …
The first row is the header row. The other rows are sample records from the file. Our objective is to find the frequency of Books Published each year. This is the same problem that was solved in the previous blog-post (Step-by-step MapReduce Programming using Java).
Now as our data is not cleansed and might give us erroronous results due to some serialization support, we clean it by the following command
$ cd /Users/Work/Data/BX-CSV-Dump
$ sed ‘s/&amp;/&/g’ BX-Books.csv | sed -e ‘1d’ |sed ‘s/;/$$$/g’ | sed ‘s/”$$$”/”;”/g’ > BX-BooksCorrected.csv
The sed commands help us to remove the delimeters “;” (semicolon) from the content and replace them with $$$. Also, the pattern “&amp;” is replaced with ‘&’ only. It also removes the first line (header line). If we don’t remove the header line, Hive will process it as part of the data, which it isn’t.
All the above steps are required to cleanse the data, and help hive give accurate results of our queries.

“0393045218”;”The Mummies of Urumchi;“;”E. J. W. Barber”;”1999″;”W. W. Norton &amp; Company”; “https://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg”

                is changed to

“0393045218”;”The Mummies of Urumchi$$$“;”E. J. W. Barber”;”1999″; “W. W. Norton & Company”; “https://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg”

Now, copy the file into Hadoop:
$ hadoop fs -mkdir input
$ hadoop fs -put /Users/Work/Data/BX-CSV-Dump/BX-BooksCorrected.csv input
Running Hive using the command line:
$ hive
hive> CREATE TABLE IF NOT EXISTS BXDataSet 
    >   (ISBN STRING, 
    >   BookTitle STRING, 
    >   BookAuthor STRING, 
    >   YearOfPublication STRING, 
    >   Publisher STRING, 
    >   ImageURLS STRING, 
    >   ImageURLM STRING, 
    >   ImageURLL STRING) 
    > COMMENT ‘BX-Books Table’
    > ROW FORMAT DELIMITED  
    > FIELDS TERMINATED BY ‘;’ 
    > STORED AS TEXTFILE;
OK
Time taken: 0.086 seconds
hive> LOAD DATA INPATH ‘/user/work/input/BX-BooksCorrected.csv’ OVERWRITE INTO TABLE BXDataSet;
Loading data to table default.bxdataset
Deleted hdfs://localhost:9000/user/hive/warehouse/bxdataset
OK
Time taken: 0.192 seconds
hive> select yearofpublication, count(booktitle) from bxdataset group by yearofpublication;

The username (“work” in our example) in the second query is dependent on the hadoop setup on your machine and the username of the hadoop setup.

Output

The output of the query is shown below:

hadoop set up

Comparison with Java MapReduce

You can see the above output and compare with the output of the MapReduce code from the previous blog entryLet’s take a look at how the code for Hive differs from MapReduce:

Mapper

mapper

Reducer

reducer

hive> CREATE TABLE IF NOT EXISTS BXDataSet (ISBN STRING,BookTitle STRING,BookAuthor STRING, YearOfPublication STRING, Publisher STRING,ImageURLS STRING,ImageURLM STRING, ImageURLL STRING) 
COMMENT ‘BX-Books Table’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘;’
STORED AS TEXTFILE;hive> LOAD DATA INPATH ‘/user/work/input/BX-BooksCorrected.csv’ OVERWRITE INTO TABLE BXDataSethive> select yearofpublication, count(booktitle) from bxdataset group by yearofpublication;

It is clear from the above that Hive reduces the programming effort required as well as the complexity of learning and writing MapReduce code. In the small example above, we reduced the lines of code from roughly 25 to 3.

Conclusion

In this tutorial we learned how to setup Hive, and run hive queries. We saw the query for the same problem statement which we solved with MapReduce and compared how the programming effort is reduced with the use of HiveQL. Stay tuned for more exciting tutorials from the small world of BigData.

 

MapReduce Tutorial
Prev post MapReduce Tutorial

Objective We will learn the following things with this step-by-step MapReduce tutorial MapReduce programming with…

Hands-on MapReduce Programming
Next post Hands-on MapReduce Programming

Hands-on MapReduce Programming from orzota This presentation was given at the Pune Hadoop User Group (PHUG) to…