Orzota

Author: Shanti Subramanyam

HBase provides random, read-write access to Big Data stored in very large tables as a distributed columnar store. HBase thus gained immediately popularity as a Big Data technology that unlike Hadoop which was primarily used in the backend data warehousing infrastructure could be deployed to service online transactions as well. Just as Hive brought SQL to Hadoop (and now to HBase as described in my earlier blog post), there are many alternate projects providing SQL for HBase which is the subject of this blog article.

Phoenix

Phoenix is a technology developed by Salesforce.com to put a SQL skin over HBase. This is not adding a new layer, rather it exposes HBase functionality through SQL using an embedded JDBC Driver that allows clients to run at native HBase speed. The JDBC driver compiles SQL into native HBase calls.

The claim is that performance of Phoenix is much better than that of Hive over HBase – upto 10x at 100 million rows. Performance is also more constant as the number of rows increases. Optimizations are transparent to the user.

Impala

Cloudera has been promoting Impala heavily since its announcement in Strata last year. Impala is a SQL engine that can run on HDFS or HBase or both. They too claim that Impala performance is much better than that of Hive – over 45 times! However this comparison seems to be based on native Hive, not Hive over HBase. The performance gains are achieved by not writing intermediate results to disk (like MapReduce), no spin up/down times, optimized code as Impala is written in C++ rather than Java. This means that it needs to use JNI to make calls into Java for HBase integrations.

Impala also includes authentication via Kerberos support, JDBC/ODBC drivers (similar to Phoenix), an interactive shell and support for popular hadoop file formats such as parquet, avro, thrift and sequence files. Parquet was specifically engineered for Impala to get the best performance.

The use cases for HBase and Impala include the following:

  • Data is being streamed in
  • Need to run queries on data as it is being created
  • Need real time edits of data in-place
  • Need random access to computed results (results can be placed in HBase)

Drill

Apache Drill (the only in this series which is an Apache project) was inspired by Google Dremel. It is the most interesting in terms of its emphasis on interactive analysis of large scale datasets. It is similar to Impala in many ways but is community driven. Although Hive is also a community project, the need for Drill rose because of the tight coupling of Hive to MapReduce which is based on a pessimistic execution principle i.e. all jobs will be long running and process lots of data. This causes a lot of overhead for short-running jobs which Drill addresses.

Drill does great when execution times of 100ms to 10 seconds are required. Phoenix provides even better response times – below 10ms.

Drill is unique in its focus on analytical queries, so a lot of emphasis is placed on bulk reads and aggregations. Phoenix is more focused on write optimizations as in the RDBMS world.

Drill also provides extensions to SQL for nested data types similar to Big Query. Many systems nest data structures inside of HBase cells (for example a cell can be a whole log record which in turn has many nested fields). It is much better to treat these nested cells and process them in the query language for optimal performance.

Drill leverages recent research approaches (late record materialization, vectorized operators, etc.) As such it definitely seems the most promising for analytical work.

Summary

With all of these different options and rich technologies that provide SQL for HBase, HBase’s popularity is bound to grow. Of course with choice comes decisions. Which technology is right for you? Orzota can help analyze your use case and recommend the right set of technologies based not just on technical merits alone, but also taking into consideration its maturity, cost and vendor support. Please contact us for more information.

 

 

 

 

 

HBase is a NoSQL data store that traditionally required a developer to write MapReduce programs to access and analyze the data. On the other hand, Hive has been gaining in popularity with it’s SQL-like interface. However, using Hive over HDFS leads to long ETL times and no access to real time data.

Solution? Hive over HBase. This benefits HBase users as it does not require programming while benefiting traditional Hive use cases by making them much faster and providing access to real-time data. The notes in this article are from the Hive session at HBaseCon.

Use Cases

Here are some example use cases for when one would want to use Hive over HBase.

Use Case 1: HBase as ETL Data Sink

If you want to analyze data in HDFS and want to serve it via online queries to say a recommender system, use Hive to perform the queries and then populate HBase which serves as the data store for online access.

Use Case 2: HBase as Data Source

Some data in HDFS and some in HBase. Combine  (Join) them using Hive. Dimension data that changes frequently can be in HBase, while static tables can reside in HDFS.

Use Case 3: Low Latency Warehouse

HDFS and HBase have the same table definitions. HBase gets the continuous updates, while daily periodic dumps go into HDFS. Hive queries can join the two sources of data together for real time query.

How does this work?

Fair warning. The rest of this article gets pretty technical, delving into the implementation details of Hive for HBase. If you don’t care, then just realize that this technology is a fantastic bridge between the speed and online capability of HBase and the ease of use and SQL-like qualities of Hive.

Hive + HBase Architecture

The StorageHandler layer talks to HBase over HDFS. This sits in parallel with the MR framework. Hive DDL operations are converted to HBase DDL operations via a client hook. All operations are performed by the client. No two phase commit is used.

Schema Mapping

Hive tables, columns, column types are mapped to HBase table and column families. Every field in a hive table is mapped to either a:

  • table key (using :key as selector)
  • column family (cf:) (MAP fields in Hive
  • A column

The Hive HBase Integration wiki has more details.

Type Mapping

This was added in Hive 0.9. Previously all types were being converted to strings in HBase. The table level property hbase.table.default.storage.type can be used to specify the type mapping. If binary, Hive types will be mapped to bytes, otherwise String. If type is not primitive or Map, it is converted to JSON string and serialized.

There are still a few rough edges for schema and type mapping which should improve in subsequent versions.

Bulk Load Feature

Instead of writing an MR job to bulk load data into HBase, we can use Hive. Sample the source data and save to a file.
Run CLUSTER BY query using HiveHFileOutputFormat and import into HBase.
The HBase Bulk Load page has details.

Filter Pushdowns

The idea is to pass down filter expressions to the storage layer from the Hive optimizer to minimize scanned data. The optimizer pushes down the predicates to the query plan.  To do this, the storage handler negotiates with the Hive optimizer to decompose the filter. This support is being added to Hive.
Find out more at https://cwiki.apache.org/Hive/filterpushdowndev.html

There are many use cases for Hive on HBase. But if what you really want is a SQL interface to HBase, there are other options as well. I will cover these in a subsequent blog post.

I attended HBaseCon yesterday. It was a fantastic event with very meaty tracks and sessions.
HBase is a NoSQL database meant for use in online big data workloads, where response time matters. It was originally developed by Powerset based on Google’s BigTable paper in 2008 and very soon became an Apache project. The first Apache release was in 2010 and since then it has been gathering velocity. There are lots of production deployments of HBase today with a wide variety of use cases.

I will share some of the use cases presented at HBaseCon in this article.

HBase at Pinterest

Pinterest is completely deployed on Amazon EC2. Pinterest uses a follow model where users follow other users. This requires a following feed for every user that gets updated everytime a followee creates or updates a pin. This is a classic social media application problem. For Pinterest, this amounts to 100s of millions of pins per month that gets fanned out as billions of writes per day.
So the ‘Following Feed‘ is implemented using Hbase. Some specifics:

  • They chose a wide schema where each user’s following feed is a single row in HBase. This exploits the sorting order within columns for ordering (each user wants to see the latest in his feed) and results in atomic transactions per user.
  • To optimize writes, they increased per region memstore size. 512M memstore leads to 40M HFile instead of the small 8M file created by default memstore This leads to less frequent compactions.
  • They take care of the potential for infinite columns by trimming the feed during compactions: there really is not much point having an infinite feed anyway.
  • They also had to do GC tuning (who doesn’t) opting for more frequent but smaller pauses.

Another very interesting fact. They maintain a mean time to recovery (MTTR) of less than 2 minutes. This is a great accomplishment since HBase favors consistency over availability. They achieve this via reducing various timeout settings (socket, connect, stale node, etc.) and the number of retries. They also avoid the single point of failure by using 2 clusters. To avoid NameNode failure, they keep a copy on EBS.

HBase at Groupon

Groupon has two distinct use caes. Deliver deals to users via email (a batch process) and provide a relevant user experience on the website. They have increasingly tuned their deals to be more accurate and relevant to individual users (personalization).
They started out with running Hadoop MapReduce (MR) jobs for email deal delivery and used MySQL for their online application – but ideally wanted the same system for both.
They now run their Relevance and Personalization system on HBase. In order to cater to the very different workload characteristics of the two systems(email, online), they run 2 HBase clusters that are replicated so they have the same content but are tuned and accessed differently.
Groupon also uses a very wide schema – One colmn-family for ‘user history and profile’ and the other for email history.
A 10 node cluster runs HBase (apart from the 100 node Hadoop cluster). Each node has 96GB RAM, 24 virtual cores, 8x2TB disks.
Data in the cluster: 100m+ rows, 2TB+ data, 4.2B data points.

HBase at Longtail Video

This company provides JW Player, an online video player used by over 2 million websites. They have lots of data which is processed by their online analytics tool. They too are completely deployed on AWS and as such use HBase and EMR from Amazon. They read data from and write data to S3.
They had the following requirements:

  • fast queries across data sets
  • support for date-range queries
  • store huge amounts of aggregated data
  • flexibility in dimensions used for rollup tables

HBase fit the bill. They use multiple clusters to partition their read and write intensive workloads similar to Groupon. They are a full-fledged python shop so use Happybase and have Thrift running on all the nodes of the HBase cluster.

Summary

It is clear that HBase adoption has grown and there are many, small and large production deployments with varying use cases.
In addition to the cases above, there is of course Facebook’s messaging system, OPower, Flurry, Lily, Meetup and many other.

It is also clear that there is still a lot of work to be done to make it easier to use, configure and tune. After all, we are not even at 1.0!

I attended Cassandra Summit 2013 . This is the first year I’m attending this conference and was impressed with the turnout as well as the number and quality of the sessions. I summarize some of the content from the first day’s sessions below.

Cassandra Keynote

The star of the morning keynote was Jonathan Ellis, co-founder of DataStax.

Yet another benchmark

He started his presentation with a graph showing benchmark data comparing Cassandra with other NoSQL databases(HBase, Voldemort, VoltDB, Redis) and even MySQL I looked up the original VLDB paper that published this benchmark to understand the workload and testing methodology. The authors were trying to mimic an APM (Application Performance Management) agent’s actions as it collects and records data. Yet, the authors split things up to have read and write type workloads similar to other benchmarks and that really have nothing to do with their original intent of an APM workload! Personally, I find such simplistic workloads problematic. The servers are unlikely to experience a realistic workload that a real application will place on them. But this is a topic for another blog post.

Anyway, as you can guess, Cassandra came out on top with many times the performance of HBase when scaled to 32 nodes. Although MongoDB was not part of this comparison, Jonathan said that other benchmarks showed MongoDB performance to be even worse than HBase.

Cassandra 1.2

He then moved on to talk about some of the features of 1.2, summarized briefly below:

  • Focus is on ease of use with CQL
  • Thrift compatibility
  • CQL adds Collections to SQL (e.g. multiple email addresses for a user). This is important since Cassandra can’t do joins with say an address table.
  • Adds data dictionary ( more like a  RDBMS)
  • Adds authentication and authorization (even more like a RDBMS)
  • New CQL native protocol – efficient, lightweight and asynchronous
  • Native driver in Java available for a few weeks. .NET driver in beta, other drivers coming
  • New tracing feature – helps troubleshooting problematic queries taking a long time
  • Memory improvements: reduced JVM heap usage by removing bloom filters, compression offsets from heap
  • Compaction throttling: Smooth constant rate limiting

Cassandra 2.0

This release is targeted at end of July. Lots of house cleaning is the focus of this release, and hence the major version number. A notable new feature will the the addition of CAS (Compare and Swap) for use when eventual consistency is just not enough. Another RDBMS like feature – triggers using Java classes!

From the looks of it, Cassandra will soon resemble an enterprise RDBMS in terms of functionality. Hopefully all these additional features will not come at the price of performance.

Use Cases

Several companies presented their use cases. This is a very useful track to help others new to the technology to understand what kinds of applications in their own companies can benefit.

Ground Traffic Control Logistics

This was presented by Jesse Young of Zonar Systems. They collect lots of data from heavy fleet using a GPS based hardware device which sends the data to their SaaS portal. Their existing data is on 100+ database servers using 3000+ databases spanning 100 TBs.

Requirements: replication, fast inserts and retrievals, horizontal scalability, ease of administration. (yeah – who doesn’t want any of these?)

Cassandra was chosen due to built-in replication, speed and easy administration.

They have created a few applications using Cassandra:

  • Photos: cheap storage, grow capacity easily, TTLs (they want photos to expire after a certain period)
  • Elevation Data: heavy reads, key based, upto 6000 reads/sec peak. 150m reads/day
  • ZPass+: This app tracks ridership. Millions of users doing 20m writes/day. But traffic is bursty only when users get on a buss.

The interesting thing is that they have not yet moved their core app – GPS data into Cassandra. That will be the true test.

Intuit’s Consumer Financial Platform

Intuit is creating a common data layer across many of their apps. They chose Cassandra as it can scale and is highly available (one of their main requirements which HBase fails). It is easy to replicate across data centers, enables fast snapshots and rolling upgrades. Operations need to make schema changes easily and this was an important requirement.

Some interesting notes from this presentation included some of the performance problems they ran into. Cassandra does compaction in the background causing huge i/o spikes making blob storage (RedHat DFS) unsuitable. So documents go into DFS but the metadata lives in Cassandra. They also see huge unaccounted for CPU spikes caused by GC activity. It is important to tune the heap size to get more predictable performance.

Summary

Clearly there are production deployments of Cassandra, and many of them. However, it takes 1-3 years to get an enterprise application onto a new data platform like Cassandra. Many customers are still testing the waters with proof-of-concepts. It’s also much easier to start a new app development using a new platform than port an existing one. Cassandra requires a complete re-thinking of the data model which many find challenging. As the old adage goes, no pain, no gain.

 

Who is a Data Scientist?

In case you missed it, one of the hottest new job titles in tech is that of the Data Scientist.In traditional IT, the data was managed by engineers ensuring that the data warehouse had relevant data which was then used by business analysts to answer business questions.With the rise of big data as a result of the ability to process large amounts of unstructured (e.g. data from devices), semi-structured (e.g. all sorts of machine logs) and traditional structured data, the notion of the data warehouse as a central RDBMS for analytics is being disrupted.Why not do analysis on all this raw data in the big data platform instead? But analysts who use BI tools do not typically have the ability to deal with raw data, let alone “big data”.

Thus, was born the role of data scientist – someone who has a strong computer science background (read PhD), along with math, statistics and the business acumen of the analyst. The data scientist explores the data set on his own, creates analytical models and tries to find patterns in the data to answer business questions. Anjul Bhambhri, vice president of big data products at IBM goes a step further, describing the data scientist role as “part analyst, part artist”.  She says, “A data scientist is somebody who is inquisitive, who can stare at data and spot trends. It’s almost like a Renaissance individual who really wants to learn and bring change to an organization.”

Do you need one?

Companies like Facebook, LinkedIn and Google have derived great benefits by using a team of data scientists. Many non-technology businesses have also started hiring them. Macys has boosted store sales by 10% using big data analytics done by their data scientists. (https://www.cnbc.com/id/100638141). Organizations of all sizes can gain superior insights using data analytics leading to improved decision making.

Starting a data science project

Before you decide that your company needs a data scientist, it is important to take a step back and have a clear understanding of what problems need to be solved.
It is best to start small in the data science space by picking a particular problem and trying to solve this. Note that this is counter to the general view that data scientists explore data and find interesting patterns. For smaller companies and/or those new to big data, the better option is to stay focused on getting results for one initial question. For example, a fairly universal problem for many businesses is to understand the reasons why customers are lost (known as customer churn). This can be a very complex analysis involving many different dimensions from pricing, support, product quality, feedback, social media, etc. and can be a daunting task to undertake.
It is possible to break this down, solve a subset of the problem and continue to build on it.

All of this sounds great. Find a data scientist and you can now have great new insights into your data and the ability to solve business problems. Except for a small problem – they are hard to find. McKinsey Global Institiute estimates that there will be a shortage of 190,000 data scientist jobs by 2018.

If you can’t find a data scientist, it is possible to build this skill set using a team consisting of a data architect, data engineer, data analytics engineer and business analyst working co-operatively together. This may even be advantageous as a larger team can handle multiple projects in the long term. Finding people with a more narrow skill set may also be easier.

If your organization does not have the expertise to start a data science project, consider partnering with a consulting firm that can help you get started. Bootstrapping from this project may be quicker than trying to start from scratch and learn as you go. Enterprise IT is littered with tales of failed ambitious projects. Don’t be part of that statistic!

Teradata1

Does your Data Warehouse use Teradata?

Are you concerned about the growing size of the data?

 

Download our white paper

to find out how Hadoop can help

augment your data warehouse,

improve efficiency and reduce cost.

Pig is a data flow platform for writing Hadoop operations in a language called Pig Latin. It adds a layer of abstraction on top of Hadoop to simplify its use by giving a SQL-like interface to process data on Hadoop and thus help the programmer focus on business logic and help increase productivity. It supports a variety of data types and the use of user-defined functions (UDFs) to write custom operations in Java, Python and JavaScript. Due its simple interface,  support for doing complex operations such as joins and filters, Pig is popular for performing query operations in hadoop.

Objective

pig tutorial

The objective of this Pig tutorial is to get you up and running Pig scripts on a real-world dataset stored in Hadoop.

Prerequisites

The following are the prerequisites for setting up Pig and running Pig scripts
  • You should have the latest stable build of Hadoop (as of writing this article, 1.0.3)
  • To install hadoop, see my previous blog article on Hadoop Setup
  • Your machine should have Java 1.6 installed
  • It is assumed you have basic knowledge of Java programming and SQL.
  • Basic knowledge of Linux will help you understand many of the linux commands used in the tutorial.
  • Download the Book Crossing DataSet. This is the data set we will use. (Alternative link to the Dataset is on the github page of the Tutorials)

Setting up Pig

Platform

This Pig tutorial assumes Linux/Mac OS X. If using Windows, please install Cygwin. It is required for shell support in addition to the required software above.

Procedure

Download a stable tarbal (for our tutorial we used pig-0.10.0.tar.gz (~50 MB),  which works with Hadoop 0.20.X, 1.0.X and 0.23.X) from one of the apache download mirrors.
Unpack the tarball in the directory of your choice, using the following command 
  $ tar -xzvf pig-x.y.z.tar.gz
Set the environment variable PIG_HOME to point to the installation directory for convinience:
You can either do
  $ cd pig-x.y.z
  $ export PIG_HOME={{pwd}}
or set PIG_HOME in $HOME/.profile so it will be set every time you login.
Add the following line to it.
  $ export PIG_HOME=<path_to_hive_home_directory>
e.g.
  $ export PIG_HOME=’/Users/Work/pig-0.10.0′
  $ export PATH=$HADOOP_HOME/bin:$PIG_HOME/bin:$PATH
Set the environment variable JAVA_HOME to point to the Java installation directory, which Pig uses internally.
  $ export JAVA_HOME=<<Java_installation_directory>>

Execution Modes

Pig has two modes of execution – local mode and MapReduce mode.

Local Mode

Local mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets which a single machine could handle. It runs on a single JVM and access the local filesystem. To run in local mode, you pass the local option to the -x or -exectype parameter when starting pig. This starts the interactive shell called Grunt
  $ pig -x local
  grunt>

MapReduce Mode

In this mode, Pig translates the queries into MapReduce jobs and runs the job on the hadoop cluster. This cluster can be pseudo- or fully distributed cluster.
First check the compatibility of the Pig and Hadoop versions being used. The compatibility details are given in the Pig release page (for our tutorial, the Hadoop version is 1.0.3, and Pig version is 0.10.0, which is compatible to the Hadoop version we are using). 
First, export the variable PIG_CLASSPATH to add Hadoop conf directory.
  $ export PIG_CLASSPATH=$HADOOP_HOME/conf/
After exporting the PIG_CLASSPATH, run the pig command, as shown below
  $ pig
  [main] INFO  org.apache.pig.Main – Apache Pig version 0.10.0 (r1328203) compiled Apr 19 2012, 22:54:12
  [main] INFO  org.apache.pig.Main – Logging error messages to: /Users/varadmeru/pig_1351858332488.log
  [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: hdfs://localhost:9000
  main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to map-reduce job tracker at: localhost:9001
  grunt> 
You can see the log reports from Pig stating the filesystem and jobtracker it connected to. Grunt is an interactive shell for your Pig queries. You can run Pig programs in three ways via Script, Grunt, or embedding the script into Java code. Running on Interactive shell is shown in the Problem section. To run a batch of pig scripts, it is recommended to place them in a single file and execute them in batch mode.

Executing Scripts in Batch Mode

Running in Local mode:
$ pig -x local BookXGroupByYear.pig
Note that Pig, in MapReduce mode, takes file from HDFS only, and stores the results back to HDFS.

For running Pig script file in MapReduce mode:
  $ pig -x mapreduce BookXGroupByYear.pig
OR
  $ pig BookXGroupByYear.pig
Its a good practice to have “*.pig” extension to the file for clarity and maintainabilty. The BookXGroupByYear.pig file is available on our github page of the Tutorials for your reference.

Now we focus on solving a simple but real-world use case with Pig. This is the same problem that was solved in the previous blog articles (Step-by-step MapReduce Programming using Java and Hive for Beginners using SQL-like query for Hive).

Problem
The problem we are trying to solve through this tutorial is to find the frequency of books published each year. Our input data set (file BX-Books.csv) is a csv file. Some sample rows:

“ISBN”;”Book-Title”;”Book-Author”;”Year-Of-Publication”;”Publisher”;”Image-URL-S”;”Image-URL-M”;”Image-URL-L”
“0195153448”;”Classical Mythology”;”Mark P. O. Morford”;”2002“;”Oxford University Press”;”https://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg”;
“https://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg”;”https://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg”
  …
“0002005018”;”Clara Callan”;”Richard Bruce Wright”;”2001“;”HarperFlamingo Canada”;”https://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg”;
“https://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg”;”https://images.amazon.com/images/P/0002005018.01.LZZZZZZZ.jpg”
The first row is the header row. The other rows are sample records from the file. Our objective is to find the frequency of books published each year.
Now as our data is not cleansed and might give us erroronous results due to it being serialized, we clean it by the following commands:
$ cd /Users/Work/Data/BX-CSV-Dump
$ sed ‘s/&amp;/&/g’ BX-Books.csv | sed -e ‘1d’ |sed ‘s/;/$$$/g’ | sed ‘s/”$$$”/”;”/g’ > BX-BooksCorrected.txt
The sed commands help us to remove the delimiters “;” (semicolon) from the content and replace them with $$$. Also, the pattern “&amp;” is replaced with ‘&’ only. It also removes the first line (header line). If we don’t remove the header line, it will be processed as part of the data, which it isn’t.
All the above steps are required to cleanse the data and give accurate results.

“0393045218”;”The Mummies of Urumchi;”;”E. J. W. Barber”;”1999″;”W. W. Norton &amp; Company”; “https://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg”                is changed to

“0393045218”;”The Mummies of Urumchi$$$“;”E. J. W. Barber”;”1999″; “W. W. Norton & Company”; “https://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg”

Now, copy the file into Hadoop:
$ hadoop fs -mkdir input
$ hadoop fs -put /Users/Work/Data/BX-CSV-Dump/BX-BooksCorrected.txt input
Note that Pig, in MapReduce mode, takes file from HDFS only, and stores the results back to HDFS.
Running Pig Flow using the command line:
$ pig
grunt> BookXRecords = LOAD ‘/user/varadmeru/input/BX-BooksCorrected1.txt’
>> USING PigStorage(‘;’) AS (ISBN:chararray,BookTitle:chararray,
>> BookAuthor:chararray,YearOfPublication:chararray,
>> Publisher:chararray,ImageURLS:chararray,ImageURLM:chararray,ImageURLL:chararray);
2012-11-05 01:09:11,554 [main] WARN  org.apache.pig.PigServer – Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
grunt> GroupByYear = GROUP BookXRecords BY YearOfPublication;
2012-11-05 01:09:11,810 [main] WARN  org.apache.pig.PigServer – Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
grunt> CountByYear = FOREACH GroupByYear
>> GENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1)));
2012-11-05 01:09:11,996 [main] WARN  org.apache.pig.PigServer – Encountered Warning IMPLICIT_CAST_TO_DOUBLE 1 time(s).
grunt> STORE CountByYear
>> INTO ‘/user/work/output/pig_output_bookx’ USING PigStorage(‘t’);

The username (“work” in our example) in the second query is dependent on the hadoop setup on your machine and the username of the hadoop setup. The output of the above pig run is stored into the output/pig_output_bookx folder on HDFS. It can be displayed on the screen by :

  $ hadoop fs -cat output/pig_output_bookx/part-r-00000

Output

The snapshot of the output of the Pig flow is shown below:

output of the Pig flow

Comparison with Java MapReduce and Hive

You can see the above output and compare with the outputs from the MapReduce code from the step-by-setp MapReduce guide and the Hive for beginners blog postLet’s take a look at how the code for Pig differs from Java MapReduce code and Hive query for the same solution: 

Mappermapper

Reducer

reducer

Hivehive> CREATE TABLE IF NOT EXISTS BXDataSet (ISBN STRING,BookTitle STRING,BookAuthor STRING, YearOfPublication STRING, Publisher STRING,ImageURLS STRING,ImageURLM STRING, ImageURLL STRING) 
COMMENT ‘BX-Books Table’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘;’

STORED AS TEXTFILE;hive> LOAD DATA INPATH ‘/user/work/input/BX-BooksCorrected.csv’ OVERWRITE INTO TABLE BXDataSethive> select yearofpublication, count(booktitle) from bxdataset group by yearofpublication;
PigBookXRecords = LOAD ‘/user/varadmeru/input/BX-BooksCorrected1.txt’ USING PigStorage(‘;’)
AS (ISBN:chararray,
BookTitle:chararray,
BookAuthor:chararray,
YearOfPublication:chararray,
Publisher:chararray,
ImageURLS:chararray,
ImageURLM:chararray,
ImageURLL:chararray);

GroupByYear =
GROUP BookXRecords BY YearOfPublication;
CountByYear =
FOREACH GroupByYear
GENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1)));
STORE CountByYear INTO ‘/user/varadmeru/output/pig_output_bookx’
USING PigStorage(‘t’);

It is clear from the above that the high level abstractions such as Hive and Pig reduce the programming effort required as well as the complexity of learning and writing MapReduce code. In the small Pig example above, we reduced the lines of code from roughly 25 (for Java) to 3 (for Hive) and 4 (for Pig).

Conclusion

In this tutorial we learned how to setup Pig, and run Pig Latin queries. We saw the query for the same problem which we solved MapReduce code from the step-by-setp MapReduce guide and the Hive for beginners with MapReduce and compared how the programming effort is reduced with the use of HiveQL. Stay tuned for more exciting tutorials from the small world of BigData.

References

1. We referred the Pig-Latin Basics and Built-in Functions guides from Apache Pig page.

2. We referred the Getting Started guide from Apache Pig page.

Hive is a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis of large datasets stored in Hadoop. Hive provides a SQL-like language called HiveQL. Due its SQL-like interface, Hive is increasingly becoming the technology of choice for using Hadoop.

Objectivehive tutorial

The objective of this Hive tutorial is to get you up and running Hive queries on a real-world dataset.

Prerequisites

The following are the prerequisites for setting up Hive and running Hive queries
  • You should have the latest stable build of Hadoop (as of writing this article, 1.0.3)
  • To install hadoop, see my previous blog article on Hadoop Setup
  • Your machine should have Java 1.6 installed
  • It is assumed you have some knowledge of Java programming and are familiar with concepts such as classes and objects, inheritance, and interfaces/abstract classes.
  • Basic knowledge of Linux will help you understand many of the linux commands used in the tutorial
  • Download the Book Crossing DataSet. This is the data set we will use. (Alternative link to the Dataset at the github page of the Tutorials)

Setting up Hive

Platform

This Hive tutorial assumes Linux. If using Windows, please install Cygwin. It is required for shell support in addition to the required software above.

Procedure

Download the most recent stable release of Hive as a tarball from one of the apache download mirrors. For our Hive tutorial, we are going to use hive-0.9.0.tar.gz
Unpack the tarball in the directory of your choice, using the following command 
  $ tar -xzvf hive-x.y.z.tar.gz
Set the environment variable HIVE_HOME to point to the installation directory:
You can either do
  $ cd hive-x.y.z
  $ export HIVE_HOME={{pwd}}
or set HIVE_HOME in $HOME/.profile so it will be set every time you login.
Add the following line to it.
  export HIVE_HOME=<path_to_hive_home_directory>
e.g.
  export HIVE_HOME=’/Users/Work/hive-0.9.0′
  export PATH=$HADOOP_HOME/bin:$HIVE_HOME/bin:$PATH
Start Hadoop (Refer to the Single-Node Hadoop Setup Guide for more information). It should show the processes being started. You can check the processes started by using the jps query
$ start-all.sh
<< Starting various hadoop processes >>
$ jps
  3097 Jps
  2355 RunJar
  2984 JobTracker
  2919 SecondaryNameNode
  2831 DataNode
  2743 NameNode
  3075 TaskTracker
In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set aprpopriate permissions in HDFS before a table can be created in Hive as shown below:
  $ hadoop fs -mkdir /tmp
  $ hadoop fs -mkdir /user/hive/warehouse
  $ hadoop fs -chmod g+w /tmp
  $ hadoop fs -chmod g+w /user/hive/warehouse

Problem

The problem we are trying to solve through this tutorial is to find the frequency of books published each year. Our input data set (file BX-Books.csv) is a csv file. Some sample rows:

“ISBN”;”Book-Title”;”Book-Author”;”Year-Of-Publication”;”Publisher”;”Image-URL-S”;”Image-URL-M”;”Image-URL-L”
“0195153448”;”Classical Mythology”;”Mark P. O. Morford”;”2002“;”Oxford University Press”;”https://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg”;
“https://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg”;”https://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg”
“0002005018”;”Clara Callan”;”Richard Bruce Wright”;”2001“;”HarperFlamingo Canada”;”https://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg”;
“https://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg”;”https://images.amazon.com/images/P/0002005018.01.LZZZZZZZ.jpg”
  …
The first row is the header row. The other rows are sample records from the file. Our objective is to find the frequency of Books Published each year. This is the same problem that was solved in the previous blog-post (Step-by-step MapReduce Programming using Java).
Now as our data is not cleansed and might give us erroronous results due to some serialization support, we clean it by the following command
$ cd /Users/Work/Data/BX-CSV-Dump
$ sed ‘s/&amp;/&/g’ BX-Books.csv | sed -e ‘1d’ |sed ‘s/;/$$$/g’ | sed ‘s/”$$$”/”;”/g’ > BX-BooksCorrected.csv
The sed commands help us to remove the delimeters “;” (semicolon) from the content and replace them with $$$. Also, the pattern “&amp;” is replaced with ‘&’ only. It also removes the first line (header line). If we don’t remove the header line, Hive will process it as part of the data, which it isn’t.
All the above steps are required to cleanse the data, and help hive give accurate results of our queries.

“0393045218”;”The Mummies of Urumchi;“;”E. J. W. Barber”;”1999″;”W. W. Norton &amp; Company”; “https://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg”

                is changed to

“0393045218”;”The Mummies of Urumchi$$$“;”E. J. W. Barber”;”1999″; “W. W. Norton & Company”; “https://images.amazon.com/images/P/0393045218.01.THUMBZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.MZZZZZZZ.jpg”; “https://images.amazon.com/images/P/0393045218.01.LZZZZZZZ.jpg”

Now, copy the file into Hadoop:
$ hadoop fs -mkdir input
$ hadoop fs -put /Users/Work/Data/BX-CSV-Dump/BX-BooksCorrected.csv input
Running Hive using the command line:
$ hive
hive> CREATE TABLE IF NOT EXISTS BXDataSet 
    >   (ISBN STRING, 
    >   BookTitle STRING, 
    >   BookAuthor STRING, 
    >   YearOfPublication STRING, 
    >   Publisher STRING, 
    >   ImageURLS STRING, 
    >   ImageURLM STRING, 
    >   ImageURLL STRING) 
    > COMMENT ‘BX-Books Table’
    > ROW FORMAT DELIMITED  
    > FIELDS TERMINATED BY ‘;’ 
    > STORED AS TEXTFILE;
OK
Time taken: 0.086 seconds
hive> LOAD DATA INPATH ‘/user/work/input/BX-BooksCorrected.csv’ OVERWRITE INTO TABLE BXDataSet;
Loading data to table default.bxdataset
Deleted hdfs://localhost:9000/user/hive/warehouse/bxdataset
OK
Time taken: 0.192 seconds
hive> select yearofpublication, count(booktitle) from bxdataset group by yearofpublication;

The username (“work” in our example) in the second query is dependent on the hadoop setup on your machine and the username of the hadoop setup.

Output

The output of the query is shown below:

hadoop set up

Comparison with Java MapReduce

You can see the above output and compare with the output of the MapReduce code from the previous blog entryLet’s take a look at how the code for Hive differs from MapReduce:

Mapper

mapper

Reducer

reducer

hive> CREATE TABLE IF NOT EXISTS BXDataSet (ISBN STRING,BookTitle STRING,BookAuthor STRING, YearOfPublication STRING, Publisher STRING,ImageURLS STRING,ImageURLM STRING, ImageURLL STRING) 
COMMENT ‘BX-Books Table’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘;’
STORED AS TEXTFILE;hive> LOAD DATA INPATH ‘/user/work/input/BX-BooksCorrected.csv’ OVERWRITE INTO TABLE BXDataSethive> select yearofpublication, count(booktitle) from bxdataset group by yearofpublication;

It is clear from the above that Hive reduces the programming effort required as well as the complexity of learning and writing MapReduce code. In the small example above, we reduced the lines of code from roughly 25 to 3.

Conclusion

In this tutorial we learned how to setup Hive, and run hive queries. We saw the query for the same problem statement which we solved with MapReduce and compared how the programming effort is reduced with the use of HiveQL. Stay tuned for more exciting tutorials from the small world of BigData.