Hive for HBase

by Shanti Subramanyam for Blog
Hive for HBase

HBase is a NoSQL data store that traditionally required a developer to write MapReduce programs to access and analyze the data. On the other hand, Hive has been gaining in popularity with it’s SQL-like interface. However, using Hive over HDFS leads to long ETL times and no access to real time data.

Solution? Hive over HBase. This benefits HBase users as it does not require programming while benefiting traditional Hive use cases by making them much faster and providing access to real-time data. The notes in this article are from the Hive session at HBaseCon.

Use Cases

Here are some example use cases for when one would want to use Hive over HBase.

Use Case 1: HBase as ETL Data Sink

If you want to analyze data in HDFS and want to serve it via online queries to say a recommender system, use Hive to perform the queries and then populate HBase which serves as the data store for online access.

Use Case 2: HBase as Data Source

Some data in HDFS and some in HBase. Combine  (Join) them using Hive. Dimension data that changes frequently can be in HBase, while static tables can reside in HDFS.

Use Case 3: Low Latency Warehouse

HDFS and HBase have the same table definitions. HBase gets the continuous updates, while daily periodic dumps go into HDFS. Hive queries can join the two sources of data together for real time query.

How does this work?

Fair warning. The rest of this article gets pretty technical, delving into the implementation details of Hive for HBase. If you don’t care, then just realize that this technology is a fantastic bridge between the speed and online capability of HBase and the ease of use and SQL-like qualities of Hive.

Hive + HBase Architecture

The StorageHandler layer talks to HBase over HDFS. This sits in parallel with the MR framework. Hive DDL operations are converted to HBase DDL operations via a client hook. All operations are performed by the client. No two phase commit is used.

Schema Mapping

Hive tables, columns, column types are mapped to HBase table and column families. Every field in a hive table is mapped to either a:

  • table key (using :key as selector)
  • column family (cf:) (MAP fields in Hive
  • A column

The Hive HBase Integration wiki has more details.

Type Mapping

This was added in Hive 0.9. Previously all types were being converted to strings in HBase. The table level property hbase.table.default.storage.type can be used to specify the type mapping. If binary, Hive types will be mapped to bytes, otherwise String. If type is not primitive or Map, it is converted to JSON string and serialized.

There are still a few rough edges for schema and type mapping which should improve in subsequent versions.

Bulk Load Feature

Instead of writing an MR job to bulk load data into HBase, we can use Hive. Sample the source data and save to a file.
Run CLUSTER BY query using HiveHFileOutputFormat and import into HBase.
The HBase Bulk Load page has details.

Filter Pushdowns

The idea is to pass down filter expressions to the storage layer from the Hive optimizer to minimize scanned data. The optimizer pushes down the predicates to the query plan.  To do this, the storage handler negotiates with the Hive optimizer to decompose the filter. This support is being added to Hive.
Find out more at https://cwiki.apache.org/Hive/filterpushdowndev.html

There are many use cases for Hive on HBase. But if what you really want is a SQL interface to HBase, there are other options as well. I will cover these in a subsequent blog post.

HBase use cases
Prev post HBase use cases

I attended HBaseCon yesterday. It was a fantastic event with very meaty tracks and sessions.…

SQL for HBase
Next post SQL for HBase

HBase provides random, read-write access to Big Data stored in very large tables as a…