Hive for HBase
HBase is a NoSQL data store that traditionally required a developer to write MapReduce programs to access and analyze the data. On the other hand, Hive has been gaining in popularity with it’s SQL-like interface. However, using Hive over HDFS leads to long ETL times and no access to real time data.
Solution? Hive over HBase. This benefits HBase users as it does not require programming while benefiting traditional Hive use cases by making them much faster and providing access to real-time data. The notes in this article are from the Hive session at HBaseCon.
Here are some example use cases for when one would want to use Hive over HBase.
Use Case 1: HBase as ETL Data Sink
If you want to analyze data in HDFS and want to serve it via online queries to say a recommender system, use Hive to perform the queries and then populate HBase which serves as the data store for online access.
Use Case 2: HBase as Data Source
Some data in HDFS and some in HBase. Combine (Join) them using Hive. Dimension data that changes frequently can be in HBase, while static tables can reside in HDFS.
Use Case 3: Low Latency Warehouse
HDFS and HBase have the same table definitions. HBase gets the continuous updates, while daily periodic dumps go into HDFS. Hive queries can join the two sources of data together for real time query.
How does this work?
Fair warning. The rest of this article gets pretty technical, delving into the implementation details of Hive for HBase. If you don’t care, then just realize that this technology is a fantastic bridge between the speed and online capability of HBase and the ease of use and SQL-like qualities of Hive.
Hive + HBase Architecture
The StorageHandler layer talks to HBase over HDFS. This sits in parallel with the MR framework. Hive DDL operations are converted to HBase DDL operations via a client hook. All operations are performed by the client. No two phase commit is used.
Hive tables, columns, column types are mapped to HBase table and column families. Every field in a hive table is mapped to either a:
- table key (using :key as selector)
- column family (cf:) (MAP fields in Hive
- A column
The Hive HBase Integration wiki has more details.
This was added in Hive 0.9. Previously all types were being converted to strings in HBase. The table level property hbase.table.default.storage.type can be used to specify the type mapping. If binary, Hive types will be mapped to bytes, otherwise String. If type is not primitive or Map, it is converted to JSON string and serialized.
There are still a few rough edges for schema and type mapping which should improve in subsequent versions.
Bulk Load Feature
Instead of writing an MR job to bulk load data into HBase, we can use Hive. Sample the source data and save to a file.
Run CLUSTER BY query using HiveHFileOutputFormat and import into HBase.
The HBase Bulk Load page has details.
The idea is to pass down filter expressions to the storage layer from the Hive optimizer to minimize scanned data. The optimizer pushes down the predicates to the query plan. To do this, the storage handler negotiates with the Hive optimizer to decompose the filter. This support is being added to Hive.
Find out more at https://cwiki.apache.org/Hive/filterpushdowndev.html
There are many use cases for Hive on HBase. But if what you really want is a SQL interface to HBase, there are other options as well. I will cover these in a subsequent blog post.