HBase use cases
I attended HBaseCon yesterday. It was a fantastic event with very meaty tracks and sessions.
HBase is a NoSQL database meant for use in online big data workloads, where response time matters. It was originally developed by Powerset based on Google’s BigTable paper in 2008 and very soon became an Apache project. The first Apache release was in 2010 and since then it has been gathering velocity. There are lots of production deployments of HBase today with a wide variety of use cases.
I will share some of the use cases presented at HBaseCon in this article.
HBase at Pinterest
Pinterest is completely deployed on Amazon EC2. Pinterest uses a follow model where users follow other users. This requires a following feed for every user that gets updated everytime a followee creates or updates a pin. This is a classic social media application problem. For Pinterest, this amounts to 100s of millions of pins per month that gets fanned out as billions of writes per day.
So the ‘Following Feed‘ is implemented using Hbase. Some specifics:
- They chose a wide schema where each user’s following feed is a single row in HBase. This exploits the sorting order within columns for ordering (each user wants to see the latest in his feed) and results in atomic transactions per user.
- To optimize writes, they increased per region memstore size. 512M memstore leads to 40M HFile instead of the small 8M file created by default memstore This leads to less frequent compactions.
- They take care of the potential for infinite columns by trimming the feed during compactions: there really is not much point having an infinite feed anyway.
- They also had to do GC tuning (who doesn’t) opting for more frequent but smaller pauses.
Another very interesting fact. They maintain a mean time to recovery (MTTR) of less than 2 minutes. This is a great accomplishment since HBase favors consistency over availability. They achieve this via reducing various timeout settings (socket, connect, stale node, etc.) and the number of retries. They also avoid the single point of failure by using 2 clusters. To avoid NameNode failure, they keep a copy on EBS.
HBase at Groupon
Groupon has two distinct use caes. Deliver deals to users via email (a batch process) and provide a relevant user experience on the website. They have increasingly tuned their deals to be more accurate and relevant to individual users (personalization).
They started out with running Hadoop MapReduce (MR) jobs for email deal delivery and used MySQL for their online application – but ideally wanted the same system for both.
They now run their Relevance and Personalization system on HBase. In order to cater to the very different workload characteristics of the two systems(email, online), they run 2 HBase clusters that are replicated so they have the same content but are tuned and accessed differently.
Groupon also uses a very wide schema – One colmn-family for ‘user history and profile’ and the other for email history.
A 10 node cluster runs HBase (apart from the 100 node Hadoop cluster). Each node has 96GB RAM, 24 virtual cores, 8x2TB disks.
Data in the cluster: 100m+ rows, 2TB+ data, 4.2B data points.
HBase at Longtail Video
This company provides JW Player, an online video player used by over 2 million websites. They have lots of data which is processed by their online analytics tool. They too are completely deployed on AWS and as such use HBase and EMR from Amazon. They read data from and write data to S3.
They had the following requirements:
- fast queries across data sets
- support for date-range queries
- store huge amounts of aggregated data
- flexibility in dimensions used for rollup tables
HBase fit the bill. They use multiple clusters to partition their read and write intensive workloads similar to Groupon. They are a full-fledged python shop so use Happybase and have Thrift running on all the nodes of the HBase cluster.
It is clear that HBase adoption has grown and there are many, small and large production deployments with varying use cases.
In addition to the cases above, there is of course Facebook’s messaging system, OPower, Flurry, Lily, Meetup and many other.
It is also clear that there is still a lot of work to be done to make it easier to use, configure and tune. After all, we are not even at 1.0!