Cassandra Summit 2013
I attended Cassandra Summit 2013 . This is the first year I’m attending this conference and was impressed with the turnout as well as the number and quality of the sessions. I summarize some of the content from the first day’s sessions below.
The star of the morning keynote was Jonathan Ellis, co-founder of DataStax.
Yet another benchmark
He started his presentation with a graph showing benchmark data comparing Cassandra with other NoSQL databases(HBase, Voldemort, VoltDB, Redis) and even MySQL I looked up the original VLDB paper that published this benchmark to understand the workload and testing methodology. The authors were trying to mimic an APM (Application Performance Management) agent’s actions as it collects and records data. Yet, the authors split things up to have read and write type workloads similar to other benchmarks and that really have nothing to do with their original intent of an APM workload! Personally, I find such simplistic workloads problematic. The servers are unlikely to experience a realistic workload that a real application will place on them. But this is a topic for another blog post.
Anyway, as you can guess, Cassandra came out on top with many times the performance of HBase when scaled to 32 nodes. Although MongoDB was not part of this comparison, Jonathan said that other benchmarks showed MongoDB performance to be even worse than HBase.
He then moved on to talk about some of the features of 1.2, summarized briefly below:
- Focus is on ease of use with CQL
- Thrift compatibility
- CQL adds Collections to SQL (e.g. multiple email addresses for a user). This is important since Cassandra can’t do joins with say an address table.
- Adds data dictionary ( more like a RDBMS)
- Adds authentication and authorization (even more like a RDBMS)
- New CQL native protocol – efficient, lightweight and asynchronous
- Native driver in Java available for a few weeks. .NET driver in beta, other drivers coming
- New tracing feature – helps troubleshooting problematic queries taking a long time
- Memory improvements: reduced JVM heap usage by removing bloom filters, compression offsets from heap
- Compaction throttling: Smooth constant rate limiting
This release is targeted at end of July. Lots of house cleaning is the focus of this release, and hence the major version number. A notable new feature will the the addition of CAS (Compare and Swap) for use when eventual consistency is just not enough. Another RDBMS like feature – triggers using Java classes!
From the looks of it, Cassandra will soon resemble an enterprise RDBMS in terms of functionality. Hopefully all these additional features will not come at the price of performance.
Several companies presented their use cases. This is a very useful track to help others new to the technology to understand what kinds of applications in their own companies can benefit.
Ground Traffic Control Logistics
This was presented by Jesse Young of Zonar Systems. They collect lots of data from heavy fleet using a GPS based hardware device which sends the data to their SaaS portal. Their existing data is on 100+ database servers using 3000+ databases spanning 100 TBs.
Requirements: replication, fast inserts and retrievals, horizontal scalability, ease of administration. (yeah – who doesn’t want any of these?)
Cassandra was chosen due to built-in replication, speed and easy administration.
They have created a few applications using Cassandra:
- Photos: cheap storage, grow capacity easily, TTLs (they want photos to expire after a certain period)
- Elevation Data: heavy reads, key based, upto 6000 reads/sec peak. 150m reads/day
- ZPass+: This app tracks ridership. Millions of users doing 20m writes/day. But traffic is bursty only when users get on a buss.
The interesting thing is that they have not yet moved their core app – GPS data into Cassandra. That will be the true test.
Intuit’s Consumer Financial Platform
Intuit is creating a common data layer across many of their apps. They chose Cassandra as it can scale and is highly available (one of their main requirements which HBase fails). It is easy to replicate across data centers, enables fast snapshots and rolling upgrades. Operations need to make schema changes easily and this was an important requirement.
Some interesting notes from this presentation included some of the performance problems they ran into. Cassandra does compaction in the background causing huge i/o spikes making blob storage (RedHat DFS) unsuitable. So documents go into DFS but the metadata lives in Cassandra. They also see huge unaccounted for CPU spikes caused by GC activity. It is important to tune the heap size to get more predictable performance.
Clearly there are production deployments of Cassandra, and many of them. However, it takes 1-3 years to get an enterprise application onto a new data platform like Cassandra. Many customers are still testing the waters with proof-of-concepts. It’s also much easier to start a new app development using a new platform than port an existing one. Cassandra requires a complete re-thinking of the data model which many find challenging. As the old adage goes, no pain, no gain.