diego

Motivation

At Medallia, a key component of our system currently works with an open source relational db. Since this component mainly queries the db entries by key, we want to try to switch to a key-value storage system and take advantage of several benefits provided by such a system, including distributed replication, load balancing, and failover. One of our objectives is to re-architect this component in a way that will allow us to achieve horizontal scalability, that among other things will help us alleviate the high disk storage requirements we currently have.
Recently we took the time to look into this (and other technological improvements too, exciting times at Medallia right now!), and we reviewed several options. To make a long story short, we ended up with two finalists, Apache Cassandra and Project Voldemort. These two projects seem to be the most mature open source options in their class, and both provide a native decentralized clustering support including partitioning, fault tolerance, and high availability. Both are based on Amazon’s Dynamo paper, but the main difference is that Voldemort follows a simple key/value model, while Cassandra uses a persistency model based on BigTable‘s column oriented model. Both provide support for read-consistency where read operations always return latest data, which was a requirement for us.

High level comparison

Project Voldemort

While not an exhaustive list, these are the most relevant pros and cons we identified when reviewing both stores:

  • Pros
    • Simpler API
    • Persistency based on Berkley DB, a mature and well-known key/value db
    • Uses Vector Clocks instead of simple timestamps. It doesn’t need the nodes (or clients) clocks to be synchronized
  • Cons
    • No built-in support for “multiple data center”-aware routing (meaning there must be 1 copy of each key in at least one data center)

Apache Cassandra

  • Pros
    • Broader range of systems in production (Facebook, Twitter, Digg, Rackspace)
    • Richer API which supports values with a dynamic column structure. The columns can evolve independently, meaning that you can update one column without reading the whole structure.
    • Optimized for writes (by design)
    • Configurable consistency level (specified on each request)
  • Cons
    • File format is still in development, changes to the internal structure are likely to happen. Due to the flexibility it provides, the file format is more complex and harder to reason with, especially in terms of performance
    • Requires Clock Synch (NTP) (for nodes and clients)
    • Reads are more disk-intensive than competitors
    • Doesn’t support client conflict resolution, so the latest update always wins

Performance Tests

To our surprise this was the only link we’ve found that compares the performance for both projects – thus we decided to write this post to share our research. We used the vpork test framework, which we modified to suit our needs by upgrading the client code to the latest versions, adding a warm-up phase, and adding rewrite capabilities. These are the results of our tests:
Setup:

  • Versions
    • Voldemort v0.80.1
    • Cassandra 0.6.0-beta3
  • Boxes: 3 similar nodes with the following spec:
    • 4GB maximum heap size
    • Replication parameters: N=3 (replicas for each entry), R=2 (nodes to wait for on each read), W=2 (nodes to block for on each write)
    • 8 processors on each server (Intel(R) Xeon(R) CPU E5504 @ 2.00GHz)
    • 1TB disk space (Seagate ST31000340NS, 7200 RPM, 32MB cache)
  • Persistence parameters
    • Voldemort (default values)
      • key-serializer: string
      • value-serializer: identity (byte array)
      • persistence engine=bdb (Berkley DB)
      • bdb.cache.size=1536MB
    • Cassandra
      • ColumnFamily definition: CompareWith=”BytesType” RowsCached=”10000″
      • ReplicationFactor=3
      • Partitioner=org.apache.cassandra.dht.RandomPartitioner
      • ConcurrentReads=16
      • ConcurrenWrites=32
  • Tests
    • Client Threads: 40
    • Initial load: 5 million records – records present before starting each test
    • WarmUp: 20K records – initial writes before measuring time for each test
    • Number of operations per test: 500K

We ran tests for 4 different write-rewrite-read configurations. A write is equivalent to a put operation with a new record (non-existing key). A rewrite is a put operation with an existing key. A read is a get operation on an existing key. These are the configurations we tested:

  • 50% Write 50% Read
  • 10% Write 40% Rewrite 50% Read
  • 50% Rewrite 50% Read
  • 90% Rewrite 10% Read

We ran all the tests for two different value sizes, 15 and 1.5 KB. Even though we evaluated different options, for our current needs, the last one with a 15 KB data entry was the most representative scenario.
The first pair of charts shows the latency, or average time it takes a read or write operation to complete in each case. Lower values are better. As expected, Cassandra write (and rewrite) times were consistently faster than Voldemort, while read times varied a bit depending on the scenario but were more or less the same in general.

test-avg-lat-1.5.png test-avg-lat-15.png

The second pair of charts shows the maximum time in the best 99% of cases; again lower values are better:

test-99-lat-1.5.png test-99-lat-15.png

On the front-end, we have a write-back cache which means that write operations don’t affect the user experience. On the other hand, read operations are directly related to page loads. That’s why we were concerned about the peak for Cassandra read in the last scenario for 15KB. We ran some further tests to measure the 99.9% and 99.99% percentiles and the difference was even greater: 5050 ms for Cassandra and 748 ms for Voldemort in the first case, and 9176 ms against 1129 ms in the second case. This huge difference was a key decision factor for us.
Finally, these two charts show the general throughput in terms of operations (read or write) per second. In this case higher values are better:

test-ops-1.5.png test-ops-15.png

Notes:

  • Cassandra commit log and data folder are supposed to be placed at different disks to improve performance, we tested with both on the same disk.

Issues found while testing:

  • Voldemort client.put(K key, V value) (not the one that takes a Version object) throws ObsoleteVersionException if called with the same key from different threads. The javadoc states “Associated (sic) the given value to the key, clobbering any existing values stored for the key. “, so this was not expected.

And the winner is …

I think there is no clear winner, in general terms. The best option depends on many factors that each company has to evaluate. My preference changed a few times during the review and tests.
Having said that, we had to choose one, and we decided to go with Project Voldemort. The main reasons were the simplicity, better versioning control, persistency layer maturity, and latency predictability.
We are currently developing the new solution, and it will take some time before we can put it in production, but we wanted to share our preliminary results with everyone who is considering one of these two options, so they’ll have one more tool at the time of the decision.
We’ll keep you posted on how it goes.
Diego Erdody
Lead Software Engineer
Other useful articles comparing different key-value stores: