Scale is a Dish Best Served Eventually Consistent

Ryan Svihla
3 min readSep 23, 2015

--

A lot of people new to Cassandra find the data modeling required tedious and outrageously hard. They’ll long for their RDMBS, if only insert favorite vendor or project lead here would make their RDBMS scale like Cassandra they could tell their bosses to shove off and go back to using their favorite tool. To people saying this (and I hear the echos of it in conversations often), I’ve got news for you, the problem isn’t Cassandra or your favorite tool, it’s your style of data modeling.

Scaling with your favorite RDBMS

Let’s step through what companies have to do at scale they existing databases. Some of this is helped with a favorite vendor tool of choice, but at the end of the day it’s a bolting on the same principles already in place in Cassandra, Riak and DynamoDB.

Sharding for Horizontal Scale

This decision is driven by write throughput because you can’t cache writes. In the most simple design with 2 databases you split your dataset in half and build in query logic to correctly query the database where data lives.

This works great, but what happens when you want to join on both servers? Most people will just do so client side and issue two queries. This works fine, but for how long? What happens when it’s 30 servers?

Denormalization

At some point you realize some queries are just way too expensive to do at runtime. So you begin by writing a batch job or two to generate simple materialized views that represent aggregations across all data sets. This brings your queries back in line.

Data Loss and Replication Lag

Somehow you lose a shard one day. This is fine as you back your data up, but the problem is you’re not getting answers on that shard during that time. Worse, you’ve corrupted your materialized views because your data is missing and you have to rerun your batch jobs again. So you setup leader/follower async replication and that way you can serve reads will the leader is down.

However, a few days after your last automated followover, you start to find inconsistency and impossible values in your database. What happened? You have replication lag between followers and leaders, and if you’re multi-dc it can be incredibly significant and span into the hours long range. Looking at the following example will illustrate the problem set.

You have to hop multiple servers so you can’t lock or do a transaction (and even when you can it’d be brittle and slow), you’re eventually consistent with your counts, and your followers may not even agree!

So What

You’ve now lost the following from your favorite RDBMS:

  • ACID on the entire dataset
  • Transactions are severely limited
  • Joins are limited and of limited use

Several of you maybe now saying “yeah but I’m fine with all this at least I don’t have to learn data modeling with Cassandra”, actually you’ve just re-implemented Cassandra but very badly. You’re paying for the overhead of relational technology that you can’t even use, and it’s substantial overhead. Worse you have a b-tree database instead of a columnar one, and so you LOSE a ton of feautures compared to Cassandra in the process.

Summary

Some of you may now think I’m insane and that your favorite database doesn’t work this way. You may want to check again. I’ve worked on some very large deployments and ripped out several different database technologies now built in the above fashion. Users that already get distributed ideas find Cassandra a joy to work with, those that haven’t scaled view all these constraints as Cassandra problems and not what they really are, which is to say just plain physics.

--

--

Ryan Svihla
Ryan Svihla

Written by Ryan Svihla

Student of all things distributed.

No responses yet