Cassandra: How many nodes are talked to with Quorum? Also should I use it?

Ryan Svihla
6 min readMar 17, 2016

--

This is common early point of confusion with users new to Cassandra, so I just thought I’d drop a brief note in hopes that someone may stumble onto this.

Quorum is a majority of replicas aka replication factor. But if I have a replication factor of 3 in 3 data centers how many replicas are there? There are 9, so a majority is 5 (sum_of_all_replicas/2+1 rounded down)

Now that’s out of the way lets take a common scenario:

Assuming 3 data centers with RF 3 in each and writing data

LOCAL_ONE - Needs 1 node from the LOCAL DC to succeed

LOCAL_QUORUM -Needs 2 nodes from the LOCAL DC to succeed

EACH_QUORUM - Needs 2 nodes in EACH DC to succeed. In this case 6 total nodes.

LOCAL_SERIAL - Like LOCAL_QUORUM. Needs 2 nodes from the LOCAL DC to succeed. But also has a lot of extra traffic (at least 4x round trips) to enforce order and consistency.

ANY-Just has to successfully write a hinted handoff. Which is normally a mechanism used for making sure a write is retried by the database. Do not use this.

ONE- Needs 1 node from the ANY DC to succeed

TWO- Needs 2 nodes from the ANY DC to succeed

THREE- Needs 3 nodes from the ANY DC to succeed

QUORUM-Needs 5 nodes from ANY DC to succeed

SERIAL- Like QUORUM. Needs 5 nodes from the ANY DC to succeed. But also has a lot of extra traffic (at least 4x round trips) to enforce order and consistency.

ALL- Needs all 9 replicas to succeed. If any node is down the write fails.

Which do I pick?

Most applications fall into one of three categories in practice:

  1. It can tolerate potentially long but limited inconsistency and it needs a lot of scale and/or availability (ironically since most people believe the opposite ATMs in banking is a great example of this). This is common in higher throughput and multi data center active-active use cases, but not all. For these use cases you will use LOCAL_ONE if I want to maintain consistent latencies and ONE if I want to have the highest chance that writes succeed but do not actually care about latency (this is very rare)
  2. It can tolerate very temporary inconsistency of a few milliseconds. This is the overwhelming majority of well architected applications (this type of inconsistency happens with stale UI pages all the time and no one is the wiser). I highly recommend LOCAL_QUORUM for most cases. Some folks are fine with QUORUM, but I would only use it sparingly if possible.
  3. It can tolerate no inconsistency and is ok with having potential contention problems. This is either a very badly architected (most of the time) or it is one of the rare problem domains that cannot exist with even temporary inconsistency, ie something where more than one thing cannot happen at a time and it cannot be sharded, work queues being the most common. LOCAL_SERIAL if the work can be partitioned by DC. However, SERIAL maybe a better option for these truly “global” use cases.

Hey Wait a minute where are the rest?

Don’t use them, except in the extremely rare corner cases.

I know that’s strong language but I see people who use these often have broken ideas of how failure and support escalations work. You may say your use case or problem domain is different, but it is more likely I’ve worked on the same problem domain with one of the above and gotten someone closer to their use case goals by using one of the above instead of one of the below.

I’m going to talk about everything below as if I have RF 3 in all data centers.

TWO — In the multi-dc case this is neither very available or very consistent. In the single DC case this is the same often as QUORUM or LOCAL_QUORUM.

THREE — Like two only slower and still not very consistent. In a single DC this is like ALL.

EACH_QUORUM — Summary: If your data is this important use SERIAL, if SERIAL is too slow use QUORUM. Detail: So this means you fail if you have any multi-dc links go down, and it’s really slow. At some point one has to ask what is the point of having multiple DCs if that makes you less available. QUORUM gives you nearly as good consistency, and is slightly faster plus has the advantage of being able to potentially tolerate a DC going down. The only case where this is possibly useful is the rare action that must be in all datacenters at write time, but that has the downside of still not being free from race conditions like SERIAL is.

ANY-Hints aren’t especially reliable and it’s often true that in pre 3.0 versions of Cassandra hint generation could murder you. While using ANY during any stressful sitaution you’re silenting accepting a bunch of writes that are likely to never be delivered. Use ONE or LOCAL_ONE instead depending on your latency needs, at least you know it got delivered to one node, and repair will make sure it’s shipped around.

ALL-Is really slow, has no availability, and has in fact WORSE availability than if you have a single machine running MySQL. Most obnoxious to me is you get no different results if you succeed on QUORUM reads and writes all the time. The only time this will help you is if you lose a lot of nodes at once and you want to make sure you get it all. The downside to this is the writes are so expensive youre more likely to lose a lot of nodes at once. Self defeating design at it’s finest. Use SERIAL or QUORUM if you really need this level of consistency, they do it better or faster depending on which you choose.

Wrap up

Not everyone is going to agree with this article and I’m ok with that. If you have a specific objection I’m willing to listen to reason, but please address what I’m stating above and why a given alternative is not really better. Do not make this about how your application is different, in almost every case there is something better for your goals as I’ve stated above.

Update

I predictably had some pretty strong pushback on some of the points so I’ll summarize and point out my concerns.

ALL does not cause more load

Some really smart people I respect told me this, but I think they’re viewing this in a vacuum of a POC or in a healthy cluster with transient load.

  1. The coordinator absolutely has to do more work to hold a response in memory longer and then it also has to process more responses from the other nodes.
  2. If the application is not limited by ingest at the driver level, the Cassandra nodes will spend more time doing work.
  3. Customers will retry the write as we always tell them to do when it fails. This means that more requests for the same amount of work.
  4. I’ve personally worked with a customer for over a month trying to get them to be able to handle the load increase from LOCAL_ONE to LOCAL_QUORUM. They were not at all limited by ingest in their case.

The only way I agree with this is if you are not pushing the cluster to the max, unfortunately when a cluster is under stress (say failing nodes or a rogue process on the server) the app servers definitely can push the cluster to it’s max even if it wasn’t able to previously. In fact this is the NORMAL case when bad things happen. So this is fundamentally why no matter how smart these people are, I have a hard time agreeing with this.

EACH_QUORUM as a back pressure mechanism

People that I respect a lot are happy with the result by this strategy. Their rationale is that this prevents the app server process from getting as high of a throughput as it could.

Here is my issue with this thought process :

  1. Much like the above argument about ALL, EACH_QUORUM I believe is making more work happen via retries, more work processing responses, and more time for some objects to hang out in the heap.
  2. You end up with the lower consistency level you were going to write to succeed with in the first place.

So when would this work? If you’re not in an overstressed situation but your ingest is just starting to slightly get a bit too fast (DSE integration with Solr is a good example of this where Cassandra isn’t stressed, but Solr indexing is very stressed). So I can see this as a good strategy where the following is true:

  1. You’re trying to limit something outside of Cassandra that’s related to Cassandra (really any external system that gets written to as part of this workload)
  2. You don’t have a huge amount of writers that are going to beat the crap out of the coordinator anyway
  3. You don’t have your own backpressure mechanism

However, I still believe (and it’s just a belief) based on the logic of it, that in a stressed situation, this is potentially counter productive, as the ingest will certainly outstrip what the coordinator could handle and the extra blocking at the driver level will not save you then.

--

--