The Tail of a Terrible, Horrible, No Good, Very Bad Bug in Cassandra

3/20/202512:00:00 AM

A long time ago I worked at a company that had a wide use and deployment of Apache Cassandra. For those of you unfamiliar Cassandra is a very unique database thats widely used amongst most of your favorite massive-scale websites. Apple is regularly quoted as a major user having over 100PB's of data sitting around in thousands of clusters, but its relatively ubiquitous in the industry.

What is Cassandra?

I'm no expert in databases, so I won't pretend to give you some detailed explanation about all the intricacies of how Cassandra works or its benefits (the docs have lots of detail you can dig into if that's your thing). Instead, let's consider a high-level summary of Cassandra's general use case, which is handling huge distributed clusters containing massive amounts of data. To support this, Cassandra has to make the usual CAP-theorem-driven trade-offs in flexibility, and the schema requires you to rigidly define how all data is stored, sorted, and queried.

Move Fast, and Things Break

One Cassandra feature that we leaned heavily on in the early stages of development was the ability to perform online schema migrations with no impact on clients. This allowed us to easily expand our schema over time to support the various features we built or scale-driven re-architectures of the way data was modeled.

Of course, on one fateful day in the midst of a relatively benign ALTER TABLE, we found ourselves with a burst of errors from database clients. I'd classify this as one of the top things I don't wanna see as a driver error message right after I've performed a schema migration on a production database:

[E 161206 01:56:58 geventreactor:141] Error decoding response from Cassandra. ver(4); flags(0000); stream(27); op(8); offset(9); len(887); buffer: '\x84\x00\x00\x1b\x08\x00\x00\x03w...'

As it turns out, during the midst of this migration, various rows in our table had somehow become corrupted. This corruption persisted and was easy to see once we found an affected record (which we could do based on logs):

/usr/bin/cqlsh.py:632: DateOverFlowWarning: Some timestamps are larger than Python datetime can represent. Timestamps are displayed in milliseconds from epoch.
Traceback (most recent call last):
 File "/usr/bin/cqlsh.py", line 1301, in perform_simple_statement
 result = future.result()
 File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.5.0.post0-d8d0456.zip/cassandra-driver-3.5.0.post0-d8d0456/cassandra/cluster.py", line 3650, in result
 raise self._final_exception
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 2: invalid start byte

Well You Had Backups, Right?

We did have backups, although, at the time, it would have been a lengthy process to recover from them. We instead made a crucial discovery that allowed us to easily resolve this problem: corrupted records had been read (or I think we may have suspected writes as well) during a short (we estimated less than a second) time window around the schema migration. We used our extensive API logging to correlate what specific records could have been corrupted and then re-wrote them using an alternative data source with a full copy of the data.

Of course, fixing our databases was crucial, but we had now unlocked a new fear of the once great ALTER TABLE we came to love. On top of this, we uncovered a very serious data corruption issue with a widely used database. We needed to diagnose the issue further, report it upstream, and potentially implement a fix ourselves.

And I'd Do It Again

We originally hit this issue for the first time on a cluster that had been relatively recently upgraded to Cassandra 3.11 (from 3.0.x). We assumed based on the fact we'd never ran into this issue before that it was likely specific to Cassandra 3.11.

Sadly this assumption proved wrong when we once again hit this problem on a older 3.0.x cluster. We managed to recover in a similar way (and this was a much smaller cluster as well) but the scope of our issue had just grown massively. All but a single one (our largest and oldest) of our clusters ran on Cassandra 3.x at this point, and it would only be a few weeks before features started getting blocked due to our newfound inability to perform schema migrations.

Asking For Help

By this point, we had already opened an issue on the Cassandra tracker and began detailing some of our findings.

Despite this nobody else was able to reproduce our issue locally and it seemed unlikely we'd get a easy resolution. We would have to investigate things further and find a way to provide a simple reproduction for the Cassandra developers (who are open source volunteers after all).

While we did talk to and consider contracting a few consulting firms that offered to work through the issue for us, we couldn't reach an agreement that seemed reasonable so we just resolved to do it ourselves.

Breaking Things on Purpose

Luckily, a great skill of mine is breaking things, so I set to work. Because our lead theory had to do with records being read during the ALTER TABLE I needed to find a way to generate load while also performing reads. To do this, I developed a few tools that generated reads and writes against a test cluster that used an identical schema and reduced copy of our production dataset. During this process of testing, I made some notable discoveries that helped us narrow in on the problem:

The bug required Cassandra to be running as a cluster (e.g. with multiple nodes) suggesting it may have to do with replication.
Originally we had hit the issue within Python which I tested with initially. However I eventually would use Go in our reproduction both for portability and to eliminate any client-driver specific causes.
I tested a few different versions of Cassandra in the 3.x family, and they all exhibited this problem, suggesting it was likely ubiquitous. We never tested 2.x because it wasn't relevant to our use (we didn't perform schema migrations on our 2.x cluster).

Eventually I was able to reduce the dataset down to only two rows using our production schema. I created a small GitHub repository with the tools required to reliably reproduce the issue, and within hours of posting this on the original issue thread we got one of the upstream developers to confirm the reproduction locally.

The Root Cause

Again since I'm no expert on the actual topic of databases I'll defer to the original bug thread for a proper explanation.

Cassandra implements a process known as "read repairs", which serves as a form of anti-entropy within a distributed database. A read repair is simply a process where we correct data on some of the replicas of a record while we're reading from them. This correction is required because replicas may contain outdated versions of the record if they were not part of the quorum of nodes selected when writing the latest update to the record. Cassandra guarantees stale records will never be read back to a client (when you correctly specify your client's consistency level) by ensuring reads are also performed with a quorum of nodes.

A bug in the read repair process meant that during a schema migration, it was possible for a node to get confused about which cells (the data of columns) correlated to which columns. This could cause some columns to be copied incorrectly into others and even omit updates for some columns using the stale records version. If two columns shared a data type, this would result in silent data corruption. If the data types aren't compatible, this would result in the client deserialization errors we saw above. Only queries within around 100ms of the schema migration that cause a read repair would trigger data corruption.

If you'd like to dive more into the actual fix check out the final commit which landed here.

Takeaways

While I mostly thought this was a fun story, I do think it has some interesting takeaways that are relevant to anyone interested in working on and diagnosing these types of problems:

Always start with a theory. It's usually a huge waste of effort to blindly throw resources at a problem. Instead, use the evidence you have to direct you toward a more scoped level of testing and debugging.
Reduce what you can but don't worry about the details. It's mort important to get a reproduction that can help others with real-time debugging than it is to get a perfectly minimal test-case that might more directly point to the problem.
If we didn't have backups, another readily available copy of the data, and extensive API logs to correlate corrupted rows, it would have been pretty hard to recover the corrupted data so quickly.

Until Next Time?

Funnily enough, this was only the first time we would encounter a severe data corruption bug with Cassandra. The second time involved a lot of popcorn and resulted in the Apache team retroactively marking an entire production feature as experimental. Hopefully, I can find a time to talk more about that in the future :)

Musings