SignalFx is built on many open source projects, including Elasticsearch, Kafka, Zookeeper, and Docker, which help us deal with the massive scale and performance requirements of doing streaming analytics and high resolution monitoring for customers like Yelp.
This post is a little bit about SignalFx, but mostly about Cassandra and how we write data to it.
The kind of data that SignalFx processes is primarily timeseries data, “a series of data points typically consisting of successive measurements made over a time interval.” This includes everything from infrastructure metrics like CPU utilization, to application metrics like number-of-jobs-running-per-hour, to customer metrics like API-latency-experienced-by-customer. Cassandra is where we keep all that data; it forms the foundation of our timeseries database (TSDB in the parlance).
Over time, we’ve gone through three different stages of how we write datapoints of each metric timeseries (MTS) we receive from users into Cassandra.
Stage 1 – Vertical Writes
Our Cassandra schema is super obvious. Each row represents an MTS and we write many points in a row to represent an MTS over time. But that isn’t an accurate representation of how datapoints actually arrive at SignalFx from customers. They don’t send us all the points, say, for 5 seconds of MTSa and then all the points for 5 seconds of MTSb and so forth. Instead, what we receive is many points at time t=x for each of N MTS’s. So at every t=x we have to touch every row in the TSDB.
This is bad. Not only is it an inefficient way to write to Cassandra, but also makes it work quite hard. There’s a significant fixed cost to touching a row and a significantly smaller incremental cost to all data written beyond the first point when you touch that row. The more rows you touch, the more times you pay that fixed cost. When writing one point at a time per t=x per row, there are very few points per row over which you can amortize that fixed cost.
Think of doing this with a stacked filing cabinet. Say you have three drawers, vertically stacked. Every second you receive one document for each of the three drawers at once.
- Open the top drawer, file one document.
- Open top-1 drawer, file one document.
- Open top-2 drawer, file one document.
- The next batch of three documents is waiting, go back to step 1.
Stage 2 – Per Time to Per Row
Since we’re not receiving data in accordance with our data model, how about we just transform them into our model before writing them to Cassandra?
This is how we arrived at adding a memory tier in front of Cassandra. In this tier, we’re still writing N points for N MTS’s per time t=x to N rows. But we’re doing it on hardware that doesn’t bear nearly the cost of the TSDB row write. Writing vertically (1 point for each of N rows) still costs a little more than writing horizontally (many points per row for t=1..x) in the memory tier itself, but that cost differential is negligible for our purposes (for now).
Effectively, we’re assembling wide writes in the memory tier. Then, every so often, we write a memory tier row’s worth of points over to the corresponding TSDB row (and discard it from the memory tier), resulting in bearing the fixed cost of writing to a Cassandra row fewer times over any block of time. Or looked at another way: amortizing each row touch over a larger number of points written.
Each datapoint is still written as an object and constitutes a write as far as Cassandra is concerned, pointing the way to further optimization.
Stage 3 – Packed Writes
Can we reduce the number of objects and writes? Effectively, is there a way to change how we use Cassandra’s schema?
Yes, yes we can. 🙂
Let’s take the datapoints in a memory tier row and pack them into a single block containing many points. That block is a single object and a single write to the TSDB. Doing this, we found that the whole is smaller than the sum of its parts. Packing multiple points into a single object and optimizing that object removes redundancies that were present in the each-datapoint-is-an-object model. The block is smaller on disk than the group of individual points.
- Disk utilization growth reduced to roughly half the rate
- CPU utilization decreased by roughly 25%
- Write performance improved 5-8x
Stage 4 – Work In Progress!