At SignalFx, users send high-volume high-resolution infrastructure and application metrics into the service at up to one-second resolution. Those metrics are then processed by our streaming analytics engine, SignalFlow, and simultaneously visualized in streaming charts. Our ingest pipeline is responsible for consuming and aggregating all this data as it comes in and making it available for the other services that make up SignalFx for consumption and processing. The ingest pipeline has to handle this enormous volume and variety of data at near-real-time velocity.
This is the first in a series of posts that will look at how we scaled part of the ingest pipeline. The series will cover topics ranging from using fixed-size ring-buffers for inter-thread communication to designing cache-efficient data structures:
- Identifying bottlenecks
- Changing the system to allow linear scaling
- Changing internal data structures for better performance
Quantizer: A Brief Introduction
Quantizer is one of more than a dozen microservices that make up SignalFx. It’s a relatively simple service that is fed with raw time series data and does some simple math to make sure that data is consumable by downstream services like our time series database or analytics service by aligning datapoints to the right time segments.There’s always some jitter and lag with transporting metrics from users to SignalFx. Quantizer makes sure that the right datapoints end up in the right time segments, down to the second, so that when users try to chart, run analytics against, and alert on them downstream–the right datapoints are being compared to each other.
The workload is characterized by very simple math work on a per transaction basis, but performed on a large number of transactions per second. In this post, we will specifically look at the library that does this transformation, but not the server in which it is embedded.
Why did we need improvements?
Given the simplicity of the workload, we initially expected that a small tier of machines would sufficiently host the service. However, we were surprised to find that this service was CPU-bound. We were never able to push an eight-core server to over 25,000 transactions/second.
What did the profiler say?
At this point, like all good engineers, we pulled out the trusty profiler. To our disappointment, there were no big hotspots at which we could wave our shiny optimization wand to make the problems disappear. The highest at which any one method stood was about 4%. The profile was full of low single-digit percentage methods.
We were convinced we were doing something wrong, but firmly believed that a million transactions per second was possible given simple-math the workload. Because the profiler couldn’t illuminate bad code with a spotlight, we turned to doing experiments and microbenchmarks. However, microbenchmarks are nearly impossible to get right, especially on just-in-time environments like the JVM. Some issues include:
Generating accurate workloads
Details like the nature of the data, the timing of the data, and the ordering of how the data is fed in can change results tremendously. We record all our incoming data to a journal along with timing information. This was a great source of data for us because it accurately captured our workload. Our benchmarks involved reading data from this journal and feeding our library with it. If you don’t have this data, a good first step might be feeding the benchmark with randomized data in a randomized order.
Sending data in a very predictable order (unless your real data is also predictable) can be very misleading misleading. Processors use branch predictors to accelerate applications that have predictable branches. When you send data in sequentially, these branch predictors fire in and make things look rosy. Real data, however, is quite unpredictable and the branch predictors don’t help much. So sending in sequential data can make your benchmarks look much better than they really are.
In the beginning, our naïve benchmark sent in datapoints for a million time series in the exact same order every iteration. When we changed this order to be random across iterations, our library went two times slower.
Dealing with the idiosyncrasies of the runtime environment
Testing on a laptop is fraught with issues. How hard your browser is running might skew benchmarks. The operating system and the JVM version in a local environment might be different from those in a production environment. We’ve found it best to start with a quiet system and actually running our benchmarks on a production server.
Accounting for the compiler and virtual machine
Optimizing compilers are very good at deleting any code that does not generate an observable side effect. Naïve benchmarks are often just measuring the time required to call the timer functionality. We strive to use well-maintained benchmark tools like JMH to reduce the chances of making such mistakes. Similar tools are available for C++. Just using JMH does not absolve one of the other responsibilities described above, but it does help avoid further pitfalls. We believe in designing our benchmarks to assume that JMH was not available to us. We then use JMH to make sure we are extra-safe.
In garbage-collected environments, the work done in the background is not always linear. Even for non-garbage-collected environments, the operating system will itself induce jitter. We strive to run our benchmarks long enough to account for such periodic jitter.
An example is accounting for allocation on the JVM. Allocations in the fast path are very fast—they can reduce to a pointer bump in most cases. The price to pay for allocation on most garbage-collected run times is the collection and not the allocation itself. Compaction during garbage collection can change the locality of memory access. So, a high allocation rate in a short benchmark might never manifest itself as an adverse effect. It might only show up much later and mysteriously slow down the entire system.
Even accounting for all such issues, microbenchmarks are inherently inaccurate. In a real production environment, resources like CPU caches are shared by all code running on the system. Even within a single process, different libraries affect each other in unpredictable ways. For example, a misbehaving library could cause a lot of cache misses and affect other well-behaved libraries. Such effects are lost in a microbenchmark. We prefer measuring our code through metrics so we know how each code push moves the needle.
Examining the Code
A good model for how modern hardware works and an eye for detail go a long way towards finding out why programs are slow. Our intuition as programmers is often wrong! Our mental models of hardware and our application are sometimes too simplistic. Thus, any change driven by intuition must be verified by experiments. Perfectly logical changes can lead to no improvement at all or even regression. With that in mind, the next step was to examine our code, figure out why it was slow, address the issues, and verify through microbenchmarks that we improved.
If you want a recap on how modern CPUs work, I’d recommend reading Martin Thompson’s blog, Fabian Giesen’s blog, and Dan Luu’s post on new CPU features.
Bottleneck 1: Synchronization
Our library was written to be thread safe—callers were allowed to access the same time series from multiple threads. We used Concurrent Hash Maps and coarse-grained locks pervasively. Our library uses multiple threads, each accessing the same concurrent data structures. The number of threads was equal to the number of cores we had minus one. So on an eight-core machine, we ran seven threads. We ran another thread that went through the same shared data structures in a loop and performed work on each.
This setup is bound to suffer from contention. You can read about the effects of contention in this post by Martin Thompson. We came up with a benchmark to verify that contention was a real issue. Our benchmark ran the same workload but varied the number of threads our library ran. We noticed that as we reduced the number of threads, our numbers actually improved:
- Seven threads: benchmark ran at 17,000 transactions per second
- One thread: benchmark ran at 76,000 transactions per second.
That’s a 4.5x improvement just by changing the number of threads used!
This pushed us to see that our library suffered from parallel slowdown. We could just use a single thread and call it a day, but we were not close enough to our goal of more than a million transactions per second. Using a single thread would mean underutilizing CPU capacity. We could run multiple JVMs, but that can be a hassle that’s not very efficient. We’ll explore how we changed the threading model in a future blog post.
Bottleneck 2: Inefficient data structures
A good mental exercise to know how your system works is tracing the flow of a single transaction. For us, it was following the entire path taken from when a raw datapoint comes in to when it is output to downstream services. We also calculated the worst case number of cache misses during a single trace of this path. Cache misses can be very expensive. We used a very simple model to calculate the worst case scenario: any time we touch a new cache line, we calculate that as a cache miss. Thus we count every indirection as a potential cache miss.
Let’s demonstrate by analyzing a simple HashMap get operation. Java’s default HashMap is implemented using separate chaining with lists. Roughly speaking, it is implemented as an array of lists of key value tuples. Let’s assume that our starting point is where we have the reference to the HashMap and the hashCode of the key in our registers. From here on, identify the following steps:
- Access the array element for the bucket to which our key hashes. Since this is a random bucket, we count this as a cache miss.
- This array element is a reference to an actual object in memory. We follow this reference. Since this could be anywhere in memory, we count this as another potential cache miss.
- The entry itself contains references to a key and a value. To verify if the key we seek is indeed the key contained in this entry, we follow the reference to the key so that we can actually compare the contents. This is yet another random jump in memory and, hence, another cache miss. [Note: If the reference being searched is the same as the reference in the entry, then we don’t need to actually check the contents of the key. But, given we create a new key every time we search our HashMaps, we count this as a new cache line load.]
- Let’s imagine that our current entry contains the key we wanted. We now need to return the value associated with this key. We don’t need extra cache misses to find the reference to the value since it was already part of the map entry. However, once the reference is returned, if the caller accesses the contents of the value, it will jump to random memory again and cause yet another cache miss.
So, that’s four cache misses for a simple HashMap lookup without any collisions. Our library used a lot of HashMaps in the data path and, in general, had a lot of indirections through a deep object tree. So, our hypothesis was that the indirections themselves were responsible for the performance problems and not the math that was done at the end of it.
To verify this hypothesis, we designed yet another benchmark. We eliminated all the time series math from our transactions. Instead, we returned a thread local random from all nontrivial math functions. From previous benchmarks, we knew that the JVM was capable of creating hundreds of millions of thread local randoms. So we ran a version of our library that replaced all math with a thread local random computation and compared it against the unchanged library.
We noticed that there was no discernible difference between the two! We were now convinced that our data structures were to be blamed, and the actual math counted for close to nothing in the profile. We’ll look at how we improved the data structures in another post.
A traditional profiler did not help much with locating hotspots in a CPU-intensive service. We came up with hypotheses based on our knowledge of the code and tested them through experiments to gain more confidence. We’ll follow up with more blog posts to detail the actual changes we made and how that affected the performance of Quantizer. In the next one, we’ll specifically look at how we got to linear scaling.
Interested in working on these kinds of problems? We’re hiring engineers for every part of SignalFx!
Side note: Typically, “quantization” conjures images of sampling and lost data. But the quantizer service does not sample the datapoints. It just makes sure they are aligned to the right time segments and performs some data optimizations to speed up the process of shipping around and using the datapoints within the rest of SignalFx.