How We Upgraded Elasticsearch 1.x to 2.x with Zero Downtime: Performance Advantages in Production

In the previous posts of this series, we discussed key differences between Elasticsearch 1.x and 2.x and how we upgraded our production cluster without incurring any service downtime. In this final post, we will illustrate the performance differences observed after switching to Elasticsearch version 2.3.3 in our production cluster.

Our Environment

At SignalFx, our Elasticsearch production cluster has 54 nodes with more than 1.5 billion documents and a data size of 20TB. The cluster runs on i2.2x large AWS EC2 instances with 16GB heap solely dedicated to Elasticsearch. Before and after the upgrade, the cluster had the same number of indices, the mappings were not changed, and the doc values were enabled for the non-analyzed fields.


We generate more than 100 million index requests on a daily basis. The documents in our Elasticsearch cluster are mutable, meaning indexing requests also include updates to the existing documents in the cluster in addition to new documents.

The chart below displays the rate of indexing requests received by both clusters. Once we start the double writing phase, the 2.3.3 index starts receiving indexing requests similarly to the 1.7 index. However, since we only have primaries on the target index, the indexing rate is lower (approximately one third). Once we increase the replica count to 2, both Elasticsearch 1.7.5 and 2.3.3 start indexing at the same rate. At the end of the migration, we stop double writing, which is seen in the graph below as a complete stop in indexing requests on Elasticsearch 1.7.5.


The chart below displays the mean indexing time for the two clusters during the double publishing phase after replicas were added to the target Elasticsearch 2.3.3 cluster. At this time, both clusters receive the same indexing requests. The chart suggests that indexing takes longer in Elasticsearch 2.3.3 (in orange) as compared to Elasticsearch 1.7.5 (in green). One possible explanation for the higher indexing latency in Elasticsearch 2.3.3 is likely the new synchronous fsync behavior (as explained here). In Elasticsearch 2.x, translog is now fsync’ed after every write request in addition to the previous frequency of every 5 seconds. The fsync takes place on both primary and replica shards. The client doesn’t receive a 200 response code until the request is fsync’ed, therefore the increase in indexing latency. The behavior can be changed to make the fsync asynchronous. However, we kept the default synchronous fsync behavior to avoid any eventual data loss during node failures.

CPU Utilization

The chart below shows our CPU consumption for the same time periods as in the indexing load comparison chart above. CPU consumption with Elasticsearch 1.7.5 (in green) is approximately 20% with many more sporadic spikes, which impacts both indexing and query latency. On the other hand, Elasticsearch 2.3.3 CPU utilization (in orange) remains below 10% for the majority of the time. It only gradually rises towards 20% with an increase in the indexing load and does not show any of the troublesome spikiness we saw with the 1.7.5 cluster.


Query Performance

The catalog search in the SignalFx app is powered by Elasticsearch and showed blazing fast speed immediately after we switched to Elasticsearch 2.3.3. These searches include complex queries with aggregations in addition to the suggestions we provide as part of the search. While serving the queries, the cluster is simultaneously indexing. This behavior is unchanged before and after the upgrade. Therefore, the indexing rate of the cluster is also influencing the query performance at any time.

In the chart below you see that the query response time for Elasticsearch 1.7.5 (in green) cluster varies between 300ms to 800ms, occasionally peaking as high as 1000ms, while the query response time for the new 2.3.3 cluster (in orange) consistently remains in the lower range of 200ms to 400ms. Also the response time is less jittery after upgrade.


To compare the mean query latency for the two versions running at different points in time, we time-shifted the query response time of the Elasticsearch 1.7.5 cluster by six weeks. The chart displays the comparison of mean search time over the course of one day for Elasticsearch 1.7.5 (in green) and Elasticsearch 2.3.3 (in orange) clusters.

The mean query time for Elasticsearch 2.3.3 cluster is observed to be 4ms with a P99 of 5ms, compared to the mean query time of 7ms and a P99 of 9ms for 1.7.5. The improvement in mean search time could be an outcome of various improvements made in Elasticsearch 2.x query execution engine, as outlined in Better query execution post by Elastic.


Filter Cache

There is a drastic reduction in the filter cache size, which is a major highlight of our transition to Elasticsearch 2.3.3. In Elasticsearch 1.7.5 (in green) where filter cache was a configurable setting, we set the limit to 20%, which is 3.2GB in our case. We heavily use filters in our queries and for performance reasons, and we had enabled the filter caching that was frequently used. After switching to Elasticsearch 2.3.3 (in blue), our filter cache size dropped to a miniscule 18kB, representing a 99.999% reduction from the original 3.2GB.


In the chart below, it’s clear that our application experienced a high thrashing rate of the cached results. As our customer base grew and as we process more and more queries, it is expected for us to observe a slight increase in slowness.


Despite dedicating time and effort in the past to optimize filter cache usage across our cluster, we couldn’t help narrowing down the filter cache thrashing rate. After switching to Elasticsearch 2.3.3, the cache eviction rate reduced from a mean of 55 per second with Elasticsearch 1.7.5 (in green) to a mean of 9 per second. This represents a massive improvement in query execution efficiency and has been a major driver of the unprecedented performance we’re now observing in our production cluster.


In reflecting on our experience, the upgrade to Elasticsearch 2.3.3 was worth the engineering efforts to deal with the version incompatibilities and running a migration on an index of tremendous size. We saw a significant improvement in all performance metrics after upgrading to Elasticsearch 2.3.3 — search became much faster after the upgrade, filter cache showed a dramatic change, and CPU is much more optimized as compared to before. As we experience a scale growth and happily serve the increasing indexing and query load, we are much more confident about our Elasticsearch cluster performance.


Posted by