In our previous post, we talked about the differences between Elasticsearch version 1.7.5 and version 2.3. In this post we will cover how we managed to upgrade Elasticsearch without incurring any downtime to our service.
At SignalFx, we run a 54 nodes Elasticsearch cluster with more than one billion documents and about 20TB of data. This cluster handles our metadata search layer and is critical to the overall availability of the SignalFx service. Taking the cluster down for an upgrade is not an option.
There are two main challenges involved in upgrading Elasticsearch without incurring any downtime:
- Shutting down the cluster for an upgrade means that you can’t index or query for that duration. It’s unclear how long the upgrade will take (this should definitely be understood if you plan to do an in-place upgrade). There is also no Plan B should the upgrade fail and you end up in a partial upgraded state. This gets more problematic with the size of the data you have. If some aspect of the upgrade doesn’t go as planned, restoring the cluster from a snapshot could take days.
- The 1.7 Java client libraries cannot communicate with a 2.3 server. This means that application code has to use a more recent version of the library. Unfortunately, the 2.3 Java library also cannot communicate with a 1.7 cluster.
Upgrading without Downtime
Upgrading with zero downtime requires two different clusters with identical data but run different versions of Elasticsearch. Updates to one cluster need to be replicated to the second cluster, and clients should be able to direct their query load to either cluster.
When we approached this problem, we saw a lot of similarities to our zero-downtime re-sharding strategy that we talked about in the past. However, the upgrade from Elasticsearch 1.7 to 2.3 is more complicated for the following reasons:
- Cross-cluster migration. In the past, we have run index migrations in the same cluster.
- Incompatible clients. The Elasticsearch 1.x client cannot communicate with Elasticsearch servers running version 2.x (and vice versa). This incompatibility requires an upgrade of both the client and the server at the same time.
Before and during the migration, the Elasticsearch client needs to be able to communicate with the 1.x cluster. After the migration, the client needs to be switched to communicate with the new 2.x cluster.
Client Abstraction and Switching Mechanism
We use the Elasticsearch client as a library across our services. We had to take a series of key steps to make the library capable of communicating with two different clusters. This is a required step in any migration that requires a fallback to the old version in case of failures.
Abstracting the Elasticsearch Client through a Wrapper
- First, we needed to bundle both Java client libraries within the same JVM. This required using the Maven Shade plugin to avoid the package name conflict.
- We then abstracted the two Elasticsearch client implementations behind a common interface. This interface abstracted away all the APIs that we require to talk to Elasticsearch. It required a lot of boilerplate POJOs, Builders, Exceptions, etc. that mimicked the Elasticsearch API while being version agnostic. A big chunk of the work was creating filters and queries. We tried to keep the interface as close as possible to the Elasticsearch client APIs so that our engineers did not need to learn yet another API.
- We created two different implementations for those interfaces that took care of the versioning aspect. Most of the differences were simply in the package name (this was mechanic to a large degree).
Handling Two Clusters at Runtime
As part of our deployment stack, we have a discovery service (based on Apache Zookeeper) that allows a service to locate an Elasticsearch cluster based on its logical name. We also have runtime configuration options that allow us to switch Elasticsearch clients between the logical clusters.
We implemented a proxy layer that can switch dynamically between the different clients and thereby hiding the complexity of having two clusters with different versions. This allowed us to switch the query paths dynamically between the two versions without any change to the client code and without any restarts to the service.
With the client changes and configuration flags we can now switch the query loads back and forth between the two clusters. However, the data exists only in the 1.7 cluster, so switching to 2.3 will result in hitting an empty cluster. This is not good for availability, and required us to handle data migration and indexing across the different clusters.
This framework is the backbone of all the index migrations that we have performed at SignalFx.
The fundamental mechanism behind the zero downtime index migration is the concept of an “index generation”. This is what allows the migration to handle concurrent mutations to the documents while the data is being copied between indices. The index generation is a special field where we write to all our documents indexed in Elasticsearch. Think of this field as a version number that typically stays the same, but occasionally gets incremented during migrations to deal with race conditions. We keep track of the latest generation number in the “indexing state”, which is stored in ZooKeeper. This state is used to coordinate the migrations between the migration process and the indexing nodes.
The migration goes through different phases. There are no requirements that the source and target indices are co-located on the same cluster, which is exactly what we need.
Phase 1: Initialization Phase
- This phase involves standing up a new cluster and creating the target index with the new 2.x mapping configuration. We made small changes to our mapping to deal with changes in syntax (for example, the analyzers). Our mapping is very similar between the two versions.
- The index created on the new cluster has zero replicas, and the refresh interval is set to -1, which turns the indexing refreshes off and speeds up the bulk import (next phase) of the documents from the source index.
- We usually spin up new nodes to avoid additional load on the current cluster. After going through a series of Elasticsearch optimizations, we achieved low CPU/IO load on the cluster. This allowed us to run two clusters on the same hardware (with isolation was done through Docker). In some scenarios, this may not be viable due to resource constraints.
- To keep the migration state, we maintained a record with following fields in ZooKeeper for each index:
- Current index generation number: a version number of the documents that we store the Elasticsearch
- Current index name: the index name to which all indexing operations are going
- Current service name: logical cluster name of the current index
- Extra write index name: an extra index to write to (double writing during migration)
- Extra service name: logical cluster name for the extra index
Phase 2: Bulk Import Phase
We now bump up the index generation of the documents. To simplify, let’s assume that index generation has been bumped from n to n+1. By bumping up the version, we basically create a finite set of documents with generation number n (or less) that needs to be migrated to the new Elasticsearch cluster. Here’s the process:
- We do a scan on the source index and bulk import documents with generation number n or less to the target index.
- We use scrolling for this because it’s the most efficient way to retrieve a large number of documents from an index.
- Keep in mind that scrolling can disallow merged segments from being claimed back by the filesystem, since the scroll is still using them. To avoid that, we rely on a bucketing technique. The bucket number is assigned when the document is created and is a hash of the document ID modulo for the total number of buckets (64k in our case). This allows us to partition the document set into a uniformly distributed set of buckets, which we add as a parameter to scrolling—in effect, preventing scrolling over the entire index at once.
- The bucketing mechanism also allows us to make the bulk migration recoverable in case of failures. Since bulk import at our size could take days, there can occasionally be problems that call for pause, recovery, or rollback.
Phase 3: Double Publishing and Cleanup Phase
At this stage, all the documents with generation number n or less are already migrated to the new Elasticsearch cluster. Documents with generation number higher than n need to be reconciled.
We do the following:
- Update the indexing state so that the indexing now happens on two different indices (which are on two different clusters with two different versions). This now directs all indexers to start writing to the target index while still writing to the source index. We call this “double publishing”.
- We also bump up the generation number once again. The generation number is now n+2 so that we know:
- Anything that changes from this point forward has generation number n+2 and will be written to both indices.
- All documents with generation number n or less have been migrated to the target index during the bulk phase.
- Only documents with generation number n+1 need to be reconciled between the source and target indices.
- Documents with generation number n+1 are a finite set (i.e., their count will either stay the same or decrease, but will never increase).
- Now, query the source index for documents with generation number n+1 and enqueue them for writing. At this stage, we are writing to both indices and the generation number has been bumped to n+2, and therefore these documents will now be indexed with a generation number n+2 in both indices.
- At the completion of the last step, we know that the target index is completely up-to-date with the source index.
- You can create replicas and set the right refresh rate on the index.
Phase 4: Closing Phase
At this point, the data is the same in both indices. There may be a slight difference due to the asynchronous nature of indexing, but the data would eventually be exactly the same if indexing stops.
We can also switch our query load to the new cluster and switch it back if things do not go as expected.
As part of making the switch go more smoothly, we implemented a mechanism so that we can send a percentage of query traffic to the new cluster. This was to be used to progressively warm up the new cluster caches (the query results are thrown away). If you are running both clusters on the same hardware, you should not send 100% of query volume to both clusters. While we didn’t notice any drop in performance for the initial queries when we switched, we ended up not using this mechanism.
This is our monitoring dashboard when we switched the flag to send query traffic to the new cluster.
To finish the migration, the remaining steps for you to do are:
- Change the current service name to the extra service name and current index name to the extra write index, as well as bump up the generation number once again. All these operations are atomic and this stops double writing. We only do this after ensuring there are no issues with the new cluster.
- Close the source Elasticsearch 1.x index, thereby freeing up resources.
- Maintain the source index on disk for a full week, just in case.
- Decommission the old 1.7 cluster all together.
In the next post, we will go through the performance differences in our production services after switching to Elasticsearch version 2.3.3 (spoiler alert: the upgrade was totally worth it!).