November 02, 2016

9 Minute Read

How We Upgraded Elasticsearch 1.x to 2.x with Zero Downtime: Why Upgrade to 2.x

By Splunk

Splunk is committed to using inclusive and unbiased language. This blog post might contain terminology that we no longer use. For more information on our updated terminology and our stance on biased language, please visit our blog post. We appreciate your understanding as we work towards making our community more inclusive for everyone.

Elasticsearch has progressed rapidly from version 1.0.0, released in February 2014, to version 2.0.0, released in October 2015 and version 5.0 freshly released just a week ago. During the two-and-a-half-year period since 1.0.0, adoption has skyrocketed, and both vendors and the community have committed bug-fixes, interoperability enhancements, and rich feature upgrades. This has made Elasticsearch one of the most popular distributed search engines for unstructured documents, as well as a great log analysis tool as part of the ELK Stack.

However, despite major improvements with each release, the majority of organizations continue to use the same version as when they initially adopted Elasticsearch — in most cases, version 1.x. The reasons for not upgrading include:

- Challenges of transitioning the Elasticsearch cluster since it’s not a rolling upgrade and requires a full restart of the cluster (with incurred downtime)
- Non-negligible engineering cost due to the breaking changes in the 2.x client libraries, as well as the lack of support for 1.x clusters in the 2.x client library
- Changes in indexing and querying
- Fear of negative impact on existing deployments
- Inability to roll back to version 1.x from 2.x

This is the first post in a three part series, and we’ll start by covering the major differences between the two Elasticsearch versions. In future posts, we will talk about how we managed to run the upgrade with zero downtime and compare performance from running both versions in production.

Part 1: Why Upgrade to 2.x
Part 2: Handling the Upgrade in Practice
Part 3: Performance Advantages in Production

Why Upgrade to 2.x?

Elasticsearch version 2.x focuses on resiliency, reliability, simplification, and features. According to the Elasticsearch’s official website, it took 2,799 pull requests by 477 committers since version 1.0.0 for release of 2.0.0. This release is based on Apache Lucene 5.x and specifically improves query execution and spatial search.

Version 2.x also delivers considerable improvements in index recovery. Historically, Elasticsearch index recovery was extremely painful, whether as part of node maintenance/failure or an upgrade. The bigger the cluster, the bigger the headache.

Node failures or a reboot can trigger a shard reallocation storm, and entire shards are sometimes copied over the network, despite already having a large portion of the data. Users have also reported more than a day of recovery time to restart a single node.

With 2.x, recovery of existing replica shards is almost instantaneous thanks to the newly introduced synced flushes. This allows cluster rolling restarts to finish in under an hour, whereas they took up to several days in 1.x. There is also more lenient reallocation, which avoids unnecessary reshuffling during node restarts.

With Lucene 5.x, merge scheduling is done dynamically using an auto-regulating feedback mechanism. This eliminates past worries about manually adjusting merge throttling settings. It also allows Elasticsearch to provide more stable search performance even when the cluster is under heavy indexing load.

Elasticsearch 2.x solves many of the known issues that plagued previous versions, including:

Mapping conflicts (often yielding wrong results)
Memory pressures and frequent garbage collections
Risk of data loss (due to asynchronous flush)
Security breaches
Slow recovery during node maintenance or rolling cluster upgrades

Subsequent releases have also introduced major new features including:

Query profiling
Pipeline aggregations
Index compression
Re-index API

Because some of these features and improvements result in design-level changes that may affect your existing cluster, Elasticsearch provides a plugin to check whether you can upgrade directly to Elasticsearch version 2.x or whether you need to make changes to your data beforehand.

Non-Backward Compatibility

Elasticsearch 2.0.0 eliminated certain features that affect compatibility between versions and require adoption of new approaches to existing workflows:

- Facets were removed completely in favor of aggregation. This also means that Kibana 3 will not work with Elasticsearch 2.x.
- _shutdown API was removed. It was previously used to shutdown a single node or a complete cluster.
- Thrift/Memcached protocol support was eliminated, now requiring use of REST APIs over HTTP or Java-based APIs.
- Rivers were the main culprits of cluster instabilities, despite their value for syncing data from multiple data sources. Instead of Rivers, you can either use Logstash or write your own code to sync data into Elasticsearch.
- MVEL scripting has been replaced by Groovy.
- File-based mapping templates are no longer supported, in favor of index-based templates.
- top_children query regularly returned wrong results and has been completely removed in new versions.

Despite the vast improvements across Elasticsearch with recent releases, the updates don’t come without shortcomings, in terms of the ease of upgrading legacy implementations and breaking changes. These challenges are like growing pains in the context of the performance enhancements and bug fixes that result from a 2.x upgrade, but they still require attention during any transition. We found that the biggest considerations had to be made for mapping and query syntax as well as dealing with the lack of compatibility between the client libraries.

Mapping Changes

Elasticsearch developers earlier assumed an index as a database and a type as a table. This allowed users to create multiple types inside the same index, but eventually became a major source of headaches because of restrictions imposed by Lucene.

Fields that have the same name inside multiple types in a single index are mapped to a single field inside Lucene. Incorrect query outcomes and index corruption can result from a field in one document type being of integer type while a field in another document type is of string type. Several other issues can lead to mapping refactoring and major restrictions on handling mapping conflicts.

These changes at the mapping level require attention during any transition:

Fields must be referenced by full name.
Names cannot be referenced using type name prefix.
Field names can’t contain dots.
Type names can’t start with a dot (only for internal Elasticsearch use)
Type names may not be longer than 255 characters.
Type may no longer be deleted. So, if an index contains multiple types, you cannot delete any of the types from the index. The only solution is to create a new index and reindex the data. index_analyzer and _analyzer parameters were removed from mapping definitions.
Document values are now default.
Parent type can’t pre-exist and must be included when creating child type.
The ignore_conflicts option of the put mappings API has been removed. Conflicts can’t be ignored anymore.
Documents and mappings can’t contain metadata fields that start with an underscore. So, if you have an existing document that contains a field with _id or _type, it will not work in version 2.x. You need to reindex your documents after dropping those fields.
doc_values are not a default parameter in version 1.x, you can’t take advantage of doc_values in version 2.x without re-indexing your data.

Query & Filter Changes

Prior to version 2.0.0, Elasticsearch had two different objects for querying data: queries and filters. Each was different in functionality and performance.

Queries were used to find out how relevant a document was to a particular query by calculating a score for each document. Filters were used to match certain criteria and were cacheable to enable faster execution. This means that if a filter matched 1,000 documents, Elasticsearch, with the help of bloom filters, would cache those documents in memory to retrieve them quickly in case the same filter was executed again.

However, with the release of Lucene 5.0, which is used by Elasticsearch version 2.0.0, both queries and filters are now the same internal object, taking care of both document relevance and matching.

So, an Elasticsearch query that used to look like the following…

{
"filtered" : {
"query": { query definition },
"filter": { filter definition }
}
}

…should now be written like this in version 2.x:

{
"bool" : {
"must": { query definition },
"filter": { filter definition }
}
}

Notably, the older syntax has not been completely removed — it has been deprecated until version 2.3. You may keep using the older syntax for now, and Elasticsearch will automatically convert your queries into the new format and perform the optimizations itself.

Additionally, the confusion caused by choosing between a bool filter and an and / or filter has been addressed with the elimination of and / or filters, replaced by the bool query syntax in the example above. Rather than the unnecessary caching and memory requirements that often resulted from a wrong filter, Elasticsearch now tracks and optimizes for frequently used filters and doesn’t cache for segments with less than 10,000 documents or 3% of the index.

Security, Reliability, and Networking Changes

Elasticsearch now runs under Java Security Manager enabled by default, which streamlines permissions after startup. For security reasons, running Elasticsearch as the “root” user is now disabled by default.

Elasticsearch has applied a “durable-by-default” approach to reliability and data duplication across multiple nodes. Documents are now synced to disk before indexing requests are acknowledged, and all file renames are atomic to prevent partially written files.

On the networking side, based on extensive feedback from system administrators, Elasticsearch has removed multicasting (although it is still available as a plugin), and default zen discovery has been changed to unicast. Elasticsearch also now binds to localhost by default, preventing unconfigured nodes from joining public networks.

Stats API Changes

Before version 2.0.0, Elasticsearch used the Sigar library for operating system-dependent statistics. But Sigar is no longer maintained, and it has been replaced in Elasticsearch by a reliance on stats provided by JVM. Accordingly, we see various changes in the monitoring parameters of the node info and node stats APIs:

network.* has been removed from nodes info and nodes stats.
fs.*.dev and fs.*.disk* have been removed from nodes stats.
os.* has been removed from nodes stats, except for os.timestamp, os.load_average, os.mem.*, and os.swap.*.
os.mem.total and os.swap.total have been removed from nodes info.

Memory usage has also been removed from the _stats API and id_cache parameter, which describes the parent-child data structure. id_cache can now be fetched from field data.

Additional Changes

An upgrade to Elasticsearch 2.x also brings changes to some of the most frequently used individual operations:

_optimize API is deprecated in version 2.1.0 and has been replaced by the Force Merge API. For example, an optimize request in version 1.x…
curl -XPOST ‘http://localhost:9200/test/_optimize?max_num_segments=5’
…should be converted to:
curl -XPOST ‘http://localhost:9200/test/_forcemerge?max_num_segments=5’
Aggregation on the boolean data type will now return binary response keys in the form of 0 and 1 instead of T and F.
Delete-by-query is replaced by the delete-by-query plugin for more reliability.
Applications written in Java require significant updates. We cover how we deal with this in our second post.
Configuration parameters for scripting on elasticsearch.yml has changed.
Format in version 1.x:
   script.disable_dynamic: false
Format in version 2.x:
   script.inline: true
   script.indexed: true

Full Cluster Restart vs. New Cluster

Rolling upgrades are not possible while upgrading an Elasticsearch cluster from 1.x to 2.x. Hence, it requires a full cluster restart upgrade. Elastic provides a guide outlining the steps for a full cluster restart upgrade.

There a several things to consider if you opt for a full cluster restart upgrade:

A full cluster restart upgrade inherently requires downtime of your entire cluster. This should be planned in advance and coordinated with the services or applications that rely on the cluster being available.
Make sure you have a copy of your application running with the code changes according to the latest version. This should be doable if you don’t have a lot of applications or you have a layer that hides Elasticsearch. It is more difficult to manage otherwise. We cover how we deal with this in the next post.
There can not be any mapping conflict inherited from your current index.
You will have to configure a plugin to support the transition if you are not using multicasting by default.
If you are using data striping to store data from an index on multiple paths, make sure the path has enough disk space. Unlike version 1.x, data from a single shard can only be stored on one path in 2.x. You should especially worry about this if your document distribution is not uniform across shards or if you have a single shard per node.
All plugins now require a descriptor file, so check if your plugins are compatible with version 2.x. Many Elasticsearch 1.x plugins are not going to work in 2.x. Install the latest supported plugins on each node.

If you can afford it, we advise setting up a new cluster and re-indexing your data to reflect the new mapping rules for Elasticsearch 2.x. With a new cluster, you are free to redesign your document modelling strategies so that you can create a separate index in case of conflicts in field names or if you delete the types from your index.

This may not work in all cases. That’s exactly the problem we run into. We wanted to do the upgrade while at the same time continuing to serve queries and indexing from our production cluster.

Managing Downtime During Upgrade

As a rule of thumb, keep the following things in the mind before upgrading Elasticsearch:

- Run the migration plugin on your production cluster to find data incompatibility early in the upgrade process.
- Test upgrades in a dev environment before upgrading your production cluster.
- Always take a snapshot of your data before upgrading.

Since any upgrade from 1.x to 2.x carries so many changes, you must test your indexes with the migration test plugin. Any mapping/schema conflicts (or other issues) may require re-indexing your data.

In the next blog post, we will cover how we handled the differences between the versions and how we made the transition between the versions without any downtime.

----------------------------------------------------
Thanks!
Mahdi Ben Hamida

Splunk

The world’s leading organizations trust Splunk to help keep their digital systems secure and reliable. Our software solutions and services help to prevent major issues, absorb shocks and accelerate transformation. Learn what Splunk does and why customers choose Splunk.

IT 3 Min Read

Six-peat! Once Again, IDC Ranks Splunk #1 in ITOA Market Share

IDC has ranked Splunk #1 for both market share and market revenue in their IDC Worldwide IT Operations Analytics Software Market Shares, 2019 report for the sixth year in a row.

IT 1 Min Read

An Executive Roundtable with the CIO of LogMeIn and the Chief Product Officer at Splunk: Why SaaS and Cloud are Key to Weathering the Current IT Storm

IT 8 Min Read

Introducing Pulsar IO

Jerry provides an overview of Pulsar IO, a framework for moving data into and out of Apache Pulsar, explaining how it works and how to build connectors using Pulsar IO.

About Splunk

The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.

Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.

Learn more about Splunk