Digital Resilience Pays Off
Download this e-book to learn about the role of Digital Resilience across enterprises.
Splunk is committed to using inclusive and unbiased language. This blog post might contain terminology that we no longer use. For more information on our updated terminology and our stance on biased language, please visit our blog post. We appreciate your understanding as we work towards making our community more inclusive for everyone.
Elasticsearch has progressed rapidly from version 1.0.0, released in February 2014, to version 2.0.0, released in October 2015 and version 5.0 freshly released just a week ago. During the two-and-a-half-year period since 1.0.0, adoption has skyrocketed, and both vendors and the community have committed bug-fixes, interoperability enhancements, and rich feature upgrades. This has made Elasticsearch one of the most popular distributed search engines for unstructured documents, as well as a great log analysis tool as part of the ELK Stack.
However, despite major improvements with each release, the majority of organizations continue to use the same version as when they initially adopted Elasticsearch — in most cases, version 1.x. The reasons for not upgrading include:
This is the first post in a three part series, and we’ll start by covering the major differences between the two Elasticsearch versions. In future posts, we will talk about how we managed to run the upgrade with zero downtime and compare performance from running both versions in production.
Elasticsearch version 2.x focuses on resiliency, reliability, simplification, and features. According to the Elasticsearch’s official website, it took 2,799 pull requests by 477 committers since version 1.0.0 for release of 2.0.0. This release is based on Apache Lucene 5.x and specifically improves query execution and spatial search.
Version 2.x also delivers considerable improvements in index recovery. Historically, Elasticsearch index recovery was extremely painful, whether as part of node maintenance/failure or an upgrade. The bigger the cluster, the bigger the headache.
Node failures or a reboot can trigger a shard reallocation storm, and entire shards are sometimes copied over the network, despite already having a large portion of the data. Users have also reported more than a day of recovery time to restart a single node.
With 2.x, recovery of existing replica shards is almost instantaneous thanks to the newly introduced synced flushes. This allows cluster rolling restarts to finish in under an hour, whereas they took up to several days in 1.x. There is also more lenient reallocation, which avoids unnecessary reshuffling during node restarts.
With Lucene 5.x, merge scheduling is done dynamically using an auto-regulating feedback mechanism. This eliminates past worries about manually adjusting merge throttling settings. It also allows Elasticsearch to provide more stable search performance even when the cluster is under heavy indexing load.
Elasticsearch 2.x solves many of the known issues that plagued previous versions, including:
Subsequent releases have also introduced major new features including:
Because some of these features and improvements result in design-level changes that may affect your existing cluster, Elasticsearch provides a plugin to check whether you can upgrade directly to Elasticsearch version 2.x or whether you need to make changes to your data beforehand.
Elasticsearch 2.0.0 eliminated certain features that affect compatibility between versions and require adoption of new approaches to existing workflows:
Despite the vast improvements across Elasticsearch with recent releases, the updates don’t come without shortcomings, in terms of the ease of upgrading legacy implementations and breaking changes. These challenges are like growing pains in the context of the performance enhancements and bug fixes that result from a 2.x upgrade, but they still require attention during any transition. We found that the biggest considerations had to be made for mapping and query syntax as well as dealing with the lack of compatibility between the client libraries.
Elasticsearch developers earlier assumed an index as a database and a type as a table. This allowed users to create multiple types inside the same index, but eventually became a major source of headaches because of restrictions imposed by Lucene.
Fields that have the same name inside multiple types in a single index are mapped to a single field inside Lucene. Incorrect query outcomes and index corruption can result from a field in one document type being of integer type while a field in another document type is of string type. Several other issues can lead to mapping refactoring and major restrictions on handling mapping conflicts.
These changes at the mapping level require attention during any transition:
Prior to version 2.0.0, Elasticsearch had two different objects for querying data: queries and filters. Each was different in functionality and performance.
Queries were used to find out how relevant a document was to a particular query by calculating a score for each document. Filters were used to match certain criteria and were cacheable to enable faster execution. This means that if a filter matched 1,000 documents, Elasticsearch, with the help of bloom filters, would cache those documents in memory to retrieve them quickly in case the same filter was executed again.
However, with the release of Lucene 5.0, which is used by Elasticsearch version 2.0.0, both queries and filters are now the same internal object, taking care of both document relevance and matching.
So, an Elasticsearch query that used to look like the following…
{
"filtered" : {
"query": { query definition },
"filter": { filter definition }
}
}
…should now be written like this in version 2.x:
{
"bool" : {
"must": { query definition },
"filter": { filter definition }
}
}
Notably, the older syntax has not been completely removed — it has been deprecated until version 2.3. You may keep using the older syntax for now, and Elasticsearch will automatically convert your queries into the new format and perform the optimizations itself.
Additionally, the confusion caused by choosing between a bool filter and an and / or filter has been addressed with the elimination of and / or filters, replaced by the bool query syntax in the example above. Rather than the unnecessary caching and memory requirements that often resulted from a wrong filter, Elasticsearch now tracks and optimizes for frequently used filters and doesn’t cache for segments with less than 10,000 documents or 3% of the index.
Elasticsearch now runs under Java Security Manager enabled by default, which streamlines permissions after startup. For security reasons, running Elasticsearch as the “root” user is now disabled by default.
Elasticsearch has applied a “durable-by-default” approach to reliability and data duplication across multiple nodes. Documents are now synced to disk before indexing requests are acknowledged, and all file renames are atomic to prevent partially written files.
On the networking side, based on extensive feedback from system administrators, Elasticsearch has removed multicasting (although it is still available as a plugin), and default zen discovery has been changed to unicast. Elasticsearch also now binds to localhost by default, preventing unconfigured nodes from joining public networks.
Before version 2.0.0, Elasticsearch used the Sigar library for operating system-dependent statistics. But Sigar is no longer maintained, and it has been replaced in Elasticsearch by a reliance on stats provided by JVM. Accordingly, we see various changes in the monitoring parameters of the node info and node stats APIs:
Memory usage has also been removed from the _stats API and id_cache parameter, which describes the parent-child data structure. id_cache can now be fetched from field data.
An upgrade to Elasticsearch 2.x also brings changes to some of the most frequently used individual operations:
Rolling upgrades are not possible while upgrading an Elasticsearch cluster from 1.x to 2.x. Hence, it requires a full cluster restart upgrade. Elastic provides a guide outlining the steps for a full cluster restart upgrade.
There a several things to consider if you opt for a full cluster restart upgrade:
If you can afford it, we advise setting up a new cluster and re-indexing your data to reflect the new mapping rules for Elasticsearch 2.x. With a new cluster, you are free to redesign your document modelling strategies so that you can create a separate index in case of conflicts in field names or if you delete the types from your index.
This may not work in all cases. That’s exactly the problem we run into. We wanted to do the upgrade while at the same time continuing to serve queries and indexing from our production cluster.
As a rule of thumb, keep the following things in the mind before upgrading Elasticsearch:
Since any upgrade from 1.x to 2.x carries so many changes, you must test your indexes with the migration test plugin. Any mapping/schema conflicts (or other issues) may require re-indexing your data.
In the next blog post, we will cover how we handled the differences between the versions and how we made the transition between the versions without any downtime.
----------------------------------------------------
Thanks!
Mahdi Ben Hamida
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.