Clustering Optimizations in Splunk 6

One of the new features we introduced in Splunk 6 is the Simplified Clustering Management. This allows administrator to setup and monitor the health of the cluster through an easy to use, intuitive UI. In addition to the cool new UI, many performance optimizations were added to handle peer failures and recovery from such failures blazingly fast. In this blog post, I’m going to highlight two such performance optimizations.

1. First Searchable Copy Optimization

This optimization is all about making sure that at least one, complete searchable copy exists in the cluster so that business users can continue to use the data while the cluster master is handling peer failures.

Let’s take a look at this with an example.  Assume that we have a cluster of 5 nodes with RF = 3 and SF = 2. Let’s also assume that we lost 2 peers, thus lost all of the searchable copies for some indexes. In the previous Splunk 5 version, users will not be able to search and use the cluster until the cluster master ensures that all of the replication policies are met.  In some cases, this might take long time and users are unnecessarily blocked until then.

In Splunk 6, the ordering of recovery has been changed. The first priority will be given to ensuring one searchable copy exists in the cluster before meeting other replication policies. Hence, the users will be able to use the data as soon as we have reached the SF = 1 milestone. This optimization vastly minimizes the time users have to wait, after some peer failures, by as much as 5X – 8X times.



2. TSIDX Regeneration Optimization

With this optimization, the search files (TSIDX) are copied from other peers instead of regenerating them from raw data. This will help cluster to meet the replication policies much more quickly.

Let’s go back to the same example and assume that we lost only one peer. We still have one searchable copy left, so users will be able to use the cluster without interruption. In Splunk 5 version, the cluster master will always regenerate the search files from raw data in order to meet to the Searchability factor.

In Splunk 6, this has been changed. Whenever possible, we copy the TSIDX files from other peers, so we can avoid the regeneration. Not only this optimization saves time but also saves precious CPU cycles from the regeneration work. When we measured this optimization internally we observed 10X- 12X improvements over Splunk 5.

In summary, these two optimizations will greatly help both admins and users. The other good thing is that these are all transparent optimizations and it works out of the box without any tuning. And, that’s awesome!

Mustafa Ahamed
Posted by

Mustafa Ahamed

Mustafa has been with Splunk for 10 years and leading the Product Management for Splunk Enterprise Platform. He's passionate about large scale deployments and complex systems. Love to travel, explore new places and food!