Splunk Sizing and Performance: Doing More with More

If you’ve browsed the Splunk docs in the last several years or talked to anyone at Splunk about performance in larger Splunk deployments, you’ve probably seen this number that we use for estimating how many servers one might need to achieve a given daily indexing volume.  Referencing here, here, and here.  Say I want to do 1TB/day of Splunk indexing.  That’s going to be about 10 indexing servers and a couple of search heads.   Our general guidance was that given a commodity server, historically an 8-core 8GB+ machine with relatively fast disk, that once you reached 100GB of indexing volume per day you should start to look at adding more indexers to service the workload.  This has always been a bit of a swag since if you read the Hardware Capacity Planning sections of the docs, you will see that a variety of dimensions can impact Splunk performance beyond just the indexing volume.

I’ve always held that Splunk indexing in a customer environment usually reaches a relatively well understood quantity (periodically adjusting for the onboarding of new data) and the real impact on your interaction with the entire system is dependent upon the search workload.  When incident investigations arise, how many users flock to Splunk and run ad-hoc searches?  How many scheduled jobs that fire alerts are running?  How many accelerations are running that power dashboards, etc?  This 100GB/day number was mostly derived from the observation that since a Splunk indexer pulls dual duty, indexing data and then responding to search requests, we needed some level of server system resources to service the indexing volume and some level of server system resources to service the search workload.

The consumption of system resources on the indexing portion of this is relatively fixed.  In benchmarks we’ve run where we attempt to index a large data-set as quickly as possible, we usually only see 3-4 server cores maxed during the test.  You can tweak some settings in Splunk but since the indexing pipelines are relatively single threaded, throwing more cores at the indexing side doesn’t buy you very much.  This is sort of by design given that we employ a map-reduce architecture that is intended to scale horizontally across lots of commodity servers.  But if you’ve ever looked at the numbers we’ve published using benchmarking tools like SplunkIt, you’ll see results in the neighborhood of 23K KBPS average and 80K Events per Second(EPS) average on indexing.  So if I could sustain 23K KBPS for 24 hours, doing a little math that yields about 1.8TB/day!

So why this 100GB/day per server number if, in theory, we could do 1.8TB/day?  Well, back to our friend the Search workload.  Even though modern operating systems are relatively good at balancing resources, if you pummel an indexer with a high sustained indexing workload, then also pummel it with a high sustained concurrent search workload, search response time will start to degrade and your Splunk experience will suffer.  So we have to find some happy medium where the indexing volume can be at some level to allow a degree of search concurrency.

There’s also the subject of search time over a given quantity of data.  Splunk is really good at “needle-in-the-haystack” searches looking for rare terms in say 1.8TB of data.  No problem.  We’re also pretty good at dense reporting searches too (and often use summaries or accelerations for this) but trying to run a report on a single machine (without any mapping or reducing or summaries) over 1.8TB of raw data might actually take a single search process greater than 24 hours.  In the real world, most searches are scoped to a point where this would not happen but we approach a limit of what a single machine can do without having some other cores working in parallel on the same job.

Now the good news.  Given some recent testing we’ve done with indexing and search concurrency, coupled with the steady march of increasing commodity server core counts, we think it is safe to raise the general guidance here from 100GB of indexing per day to 250GB of indexing per day per server.  Acknowledging the “sweet-spot” for commodity servers (that is the price/performance for commodity server hardware) has shifted to 12-core 12GB+ machines (and probably 16-core very soon), we raised the building block server we do our performance testing on back in Splunk 6.0.

Now there are, of course, bigger machines out there and these are fine for environments with greater user counts and higher concurrent search workloads but we’ve usually stuck with the mantra that more moderate-sized servers, with your workload spread across more hosts, is generally better and perhaps more cost efficient.  So now that we have more cores to work with in general, we should be able to sustain more indexing per host while still having enough overhead to service a given search workload.

Let me reiterate that this is still general guidance based upon the principle that a server will need cores for indexing workload and some other cores for search workload.  Again, your mileage may vary based upon the search workload.   The other good news here is that over the last several releases we’ve made steady improvements to how the backing stores that drive report acceleration and data models are created resulting in a more evenly spread summarization workload.  Couple that with better handling of real-time searches at scale and other general improvements to the storage and search engine and we end up with a more balanced experience.

Rest assured that we’re not standing still.  If, in a few years, the sweet-spot commodity server is 48-core, believe it that we’re considering additional ways to make use of that real-estate.

Patrick Ogdin
Posted by

Patrick Ogdin