Topics

| pdf version

Splunk > The IT Search Company

  • Search and navigate IT data from applications, servers and network devices in real-time.
  • Download Splunk

Localized Splunk documentation

Looking for Splunk documentation in other languages?

Hardware capacity planning for your Splunk deployment

This documentation applies to the following versions of Splunk: 4.0 , 4.0.1 , 4.0.2 , 4.0.3 , 4.0.4 , 4.0.5 , 4.0.6 , 4.0.7 , 4.0.8 , 4.0.9 , 4.0.10

Hardware capacity planning for your Splunk deployment

Splunk is a very flexible product that can be deployed to meet almost any scale and redundancy requirement. However, that doesn't remove the need for care and planning. This article discusses high level considerations for Splunk deployments, including sizing and availability.

After you've worked through the general layout of your Splunk search topology, the other sections in this document can explain more thoroughly how to implement them, along with the formal Admin guide for Splunk.

Reference Hardware

Let's consider a common, commodity hardware server as our standard:

  • Intel 64-bit chip architecture
  • Standard Linux or Windows 64-bit distribution
  • 2 CPU, 4 core per CPU, 2.5-3Ghz per core
  • 8GB RAM
  • 4x300GB SAS hard disks at 10,000 rpm each in RAID 10
    • capable of 800 IO operations / second (IOPS)
  • standard 1Gb Ethernet NIC, optional 2nd NIC for a management network

For the purposes of this discussion this will be our single server unit. Note that the only exceptional item here is the disk array. Splunk is often constrained by disk I/O first, so always consider that first when selecting your hardware.

Performance Checklist

The first step to deciding on a reference architecture is sizing - can your Splunk handle the load? For the purposes of this guide we assume that managing forwarder connections and configurations (but not their data!) to be free. Therefore we need to look at index volume and search load.

Question 1: Do you need to index more than 2GB per day?

Question 2: Do you need more than 2 concurrent users?

If the answer to both questions is 'NO' then your Splunk instance can safely share one of the above servers with other services, with the caveat that Splunk be allowed sufficient disk I/O on the shared box. If you answered yes, continue.

Question 3: Do you need to index more than 100GB per day?

Question 4: Do you need to have more than 4 concurrent users?

If the answer to both questions is 'NO', then a single dedicated Splunk server of our reference architecture should be able to handle your workload.

Question 5: Do you need more than 500GB of storage?

At a high level, total storage is calculated as follows:

  daily average rate x retention policy x 1/2

You can generally safely use this simple calculation method. If you want to base your calculation on the specific type(s) of data that you'll be feeding into Splunk, you can use the method described in "Estimating your storage requirements" in this manual.

Splunk can generally, including indexes, store raw data at approximately half the original size thanks to compression. Given allowances for operating system and disk partitioning, that suggests about 500GB of usable space. In practical terms, that's ~6 months of fast storage at 5GB/day, or 10 days at 100GB/day.

If you need more storage, you can either opt for more local disks for fast access (required for frequent searching) or consider attached or network storage (acceptable for occasional searching). Low-latency connections over NFS or CIFS are acceptable for searches over long time periods where instant search returns can be compromised to lower cost per GB. Shares mounted over WAN connections and standby storage such as tape are never acceptable.

Beyond 100GB/Day

If you have requirements greater than 100GB/day or 4 concurrent users, you'll want to leverage Splunk's scale-out capabilities. That involves using distributed search to run searches in parallel across multiple indexers at once, and possibly load balancing the incoming data with auto load balanced Splunk forwarders.

Also, at this scale it is very likely that you'll have high availability or redundancy requirements, covered in greater detail below.

Question 6: Do you need more than 300GB/day of daily indexed volume?

If you do not - i.e. you are between 100GB/day and 300GB/day - you should be able to have multiple dual-purpose Splunk boxes that are searching across each other.

Image:DSNoSearchHead.png
example of a search user searching on one Splunk instance and having their search distributed to other instances

Additional Considerations

  • you can use a third party load balancer to assign users to different Splunk instances
  • the recommended best practice for Splunk forwarder management is to use auto load balancing
  • you can use Splunk deployment server to propagate Splunk apps and user preferences between instances
  • if you need more than 4 concurrent search users 'per server' this deployment is not appropriate
    • for example if you have 2 reference servers but need more than 8 search users

Beyond 300GB/Day

For deployments of 300GB/day or larger, consider a three tier Splunk deployment. In this model, search is separated from index by creating Splunk search heads, or instances of Splunk that only do searching. That allows for more efficient use of hardware, and to scale search usage (mostly) independently of index volume.

Image:PerfDS.png
Example Splunk distributed topology. This example could handle up to 400GB/day and 8 concurrent search users for common use cases.

Dividing Up Indexing and Searching

At daily volumes above 300GB/day, it makes sense to slightly modify our reference hardware to reflect the differing needs of indexers and search heads. Search heads do not need disk I/O, nor much local storage. However they are far more CPU bound than indexers. Therefore we can change our recommendations to:

Search Head

  • Intel 64-bit chip architecture
  • Standard Linux or Windows 64-bit distribution
  • 4 CPU, 4 core per CPU, 2.5-3Ghz per core
  • 4GB RAM
  • 2 300GB SAS hard disks at 10,000 rpm each in RAID 0
  • standard 1Gb Ethernet NIC, optional 2nd NIC for a management network

Given that a search head will be CPU bound, if fewer, more performant servers are desired, adding more and faster CPU cores is best.

Note: The guideline of 1 core per active user still applies. Don't forget to account for scheduled searches in your CPU allowance as well.

Indexer

  • Intel 64-bit chip architecture
  • Standard Linux or Windows 64-bit distribution
  • 2 CPU, 4 core per CPU, 2.5-3Ghz per core
  • 8GB RAM
  • 8 300GB SAS hard disks at 10,000 rpm each in RAID 10
    • capable of 1200 IO operations / second (Iopps)
  • standard 1Gb Ethernet NIC, optional 2nd NIC for a management network

The indexers will be busy both writing new data and servicing the remote requests of search heads. Therefore disk I/O is the primary bottleneck.

At these daily volumes, likely local disk will not provide cost effective storage for the time frames that speedy search is desired, suggesting fast attached storage or networked storage. While there are too many types of storage to be prescriptive, here are guidelines to consider:

  • indexers do many bulk reads
  • indexers do many disk seeks

Therefore...

  • more disks (specifically, more spindles) are better
  • total throughput of the entire system is important, but...
  • disk to controller ratio should be higher, similar to a database

Ratio of indexers to search heads

Technically, there is no practical Splunk limitation on the number of search heads an indexer can support, or the number of indexers a search head can search against. However systems limitations suggest a ratio of approximately 8 to 1 for most use cases. That is a rough guideline however; if you have many searchers compared to your total data volume, more search heads make sense, for example. In general, the best use of a separate search head is to populate summary indexes. This search head will then act like an indexer to the primary search head that users log into.

Accommodating many simultaneous searches

A common question for a large deployment is: how do I account for many concurrent users? Let's take as an example a system that may have at peak times 48 concurrent searches. The short answer is that we can accommodate 48 simultaneous searches on a cluster of indexers and search heads where each machine has enough RAM to prevent swapping. Assuming that each search takes 200MB of RAM per system, that is roughly 10GB additional RAM (beyond indexing requirements). This is because CPU will degrade gracefully with more concurrent jobs but once the working set of memory for all processes exceeds the physical RAM, performance drops catastrophically with swapping.

The caveat here is that a search's run time will be longer in proportion to the number of free cores when no searches were running. For example, suppose the indexers were doing nothing before the searches arrived and have 8 cores each. Suppose the first (of identical searches) takes 10s to complete. Then the first 8 searches will each take 10s to complete since there is no contention. However, since there are only 8 cores, if there are 48 searches running, each search will take 48/8 = 6x longer than if only 1-8 searches were running. So now, every search takes ~1 minute to complete.

This leads to the observation that the most important thing to do here is add indexers. Indexers do the bulk of the work in search (reading data off disk, decompressing it, extracting knowledge and reporting). If we want to return to the world of 10s searches, we use 6 indexers (one search head is probably still fine, though it may be appropriate to set aside a search head for summary index creation) and searches 1-8 now take 10/6 = 1.6s and with 48 searches, each takes 10s.

Unfortunately, the system isn't typically idle before searches arrive. If we are indexing 150 GB/day, at peak times, we probably are using 4 of the 8 cores doing indexing. That means that the first 4 searches take 10s, and having 48 searches running takes 48/4 = 12x longer, or 2 min to complete each.

Now one might say: let me put sixteen cores per indexer rather than eight and avoid buying some machines. That makes a little bit of sense, but is not the best choice. The number of cores doesn't help searches 1-16 in this case; they still take 10s. With 48 searches, each search will take 48/16 = 3x longer, which is indeed better than 6x. However, it's usually not too much more expensive to buy two 8 core machines, which has advantages: the first few searches will now just take 5s (which is the most common case) and we now have more aggregate I/O capacity (doubling the number of cores does nothing for I/O, adding servers does).

The lesson here is to add indexers. Doing so reduces the load on any system from indexing, to free cores for search. Also, since the performance of almost all types of search scale with the number of indexers, searches will be faster, which mitigates the effect of slowness from resource sharing. Additionally making every search faster, we will often avoid the case of concurrent searches with concurrent users. In realistic situations, with hundreds of users, each user will run a search every few minutes, though not at the exact same time as other users. By reducing the search time by a factor of 6 (by adding more indexers), the concurrency factor will be reduced (not necessarily by 6x, but by some meaningful factor). This in turn, lowers the concurrency related I/O and memory contention.

Summary of Performance Recommendations

Daily Volume Number of Search Users Recommended Indexers Recommended Search Heads
< 2GB/day < 2 1, shared N/A
2GB/day to 100GB/day up to 4 1 N/A
200GB/day up to 8 2 1
300GB/day up to 12 3 1
400GB/day up to 8 4 1
500GB/day up to 16 5 2
1TB/day up to 24 10 2

Note that these are approximate guidelines only. You should feel free to modify based on the discussion here for your specific use case, and to contact Splunk for more guidance if needed.

High Availability and Data Redundancy

Many Splunk deployments require some form of redundancy, either to protect the data from loss or the search service from outage - and sometimes both. In general Splunk's solution to this problem is a straightforward matter of data duplication, however we will look at three specific deployment possibilities.

Data Duplication

The easiest method of ensuring data will not be lost is to have two original artifacts made by cloning data coming from Splunk forwarders.

Image:DupeByForwarding.png

In this approach, the data is duplicated and available instantly, should you need to cut over to the stand-by Splunk instance. Note that while you can simply have one Splunk forward to the next Splunk (as shown here for the offsite location) to save on network usage, there is a risk on hard shutdown of the last few events not being sent on. If that is acceptable, the topology can be even simpler.

High Availability

The goal of a high availability deployment is both data survivability and service uptime. To accommodate this kind of deployment, you need to duplicate both the data and the physical hardware providing service, not unlike other web based applications. Also, redundancy needs to be considered for all three tiers of service - splunkweb searching, splunkd indexing and forwarding.

Image:HAwALB.png

In this topology, there are two data complete functional groups. In the picture both groups are servicing search requests to optimize hardware costs; the second infrastructure could be idle to ensure neither disruption nor degradation of search services.

Things to note about this topology

  • search heads can be load-balanced to allow for user redirection should one go down
  • as long as there is at least one indexer up in a clone group, the dataset remains intact
    • so long as the surviving indexers can handle the load; it is recommended to stop searching against a degraded group to ensure it doesn't fall behind

Performance Considerations

Splunk has three primary roles - indexer, searcher and forwarder. In many cases a single Splunk instance may two or all three roles at once. All have their own performance requirements, and bottlenecks.

  • indexing, while relatively resource inexpensive, is often disk I/O bound
  • searching can be both CPU and disk I/O bound
  • forwarding uses very little resources, and is rarely a bottleneck

As you can see, disk I/O is frequently the limiting factor in Splunk performance, and deserves extra consideration in your planning. That also makes Splunk a poor virtualization candidate unless dedicated disk access can be arranged.

CPU

  • allow 1 CPU core for every 1MB/s of indexing volume
  • allow 1 CPU core for Splunk's optimization routines for every 2MB/s of indexing volume
  • allow 1 CPU per active searcher (be sure to account for scheduled searches)

Disk I/O

  • assume 50 Iopps per 1 MB/s of indexing volume
  • allow 50 Iopps for splunk's optimize routines
  • allow 100 Iopps per search, or an average of 200 Iopps per search user

Memory

  • allow 200-300MB for indexing
  • allow 500MB per concurrent search user
  • allow 1GB for the operating system to accommodate OS caching

Total Storage

  • allow 15% overhead for OS and disk partitioning
    • on this system there is ~500GB of usable storage
  • conservatively Splunk can, including indexes, compress original logs by ~50%
    • compression rates vary based on the data

Based on these estimate, this machine will be disk IO bound' if there are too many active users or too many searches per user. That is the most likely limitation for this hardware, possibly followed by CPU if the searches are highly computational in nature, such as many uses of stats or eval commands in a single search.

Applied Performance

With the information above, it is possible to estimate required hardware for most Splunk use cases by considering the following:

  • the amount of daily indexed volume (disk I/O, CPU)
  • the required retention period (total storage)
  • the number of concurrent search users (disk I/O, CPU)

Note that not all search users consume the same amount of resources. While their are in depth guides for search cost analysis available here, consider these very rough guidelines.

  • dashboard-heavy users trigger many searches at once
  • dashboards also suggest many scheduled searches
  • searching for rare events across large datasets (e.g all time) is disk I/O intensive
  • calculating summary information is CPU intensive
    • if done over long time intervals, can also be disk I/O intensive
  • alerts and scheduled searches run even if no one sees their results

What does that mean in real life?

  • executive users with many dashboards and summaries require both CPU and disk I/O
  • operations users searching over recent and small datasets require less resources
  • forensic and compliance users searching over long timeframes require disk I/O
  • alerting and scheduled searches over short timeframes are inexpensive; over long timeframes potentially very expensive.
Revision: 207 Contact Privacy Policy Terms of Use Community content licensed under Creative Commons