Splunk administrators are increasingly facing the unique challenge of providing Splunk service to several users and applications, and at the same time balancing the quality of that service in line with overall business needs. Most times, they have a shared Splunk infrastructure, which makes it complicated to allocate Splunk resources based on workload priority and ensure proper resource isolation to prevent noisy neighbor problems.
In this series of three Splunk blogs, I will share how Splunk Workload Management may be used to solve these challenges. In the first installment below, I will describe how to configure the feature and use it to reserve resources for ingestion and search workloads.
Splunk Workload Management was first released in Splunk Enterprise 7.2 and later improved in Splunk Enterprise 7.3. This blog includes capabilities that are generally available to Splunk Enterprise customers.
How to Get Started with Workload Management
Workload management is a rule-based framework to allocate system resources (compute and memory) to various workloads, and manage workloads through their lifecycle. For resource allocation, workload management uses cgroups functionality in the Linux operating system. Hence, you need to setup cgroup hierarchy on all servers (search heads and indexers) before using workload management for resource allocation. There are various ways to do so depending on whether or not you are using systemd.
The overall system resources that are allocated to Splunk are divided in three predefined categories — Ingest, Search and Miscellaneous. For each category, you may allocate resources and create resource pools. For Ingest and Miscellaneous categories you can only create a default pool. For Search category, you may create several pools and assign resources from Search category. If CPU resources allocated to a pool are not fully utilized, they are automatically shared with workloads running in other pools. However, the memory allocation is the maximum memory that can be used by workloads in a pool.
All key Splunk processes run in the Ingest pool, hence, protection against OOM may be provided by setting appropriate memory limits for each pool. In the example below, even in the worst case scenario when several searches are running, the ingest pool gets a minimum of 30% memory because Search and Miscellaneous pools have a combined memory limit of 70% (65%+5%). This provides memory protection in heavily loaded deployments. Modular and scripted inputs run in the Miscellaneous pool.
For search heads, the resource settings can be defined using the graphical user interface. Once defined, the settings are saved in workload_pools.conf file. For indexer clusters, this file needs to be pushed through Cluster Master. In most scenarios, workload_pools.conf should be the same across search head and indexers but it is not a requirement. If you are using multiple standalone search heads or search head clusters with the same indexer cluster, you may need a different configuration file for each search head (cluster) and indexer cluster. More on this in the second blog!
Note: This is just an example. The allocation will depend on your specific workload and deployment.
Once the workload pools are defined, you may create workload rules to place searches in different search pools. The rules are a logical combination of predicates such as app, user, role and index (e.g. If index=security AND role=security_team, then Place Search in Priority pool). The rules are evaluated in order and once a rule is matched, the corresponding action is taken. Rules listed below the matched rule are not evaluated and hence the ordering of rules is very important. For example, in the scenario below, Rule #2 will be never matched. Generally, more restrictive rules should be ordered on the top.
- index=security then Place Search in Standard pool
- index=security AND role=security_team then Place Search in Priority pool
Resource Allocation for Ingestion and Search Workloads
Oftentimes you want to ensure that heavy search workload does not result in data ingestion lag or drop, and vice versa search execution is not impacted because of heavy ingestion load. This can be achieved by allocating CPU and Memory resources across ingest and search categories. The example below illustrates the CPU utilization by ingest and search workloads at four periods of time when workload management is enabled.
- Period 1: Only a single search is running in Default_Search pool. As there is no CPU contention, it gets as much CPU as required (34%) although the CPU allocation was 14%.
- Period 2: Two searches are running, one each in Default_Search and HighPriority_Search pool. Again, there is no CPU contention so both searches get as much CPU as required (33%) although the CPU allocation was much lower.
- Period 3: No search is running but a large ingest workload (blue bars) has started. As can be seen at the onset of ingest and also at the tail end when no search is running, the ingest workload gets as much CPU as required, up to 80-90%, although the allocation is 20%.
- Period 4: Multiple searches and ingestion workload are running concurrently. There is CPU resource contention and hence workload resource reservation kicks in. The searches running in HighPriority_Search and Standard_Search pools get the allocated CPU resources (~27%) while the ingestion workload gets the remaining CPU cycles.
CPU Utilization Monitoring from Splunk Monitoring Console
Workload management enables you to assign Splunk resources to meet your business and operational requirements. You may use a set of rules to automatically place searches in different pools and also provide granular access controls to certain users to choose their own workload pools. In addition, rich monitoring capabilities allow you to track utilization and fine tune the resource allocations.
Overall, Workload Management puts you in control to allocate Splunk resources more precisely to service your internal business requirements without over provisioning the capacity. In the next blog, I will describe how to configure workload pools in complex deployment scenarios, and how to tame use cases like internal chargeback and high priority search execution. In the meantime, check out the demo video to get started with Splunk Workload Management.