CUSTOMERS CUSTOMERS

Meltdown Patch and the Impact on Infrastructure Supporting Splunk Solutions

The depth and breadth of the impacts of Meltdown and Spectre vulnerabilities surprised many in the industry. While more and more information is being published and patches are released, it’s clear that the impact is very broad and varied. For the most part, applying patches can impact the performance of a system – which in turn can require capacity expansion. This article covers the impact of patches for Meltdown in our internal testing.

As our customers know, no two Splunk deployments are exactly the same. For this reason, how the Meltdown vulnerability and the many layers of operating system patches it requires will affect your environment depends on both the hardware and software applications you’re running. Because so many variables can influence performance, it’s difficult to give broad-based guidance about the best path forward with respect to protecting your infrastructure.

To help guide your evaluation, Splunk conducted a series of performance tests (both in the cloud and on-premises) with characteristic Splunk workloads to help you predict the potential impact the patching may have on your environment.

The Splunk testing included performance analyses at the search, index and cluster levels and included different virtualization types, such as Bare Metal, Virtual Machine and VMware, different hardware specifications, and different Linux distribution and kernel versions. Below is the summary test matrix:

  • Virtualization: Bare Metal, Virtual Machine (EC2), VMWare
  • Hardware: Dell PowerEdge C6320 servers, AWS EC2 m5.12xlarge Hardware Virtual Machine (HVM), AWS EC2 c3.8xlarge (HVM)
  • OS: Centos 7.0, Ubuntu 16.04
  • Linux Kernel: 3.10, 4.4, 4.14
  • Splunk: splunk-7.0.1(x86_64)

Overall, the test results show that the post-patch performance impact varies with workload, hardware specification, and the type of virtualization. For workloads that are IO bound, we recommend increasing capacity. Below test results can help guide the degree of capacity increase needed.

Performance baseline and workload impact

We tested the performance baseline with an m5.12xlarge instance on AWS (HVM) with the following specifications:

CPU

Intel® Xeon® CPU E5-2680 v2 @ 2.80GHz (48 vCPU)

Memory

192GB memory

Disk

650GB gp1 EBS (3000 burst IOPS)

OS

Ubuntu 16.04 with Kernel 4.14.12-041412

Splunk Version

7.0.1 on Linux, 64 bit

The performance results vary based on the workload:

  • Measurable: Up to 34% degradation - Highly (buffered) IO bounding workload, such as csv/file lookup operation, forwarding to indexer without SSL encryption (SSL).
  • Modest: Up to 10% degradation - Indexing, search workload with lookup, stats, tstats, forwarding to indexer with SSL, etc. were affected less than the “Measurable” category. These operations are IO bound but less CPU intensive.
  • Small: Up to 5% degradation - CPU bound operations such as data model acceleration, dense search without lookup and stats shows a 5% performance hit in worst case.
  • Minimal: 0 to 5% - Most workloads, including search head clustering, indexer clustering, bundle push and quick search reveal negligible impact.

Performance impact on upgraded Linux kernel 4.14

Intel’s processors with Process-Context Identifiers (PCID) enabled suffer less meltdown patch performance degradation. Since different kernel versions support PCID differently, we tested kernel 4.14 and 4.4 with identical hardware specifications.

Workload performance differences were noted even before the patch was applied. However, the 4.14 kernel had better baseline performance in most IO bound scenarios, 5% performance gain in workloads such as indexing and search, and 12% performance gain in heavily IO bound workloads, such as csv/file lookup operation.

Performance impact on different hardware

We also verified the impact on older hardware that doesn’t support PCID. We executed all the above testing on a c3.8xlarge instance, which uses older Ivy Bridge microarchitecture without PCID support.

Instance Type

c3.8xlarge

m5.12xlarge

CPU

Intel Xeon CPU E5-2680 v2 @ 2.80GHz (32 vCPU)

Intel Xeon Platinum 8175M CPU @ 2.50GHz (total 48 vCPU)

Memory

60GB memory

192GB memory

Disk

650GB gp1 EBS (3000 burst IOPS)

650GB gp1 EBS (3000 burst IOPS)

OS

Ubuntu 4.4.0-1047-aws kernel

Ubuntu 4.4.0-1047-aws kernel

Virtualization Type

HVM

HVM

Splunk Version

7.0.1 on Linux, 64 bit

7.0.1 on Linux, 64 bit

The results reveal a larger performance degradation in c3.8xlarge when the patch is applied in high IO bound workloads, such as indexing, sparse search and kvstore lookup—as much as a 100%+ performance impact than m5.12xlarge instance.

For example, we observed up to a 10% search performance hit in m5.12xlarge after patch, but a 20% search performance hit in c3.8xlarge after patch.

Performance impact on virtualization type

We performed high IO bounded case testing to validate the performance impact on operating systems running different virtualization types, covering physical host and ec2 vm with the following specifications:

 

Bare Metal Physical Host

EC2 Instance (m5.12xlarge on HVM)

CPU

2 x Intel Xeon E5-2650 v4 processors @ 2.2Ghz (total 48 vCPU)

Intel Xeon CPU E5-2680 v2 @ 2.80GHz (48 vcpu)

Memory

64GB memory @ 2400 MHz (2x 32GB RDIMMs; 2 available)

192GB memory

Disk

6x 800GB SSD @ 6Gbps (RAID 0; total 4.8 TB)

650GB gp1 EBS (3000 burst IOPS)

OS

Centos 7.0 with Kernel 4.14

Centos 7.0 with Kernel 4.14

Similar performance results were noted. Splunk Cloud observed significant (as much as 30%) performance degradation from the Meltdown patches in AWS. Most of this was observed on Paravirtual (PV) instances, and it was significantly reduced by switching to HVM instances. The performance of page table events for PV is virtualized and known to be slower, so it’s not surprising that the KPTI patch exacerbates performance. Additionally, the Splunk Cloud instances that experienced the largest performance impact were in the c3 instance family, which use the older Ivy Bridge microarchitecture (see technical notes below on PCID/INVPCID). This is also confirmed by the above testing.

Performance tests in a non-lab environment

These are tests conducted using a unique, mixed workload in a non-lab (many external variables—more real world) environment simulating mixed search and indexing scenarios, and with servers running on both bare metal and VMware. The tests ran one day’s worth of workload in this environment with and without patch to further validate the performance impact.

 

Search Head

Indexer

Virtualization

VMware OS

Bare Metal Server

CPU

Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (12 cores)

Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz (48 cores)

Memory

24GB

64GB

Disk

200G

3T

OS

Centos 7.4 with Kernel 3.10.0-693

Centos 7.4 with Kernel 3.10.0-693

Splunk Version

7.0.1 on Linux, 64 bit

7.0.1 on Linux, 64 bit

The results on bare metal indexer servers show up to 30% performance impact after patching for a mixed load of long-running searches. However, for a sample search with less than 10 searches running, the impact was negligible. Please note that this is very much a function of the workload and the weightage of IO in the workload mix.

We weren’t able to disable the patch in the VMware layer and thus couldn’t measure whether there was a performance impact when the patch was applied to VM host, but patching on the search head VM OS yielded a 17% performance impact. In total, there was a 45% performance impact from both patched indexers and search heads (which included the patch on the VMware layer).

Summary

As you can see, the performance impact of Meltdown-induced patching varies considerably based on the nature of the workload, hardware/configuration and system constraints, such as whether the workloads are CPU bound or IO bound.

In general, we recommend:

  • Checking hardware support PCID before patch
  • Using HVM rather than PV in AWS
  • Upgrading to latest Linux kernel, if possible

Most workloads should be largely unaffected after patch, but workloads with high IO should be watched carefully, the setup of performance monitoring before and after patch is highly encouraged and the additional capacity may be needed to help ensure performance levels are met.

Krishna Tammana
Posted by Krishna Tammana

Join the Discussion