Data Integrity is back, baby!

I’m sitting in my living room near Boulder, and watching the Republican Presidential Debate happening right down the road at the University of Colorado. Each candidate is doing their best to portray themselves as a candidate with integrity that’s ready to lead our country into the future. But this far into the debate, the responses are getting pretty repetitive…

So it’s a perfect time to check out something with some real integrity – the new Data Integrity feature added to Splunk 6.3, now generally available from Splunk. This allows you to prove that your indexed data has not been tampered with after indexing. Some historical background…we used to have two features that were similar, one called Block Signing and the other called Event Hashing. However, the former didn’t work with distributed search, and the latter didn’t work with index replication, so in practice these were inappropriate to implement because most Splunk installations are configured with distributed search, index replication, or both.

The new Data Integrity feature works with both distributed search and clustered configurations. It’s particularly important if you need to prove that your ingested Splunk data has not been tampered with after indexing – think of compliance regulations like PCI 10.5.5. You turn it on at the individual index level, and in this release it can only be enabled via CLI and editing the indexes.conf file. Also note, if you enable it on an index that already has data in it, the data already in the index will fail the integrity check because the hashes calculated for the integrity check are done at index time. So probably best to do this on a new index within which you need to guarantee the integrity.

Here’s a quick walkthrough. I’ve created a simple index called pci_data in my local copy of Splunk:


Then, I go to my indexes.conf, and add the directive “enableDataIntegrityControl = true” to the indexes.conf file where the index is defined:


Then I add some data to the index, and if you look at the hot bucket where the data gets indexed, you will see an “l1Hashes” temp file get created, and it gets updated with SHA256 hashes calculated on the slices of data (128kb in size, which is configurable) in the index, as new data gets indexed into the hot bucket:


Once the hot bucket rolls to warm, the .tmp file gets finalized, and a L2Hash file gets created which contains a hash of the l1Hashes file (because warm buckets should not change their contents as they are read-only):


To check and see if your index has integrity, you can run the check-integrity command, which compares the hash data in the l1Hashes file with the L2Hash file, and then with the hashes of the rawdata slices in the index, and lets you know about any discrepancies:


Obviously, indexes with a lot of data take a while to verify, but the verification process happens outside of splunkd so as to not affect indexing performance. You can back up the hash files somewhere else to prevent them from being tampered, and bring them back in for the verification process (this would need to be scripted). Also, the slice size that the hashes are computed against can be configured.

To prove to an auditor that you can integrity check your data, show the places in indexes.conf where you have configured the feature, and demonstrate that you can run integrity checks as needed. You could even script regular integrity checks and alert if they indicate tampering.

For more info, check out our official documentation here. And for a whole lot more detail, have a look at the slides and the recording from Dhurva Bhagi’s presentation on this new feature at .conf 2015. Dhurva’s presentation contains details on how this works in clustered environments, and what kind of performance hits you might take and disk space you need (sneak preview: both are negligible.)

Stay untampered, my friends.

James Brodsky
Posted by

James Brodsky

Long Island->NOVA->Upstate->Global Crossing->CA->IBM->Resolve->Tripwire->Splunk