Pollution is Bad

Pollution is bad for the environment and bad for Splunk. When your Splunk datastore gets polluted it can impact your search experience negatively. It can also be difficult, if not impossible, to clean up without re-indexing.

Pollution can happen for a number of reasons:

  • the wrong timestamp is extracted (events are dated in the past or future)
  • events are broken at the wrong place
  • incorrect metadata (host, source, sourcetype) is associated with an event

What does this mean to you? Pollution can cause time-bound searches to return inaccurate results. For example, if you are searching over the last 24 hours and events are incorrectly dated a week ago they will not be returned as part of the result set. Any subsequent operations (e.g. stats, timechart) on the result set will be inaccurate. Pollution can also cause skew in the event count. If Splunk inadvertently breaks an event into multiple parts, the reported event count will differ from the true event count. Thirdly, if the wrong sourcetype or host data is assigned to an event, searches on sourcetype or host will be troublesome.

What can you do if any of the conditions above threaten the integrity of your Splunk installation? It is possible to delete events, whereby they are not returned in search results, but even delete has its limitations. The alternative is to clean and re-index data. This is a very heavy-handed approach and assumes you do not mind losing/reprocessing many millions/billions of events or months/years of data.

Preventing pollution is the best policy. Problems can easily go undetected in a sea of events. Ensuring these problems don’t crop up over time when they become more difficult to address can save you time and save you from having to make difficult decisions about re-indexing.

Here are some simple ways to help you defeat contamination:

  1. When first setting up Splunk or adding a new data source, run through some safety checks to make sure Splunk is indexing the data sensibly. Check out the attached on-boarding checklist for some suggested sanity checks.
  2. For testing, use a staging environment, not your production Splunk installation. Get a sample of the data and see how it performs. Use Splunk Free, use your desktop, use your neighbor’s desktop–anything but the production Splunk server. If no alternative to the production server is available, at the least, setup a sandbox index where you can test the new data to your heart’s content. When you’re done testing, divert the data stream to the default index (or wherever you need it to go), then delete the sandbox index. Cleaning an index is much easier than trying to surgically remove events from an index.
  3. Remove the guessing from timestamp extraction, line breaking, sourcetyping. For your convenience, these 3 topics are covered separately in my previous blogs.

Pollution is not our friend.

Vi Ly

Posted by