How To Determine When a Host Stops Sending Logs to Splunk...Expeditiously

So I've only been at Splunk for 8 months, and in the short amount of time I've been here, one of the most common questions I've been asked is “How do I get an alert when Splunk is not receiving logs?". As a matter of fact, if I had $0.05 each time I was asked this question, I would have $0.25!  

Surprisingly, with this being such an often-asked question, I haven't been able to find much documentation on how to accomplish this using the native features of Splunk. In this blog, I aim to share with you some ideas on how to answer this with Splunk using Search Processing Language (SPL).

A Bit of Background

Before getting right into the meat and potatoes of how to accomplish this, let’s take a short detour to try to explain the methodology behind the upcoming SPL. Doing this will allow you to not only understand “what” works, but the “how” and “why” behind it.  

Let me preface this section of the article by saying that with Splunk, there is definitely more than one way to accomplish this. There’s an article that talks about how to monitor inactive hosts using metadata. In fact, my initial iteration was totally different (and totally inefficient) and after discussing with some colleagues, they helped me accomplish the same outcome with a much faster search. Enter stage right...tstats!

Side note: for a quality explanation of tstats (and just accelerating access to data in Splunk), reference this amazing .conf16 presentation entitled "How to Scale: From _raw to tstats (and beyond!)."

Special Considerations

Please note that this particular functionality relies on a few components being correct in the data. Specifically:

  • Splunk must be set to an accurate time
  • The timestamp in the events are mapping to a time that is close to the time that the event is received and indexed by Splunk
  • Splunk has received data for this index, host, source or sourcetype within the time range you are searching over

The second point is most important because in this methodology Splunk uses the timestamp in an event to compare it against a relative time window to determine whether the event has been received within time. The use case for this is going to be applicable to more “real-time” deployments where Splunk is receiving data from a high frequency data source such as a syslog server or push logs via the HTTP Event Collector.

Default Indexed Fields

The default fields that Splunk indexes as part of each event are:

  1. Host
  2. Source
  3. Sourcetype
  4. Time (_time)

This is important to note because this is all of the information we need in order to determine when Splunk has not received an event after a certain time period. Since we have this information, we can:

  • Determine the timestamp of each event based on the host, source or sourcetype received by Splunk
  • Calculate a relative timestamp to use to determine if a log is outside of the receive window
  • Check to see if the timestamp of each event is within or outside the window of the relative timestamp

Once we understand these items, we can now craft a search within Splunk to detect and alert when an event has not been received.

Alert When There is No Data From a Specific Host

In the case where you want to be notified when events are no longer being received by a certain host, a search can be crafted to compare the timestamp of the events from the host to the relative time window.

| tstats latest(_time) as latest where index=* earliest=-24h by host
| eval recent = if(latest > relative_time(now(),"-5m"),1,0), realLatest = strftime(latest,"%c")
| where recent=0

Figure 1. Screenshot of Splunk showing host without any new events in last 5 minutes.

Let’s take a look at the SPL and break down each component to annotate what is happening as part of the search:

| tstats latest(_time) as latest where index=* earliest=-24h by host

Run a tstats search to pull the latest event’s “_time” field matching on any index that is accessible by the user. The earliest event should go to a maximum of 24 hours in the past and group this data by the host name.

| eval recent = if(latest > relative_time(now(),"-5m"),1,0), realLatest = strftime(latest,"%c")

Create a new field called “recent”. To determine what that field should be set to, perform a conditional check to see if the latest event time is greater (more recent) than the current time minus 5 minutes. If it is, set the recent variable to 1, if it is not, set it to 0. Also, take the latest time and convert it from epoch to the human readable format using the strftime function.

| where recent=0

Return all results where the recent flag is set to 0. (This is because if the flag is set to 1 or greater, the index has received recent events.)

In doing so, Splunk will now use the timestamp in the latest log it received from the host in calculating whether or not it has sent an event within the window of when Splunk expects to receive data. This SPL statement can easily be adjusted for source and sourcetype as well.

In a nutshell, this uses the tstats command (very fast) to look at all of your hosts and identify those that have not reported in data within the last five minutes. Please note that this particular query assumes that you have, at some point within your search time, received data in from the hosts that are being listed by the above command.

Please keep in mind that this search functions for hosts that already exist within the complete time period you are searching (i.e. previous 24 hours).  I will create another article in the future that will provide guidance on how to use a lookup file of hosts to check for in case the hosts do not exist in Splunk and/or are outside of the search window.

Alert When There is No Data to a Specific Index

In the case where you want to be alerted if no data has been received from a specific host within a certain time period, you simply substitute “index” for “host” in the above query as highlighted below:

| tstats latest(_time) as latest where index=* earliest=-24h by index
| eval recent = if(latest > relative_time(now(),"-5m"),1,0), realLatest = strftime(latest,"%c")
| where recent=0

Figure 2. Screenshot of Splunk showing index without any new events in last 5 minutes.

Final Thoughts

Now that you have the SPL query to use to identify if assets within Splunk are not sending data, you can create alerts, reports and dashboards to proactively monitor and respond when a device may be offline or have some other issue preventing it from sending data. Creating those items are beyond the scope of this article but in the wealth of knowledge that is Splunk’s documentation site, you can find instructions on how to do so.

I hope that this post was helpful and informative!

Jonathan Torian

Posted by