(Hi all–welcome to the latest installment in the series of technical blog posts from members of the SplunkTrust, our Community MVP program. We’re very proud to have such a fantastic group of community MVPs, and are excited to see what you’ll do with what you learn from them over the coming months and years.
–rachel perkins, Sr. Director, Splunk Community)
This is part 2 of a series. Find part 1 here: http://blogs.splunk.com/2016/02/11/whats-next-next-level-splunk-sysadmin-tasks-part-1/
In this brief series of posts we are covering 3 things Splunk admins should do soon after getting data into Splunk. In Part 1 we talked a bit about making sure values in the host field are correct. This time we are going to talk about making sure the local time on servers is set correctly.
One of the great things about Splunk is that as data comes in, Splunk will look for timestamps and automagically place events chronologically. What if the local time on server generating the logs was 15 minutes slow or fast, though? Remember that if you set the timerange selector in Splunk to look for events from the last 60 minutes, you aren’t getting back events that were generated in the last 60 minutes so much as events that Splunk has understood to have been generated in the last 60 minutes based on the timestamp in the events and other time-based Splunk settings. This is a nuanced but impactful difference. If logs are generated in UTC but Splunk doesn’t know that, when you search for events from the previous 60 minutes the events you see could be hours old depending on where in the world you actually sit. This makes it tough to identify when the server/device generating the events itself is off, and gets into a larger discussion of looking for time issues across all your data (which we will get into in the next article). The case where the server’s clock is off is nuanced enough to warrant its own discussion and should be resolved before we talk about larger time issues.
So how do we detect this clock skew in your data? Well instead of waiting for events to be generated (ie you logging into a server while looking at a clock) we are going to have the system generate one for us. Generally speaking it doesn’t much matter when the event is generated as much as it IS generated. Why not just get the average delta by host? Well what happens if the local clock is off and the server is generating data set to a couple different time zones? This way we can focus in on one event to minimize other noise.
Since one of the options for the WMIC OS call used in the last article was to show the local date and time (LocalDateTime) let’s just reuse that. As a refresher here is the info.
@echo off wmic /node:"%COMPUTERNAME%" os get bootdevice, caption, csname, description, installdate, lastbootuptime, localdatetime, organization, registereduser, serialnumber, servicepackmajorversion, status, systemdrive, version /format:list inputs.conf [script://.\bin\wmic_os.bat] disabled = 0
## Run once per day interval = 86400 sourcetype = Windows:OS source = wmic_os
In the spirit of reuse we might as well go ahead and use the Unix:Version sourcetype that comes in the Linux TA. Because of the format of Linux logs, we will be able to use the default time field in Splunk ‘_time’. Just make sure to enable the script in the inputs.
Bringing the data together
Like last time, we will want to bring the data together in one query to look for issues across both OSes. Some work will need to be done in extracting the time values from the Windows events, and in both cases to compare the event generation time to when the data was brought into Splunk. We will get more into this in the next article but this particular step means exposing the otherwise hidden _indextime field. This involves a simple eval statement like | eval index_time = _indextime. There will always be some delay between when the data was generated and when it comes into Splunk, but generally speaking that should just be a few seconds or so. Because of this delay you will want to ignore what will hopefully(!) be the large majority of these logs, though. The OSU team chooses to look at logs where this time differential is over 5 minutes. The query we use is:
sourcetype=windows:os OR sourcetype=unix:version | eval index_time = _indextime | eval type = if(isnotnull(LocalDateTime), "Windows", "Linux") | eval local_date_time = if(isnotnull(LocalDateTime), strptime(LocalDateTime, "%Y%m%d%H%M%S.%6N"), _time) | eval delta_time = local_date_time - index_time | eval abs_delta_time = abs(delta_time) | where abs_delta_time > 300 | eval sec = floor(abs_delta_time%60) | eval min = floor((abs_delta_time%3600) / 60) | eval hrs = floor((abs_delta_time - abs_delta_time%3600) / 3600) | eval skew = tostring(hrs) + "h " + tostring(min) + "m " + tostring(sec) + "s" | eval direction = case(delta_time<0, "Behind", delta_time>0, "Ahead", 1=1, "???") | eval time_str = strftime(_time, "%T %F") | eval local_date_time = strftime(local_date_time, "%T %F") | table index host, type, skew, direction, local_date_time, time_str | sort index host | rename host AS "Host", type AS "Server Type", skew AS "Time Skew", direction AS "Time Skew Direction", local_date_time AS "Host System Time", time_str AS "Splunk Timestamp"
If you have data coming in from systems in multiple timezones you will need to account for those instances. However, the query above will hopefully give you a baseline to work from; adjust as needed. The most common causes I’ve seen for this issue is the time zone being set incorrectly or NTP not being enabled on the local server.
Special thanks to Alex Moorhead, a student worker in the OSU Splunk shop, who came up with the query above for us.