TIPS & TRICKS

Statistics and Windows Perfmon

Sometimes, things that you expect to be trivial are less so, and you learn by experience of the pitfalls that you may fall into. One such this is Windows Perfmon. In order to save valuable license space, the Splunk Perfmon implementation squashed zero values. In other words, a zero value is not logged. This is normally not a big deal – after all, if you are recording a time chart of the % Processor Time, you might do something like this:

index=perfmon counter=”% Processor Time” instance=”_Total” | timechart avg(Value) by host

When you turn this into a chart, you can specify that null values be rendered as zero and you have a nice chart.

But is it correct?

Let’s say you are monitoring your perfmon counters every minute. At each minute interval, the splunk-perfmon.exe process wakes up and polls for the counter value. If it is not zero, it emits an event with the value. Let’s look at an example. Let’s say your counter is the number of current connections to the IIS process. We monitor this every 60 seconds and get our results. Now, when we do our timechart, we put these values into time-span buckets. Maybe our bucket is every 5 minutes. If the bucket is full, then the value reported by avg(Value) is correct. Similarly, if the bucket is empty, then the value is null, which is handled properly by the nullValueMode on the chart. But what if the bucket is partially full?

In our example, let’s say we get the following samples: {0,3,2,0,0} for our five one-minute intervals. The average for this set is 1 (a total of 5 connections divided by the 5 sample entries). But zero-values are squashed (i.e. not emitted), so what the timechart sees in the bucket is {3,2} for an average of 2.5. This is way off what we expect. The unfortunate thing here is that we don’t know why the zero is squashed – it could be because the value is zero, but it could also be because the server is down. When doing statistical analysis, zero is relevant, so we need to fix that.

Fortunately, we have a good way to correct this. Go on to your Splunk Universal Forwarder and edit the file %SPLUNK_HOME%\bin\scripts\splunk-perfmon.path. This is a text file and contains the following:

$SPLUNK_HOME\bin\splunk-perfmon.exe

Basically, our path file executes the normal executable. However, splunk-perfmon.exe also can take arguments, and one is of interest to us. Change this file to:

$SPLUNK_HOME\bin\splunk-perfmon.exe -showzero

The showzero argument tells the splunk-perfmon.exe to emit zero values. Now you can do statistical analysis on your perfmon data. This not only includes averages, but statistics like 95th percentile.

There are a couple of obvious caveats here:
1. This is a system-wide change. All the perfmon data from all apps will now record zero values.
2. This will increase your license usage. How much? That all depends on how many zero values you are getting.

Ultimately, the decision rests with you – do you do statistics on your perfmon data? If so, you need to make this change. If your needs are a little less statistical (maybe correlation with the Windows Event Logs), then you probably don’t need this change.

Splunk
Posted by

Splunk