It has been an interesting exercise. We were able to get access to Cisco’s product labs where I could (remotely) access some of their high-end hardware, and I was able to test the SNMP collector against the Nexus series 3000, 5000, and 7000 switches.
Lots of devices use SNMP, and the MIBs which I’ll discuss below are relatively universal among switch/router devices, so consider this a practical example of working with SNMP data in Splunk, and some lessons we learned along the way.
In Part 1, I walked through what I had to do to get the data about the switches into Splunk using the SNMP Modular Input. In Part 2 (which is this post) I’ll explain just a few of the things you can do with that data once it’s in Splunk!
Step 4. Writing SPL Queries
Now that we have SNMP data being pulled into Splunk, we can start analyzing and reporting on it. In my testing I was querying 3 switches for just 3 tables of information, and only querying every 5 minutes, since I was doing all this from my dev laptop. This came out to about 100,000 events an hour, which is obviously just a drop in the bucket compared to what you would get if you were gathering all the information available via SNMP.
The simple stuff like device status, sysName, software versions and so on are easy to query, but the tricky part of working with SNMP data in Splunk (and the motivation for writing a second part to this blog post) turned out to dealing with tables.
A lot of SNMP data is table-based: when you have a switch with 150 ethernet “interfaces,” you have the same information for each one, so you can either write a query to pull a single piece of information about a single ethernet port, or you can do a bulk get for the whole table and get all the information for all the entries at once.
The SNMP Modular Input gives you two primary choices to deal with tables: putting each field=value pair as a separate event in Splunk (which results in them being parsed automatically), or writing all the results as a single string (which means you need to write regular expressions to parse the data). Additionally (partly as a result of my frustration with these two options), it now has custom response handlers: the idea is that you can provide a custom response handler in python to convert the data to tabular csv format (or whatever makes sense to you) and get it into Splunk in customized ways. This last option was added after I’d finished my work, so I haven’t written such a handler, instead, I’ve written queries to turn the split bulk data into tabular data…
I’ve posted this to GitHub, but the basics of the queries are worth explaining. Most of them are in the macros.conf in the github project and the savedsearches.conf, although there are some in the rejected.xml view which I didn’t think I would use, but kept around just in case.
First, some transforms
When dealing with SNMP table data, the output from the SNMP modular input in Splunk looks like this:
MIBNAME::PropertyName."123456789" = "value"
When working with tables, the default SNMP Modular Input entries include the unique identifier number in quotes. In the case of the ifTable, ifXTable, dot3StatsTable and so on, the unique identifier refers to a specific “interface” (that is, it identifies a particular ethernet jack), and when you do a single snmp bulk get, there are a dozen or more properties returned for each interface, and dozens or hundreds of interfaces on each router.
Thus, in the transforms file I have this transform which is a regular expressions that defines fields for the MIB, the unique identifier, the property name, and the value, as well as ensuring that the Property=Value field is being generated.
[snmp_mib-uid]REGEX = ([^:]+): [^\.]+)\.("?)([^"]*)\3 = \"([^\"]*)\"(?= |\n|$)FORMAT = MIB::$1 UID::$4 Name::$2 $2::$5 Value::$5
Given that transform, every event that’s coming from these table will have not only the usual _host and _time, but also the UID to identify which interface jack it’s referring to. For my own sake, I want all of the events that are returned from a single query to be grouped into single events per UID: that is, for the sake of displaying the data, I want one event with a bunch of fields for each ifEntry in the ifTable.
Then, some lookups
In order to be able to show the sysName associated with each device I’ve queried, I create a lookup table which I update using a scheduled saved search:
sourcetype=nexus_snmp_info | bucket _time span=60s |stats first(*) as * by host |table host, sysName, sysLocation | outputlookup snmpInfo.csv
The idea is to bucket by 60 seconds just to make sure that if any of the results have a slightly different timestamp they’re all normalized to the same time. Remember this data gets queried once an hour, but sometimes the SNMP modular input writes them with a timestamp that varies by a second or two, and it can ruin the use of the stats command. The “stats(*) as * by host” serves to group all of the events with the same host into a single event with fields for each of the fields found in the original events. It returns the first value for each field name, but we’re only concerned with a single result in any case. In other words, it turns something like this:
_host=192.168.0.71 SNMPv2-MIB::sysName="7k1"_host=192.168.0.71 SNMPv2-MIB::sysLocation="San Jose, CA"_host=192.168.0.71 SNMPv2-MIB::sysContact="The IT guy!"
into something like this:
_host=192.168.0.71 MIB=SNMPv2-MIB sysName=7k1 sysLocation="San Jose, CA" sysContact="The IT guy!"
Finally, some search!
When we’re ready to look at actual performance data, we’ll want to collect that data into a single event per interface, so the “by” clause of the stats command changes, and we’ll use search terms to narrow down the events we process to only the fields we’re actually interested in:
sourcetype=nexus_snmp if*Octets OR if*Speed OR ifDescr |bucket _time span=60s | stats first(*) as * by _time, host, UID
That gives us events that make sense when you look at them in tables, but to actually make performance calculations, we will need to calculate deltas. For instance, here’s Cisco’s documentation of how to calculate Bandwidth Utilization Using SNMP, which shows that you need to calculate the delta of inOctets (and outOctets) and divide by the time delta. Calculating a delta in splunk can be confusing if you’ve never done it before, but there are several simple approaches. My favorite is the streamstats command, which allows you to calculate the delta between two measurements of the same value using the range function.
Given our earlier search, we’ll tack on the streamstats:
sourcetype=nexus_snmp if*Octets OR if*Speed OR ifDescr |bucket _time span=60s | stats first(*) as * by _time, host, UID |streamstats current=true window=0 global=false count(UID) as dc,range(if*Octets) as delta*, range(_time) as seconds by host, UID
You’ll notice this time we’re doing something a little more sophisticated with our wildcards: the range function is being applied to every field which matches “ifOctets” … and a maching output field will be created with the name “delta“ where the * is replaced with the portion of the original field name that matched the * there. In other words: the total range (delta) in value of ifInOctets becomes the new field deltaIn and ifOutOctets becomes deltaOut, and if they’re available, ifHcInOctets becomes deltaHcIn, and ifHcOutOctets becomes deltaHcOut.
Of course, for these numbers to mean anything, they have to be divided by the calculated field “seconds” which is the time range: the amount of time which has passed. With the streamstats “window” set to zero, this will calculate an ever increasing delta (from the first event to the current event), so you’ll want to limit the search to a timespan that only returns a few results, or else you’ll be showing the average over the whole time period.
The other option is that you can change the window on streamstats to 2 so that you get a rolling average instead:
sourcetype=nexus_snmp if*InOctets OR if*OutOctets OR if*Speed |bucket _time span=60s | stats first(*) as * by _time,host,UID |streamstats current=true window=2 global=false count(UID) as dc,range(if*Octets) as delta*, range(_time) as seconds by host, UID |where dc=2 | eval bytesIn = coalesce(deltaHcIn, deltaIn)*8 |eval bytesOut = coalesce(deltaHcOut, deltaOut)*-8 |eval "bytesInPSec"=floor(bytesIn/seconds) |eval "bytesOutPSec"=floor(bytesOut/seconds) | bucket _time span=10m |chart sum(bytesInPSec) as in, sum(bytesOutPSec) as out over _time by sysName
Note that in this case I prefer the “hc” (high capacity) values over the non-hc fields if they’re present (on modern, faster routers), because otherwise the values sometimes hit the max for the int field. I therfore use eval with coalesce to pick the first one that’s present, and then multiply by 8 to get bytes (since these are octets), and divide by the time, as stated earlier.
Because we’re using a -8 for bytesOut, and +8 for bytesIn, when we chart these, the output shows up as negative on a chart and input shows up as positive, allowing us to have paired charts showing both input and output clearly.
Additionally in the github project I have example queries showing summary configuration information (sysName, # interfaces, location, uptime, software version, etc), and counts of ports by speed, errors and dropped packet counts, etc.
Some Final Notes
If you need to monitor hundreds or even thousands of switches, you should know that although I did all of this on one box for my demo (and queried the status of the switches only every 5 minutes), you can obviously scale this infinitely by deploying the SNMP collector to multiple Splunk forwarders and distributing the work of collecting the data.
I mentioned the problems I had dealing with the data format and the changes that have been made to the SNMP modular input to accomodate future users — I also had some issues with connectivity where one of the switches was not configured properly for my SNMP queries, and found that the error handling in the SNMP collector didn’t tell me which of my input stanzas was the one timing out — that’s also been fixed, and the current release shows the input stanza name and “destination” string whenever it has errors.
I also included in the github project a report which shows all the errors since the last time the modular input was restarted — that is: since the last time we logged the message about snmp.py being a new scheduled exec process:
index=_internal snmp.py source="*\\var\\log\\splunk\\splunkd.log" |eval restart=if(match(message,"New scheduled exec process:.*snmp\.py.*"),1,0) |head (restart=0)