I relaized that as part of the previous monitoring bundle post i forgot to explain something cool/critical.
When we first conceived of the scripted inputs we used ps, top, netstat, as examples. It was going to be so easy and cool to eat
ps output and get graphs of VM usage by process. Totally obvious until we tried it. The ps output in splunk works best as one event, with the header at the top and a repeated line per process:
Looks great! I can search for “sourcetype::ps splunkd” and get back all the times splunkd was running. But the problem comes when wanting to report on VM usage. How do i get our kv extractor to support a search that is “average VSZ for splunkd by time”. In our search langauge ( or using the UI ) you can say something like:
"sourcetype::ps splunkd | stats avg(VSZ) by _time
What we want is to produce a table and graph that is the average value for the key “VSZ” over time for just the one the process “splunkd”. But the above wont do that as there is no VSZ = in the “event” and worse than that, there are many values drawn out in the VSZ column.
The problem is that we have made ps output into one event with the header and multiple rows per process. This is good as it keeps logically together all the ps output and makes it look like the output from the command. We could have split it up on input but that would be really ugly with each line just a row of data. Even if we duplicated the header for each row it would have been sucky.
We were confronted with how to access individual row/column values withing this type of output – keep in mind that many OS commands operate this way : top, lsof, ls -las, …
In usual splunk fashion we did not want our sever to have specific knowledge on how to process top, ps, etc, output.
Enter Steve Zhang – i think it was his first week here at splunk and he just nailed it.
He wrote the multikv search processor that would find a header and rip events like this into single line events with kv pairs generated for each col/row. SWEET!
It works something like this – its takes the following type of event ( header(col) + multiple lines(rows)):
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 41 0.0 -0.0 28324 680 ?? Ss 8:25PM 0:00.01 /usr/sbin/memberd -x
root 42 0.0 -0.1 29268 2208 ?? Ss 8:25PM 0:00.18 /usr/sbin/securityd
root 44 0.0 -0.0 27864 484 ?? Ss 8:25PM 0:00.62 /usr/sbin/notifyd
root 46 0.0 -0.1 30988 2976 ?? Ss 8:25PM 0:00.61 /usr/sbin/DirectoryService
root 48 0.0 -0.0 27676 876 ?? Ss 8:25PM 0:00.37 /usr/sbin/distnoted
root 51 0.0 -0.0 27252 240 ?? Ss 8:25PM 0:07.55 /usr/sbin/update
root 62 0.0 -0.1 37828 2020 ?? S 8:25PM 0:00.12 /usr/sbin/blued
And at search time if you pipe as result set through
multikv turns it into something that looks like the following where each line is a seperate event:
USER=root PID=41 %CPU=0.0 %MEM=0.0 VSZ=28342 VSZ=680 TT=?? STAT=Ss STARTED=08:25PM TIME=0.00.01 COMMAND=/usr/sbin/memberd -x
USER=root PID=42 %CPU=0.0 %MEM=0.1 VSZ=29268 VSZ=2208 TT=?? STAT=Ss STARTED=08:25PM TIME=0.00.18 COMMAND=/usr/sbin/secuirtyid -x
USER=root PID=44 %CPU=0.0 %MEM=0.0 VSZ=287864 VSZ=484 TT=?? STAT=Ss STARTED=08:25PM TIME=0.00.62 COMMAND=/usr/sbin/notifyd -x
and so on…
This means that by taking the original search and pipeing it to multikv we can report on any row/column pair – so now do the following search
sourcetype::ps splunkd | multikv | stats avg(VSZ) by _time
The problem with the above is that multikv is going to breakout every row and thus the avg(VSZ) above will be the average for ALL processes. Think of multikv turning ps output into a table of rows and columns. So, we need the ability for multikv to not break every ps entry into its own event, just the splunkd ones. No problem, enter Dr. Zhang. Multikv was enhanced to support both filter and field so that you specify which rows and columns you want extracted.
By adding the following filter argument multikv will take the entire ps event and pull out just the splunkd row.
sourcetype::ps splunkd | multikv filter splunkd | stats avg(VSZ) by _time
The filter argument to multikv will filter out all rows that do not match – you can supply multiple coma separated. For example following will will pull out all rows that have splunkd OR python:
sourcetype::ps splunkd | multikv filter splunkd, pythond | stats avg(VSZ) by _time
So now we will get just rows with splunkd or python but we will still get all fields (columns). In some cases you don’t want/need all columns. Just as the filter argument helped control which rows we wanted, we can limit columns using the
fields argument. The following will pull out just the RSS and VSZ fields (columns) for splunkd and python entries in ps
sourcetype::ps splunkd | multikv filter splunkd, python fields RSS, VSZ
So to get just the splunkd VSZ over time we can just do the following: ( click to enlarge )
The above focused on explaining how to work with by filtering out just the splunkd process. It can be very intereting to look at all process by time.
If we wanted to, we could easily use multikv along with timechart to graph all processes vm usage over time. Just remove the
filter argument and have timechart split by host. This done easily enough via the UI:
- Search for “index::monitoring sourcetype::ps | multikv”
- Click the report link to go to report mode
- Click on the VSZ field
- Change the popup/builder to average by time split by host
So, if your using the monitoring bundle or have your own header, row, col events used Dr. Z’s multikv!!!
**as usual, please, please, please, send/post/call with suggestions on how to improve**