
Very often you want to find “problems” in your IT data, but you don’t know what to look for. How can you find these problems with Splunk?
In Splunk’s new search language, there are several search operators that can help you. I’ll describe only a subset of what is possible.
- 1) You can search for unexpected events by looking at those that do not cluster into large groups. For example, you can cluster the errors in the last hour and report on the events the belong in the smallest clusters (e.g., ‘error | cluster showcount=true | sort – cluster_count | head 5’).
- 2) You can find unexpected events by finding values that are far from the standard deviation. For example, you can search for sendmail events with anomalous ‘delay’ values (e.g., ‘sourcetype=sendmail_syslog | anomalousvalue delay action=filter pthresh=0.02’).
- 3) You can use machine learning to find events that have unexpected values based on the past historical context (e.g., ‘* | anomalies blacklist=boringevents’).
- 4) It’s a little bit of a hand-wave — but you can do really cool graphical reports that often make anomalies visibly obvious. For example, you could create a timechart of average cpu_seconds by host, and visibly see problems (e.g., ‘sourcetype=top | timechart avg(cpu_seconds) by host’).
- 5) Finally, Splunk is expandable — you can define your own search operators. If you know how to find events interesting to you, you can write a simple script and trivially integrate it with the power of a search platform that deals for billions of events in seconds. Since Splunk uses a scalable map-reduce framework, your script will run in the map-reduce framework and scale automatically.
Once you have searches that find unexpected events, you can set alerts for them. You can also combine events together into ‘transactions’, and look for anomalies in groups of events.
----------------------------------------------------
Thanks!
David Carasso