TIPS & TRICKS

Digging into metrics.log

Occasionally people ask for help in identifying a rogue data input that is suddenly spewing events. If it’s hidden in a ton of similar data it can be difficult to sort out which one is actually the problem. One place to look is the Splunk internal metrics.log. You can find it by searching the internal index (add “index=_internal” to your search) or just look in the file itself (located in $SPLUNK_HOME/var/log/splunk.)

Before I get into what can be found there, I need to explain what metrics.log is not. It is a sampling over 30 second intervals, so it will not give you an exact accounting of all your inputs. For each type of item reported, you get the top ten hot sources over the interval, based on the size of the event (_raw.) It is different from the numbers reported by LicenseManager, which include the indexed fields. Also, the default configuration only maintains the metrics data in the internal index a few days, but by going to the files you can see trends over a period of months if your rolled files go that far back.

A typical metrics.log has stuff like this:

03-13-2008 10:48:55.620 INFO Metrics – group=pipeline, name=tail, processor=tail, cpu_seconds=0.000000, executes=31, cumulative_hits=73399
03-13-2008 10:48:55.620 INFO Metrics – group=pipeline, name=typing, processor=annotator, cpu_seconds=0.000000, executes=63, cumulative_hits=134912
03-13-2008 10:48:55.620 INFO Metrics – group=pipeline, name=typing, processor=clusterer, cpu_seconds=0.000000, executes=63, cumulative_hits=134912
03-13-2008 10:48:55.620 INFO Metrics – group=pipeline, name=typing, processor=readerin, cpu_seconds=0.000000, executes=63, cumulative_hits=134912
03-13-2008 10:48:55.620 INFO Metrics – group=pipeline, name=typing, processor=sendout, cpu_seconds=0.000000, executes=63, cumulative_hits=134912
03-13-2008 10:48:55.620 INFO Metrics – group=thruput, name=index_thruput, instantaneous_kbps=0.302766, instantaneous_eps=2.129032, average_kbps=0.000000, total_k_processed=19757, load_average=0.124023
03-13-2008 10:48:55.620 INFO Metrics – group=per_host_thruput, series=”fthost”, kbps=0.019563, eps=0.096774, kb=0.606445
03-13-2008 10:48:55.620 INFO Metrics – group=per_host_thruput, series=”grumpy”, kbps=0.283203, eps=2.032258, kb=8.779297
03-13-2008 10:48:55.620 INFO Metrics – group=per_index_thruput, series=”_internal”, kbps=0.275328, eps=1.903226, kb=8.535156
03-13-2008 10:48:55.620 INFO Metrics – group=per_index_thruput, series=”_thefishbucket”, kbps=0.019563, eps=0.096774, kb=0.606445
03-13-2008 10:48:55.620 INFO Metrics – group=per_index_thruput, series=”default”, kbps=0.007876, eps=0.129032, kb=0.244141
03-13-2008 10:48:55.620 INFO Metrics – group=per_source_thruput, series=”/applications/splunk3.2/var/log/splunk/metrics.log”, kbps=0.272114, eps=1.870968, kb=8.435547
03-13-2008 10:48:55.620 INFO Metrics – group=per_source_thruput, series=”/applications/splunk3.2/var/log/splunk/splunkd.log”, kbps=0.003213, eps=0.032258, kb=0.099609
03-13-2008 10:48:55.620 INFO Metrics – group=per_source_thruput, series=”/var/log/apache2/somedomain_access_log”, kbps=0.007876, eps=0.096774, kb=0.244141
03-13-2008 10:48:55.620 INFO Metrics – group=per_source_thruput, series=”filetracker”, kbps=0.019563, eps=0.096774, kb=0.606445
03-13-2008 10:48:55.620 INFO Metrics – group=per_sourcetype_thruput, series=”access_common”, kbps=0.007876, eps=0.129032, kb=0.244141
03-13-2008 10:48:55.620 INFO Metrics – group=per_sourcetype_thruput, series=”filetrackercrclog”, kbps=0.019563, eps=0.096774, kb=0.606445
03-13-2008 10:48:55.620 INFO Metrics – group=per_sourcetype_thruput, series=”splunkd”, kbps=0.275328, eps=1.903226, kb=8.535156
03-13-2008 10:48:55.620 INFO Metrics – group=queue, name=aeq, max_size=10, filled_count=0, empty_count=0, current_size=0, largest_size=0, smallest_size=0
03-13-2008 10:48:55.620 INFO Metrics – group=queue, name=aq, max_size=10, filled_count=0, empty_count=0, current_size=0, largest_size=0, smallest_size=0
03-13-2008 10:48:55.620 INFO Metrics – group=queue, name=tailingq, current_size=0, largest_size=0, smallest_size=0
03-13-2008 10:48:55.620 INFO Metrics – group=queue, name=udp_queue, max_size=1000, filled_count=0, empty_count=0, current_size=0, largest_size=0, smallest_size=0

There’s a lot more there than just volume data, but for now I’ll focus on investigating data inputs. “group=” identifies what type of thing being reported on and series the particular item. For incoming events, the amount of data processed is in the thruput group, as in per_host_thruput. In my case, I’m only indexing data from one host so per_host_thruput actually can tell me something useful: that right now host “grumpy” indexes around 8k in a 30-second period. Since there is only one host I could add it all up and get a good picture of what I’m indexing, but if I had more than 10 hosts I would only get a sample.

03-13-2008 10:49:57.634 INFO Metrics – group=per_host_thruput, series=”grumpy”, kbps=0.245401, eps=1.774194, kb=7.607422
03-13-2008 10:50:28.642 INFO Metrics – group=per_host_thruput, series=”grumpy”, kbps=0.237053, eps=1.612903, kb=7.348633
03-13-2008 10:50:59.648 INFO Metrics – group=per_host_thruput, series=”grumpy”, kbps=0.217584, eps=1.548387, kb=6.745117
03-13-2008 10:51:30.656 INFO Metrics – group=per_host_thruput, series=”grumpy”, kbps=0.245621, eps=1.741935, kb=7.614258
03-13-2008 10:52:01.661 INFO Metrics – group=per_host_thruput, series=”grumpy”, kbps=0.311051, eps=2.290323, kb=9.642578
03-13-2008 10:52:32.669 INFO Metrics – group=per_host_thruput, series=”grumpy”, kbps=0.296938, eps=2.322581, kb=9.205078
03-13-2008 10:53:03.677 INFO Metrics – group=per_host_thruput, series=”grumpy”, kbps=0.261593, eps=1.838710, kb=8.109375
03-13-2008 10:53:34.686 INFO Metrics – group=per_host_thruput, series=”grumpy”, kbps=0.263136, eps=2.032258, kb=8.157227
03-13-2008 10:54:05.692 INFO Metrics – group=per_host_thruput, series=”grumpy”, kbps=0.261530, eps=1.806452, kb=8.107422
03-13-2008 10:54:36.699 INFO Metrics – group=per_host_thruput, series=”grumpy”, kbps=0.313855, eps=2.354839, kb=9.729492

For example, I know that access_common is a popular sourcetype for events on this webserver, so it would give me a good idea of what was happening:

03-13-2008 10:51:30.656 INFO Metrics – group=per_sourcetype_thruput, series=”access_common”, kbps=0.022587, eps=0.193548, kb=0.700195
03-13-2008 10:52:01.661 INFO Metrics – group=per_sourcetype_thruput, series=”access_common”, kbps=0.053585, eps=0.451613, kb=1.661133
03-13-2008 10:52:32.670 INFO Metrics – group=per_sourcetype_thruput, series=”access_common”, kbps=0.031786, eps=0.419355, kb=0.985352
03-13-2008 10:53:34.686 INFO Metrics – group=per_sourcetype_thruput, series=”access_common”, kbps=0.030998, eps=0.387097, kb=0.960938
03-13-2008 10:54:36.700 INFO Metrics – group=per_sourcetype_thruput, series=”access_common”, kbps=0.070092, eps=0.612903, kb=2.172852
03-13-2008 10:56:09.722 INFO Metrics – group=per_sourcetype_thruput, series=”access_common”, kbps=0.023564, eps=0.290323, kb=0.730469
03-13-2008 10:56:40.730 INFO Metrics – group=per_sourcetype_thruput, series=”access_common”, kbps=0.006048, eps=0.096774, kb=0.187500
03-13-2008 10:57:11.736 INFO Metrics – group=per_sourcetype_thruput, series=”access_common”, kbps=0.017578, eps=0.161290, kb=0.544922
03-13-2008 10:58:13.748 INFO Metrics – group=per_sourcetype_thruput, series=”access_common”, kbps=0.025611, eps=0.225806, kb=0.793945

But I’ve got way more than 10 sourcetypes, so at any particular time some other one could spike and access_common wouldn’t be reported. per_index_thruput and per_source_thruput work similarly.

With this in mind, lets dissect the standard saved search “KB indexed per hour last 24 hours”.

index::_internal metrics group=per_index_thruput NOT debug NOT sourcetype::splunk_web_access | timechart fixedrange=t span=1h sum(kb) | rename sum(kb) as totalKB

This means look in the internal index for metrics data of group per_index_thruput, ignore some internal stuff and make a report showing the sum of the kb values. For cleverness, we’ll also rename the output to something meaningful, “totalKB”. The result looks like this:

sum of kb vs. time for results in the past day
_time totalKB
1 03/12/2008 11:00:00 922.466802
2 03/12/2008 12:00:00 1144.674811
3 03/12/2008 13:00:00 1074.541995
4 03/12/2008 14:00:00 2695.178730
5 03/12/2008 15:00:00 1032.747082
6 03/12/2008 16:00:00 898.662123

Those totalKB values just come from the sum of kb over a one hour interval. If I like, I can change the search and get just the ones from grumpy:

index::_internal metrics grumpy group=per_host_thruput | timechart fixedrange=t span=1h sum(kb) | rename sum(kb) as totalKB

sum of kb vs. time for results in the past day
_time totalKB
1 03/12/2008 11:00:00 746.471681
2 03/12/2008 12:00:00 988.568358
3 03/12/2008 13:00:00 936.092772
4 03/12/2008 14:00:00 2529.226566
5 03/12/2008 15:00:00 914.945313
6 03/12/2008 16:00:00 825.353518

index::_internal metrics access_common group=per_sourcetype_thruput | timechart fixedrange=t span=1h sum(kb) | rename sum(kb) as totalKB

sum of kb vs. time for results in the past day
_time totalKB
1 03/12/2008 11:00:00 65.696285
2 03/12/2008 12:00:00 112.035162
3 03/12/2008 13:00:00 59.775395
4 03/12/2008 14:00:00 35.008788
5 03/12/2008 15:00:00 62.478514
6 03/12/2008 16:00:00 14.173828

Splunk
Posted by

Splunk

Join the Discussion