TIPS & TRICKS

Some details on metrics.log data, format, utility


Metrics.log has a variety of introspection information for reviewing
Splunk's behavior.  Let's look at what's available:

Firstly, metrics is a periodic report of the goings on of various
Splunk behavior.  Essentially it's a report, every 30 seconds or so,
of recent goings-on.

Structure of the lines:

01-27-2010 15:43:54.913 INFO  Metrics - group=pipeline, name=parsing, processor=utf8, cpu_seconds=0.000000, executes=66, cumulative_hits=301958

Boiler plate:  the timestamp, the 'severity', which I believe is
always INFO for metrics events, and then the kind of event, "metrics".   

Next the group, this is the indicator for what kind of metrics data it
is.  There's a few types in the file.  There are a few types already:

- pipeline
- queue
- thruput
- udpin_connections
- mpool

Pipeline messages are reports on the splunk pipeline, which are the
pieces of 'machinery' which process and manipulate events flowing into
and out of the splunk system.  You can see how many times data reached
a given machine in the splunk system (executes) , and you can see how
much cpu time each machine used (cpu_seconds).

Plotting totals of cpu seconds by processor can show you where the cpu
time is going in indexing activity.  Looking at numbers for executes
can give you an idea of dataflow.  For example if you see:

group=pipeline, name=merging, processor=aggregator, ..., executes=998
group=pipeline, name=merging, processor=readerin, ... , executes=698
group=pipeline, name=merging, processor=regexreplacement, ..., executes=698
group=pipeline, name=merging, processor=sendout, ... , executes=698

Then it's pretty clear that a large portion of your items aren't
making it past the aggregator.  This might indicate that many of your
events are multiline and are being combined in the aggregator before
being passed along.

---------

Queue lines look as follows
... group=queue, name=parsingqueue, max_size=1000, filled_count=0, empty_count=8, current_size=0, largest_size=2, smallest_size=0

Most of these values aren't interesting.  current_size, especially
considered in aggregate, across events, can be informative of which
portions of splunk indexing are the bottlenecks.  If current_size
remains near zero, then probably the splunk (indexing) system is not
being taxed in any way.  If the queues remain near 1000, then more
data is being fed into the system (at the time) than it can processes
in total.

Sometimes you will see messages such as
... group=queue, name=parsingqueue, blocked!!=true, max_size=1000, filled_count=0, empty_count=8, current_size=0, largest_size=2, smallest_size=0

this message contains the 'blocked' string, indicating that it was
full, and someone tried to add more, and couldn't.  A queue will
become unblocked as soon as the code pulling items out of it pulls an
item.  Many blocked queue messages in a sequence indicate that data is
not flowing at all for some reason.  A few scattered blocked messages
indicate that flow control is operating, and is normal for a busy
indexer.

If you want to look at the queue data in aggregate, graophing the
average of current_size is probably a good starting point.

There are queues in place for data going into the parsing pipeline,
for data between parsing and indexing.  Each networking output also
has its own queue, which can be useful to determine if the data is
able to be sent promptly, or alternatively if there's some network or
receiving system limitation.

Generally, filled_count and empty_count cnanot be productive used for
inferences.

---------

Thruput lines (similar to the english word throughput) come in a few
flavors.

Firstly, thruput is measured in the indexing pipeline, if your data is
not reaching this pipeline for some reason, it will not appear in this
data.

First there is a catchall line, that looks like this:
... group=thruput, name=index_thruput, instantaneous_kbps=0.287598, instantaneous_eps=1.000000, average_kbps=0.270838, total_k_processed=74197, load_average=1.345703

This is the best line to look at when working on performance tuning or
evaluating indexing load, etc. It is trying to capture the total
indexing data load.  On my box, there is basically nothing to speak
of.

Note: In thruput lingo, 'kbps' does not mean kilobits per second, it
means kilo*bytes* per second.  The industry standard term would be
something like KBps or kBps or KB/s.

instantaneous_kbps over time is probably the most useful figure to look at
in aggregate, but I've rarely been interested in this line.

Following the catchall are a variety of breakouts of the indexing
thruput, lines such as:

... group=per_host_thruput, series="joshbook.splunk.com", kbps=0.261530, eps=1.774194, kb=8.107422
... group=per_index_thruput, series="_internal", kbps=0.261530, eps=1.774194, kb=8.107422
... group=per_source_thruput, series="/applications/splunk4/var/log/splunk/metrics.log", kbps=0.261530, eps=1.774194, kb=8.107422
... group=per_sourcetype_thruput, series="splunkd", kbps=0.261530, eps=1.774194, kb=8.107422

Here the data load is broken out by host, index, source, and
sourcetype.  This can be useful to answer two questions:  

- which data categories are busy?
- when were my data categories busy?

The series value says which host, or index, etc. The kb value
indicates the number of bytes processed since the last sample.
Graphing kb in aggregate can be informative.  The summary indexing
status dashboard uses this data, for example.

NOTE: the per_x_thruput categories are not complete.  By default they
show the ten busiest of each type, for each sampling window.  If you
have 2000 active forwarders, you cannot expect to see the majority of
them in this data.  The sampling quantity can be adjusted, but it will
increase the chattiness of metrics.log, and the resulting indexing
load and _internal index size.  The value is adjustable in
limits.conf, [metrics] maxseries = num

I recommend ignoring the 'eps' value, as it has accuracy issues.

---------

udpin_connections lines are essentially metering on udp input.

group=udpin_connections, 2514, sourcePort=2514, _udp_bps=0.00, _udp_kbps=0.00, _udp_avg_thruput=0.00, _udp_kprocessed=0.00, _udp_eps=0.00

Be aware that it's quite achievable to max out the ability of the
operating system, let alone splunk, to handle udp packets at high
rates.  This data may be useful to determine if any data is coming in
at all, and at what times it rises.  There is no guarantee that all
packets sent to this port will be received and thus metered.

---------

mpool lines represent memory used by the splunk indexer code only
(not any other pipeline components).  This information is probably not
useful to anyone other than splunk developers.

group=mpool, max_used_interval=4557, max_used=53878, avg_rsv=180, capacity=268435456, used=0

max_used_interval represents the number of bytes (in this example a
bit over 4KB) used during the reporting interval (since the last
output).

max_used represents the maximum amount of memory, in bytes, in use at
any time during the component's lifetime (most likely since splunk
start). In this example the high water mark was around 53KB.

avg_rsv is the average size of a memory allocation across the runtime
of the system.

capacity is the ceiling on memory use for the indexer.

used is the current use of memory by the indexr.

In this case we can see that some memory is sometimes in use, although
at the time of sample, none is in use, and that generally the use is
low.

----------------------------------------------------
Thanks!
Joshua Rodman

Splunk
Posted by

Splunk