Community:TroubleshootingIndexedDataVolume
From Splunk Wiki
Troubleshooting Indexed Data Volume
So data is flowing into Splunk. Where's it coming from? What's chatty today? Who just blew the doors off our indexing volume?
First thing you should check is how many apps are running. Have you installed the Unix app? That can index a lot of data really quickly because it runs lots of scripted inputs. What about other apps, or other inputs? Where did you (or some else?) tell Splunk to get data from? The searches below will help you figure this out
There's a few tools available to answer these questions.
Data volume seen by the license code
You may care about this for licensing concerns. Or you may just want to sanity-check what quantity of data the licensing code is seeing.
You can review some information in the Manager portion of the Splunk 4.0.x interface, or you can run a search on the _internal index to see a pretty chart etc.
index=_internal todaysBytesIndexed LicenseManager-Audit source=*license_audit.log | eval Daily_Indexing_Volume_in_MBs = todaysBytesIndexed/1024/1024 | timechart avg(Daily_Indexing_Volume_in_MBs) by host
This is snarfed from: http://www.splunk.com/base/Documentation/latest/Installation/AboutSplunklicenses#View_your_license_and_usage_details
You can also review the license_audit.log file itself in your splunk installation, if you need history longer than 28 days. If undertaking this, you may find the following unix-platform incantation useful, which creates a more-readable variation on the file.
cat license_audit.log |awk '{ printf("%s\n",substr($0,0,(index($0,"]["))-1)) }' > readable-license-audit.log
Quick summary information by host, sourcetype, source
Okay, so there's a problem with the data volume.. higher than you expected, or higher than you were planning for. Or you just want to get a better picture of where the data is coming from in a bulk manner.
The metrics.log data already has totals for this on a reasonable interval, so we can mine this.
Splunk Metrics Reports has searches for this purpose in the section 'How much was indexed'. For example:
index=_internal group="per_host_thruput" | eval mb=kb/1024| timechart span=1d sum(mb) by series
index=_internal group="per_source_thruput" | eval mb=kb/1024| timechart span=1d sum(mb) by series
index=_internal group="per_sourcetype_thruput" | eval mb=kb/1024| timechart span=1d sum(mb) by series
These searches provide a sampling of the top producers by different categories. The default sampling size is 10, so if you have for example 20 sourcetypes you expect to receive, this will not be a complete data picture, but will have the 10 busiest for each sub-minute time window. Thus this gives you a quick picture of what's going on generally, but not a to-the-byte accurate value.
To see how much data Splunk has actually written to your various indexes, use this search:
index=_internal group="per_index_thruput" | eval mb=kb/1024| timechart span=1d sum(mb) by series
Set up a scheduled search to alert you if a license violation occurs
First off, learn how to set up a scheduled search with an email alert trigger here, you can then use the search-string below as the basis for your alert. It will only return results if the quota has incremented:
index=_internal source=*license_audit.log LicenseManager-Audit | delta quotaExceededCount as quotadiff | stats first(quotadiff) as quotadiff | search quotadiff>0
Counting event sizes over a time range
Yet to write-up.
