TIPS & TRICKS

Merriam Webster says, “collect: to get things from different places and bring them together”

THE ‘collect’ COMMAND

The ‘collect’  command is used to replicate data from one index into another.  The assumed usage, and original intent, of this command is to aggregate granular events into a summary index.  In fact, the documentation states that is its purpose in the Synopsis – “Put search results into a summary index.”  But that need not always be the case.  There are other uses for the ‘collect’ command, as well – result sets can be collected at either a granular or aggregate (summary) level.  And they may be retained in regular or summary indexes.

There are no licensing implications to duplicating the data in another index – after the original data is indexed, it is not counted again by the license meter when it is stored in another index via the collect command.

Why would one want to use the ‘collect’ command?  Two very good reasons are search performance and data retention.

Search performance:  A database analogy would be summary tables (or aggregate tables or reporting tables).  The idea is to store a copy of an aggregate search (query) to improve subsequent searches that request the data at that level of aggregation or higher.  In this way the data is “preprocessed” and searches read far less data to surface the result set over reading all the granular events.  (HINT: Summarizing at the first level of aggregation above the raw event is often the most useful as it typically reduces the number of events by orders of magnitude and can be used as a basis for all higher levels of aggregation.)

Data Retention:  This aspect also has multiple applications.  These include retention policies for summary indexes and policies for “merged indexes” (to be explained below).

It is quite often the case that access to granular-level detailed data is required for recent events but only lightly- or highly-summarized data is needed of older data to allow baselining or trending.  In this scenario, the retention policy for the granular data could be set very low, say 1 week or 1 month.  The retention policy for the summary index, populated by the ‘collect’ command, could be set much higher, say 6 months or 6 years.  Baselining and/or trending searches therefore read far less data, do less compute-intensive work, and perform much better compared to reading raw event-level data.

The second case, a “merged index”, would be an index that is populated by a command that reads data from two separate indexes – that may or may not have different retention policies – and are collected into another index that has its own retention policy.  Most often the command that populates the merged index would include the ‘transaction’ command that would surface multiple fields from two or more indexes.  Some of those fields might come from an index with a relatively short retention period compared to the retention requirements for the transaction event.  In this case, collecting the output of the ‘transaction’ command into another index allows assigning a longer retention period for the transaction event.

----------------------------------------------------
Thanks!
dlux

Splunk
Posted by

Splunk

Join the Discussion