Summary indexing allows you to deal with large volumes of data in an efficient way by reducing the data volume into smaller subsets, working on those individually and finally collate all of the results to get a final results.
An example of where summary indexing is commonly used is with large volumes of firewall or web access logs data. Imagine that you need to report on monthly firewall traffic (e.g. top sources, top destinations, top services, etc...). Assuming your environment generates around 10 million firewall events a day, the report at the end of the month would have to deal with roughly 280M to 310M events, so running a monthly report could take some time.
The most simple way would be to run a daily search to summarize firewall data and store that in a separate index (summary), you could store it on the same index but normally retention periods are different for summary data.
So every day at 02:00am schedule the saved search Do Not Click - Summary Index - Firewall Daily Summary Source IP to run and collect this summary data. See example:
starthoursago=26 endhoursago=2 eventtype=firewall | stats count by src_ip | sort count desc | head 200The actual search executed by the system would look something like
search starthoursago=26 endhoursago=2 eventtype=firewall | stats count by src_ip | sort count desc | head 200 | addinfo | collect addtime index="summary" \ marker="info_search_name=\"Do Not Click - Summary Index - Firewall Daily Summary Source IP\",report=\"firewall_daily_summary_src_ip\""Notice that several additional commands were automatically added by the system and there was also an additional definition on the saved search page. An extra field (report) was added with a value (firewall_daily_summary_src_ip) as well; this field will help differentiate between multiple sets on the summary index for later reporting purposes.
This would reduce the daily amount of data from 10M events to 200 events containing the top 200 src_ip addresses. Now at the end of each month the system will only need to deal with 5600 to 6200 events to calculate the monthly top 20. The search for that would be
startmonthsago=1 index=summary report=firewall_daily_summary_src_ip | stats sum(count) by src_ip | sort sum(count) desc | head 20This search would produce the desired results fairly quickly.
You can already see a benefit: not only can we easily and quickly report on last month's data but we can also easily do it on a rolling basis for the last 30 days. It becomes even more clear if you required weekly, quarterly and annual reports, this is actually a very common scenario. You could also always if needed do another summary collection on the summary data itself (e.g. Monthly summaries of Daily data).
Similar searches would be required for top destinations (dst_ip) and top services (dst_port). Ideally you would want to schedule all of these at least an hour apart to reduce the load on the system. The configured searches for this would be:
Every day at 03:00am run the saved search Do Not Click - Summary Index - Firewall Daily Summary Destination IP - extra fields report=firewall_daily_summary_dst_ip
starthoursago=27 endhoursago=3 eventtype=firewall | stats count by dst_ip | sort count desc | head 200Every day at 04:00am run thesaved search Do Not Click - Summary Index - Firewall Daily Summary Destination Port - extra fields report=firewall_daily_summary_dst_port
starthoursago=28 endhoursago=4 eventtype=firewall | stats count by dst_port | sort count desc | head 200In some cases, it might be possible to combine the three searches into a single search. This would increase performance but that can compromise the accuracy of your final results and it would be highly advisable that you increase the number of results collected. For example:
Every day at 02:00am run the saved search Do Not Click - Summary Index - Firewall Daily Summary - extra fields report=firewall_daily_summary
starthoursago=26 endhoursago=2 eventtype=firewall | stats count by src_ip, dst_ip, dst_port | sort count desc | head 2000Important: The searches were named Do Not Click - * on purpose and shouldn't be shared or added to any Dashboards; clicking/running these searches from the UI will pollute your summary index, this will produce incorrect results.
The values used above are for illustration purposes, with 10 million events a day you should probably be running the summary collection searches more often so that the number of events that you need to deal with is smaller.
Also to deal with large volumes you should use the CLI dispatch command instead of the search command. In this case you will need to schedule the searches through the OS cron facility, in this case you are responsible for the full search command, including the commands addinfo and collect
startminutesago=70 endminutesago=10