Astronomy and Summary Indexing

I had the pleasure last week of viewing Saturn’s rings at Rutgers University’s observatory. It was my first time actually seeing the rings through a professional telescope and the planet does look like what we often see in text book pictures. After the viewing, I started thinking that astronomy records a lot of data that needs to be indexed for search and aggregated for reports. I asked the professor conducting the tour if he had any logs for astrometry data and he took out his paper notebook to show it to me. Obviously, in Splunk terms, that was not what I was asking to see.

In seriousness, the professor told me that optical telescopes, radio telescopes, and spectrometers can generate over a 1 TB of  computer data per day. I think much of this data may be photo related to do trend analysis of observed readings, but the rest is the usual time series data that does require searching, analytical investigation, and reporting. Since this is unstructured time series data generated by software, Splunk could easily be used to do what it does best for this use case: index, analyze through search, and present aggregated reports.

For instance, suppose, we have the following data

Fri May 21 22:34:40 EDT 2010 star=n14532 1.01
Fri May 21 22:35:40 EDT 2010 star=n14532 1.00
Fri May 21 22:35:40 EDT 2010 star=n32344 1.62
Fri May 21 22:36:40 EDT 2010 star=n14532 0.99
Fri May 21 22:37:40 EDT 2010 star=n32344 1.60

The last number in each series represents the observed magnitude (an object’s brightness) of different stars in this computer generated log file. I could index this data into Splunk and plot the relative average observed magnitude by star with a simple search command.

sourcetype=starlog|timechart avg(observed_magnitude) by star

This would end up looking something like this:

Average Observed Magnitude

Average Observed Magnitude

With two stars and very few events, this isn’t terribly exciting. However, from real calculations, with billions of galaxies and trillions of stars, the volume of data becomes challenging to manage and our simple time chart search command becomes a handy mechanism to analyze and plot the graph in the same manner.

The next question is what if you wanted to perform the same calculations over a 30 day period, where 8 billion events have been recorded? Computing the average observed magnitude of thousands of star with billions of raw events is not going to be an instantaneous  search no matter what technology you use. Fortunately, Splunk ships with a feature called summary indexing that will solve this problem quite easily.

Summary Indexing

A summary index is an index of an existing index. It contains a time series aggregate summary of prior calculations that have occurred from data in another index. In our example, if we were to schedule a search to run every hour that takes the average observed magnitude of the all events that have been indexed in the last hour, this aggregate hourly readings can be placed in the summary index. For a 30 day period, we can take an average of the existing averages that have been recorded in the summary index and the search results will be magnitudes (pardon the pun) faster than going through all 8 billion events at once. Allow me to walk through our example to explain this through practice.

First, create a summary index with Splunk Manager. Splunk ships with an index called summary that can be used out of the box for this. Then, in our example, save the search

sourcetype=starlog|sitimechart avg(observed_magnitude) by star

to use the last hour for earliest time and have it scheduled to run every hour to save its results in the summary index. Let’s call this search “Summary Timechart for Stars” Notice that timechart has now been called sitimechart in the example. This tells Splunk that only the aggregate results of this search will be returned and saved to the summary index. All reporting commands such as top, timechart, chart, rare, and stats have a si prefix to be used for this purpose.

Now, if we want to find the average observed magnitude of our events for a 30 day period, we would simply run the following search:

index="summary" search_name="Summary Timechart for Stars"|timechart avg(observed_magnitude) by star

Your results and corresponding report will come back quicker as we are now taking an average of averages in layman’s terms. The concept of summary indexing is much more comprehensive than this and I encourage you to read the documentation for further details. Because of the sheer volume of data produced by astronomy, summary indexing is a great way to increase search performance. This could be applied to any large collection of data that is indexed and aggregated.

Nimish Doshi
Posted by

Nimish Doshi

Nimish is Director, Technical Advisory for Industry Solutions providing strategic, prescriptive, and technical perspectives to Splunk's largest customers, particularly in the Financial Services Industry. He has been an active author of Splunk blog entries and Splunkbase apps for a number of years.

Show All Tags
Show Less Tags