Metrics to the Max! Dramatic Performance Improvements for Monitoring and Alerting on Metrics Data

As more and more organizations are embarking upon the “digital” journey, availability of service and reliability have become ever so important. These organizations are heavily relying on monitoring tools to not only monitor the health of their systems but also help troubleshoot and analyze the root cause in case of an outage.

Customers today rely on various monitoring tools to help keep the “keep the lights on.” These tools work well independently, but the real issue occurs when during an outage, a user tries to switch between these tools and still not lose the context.

IT Admins are constantly toggling between dashboards (resulting in the infamous swivel-chair effect) and have been struggling to correlate events, thereby resulting in an increaser mean time to repair (MTTR).

The other challenges these IT operations teams face is the need to manage and maintain multiple systems which also results in an increased total cost of ownership. One of the key challenges of them all is the inability to leverage the metrics data that's hidden in these log files.

To help address these challenges, in Splunk Enterprise 7.0, we introduced support for metrics data. Splunk now offers a single platform to investigate and analyze both logs and metrics.

In the first version of Metrics, we introduced a new metrics data store, two new spl commands—one to perform mathematical and statistical transformations called "mstats" and another called “mcatalog” that displays the schema of the data stored in the metrics data store—and event annotation, a feature that allows users to provide more context to metrics with logs data.

We continue our investment in Metrics and the latest release, Splunk Enterprise 7.1, features a host of interesting functionalities. An enhanced metrics data store that will make searches 7-10x faster than the previous version of the data store, and an enhanced “mstats” command to support multiple measurements in one search. For example with 7.0, we were able to perform multiple aggregations on one measure in one search.

| mstats avg(_value) AS cpu WHERE metric_name=cpu.user.value AND index=foo span=10s
| appendcols [
| mstats max(_value) AS latency AS avg_cpu WHERE metric_name=app.request.latency AND index=foo span=10s ]

With 7.1, we will be able to perform multiple aggregations on multiple measures in one search.

| mstats avg(cpu.user.value) AS cpu max(app.request.latency)  AS latency WHERE index=foo span=10s

We also added a new SPL command called the “mcollect” command, which helps extract metrics from log events. This can be useful for converting your historical log data into metrics so you can take advantage of the massive performance increase!

To understand Metrics in further detail, let us look at some sample data on Airline On-Time Performance, that is made available by the Bureau of Transportation Statistics and contains departure and arrival data for all scheduled nonstop flights within the United States of America. This data has been indexed into the Splunk event index; our users wish to analyze this data to determine the routes with the lowest average delay.  

In order to address this use-case, we’ll first create a new metrics index called “arrDelayMetric” that will help answer sub-second latency queries.

While most customers use traditional methods to populating the metrics data store (like CollectD, StatsD), they also heavily rely on log events as an important source of metrics data. Logs are considered a goldmine of information. Each log line can be instrumented to capture a variety information; thus its becomes imperative to leverage this data. The “mcollect” command is a huge differentiator that allows users to extract this information.

The next step is to extract the relevant fields from the log events and import it into our metrics index. We can perform this action seamlessly using the “mcollect” command.

A measurement (metrics event) has three main components—a timestamp, measure and floating point value for the metric that we are measuring. Splunk takes in metric_name as the name of that measurement and _value will contain the measure and finally a dimension. Dimensions provide additional information about the metric being measure; in this case it's fields like destination (Dest), origin and the carrier (UniqueCarrier).

The question we are looking to answer through this is to find the carrier with the shortest delay between San Francisco (SFO) and Los Angeles (LAX).

This chart shows us that carrier ‘WN’ has the shortest delay and ‘AA has the longest delay between SFO and LAX. This is good information, but what if I want to investigate the root cause of this delay for AA? We can do this using correlations.

Splunk allows users to correlate the Metrics data with the event index data; we can look at the data that is in the event index to provide more context. It turns out the the delays have been primarily due to late aircraft arrival.

The Metrics Data Store has been designed to handle large volumes of data arriving with varied latency. The enhanced metrics data store, now available with Splunk Enterprise version 7.1, colocates time-series, making searches about 7-10x faster than version 7.0!

Hema Mohan
Posted by Hema Mohan

Join the Discussion