DevOps

May 20, 2016

3 Minute Read

Reducing Alert Noise: Transformations and Dynamic Thresholds

By Joe Ross

Monitoring modern infrastructure poses fundamentally new challenges in terms of data volume and velocity. Collecting the metrics emitted by machines is only the first step. To extract value from that data, we need a method of expressing service, team, or business goals against that stream of data. That method is analytics.

This is the second in a series of posts on how to use analytics to both compose the most salient signals to monitor out of raw metrics and also how to configure useful alerts.

Part 1: Static Thresholds, Durations, and Transformations
Part 2: Transformations and Dynamic Thresholds
Part 3: Ranges for Firing and Clearing
Part 4: Rates of Change

As discussed in the first post – Static Thresholds, Durations, and Transformations – we’ve found that duration conditions and simple transformations greatly reduce the number of false alerts we receive from using static thresholds. In this post we will discuss characteristics of a signal which may be considered suspicious but are missed by duration conditions (“false negatives”). Here is an example.

Mean Job Wait Time - Raw Metric

The metric represents the mean time a job spends on a queue across a number of machines running the same service. Analysis revealed that this spiky behavior was due to the machine with the longest queue wait time failing to report its data in time for the job to proceed. (Deciding when to proceed with analysis before all machines have reported results, and how to handle missing data, are separate topics.) In real time, we want to be alerted of this kind of spikiness so we can react. In this case, we want to be pointed to the troublesome machine and determine if it needs to be restarted. In our example, the threshold and duration condition fail to fire, but the queue wait time is reaching undesirable levels from the standpoint of the end user experience. The SignalFx analytics engine supports transformations, which allow us to be alerted of the spiky behavior even when duration conditions are not met.

The Rolling Maximum

This first transformation captures the essential upward trend in this example: the rolling maximum. Whereas the rolling mean replaces a signal with its average value over a window, the rolling maximum replaces a signal with largest value in the window. The rolling maximum does not attempt to approximate the true signal. Instead, it captures trends in the worst case (assuming high levels are bad) scenario. On this chart the rolling maximum is shown along with the original signal.

Mean Job Wait Time - Raw and Rolling Max

Since the rolling maximum is a “pessimistic summary” of the window, thresholds based on rolling maximum transformed signals should be set at levels higher than thresholds based on the original signal, and the duration required must be longer than the window size (since a single elevated value determines the rolling maximum for a period equal to the window size).

Rate of Change

There is a second transformation that captures the spiky behavior: the absolute value of the rate of change. We have two approaches for building a detector based on the absolute value of the rate of change of a signal (the “transformed signal”). The first is to manually inspect historical values of the transformed signal and decide on a threshold and duration; the other is to use further analytics to compare the very recent history of the transformed signal to its semi-recent past. This comparison is achieved by summarizing the semi-recent past via some statistical functions. For example, we could require that the absolute value of the rate change be above the 99th percentile (calculated over the last day) for 5 minutes. This is one way of capturing that the last 5 minutes were very different from the preceding 24 hours.

In SignalFx, we can constructed the signal as follows:

Absolute Value Rate of Change Calculation

Then the detector can be set to fire when J is above K for a duration of 5 minutes. The absolute value of the rate of change is large when the signal oscillates wildly between small and large values, but note it is also large with the signal experiences steep ascent or descent. This is unlikely to be a problem from the standpoint of monitoring the signal: sustained steep descent is rare for metrics bounded from below, and sustained steep ascent is also worth detecting.

Dynamic Thresholds

Whereas in the last post we focused on improving detectors based on static thresholds, in this example we’re employing a dynamic threshold (namely, the signal K) — one that changes with time.

We may expect a service owner to be familiar with the basic performance profile (e.g. known good envelope for utilization of CPU, memory, and disk) of the service, so for these metrics we can construct high-quality detectors based on static thresholds. On the other hand, the distribution of the absolute value of the rate of change of some metric is more obscure. How can we establish a baseline (a sense of the range of typical values) for a complicated signal? Only by observing those values over time do we gain more information about the distribution. A value above the last day’s 99th percentile is extreme; if we observe values above the last day’s 99th percentile for a duration of 5 minutes, we have an indication the signal is changing state. This logic applies to any signal, but is particularly valuable for complicated derived signals with which we do not have direct experience of what constitute typical values.

Moreover, the baseline is continuously updated with new data. The performance profile of a service may change with regularly deployed code changes, or when an upstream service is altered. A dynamic threshold does not need to be manually reset in response to such events. It is designed precisely to continuously capture the new normal.

Join our live weekly demo on cloud monitoring »

Data Modernization + Observability = how to rebalance your use of the Splunk platform to enhance your digital resilience

In the ever-evolving digital ecosystem, where the pace of innovation is relentless, organizations face the dual challenge of managing escalating data volumes while simultaneously enhancing resilience and cost efficiency. Embracing modern data approaches presents a compelling solution, offering the promise of rebalancing the use of the Splunk Platform to enhance digital resilience. Let's delve into why modernizing data strategies is not just an option but a necessity in today's tech landscape and later in this blog we’ll explore the 3 strategies you can adopt to improve your Observability while rebalancing your use of the Splunk Platform.

DevOps 5 Min Read

Splunk APM Now Delivers Unified, AI-Driven Application Monitoring and Troubleshooting

Distributed tracing has become popular in recent times as the preferred way to troubleshoot complex problems in diverse, microservice-based applications. And for good reason: distributed traces give DevOps teams end-to-end visibility into user requests (transactions), helping elevate user experience and future-proof applications.

DevOps 2 Min Read

The State of Observability 2023: Realizing ROI and Increasing Digital Resilience

Splunk has published The State of Observability 2023 — a research report created in partnership with ESG — to understand best practices, challenges and trends across the observability landscape.

About Splunk

The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.

Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.

Learn more about Splunk

Reducing Alert Noise: Transformations and Dynamic Thresholds

The Rolling Maximum

Rate of Change

Dynamic Thresholds

Related Articles

Data Modernization + Observability = how to rebalance your use of the Splunk platform to enhance your digital resilience

Splunk APM Now Delivers Unified, AI-Driven Application Monitoring and Troubleshooting

The State of Observability 2023: Realizing ROI and Increasing Digital Resilience

About Splunk

Subscribe to our blog

Connect with Splunk on X

Connect with Splunk on Instagram