Reducing Alert Noise: Transformations and Dynamic Thresholds

Monitoring modern infrastructure poses fundamentally new challenges in terms of data volume and velocity. Collecting the metrics emitted by machines is only the first step. To extract value from that data, we need a method of expressing service, team, or business goals against that stream of data. That method is analytics.

This is the second in a series of posts on how to use analytics to both compose the most salient signals to monitor out of raw metrics and also how to configure useful alerts.

As discussed in the first post – Static Thresholds, Durations, and Transformations – we’ve found that duration conditions and simple transformations greatly reduce the number of false alerts we receive from using static thresholds. In this post we will discuss characteristics of a signal which may be considered suspicious but are missed by duration conditions (“false negatives”). Here is an example.

Mean Job Wait Time - Raw Metric

The metric represents the mean time a job spends on a queue across a number of machines running the same service. Analysis revealed that this spiky behavior was due to the machine with the longest queue wait time failing to report its data in time for the job to proceed. (Deciding when to proceed with analysis before all machines have reported results, and how to handle missing data, are separate topics.) In real time, we want to be alerted of this kind of spikiness so we can react. In this case, we want to be pointed to the troublesome machine and determine if it needs to be restarted. In our example, the threshold and duration condition fail to fire, but the queue wait time is reaching undesirable levels from the standpoint of the end user experience. The SignalFx analytics engine supports transformations, which allow us to be alerted of the spiky behavior even when duration conditions are not met.

The Rolling Maximum

This first transformation captures the essential upward trend in this example: the rolling maximum. Whereas the rolling mean replaces a signal with its average value over a window, the rolling maximum replaces a signal with largest value in the window. The rolling maximum does not attempt to approximate the true signal. Instead, it captures trends in the worst case (assuming high levels are bad) scenario. On this chart the rolling maximum is shown along with the original signal.

Mean Job Wait Time - Raw and Rolling Max

Since the rolling maximum is a “pessimistic summary” of the window, thresholds based on rolling maximum transformed signals should be set at levels higher than thresholds based on the original signal, and the duration required must be longer than the window size (since a single elevated value determines the rolling maximum for a period equal to the window size).

Rate of Change

There is a second transformation that captures the spiky behavior: the absolute value of the rate of change. We have two approaches for building a detector based on the absolute value of the rate of change of a signal (the “transformed signal”). The first is to manually inspect historical values of the transformed signal and decide on a threshold and duration; the other is to use further analytics to compare the very recent history of the transformed signal to its semi-recent past. This comparison is achieved by summarizing the semi-recent past via some statistical functions. For example, we could require that the absolute value of the rate change be above the 99th percentile (calculated over the last day) for 5 minutes. This is one way of capturing that the last 5 minutes were very different from the preceding 24 hours.

In SignalFx, we can constructed the signal as follows:

Absolute Value Rate of Change Calculation

Then the detector can be set to fire when J is above K for a duration of 5 minutes. The absolute value of the rate of change is large when the signal oscillates wildly between small and large values, but note it is also large with the signal experiences steep ascent or descent. This is unlikely to be a problem from the standpoint of monitoring the signal: sustained steep descent is rare for metrics bounded from below, and sustained steep ascent is also worth detecting.

Dynamic Thresholds

Whereas in the last post we focused on improving detectors based on static thresholds, in this example we’re employing a dynamic threshold (namely, the signal K) — one that changes with time.

We may expect a service owner to be familiar with the basic performance profile (e.g. known good envelope for utilization of CPU, memory, and disk) of the service, so for these metrics we can construct high-quality detectors based on static thresholds. On the other hand, the distribution of the absolute value of the rate of change of some metric is more obscure. How can we establish a baseline (a sense of the range of typical values) for a complicated signal? Only by observing those values over time do we gain more information about the distribution. A value above the last day’s 99th percentile is extreme; if we observe values above the last day’s 99th percentile for a duration of 5 minutes, we have an indication the signal is changing state. This logic applies to any signal, but is particularly valuable for complicated derived signals with which we do not have direct experience of what constitute typical values.

Moreover, the baseline is continuously updated with new data. The performance profile of a service may change with regularly deployed code changes, or when an upstream service is altered. A dynamic threshold does not need to be manually reset in response to such events. It is designed precisely to continuously capture the new normal.

Join our live weekly demo on cloud monitoring »

Joe Ross
Posted by

Joe Ross

Joe is a data scientist at SignalFx. Previously Joe worked at other startups and in academia as a professor and researcher in mathematics.
Show All Tags
Show Less Tags