Every monitoring team wants the holy grail of thresholds built into their overall monitoring strategy. These thresholds are applied to datasets in order to:
- Define problems within the system.
- Identify the severity of that problem.
When defining thresholds, it’s important to know they are probably going to change over time, and to understand the importance around making the thresholds adaptive so they can overcome that change and still be effective.
Common types of thresholds
The two most common types of thresholds are static and adaptive:
- Static thresholds, as the name implies, are unchanging and have a static number associated with each severity.
- Adaptive thresholds will adapt over time to ensure proper thresholding.
In this article, let’s explore some tips for configuring adaptive thresholding.
(Learn how to use the density function for adaptive thresholding.)
High-value use cases for adaptive thresholding
While you may want to apply adaptive thresholds to every server or service to ensure nothing is being missed, this may backfire and cause a lot more noise than value.
A smarter approach would be to set thresholds on business aligned metrics — such as customer traffic patterns or login attempts — to core systems. Adaptive thresholds must be recalculated on a regular basis to ensure they are adapting to changing conditions within the environment. Because of this, the resource usage can be expensive based on how much data is being used to calculate the thresholds.
Now, let’s turn to the tips.
Take a hybrid approach
You can use a hybrid approach of static and adaptive thresholding to minimize resource usage while maximizing monitoring. This includes:
- Applying adaptive thresholding to the aggregate or service level.
- Applying static thresholds to the underlying entities that represent that aggregate view.
As an example, you may see a large decrease in traffic as an aggregate and your adaptive thresholding detects it; you can then leverage the static thresholding to look at the underlying servers and easily identify which one is lower than the others.
Use the past to determine ‘normal’
When configuring thresholds, it may be common to look back the last 24 hours or 7 days to identify what is normal. Then, set that value with some tolerance between each severity.
A better method would be to compare the same day in previous weeks and months to determine what is normal for that day, and set thresholds based on that.
As we see in the pattern below, the weekends have very low values compared to week days, and not all week days are exactly the same. You can also see that time of day must be taken into account here as volume peaks then goes back down within a one-day period.
This seasonality can also occur at the month or year level as it's common to have a busy season, which can greatly change the normalcy of your baseline and thresholds.
When thinking of entropy, you may be thinking in the context of energy or the laws of thermodynamics. But if we apply this concept to data, it holds true. In a time-series thresholding sense, we can define entropy as a gradual decline into disorder. Data changes over time and thresholds should also change in response to that.
False positive alerts are the biggest driver of misconfigured thresholds, causing end-users to lose trust in the monitoring. Once trust is lost, it’s easy to miss the real impact to the environment, which can cause the adaptive thresholding to be counter-productive to the overall monitoring strategy.
It’s important to ensure proper thresholding and alerting once deployed and to use post-mortems as an opportunity to adjust thresholds to ensure they are working properly. You can do this by comparing historical data to more recent data to see if there are significant changes that your adaptive thresholding may not yet have learned.
Use proven statistical methods to let data determine the thresholds
To avoid the work-intensive task of manually reconfiguring thresholds on a regular basis, Splunk IT Service Intelligence (ITSI) has an excellent user interface and a friendly approach of creating adaptive thresholds on the aggregate service level.
This works by looking back a determined number of days and calculating the baseline that will identify normalcy relative to time-of-day and day-of-week. Once the baseline is calculated, it’s then possible to calculate the standard deviation and enable severity levels based on how much the expected value deviates from the baseline.
Stick to high-value use cases for adaptive thresholding
It’s important to identify high-value use cases when applying adaptive thresholding, as it can be compute expensive, with the potential to generate a lot of false alerting depending on the sensitivity applied. In addition, adaptive thresholding requires some level of experimentation to identify if it’s working properly.
This is not a one-size-fits-all solution. It takes time and patience to get it working correctly. If you’re willing to put in that time, then the outcome will be worth it.
What is Splunk?
The original version of this blog was published by Steve Koelpin. Steve is a former Splunk professional services consultant and 5x Splunk Trust MVP. He specializes in Splunk IT Service Intelligence, Splunk Machine Learning Toolkit, and general Splunk development. While not behind the keyboard, he is best known as dad.
This posting does not necessarily represent Splunk's position, strategies or opinion.