
Anomaly detection can be defined by data points or events that deviate away from its normal behavior.
If you think of this in the context of time-series continuous datasets, the normal or expected value is going to be the baseline, and the limits around it represent the tolerance associated with the variance. If a new value deviates above or below these limits, then that data point can be considered anomalous.
Anomaly detection is a key use case for machine learning algorithms, and one that might seem like magic. We know, of course, that accurate anomaly detection relies on a combination of historical data and ongoing statistical analysis. Importantly, these models are highly dependent on the data quality and sample sizes used that affect the overall alerting.
Let’s discuss a few common challenges with anomaly detection and how to solve them.
Challenge #1: Data quality
When building an anomaly detection model, one primary question you may have is:
“Which algorithm should I use?” This greatly depends on the type of problem you're trying to solve, of course, but one thing to consider is the underlying data.
Data quality — that is, the quality of the underlying dataset — is going to be the biggest driver in creating an accurate usable model. Data quality problems can include:
- Null data or incomplete datasets
- Inconsistent data formats
- Duplicate data
- Different scales of measurement
- Human error
(Read our full data quality explainer.)
How to solve data quality issues
So, how do you improve data quality? Here are some best practices:
- Discard or fill null values. Filled null values can include the expected value or median data points
- Standardize all data formats prior to fitting your model.
- Remove duplicate data based off time, assuming a time-series dataset.
- Pre-process your input features into a standard scale before building a model.
- Reduce dependency on data that is manually entered by humans.
Challenge #2: Training sample sizes
Having a large training set is important for many reasons. If the training set is too small, then…
- The algorithm doesn’t have enough exposure to past examples to build an accurate representation of the expected value at a given time.
- Anomalies will skew the baseline, which will affect the overall accuracy of the model.
Seasonality is another common problem with small sample sets. Not every day or week is the same, which is why having a large enough sample dataset is important. Customer traffic volumes may spike during the holiday season, or could significantly drop depending on the line of business. It’s important for the model to see data samples for multiple years so it can accurately build and monitor the baseline during common holidays.
How to solve for lack of dataset size
Unfortunately, thin datasets are the most difficult problem to solve. You’ll have to capture real data within the wild to get an accurate understanding of what’s considered normal. You could build synthetic datasets by extrapolating on current datasets, but this will likely lead to overfitting.
Challenge #3: False alerting
Identifying anomalies is an excellent tool in a dynamic environment as it can learn from the past to identify expected behavior and anomalous events. But what happens when your model continuously generates false alerts and is consistently wrong?
It’s hard to gain trust from skeptical users and easy to lose it — which is why it’s important to ensure a balance in sensitivity.
How to solve for model sensitivity
A noisy anomaly detection model could technically be right in the alerting for anomalies; but it could be written off as noise when reviewed manually. A reason for this is the model sensitivity. If the limits are too tight around the baseline, then it may be common for normal variance to deviate away from that baseline.
To solve this, you could:
- Increase the confidence interval to widen those limits.
- Adjust your alert to look for 1.5x or 2x the original limit.
Another solution to fix false alerting would be to increase the sample size the algorithm used to build the model. More examples of historical data should equate to higher accuracy and lower false alerts.
Challenge #4: Imbalanced distributions
Another method of building an anomaly detection model would be to use a classification algorithm to build a supervised model. This supervised model will require labeled data to understand what is good or bad.
A common problem with labeled data is distribution imbalance. It’s normal to have a good state which means 99% of the labeled data will be skewed towards good. Because of this natural imbalance, the training set may not have enough examples to learn and associate with the bad state.
How to solve imbalance
This is another problem that can be hard to solve. A common technique to get around this includes scaling down the number of good states to be more equal to the number of bad states. Ideally, you should have enough bad states (i.e., anomalous states) so the model can accurately learn and reinforce its behavior.
Because of this imbalance distribution, it’s likely you may not have enough labeled data to learn from those bad (anomalous) states.
Once you identify an anomaly
Once an anomaly has been identified, it’s important to act on that insight and quantify the impact on that anomaly. A good strategy is to:
- Find the difference between the limits and the anomalous value and compare it against past anomalies. This will identify the severity, which helps score the anomaly and prioritize urgency.
- Once the severity is understood, correlate this anomalous value with known changes or incidents in the environment to gain more context around underlying problems within the environment.
Anomaly detection may seem like magic – but it’s all about using historical data to identify patterns, and applying statistics to build limits against those patterns and alert when something is not normal. These models are highly dependent on the data quality and sample sizes used that affect the overall alerting.
If you are able to ensure good quality datasets over a long period of time, then you, too, can build reliable production quality machine learning models.
What is Splunk?
This article was written by Steve Koelpin. Steve is a former Splunk professional services consultant and 5x Splunk Trust MVP. He specializes in Splunk IT Service Intelligence, Splunk Machine Learning Toolkit, and general Splunk development. While not behind the keyboard, he is best known as dad.
This posting does not necessarily represent Splunk's position, strategies or opinion.