Anomaly detection can be defined by data points or events that deviate away from its normal behavior.
If you think of this in the context of time-series continuous datasets, the normal or expected value is going to be the baseline, and the limits around it represent the tolerance associated with the variance. If a new value deviates above or below these limits, then that data point can be considered anomalous.
Anomaly detection is a key use case for machine learning algorithms, and one that might seem like magic. We know, of course, that accurate anomaly detection relies on a combination of historical data and ongoing statistical analysis. Importantly, these models are highly dependent on the data quality and sample sizes used that affect the overall alerting.
Let’s discuss a few common challenges with anomaly detection and how to solve them.
When building an anomaly detection model, one primary question you may have is:
“Which algorithm should I use?” This greatly depends on the type of problem you're trying to solve, of course, but one thing to consider is the underlying data.
Data quality — that is, the quality of the underlying dataset — is going to be the biggest driver in creating an accurate usable model. Data quality problems can include:
(Read our full data quality explainer.)
So, how do you improve data quality? Here are some best practices:
Having a large training set is important for many reasons. If the training set is too small, then…
Seasonality is another common problem with small sample sets. Not every day or week is the same, which is why having a large enough sample dataset is important. Customer traffic volumes may spike during the holiday season, or could significantly drop depending on the line of business. It’s important for the model to see data samples for multiple years so it can accurately build and monitor the baseline during common holidays.
Unfortunately, thin datasets are the most difficult problem to solve. You’ll have to capture real data within the wild to get an accurate understanding of what’s considered normal. You could build synthetic datasets by extrapolating on current datasets, but this will likely lead to overfitting.
Identifying anomalies is an excellent tool in a dynamic environment as it can learn from the past to identify expected behavior and anomalous events. But what happens when your model continuously generates false alerts and is consistently wrong?
It’s hard to gain trust from skeptical users and easy to lose it — which is why it’s important to ensure a balance in sensitivity.
A noisy anomaly detection model could technically be right in the alerting for anomalies; but it could be written off as noise when reviewed manually. A reason for this is the model sensitivity. If the limits are too tight around the baseline, then it may be common for normal variance to deviate away from that baseline.
To solve this, you could:
Another solution to fix false alerting would be to increase the sample size the algorithm used to build the model. More examples of historical data should equate to higher accuracy and lower false alerts.
Another method of building an anomaly detection model would be to use a classification algorithm to build a supervised model. This supervised model will require labeled data to understand what is good or bad.
A common problem with labeled data is distribution imbalance. It’s normal to have a good state which means 99% of the labeled data will be skewed towards good. Because of this natural imbalance, the training set may not have enough examples to learn and associate with the bad state.
This is another problem that can be hard to solve. A common technique to get around this includes scaling down the number of good states to be more equal to the number of bad states. Ideally, you should have enough bad states (i.e., anomalous states) so the model can accurately learn and reinforce its behavior.
Because of this imbalance distribution, it’s likely you may not have enough labeled data to learn from those bad (anomalous) states.
Once an anomaly has been identified, it’s important to act on that insight and quantify the impact on that anomaly. A good strategy is to:
Anomaly detection may seem like magic – but it’s all about using historical data to identify patterns, and applying statistics to build limits against those patterns and alert when something is not normal. These models are highly dependent on the data quality and sample sizes used that affect the overall alerting.
If you are able to ensure good quality datasets over a long period of time, then you, too, can build reliable production quality machine learning models.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.