A bug is a flaw or fault in a software program that causes it to operate incorrectly or produce an unintended result.

A defect is a deviation from the requirements specified in the software documentation, which may or may not cause a failure.

An error is a human action that produces an incorrect result, often leading to defects or bugs in the software.

An incident is an unplanned interruption or reduction in the quality of an IT service, which may be caused by a bug, defect, or error.

How are bugs, defects, errors, and incidents related?

Errors made by humans can introduce defects into software, which may manifest as bugs. These bugs or defects can lead to incidents if they cause service interruptions or failures.

Learn

October 14, 2024

8 Minute Read

What Is Anomaly Detection? Examples, Techniques & Solutions

By Muhammad Raza, Stephen Watts

Key takeaways

Anomaly detection identifies unusual patterns or outliers in data that deviate from expected behavior, enabling proactive detection of issues like fraud, security threats, or equipment failures.
Techniques range from classical statistical methods to supervised, unsupervised, and hybrid machine learning models, with the choice depending on data type, desired accuracy, and explainability.
Effective anomaly detection relies on rigorous data preprocessing, thoughtful feature engineering, dynamic thresholding, and ongoing model tuning to minimize false positives and adapt to changing baseline behaviors.

How do you know when something is wrong? Often, it’s not “normal” or expected behavior.

Anomaly detection is the practice of identifying data points and patterns that deviate from an established norm or hypothesis. As a concept, anomaly detection has been around forever.

Today, detecting anomalies today is a critical practice for organizations and companies of all sizes and stripes. That’s because identifying these outliers can power important activities, such as:

Monitoring network traffic
Identifying security threats and security breaches
Flagging potential cases of fraud
Finding the root cause of a hardware or software problem
Understanding changes in customer expectations and demands

So, here, let’s look at the wide world of anomaly detection. We’ll look at how AD works, techniques for detection, ways to use it in the workplace, and solving common challenges.

(Anomaly detection is essential to many Splunk services and products, including our industry-leading SIEM platform and our full-stack observability portfolio.)

What is anomaly detection?

Anomaly detection includes any technique that helps find data points or events that deviate from “normal” or expected behavior. For decades, anomaly detection was manual work. Only in more recent years has machine learning, automation, and yes, AI, are making it easier and more sophisticated to detect anomalies in all sorts of environments.

So, what is an anomaly?

Any data point or pattern that may deviate significantly from an established hypothesis or from pre-determined thresholds. Let’s illustrate this with some real-world examples.

Examples of anomaly detection systems

A go-to example of anomaly detection is a credit card fraud detection system. This uses algorithms to identify unusual spending patterns in real-time: large purchases in a new location, for example, This alert for potentially fraudulent activity is then reviewed by the bank directly. How does it achieve this? By picking up on anomalous purchases within a customer's typical spending behavior.

Other examples of anomaly detection systems include:

Network intrusion detection: Monitoring network traffic for unusual patterns that might indicate a hacking attempt.
Quality control in manufacturing and industrial settings: Identifying defective products on a production line by analyzing sensor data.
Healthcare monitoring: Detecting abnormal vital signs in a patient's medical records.
Server monitoring: Identifying unusual spikes in CPU usage or memory consumption on a server.
Environmental monitoring: Flagging sudden changes in air quality or water levels.

Types of anomalies

When looking at a time series of data — data that is collected sequentially, over a period of time — there are three main types of anomalies, which we’ll demonstrate with a classic example:

Global anomalies (aka point anomalies): This anomaly is a piece of data that is simply much higher or lower than the average. If your average credit card bill is $2,000 and you receive one for $10,000, that’s a global anomaly.

Contextual anomalies: These outliers depend on context. Your credit card bill probably fluctuates over time (due to holiday gift giving, for example). These spikes may look strange if you consider your spending in the aggregate, but in the context of the season, the anomaly is expected.

Collective anomalies: These anomalies represent a collection of data points that individually don’t seem out of the ordinary but collectively represent an anomaly, one of which is only detectable when you look at a series over time.

If your $2,000 credit card bill hits $3,000 one month, this may not be especially eyebrow-raising, but if it remains at the $3,000 level for three or four months in a row, an anomaly may become visible. Collective anomalies are often easiest to see in “rolling average” data that smooths a time series graph to more clearly show trends and patterns.

How AD works: Common techniques used in anomaly detection

When building an anomaly detection model, here’s a question you probably have: “Which algorithm should I use?” This greatly depends on the type of problem you're trying to solve, of course, but one thing to consider is the underlying data.

Talking about anomaly detection techniques easily overlaps into technical areas like statistics and machine learning. That’s because AD techniques can be categorized into three primary types:

Statistical methods
Machine learning methods
Deep learning methods

There are countless ways to identify these outliers, ranging from simple to sophisticated: sorting your data/spreadsheet, visualizing the data using graphs, charts, etc., looking at z-scores, creating outlier fences using interquartile ranges — among many others.

Some common AD algorithms and techniques include:

Local outlier factor. A common technique that compares the local density of an object to its neighboring data points. If the object has a lower density, it's considered an outlier.

Isolation forest. An efficient method that's relatively low complexity and consumes low CPU and time. However, it's not adapted to data stream contexts.

Nearest neighbors. A classic in the AD world, nearest neighbors is a successful and long-standing technique.

Intrusion detection. Compares normal data packets with incoming data packets to detect malicious data packs.

Autoencoder. A technique used in deep neural networks to identify anomalies in robotic sensor signals.

Additional techniques, though by no means all of them, include machine learning AD, clustering algorithms, and hybrid approaches, which may combine anomaly- and signature-based detections.

(Related reading: Splunk App for Anomaly Detection.)

Time series data anomaly detection

In the context of time-series continuous datasets, the normal or expected value is the baseline. The limits around that baseline represent the tolerance associated with the variance. If a new value deviates above or below these limits, then we can consider that data point to be anomalous.

Anomaly detection is a very common way that organizations today harness machine learning. We know, of course, that accurate anomaly detection relies on a combination of historical data and ongoing statistical analysis. Importantly, these models are highly dependent on the data quality and sample sizes used that affect the overall alerting.

(Related reading: exploratory data analysis for AD.)

Inherent challenges in detecting anomalies

First things first: anomalies may be introduced by a variety of reasons depending on the environment. Considering the case of network traffic flows, from above, a surge in user traffic could stem from a variety of situations:

A marketing video or social media moment that went viral
A natural disaster that suddenly upped the demand for your products in a particular geographical region
A malicious actor attacking your servers with a Denial of Service (DoS) attack
Many other reasons

Anomalies can emerge from a variety of sources. In order to fully understand the underlying causes, the following key challenges are addressed as part of the anomaly detection process:

Defining a universal hypothesis is already near-impossible. It’s often implausible to define a model that encompasses all data dimensions (factors and the applicable constraints) and quantifying intangible qualitative metrics (such as user preferences and intent) is doubly challenging.

Defining “anomaly” is not easy. That’s because the answers often are not black/white issues: instead, there’s many shades of grey. First, the notion of “anomaly” deviates highly between the application of and the sensitivity of the situation. An example:

An unauthorized login attempt by a front-line employee of your service management organization may not be regarded as an anomaly.
A similar attempt on a C-suite user account with escalated access privileges, however, should be immediately alarming and trigger an automated control mechanism for intrusion prevention.

In cybersecurity, threat actors know how to hide their anomalous behavior. These adversaries can quickly adapt their actions and/or manipulate the system in a way that any anomalous observations actually conforms to the acceptable models and hypothesis.

Normal or expected behavior is dynamic. The notion of normal and expected behavior evolves continuously, especially in large organizations. Internal organizational changes and a growing user base may require decision makers to redefine things like:

What is normal (today or in the future?)
What to expect
What constitutes an anomaly

Trusting the model. Any AD system must establish trust with its users. Without the trust, why would we run business-critical activities over that data? One “trustworthy” area to get right is that the users must believe, know, trust that the model is indeed finding all outliers, or at least the ones deemed relevant by your own settings.

Solving common anomaly detection challenges

We’ve looked at the conceptual challenges of anomalies and detecting them. Now let’s look at the practical side: the real issues your teams may face when handling anomalous behavior – we’ll look at the challenge and then offer some paths towards solving them.

Challenge #1: Data quality

Data quality — that is, the quality of the underlying dataset — is going to be the biggest driver in creating an accurate usable model. Data quality problems can include:

Null data or incomplete datasets
Inconsistent data formats
Duplicate data
Different scales of measurement
Human error

How to solve data quality issues. So, how do you improve data quality? Here are some best practices:

Discard or fill null values. Filled null values can include the expected value or median data points
Standardize all data formats prior to fitting your model.
Remove duplicate data based off time, assuming a time-series dataset.
Pre-process your input features into a standard scale before building a model.
Reduce dependency on data that is manually entered by humans.

Challenge #2: Training sample sizes

Having a large training set is important for many reasons. If the training set is too small, then…

The algorithm doesn’t have enough exposure to past examples to build an accurate representation of the expected value at a given time.
Anomalies will skew the baseline, which will affect the overall accuracy of the model.

Seasonality is another common problem with small sample sets. Not every day or week is the same, which is why having a large enough sample dataset is important. Customer traffic volumes may spike during the holiday season or could significantly drop depending on the line of business.

It’s important for the model to see data samples for multiple years so it can accurately build and monitor the baseline during common holidays.

How to solve for lack of dataset size. Unfortunately, thin datasets are the most difficult problem to solve. If you don’t have “enough” data, you’ll need more. You’ll have to capture real data within the wild to get an accurate understanding of what’s considered normal.

You could build synthetic datasets by extrapolating on current datasets, but this will likely lead to overfitting.

Challenge #3: False alerting

Identifying anomalies is an excellent tool in a dynamic environment as it can learn from the past to identify expected behavior and anomalous events. But what happens when your model continuously generates false alerts and is consistently wrong?

It’s hard to gain trust from skeptical users and easy to lose it — which is why it’s important to ensure a balance in sensitivity.

How to solve for model sensitivity. A noisy anomaly detection model could technically be right in the alerting for anomalies; but it could be written off as noise when reviewed manually. A reason for this is the model sensitivity. If the limits are too tight around the baseline, then it may be common for normal variance to deviate away from that baseline.

To solve this, you could:

Increase the confidence interval to widen those limits.
Adjust your alert to look for 1.5x or 2x the original limit.

Another solution to fix false alerting would be to increase the sample size the algorithm used to build the model. More examples of historical data should equate to higher accuracy and lower false alerts.

Challenge #4: Imbalanced distributions

Another method of building an anomaly detection model would be to use a classification algorithm to build a supervised model. This supervised model will require labeled data to understand what is good or bad.

A common problem with labeled data is distribution imbalance. It’s normal to have a good state which means 99% of the labeled data will be skewed towards good. Because of this natural imbalance, the training set may not have enough examples to learn and associate with the bad state.

How to solve imbalance. This is another problem that can be hard to solve. A common technique to get around this includes scaling down the number of good states to be more equal to the number of bad states. Ideally, you should have enough bad states (i.e., anomalous states) so the model can accurately learn and reinforce its behavior.

Because of this imbalance distribution, it’s likely you may not have enough labeled data to learn from those bad (anomalous) states.

Once you identify an anomaly

Once an anomaly has been identified, it’s important to act on that insight and quantify the impact on that anomaly. A good strategy is to:

Find the difference between the limits and the anomalous value and compare it against past anomalies. This will identify the severity, which helps score the anomaly and prioritize urgency.
With severity determined, correlate this anomalous value with known changes or incidents in the environment to gain more context around underlying problems within the environment.

Anomaly detection: understanding behaviors

To summarize, anomaly detection is about defining normal behavior, developing a model that can generalize such a normal behavior and specifying the thresholds for observations that can be accurately deemed as a significant variation from the expected true normal behavior.

How your organization uses this information, however, is up to you. And THAT is where the good stuff happens.

Video: Learn more about Anomaly Detection

See an error or have a suggestion? Please let us know by emailing splunkblogs@cisco.com.

This posting does not necessarily represent Splunk's position, strategies or opinion.

Muhammad Raza

Muhammad Raza is a technology writer who specializes in cybersecurity, software development and machine learning and AI.

Stephen Watts

Stephen Watts works in growth marketing at Splunk. Stephen holds a degree in Philosophy from Auburn University and is an MSIS candidate at UC Denver. He contributes to a variety of publications including CIO.com, Search Engine Journal, ITSM.Tools, IT Chronicles, DZone, and CompTIA.

Learn 3 Min Read

What Is Intelligent Automation?

Intelligent automation is smarter than more traditional forms of IT automation. Take a look at what intelligent automation can do for you today.

Learn 6 Min Read

Log Analysis: A Complete Introduction

Learn what log analysis is, explore key techniques and tools, and discover practical tips to effectively analyze system log files.

Learn 4 Min Read

Conway’s Law Explained

In business, Conway's Law says: your business outcomes will be a direct reflection of your company's internal structure — good or bad. Read more here.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram