Useful dashboards can elevate data analysis tasks, and bridge the gap between data and action. Viewers should be able to look at a dashboard and go, “I understand what’s going on and exactly what I need to do now.”
Published Date: March 14, 2023
Clustering is a machine learning technique in which data points are grouped together around similar properties. It’s an exploratory data analysis approach that allows you to quickly identify linkage, or hidden relationships, between the data points in labeled or unlabeled datasets, which can be either supervised or semi-supervised.
To understand clustering, it helps to understand the two main methods of machine learning: supervised and unsupervised.
As its name indicates, supervised machine learning requires the presence of a supervisor, typically a data scientist, who guides the algorithm through the learning process. The supervisor provides the learning algorithm with training data that is “labeled” — tagged with the correct answer — to train it to accurately classify data or predict outcomes from new data.
In unsupervised machine learning, algorithms are provided with training data that isn’t labeled. Without known outcomes for data comparison, unsupervised machine learning algorithms must analyze similar data to find hidden patterns and sort the information according to similarities and differences, so the data can be further processed.
Clustering is critical for determining the hidden structure in large volumes of labeled and unlabeled data and has myriad applications across industries, ranging from anomaly detection to customer segmentation to the recommendation systems used by Netflix and Amazon. In the following sections, we’ll look at how clustering works and the various ways it’s used. We’ll also explain the many benefits and when it makes sense to use the technique.
What does clustering mean?
Clustering means grouping data points around their shared characteristics and similar features. Each group, called a cluster, is a subset of a larger dataset. Each data point within a cluster is closer to that cluster’s center than to other cluster centers in the dataset.
To visualize clustering, imagine you have 20 dogs — your dataset — and you want to sort them by age and weight. If you created a scatter plot around this data, you’d probably find that many of the dogs fit together into several subgroups. If you drew a circle around each of these groups, you’d find there is less distance between each dog and the center of its group than between it and the center of the surrounding groups. The number of groups in this illustration is an example of a data cluster.

In a clustering scatterplot diagram, each data point within a cluster is closer to its center than to other cluster centers in the dataset.
How does clustering work?
Clustering works by looking for relationships or trends in sets of unlabeled data that aren’t readily visible. The clustering algorithm does this by sorting data points into different groups, or clusters, based on the similarity of their features. The algorithm’s guiding principle is that data points with many similarities tend to be closer together in the same cluster, while data points with highly dissimilar features tend to be farther apart in separate clusters. These clusters can then be analyzed to uncover useful insights.
Customer segmentation is one of the most common problems addressed with clustering. All retail companies work with sales data that includes variables such as each customer’s name, age, and gender; the items sold in each sale; and the profit or loss made per sale, among other things. Companies can use this data to discover customer purchasing patterns such as what gender group spends more per purchase, what age group buys more of a certain product, or who is most likely to make an impulse purchase.
By clustering the data into separate groups, they can be analyzed to uncover trends and other insights. For example, you can determine the features that data points share within a cluster, as well as what features tend to drive two data instances apart into different categories. In this way, the retailer can quickly identify trends in the data that provide a deeper understanding about its customers’ purchasing behavior.
What are the types of clustering methods?
The different types of clustering algorithms follow different rules for determining what is a similarity among data points, but most adhere to one of these models:
- Connectivity models: These models build clusters based on the notion that the closer the two data points, the more similar they are to each other. There are two approaches to the connectivity model: in one approach, data points are classified into separate clusters and then aggregated as the distance between each pair decreases. In the other approach, data points are distributed into a single large cluster and then segregated as the distance between them increases. The hierarchical clustering algorithm is an example of this model.
- Centroid models: Centroid-based algorithms define similarity by the closeness of a data point to the centroid of the clusters — each data point is assigned to a cluster based on its squared distance from the centroid. What distinguishes centroid models is that they require that the number of clusters be known before data is assigned. The k-means algorithm is the most widely used centroid-based clustering algorithm.
- Distribution models: Distribution-based clustering algorithms consider all data points to be part of a given cluster based on the probability that they belong to it. As the distance of a data point from the center of a cluster increases, the probability of it being a part of that cluster decreases. The expectation-maximization algorithm is based on this model.
- Density models: Density-based clustering algorithms identify areas of varied data-point density in the data space. They isolate these different density regions and assign the data points within each to the same cluster. In this model, outliers are not assigned to any cluster and are simply ignored. The DBSCAN and OPTICS clustering algorithms are popular examples of density models.
What is a clustering algorithm?
A clustering algorithm is a machine learning algorithm used to segment a dataset into groups of data points based on similar features. Clustering algorithms are commonly used in data science to segregate data so that trends and relationships can be more easily identified.
Clustering algorithms can be broadly broken down into two types:
- Hard clustering: In the hard clustering method, each data point can belong to only one cluster. If a data point does not exactly meet a cluster’s conditions, it is completely removed from the cluster.
- Soft clustering: In the soft clustering method, a data point can belong to more than one cluster. Rather than being assigned to a single cluster, data points are assigned a probability or likelihood to be in each cluster, sometimes resulting in data points overlapping multiple clusters.
There are hundreds of different clustering algorithms that can be categorized into one of these two types. Some of the more popular include:
- K-means Clustering: This is one of the simplest and most commonly used unsupervised learning algorithms. The “k” in k-means represents the number of clusters the user needs to create based on the initial observation of the dataset. Once the user defines the number of clusters, the distance between each data point and the centroids of each cluster is calculated. Each data point is then classified by calculating the distance between that point and each group center, and then classifying the data point to be in the cluster whose center is closest to it. Each centroid is then recomputed and the process continues for a given number of iterations or until the centroids don’t change.
- DBSCAN: DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. This algorithm groups data points together by finding neighborhoods of data points whose density exceeds a particular threshold. The density threshold is defined by two user-defined parameters: how close the data points should be to be considered neighbors and a minimum number of points in the neighborhood. DBSCAN starts from a random “unvisited” data point, classifying all data points within the user-defined vicinity as neighbors. If there are enough neighbors in this vicinity, then the original data point and all its neighbors are assigned to a cluster. If there are not enough neighbors in this vicinity, then the original data point is disregarded as noise. This process continues until all data points are either part of a cluster or labeled as noise.
- Mean-shift algorithm: This centroid-based algorithm tries to find dense areas of data points and locates center points for each group. Each given data point is then iteratively assigned toward the closest cluster centroid until every point is assigned to a cluster. The mean-shift algorithm is similar to the k-means algorithm but has the advantage of not requiring the number of clusters to be determined ahead of time.
- Fuzzy c-means algorithm: This soft-clustering algorithm assigns a membership grade that indicates the degree to which each point belongs in each cluster. Points closer to the centroid of the cluster have stronger membership than points on the edge of the cluster. Fuzzy c-means clustering takes a similar approach to k-means, but because of this weighted membership, no data point is exclusively a member of any single cluster — hence the name “fuzzy.”
- Hierarchical algorithms: These connectivity-based clustering algorithms sort data into clusters based on the hierarchy of observed data similarity. They can be categorized into two types: divisive (top-down) clustering and agglomerative (bottom-up) clustering. Divisive clustering starts with all data points in a single cluster and divides that cluster into subclusters based on a measure of lowest similarity between data points. Agglomerative clustering assigns every data point to its own cluster and aggregates the most similar clusters one by one into larger clusters.
How is clustering used?
Clustering is used in a variety of real-world applications across industries. Some of the most popular include:
- Anomaly detection: Clustering algorithms can find outlying data points in massive volumes of data that may indicate significant events, such as degrading system performance, financial fraud and cybersecurity threats.
- Customer segmentation: This kind of clustering analysis can be used to identify individuals with similar characteristics and behavior. This enables businesses to adapt their advertising and marketing to target customers more effectively, improve customer retention and achieve growth objectives.
- Exploratory data analysis: Clustering enables users to explore large volumes of data to uncover trends, relationships, differences and similarities between data points. It provides a more nuanced understanding of datasets and informs how to best make use of them.
- Genetics: Clustering analysis is used to analyze evolutionary biology by classifying species of plants, animals, research diseases and many other things.
What are some examples of clustering?
There are many examples of clustering being used in everyday life. Some of these include:
- Cybersecurity: Clustering algorithms comb through massive volumes of IT infrastructure data to detect anomalies that may indicate a network intrusion.
- Product recommendations: Online retailers use clustering to analyze individual customers’ purchasing behavior and make product recommendations during browsing or suggest add-ons during checkout.
- Credit card fraud: Clusters can be used as identifiers for outliers to customer purchasing patterns and alert banks to potential fraud.
- Machine maintenance: Manufacturing companies use clustering analysis to help monitor the behavior of their machinery. Machine systems typically show atypical behavior well in advance of a failure, and clustering can find the anomalies in machine input and output parameters to help with preventative maintenance.
What is clustering in big data?
Clustering is an essential data mining tool for big data. Data mining is the process of discovering patterns and trends in large datasets to extract useful insights. As data volume has grown and data warehousing technology has evolved, data mining has become an essential technique for organizations to operationalize their raw data.
Clustering analysis helps make sense of big data because it can be performed quickly and without much prior knowledge about the dataset. Clustering algorithms can simply explore the data and surface what’s interesting. Users can then rely on the insights to make better business decisions.
What is clustering in AI?
Clustering is a machine learning technique in AI. AI, or artificial intelligence, is an umbrella term that encompasses many types of computing methods designed to replicate human intelligence. These include fields such as Natural Language Processing, Computer Vision and Pattern Recognition.
Machine learning is the most widely implemented field of AI, and it’s particularly important in data science. Machine learning uses algorithms to process large amounts of data, make classifications or predictions, and uncover actionable insights. Machine learning is classified as one of two types — supervised and unsupervised — and each uses different algorithms and computational techniques to accomplish different outcomes. Clustering is classified as an unsupervised learning technique.
When do you use clustering techniques?
Clustering techniques can be used in the following situations:
- When working with large unlabeled datasets: Clustering is an efficient way to apply data analytics to large volumes of unstructured data. It requires no instruction and can quickly organize unwieldy data into something usable.
- When you don’t know how your data is divided: When working with unstructured data, you likely won’t know how to divide it. Clustering results can give you a deeper understanding of your data so you can decide what steps to take to make use of it.
- When you don’t have the resources to annotate your data manually: The more data you’re dealing with, the less feasible it is to manually annotate, classify and categorize it. Clustering algorithms can reduce the time it takes to perform these tasks and surface answers more quickly.
- When you want to find anomalies in your data: Many clustering algorithms are particularly sensitive to atypical and outlier data points. This makes clustering especially useful for uncovering anomalies in your data that can help detect problems or optimize your data collection for greater accuracy.
What are the benefits of clustering?
Clustering offers several benefits, including:
- Less complexity: Unlike supervised machine learning techniques, clustering doesn’t require users to tag data and train the algorithm, which reduces complexity
- Faster, more accurate analysis: Clustering algorithms can evaluate and gain insights from raw datasets far more quickly and accurately than people can.
- Ability to find hidden patterns: Clustering extracts insights from unlabeled data by discovering the commonalities, differences and relationships among various data points.
- Better business decisions: Organizations can use the insights provided by data clustering to achieve metrics and make better-informed business decisions (e.g. an online retailer may change the way it markets products based on insights about customer purchasing patterns.)
What is a clustering problem?
A clustering problem is any issue that can impede cluster analysis. Usually a clustering problem for one clustering algorithm can be solved by using a different clustering algorithm.
Some common problems include:
- High dimensionality: Large datasets can contain many dimensions or attributes. With high dimensional data, it can sometimes be more difficult to identify groups than with lower dimensional data using clustering.
- Different types of attributes: Many algorithms are designed to cluster numerical data. However, data can also be categorical, ordinal, or binary, and some datasets may contain a mix of these attributes.
- Scalability: Some clustering algorithms work well on small datasets but struggle with larger ones, which can contain millions of data objects. Clustering a sample portion of a larger dataset is a possible workaround, but can result in biased results.
- Noisy data: Most datasets include atypical, missing, unknown or erroneous data. Some clustering algorithms are sensitive to this type of data, which can result in poor-quality clusters.
- Arbitrary cluster shapes: Many clustering algorithms tend to find spherical clusters with similar size and density. However, a data cluster can be any shape. This can lead to skewed results if the algorithm can’t identify arbitrary shapes.
- Data velocity: With new patterns constantly emerging from datasets, it can be difficult to know when to analyze existing data and when to wait until more data is collected. Increased velocity may also mean trends change as data is being collected.
Whether for market research, customer insights or business processes, your organizational data is one of your most valuable assets. Analyzing this information and uncovering the most important data is critical, but rarely easy. Clustering can help you better manage and take advantage of your data so you can gain a deeper understanding of your operations and your customers.

Four Lessons for Observability Leaders in 2023
Frazzled ops teams know that their monitoring is fundamentally broken in this new multicloud reality. Bottom line? Real need will spur the coming observability boom.