When I first came across the term machine learning (ML) models, I pictured futuristic sci-fi robots tirelessly working behind the scenes while we humans effortlessly enjoyed the benefits. While the reality isn’t quite that cinematic, ML models are undeniably intelligent and transformative.
You may have noticed how Spotify always knows what we want to hear next. Or how our email sorts the junk from real messages. That’s machine learning doing its thing. But what’s going on behind the scenes? What are these models people keep talking about, and how do they work?
In this guide, I’ll break down:
Machine learning (ML) models are algorithms that learn patterns from data and use those patterns to make predictions or automate decisions without being directly programmed for every specific task. In fact, these models are behind many of the intelligent systems we use every day.
Here’s how it works:
Working of ML models.
Parameters are values that the model learns from data to make predictions. They determine how inputs are transformed into outputs, such as weights in a linear equation or connections in a neural network. Good parameters mean better performance; bad ones cause overfitting.
But hyperparameters are different. We set them before training, like the learning rate or model size. They guide how the model finds the best parameters. Together, parameters and hyperparameters make the model work well with new data.
When it comes to machine learning, there’s no one solution. We have four main types of machine learning models — supervised, unsupervised, reinforcement, self-supervised — each designed to learn in different ways. Let’s explore them in detail and see which one fits which job.
Supervised learning is like a teacher guiding a student. The model is trained on labeled data, which means the input data comes with the correct answers. It analyzes the data, makes predictions, then compares those predictions to the correct answers (output) and adjusts itself to improve accuracy.
Take Gmail’s spam detection as an example. Gmail trains its models on emails that are already labeled "spam" or "not spam." This way, the model picks up patterns like specific phrases or suspicious links and learns to recognize what shouldn’t be in our inbox.
There are two types of supervised learning:
Unsupervised learning is where things get a little more independent. Unlike supervised learning, unsupervised learning works with data that doesn’t come with labels. The model identifies patterns and groups on its own, without being instructed on what to find.
There are three main types of unsupervised learning techniques:
Clustering groups similar data points into clusters based on shared traits. If a business has a huge customer base but doesn’t know much about them, clustering can identify patterns. It may group customers by their shopping habits or interests, without needing pre-labeled data. These insights are then used for targeted marketing to satisfy shoppers' intent.
Clustering works in a few different ways:
Spotify is a great example of this. It uses clustering algorithms to group listeners based on their music preferences. These groups aren't pre-labeled; Spotify identifies natural patterns, such as grouping people who listen to similar artists or genres. This way, it recommends new songs that match our tastes.
Association rule finds relationships between items in large datasets. It’s widely used in retail, where algorithms analyze shopping carts to see which items are often bought together. You've probably seen “People who bought this also bought…” That’s what association rules do. They learn from past data to make such smart suggestions.
Example of the association rule.
Dimensionality reduction removes irrelevant or redundant features from large, complex datasets while preserving important details. It uses Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) to determine which features contain the most useful information and filter out the noise based on that.
A real-world example of this is Apple’s Face ID. It captures a 3D scan of our face with thousands of data points. But instead of processing all of them, it uses machine learning to reduce the data to the most important features. This way, the phone recognizes our face quickly and securely, without overloading the system with unnecessary information.
Reinforcement learning trains models through trial and error. The model interacts with an environment, makes decisions, and receives rewards or penalties based on its actions. Over time, it learns which actions lead to better outcomes and which don’t.
Waymo’s self-driving cars use reinforcement learning to make smarter decisions on the road. They are trained in virtual environments, where they go through millions of different driving situations and learn by trial and error.
The Waymo Driver’s system gathers data from sensors and uses AI to understand what's happening around it, from spotting pedestrians and cyclists to reading traffic lights and temporary stop signs. After training on over 100k miles of city driving, reinforcement learning made Waymo’s cars safer and more reliable in challenging situations.
(Check out this video explaining Waymo’s driving technology.)
Self-supervised learning is a middle ground between supervised and unsupervised learning. It doesn’t require human-labeled data, but it learns on its own by predicting parts of the data based on other parts. Instead of being fed the answers, the model creates its own labels from the raw data. For example, it may hide part of an image or sentence and learn to guess what’s missing.
Take BERT, for example. It uses self-supervised learning by training on two tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP), which generate their own training signals from raw text without needing manual labels. Here’s how BERT works based on them:
These pre-training tasks allow BERT to learn general language patterns and context, which can later be fine-tuned for specific NLP tasks like classification or question answering.
Now that we’ve covered the types of machine learning models, let’s look at the supervised ML algorithms that train them.
Algorithm | Purpose | How It Works |
Linear regression | Predict continuous values | Draws a straight line through data points to model the relationship between input and output. |
Logistic regression | Classification (binary) | Uses a linear combination of inputs, then applies a sigmoid function to output a probability between 0 and 1. |
Decision tree | Classification or Regression | A flowchart-like structure that splits data by asking yes/no questions at each node. |
Random forest | Classification or Regression | Builds many decision trees on different parts of the data and combines their results (majority vote for classification, average for regression). |
Support Vector Machine (SVM) | Classification | Draws the best possible boundary (hyperplane) between different classes to maximize the margin between them. |
K-Nearest Neighbors (KNN) | Classification or Regression | Predicts based on the majority label of the `k` closest data points (neighbors). |
Gradient boosting | Highly accurate predictions | Sequentially builds small decision trees, where each new tree focuses on correcting the mistakes of the previous one (boosting). |
Now, let’s look at unsupervised ML algorithms:
Algorithm | Purpose | How It Works |
K-Means clustering | Group data into clusters | Picks `K` cluster centers (centroids), assigns each point to the nearest center, recalculates centers, and repeats until stable clusters are formed. |
Hierarchical clustering | Build a cluster hierarchy | Starts with each data point as its own cluster, then repeatedly merges the closest clusters to form a tree (dendrogram). |
Apriori algorithm | Discover association rules | Finds frequent item sets in data and then derives rules like "If A, then B" based on items that commonly appear together. |
Since we have so many ML models available, each with its strengths, it’s not easy to choose the right one. You should first consider what kind of problem you’re solving and what kind of data you’re working with.
Here’s a simple way to approach it:
The first step is to clearly define your goal. Ask yourself: What am I trying to predict or understand? If your objective is to categorize items — such as determining whether an email is spam or not — a classification model like logistic regression or decision trees may be the ideal choice.
If you need to predict a number, such as estimating a house price, you'll want a regression model like linear regression, or perhaps gradient boosting if you want higher accuracy.
But if you want to find hidden patterns without any labels to guide you, consider unsupervised models like K-Means clustering, as it can group similar data points without predefined categories.
Once you know your goal, take a close look at the data you have. If your dataset comes with clear answers like labelled examples that show what the right outcome should be, then go with supervised learning. But if your data lacks labels altogether, unsupervised learning is a better choice, and in some cases, self-supervised techniques may provide an even smarter route.
Sometimes it's not enough to get the output only, we also need to understand why our model made a certain decision. This transparency is particularly necessary in sensitive areas such as healthcare or finance. So, if explainability is your priority, simpler models like linear regression or decision trees can help you see how the model reaches its conclusions.
On the other hand, if getting the most accurate predictions is more important than being able to explain every step, then complex models like random forests or gradient boosting may be the better choice, even though they behave more like black boxes.
If your dataset is small and you need quick results, simpler models like K-Nearest Neighbors are often the best choice because they’re easy to set up and fast to run. But when you work with vast amounts of data, or when you care more about squeezing out every bit of predictive power, it’s better to train sophisticated models like gradient boosting, even if they take longer to work.
After all, the best way to choose a model is to get hands-on. Try out a few different models, see how they perform, and compare their results side by side. Often, the right choice only becomes obvious once you see how each model handles the real data.
Machine learning isn’t mystical — but it sure is cool once you understand how much it impacts our daily lives.
We’ve covered a lot, from the different types of machine learning models to how these models are quietly shaping the tools we use every day. Whether it's Gmail sorting spam or Spotify suggesting your next favorite song, machine learning is everywhere. It's not going anywhere.
But here's the catch: just like with any new technology, there’s no one-size-fits-all. The right model depends on your problem, the data you have, and how much accuracy or transparency you need. So, if you take anything away from this, let it be this: explore multiple models, experiment, and let the data show you the way. This way, you will find the perfect fit for what you’re trying to achieve.
See an error or have a suggestion? Please let us know by emailing splunkblogs@cisco.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.
Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.