Introduction to Reinforcement Learning

Like humans, machines can learn from experiences (and mistakes)! This concept is known as reinforcement learning, and it’s at the core of some of the most prominent AI breakthroughs in the last decade.

Large language models (LLMs) including ChatGPT, the Nobel Prize-winning protein discovery technology AlphaFold, and the historic victory of the AlphaGo machine against a human champion at the complex ancient Chinese game of Go — they all incorporate reinforcement learning algorithms.

In this article, we will review key Reinforcement Learning concepts and approaches, as well the latest advancements behind the latest technological breakthroughs.

What is reinforcement learning?

Reinforcement learning (RL) refers to an AI methodology in which an agent interacts with its environment and learns an optimal policy that maximizes a reward function.

This concept is partly inspired by animal psychology, especially the concepts of:

Learning from trials and errors
Rewards and punishments
Exploration vs. exploitation of resources,
Learning a policy that maps an environment state to the behavior of the agent

(Related reading: machine data and machine customers.)

How is reinforcement learning different from supervised and unsupervised learning?

Yes, the concepts of reinforcement learning share some similarities with supervised learning. But the approach and goal of reinforcement learning is vastly different and totally in contrast to unsupervised learning.

In the supervised learning approach, you create a model that learns to map a data distribution to its known domain. For example, a set of images of animals, each correctly labeled with the animal names. A supervised learning model would learn how different features within the image map to its corresponding label (which is available, hence the supervised learning).

In the unsupervised learning approach, the labels of such images are not explicitly provided. Instead, the model extracts features within the images to separate them. Images with similar features are clustered together — each cluster is considered a separate label class.

In both supervised and unsupervised learning, when the model is exposed to new (unknown) images, it can read the features within the image to predict its corresponding label.

How reinforcement differs

Reinforcement learning is different:

RL is not supervised, because no class labels are provided.
RL is also not unsupervised, since the agent interacts with the environment to extract knowledge regarding the environment state.

The agent eventually learns to associate an action policy or a value to all states that it may encounter.

Another key difference exists: the goal for the agent is to learn how to act in the environment — not to learn patterns within data.

Reinforcement learning stems from the branch of science called optimal control. Together with supervised learning, it has helped produce some of the most promising AI breakthroughs: the approach of Reinforcement Learning from Human Feedback (RLHF) merges RL algorithms (PPO) with some human intervention to engineer safe and secure optimal control of LLMs.

How reinforcement learning works

The agent interfaces with its surrounding environment by perceiving the existing environment state and how some action (interaction) brings it to another state.

At each interaction step, the agent receives an input containing information about the present environment state. The agent then chooses an action to interact with the environment and transition to another state. A value is associated with this state transition in the form of a reinforcement signal — either reward or penalty.

The behavior policy of this agent should be to maximize the long-run measure of reward. This is learned systematically over time using trial and error.

Simple demo

Here’s a simple demonstration of a reinforcement learning model, where a mouse (the agent) has to traverse the maze (the environment) to find cheese (the reward).

The goal of this agent is to explore the entire environment to maximize the long-term reward.

The environment also includes states where an adversary can meet the agent and cause a penalty to the accumulated reward. For example:

Meeting an environment state (any box in the maze) with a cat can cause a very high penalty.
Whereas meeting a state with lightning can have a low penalty.

(Source)

The mouse can aim for less reward (single cheese slice) or walk around the maze to find a large sum of reward (three slices of cheese in the same box). The mouse can choose a strategy where it can either:

Risk exploring many states farther from its location in search for higher cumulative reward.
OR, exploit nearby states for immediate reward.

This is essentially the exploration versus exploitation problem of reinforcement learning. You can find a more technical and detailed explanation of these concepts here.

Approaches for reinforcement learning

An important question arises. What exactly should the agent learn in order to achieve its desired states? Reinforcement learning offers the following approaches:

Value-based learning

Here, the agent’s goal is to learn how valuable it is to be in a certain state or take a specific action therein.

The idea is to learn how important it is to be in any given state based on that value. The agent can choose state-action value pairs to make this decision. Examples include:

Q-Learning
SARSA
A2C: Advantage Actor-Critic

Policy-based learning

In this strategy, the agent doesn’t learn values. Instead, the agent learns a policy — a function that maps states to actions.

It explicitly defines the policy of taking a specific action in any given state. This makes policy-based learning more effective in environments with high-dimensional or continuous action spaces. Examples include:

PPO: Proximal Policy Optimization
TD3: Twin Delayed Deep Deterministic Policy Gradient
SAC: Soft Actor-Critic

Model-based learning

In this approach, the agent builds or uses a model that simulates the dynamics of the environment — that is, the environment and how actions transition between states.

It simulates future outcomes, either by creating a model of the environment or relying on an algorithm that explicitly learns it. That means it can help plan, so it’s useful for reducing sample complexity. Examples of model-based learning include:

MCTS: Monte Carlo Tree Search
MPC: Model Predictive Control
DP: Dynamic Programming

Model-free learning

In this model-free approach, the agent does not model the environment dynamics. Instead, the ageny simply relies on trial and error to learn.

Model-free learning is useful when modeling the entire environment is unknown, intractable, and/or complex. Examples include DQN, DDPG and A3C.

DQN Deep Q-Network
DDPG Deep Deterministic Policy Gradient
A3C Asynchronous Advantage Actor-Critic

A comprehensive report on the latest state-of-the-art algorithms is available here. The image below demonstrates the overlap between these four approaches:

(Source)

Final thoughts

Reinforcement Learning provides a flexible and powerful framework for training agents that learn from their environment—not just from data. As RL continues to merge with deep learning, optimization, and human feedback, it's becoming a core pillar of intelligent decision-making in AI systems.

Whether it’s mastering ancient games, helping humans communicate with language models, or navigating the physical world, reinforcement learning continues to push the frontier of what machines can learn to do.

/en_us/blog/fragments/disclaimer-with-divider

Style

two-column

How to Use LLMs for Log File Analysis: Examples, Workflows, and Best Practices

Learn

7 Minute Read

How to Use LLMs for Log File Analysis: Examples, Workflows, and Best Practices

Learn how to use LLMs for log file analysis, from parsing unstructured logs to detecting anomalies, summarizing incidents, and accelerating root cause analysis.

Beyond Deepfakes: Why Digital Provenance is Critical Now

Learn

5 Minute Read

Beyond Deepfakes: Why Digital Provenance is Critical Now

Combat AI misinformation with digital provenance. Learn how this essential concept tracks digital asset lifecycles, ensuring content authenticity.

The Best IT/Tech Conferences & Events of 2026

Learn

5 Minute Read

The Best IT/Tech Conferences & Events of 2026

Discover the top IT and tech conferences of 2026! Network, learn about the latest trends, and connect with industry leaders at must-attend events worldwide.

The Best Artificial Intelligence Conferences & Events of 2026

Learn

4 Minute Read

The Best Artificial Intelligence Conferences & Events of 2026

Discover the top AI and machine learning conferences of 2026, featuring global events, expert speakers, and networking opportunities to advance your AI knowledge and career.

The Best Blockchain & Crypto Conferences in 2026

Learn

5 Minute Read

The Best Blockchain & Crypto Conferences in 2026

Explore the top blockchain and crypto conferences of 2026 for insights, networking, and the latest trends in Web3, DeFi, NFTs, and digital assets worldwide.

Log Analytics: How To Turn Log Data into Actionable Insights

Learn

11 Minute Read

Log Analytics: How To Turn Log Data into Actionable Insights

Breaking news: Log data can provide a ton of value, if you know how to do it right. Read on to get everything you need to know to maximize value from logs.

The Best Security Conferences & Events 2026

Learn

6 Minute Read

The Best Security Conferences & Events 2026

Discover the top security conferences and events for 2026 to network, learn the latest trends, and stay ahead in cybersecurity — virtual and in-person options included.

Top Ransomware Attack Types in 2026 and How to Defend

Learn

9 Minute Read

Top Ransomware Attack Types in 2026 and How to Defend

Learn about ransomware and its various attack types. Take a look at ransomware examples and statistics and learn how you can stop attacks.

How to Build an AI First Organization: Strategy, Culture, and Governance

Learn

6 Minute Read

How to Build an AI First Organization: Strategy, Culture, and Governance

Adopting an AI First approach transforms organizations by embedding intelligence into strategy, operations, and culture for lasting innovation and agility.

/en_us/blog/fragments/about-splunk

/en_us/blog/fragments/subscribe-footer

Introduction to Reinforcement Learning

What is reinforcement learning?

How is reinforcement learning different from supervised and unsupervised learning?

How reinforcement differs

How reinforcement learning works

Simple demo

Approaches for reinforcement learning

Value-based learning

Policy-based learning

Model-based learning

Model-free learning

Final thoughts

Related Articles