How Chain of Thought (CoT) Prompting Helps LLMs Reason More Like Humans

Large language models (LLMs) are conversational AI language models trained on vast amounts of data from the internet. Typically containing upwards of a trillion model parameters, these models are fine-tuned on specialized knowledge-bases, like in the domains of mathematics, computer science, biology, and other specialized fields.

Yet sometimes it seems that LLMs don’t perform well — particularly on language tasks that require reasoning or common sense.

In fact, complex reasoning tasks have seriously challenged the scaling laws of large conversational AI models. Indeed, it seems that conversational AGI is simply not ready just yet….even for simple reasoning problems such as finding the number of Rs in the word “strawberry”. So much so, that OpenAI had to release a new model engineered specifically to solve the complicated reasoning queries such as the Strawberry R’s problem, and called it the o1 Strawberry model.

Why is that and how can you help solve complicated reasoning tasks with an LLM?

How humans use reason

Consider your own thought process when you solve an arithmetic problem or a question that requires common sense.

Usually, you would decompose the problem into simple intermediate steps. Solving one step will lead you closer to solving the next one — you will eventually reach a final solution that is reasoned and verified with logic at every step. This reasoning and logic may be hidden, unexplainable, or considered “common sense”. And that’s why it’s so difficult for language models to get right, right now.

What is CoT prompting?

The term chain of thought (CoT) prompting refers to the practice of decomposing complex user queries into intermediate prompts that serve as few-shot examples to step-by-step answers.

Deciding when to use CoT is important. Chain of thought is particularly valuable for complex tasks. The larger the models, the better. Smaller models will likely perform worse with CoT. Think of CoT prompting as a more advanced framework for prompt engineering, where the AI model is consuming examples in order to:

Get the desired output.
Optimize the model’s output.

Because CoT is a more advanced framework, let’s detour into few shot prompting, the foundation for chain of thought prompting.

What is few-shot prompting?

Few-shot prompting refers to the manual process of providing the LLM with some examples of a task in order to generate a desired response. The few-shot method is valuable for scenarios lacking extensive training data.

A few-shot prompt can be an example of a similar prompt-response combination. “Vanilla” few-shot methods may not provide any additional context or logic on solving (or suggesting to solve) the prompt query in intermediate steps.

From few-shot to chain of thought prompts

Extending the few shot prompts is what makes a “chain of thought” prompt. Here’s how: you extend the few-shot prompt: you literally make the prompt longer by asking a question/prompt and also providing an example of the prompt-response you’d like to see.

This example contains a sequence of intermediate steps that can guide the model to reason or think over the decomposed parts of the prompt…quite similar to just how us humans process thoughts. With the few-shot prompting approach, the model can acquire context to reason and generate an output — and that output will follow the intermediate steps according to the provided examples.

Chain of thought takes few-shot prompts as examples to guide the LLM from a starting point (to build context), through the desired intermediary steps (to build desired process) that the LLM can follow in its responses.

In this way, CoT is similar to teaching a young child something new by exposing them to some (few-shot) examples and guiding them on the reasoning process, doing so step by step.

Few-shot vs. zero-shot prompting

Let’s contrast few-shot prompting with the Zero-Shot prompt. In the Zero-Shot prompt, the LLM is not provided with any example prompt to gain additional context and reasons on its own.

The LLM is expected to respond to a complex prompt query — which is not a part of the model’s training data. You can also ask the LLM to infer step-by-step by decomposing the prompt query without actually providing the (few-shot) examples of this process.

Properties of chain of thought prompting

OK, so what is it about chain of thought prompting that helps LLMs to reason in ways similar to humans? These properties are what help a language model solve complex conversational tasks:

Decomposing multi-step problems into intermediate tasks. Each task is allocated additional computation and memory.
The behavior of the model across the intermediate tasks can be interpreted. Any logical error can be debugged within the environment of that intermediate task.
Commonsense reasoning can be introduced at intermediate steps with Few-Shot prompt examples. In fact, if reasoning can be introduced via language, the LLM can adopt it.
The performance of an LLM improves substantially if you introduce few-shot CoT examples within the prompt.

Example of CoT prompting

Let’s consider this simple example we looked at above:

This image illustrates the differences between a standard prompt versus a chain of thought prompt.
(Image source.)

In this example, the Standard Prompting uses a few shot prompt solving a similar arithmetic query as an example. Here, the few shot does not provide steps or context that can be used by the model to solve the problem as decomposed intermediary tasks.

On the right, a few shot example is provided with the Chain of Thought prompting. It provides intermediary steps that serve as an example to decompose the next question in the prompt. The LLM then:

Reviews the first question.
Understands how the user expects the question to be decomposed.
Follows through intermediate steps to reach an answer.

This knowledge serves as a context to answer the next question in the user prompt.

CoT for long prompt queries

But what about long prompt queries? A popular example here is coding scripts, where the user expects the LLM to find the reasoning, errors, functionality, and outputs of the code.

If you have ever used a long coding script with ChatGPT, you may find that the response is not relevant to your query.

One of the reasons here is that LLMs can understand complex code — but may still expect a clear CoT description of your query. You may not need to introduce few-shot coding examples but instead a clear description of the coding functionality and how it can be reached.

In this case, you may simply need to ask the LLM to follow a step-by-step chain of thought prompting process. Ultimately, this is similar to zero-shot prompting but following a CoT approach.

An example of this is shown below, where researchers simply ask the model to “think step by step” as part of the zero shot prompt query in Box C, on the lower-left side:

Prompt best practices for users

The idea of CoT is to simplify the reasoning process for the LLM. Machines don’t think in the same way as humans. And while we can correctly assume that LLMs may already have factual information, they may need guidance from the user. This is not necessarily a limitation of the LLM either — LLMs should not be expected to interpret a user query exactly based on user intent.

Instead, the user is expected to communicate (prompt) their intent to better clarify the query. Techniques like Few-Shot or Zero-Shot CoT prompt the LLM to connect the dots in the right direction.

Since LLMs aim for universal intelligence (AGI), you can use Chain of Thought Prompting to converge the knowledge and intelligence of the LLM in the direction that is ideally suited to the intent of your query. The following user-side mitigations can help:

Think step by step

For example, instead of asking: “What is the capital of the country with the largest GDP?”, you can break down the query into intermediate steps: “Which country has the largest GDP? Then, name the capital of that country.”

For more complex coding queries, you can achieve similar results by following the Few-Shot or the Zero-Shot CoT approach.

External verification loops prompt the LLM to evaluate its own answer. For example, follow up a query with the following:

“Review your response for factual errors.”
“Verify your response with recent academic literature.”

Fact-checking and fine-tuning

It is likely that a universal LLM may not have knowledge or context about a niche topic that has not already been published online.

In this scenario, you can attempt to retrain (fine-tune) your LLM with your own data set, which is resource-consuming and only works with open-source models that you deploy in your own systems. Or you can manually add context to the prompt, which only works if you already have the context you need.

Retrieval-Augmented Generation (RAG)

Given that you already have access to an external knowledge base in the form of vector databases, you can retrieve this knowledge in real-time during CoT prompting. Here’s how it works:

User Query: ‘specify the industry best-practice steps for deploying high performance and secure microservices pipeline to production’
Retrieval: The RAG pipeline may find relevant internal documents such as: ‘microservice deployment policy v3’, ‘CI/CD guidelines for Kubernetes’, and more.
Generation: The LLM reads the retrieved documentation and composes a response based on the updated knowledge. For example: ‘To deploy a microservice to production, follow these steps:

Merge code to the main branch
Ensure all CI checks pass
Submit a change request via Jira
Await approval from DevOps
Use the ‘prod-deploy.sh’ script for release…”

The challenge of hallucinations — and how it’s being solved

Now, unlike the power users in the tech industry, the average user of LLMs does not have the resources and expertise to retrain and deploy specialized open source LLMs or integrate RAG tooling into their LLM pipeline.

Many users expect and desire AI companies to deliver on their promises of AGI sooner than later. But what they witness frequently are symptoms of hallucination (such as the Strawberry R’s problem). Vendors such as OpenAI have are responding to the challenge of hallucination in two distinct directions:

Universal AGI

One example is the Q* algorithm, a code name for an OpenAI algorithm or LLM architecture pipeline that allows the model to intelligently reason through steps. (To be clear, this Q* algorithm is not Q-Learning.)

OpenAI never published a paper on this. We don’t really know if it was an actual algorithm capable of moving beyond prompt-based reasoning ,such as the CoT, to a more autonomous and intelligent prompt mechanism.

Over-engineering to perfect a user-centric LLM solution

This is closer to what we see with LLMs today: Reinforcement Learning from Human Feedback (RLHF) flow is now even more deep-rooted into the latest OpenAI LLMs. Here’s how it works:

Pretraining: The LLM is pretrained on a large data set (unsupervised learning approach).
Fine tuning: The model is fine-tuned on a smaller, high-quality dataset that is essentially a human-written prompt-response pair. Known as Supervised Fine Tuning (SFT), OpenAI has focused especially on the SFT approach for code-related prompt tasks.
Rank and reward: Human experts rank the model outputs, which is used to train a Reward Model. (The reward model defines the reward signal that describes the value of an action of the reinforcement learning agent in a state space). Since the agent follows a reward system feedback generated by human experts, it is more likely to mimic a human developer instead of producing hallucinated outputs.

Modern LLMs such as the reasoning models, GPT-4-Turbo, GPT-4.5 and Operator as well as the o1 pro are designed specifically to avoid hallucinations, by emulating the human approach of learning, reasoning and responding.

/en_us/blog/fragments/disclaimer-with-divider

Style

two-column

How to Use LLMs for Log File Analysis: Examples, Workflows, and Best Practices

Learn

7 Minute Read

How to Use LLMs for Log File Analysis: Examples, Workflows, and Best Practices

Learn how to use LLMs for log file analysis, from parsing unstructured logs to detecting anomalies, summarizing incidents, and accelerating root cause analysis.

Beyond Deepfakes: Why Digital Provenance is Critical Now

Learn

5 Minute Read

Beyond Deepfakes: Why Digital Provenance is Critical Now

Combat AI misinformation with digital provenance. Learn how this essential concept tracks digital asset lifecycles, ensuring content authenticity.

The Best IT/Tech Conferences & Events of 2026

Learn

5 Minute Read

The Best IT/Tech Conferences & Events of 2026

Discover the top IT and tech conferences of 2026! Network, learn about the latest trends, and connect with industry leaders at must-attend events worldwide.

The Best Artificial Intelligence Conferences & Events of 2026

Learn

4 Minute Read

The Best Artificial Intelligence Conferences & Events of 2026

Discover the top AI and machine learning conferences of 2026, featuring global events, expert speakers, and networking opportunities to advance your AI knowledge and career.

The Best Blockchain & Crypto Conferences in 2026

Learn

5 Minute Read

The Best Blockchain & Crypto Conferences in 2026

Explore the top blockchain and crypto conferences of 2026 for insights, networking, and the latest trends in Web3, DeFi, NFTs, and digital assets worldwide.

Log Analytics: How To Turn Log Data into Actionable Insights

Learn

11 Minute Read

Log Analytics: How To Turn Log Data into Actionable Insights

Breaking news: Log data can provide a ton of value, if you know how to do it right. Read on to get everything you need to know to maximize value from logs.

The Best Security Conferences & Events 2026

Learn

6 Minute Read

The Best Security Conferences & Events 2026

Discover the top security conferences and events for 2026 to network, learn the latest trends, and stay ahead in cybersecurity — virtual and in-person options included.

Top Ransomware Attack Types in 2026 and How to Defend

Learn

9 Minute Read

Top Ransomware Attack Types in 2026 and How to Defend

Learn about ransomware and its various attack types. Take a look at ransomware examples and statistics and learn how you can stop attacks.

How to Build an AI First Organization: Strategy, Culture, and Governance

Learn

6 Minute Read

How to Build an AI First Organization: Strategy, Culture, and Governance

Adopting an AI First approach transforms organizations by embedding intelligence into strategy, operations, and culture for lasting innovation and agility.

/en_us/blog/fragments/about-splunk

/en_us/blog/fragments/subscribe-footer

How Chain of Thought (CoT) Prompting Helps LLMs Reason More Like Humans

How humans use reason

What is CoT prompting?

What is few-shot prompting?

From few-shot to chain of thought prompts

Few-shot vs. zero-shot prompting

Properties of chain of thought prompting

Example of CoT prompting

CoT for long prompt queries

Prompt best practices for users

Think step by step

Fact-checking and fine-tuning

Retrieval-Augmented Generation (RAG)

The challenge of hallucinations — and how it’s being solved

Universal AGI

Over-engineering to perfect a user-centric LLM solution

Related Articles