How Chain of Thought (CoT) Prompting Helps LLMs Reason More Like Humans
Large language models (LLMs) are conversational AI language models trained on vast amounts of data from the internet. Typically containing upwards of a trillion model parameters, these models are fine-tuned on specialized knowledge-bases, like in the domains of mathematics, computer science, biology, and other specialized fields.
Yet sometimes it seems that LLMs don’t perform well — particularly on language tasks that require reasoning or common sense.
In fact, complex reasoning tasks have seriously challenged the scaling laws of large conversational AI models. Indeed, it seems that conversational AGI is simply not ready just yet….even for simple reasoning problems such as finding the number of Rs in the word “strawberry”. So much so, that OpenAI had to release a new model engineered specifically to solve the complicated reasoning queries such as the Strawberry R’s problem, and called it the o1 Strawberry model.
Why is that and how can you help solve complicated reasoning tasks with an LLM?
How humans use reason
Consider your own thought process when you solve an arithmetic problem or a question that requires common sense.
Usually, you would decompose the problem into simple intermediate steps. Solving one step will lead you closer to solving the next one — you will eventually reach a final solution that is reasoned and verified with logic at every step. This reasoning and logic may be hidden, unexplainable, or considered “common sense”. And that’s why it’s so difficult for language models to get right, right now.
What is CoT prompting?
The term chain of thought (CoT) prompting refers to the practice of decomposing complex user queries into intermediate prompts that serve as few-shot examples to step-by-step answers.
Deciding when to use CoT is important. Chain of thought is particularly valuable for complex tasks. The larger the models, the better. Smaller models will likely perform worse with CoT. Think of CoT prompting as a more advanced framework for prompt engineering, where the AI model is consuming examples in order to:
- Get the desired output.
- Optimize the model’s output.
Because CoT is a more advanced framework, let’s detour into few shot prompting, the foundation for chain of thought prompting.
What is few-shot prompting?
Few-shot prompting refers to the manual process of providing the LLM with some examples of a task in order to generate a desired response. The few-shot method is valuable for scenarios lacking extensive training data.
A few-shot prompt can be an example of a similar prompt-response combination. “Vanilla” few-shot methods may not provide any additional context or logic on solving (or suggesting to solve) the prompt query in intermediate steps.
From few-shot to chain of thought prompts
Extending the few shot prompts is what makes a “chain of thought” prompt. Here’s how: you extend the few-shot prompt: you literally make the prompt longer by asking a question/prompt and also providing an example of the prompt-response you’d like to see.
This example contains a sequence of intermediate steps that can guide the model to reason or think over the decomposed parts of the prompt…quite similar to just how us humans process thoughts. With the few-shot prompting approach, the model can acquire context to reason and generate an output — and that output will follow the intermediate steps according to the provided examples.
Chain of thought takes few-shot prompts as examples to guide the LLM from a starting point (to build context), through the desired intermediary steps (to build desired process) that the LLM can follow in its responses.
In this way, CoT is similar to teaching a young child something new by exposing them to some (few-shot) examples and guiding them on the reasoning process, doing so step by step.
Few-shot vs. zero-shot prompting
Let’s contrast few-shot prompting with the Zero-Shot prompt. In the Zero-Shot prompt, the LLM is not provided with any example prompt to gain additional context and reasons on its own.
The LLM is expected to respond to a complex prompt query — which is not a part of the model’s training data. You can also ask the LLM to infer step-by-step by decomposing the prompt query without actually providing the (few-shot) examples of this process.
Properties of chain of thought prompting
OK, so what is it about chain of thought prompting that helps LLMs to reason in ways similar to humans? These properties are what help a language model solve complex conversational tasks:
- Decomposing multi-step problems into intermediate tasks. Each task is allocated additional computation and memory.
- The behavior of the model across the intermediate tasks can be interpreted. Any logical error can be debugged within the environment of that intermediate task.
- Commonsense reasoning can be introduced at intermediate steps with Few-Shot prompt examples. In fact, if reasoning can be introduced via language, the LLM can adopt it.
- The performance of an LLM improves substantially if you introduce few-shot CoT examples within the prompt.
Example of CoT prompting
Let’s consider this simple example we looked at above:
This image illustrates the differences between a standard prompt versus a chain of thought prompt.
(Image source.)
In this example, the Standard Prompting uses a few shot prompt solving a similar arithmetic query as an example. Here, the few shot does not provide steps or context that can be used by the model to solve the problem as decomposed intermediary tasks.
On the right, a few shot example is provided with the Chain of Thought prompting. It provides intermediary steps that serve as an example to decompose the next question in the prompt. The LLM then:
- Reviews the first question.
- Understands how the user expects the question to be decomposed.
- Follows through intermediate steps to reach an answer.
This knowledge serves as a context to answer the next question in the user prompt.
CoT for long prompt queries
But what about long prompt queries? A popular example here is coding scripts, where the user expects the LLM to find the reasoning, errors, functionality, and outputs of the code.
If you have ever used a long coding script with ChatGPT, you may find that the response is not relevant to your query.
One of the reasons here is that LLMs can understand complex code — but may still expect a clear CoT description of your query. You may not need to introduce few-shot coding examples but instead a clear description of the coding functionality and how it can be reached.
In this case, you may simply need to ask the LLM to follow a step-by-step chain of thought prompting process. Ultimately, this is similar to zero-shot prompting but following a CoT approach.
An example of this is shown below, where researchers simply ask the model to “think step by step” as part of the zero shot prompt query in Box C, on the lower-left side:
Prompt best practices for users
The idea of CoT is to simplify the reasoning process for the LLM. Machines don’t think in the same way as humans. And while we can correctly assume that LLMs may already have factual information, they may need guidance from the user. This is not necessarily a limitation of the LLM either — LLMs should not be expected to interpret a user query exactly based on user intent.
Instead, the user is expected to communicate (prompt) their intent to better clarify the query. Techniques like Few-Shot or Zero-Shot CoT prompt the LLM to connect the dots in the right direction.
Since LLMs aim for universal intelligence (AGI), you can use Chain of Thought Prompting to converge the knowledge and intelligence of the LLM in the direction that is ideally suited to the intent of your query. The following user-side mitigations can help:
Think step by step
For example, instead of asking: “What is the capital of the country with the largest GDP?”, you can break down the query into intermediate steps: “Which country has the largest GDP? Then, name the capital of that country.”
For more complex coding queries, you can achieve similar results by following the Few-Shot or the Zero-Shot CoT approach.
External verification loops prompt the LLM to evaluate its own answer. For example, follow up a query with the following:
- “Review your response for factual errors.”
- “Verify your response with recent academic literature.”
Fact-checking and fine-tuning
It is likely that a universal LLM may not have knowledge or context about a niche topic that has not already been published online.
In this scenario, you can attempt to retrain (fine-tune) your LLM with your own data set, which is resource-consuming and only works with open-source models that you deploy in your own systems. Or you can manually add context to the prompt, which only works if you already have the context you need.
Retrieval-Augmented Generation (RAG)
Given that you already have access to an external knowledge base in the form of vector databases, you can retrieve this knowledge in real-time during CoT prompting. Here’s how it works:
- User Query: ‘specify the industry best-practice steps for deploying high performance and secure microservices pipeline to production’
- Retrieval: The RAG pipeline may find relevant internal documents such as: ‘microservice deployment policy v3’, ‘CI/CD guidelines for Kubernetes’, and more.
- Generation: The LLM reads the retrieved documentation and composes a response based on the updated knowledge. For example: ‘To deploy a microservice to production, follow these steps:
- Merge code to the main branch
- Ensure all CI checks pass
- Submit a change request via Jira
- Await approval from DevOps
- Use the ‘prod-deploy.sh’ script for release…”
The challenge of hallucinations — and how it’s being solved
Now, unlike the power users in the tech industry, the average user of LLMs does not have the resources and expertise to retrain and deploy specialized open source LLMs or integrate RAG tooling into their LLM pipeline.
Many users expect and desire AI companies to deliver on their promises of AGI sooner than later. But what they witness frequently are symptoms of hallucination (such as the Strawberry R’s problem). Vendors such as OpenAI have are responding to the challenge of hallucination in two distinct directions:
Universal AGI
One example is the Q* algorithm, a code name for an OpenAI algorithm or LLM architecture pipeline that allows the model to intelligently reason through steps. (To be clear, this Q* algorithm is not Q-Learning.)
OpenAI never published a paper on this. We don’t really know if it was an actual algorithm capable of moving beyond prompt-based reasoning ,such as the CoT, to a more autonomous and intelligent prompt mechanism.
Over-engineering to perfect a user-centric LLM solution
This is closer to what we see with LLMs today: Reinforcement Learning from Human Feedback (RLHF) flow is now even more deep-rooted into the latest OpenAI LLMs. Here’s how it works:
- Pretraining: The LLM is pretrained on a large data set (unsupervised learning approach).
- Fine tuning: The model is fine-tuned on a smaller, high-quality dataset that is essentially a human-written prompt-response pair. Known as Supervised Fine Tuning (SFT), OpenAI has focused especially on the SFT approach for code-related prompt tasks.
- Rank and reward: Human experts rank the model outputs, which is used to train a Reward Model. (The reward model defines the reward signal that describes the value of an action of the reinforcement learning agent in a state space). Since the agent follows a reward system feedback generated by human experts, it is more likely to mimic a human developer instead of producing hallucinated outputs.
Modern LLMs such as the reasoning models, GPT-4-Turbo, GPT-4.5 and Operator as well as the o1 pro are designed specifically to avoid hallucinations, by emulating the human approach of learning, reasoning and responding.
Related Articles

How to Use LLMs for Log File Analysis: Examples, Workflows, and Best Practices

Beyond Deepfakes: Why Digital Provenance is Critical Now

The Best IT/Tech Conferences & Events of 2026

The Best Artificial Intelligence Conferences & Events of 2026

The Best Blockchain & Crypto Conferences in 2026

Log Analytics: How To Turn Log Data into Actionable Insights

The Best Security Conferences & Events 2026

Top Ransomware Attack Types in 2026 and How to Defend
