Large language models (LLMs) are conversational AI language models trained on vast amounts of data from the internet. Typically containing upwards of a trillion model parameters, these models are fine-tuned on specialized knowledge-bases, like in the domains of mathematics, computer science, biology, and other specialized fields.
Yet sometimes it seems that LLMs don’t perform well — particularly on language tasks that require reasoning or common sense.
In fact, complex reasoning tasks have seriously challenged the scaling laws of large conversational AI models. Indeed, it seems that conversational AGI is simply not ready just yet….even for simple reasoning problems such as finding the number of Rs in the word “strawberry”. So much so, that OpenAI had to release a new model engineered specifically to solve the complicated reasoning queries such as the Strawberry R’s problem, and called it the o1 Strawberry model.
Why is that and how can you help solve complicated reasoning tasks with an LLM?
Consider your own thought process when you solve an arithmetic problem or a question that requires common sense.
Usually, you would decompose the problem into simple intermediate steps. Solving one step will lead you closer to solving the next one — you will eventually reach a final solution that is reasoned and verified with logic at every step. This reasoning and logic may be hidden, unexplainable, or considered “common sense”. And that’s why it’s so difficult for language models to get right, right now.
The term chain of thought (CoT) prompting refers to the practice of decomposing complex user queries into intermediate prompts that serve as few-shot examples to step-by-step answers.
Deciding when to use CoT is important. Chain of thought is particularly valuable for complex tasks. The larger the models, the better. Smaller models will likely perform worse with CoT. Think of CoT prompting as a more advanced framework for prompt engineering, where the AI model is consuming examples in order to:
Because CoT is a more advanced framework, let’s detour into few shot prompting, the foundation for chain of thought prompting.
Few-shot prompting refers to the manual process of providing the LLM with some examples of a task in order to generate a desired response. The few-shot method is valuable for scenarios lacking extensive training data.
A few-shot prompt can be an example of a similar prompt-response combination. “Vanilla” few-shot methods may not provide any additional context or logic on solving (or suggesting to solve) the prompt query in intermediate steps.
Extending the few shot prompts is what makes a “chain of thought” prompt. Here’s how: you extend the few-shot prompt: you literally make the prompt longer by asking a question/prompt and also providing an example of the prompt-response you’d like to see.
This example contains a sequence of intermediate steps that can guide the model to reason or think over the decomposed parts of the prompt…quite similar to just how us humans process thoughts. With the few-shot prompting approach, the model can acquire context to reason and generate an output — and that output will follow the intermediate steps according to the provided examples.
Chain of thought takes few-shot prompts as examples to guide the LLM from a starting point (to build context), through the desired intermediary steps (to build desired process) that the LLM can follow in its responses.
In this way, CoT is similar to teaching a young child something new by exposing them to some (few-shot) examples and guiding them on the reasoning process, doing so step by step.
Let’s contrast few-shot prompting with the Zero-Shot prompt. In the Zero-Shot prompt, the LLM is not provided with any example prompt to gain additional context and reasons on its own.
The LLM is expected to respond to a complex prompt query — which is not a part of the model’s training data. You can also ask the LLM to infer step-by-step by decomposing the prompt query without actually providing the (few-shot) examples of this process.
OK, so what is it about chain of thought prompting that helps LLMs to reason in ways similar to humans? These properties are what help a language model solve complex conversational tasks:
Let’s consider this simple example we looked at above:
This image illustrates the differences between a standard prompt versus a chain of thought prompt.
(Image source.)
In this example, the Standard Prompting uses a few shot prompt solving a similar arithmetic query as an example. Here, the few shot does not provide steps or context that can be used by the model to solve the problem as decomposed intermediary tasks.
On the right, a few shot example is provided with the Chain of Thought prompting. It provides intermediary steps that serve as an example to decompose the next question in the prompt. The LLM then:
This knowledge serves as a context to answer the next question in the user prompt.
But what about long prompt queries? A popular example here is coding scripts, where the user expects the LLM to find the reasoning, errors, functionality, and outputs of the code.
If you have ever used a long coding script with ChatGPT, you may find that the response is not relevant to your query.
One of the reasons here is that LLMs can understand complex code — but may still expect a clear CoT description of your query. You may not need to introduce few-shot coding examples but instead a clear description of the coding functionality and how it can be reached.
In this case, you may simply need to ask the LLM to follow a step-by-step chain of thought prompting process. Ultimately, this is similar to zero-shot prompting but following a CoT approach.
An example of this is shown below, where researchers simply ask the model to “think step by step” as part of the zero shot prompt query in Box C, on the lower-left side:
The idea of CoT is to simplify the reasoning process for the LLM. Machines don’t think in the same way as humans. And while we can correctly assume that LLMs may already have factual information, they may need guidance from the user. This is not necessarily a limitation of the LLM either — LLMs should not be expected to interpret a user query exactly based on user intent.
Instead, the user is expected to communicate (prompt) their intent to better clarify the query. Techniques like Few-Shot or Zero-Shot CoT prompt the LLM to connect the dots in the right direction.
Since LLMs aim for universal intelligence (AGI), you can use Chain of Thought Prompting to converge the knowledge and intelligence of the LLM in the direction that is ideally suited to the intent of your query. The following user-side mitigations can help:
For example, instead of asking: “What is the capital of the country with the largest GDP?”, you can break down the query into intermediate steps: “Which country has the largest GDP? Then, name the capital of that country.”
For more complex coding queries, you can achieve similar results by following the Few-Shot or the Zero-Shot CoT approach.
External verification loops prompt the LLM to evaluate its own answer. For example, follow up a query with the following:
It is likely that a universal LLM may not have knowledge or context about a niche topic that has not already been published online.
In this scenario, you can attempt to retrain (fine-tune) your LLM with your own data set, which is resource-consuming and only works with open-source models that you deploy in your own systems. Or you can manually add context to the prompt, which only works if you already have the context you need.
Given that you already have access to an external knowledge base in the form of vector databases, you can retrieve this knowledge in real-time during CoT prompting. Here’s how it works:
Now, unlike the power users in the tech industry, the average user of LLMs does not have the resources and expertise to retrain and deploy specialized open source LLMs or integrate RAG tooling into their LLM pipeline.
Many users expect and desire AI companies to deliver on their promises of AGI sooner than later. But what they witness frequently are symptoms of hallucination (such as the Strawberry R’s problem). Vendors such as OpenAI have are responding to the challenge of hallucination in two distinct directions:
One example is the Q* algorithm, a code name for an OpenAI algorithm or LLM architecture pipeline that allows the model to intelligently reason through steps. (To be clear, this Q* algorithm is not Q-Learning.)
OpenAI never published a paper on this. We don’t really know if it was an actual algorithm capable of moving beyond prompt-based reasoning ,such as the CoT, to a more autonomous and intelligent prompt mechanism.
This is closer to what we see with LLMs today: Reinforcement Learning from Human Feedback (RLHF) flow is now even more deep-rooted into the latest OpenAI LLMs. Here’s how it works:
Modern LLMs such as the reasoning models, GPT-4-Turbo, GPT-4.5 and Operator as well as the o1 pro are designed specifically to avoid hallucinations, by emulating the human approach of learning, reasoning and responding.
See an error or have a suggestion? Please let us know by emailing splunkblogs@cisco.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.
Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.