What is Agent Tokenomics?
Observability Pratik BhavsarKey takeaways
- The agent bill compounds on every step. Each step re-sends the same instructions, schemas, and history through a stateless API, so a fifteen-word request can balloon into tens of thousands of billed tokens before the task finishes.
- Five buckets reveal what your tokens actually bought. Sorting every token into context, reasoning, retrieval, coordination, or governance turns one flat invoice into a map of which spend drove task progress and which was pure overhead.
- Set the bar for what counts as a good output, then track what it really costs to hit it, including the eval cost, not just the token bill.
Somewhere this month, an agent processed a fifteen-word user request and burned 60,000 tokens doing it. 62% of those tokens were the model re-reading context it had already seen. 3,000 went to tool schemas the agent never used.
In 2026, enterprise AI spending climbed from $1.2 million per company in 2024 to $7 million, even as per-token API prices fell roughly 280 times over the same period. The unit got cheaper but the number of tokens per increased dramatically. Uber burned through its entire 2026 Claude Code budget by April. Teams are still budgeting in chatbot-era cost models, where one prompt equals one call but agents broke that assumption and now the math needs to be updated.
Cisco's own Jeetu Patel put it plainly at Cisco Live 2026 that the risk is when the cost of tokens and the value those tokens produce start to diverge. Hence, getting visibility into that gap is the whole game. While the cost of generating tokens has dropped significantly (e.g., 1000x over 12-18 months), token consumption has exploded exponentially (10,000x to 100,000x).
To solve the visibility into this consumption, Patel described a stack-level approach to tokenomics—measuring utilization at the GPU layer, performance at the model layer, task completion at the application layer, and behavior at the agent layer, which he called "the hardest" part. Patel said the market is still in phase one, and companies need to get familiar with token spend before they can get good at governing it. That familiarity around the topics is what this series is all about.
What Is Agent Tokenomics?
In plain terms, agent tokenomics is the discipline of classifying and governing how tokens move through an agent system. Tokens function as the working capital of intelligence: they consume compute and price every unit of AI work. The question that matters is which tokens bought task progress, which bought reliability, and which bought nothing.
Types of Tokens
Every token in an agent call falls into one of these categories. This classification is the foundation for every optimization decision that follows.
Context Tokens
Context tokens are system instructions, conversation history, tool schemas. Everything that rides along on every call. Some are necessary, like the safety policy and task state. Others are overhead, like schemas for tools the agent will never invoke on this task or stale conversation turns from steps that no longer matter. Research from the Stanford Digital Economy Lab found that re-sent context accounts for 62% of agent inference bills. The waste within that 62% is specific: irrelevant schemas, stale history, stable prefixes that could be cached but are not.
Reasoning Tokens
Reasoning tokens are generated while the model thinks for planning, chain-of-thought, intermediate steps. They lead to better outputs for hard, multi-step problems but burn money on easier ones, so most APIs now let you dial reasoning depth up or down per call for reasoning models.
Retrieval Tokens
Retrieval tokens come from RAG, where you pull in documents to ground the model in facts. Too little retrieval and the model hallucinates. Too much and the context fills with noise, which pushes the model into extra reasoning loops to compensate. Long retrieved contexts can also trigger more compactions, raising the risk of dropping useful information along the way.
Tool Tokens
Tool tokens come from the model calling tools and reading back what they return: API responses, database rows, search results, the output of a function. The call itself is cheap, but the response can be large and lands in the context window, where it rides along on every later step of the loop.
Coordination Tokens
Coordination tokens appear in multi-agent systems: role prompts, synchronization messages, shared state. They buy specialization but grow fast. After a point, communication overhead outpaces the value of splitting work across agents.
Governance Tokens
Governance tokens cover validation, security checks, evaluation, and human review triggers. They are often invisible in the bill, which means they are never budgeted deliberately.
The goal is to understand which buckets earn their cost, and which are pure overhead. That distinction is what separates a real optimization strategy from guesswork.
The Cost of Tokens
Let's look at a single call through this lens of tokenomics. A user types: "check if this customer's last three orders are eligible for a refund." About 12 words.
The system sends 1,500 context tokens of system instructions. Another 3,000 context tokens of tool schemas, including tools irrelevant to this task and 2,500 retrieval tokens of customer data. The user's message with just 20 tokens becomes 7,000 input tokens before the model produces anything.
The output: roughly 3000 tokens with reasoning and a single tool call with the final output!
Then the agent loops. Every API call is stateless, so the full accumulated context is re-sent with each step. Call two: 10,000 tokens. Call three: 13,000. Call four: 17,000 as tool results accumulate. By call five, the model has re-processed the same context four times over, and the context window is now dominated by information the model already acted on in previous steps.
A five-step task that should cost less than10,000 tokens of useful work end up consuming over 40,000.
When LeanOps audited 30 engineering teams running agents in production, they found a 20x spread between the cheapest and most expensive developers on the same team doing similar work. One company went from $87,000/month to $24,000 after classifying its token spend and changing the architecture. The savings came entirely from overhead that nobody had categorized until someone looked. The surprising thing is that sprint velocity stayed flat!
Metrics for Agent Tokenomics
Token yield tells you how much quality you get per million tokens. You pick the quality bar that matters for the workflow. For a support agent that might mean the task got done, hallucination stayed under your limit, and nobody had to escalate to a human. A session counts as successful only when it clears every bar you marked as required. Token yield is then successful sessions per million tokens. Your evals decide what "successful" means, so the number only works once they are in place. A refund lookup and summarization are different workflows, so each gets its own bar and its own yield.
On the other side, cost per accepted task is the real price of one good result. Here is how we can calculate it.
cost per accepted task = model cost + tool/runtime cost + human review cost
Take a support agent on a cheap model at $0.02 a task. That looks efficient. But say 40% of its answers need a human to review them, fifteen minutes each, and your support staff costs $30 an hour. For those answers the real cost is about $3.02. For the clean ones it stays $0.02. The blended number sits far above the invoice.
How To Get Started With Tokenomics
Do not start by switching models or pruning context. Start with the numbers you can already pull.
The full categorization is the goal, not the starting point. Splitting reasoning tokens from retrieval tokens takes granularity most teams have not wired up yet. So, begin with what every API response already gives you: total input and output tokens per call.
Find your worst offenders, then add the finer bucket split as tracing improves.
Check Your Own Tokenomics for Free
If you code with Claude Code or Codex, you can watch these numbers live. Token Meter, an open-source tool from Galileo Agent Labs, follows your local agent logs and streams the cost of a run to a dashboard on your own machine. It runs entirely local, no API keys, no telemetry, so it is a low-risk way to see where your tokens go before you change anything. Point it at your next long run and watch which bucket fills.
Token Meter's menu-bar widget is a glanceable companion to the full dashboard: it lives in the macOS menu bar and surfaces the state of your active agent session at a glance.
FAQs
Why are AI agents so expensive to run?
AI agents are expensive because they loop, and every step re-sends the accumulated context through a stateless API, so token use compounds with each call. A short request can bill several times more tokens than the useful work actually needs.
How can I reduce AI agent token costs?
Start by classifying one week of agent calls into context, reasoning, retrieval, coordination, and governance tokens to see which bucket dominates the bill. Cut the largest source of overhead first, since optimizing before you measure usually trims the wrong thing.
How do I track AI agent token usage?
Use an observability tool that traces each agent call and breaks down token usage for you, instead of reading totals off a billing dashboard. Good tracing shows tokens per call and per workflow, along with task outcomes, so you can see where the spend actually goes before trying to cut it.
How do I test whether a cheaper model is good enough?
Run the step on both the current and cheaper model across a real sample of your traffic, then score the outputs against the same bar you set for that workflow. Study the shape of the failures, since a recoverable miss is acceptable but a rare catastrophic one is not.
Related Articles

What You Need to Know About Boss of the SOC

Picture Paints a Thousand Codes: Dissecting Image-Based Steganography in a .NET (Quasar) RAT Loader
