The Hidden Cost of Agentic AI: Why Most Projects Still Die Before Production
Observability Paul LaceyKey takeaways
- The demo is cheap. The production agent is not. The costs that kill agentic projects rarely show up on the model invoice. They hide in evaluation, infrastructure, runtime loops, and a pricing model that taxes the testing you needed to catch the problem.
- Cheaper tokens, bigger bills. Token prices are falling, but agents burn 5 to 30 times more tokens per task than a single chatbot call. The unit price drops while the unit count explodes, so total spend climbs.
- You can't control a cost you can't see. Agent observability makes the spend visible, turn by turn, so you catch the runaway loop before it reaches the bill.
- Plan for cost from day one. Route each task to the right-sized model, evaluate 100% of agents cost-effectively, and set thresholds that stop a runaway agent before it spends.
V2 refresh, June 2026. Original published Aug 21, 2025 by Galileo (now a part of Cisco).
Your demo delivered. The room was sold. Then you shipped, and the bill arrived in a shape nobody expected.
You’re not careless. Even the most disciplined teams are getting blindsided. In April 2026, Uber’s CTO announced they had burned through the company’s entire annual AI coding budget in four months. Microsoft cut most of its internal Claude Code licenses. Meta now ranks engineers on a usage leaderboard it calls “Claudeonomics” (Fortune, May 2026). These aren’t stories about reckless spenders. They’re proof that agents spend faster than anyone plans for.
Here’s the part that should give you hope: agentic projects rarely die on the main compute bill. They die from the hidden costs that were never budgeted. Gartner says 40% of agentic AI projects get canceled by 2027. MIT found 95% of GenAI deployments return zero measurable value. S&P watched abandonment jump from 17% to 42% in a single year. None of those is a model problem. They’re visibility problems. And a cost you can see is a cost you can fix.
Five costs hide between your demo and production: tokens, data, evals, guardrails, and a pricing model that punishes you for checking the other four. Make them visible and you ship before the invoice and the incident show up together. That visibility has a name: agent observability.
Tokens: Cheaper Units, Bigger Bills
Token prices are in freefall. GPT-4-class capability that cost $20 per million tokens in 2022 runs under a dollar today. Gartner expects inference to cost 90% less by 2030. Tempting to conclude the cost problem solves itself.
That trend line is the trap.
Agents don’t make one call. They loop, plan, call tools, retry, hand off, and resend their full context at every step. By step 20 you’ve paid for the same history 20 times. A 2026 Concordia University study clocked the waste: a 2-to-1 input-to-output ratio, a “communication tax,” with code review alone eating 59% of every token spent. Add it up and agents burn 5 to 30 times more tokens per task than a chatbot (Gartner, 2026). The price per token drops, but the token count explodes, and the bill climbs anyway.
EY put it in dollars. One customer-service interaction went from $0.04 to $1.20 in three years, 30 times higher, while token prices fell. Their phrase for it: these costs stay “structurally invisible until designed for visibility.” The macro data agrees. CloudZero’s survey of 500 engineering leaders found average AI spend up 36% to roughly $85,500 a month, with those spending over $100,000 a month more than doubling, from 20% to 45%. The demo is cheap. Production is not, and Gartner found pilots run just 15 to 25% of the real bill.
So stop shopping for a cheaper model. A cheaper model saves pennies a call. Catching the run that took 40 turns instead of 4 saves the project. Watch token volume per agent, per session, per tool call. Then three moves pay off:
- Route simple work to small models. Save the frontier models for the reasoning that needs them.
- Instrument token use at the agent and session level, so a runaway loop should surface in minutes, not on the monthly bill.
- Track cost next to quality as a first-class metric. An agent that's right but ruinously expensive still failed.
Data: Garbage In, Bill Out
Your agent reasons on the data you hand it. Hand it noise and you get shaky evals, cascading errors, and a demo that dies under real traffic. RAG agents see this first, because the final response is only as good as what you put in front of the model.
RAND traced where these projects actually break, and most breaks sit upstream of the model: the wrong problem targeted, the wrong data sourced, and infrastructure that buckles at scale. In RAG, noisy embeddings cut retrieval accuracy 20 to 30%. That triggers retries. Retries burn tokens. The token trap, feeding itself.
Fix it at the source. See the data first: the roughly 3% empty and 35% garbage samples in a typical retrieval set. The 290 confusing records buried in 18,000. The malformed JSON and duplicates padding your context window 15 to 25% before the model even reads them. Clean early and you cut inference cost on every future call. Skip it and you rent that waste for the lifespan of the agent.
Evals: The Test You Skip Is the Failure You Ship
Evaluation is where budgets die quietly. Your agent needs more than an accuracy score:
- Did it finish every goal?
- Did it pick the right tool?
- Did it hallucinate?
- Did it leak something it shouldn’t?
Running that across production traffic at frontier-model prices gets expensive fast. So the instinct is to run it less. That’s the failure. The eval you skip is the failure you ship.
Make evals cheap enough to run on everything, always. That’s the whole point of purpose-built evaluation models. Splunk Agent Observability’s Luna-2 small language models run 20-plus quality and safety checks at once, under 200ms, at up to 96% lower cost than LLM-as-judge on frontier models. Now you stop rationing evaluation as a quarterly audit and start running it as always-on quality control across 100% of your traffic. Alert right away so you can intervene before the agents run awry.
And you can’t fix what you can’t trace. API logs say a call happened. They won’t show you the execution graph, the redundant calls, or the loop that ran 30 times because one agent misunderstood another. Splunk Agent Observability gives you Timeline, Graph, and Conversation views to trace behavior step by step across agents, memory, and tool calls. Fewer reruns, less wasted eval, faster fixes. That is the whole game when 40% of projects die on cost and complexity. In a very competitive world and a frontier like AI is, being able to detect and fix issues faster could be the difference between winning and losing in the market.
Guardrails: You Can’t Bolt Safety on After the Breach
Every autonomous agent widens your attack surface. The moment agents touch private data, call tools, and take in untrusted input, they open holes that static rules miss. The OWASP Top 10 for Agentic Applications, released in late 2025, names them: goal hijacking, tool misuse, identity and privilege abuse, memory poisoning. In security, 95% caught is a failing grade. The 5% that slips through is the breach you can’t take back.
Safety can’t live in a review queue. A human approver can’t keep pace with an agent firing hundreds of actions a second. It has to run inline, at machine speed. Luna-2 checks continuously for PII exposure, tool misuse, policy violations, and unsafe content, and routes to a human, or to a watchdog agent, when confidence drops. In March 2026, Galileo (now a part of Cisco) shipped Agent Control, an open-source control plane licensed under Apache 2.0: write one behavioral policy, enforce it across every agent.
Real-time guardrails keep pace with risk as your agents explore new paths, and they head off breaches that run into the millions per incident. The cheapest incident is the one that never reaches a user.
Pricing: Tools That Bill per Test Teach Teams To Stop Testing
Here’s the cost nobody names: the pricing model itself. Most evaluation and monitoring tools bill at frontier model rates, so every test costs serious money. Every guardrail check costs as much as the agent run itself and adds significant latency. So teams check fewer things, less often. Some drop evaluation entirely when the cost spikes. That is exactly backward. You end up rationing the work that keeps agents and your company’s reputation safe, one skipped test at a time.
Flip the incentive. Pricing that stays low with finetuned Small Language Models (SLMs) at up to 96% less than frontier rates lets you test and guardrail without fearing the bill. Distilling expensive evaluators into compact Luna-2 models is what makes watching 100% of traffic affordable instead of aspirational. Keep the price low and the latency minimal, and your team iterates as fast as the technology allows.
You Don’t Need To Burn To Scale
Agentic AI projects don’t die for lack of ideas or models. They die quietly, and it usually traces to one blind spot: visibility of the agent’s actions. No shared definition of “working.” Debugging with no traceability. Per-test costs that pull teams back from testing. Risk that grows in the dark until it lands in production. Every one of these is a visibility failure.
Agent observability turns the lights on: low-latency evals, step-level traces, real-time guardrails, and usage-aware spend tracking, from development through production. Catch the runaway loop, the noisy retrieval, the policy drift, and the silent failure while they’re still cheap to fix.
Tokens keep getting cheaper. The bill keeps getting bigger. The teams that win the next two years won’t own the cheapest model, or the ones who call the expensive frontier one the least. They’ll be the ones who could see what their agents are doing in real-time and adjust for efficiency and accuracy.
Pick one agent. Make it observable. Ship it before the invoice and the incident arrive together. If you want to learn more about how to make your agents observable, download The Agentic Shift: Redefining Observability for the AI Era.
FAQ
How much more do AI agents cost to run than a chatbot? Five to 30 times more tokens per task, because of reasoning loops, tool calls, retries, and multi-agent coordination (Gartner, 2026).
What share of agentic AI projects fail, and why? Gartner expects over 40% to be canceled by the end of 2027, driven by escalating costs, unclear business value, and weak risk controls.
If token prices are dropping, why are agent costs rising? The price per token falls while the number of tokens per task rises faster. EY clocked one customer-service interaction going from $0.04 to $1.20, roughly 30x, even as token prices fell.
What is agent observability? Seeing what your agent actually did, not just whether the request finished: which decisions it made, which tools it called, what it cost, and whether the output was correct and safe.
Related Articles

This Feels Scripted: Zeek Scripting and Splunk

CISA Top Malware Summary
