Observability Challenges in Multi Agentic Environments

Artificial Intelligence June 22, 2026 Rod Soto

Key takeaways

Agentic AI workflows require new forms of observability that track decisions, tool usage, and interactions across multiple agents, not just traditional application activity.
Organizations need visibility into agent identities, workflow states, data movement, and stored information to investigate issues and understand outcomes.
Observability and security controls can help teams trace agent behavior, detect risks like prompt injection and misuse, and support safer AI deployments.

The current drive to implement Agentic AI has brought a number of multiple challenges, starting from the basic need for observability. Agentic AI is being sold as the next step in implementing and integrating generative ai in the enterprise. Agentic AI security is very different from classic application and software security. We go from deterministic to stochastic, from a request that enters a service, passes a queue, queries a database, calls an API and returns to a single request that can be decomposed by an agent router delegated to many other agents, enriched by retrieval, pass through multiple model loops, tools, pause for human approval (if applicable), execute a request via API or webhook and finalized by an evaluator.

These agentic workflows are completely different in nature and before we can even think of analytics, we must be able to see and trace the entire pipeline and map each component in a way that makes sense to analysts and operators. When dealing with multi agent systems observability needs to go beyond operations, it needs to provide a clear picture and reconstruction of intent, propagation, decision-making and possible side effects across the workflow that behaves more like a somewhat organized interaction than a program or application execution.

Observability of Agents must include the fundamentals of these executions and interactions which include prompt chains, model decisions, tool contracts, human intervention, and more. What happened and can we trust it to happen again?

In this blog, I am attempting to address that question.

What is an Agent?

LLMs can reason but not act; their reasoning is also not deterministic but stochastic; this introduces ephemeral uncertainty. An agent can narrow that reasoning and refine it for specific actions and decisions. These actions are usually taken via a tool call or MCP protocol which is the way LLMs can retrieve and reach into multiple resources that provided to an Agent are expected to be executed and completed in the form of a judgement or decision.

Agents are applications that work with LLMs; they need the LLMs for reasoning, but they also need several other resources to perform their job efficiently. Some of those resources include the use of tools such as API calls, executing queries against databases, and retrieving specific files. Agents also can act with partial, monitored or full autonomy until their objective is complete and they may leverage additional resources such as RAG, Memory, Orchestrated planning, Structured Reasoning, Context Awareness, Delegation and collaboration with other agents, and Role or Persona styles.

In other words, AI agents are applications that leverage the reasoning of LLMs with specific decision making, goal-oriented vector(s) leveraging tools, memory, and other capabilities in different levels of autonomy workflows.

Agent workflows can go from single agent to multiple agent orchestration interaction; agents can be reactive, proactive or interactive. Agents can also have different architectures, and there are several frameworks to coordinate and orchestrate them.

Agents are the next step in the implementation of LLM driven technologies, with the aim of making them more efficient, customized to specific use cases, specialized in skills, lower latency and the ability to apply validation, guardrails and monitoring. Agents have multiple components.

General Components of an Agent

Goal: What the agent is trying to achieve.
Perception: Inputs from the environment (user input, data, signals)
Memory: Where context and past knowledge is kept (short-term & long term)
Reasoning: A decision on what to do next in terms of single or multiple steps
Decision Core (LLM): Leveraging of Large Language Model (System Prompts, Hooks)
Tools: External capabilities (APIs, MCP, Systems, Databases, Files)
Action: Executes actions
Feedback Loop: Observes results and adjust behaviors or re executes specific parts of the workflow.

Agent Harness

An agent harness is the orchestration framework that binds all of the above components together into a functioning, executable pipeline. It manages the flow of information between perception, memory, reasoning, and action layers while handling tool registration, conversation state, and safety guardrails. In practice, harnesses are implemented through frameworks like LangChain, AutoGen, CrewAI, or MCP-compatible runtimes that standardize how tools and context are exposed to the decision core. From a security standpoint, the harness is a critical attack surface — weaknesses in how it validates tool outputs, enforces identity boundaries, or manages memory can be exploited to hijack agent behavior or cause cascading failures across the workflow.

Benefits of AI Agents

If agents are used efficiently, they provide advantages in the form of automation, faster workflows, adaptive operations, and the ability to extend operations taking advantage of automation, orchestration, and reasoning from powerful generative AI models. Here are some of the benefits of using agents.

Higher productivity and automation. Agents can automate repetitive multi step workflows, freeing humans to focus on higher value analysis.
Faster decisions. Continuously analyze large, heterogeneous data sets and surface insights or next actions much faster than manual processes.
Improved quality and fewer errors. In many agentic systems, it is possible to implement self-checkouts, cross validation, and collaboration with other tools or agents, reducing manual mistakes across complex processes.
Scalability and always-on coverage. Agents run 24/7 and can be replicated at a low cost and scale to handle more tickets, customers or tasks.
Access skills not present in-house. In some cases, Agents can bring specialized capabilities (coding, data analysis) helping support teams when those skills are not available.

Agents, however, also come up with a number of challenges when it comes to implementation, deployment, and operations.

Agentic development challenges

Agent visibility and monitoring is at the moment negligible and challenging. No frameworks or specific logging monitoring methods industry wide at the moment. (March 2026)
Some industry research suggests that a very large number of agentic projects do not make it into production, with failure rates ranging from 40% up to 90% of enterprise projects. * *
Agents can indeed make things more complicated, with studies showing less effectiveness, creating bottlenecks, introducing vulnerabilities and being at least effective at times than querying LLMs directly. *
Agents in multiple orchestration settings and looping can indeed increase the cost of operation. With an exponential growth in cost the more agents are implemented and a concave line when it comes to their benefit eventually plateauing and minimizing such benefits versus cost. *
Agents can go rogue and even collude and be harmful to an enterprise (AI psychosis, sycophancy). * *

HITL, HOTL, HOOTL?

One of the main mechanisms considered for monitoring and validating agentic workflows is the involvement of humans. This involvement varies from Human IN the loop, where Human input is required and workflow cannot proceed without human review. For example, in a SOAR running enrichment where containment playbook steps require analyst approval (e.g. block and IP).

Human ON the loop, where workflow runs autonomously by default and at a higher-level humans supervise and intervene only when necessary. Following the above example, it would assume that must playbook run in auto mode an analysts supervise dashboards and only step in when certain risks or anomaly triggers fire.

Finally Human OUT of the loop where workflow is designed to operate without any human intervention. In this case, we are looking at autonomous agents performing the execution, analysis and implementation of SOAR playbooks, with humans just reviewing periodic reports or post-incident artifacts. Whatever level of Human involvement is present at an Agentic workflow can influence positively or at times even negatively, complicating things even further.

Why Agentic Security is Different

AI Agents represent a shift from just using LLMs to leveraging their reasoning with the use of tools and specific workflows to focus, enhance and extend their reasoning. This also brings security challenges, as not only are we adding more components around the output of a LLM, but we also must consider the following differentiators:

With the stochastic nature of the generative AI, results can vary, and they are not deterministic.
Unsafe delegation of action under ambiguous authority (No framework or standard for agent identity)
Agents autonomously plan, use tools, maintain memory and act toward goals, introducing risks beyond communication injection
Possibility of the “Confused Deputy” event, when an agent with higher privileges gets tricked by a less privileged actor into misusing authority.
No Boundary between instructions and data. The system often treats instructions and what its reasons over data in the same channel and representation, so it blurs them together and influences each other. In other words, the LLM is just predicting tokens; it only sees a stream of tokens and learns patterns over that stream. This makes defense and prevention even more difficult.
Agents use tools for mediated actuation, and different levels of autonomy at times with delegated identity with high privileges. Tools introduce supply chain attack risks.

The OWASP Top 10 for Agentic Applications for 2026 It is a framework that highlights the most critical security risks in autonomous and agentic AI systems. It provides security guidance to address risks known to affect the expansion of the attack surface because of the combination of autonomy, tool use, external data, memory, and multi-step workflows. The following is a graphical mapping of these risks in agentic workflows.

The Observability Challenge

Based on all these challenges presented by the different nature of Agentic AI workflows security, we first must address if we can even observe these flows in a useful manner that allows us to address possible risks and attacks. Here are some key items that we would need to capture to address this observability challenge.

Trace every step: Log the full execution path of each agent action, including tool calls, LLM prompts/responses and decision branches. We need end-to-end tracing of these workflows.
Capture Inputs & Outputs at Each Node: Log what data enters and exists for every agent node or function, not just outputs. This is necessary to identify were hallucinations, bad tool calls, or unexpected state changes originated.
Monitor Latency & Token Usage Per Step: Track per Node timing and token consumption beyond aggregate metrics. This may reveal bottlenecks, runaway loops, and cost of hotspots in complex pipelines.
Log Tool Call Success/Failure with context: Every external tool invocation (API calls, MCPs, database queries, etc.) should be logged with its parameters, return values, and error codes. Agent failures often cascade from a single silent tool error.
Implement Semantic Evaluation beyond just errors: Beyond technical logs, assess quality of agent reasoning. Need to assess if the goal was met, if the right tool was used.

Tracing Across Agents --> Following the Conversation

In our pursuit of tracing every step of these workflows, I created a couple of agentic workflows with multiple agents; these agentic workflows were very different in nature. One was security oriented (detection testing), the other was customer service agentic workflow. The objective was to try to pinpoint possible consistent workflow labels that may identify stages or phases of these agentic workflows. There appear to be at least 4 categories where we could try to understand the data and entities that interact in agent workflows:

Agent Identity related information: API, Keys, Oauth Tokens, Service Accounts, Role and Policy definitions, capability descriptors
Pipeline State: Execution graph, current step, error states, retry metadata, tool call history.
Data in Transit: Prompts, tool input/outputs (MCP), external API payloads, streaming logs, callbacks, webhooks, crossing trust boundaries at runtime.
Data at Rest: Long term agent memory stores (vector DBs, knowledge bases, output files), session logs, traces, audit records, replay buffers, cached tool results, local embeddings, model fine tuning data.

LLMs take part in all four categories.

The following fields were found in our testing of agentic workflows

Security related workflow

In the above screenshots we can see some identity related fields such as run_id, agent function, and stage of pipeline.

Customer workflow

In the above screenshot we can also see identity related fields such as agent_id, host_ip, host_name, unique_pids.

Security related workflow

In the above screenshot, we can see the status and statistics of the different pipeline steps and stages, including loops agent and testing.

Pipeline status with details

In this screenshot we can see the details and the suggested fixes from guardrail agent and the different attempts in the security related agentic workflow (detection testing).

Pipeline escalation events

In the following screenshots, we have actual details of the pipeline steps and statuses allowing operator to clearly pinpoint sessions, specific agent, status, escalation and tests.

Customer Service workflow

The following screenshot shows the Customer Service agentic workflow by state, by pipeline event result and numbers.

Data in Transit Fields

Customer service data in transit statistics

Now we can actually see the Data in Transit broken down by workflow channels, individual customers, number of messages and type of event.

Customer service sample messages

Following the above screenshot in this next one we can see the specific messages.

Customer service tool calls

In this specific screenshot, we can see the specific tool calls, sessions and number of calls.

External write options involving financial data flow

In this analytic, we can see the breakdown of type of operation, action type, sessions, occurrences and block reasons for a refund processed by the Customer Service agentic workflow.

LLM backend calls

In this simple analytic we can see the number of calls to LLM framework (0llama)

Security Workflow Data in Transit

Sample of output messages by agent interaction

In the above image, we can see specific messages by stage, agent interaction, and message specifics.

Security related workflow tool call transit data

In this image, we can see a full breakdown of the tool used, the stage in pipeline, statuses, source,sourcetype of the data in transit, number of results, and sample event counts.

Data at Rest Fields

Customer service (what is stored)

Stored Log Audit by outcome (Closed, Escalated, Rejected)

Audit Record of sessions (Extensive record of sessions by customer and sessions)

Customer Service sensitive fields stored at rest

Security Related workflow Data at Rest

Data stored inventory by stage

Stored outputs by session (tracking specific sessions, type of data, escalation, time and output file)

Who Changed What? Based on Which Prompt? Which Model? And Which Tool?

As seen in the above screenshots, it is possible to break down these agentic workflows and dissect them for analysis and investigations. The following is an example of a security related (detection testing) workflow analysis.

We can see in the above session trace, the agent type by function, the specific step in the pipeline, what has been changed, the actual LLM model used, the time it happened, and the unique prompt identifying hash.

We can use the 4 categories previously proposed to pinpoint sequences and relationships between these events that lead to outcomes that may be expected or unexpected.

Detecting Security Incidents Not Just Errors

Here are some common threats based on the 4 proposed categories and the OWASP security guidance so we can understand their effects in the different steps of agentic workflows:

1. Identity misuse and spoofing

Stolen or shared credentials for agent/tools.
Impersonation of agents, tools, or data resources (fake upstream, forged messages).

2. State and data tampering

Modification of workflow, intermediate state, tool results, or stored memories/logs.
Poisoning knowledge bases, caches, or configuration to shape future behavior.

3. Prompt and content Injection

Malicious instructions embedded in user input, retrieved documents, or tool outputs.
Indirect prompt injection via compromised external systems or data sources.

4. Information disclosure and data leakage

Sensitive data exposed in prompts, completions, logs, traces, or memory stores.
Model or corpus exfiltration (weights, RAG data, long term logs).

5. Abuse of capabilities / over-privilege

Agents or tools granted broader actions than necessary (filesystem, network, production APIs).
Goal / intent hijacking: steering the agent to use legitimate tools for malicious ends.

6. Audit and provenance gaps

Flooding agents/tools with requests, large contexts, or complex prompts.
Inability to prove which inputs, policies, and tools led to a given action or output.

7. Denial of service and resource exhaustion

Flooding agents/tools with requests, large contexts, or complex prompts.

State and storage growth (logs, memories, caches) degrading performance or availability.

8. Cross-boundary trust confusion

Treating untrusted inputs (user, web, third-party tools) as a trusted internal state.
Insufficient isolation between tenants, agents, or environments.

9. Policy and guardrail bypass

Attacks that route around filters/guardrails by switching channels, tools, or formats.
Inconsistent enforcement of policy across orchestration, LLM, and tool layers.

10. Supply chain and configuration compromise

Malicious or vulnerable models, libraries, tools, or plugins.
Dangerous configuration drift (debug modes, verbose logging, test keys in production).

Finally, we can pinpoint specific instances where a possible attack happened and see it through the actual steps of the workflow, as it can be seen in the screenshot below where a prompt injection attack was blocked by the guardrail agent based on specific reasons (input_sanitation_failed, out_of_scope, auth_scope invalid) and it was written to the LOG_AUDIT file for tracing and forensic purposes. Organizations may also implement other defense mechanisms applicable to their agentic workflows (i.e. Firewalls, WAF)

Conclusion

Agentic observability is challenging, and the current lack of standards for observability poses a huge risk for organizations. The use of OTel (OpenTelemetry) or Cisco AppDynamics for example, is a way to output Agentic workflows into useable data for security analysts and operators. However, no widely adopted or ratified schema exists that may cover normalizing and unifying these steps within a security-driven purpose.

An emerging complementary approach is the use of LLM gateways such as LiteLLM and Bifrost, which act as a proxy layer between agentic applications and underlying LLM providers. These gateways can centralize logging of prompts, completions, tool calls, and latency metrics across multi-model or multi-provider deployments — offering a single choke point for observability and, increasingly, policy enforcement. While they do not yet define a security-focused schema on their own, they represent a pragmatic interim layer for organizations seeking visibility into LLM traffic before broader standards mature.

In this blog I have shown by executing and tracing a couple of simple agentic workflows, that there are multiple components, tools, steps and vectors in play, that need to be traced from the beginning to the end in any agentic workflow, regardless of its complexity. Organizations should be aware that this lack of observability is a huge exposure and need to be addressed as condition prior of deploying Agents in sensible or production environments. We can, however—as it has been shown in this blog—implement observability and apply some detection logic within the steps of Agentic applications and get actionable meaningful results.

If you are considering implementation of these agentic workflows, Splunk and Cisco also have a number of security content that focuses on Enterprise LLMs, local LLMs, MCP tools and enterprise Agentic deployments.

style

two-column

Joint first-time participation! Cisco & Splunk as One Team ~ Hardening 2025 Invisible Divide ~

Security

12 Minute Read

Joint first-time participation! Cisco & Splunk as One Team ~ Hardening 2025 Invisible Divide ~

The Hardening Project is a community-driven competition sponsored by industries, academia and government agencies, dedicated to maximizing the value of defensive technology. Splunk joined forces with Cisco, standing together as "One Team" to protect what matters most.