Observability Challenges in Multi Agentic Environments

Artificial Intelligence Rod Soto

Key takeaways

  1. Agentic AI workflows require new forms of observability that track decisions, tool usage, and interactions across multiple agents, not just traditional application activity.
  2. Organizations need visibility into agent identities, workflow states, data movement, and stored information to investigate issues and understand outcomes.
  3. Observability and security controls can help teams trace agent behavior, detect risks like prompt injection and misuse, and support safer AI deployments.

The current drive to implement Agentic AI has brought a number of multiple challenges, starting from the basic need for observability. Agentic AI is being sold as the next step in implementing and integrating generative ai in the enterprise. Agentic AI security is very different from classic application and software security. We go from deterministic to stochastic, from a request that enters a service, passes a queue, queries a database, calls an API and returns to a single request that can be decomposed by an agent router delegated to many other agents, enriched by retrieval, pass through multiple model loops, tools, pause for human approval (if applicable), execute a request via API or webhook and finalized by an evaluator.

These agentic workflows are completely different in nature and before we can even think of analytics, we must be able to see and trace the entire pipeline and map each component in a way that makes sense to analysts and operators. When dealing with multi agent systems observability needs to go beyond operations, it needs to provide a clear picture and reconstruction of intent, propagation, decision-making and possible side effects across the workflow that behaves more like a somewhat organized interaction than a program or application execution.

Observability of Agents must include the fundamentals of these executions and interactions which include prompt chains, model decisions, tool contracts, human intervention, and more. What happened and can we trust it to happen again?

In this blog, I am attempting to address that question.

What is an Agent?

LLMs can reason but not act; their reasoning is also not deterministic but stochastic; this introduces ephemeral uncertainty. An agent can narrow that reasoning and refine it for specific actions and decisions. These actions are usually taken via a tool call or MCP protocol which is the way LLMs can retrieve and reach into multiple resources that provided to an Agent are expected to be executed and completed in the form of a judgement or decision.

Agents are applications that work with LLMs; they need the LLMs for reasoning, but they also need several other resources to perform their job efficiently. Some of those resources include the use of tools such as API calls, executing queries against databases, and retrieving specific files. Agents also can act with partial, monitored or full autonomy until their objective is complete and they may leverage additional resources such as RAG, Memory, Orchestrated planning, Structured Reasoning, Context Awareness, Delegation and collaboration with other agents, and Role or Persona styles.

In other words, AI agents are applications that leverage the reasoning of LLMs with specific decision making, goal-oriented vector(s) leveraging tools, memory, and other capabilities in different levels of autonomy workflows.

Agent workflows can go from single agent to multiple agent orchestration interaction; agents can be reactive, proactive or interactive. Agents can also have different architectures, and there are several frameworks to coordinate and orchestrate them.

Agents are the next step in the implementation of LLM driven technologies, with the aim of making them more efficient, customized to specific use cases, specialized in skills, lower latency and the ability to apply validation, guardrails and monitoring. Agents have multiple components.

General Components of an Agent

Agent Harness

An agent harness is the orchestration framework that binds all of the above components together into a functioning, executable pipeline. It manages the flow of information between perception, memory, reasoning, and action layers while handling tool registration, conversation state, and safety guardrails. In practice, harnesses are implemented through frameworks like LangChain, AutoGen, CrewAI, or MCP-compatible runtimes that standardize how tools and context are exposed to the decision core. From a security standpoint, the harness is a critical attack surface — weaknesses in how it validates tool outputs, enforces identity boundaries, or manages memory can be exploited to hijack agent behavior or cause cascading failures across the workflow.

Benefits of AI Agents

If agents are used efficiently, they provide advantages in the form of automation, faster workflows, adaptive operations, and the ability to extend operations taking advantage of automation, orchestration, and reasoning from powerful generative AI models. Here are some of the benefits of using agents.

Agents, however, also come up with a number of challenges when it comes to implementation, deployment, and operations.

Agentic development challenges

HITL, HOTL, HOOTL?

One of the main mechanisms considered for monitoring and validating agentic workflows is the involvement of humans. This involvement varies from Human IN the loop, where Human input is required and workflow cannot proceed without human review. For example, in a SOAR running enrichment where containment playbook steps require analyst approval (e.g. block and IP).

Human ON the loop, where workflow runs autonomously by default and at a higher-level humans supervise and intervene only when necessary. Following the above example, it would assume that must playbook run in auto mode an analysts supervise dashboards and only step in when certain risks or anomaly triggers fire.

Finally Human OUT of the loop where workflow is designed to operate without any human intervention. In this case, we are looking at autonomous agents performing the execution, analysis and implementation of SOAR playbooks, with humans just reviewing periodic reports or post-incident artifacts. Whatever level of Human involvement is present at an Agentic workflow can influence positively or at times even negatively, complicating things even further.

Why Agentic Security is Different

AI Agents represent a shift from just using LLMs to leveraging their reasoning with the use of tools and specific workflows to focus, enhance and extend their reasoning. This also brings security challenges, as not only are we adding more components around the output of a LLM, but we also must consider the following differentiators:

The OWASP Top 10 for Agentic Applications for 2026 It is a framework that highlights the most critical security risks in autonomous and agentic AI systems. It provides security guidance to address risks known to affect the expansion of the attack surface because of the combination of autonomy, tool use, external data, memory, and multi-step workflows. The following is a graphical mapping of these risks in agentic workflows.

The Observability Challenge

Based on all these challenges presented by the different nature of Agentic AI workflows security, we first must address if we can even observe these flows in a useful manner that allows us to address possible risks and attacks. Here are some key items that we would need to capture to address this observability challenge.

Tracing Across Agents --> Following the Conversation

In our pursuit of tracing every step of these workflows, I created a couple of agentic workflows with multiple agents; these agentic workflows were very different in nature. One was security oriented (detection testing), the other was customer service agentic workflow. The objective was to try to pinpoint possible consistent workflow labels that may identify stages or phases of these agentic workflows. There appear to be at least 4 categories where we could try to understand the data and entities that interact in agent workflows:

LLMs take part in all four categories.

The following fields were found in our testing of agentic workflows

Security related workflow

In the above screenshots we can see some identity related fields such as run_id, agent function, and stage of pipeline.

Customer workflow

In the above screenshot we can also see identity related fields such as agent_id, host_ip, host_name, unique_pids.

Security related workflow

In the above screenshot, we can see the status and statistics of the different pipeline steps and stages, including loops agent and testing.

Pipeline status with details

In this screenshot we can see the details and the suggested fixes from guardrail agent and the different attempts in the security related agentic workflow (detection testing).

Pipeline escalation events

In the following screenshots, we have actual details of the pipeline steps and statuses allowing operator to clearly pinpoint sessions, specific agent, status, escalation and tests.

Customer Service workflow

The following screenshot shows the Customer Service agentic workflow by state, by pipeline event result and numbers.

Data in Transit Fields

Customer service data in transit statistics

Now we can actually see the Data in Transit broken down by workflow channels, individual customers, number of messages and type of event.

Customer service sample messages

Following the above screenshot in this next one we can see the specific messages.

Customer service tool calls

In this specific screenshot, we can see the specific tool calls, sessions and number of calls.

External write options involving financial data flow

In this analytic, we can see the breakdown of type of operation, action type, sessions, occurrences and block reasons for a refund processed by the Customer Service agentic workflow.

LLM backend calls

In this simple analytic we can see the number of calls to LLM framework (0llama)

Security Workflow Data in Transit

Sample of output messages by agent interaction

In the above image, we can see specific messages by stage, agent interaction, and message specifics.

Security related workflow tool call transit data

In this image, we can see a full breakdown of the tool used, the stage in pipeline, statuses, source,sourcetype of the data in transit, number of results, and sample event counts.

Data at Rest Fields

Customer service (what is stored)

Stored Log Audit by outcome (Closed, Escalated, Rejected)

Audit Record of sessions (Extensive record of sessions by customer and sessions)

Customer Service sensitive fields stored at rest

Security Related workflow Data at Rest

Data stored inventory by stage

Stored outputs by session (tracking specific sessions, type of data, escalation, time and output file)

Who Changed What? Based on Which Prompt? Which Model? And Which Tool?

As seen in the above screenshots, it is possible to break down these agentic workflows and dissect them for analysis and investigations. The following is an example of a security related (detection testing) workflow analysis.

We can see in the above session trace, the agent type by function, the specific step in the pipeline, what has been changed, the actual LLM model used, the time it happened, and the unique prompt identifying hash.

We can use the 4 categories previously proposed to pinpoint sequences and relationships between these events that lead to outcomes that may be expected or unexpected.

Detecting Security Incidents Not Just Errors

Here are some common threats based on the 4 proposed categories and the OWASP security guidance so we can understand their effects in the different steps of agentic workflows:

1. Identity misuse and spoofing

2. State and data tampering

3. Prompt and content Injection

4. Information disclosure and data leakage

5. Abuse of capabilities / over-privilege

6. Audit and provenance gaps

7. Denial of service and resource exhaustion

8. Cross-boundary trust confusion

9. Policy and guardrail bypass

10. Supply chain and configuration compromise

Finally, we can pinpoint specific instances where a possible attack happened and see it through the actual steps of the workflow, as it can be seen in the screenshot below where a prompt injection attack was blocked by the guardrail agent based on specific reasons (input_sanitation_failed, out_of_scope, auth_scope invalid) and it was written to the LOG_AUDIT file for tracing and forensic purposes. Organizations may also implement other defense mechanisms applicable to their agentic workflows (i.e. Firewalls, WAF)

Conclusion

Agentic observability is challenging, and the current lack of standards for observability poses a huge risk for organizations. The use of OTel (OpenTelemetry) or Cisco AppDynamics for example, is a way to output Agentic workflows into useable data for security analysts and operators. However, no widely adopted or ratified schema exists that may cover normalizing and unifying these steps within a security-driven purpose.

An emerging complementary approach is the use of LLM gateways such as LiteLLM and Bifrost, which act as a proxy layer between agentic applications and underlying LLM providers. These gateways can centralize logging of prompts, completions, tool calls, and latency metrics across multi-model or multi-provider deployments — offering a single choke point for observability and, increasingly, policy enforcement. While they do not yet define a security-focused schema on their own, they represent a pragmatic interim layer for organizations seeking visibility into LLM traffic before broader standards mature.

In this blog I have shown by executing and tracing a couple of simple agentic workflows, that there are multiple components, tools, steps and vectors in play, that need to be traced from the beginning to the end in any agentic workflow, regardless of its complexity. Organizations should be aware that this lack of observability is a huge exposure and need to be addressed as condition prior of deploying Agents in sensible or production environments. We can, however—as it has been shown in this blog—implement observability and apply some detection logic within the steps of Agentic applications and get actionable meaningful results.

If you are considering implementation of these agentic workflows, Splunk and Cisco also have a number of security content that focuses on Enterprise LLMs, local LLMs, MCP tools and enterprise Agentic deployments.

Related Articles

Joint first-time participation! Cisco & Splunk as One Team ~ Hardening 2025 Invisible Divide ~
Security
12 Minute Read

Joint first-time participation! Cisco & Splunk as One Team ~ Hardening 2025 Invisible Divide ~

The Hardening Project is a community-driven competition sponsored by industries, academia and government agencies, dedicated to maximizing the value of defensive technology. Splunk joined forces with Cisco, standing together as "One Team" to protect what matters most.
Threat Update: AcidRain Wiper
Security
10 Minute Read

Threat Update: AcidRain Wiper

The Splunk Threat Research Team shares the details on the new malicious payload named AcidRain, designed to wipe modem or router devices (CPEs).
Add to Chrome? - Part 1: An Analysis of Chrome Browser Extension Security
Security
4 Minute Read

Add to Chrome? - Part 1: An Analysis of Chrome Browser Extension Security

An overview of SURGe research that analyzed the entire corpus of public browser extensions available on the Google Chrome Web Store.