The AI Paradox and the Hidden Costs of Downtime

CIO Office Hanlin Fang VP of Product Management

The cost of digital failure is rising as fast as the pace of innovation, and AI is accelerating both. Splunk’s latest research, The Hidden Costs of Downtime, reveals that the aggregate cost of downtime and service degradation for the Global 2000 has soared to a staggering $600 billion annually. That is a 50% increase in just two years.

A troubling paradox lies at the heart of this statistic: AI is simultaneously becoming a powerful tool in the fight against downtime and a source of new vulnerabilities. While 56% of AI users say the technology has reduced downtime at their organization, every technology leader surveyed has experienced an AI-related outage in the past year. We can reconcile this paradox not by slowing innovation, but by changing how AI is deployed, observed, and governed. Digital resilience in the AI era depends on trusted machine data, clear accountability, and human expertise predicated on a strong operational data foundation.

How AI improves incident triage, security, and observability

AI is reducing cognitive burden on human teams, enabling faster, more informed decision-making. According to the survey, organizations are spending a median $24.5 million annually on AI tools to prevent and respond to downtime. Top investments include AI-driven security automation (85%) and AI-powered observability (65%). With 56% of users reporting that AI has reduced the overall risk of downtime, it seems these investments are paying off.

Across security and IT, the primary use cases for AI are incident triage and root cause analysis. When an outage occurs, identifying the source can be the most time-consuming step, prolonging business impact. Generative AI assistants are proving invaluable here, helping human investigators correlate vast amounts of data, summarize complex incidents, and suggest actionable next steps. Meanwhile, agentic AI, systems — designed to act autonomously within defined boundaries, make decisions, and perform tasks with limited human supervision — are being deployed to execute common fixes, such as performing code rollbacks, which can resolve issues before they escalate into business-crippling events.

The good news is that respondents using AI to manage and triage incidents and orchestrate workflows cite tangible benefits. Seventy-four percent of these "AI Workflow and Triage Experts" avoided having to publicly disclose a data breach last year, compared to just 54% of their non-expert counterparts. They are also nearly three times more likely to say they’ve never lost customers because of downtime. The bad news? Technology leaders across security, ITOps, and engineering admit that AI-related downtime is becoming increasingly common. This underscores why organizations must ensure their AI operates on trusted, governed data while also providing humans with the visibility needed to validate decisions and intervene when necessary.

The hidden risks of AI outages and shadow AI

Despite speeding up incident triage and root cause analysis, AI has introduced a new breed of disruptions. Half of organizations have experienced downtime stemming from incorrect AI-driven automation and model drift. And although a striking 44% of technology leaders say they’re already leveraging agentic AI to combat downtime, 68% worry their AI agents will behave unpredictably and cause outages. Most concerning, every technology leader admits their organization has endured some form of AI-related downtime in the past year, setting a dangerous new standard.

The root of this problem often lies in the rush to deploy. Organizations are eager to capture AI’s competitive advantages but are doing so without the necessary guardrails like clearly defined owners and escalation paths. To make matters worse, 66% of technology leaders reveal that employees use unapproved shadow AI tools to assist with their jobs, adding another layer of risk into their environment.

When organizations integrate AI into production systems without human oversight, the consequences are predictable: integration bugs, latency from AI inference pipelines, and even cyberattacks like prompt injections and data poisoning.

To understand why AI causes outages, we can look at the nature of the errors. Sometimes the AI system itself becomes unavailable or unreliable. Worse, AI can become the cause of disruption when an automated decision, tool call, or remediation action doesn’t have enough human validation and changes production behavior. A small error in the AI logic is then amplified across the entire network. This is the classic black box problem, correlated with a lack of observability, explainability, and action auditability. If teams cannot understand why an AI made a decision, they cannot effectively troubleshoot when things go wrong.

Building a framework for resilience and AI governance through machine data

The solution to the AI paradox is not to retreat from innovation, but to change how we govern it. This requires a fundamental shift in deployment strategy. Organizations that successfully minimize AI-related downtime don’t necessarily have the most sophisticated technology, but they give humans control with continuous monitoring and fast intervention when outcomes inevitably stray.

The foundation for AI oversight is machine data: the metrics, events, logs, and traces that let humans see what an AI did, detect issues early, and correct course before small errors become outages.

But human oversight is only as effective as the visibility behind it. To maintain control, organizations should view machine data as the sensory system of an organization. Without it, an AI agent is essentially a black box operating in a vacuum. To build this “sensory system,” leaders can rely on the four core pillars of machine data: Metrics, Events, Logs, and Traces (MELT):

By enriching these core data types with critical context such as topology, configuration changes, service ownership, deployment history, and business impact, AI-driven operations become even more powerful, enabling organizations to transition from reactive recovery to proactive resilience.

Gaining AI visibility through machine data

By correlating metrics, events, logs, and traces, organizations can create an operational view of their AI’s performance. They can see the data behind AI decision-making, providing the context necessary for human validation.

Preventing system failure with early intervention

Small errors like an AI agent misinterpreting a minor traffic spike as a DDoS attack often precede large-scale outages. With robust machine data, these anomalies appear as early warning signals. Human experts can then step in, override the AI, or adjust the parameters before the system spirals into a wider, more costly incident.

Ensuring AI accountability and regulatory governance

As regulatory pressures increase, organizations must be able to prove why their systems behaved the way they did. Machine data provides the immutable record required for compliance, ensuring that AI-driven actions are documented, auditable, and aligned with business policies.

Machine data is the bridge between human intent and AI execution. It transforms AI from a "set-it-and-forget-it" tool into a transparent, manageable asset. A governed data fabric for machine data gives organizations a way to connect signals across security, ITOps, engineering, and business ops, so AI agents are not acting on isolated telemetry, but shared operational truth.

Organizations that manage the collection and analysis of this data better remain in the driver's seat. They can use AI to accelerate operations while keeping a firm hand on the wheel for stability.

Bridging human intelligence and AI autonomy

Digital resilience relies on a human-centric model of governed autonomy. While the goal is not to keep humans in every operational loop, AI agents can accelerate signal analysis and automate routine tasks. To succeed, these agents must operate within clear policy boundaries, ensuring full observability, auditable decision-making, and reliable escalation paths for high-impact or low-confidence situations.

The future of digital resilience will be defined by the synergy between human expertise and machine intelligence. While humans provide essential judgment, accountability, and business context, AI delivers the speed, scale, and continuous analysis required for future-proofed operations. Success will belong to organizations that pair a robust data foundation with the visibility and control necessary to build lasting trust in AI.

Navigate the complexities of the AI era with confidence. Download The Hidden Costs of Downtime report to uncover actionable insights for aligning your security, ITOps, and engineering teams around shared resilience goals.

No results