observability

AI SRE

Stop guessing, start resolving with an agentic teammate to troubleshoot issues faster.

Free Edition Get Observability Cloud free for up to 15 hosts.

How it works

Embedded AI and agentic support across the entire incident lifecycle

An agentic teammate to troubleshoot and fix issues

Get an extra set of eyes on your systems to make sure everything’s running as planned. If something goes wrong, the AI SRE automatically finds probable root cause, builds a plan, and provides a step-by-step guide on how to get everything back up and running.

Meets you where you are

Whether you're new to observability or a veteran, our built-in agentic AI capabilities help you reduce mean time to resolution (MTTR) and gain actionable insights from day one.

Free your time for what matters most

AI SRE helps you spend less time firefighting issues and more time focused on what matters most, so you can build what’s next.

Agentic observability for the new speed of business

Explore the documentation

AI-driven detection

Automatically detect issues

Anticipate system failures, bottlenecks, and performance degradation and configure detectors to prevent customer-impacting incidents from occurring.

pd-o-ai-sre-features-automatically-detect

AI troubleshooting agent

Let our agent troubleshoot for you

The AI troubleshooting agent automatically sifts through all metrics, log, and trace data, identifying whether your application or infrastructure is at fault, and surfacing the most likely root causes — all in plain language, within your existing workflow.

pd-o-ai-sre-features-trouble-shooting-agent

Remediation plan

Get to resolution faster

Stop context switching across multiple screens, tabs, and tools and manually reviewing tons of documentation, and instead get a ranked list of probable causes, clear impact analysis, and actionable recommendations, right when and where you need it.

pd-o-sre-agentic-teammate-ani

AI Assistant in Observability Cloud

Get insights and answers in plain English

Easily extract insights from Observability Cloud and accelerate investigations using natural language. If you need more help, just ask the AI assistant.

Splunk MCP Server & agentic AI

Use Splunk capabilities in one unified MCP server

Leverage a secure interface to connect your local AI agents, LLMs, tools, and data with Observability Cloud data to build custom AI workflows and debug performance issues in production without leaving your environment.

pd-o-ai-sre-features-mcp-agentic-ai

We work with amazing customers.

See why the world’s leading organizations rely on Splunk.

Repay customer story Repay customer story

CUSTOMER STORY

Repay Pays it Forward with AI Assistant in Observability Cloud

With so many different systems with various endpoints, knowing it all is impossible. So it’s not just about efficiency but also identifying the unknown anomalies and getting insights from the data like a subject matter expert

Van Wolfe, VP of Platform Engineering at Repay
50%
faster triage
30%
Transaction latency reduced by 30%
Resources
Explore more from Splunk

5 Big Myths of AI and Agentic AI

Separate AI fact from fiction. Learn how agentic AI reshapes observability and security while helping teams work smarter and faster.

Read the e-book

AI SRE FAQs

Bringing together agentic and embedded AI across Splunk Observability Cloud, the AI SRE is an agentic AI experience spanning the entire incident response lifecycle including detection, troubleshooting, and remediation. This is an AI-native user experience including the AI Assistant in Observability Cloud and Splunk MCP Server that delivers in-context insights and frees teams to focus on what matters most and building what’s next, rather than manual troubleshooting.

Troubleshooting issues has always been a game of hide-and-seek for engineering teams, especially when it comes to modern applications running on complex environments like Kubernetes. Traditionally, DevOps and site reliability engineers (SREs) have had to hunt through dashboards, logs, and metrics across multiple screens and tools to pinpoint the cause of an outage, or performance problem. This is where the AI SRE can help.

The AI SRE acts like a site reliability engineering (SRE) teammate by automatically sifting through all metrics, log, and trace data, identifying whether your application or infrastructure is at fault, and surfacing the most likely root causes — all in plain language, within your existing workflow. Rather than context switching across multiple screens, tabs, and tools and manually reviewing tons of documentation, teams get a ranked list of probable causes, clear impact analysis, and actionable recommendations, right when and where they need it.

The real value for engineering teams comes from the way the AI troubleshooting agent turns hours of manual investigation into just minutes of insight. When teams view an alert, the agent analyzes everything from recent deployments to Kubernetes events and historical incidents, even highlighting patterns from previous fixes. It doesn’t just stop with symptoms; it provides a concise root cause analysis (RCA) summary so teams can act confidently and quickly, reducing downtime and keeping services running smoothly. With Observability Cloud’s AI at your side, your team spends less time firefighting and more time building what matters.

After reviewing the suspected root causes and evidence provided by the troubleshooting agent, teams will be able to leverage the AI remediation plan in Splunk Observability Cloud. The AI remediation plan generates guided steps to implement a long-term resolution to help reduce or eliminate these issues going forward. Teams can complete or undo steps as needed throughout the remediation flow and receive a summary of associated actions. Once complete, teams can mark the alert as resolved and provide feedback if the outcome doesn’t meet their expectations.

The top benefits of the AI SRE are helping teams reduce the manual toil currently needed to find and fix issues before customers are impacted, which helps teams drastically reduce mean time to resolution (MTTR) so they can focus on what matters most and build what’s next.

Related capabilities

Application Performance Monitoring

Solve problems faster in monoliths and microservices by immediately detecting problems from new changes, confidently troubleshooting the source of an issue, and optimizing service performance.

Explore APM

Infrastructure Monitoring

Improve hybrid cloud performance with instant visibility and real-time alerts.

Explore Infrastructure Monitoring

AI Assistant in Observability Cloud

Get expert guidance in plain English to find and fix issues faster.

Explore AI Assistant

AI Observability

Observe the performance, quality, security, and cost of your AI stack.

Explore AI Observability
Get started

Experience the embedded AI in Observability Cloud for free.

Contact sales
Free trial