Today, IT and site reliability engineering (SRE) teams face pressure to remediate problems faster than ever, within environments that are larger than ever, while contending with architectures that are more complex than ever.
In the face of these challenges, artificial intelligence has become a must-have feature for managing complex application performance or availability problems at scale. Without AI, it’s simply not feasible to address the types of issues IT, SRE and dev teams must manage today within the mean time to repair (MTTR) constraints they are expected to meet.
That’s why AI is built into the core of Splunk APM (Application Performance Monitoring). By using AI to help teams understand the root cause of complex problems and suggesting the fastest path to remediation, Splunk APM makes it possible to meet MTTR goals and avoid alert fatigue even for teams that must trace tens of thousands of requests per second.
In this blog post, we explain why AI-driven APM is so crucial for modern teams, then walk through the ways in which Splunk APM uses AI to enable easy, fast and scalable troubleshooting in even the most complex of cloud-native application environments.
Why AI and why now?
In the past, it was easy enough to troubleshoot software performance or availability issues without the help of AI. When applications were deployed as monoliths on individual virtual machines, software environments were small enough in scale and simple enough architecturally for teams to manage them without the assistance of AI.
Today, however, the nature of software environments has changed. To manage performance and reliability effectively in modern, cloud-native environments, teams need AI on their side.
With AI, teams gain faster insights into the root cause of problems. In turn, they can resolve issues faster.
When your application environment includes dozens or hundreds of microservice instances, it’s rarely easy to determine what the underlying cause of a problem is. The root cause of a page that is slow to load, or a failed user authentication, could lie in any of the services through which requests must flow.
Attempting to find the root cause of an issue manually would mean testing each service individually or analyzing complex trace data by hand. Both processes are slow and tedious.
With AI, however, you can analyze complex traces almost instantaneously. AI automatically highlights the anomalies to help you drill down into the root cause of the issue.
Manual analysis may work in small-scale environments that include just a few applications and servers.
But when you migrate to platforms like Kubernetes, where you may have dozens of pods spread across hundreds of nodes, it’s simply not possible to collect and analyze traces manually. There are too many traces and too much data in each one to analyze by hand.
Here, AI makes it possible to scale by automatically pulling the anomalies out of all of your traces, even if you are monitoring hundreds of transactions each second.
Solving complex performance problems often requires correlating data from multiple sources. You may need to compare several traces to understand the nature of a performance issue, for example. Or, you may have to pair trace data with other observability sources, like logs, to find the root cause.
But it’s not always readily apparent which data correlates with which other data. If you take a manual approach, you’ll spend a lot of time determining which data to compare.
AI speeds and simplifies this process by automatically correlating interrelated sets of data. It gives you full visibility into the context of performance problems in a fast, easy way.
AI in Splunk APM
AI-powered correlation and analysis forms part of the core functionality of Splunk APM, Splunk’s full-stack application monitoring solution. Splunk APM offers a variety of ways to utilize AI in order to make troubleshooting faster, easier and more scalable, no matter which types of applications you deploy or how complex your architecture may be.
Interpreting complex traces
The longer your traces, and the more traces you have, the harder it is to interpret trace data in order to get to the root cause of performance problems quickly.
To address this challenge, Splunk APM leverages statistical models and pattern recognition to identify anomalous behavior within traces. Admins can view time series charts for each service in their environment, which present a high-level overview of requests, errors and latency. From there, they can select components of the chart that correspond with changes in service performance in order to drill down into the data associated with the change. Splunk APM automatically surfaces tags associated with the trend, helping teams spotlight the root cause of the issue.
The result is performance troubleshooting that is not only faster, but also more efficient. That’s because AI-driven analysis allows teams to identify problems more quickly than they could through manual investigation, and Splunk APM helps teams identify the most serious issues and their root causes, which in turn helps engineers determine which problems to prioritize.
Surfacing root cause errors
When you’re faced with a spike in errors, it can be difficult to determine which errors are the root cause of the issue, and which are secondary.
Splunk APM makes it easy for engineers to drill down into alerts. Using AI, Splunk fully contextualizes errors by providing data about duration and impacted services and endpoints. Teams can view the full error path of each error to determine which service it originates in and which other services it affects. With this data, engineers can surface the root cause of each error. They can also associate errors in other services with the root cause error, providing them with visibility into the relationships between primary and secondary errors.
Here again, the result for SRE and IT teams is faster and more efficient resolution. Instead of using trial and error to get to the root cause of errors, they can pinpoint the source of issues quickly, and work effectively in the face of alert storms.
Bringing contextual insights
Sometimes, the root cause of performance or availability problems on a microservice is not the service itself, but the infrastructure hosting it. Troubleshooting these types of issues can be especially challenging for teams that manage cloud-native environments like Kubernetes, which abstracts underlying infrastructure in ways that sometimes make it difficult to trace surface-level issues to their source.
Splunk APM uses AI to detect patterns and correlate individual containers with services within cloud-native environments. It also groups errors by cluster and provides metadata tags that provide deeper context into mappings between services and infrastructure.
Using these insights, SREs and developers can more quickly understand the relationship between performance problems and infrastructure, even in distributed, cloud-native environments that abstract infrastructure and host constantly changing workloads.
Dynamic alert thresholds
AI also helps keep performance management workflows manageable by setting dynamic alert thresholds.
When your environment constantly scales up or down, fixed alerting thresholds can lead to alert storms that overwhelm your team. You can’t reliably establish a minimum number of service iterations that should exist within a cluster, or a maximum amount of memory that a service is allowed to consume, if the number of requests that the cluster or service needs to support constantly fluctuates. In other words, there is no “normal” when traffic patterns and demand change constantly throughout the day.
Splunk APM addresses this issue by using AI to adjust alert thresholds automatically. Admins can configure alert rules to create APM detectors, which control when alerts fire based on dynamic conditions defined in the alert rules. Through their dynamic nature, APM detectors help display the impact radius of performance issues like errors and latency spikes. And, because Splunk APM automatically contextualizes errors with service topology and dependency information, it allows teams to perform deep incident analysis and triage problems based on their severity.
In short, alerting in Splunk APM not only saves engineers from alert storms and alert fatigue, but also provides the context they need to respond to alerts efficiently and effectively.
Performance management and reliability engineering are more challenging than ever — and attempting to troubleshoot issues manually is less effective than ever.
You need to evolve your troubleshooting strategy by leveraging native AI features to speed up and simplify complex workflows. Key to AI-driven, directed troubleshooting is the ability for a real-time, NoSample™ full-fidelity approach in application performance monitoring that allows for unlimited cardinality exploration. Faster troubleshooting, easier root cause analysis and more efficient remediation lead to happier customers —-- not to mention more productive SRE and IT teams.
What is Splunk?
This posting is my own and does not necessarily represent Splunk's position, strategies, or opinion.