How to Use Observability to Reduce MTTR

When you’re operating a web application, the last thing you want to hear is “the site is down." Regardless of the reason, the fact that it is down is enough to cause anyone responsible for an app to break out into a sweat. As soon as you become aware of an issue, a clock starts ticking — literally, in some cases — to get the issue fixed. Minimizing this time between an issue occurring and its resolution is arguably the number one goal for any operations team.

Want to skip the reading and experience it for yourself? Start a trial instantly.

Unfortunately, the complexity and distributed nature of modern applications makes it harder to troubleshoot and resolve issues than ever before. Modern applications can be running in multiple hosting environments, distributed around the world, with requests routed to different instances with every single request. Tracking down why certain requests are failing and then fixing it can be a nightmare, especially without a meaningful way to see exactly what is happening in your architecture and where the issues exist.

Enter Observability — an evolution of classical monitoring solutions that lets you see inside your application. In the rest of this post, I’ll explain a bit more about measuring your mean time to resolution (MTTR) and how Observability can help you reduce it.

When issues occur, they are resolved through the model shown above. Something changes that causes an incident to happen. The incident is detected through an observability detector (or an angry customer tweet, or a legacy monitoring platform). Some unfortunate soul is then notified and has to get a clue about what caused the incident. Finally, someone else has to resolve the issue and make sure that it won’t recur. The time it takes from when the incident happens to when it is finally resolved is, of course, the ‘time to resolve.’ Averaging your time to resolve across all of your incidents creates ‘mean time to resolve’ (MTTR). This metric can help determine your application’s overall quality and stability and the extent to which your observability solution is doing its job. For a more detailed look at the issue resolution process, and how observability helps, check out our detailed infographic.

Now that I’ve discussed what MTTR is and why it’s important to minimize it, let’s discuss how observability helps you reduce it.

Observability gives you visibility into three key types of data: metrics, traces, and logs. Using this data, you can find out when incidents occur, locate the faulty component, then understand why the component is broken. Troubleshooting incidents requires these three types of data — each one used to understand a different part of the problem.

With Splunk Observability Cloud, not only do you have access to these three key pillars of data, but you also have access to real user monitoring and synthetic monitoring to identify issues occurring in user browsers, third-party JavaScript libraries or even public networks. Furthermore, our real-time streaming analytics detects incidents in mere seconds. There’s no risk that the broken traces might be missing data — we do not sample and instead ingest, analyze and store all trace data in full fidelity (see why sampling is a bad idea in observability). Once a problematic trace is identified, directed troubleshooting powered by AI and ML helps pinpoint the source of the problem and live service logs from the service, automatically analyzed for context, appear right away. Integration with Splunk On-Call gets the right person the first time, and they can seamlessly review any metric, trace or log in context to quickly find out exactly what happened, where and why. These actions combine to minimize MTTR — Splunk customers can go from issue detection to understanding the cause in under four minutes 90% of the time.

As a provider of a technology platform for first responders, Mark43 knows seconds matter during outages. An early adopter of microservice technology, they also understood the complexity of those services and the importance of reducing MTTR. Their legacy monitoring solution just didn’t cut it — they needed observability. After adopting Splunk Infrastructure Monitoring and Application Performance Monitoring, they reduced MTTR down by 95% to mere minutes. 

Clearly, observability platforms provide immense benefit to solving issues quickly, the first time. Being able to isolate issues in such a short time means that MTTR will be lower and customers and developers will both be happier. Only Splunk Observability Cloud provides one integrated experience for identifying, troubleshooting, and resolving issues, based on every transaction, with full integration of metrics, traces, and logs — in addition to integrated incident response and AI-directed troubleshooting.

Your applications are critical to your business — you wouldn’t run them otherwise — so take action today to get your MTTR under control. Start a free trial of Splunk Observability Cloud and go from “don’t know” to “fixed” faster than ever.

Greg Leffler
Posted by

Greg Leffler

Greg heads the Observability Practitioner team at Splunk, and is on a mission to spread the good word of Observability to the world. Greg's career has taken him from the NOC to SRE to SRE management, with side stops in security and editorial functions. In addition to Observability, Greg's professional interests include hiring, training, SRE culture, and operating effective remote teams. Greg holds a Master's Degree in Industrial/Organizational Psychology from Old Dominion University.

Show All Tags
Show Less Tags