Why You Need Real-Time for Faster MTTR

If you ain't first, you're last.”

While that famous one-liner from Ricky Bobby (Will Ferrell) in the cult hit Talladega Nights is more joke than catchphrase, it hits home for those of us in the world of DevOps and Observability. Faster is better. And in our technology-driven world of online transactions and complex environments, faster isn’t just better — it’s crucial.

Want to skip the reading and experience it for yourself? Start a trial instantly.

Today, our systems are increasingly complex, increasingly distributed and with transactions flowing across the entire stack. The concern is that we have two dimensions moving at the same time. We have changes to our applications (microservices, continuous deployment) while we increasingly have elastic and ephemeral behavior in our infrastructure (containers, orchestration, serverless, elastic). We have moved from the simple to the complex, where we need to probe, sense and respond — constantly, and in real-time.

In today’s instant-gratification world, where 3.7 seconds can result in a lost sale, you can’t afford to have any delays in both spotting and responding to incidents. Obviously faster is better. 

But what does fast mean when we are observing our environment? Well, to be fast enough against our complex systems means we need 3 principal characteristics:

We need very fine granularity with high-speed reporting. It’s not uncommon to see first-generation monitoring still working at reporting resolutions of 30 to 60 seconds. Next-generation monitoring or observability solutions, like the Splunk Observability Cloud, can drive reporting metrics at 1 second. After all, when the average time for a serverless function is around 1 second, think how many events you might have missed in even 60 seconds across your entire environment.

We need to ingest data in real-time. Our users operate in real-time and so should we. When you are dealing with only one or two servers, it might be remotely possible to push/pull fast enough. But when you’re operating containers and microservices at scale, you can’t pull that fast and push is the only viable method to keep data flowing.

When we have this much data it is tempting to reduce our data load. And it seems like a sampling/filtering approach would help reduce that data load. But filtering or sampling reduces the accuracy of analysis and creates visualization blind spots. Linear or low-pass filtering smooths your visualization. That destroys the sharp edges (and loss of those sharp edges might be just what cuts you). Bandpass filtering removes the outliers (or conversely, removes the baseline) which might make things look better or worse than actual. Sampling, either head or tail-based, removes data that might be useful in the future. After all, we’re going to be attempting to resolve issues that we didn’t expect or even suspect (unknown unknowns).

We must have extremely fast detection, based on powerful (and flexible) computation and coupled with AI/ML capabilities to help identify and alert on those things we didn’t even know could go wrong.

After all, the complexity of our systems and apps means we have challenges building mental maps of them. Microservices create complex, loosely coupled transactional paths. Those services may be hosted on multiple virtual compute environments, under orchestration like Kubernetes, and in turn, create multiple mappings from app to infrastructure. In short, system behavior is unpredictable and failure conditions don’t repeat in the exact same way as before. Sometimes, we don’t suspect something can happen, but observability can help us and alert us to changes that are unexpected or unplanned. Our systems need to learn and extend detection for us, to recognize anomalies, be they sudden changes or deviations from historic performances.

That real-time and context-rich data also allows us to go even further. With the right choices, we can provide better controllability. In this complex world, and in line with DevOps practices, we need to consider how best to automate our responses. We can start building automated responses that take action to re-stabilize a system, like auto-remediation after a code deploy if we receive and process the complete set data to make an informed choice. We can also tie runbooks to our alerting, to reduce the potential downtime in choosing the appropriate response. Automation, in conjunction with keeping the right team informed, allows us to focus on improvements, not reactions, and do so quickly.

Further, with real-time and full-fidelity data, we can even start using predictive analytics which can help us get ahead of problems. Prediction depends on having every possible piece of data, without any selection bias, so don’t try this with sampled or filtered data. And we do need to recognize that prediction isn’t necessarily perfect, but with understanding and tailored statistical analysis, like the kind you can do with SignalFlow, you can make prediction part of your active response repertoire to catch and resolve anomalies before they impact your users.

Observability depends on being fast, accurate and precise. So data is the crucial path for our ability to respond and resolve issues rapidly. Our data has to be streaming, real-time in collection and analysis. We have to have fast detection and alerting, allowing our systems themselves to help us help them.

And with observability, “Faster is better” also means “Better is faster”. Better detection, better response and even better resolutions are driven by the streaming data and real-time analysis of the Splunk Observability Cloud.

Your applications are critical to your business and their performance is crucial to your happy customers. So take action today to get your MTTR under control. Start a free trial of Splunk Observability Cloud and go from “don’t know” to “fixed” faster than ever.

Dave McAllister

Posted by