Why You Need Real-Time for Faster MTTR
“If you ain't first, you're last.”
While that famous one-liner from Ricky Bobby (Will Ferrell) in the cult hit Talladega Nights is more joke than catchphrase, it hits home for those of us in the world of DevOps and Observability. Faster is better. And in our technology-driven world of online transactions and complex environments, faster isn’t just better — it’s crucial.
Today, our systems are increasingly complex, increasingly distributed and with transactions flowing across the entire stack. The concern is that we have two dimensions moving at the same time. We have changes to our applications (microservices, continuous deployment) while we increasingly have elastic and ephemeral behavior in our infrastructure (containers, orchestration, serverless, elastic). We have moved from the simple to the complex, where we need to probe, sense and respond — constantly, and in real-time.
In today’s instant-gratification world, where 3.7 seconds can result in a lost sale, you can’t afford to have any delays in both spotting and responding to incidents. Obviously faster is better.
But what does fast mean when we are observing our environment? Well, to be fast enough against our complex systems means we need 3 principal characteristics:
We need to ingest data in real-time. Our users operate in real-time and so should we. When you are dealing with only one or two servers, it might be remotely possible to push/pull fast enough. But when you’re operating containers and microservices at scale, you can’t pull that fast and push is the only viable method to keep data flowing.
We must have extremely fast detection, based on powerful (and flexible) computation and coupled with AI/ML capabilities to help identify and alert on those things we didn’t even know could go wrong.
After all, the complexity of our systems and apps means we have challenges building mental maps of them. Microservices create complex, loosely coupled transactional paths. Those services may be hosted on multiple virtual compute environments, under orchestration like Kubernetes, and in turn, create multiple mappings from app to infrastructure. In short, system behavior is unpredictable and failure conditions don’t repeat in the exact same way as before. Sometimes, we don’t suspect something can happen, but observability can help us and alert us to changes that are unexpected or unplanned. Our systems need to learn and extend detection for us, to recognize anomalies, be they sudden changes or deviations from historic performances.
That real-time and context-rich data also allows us to go even further. With the right choices, we can provide better controllability. In this complex world, and in line with DevOps practices, we need to consider how best to automate our responses. We can start building automated responses that take action to re-stabilize a system, like auto-remediation after a code deploy if we receive and process the complete set data to make an informed choice. We can also tie runbooks to our alerting, to reduce the potential downtime in choosing the appropriate response. Automation, in conjunction with keeping the right team informed, allows us to focus on improvements, not reactions, and do so quickly.
Further, with real-time and full-fidelity data, we can even start using predictive analytics which can help us get ahead of problems. Prediction depends on having every possible piece of data, without any selection bias, so don’t try this with sampled or filtered data. And we do need to recognize that prediction isn’t necessarily perfect, but with understanding and tailored statistical analysis, like the kind you can do with SignalFlow, you can make prediction part of your active response repertoire to catch and resolve anomalies before they impact your users.
And with observability, “Faster is better” also means “Better is faster”. Better detection, better response and even better resolutions are driven by the streaming data and real-time analysis of the Splunk Observability Cloud.
Your applications are critical to your business and their performance is crucial to your happy customers. So take action today to get your MTTR under control. Start a free trial of Splunk Observability Cloud and go from “don’t know” to “fixed” faster than ever.
----------------------------------------------------
Thanks!
Dave McAllister