Modernizing On-Call for Better Results
One of the most important ideas we evangelize at VictorOps is that we, as software makers and digital service providers, must start thinking of our systems in a more holistic, “Systems Thinking” way. VictorOps, just like any other company is a socio-technical system. Aspects regarding both technology and humans must be considered part of the “system”.
Because of this, we need to find new methods of incorporating human considerations into our methods of building and operating services. As good as we may be at software development and architecture, it still takes humans to respond to problems when they inevitably break.
Some organizations are experimenting with self-healing systems but these are mostly focused on infrastructure. There are far fewer of these companies using self-healing systems than those of us out there who rely on our subject matter experts and developers to restore service whenever a disruption occurs.
Two sides of the same coin, on-call and reliability, are forever tied to each other. As new instrumentation exposes new areas of the system to pay closer attention to, we often forget the challenges of scaling and improving our on-call practices.
It has become more and more rare for us to encounter customers who are establishing Network Operations Centers (NOCs). The concepts outlined throughout this book should highlight the shift in the realization of urgency to service disruptions. Businesses can’t wait until Monday morning when someone from IT gets into the office. Revenue, reputation, and more are tied directly to the site and accompanying digital services. If it’s down for even a few moments, that’s a HUGE problem.
This is an urgency that most companies can’t outrun or ignore. Instilling a sense of the seriousness to restore services whenever a problem occurs is necessary for bettering the reliability of a service. Because of this, companies are cutting out as many “middlemen” as possible and getting the person or team who is ultimately most qualified in that moment to address and resolve an incident.
Bouncing support tickets through a tiered-support group while the clock ticks away is devastating to a company on its own digital transformation, particularly if they are early on in the journey. The stakes are often higher for a company with a big name but are still new to this way of appeasing customers.
Tech startups and Silicon Valley garage projects can afford to experience some downtime. It’s the price their users pay to ride the wave of innovation; as early adopters, they are willing to take that gamble just to be one of the first to use the service. But once a large, well-known organization has experienced a significant outage, customers will have difficulty reconciling the enormous company resources and any amount of downtime of the service on which they have come to “rely.”
Modernizing the on-call practices should not involve the use of outsourced or tiered-support, relying on tickets to be created, assigned, or any other activity that further delays the restoration of service. When the system is no longer “normal”, we have already deemed an associated alert or incident to be actionable and, therefore, the person or team who is most qualified to restore service (in that moment) should in all cases be the first responder—not someone who is going to escalate.
First responders should rotate often and the experience and systems knowledge will, of course, vary from engineer to engineer. While a more senior architect on the team may appear to be the most qualified, we never want to encourage a superhero mentality where only specific individuals contain the “know-how” to solve our most critical problems. Modernizing on-call rotations also includes bringing in more of the team while making more of the system available to all. With the right context alongside safe and accountable access to the same tools as any other engineer, even the most junior developer should be able to successfully respond to an incident first and begin making a positive impact to restoring service. The moment will vary, the “qualified engineer” should too. Transfer knowledge through your on-call practices. Have empathy towards not only the customer’s perspective but to the first responders as well. What would be the most helpful when acknowledging an incident? How can I solve this problem the fastest? What can I give “future me” to restore service as quickly as possible?
How would we know the answers to these questions? How could I possibly know what the future will need in a moment like this? As we have become more familiar with our systems and the tools earlier in the SDLC, engineers start to figure out what is going to be helpful. Being alerted during office hours to issues in pre-production environments based on new code means that problems can be caught much much sooner. Not only that, but we’ve verified that we can catch (or see) what we are looking for before it goes to production. Additionally, we have gained an intimate knowledge of the monitoring and alerting tools that are eventually used in production.
By the time this kind of problem has the opportunity to surface in production, we should have instrumented the service better and established only actionable alerts that contain whatever relevant context may be helpful to the responding engineer. It’s all part of a bigger effort of continuous learning more about our systems.
Improving Mental Models
“Learning from Failure” has long been a message we repeat. Strengthening and improving mental models of how systems actually work versus how we think they work. A large gap between individual realities of how systems behave almost always exsits; moreover, no single perception of the system is correct. The more we isolate our understandings of systems from each other and avoid exercises that help us to create more realistic understandings of them, the longer and more harmful a service disruption will be.
It is with this in mind that we determined our next assignment. A full-day event to proactively poke at our systems to improve our understanding of how they actually work, what we have visibility on, and ultimately what can be done in the near-term to improve things.