“I know we don’t have tests for that, but it’s a small change; it’s probably fine…”
“I ran the same commands I always do, but…something just doesn’t seem quite right.”
“That rmrf sure is taking a long time!”
If you’ve worked in software operations, you’ve probably heard or uttered similar phrases. They mark the beginning of the best “Ops horror stories” the hallway tracks of Velocity and DevOps Days the world over have to offer. We hold onto and share these stories because, back at that moment in time, what happened next to us, our teams, and the companies we work for became a epic journey.
Incidents (and managing them, or…not, as the case may be) is far from a “new” field: indeed, as an industry, we’ve experienced incidents as long as we’ve had to operate software. But the last decade has seen a renewed interest in digging into how we react to, remediate, and reason after-the-fact about incidents.
This increased interest has been largely driven by two tectonic shifts playing out in our industry: the first began almost two decades ago and was a consequence of a change in the types of products we build. An era of shoveling bits onto metallic dust-coated plastic and laser-etched discs that we then shipped in cardboard boxes to users to install, manage, and “operate” themselves has given way to a cloud-connected, service-oriented world. Now we, not our users, are on the hook to keep that software running.
The second industry shift is more recent, but just as notable: the DevOps movement has convincingly made the argument that “if you build it, you should also be involved (at least in some way) in running it,” a sentiment that has spurred many a lively conversation about who needs to be carrying pagers these days! This has resulted in more of us, from ops engineers to developers to security engineers, being involved in the process of operating software on a daily basis, often in the very midst of operational incidents.
I had the pleasure of meeting Jason at Velocity Santa Clara in 2014, after I’d presented “A Look at Looking in the Mirror,” a talk on the very topic of operational retrospectives. Since then, we’ve had the opportunity to discuss, deconstruct, and debate (blamelessly, of course!) many of the ideas you’re about to read. In the last three years, I’ve also had the honor of spending time with Jason, sharing our observations of and experiences gathered from real-world practitioners on where the industry is headed with post-incident reviews, incident management, and organizational learning.
But the report before you is more than just a collection of the “whos, whats, whens, wheres, and (five) whys” of approaches to postincident reviews. Jason explains the underpinnings necessary to hold a productive post-incident review and to be able to consume those findings within your company. This is not just a “postmortem how-to” (though it has a number of examples!): this is a “postmor‐ tem why-to” that helps you to understand not only the true complexity of your technology, but also the human side that together make up the socio-technical systems that are the reality of the modern software we operate every day.
Through all of this, Jason illustrates the positive effect of taking a “New View” of incidents. If you’re looking for ways to get better answers about the factors involved in your operational incidents, you’ll learn myriad techniques that can help. But more importantly, Jason demonstrates that it’s not just about getting better answers: it’s about asking better questions.
No matter where you or your organization are in your journey of tangling with incidents, you have in hand the right guide to start improving your interactions with incidents.
And when you hear one of those hallowed phrases that you know will mark the start of a great hallway track tale, after reading this guide, you’ll be confident that after you’ve all pulled together to fix the outage and once the dust has settled, you’ll know exactly what you and your team need to do to turn that incident on its head and harness all the lessons it has to teach you.
— J. Paul Reed DevOps consultant and retrospective researcher San Francisco, CA July 2017