We never had a name for that huddle and discussion after I’d lost months’ worth of customer data. It was just, “Let’s talk about last night.” That was the first time I’d ever been a part of that kind of investigation into an IT-related problem.
At my previous company, we would perform RCAs following incidents like this. I didn’t know there was another way to go about it. We were able to determine a proximate cause to be a bug in a backup script unique to Open CRM installations on AWS. However, we all walked away with much more knowledge about how the system worked, armed with new action items to help us detect and recover from future problems like this much faster. As with the list of action items in Chapter 6, we set in motion many ways to improve the system as a whole rather than focusing solely on one distinct part of the system that failed under very unique circumstances.
It wasn’t until over two years later, after completely immersing myself in the DevOps community, that I realized the exercise we had performed (intentionally or not) was my very first post-incident review. I had already read blog posts and absorbed presentation after presentation about the absence of root cause in complex systems.But it wasn’t until I made the connection back to that first postincident review that I realized it’s not about the report or discovering the root cause—it’s about learning more about the system and opening new opportunities for improvement, gaining a deeper understanding of the system as a whole and accepting that failure is a natural part of the process. Through that awareness, I finally saw the value in analyzing the unique phases of an incident’s lifecycle. By setting targets for small improvements throughout detection, response, and remediation, I could make dealing with and learning from failure a natural part of the work done.
Thinking back on that day now gives me a new appreciation of what we were doing at that small startup and how advanced it was in a number of ways. I also feel fortunate that I can share that story and the stories of others, and what I’ve learned along the way, to help reshape your view of post-incident analysis and how you can continuously improve the reliability and availability of a service.
Post-incident reviews are so much more than discussing and documenting what happened in a report. They are often seen as only a tool to explain what happened and identify a cause, severity, and corrective action. In reality, they are a process intended to improve the system as a whole. By reframing the goal of these exercises as an opportunity to learn, a wealth of areas to improve become clear.
As we saw in the case of CSG International, the value of a postincident review goes well beyond the artifact produced as a sumary. They were able to convert local discoveries into improvements in areas outside of their own. They’ve created an environment for constant experimentation, learning, and making systems safer, all while making them highly resilient and available.
Teams and individuals are able to achieve goals much more easily with ever-growing collective knowledge regarding how systems work. The results include better team morale and an organizational culture that favors continuous improvement.The key takeaway: focus less on the end result (the cause and fix report) and more on the exercise that reveals many areas of improvement.
When challenged to review failure in this way, we find ingenious ways to trim seconds or minutes from each phase of the incident lifecycle, making for much more effective incident detection, response, and remediation efforts.
Post-incident reviews act as a source of information and solutions. This can create an atmosphere of curiosity and learning rather than defensiveness and isolationism. Everyone becomes hungry to learn.
Those that stick to the old-school way of thinking will likely continue to experience frustration. Those with a growth mindset will look for new ways of approaching their work and consistently seek out opportunities to improve.
Which of the following is your next best step?
Do nothing and keep things the same way they are.
Establish a framework and routine to discover improvements and understand more about the system as a whole as it evolves.
If you can answer that question honestly to yourself, and you are satisfied with your next step, my job here is done.
Wherever these suggestions and stories take you, I wish you good luck on your journey toward learning from failure and continuous improvement.
Serving as a DevOps Champion and advisor to VictorOps (Now Splunk On-Call), Jason Hand writes, presents, and coaches on the principles and nuances of DevOps, modern incident management practices, and learning from failure. Named “DevOps Evangelist of the Year” by DevOps.com in 2016, Jason has authored two books on the subject of ChatOps, as well as regular contributions of articles to Wired.com, TechBeacon.com, and many other online publications. Cohost of “The Community Pulse”, a podcast on building community within tech, Jason is dedicated to the latest trends in technology, sharing the lessons learned, and helping people continuously improve.
I’d like to give an extra special “thank you” to the many folks involved in the creation of this report.
The guidance and flexibility of my editors Brian Anderson, Virginia Wilson, Susan Conant, Kristen Brown, and Rachel Head was greatly appreciated and invaluable. Thank you to Matthew Boeckman, Aaron Aldrich, and Davis Godbout for early reviews, as well as Mark Imbriaco, Courtney Kissler, Andi Mann, John Allspaw, and Dave Zwieback for their amazing and valuable feedback during the technical review process. Thanks to Erica Morrison and John Paris for your wonderful firsthand stories to share with our readers.
Thank you to J. Paul Reed, who was the first presenter I saw at Velocity Santa Clara in 2014. His presentation “A Look At Looking In the Mirror,” was my first personal exposure to many of the concepts I’ve grown passionate about and have shared in this report.
Special thanks to my coworkers at VictorOps and Standing Cloud for the experiences and lessons learned while being part of teams tasked with maintaining high availability and reliability. To those before me who have explored and shared many of these concepts, such as Sidney Dekker, Dr. Richard Cook, Mark Burgess, Samuel Arbesman, Dave Snowden, and L. David Marquet…your work and knowledge helped shape this report in more ways than I can express. Thank you so much for opening our eyes to a new and better way of operating and improving IT services.
I’d also like to thank John Willis for encouraging me to continue spreading the message of learning from failure in the ways I’ve outlined in this report. Changing the hearts and minds of those set in their old way of thinking and working was a challenge I wasn’t sure I wanted to continue in late 2016. This report is a direct result of your pep talk in Nashville.
Last but not least, thank you to my family, friends, and especially my partner Stephanie for enduring the many late nights and weekends spent in isolation while I juggled a busy travel schedule and deadlines for this report. I’m so grateful for your patience and understanding. Thank you for everything