Problems with Post-Incident Reviews Today
We’ve all seen the same problems repeat themselves. Recurring small incidents, severe outages, and even data losses are stories many in IT can commiserate over in the halls of tech conferences and forums of Reddit. It’s the nature of building complex systems. Failure happens.
However, in many cases we fall into the common trap of repeating the same process over and over again, expecting the results to be different. We investigate and analyze problems using techniques that have been well established as “best practices.” We always feel like we can do it better. We think we are smarter than others—or maybe our previous selves—yet the same problems seem to continue to occur, and as systems grow, the frequency of the problems grows as well.
Attempts at preventing problems always seem to be an exercise in futility. Teams become used to chaotic and reactionary responses to IT problems, unaware that the way it was done in the past may no longer apply to modern systems.We have to change the way we approach the work. Sadly, in many cases, we don’t have the authority to bring about change in the way we do our jobs. Tools and process decisions are made from the top.
Directed by senior leaders, we fall victim to routine and the fact that no one has ever stopped to ask if what we are doing is actually helping.Traditional techniques of post-incident analysis have had minimal success in providing greater availability and reliability of IT services. In Chapter 6, we will explore a fictional case study of an incident to illustrate an example service disruption and post-incident review. One that more closely represents a systematic approach to learning from failure in order to influence the future.
When previous attempts to improve our systems through retrospective analysis have not yielded IT services with higher uptime, it can be tempting to give up on the practice altogether. In fact, for many these exercises turn out to be nothing more than boxes for someone higher up the management ladder to check. Corrective actions are identified, tickets are filed, “fixes” are put in place…. Yet problems continue to present themselves in unique but common ways, and the disruptions or outages continue to happen more frequently and with larger impact as our systems continue to grow.
This chapter will explore one “old-school” approach—root cause analysis—and show why it is not the best choice for post-incident analysis.
Root cause analysis (RCA) is the most common form of postincident analysis. It’s a technique that has been handed down through the IT industry and communities for many years. Adopted in large part from industries such as manufacturing, RCAs lay out a clear path of asking a series of questions to understand the true cause of a problem. In the lore of many companies the story goes, “Someone long ago started performing them. They then became seen as a useful tool for managing failure.” Rarely has anyone stopped to ask if they are actually helping our efforts to improve uptime.
I remember learning about and practicing RCAs in my first job out of college in 1999, as the clock ticked closer and closer to Y2K. It was widely accepted that “this is how you do it.” To make matters worse, I worked for a growing manufacturing and assembly company. This approach to problem solving of “downtime” was ingrained into the industry and the company ethos. When something broke, you tracked down the actions that preceded the failure until you found the one thing that needed to be “fixed.” You then detailed a corrective action and a plan to follow up and review every so often to verify everything was operating as expected again.
We employed methods such as the “5 Whys” to step through what had happened in a linear fashion to get to the bottom of the problem.
Q: Why did the mainframe lose connection to user terminals?
A: Because an uplink cable was disconnected.
Q: Why was a cable disconnected?
A: Because someone was upgrading a network switch.
Q: Why was that person upgrading the switch?
A: Because a higher-bandwidth device was required.
Q: Why was more bandwidth required?
A: Because the office has expanded and new users need access to the system.
Q: Why do users need access to the system?
A: To do their work and make the business money.
If we were to stop asking questions here (after 5 Whys), the root cause for this problem (as a result of this line of questioning) would be identified as: the business wants to make money!
Obviously this is a poor series of questions to ask, and therefore a terrible example of 5 Whys analysis. If you’ve performed them in the past, you are likely thinking that no one in their right mind would ask these questions when trying to understand what went wrong.
Herein lies the first problem with this approach: objective reasoning about the series of events will vary depending on the perspectives of those asking and answering the questions.
Had you performed your own 5 Whys analysis on this problem, you would have likely asked a completely different set of questions and concluded that the root of the problem was something quite different. For example, perhaps you would have determined that the technician who replaced the switch needs further training on how (or more likely when) to do this work “better.”This may be a more accurate conclusion, but does this identification of a single cause move us any closer to a better system?
This brings us to the second problem with this approach: our line of questioning led us to a human. By asking “why” long enough, we eventually concluded that the cause of the problem was human error and that more training or formal processes are necessary to prevent this problem from occurring again in the future.
This is a common flaw of RCAs. As operators of complex systems, it is easy for us to eventually pin failures on people. Going back to the lost data example in the Introduction, I was the person who pushed the buttons, ran the commands, and solely operated the migration for our customer. I was new to the job and my Linux skills weren’t as sharp as they could have been. Obviously there were things that I needed to be trained on, and I could have been blamed for the data loss. But if we switch our perspective on the situation, I in fact discovered a major flaw in the system, preventing future similar incidents. There is always a bright side, and it’s ripe with learning opportunities.
The emotional pain that came from that event was something I’ll never forget. But you’ve likely been in this situation before as well. It is just part of the natural order of IT. We’ve been dealing with problems our entire professional careers. Memes circulate on the web joking about the inevitable “Have you rebooted?” response trotted out by nearly every company’s help desk team. We’ve always accepted that random problems occur and that sometimes we think we’ve identified the cause, only to discover that either the fix we put in place caused trouble somewhere else or a new and more interesting problem has surfaced, rendering all the time and energy we put into our previous fix nearly useless. It’s a constant cat-and-mouse game. Always reactionary. Always on-call. Always waiting for things to break, only for us to slip in a quick fix to buy us time to address our technical debt. It’s a bad cycle we’ve gotten ourselves into.
Why should we continue to perform an action that provides little to no positive measurable results when availability and reliability of a service is ultimately our primary objective?
We shouldn’t. In fact, we must seek out a new way: one that aligns with the complex conditions and properties of the environments we work in, where the members of our team or organization strive to be proactive movers rather than passive reactors to problems. Why continue to let things happen to us rather than actively making things happen?
No matter how well we engineer our systems, there will always be problems. It’s possible to reduce their severity and number, but never to zero. Systems are destined to fail. In most cases, we compound the issue by viewing success as a negative, an absence of failure, avoidance of criticism or incident.
We’ve been looking at post-incident analysis the wrong way for quite some time. Focusing on avoiding problems distracts us from seeking to improve the system as a whole.
What makes a system a system? Change. This concept is central to everything in this book. Our systems are constantly changing, and we have to be adaptable and responsive to that change rather than rigid. We have to alter the way we look at all of this.
If you’re like me, it’s very easy to read a case study or watch a presentation from companies such as Amazon, Netflix, or Etsy and be suspicious of the tactics they are suggesting. It’s one thing to learn of an inspirational new approach to solving common IT problems and accept that it’s how we should model our efforts. It’s something quite different to actually implement such an approach in our own companies.
But we also recognize that if we don’t change something we will be forever caught in a repeating cycle of reactionary efforts to deal with IT problems when they happen. And we all know they will happen. It’s just a matter of time. So, it’s fair for us to suspect that things will likely continue to get worse if some sort of change isn’t effected.
You’re not alone in thinking that a better way exists. While stories from “unicorn” companies like those mentioned above may seem unbelievably simple or too unique to their own company culture, the core of their message applies to all organizations and industries.
For those of us in IT, switching up our approach and building the muscle memory to learn from failure is our way to a better world and much of it lies in a well-executed post-incident analysis.
In the next four chapters, we’ll explore some key factors that you need to consider in a post-incident analysis to make it a successful “systems thinking” approach.