Chapter 7 | How to Improve Post-Incident Reviews

Approach to Continuous Improvement

There are two main philosophical approaches to both what the analysis is and the value it sets out to provide organizations. For many, its purpose is to document in great detail what took place during the response to an IT problem (self-diagnosis). For others, it is a means to understand the cause of a problem so that fixes can be applied to various aspects of process, technology, and people (self-improvement).

Regardless of the approach you take, the reason we perform these exercises is to learn as much about our systems as possible and uncover areas of improvement in a variety of places. Identifying a “root cause” is a common reason most claim as to why analysis is important and helpful. However, this approach is shortsighted.

The primary purpose of a post-incident review is to learn.

Discovering Areas of Improvement

Problems in IT systems arise in many different forms. In my opening example in the Introduction, our system did not experience an outage, but rather an unexpected (and bad) outcome as a result of normal operation. The backup process was not performing as it was expected to, but the system as a whole did not suffer a disruption of service. All signs pointed to a healthy working system. However, as we learned from the remediation efforts and post-incident review, there were latent problems in the migration process that caused a customer to lose data. Despite not directly disturbing the availability of our system, it certainly impacted an aspect of reliability. The system wasn’t reliable if it couldn’t perform the tasks related to its advertised functionality, namely seamless migration of applications between cloud providers.

Another situation might be the failure of a database in a staging environment failed. Despite not harming the production environment or real customers, whatever problem exists in staging will likely make its way to production if not caught and addressed. Discovering and analyzing such “non-production” disruptions is important as well. Often an incident in a development or staging environment is a precursor to that same condition occurring in the production environment.

Many organizations build infrastructure and test continuously. Disruptions in a non-production environment will have an impact on cycle time and ultimately the ability to deliver value.

Incidents show themselves in many ways. Learning as much as possible from them dramatically increases the reliability of a service. Analyzing many aspects of the data that is available before and during an incident is the path to that learning.

Think about the lists of learnings and action items from the case study in Chapter 6. We discovered nine interesting findings. Those converted perfectly into nine specific action items, each of which was prioritized and assigned an owner. Opportunities to learn and improve are there if we just know where to look.

Facilitating Improvements in Development and Operational Processes

In many cases, teams focus too much on the one thing that broke and what needs to be done to fix it. All too often we don’t take the time to examine existing processes tied to the development and operations of software and infrastructure that make up the service.In the case study in Chapter 6, the review process allowed Cathy to inform a larger audience, including key decision makers, how work was performed. It allowed everyone to informally examine the “way we do things” and spot inefficiencies that had emerged along the lifespan of the system as a whole. Efforts that made sense previously may no longer be optimal today.

Because post-incident analysis includes discussions around process, it helps subject matter experts and management form a much clearer picture of the system as a whole, including how components go from ideas to functional features that impact the bottom line.

These analyses allow senior developers to spot improvement opportunities in the delivery pipeline. They allow management to see where breakdowns in communication and collaboration are delaying restoration of service. The entire organization begins to see the big picture and where their individual efforts play a role in delivering and operating IT services.Continuous improvement means that we are constantly evaluating many aspects of IT, and a well-executed post-incident review is a great way to easily identify new and better ways to improve areas of development, operations, security, and more.

Identifying Trade-offs and Shortcomings in IT

One of the best things to come out of a post-incident review is that teams begin to see where current trade-offs exist, as well as identifying inherent problems or technical debt that has crept into the system.

The post-incident review in the case study in Chapter 6 exposed problems with detection methods, escalation policies, communication channels, access to systems, available documentation, and much more. Often a result of technical debt, these problems are easily addressed yet rarely prioritized.

These shortcomings are important to identify and address early on so that a clear picture of the system exists for everyone, as well as pointing out obvious areas of improvement. A good example of this is the difference between software or operations work as it is designed and how it is actually being performed.

Work as designed versus work as performed

It’s easy to set in motion process improvements or implement a new tool to attempt to provide more reliable and available IT services, but all too often the work that is actually performed looks quite a bit different than how it was scoped out and established previously.

Etsy’s “Debriefing Facilitation Guide” suggests approaching exercises like the post-incident review with a true “beginner’s mind,” enabling all involved to frame the exercise as an opportunity to learn.

To better understand work as designed versus work as performed, the guide recommends asking the following questions:

• How much of this is new information for people in the room?

• How many of you in the room were aware of all the movingpieces here?

Workarounds, changes to teams or tools, and general laziness can all play a part in creating a gap between work as imagined and agreed upon and what actually happens. When we discuss what happened and exactly how during a post-incident review, these differences are surfaced so that the teams can evaluate discrepancies and how they can or should be corrected.

Efficiency versus thoroughnessIn many cases, drift in work as performed compared to how it was drawn up initially comes down to a constant struggle to manage the trade-off between being highly efficient or extremely thorough. This is very prevalent when it comes to remediation efforts. When a service disruption occurs, first responders are working as quickly as possible to triage and understand what’s going on. Engineers with various areas of expertise work in parallel to investigate what’s going on, forming theories on how to remediate the problem and taking action to restore services efficiently.

This aim for speed directly impacts the ability of responders to be thorough in their investigation and actions to recover. We are forced to find a balance between being extremely efficient or extremely thorough; we can’t be on the extreme side of both at the same time.

The efficiency–thoroughness trade-off, and what that balance looks like, becomes much clearer during post-incident review. By asking questions that inquire objectively about how work is accomplished and the decision-making process involved, we can point out patterns of extreme swings in either direction between efficiency and thoroughness.

Hastily performing actions during remediation efforts may help to restore service quickly, but it’s common for instinctual actions taken by responders to have little or no impact—or worse, cause more harm than good.

Recognizing the trade-off will go a long way in improving the behavior of first responders and teams that are pulled in to assist during remediation efforts.We’ve laid the groundwork regarding what makes up a post-incident review, but there are still a number of questions to consider when planning to conduct the exercise:

• What warrants the exercise?

• When should we perform it?

• Who should be there?

• How long will it take?

• What documents should come from the exercise?

• Who should have access to the artifacts created?

The following chapters will address these questions.

Let us help you make on-call suck less. Get started now.