Now that we’ve established the “what” and “why” of a post-incident review, we can begin to take a closer look at the “how.”
There are various approaches to conducting a successful analysis. Some follow strict formats, while others tend to be much more informal. This book attempts to provide some structure or guidance for conducting your next analysis. If you are worried about it being too restrictive or not providing what management is asking for, don’t panic. If management needs additional information, it is reasonable to adjust the suggested guidelines provided in Chapter 10 to meet your needs.
My advice when asked to stray too much from the guide is to keep in mind the basic principles of Chapter 5, where we discussed that the point of this exercise is to learn so that we may improve our methods of knowing about a problem sooner (detection) as well as how we can recover (response and remediation) sooner. Cause, severity, impact, and additional topics of interest may be included in your own analysis, but don’t let them distract you from a learning opportunity.
A well executed post-incident review has a clearly stated purpose and repeatable framework. Let’s take a moment to address some of the key components.
Having many diverse perspectives on what took place during response and remediation efforts helps to bring high-value improvements to the surface. Rather than focusing simply on identifying what went wrong and targeting that as what should be fixed, we now understand that there are many contributing factors to an incident, and avoiding opportunities to discuss and improve them all but guarantees a scenario where engineers are in a constant break/fix situation, chaotically reacting to service disruptions. Focusing the perspectives and efforts toward making slight but constant improvements to the entire incident lifecycle provides the greatest gains.
Essential participants in the post-incident review include all of the people involved in decisions that may have contributed to the problem or recovery efforts:
• (Primary) First responders
• (Secondary) Escalation responders (additional subject matter experts who joined in)
• Incident Commander (where necessary)
• Management and stakeholders in other areas
We want to involve all the people who identified, diagnosed, or were affected by the problem. Others who are curious about the discussion and process or have something to contribute should be included as well. Anyone who has ideas or something to share that may help us explore value-adding improvements should be invited to join in.
The role of Incident Commander can be performed by engineers or leadership, depending on the circumstances, structure, and culture of the organization. Differences exist in the enterprise versus smaller companies. Leadership may assume or be assigned the role, but that is not always the case.
In many cases, leveraging an objective third-party facilitator can provide several key benefits. Including someone who wasn’t directly involved in the incident removes opportunities for human bias to taint the process of learning. When possible, get someone who wasn’t involved to facilitate.
There are predictable flaws that arise when bias begins to creep into conversations regarding remediation efforts. One example of this is hindsight bias, or the inclination, after an incident, to see it as having been predictable—despite there having been little or no objective basis for predicting the incident, you truly believe you could have foreseen it somehow. Failing to spot bias allows us to go down a path of discussion and reasoning that provides no real value. It’s very easy to begin using counterfactual statements such as “I should have known X” or “If I had just done Y.” The result of this is simply discussing alternative versions of events that have already occurred.
Another common bias that manifests during the post-incident review is confirmation bias, or the tendency to interpret new evidence as supporting or confirming what we believe has occurred in the system. Theories should be formed, but thoroughly tested to reduce confirmation bias. We should rely only on hypotheses that have been verified.
All members participating in a exercise should be vigilant to spot bias as it occurs in discussions, and encouraged to speak up.What is the purpose of a post-incident review?
Now that we have gathered all parties to the exercise, the tone and mission should be established. First and foremost, we are here for one reason only: to learn.
Yes, understanding the causes of problems in our systems is important. However, focusing solely on the cause of a problem misses a large opportunity to explore ways in which systems can be designed to be more adaptable. More important than identifying what may have caused a problem is learning as much as possible about our systems and how they behave under certain conditions. In addition to that, we want to scrutinize the way in which teams form during incident response. Identifying and analyzing data points regarding these areas won’t necessarily bring you to a root cause of the problem, but it will make your system much more “known”—and a greater and more in-depth understanding of the system as a whole is far more valuable to the business than identifying the root cause of any one problem.
In addition to establishing a space to learn, post-incident reviews should be considered an environment in which all information should be made available. No engineers should be held responsible for their role in any phase of the incident lifecycle. In fact, we should reward those who surface relevant information and flaws within our systems, incentivizing our engineers to provide as much detail as possible.
When engineers make mistakes but feel safe to give exhaustive details about what took place, they prove to be an invaluable treasure chest of knowledge. Collectively, our engineers know quite a lot about many aspects of the system. When they feel safe to share that information, a deeper understanding of the system as a whole is transferred to the entire team or organization. Blaming, shaming, or demoting anyone involved in an incident is the surest way for that deeper understanding to not take place. Encourage engineers to become experts in areas where they have made mistakes previously. Educating the rest of the team or organization on how not to make similar mistakes in the future is great for team culture as well as knowledge transfer.
One important behavior to watch out for is when leadership “disappears,” expecting teams to do all of the analysis and improvement coordination themselves. It’s important that leadership attend and demonstrate that they value the learning process, as well as what they themselves have learned. CxOs and managers can then communicate with others, including broader leadership within the organization, further sharing what was learned to cross-pollinate with the various teams. Transparency is an opportunity in many organizations, especially enterprises. Be transparent and share the findings as broadly as possible.When should you conduct a post-incident review?
Analyses conducted too long after a service disruption are of little value to the overall mission of a post-incident review. Details of what took place, memories of actions taken, and critical elements of the conversations will be lost forever. Performing the review exercise as quickly as possible ensures that the maximum amount of relevant information and context about the incident can be accurately captured. The longer we wait to discuss the events, the less we’ll be able to recall, and the likelihood of establishing useful action items is diminished. Try to conduct the review exercise the following business day, or at your first opportunity. Scheduling the analysis and inviting key stakeholders as soon as recovery efforts have concluded helps get the appointment on the calendar before something else comes up.
Performing a post-incident review on all significant incidents will uncover areas of improvement. However, performing one for each incident that comes up may prove to be too frequent. Inquiries to learn should, at the very least, be conducted for all Sev1 and Sev2 incidents.
TIP: Set aside post-incident review exercises as something separate from “regular work” and establish it as a safe space for inquiry and learning to occur. Make it a ritual.
Team members don’t necessarily work in the same office, or even the same time zones. When possible, physical meetings should be taken advantage of, but these are not essential. In fact, virtual conference calls and meetings can allow more diverse perspectives to be remotely pulled into the discussion. This in turn can help avoid groupthink pitfalls and provide more opportunities for genuine analysis of not only what went wrong, but how well teams responded.
Once we have gathered the people involved and established our intent, it’s time to begin discussing what took place.
Establish a Timeline
The facilitator may begin by first asking, “When did we know about this problem?” This helps us to begin constructing a timeline. Establishing when we first knew about the problem provides a starting point. From there, we can begin describing what we know.
NOTE: A mindful communication approach—describing the events of the timeline rather than explaining them— helps to ward off cognitive bias and natural tendencies toward blame.
The timeline should include as many perspectives as possible. The more data points collected, the clearer the overall picture of exactly what took place during the entire incident. Collect information on what the engineers actually did, and how. What contributed to response and remediation efforts, and what affected the time to recover?
Note: In Chapter 10, I’ve provided a guide to help document and plot the data you collect during your post-incident review. This may be helpful when you try to understand better what engineers were doing, how that work was accomplished, and whether it helped or hurt your recovery efforts.
As engineers continue to describe their experiences, the facilitator may want to probe deeper by asking questions like “How do we currently perform X?” An open-ended inquiry such as this may spark a deeper discussion around alternate, better methods, or at the very least point out action items to tackle inefficiencies in remediation efforts.
While discussing what took place, naturally we will point out things that did not work out so well. Obvious opportunities for improvement will present themselves. After an incident, those are typically the things we tend to focus on. We often overlook the things that went well. However, these too should be collected, analyzed, and shared as learnings back into the system and the organization. Just as there are many factors related to failures within IT systems, so too are there multiple factors involved in things that go well.
NOTE: While the events of the timeline are being described, listen for and help participants gain awareness of bias, blaming, and counterfactual statements.
As engineers step through the process of investigating, identifying, and then working to solve the problem, conversations are happening among all who are involved. Engineers sharing in great detail exactly what they were thinking and doing and the results of those actions helps others involved in the firefight build a shared context and awareness about the incident. Capturing the dialogue will help spot where some engineers may need to improve their communication or team skills.
NOTE: Having an objective moderator run the exercise can help prevent one person (or small group) from steam‐ rolling the conversation and avoids “groupthink.” Having another person asking questions from a genuine place of curiosity forces thoughtful discussion of actions rather than relying on assumptions and existing beliefs around state and behavior.
Included in those conversations should be fairly detailed documentation describing exactly what happened throughout the timeline. This may include specific commands that were run (copied and pasted into chat), or it could be as simple as engineers describing which commands they were using and on which systems. These descriptions teach others how to perform the same type of investigation—and not only will others learn something from discussing these tasks, but the engineers can analyze if they are taking the most efficient route during their investigation efforts. Perhaps there are better ways of querying systems. You may never know if you don’t openly discuss how the work is accomplished.
TIP: Take note of all tasks completed during the timeline. By indicating if a task provided a positive or negative impact, we can learn and identify areas of improvement.
One method teams have employed to facilitate not only the conversation and remediation efforts but the forthcoming post-incident review is the use of ChatOps.
NOTE: ChatOps is the practice of leveraging group collaboration tools to dually serve as the interface for develop‐ ment and operations efforts, as well as the conversations related to those interactions.
This new method of interacting, combining tools and conversations into a single interface, provides a number of benefits. In the context of a post-incident reviews, a system of record is automatically created that serves as a detailed recollection of exactly what took place during the response to the incident. When engineers combine conversations with the methods in which they investigated and took action to restore the service, a very clear timeline begins to emerge. Details on which engineers ran what specific commands at what time and what the results were provide a clear and accurate account of what took place. Much of the timeline is already constructed when ChatOps is part of the detection, response, and remediation efforts.
Not all tricks and commands used during a firefight are suitable for chat or available via chatbots and scripts, but describing what was done from the engineer’s local terminal at the very least helps to share more with the team about how to diagnose and recover from problems. Incidents become teaching opportunities as engineers step through the incident lifecycle.
One example of this is the querying and retrieval of relevant metrics during the response to an incident. First responders typically begin their triaging process by exploring a number of key metrics such as time-series data and dashboards. The collection of time-series data is extremely easy and cost-effective. You may not need to watch a dashboard all day in an attempt to predict problems, but having those data points available during the investigation process helps engineers understand the current state of systems and reason about how they got there.
Any metrics that were leveraged during response and remediation efforts should be included in the review discussion. Not only does it help to tell the story of how teams responded to problems, but it also shares helpful information with a larger audience. Some engineers may not have been aware of such metrics or how to access them. Studying the ways that engineers used (or didn’t use) metrics during a firefight will present areas for improvement.
Two important metrics in a post-incident review are the time it took to acknowledge and recover from the incident. A simple way to measure improvement in responding to failure is the average time it takes to acknowledge a triggered incident. How are teams organized? How are first responders contacted? How long does it take to actually begin investigating the problems? Improvements in each of these areas (and more) will reduce the time it takes teams to recover from problems.
Likewise, measuring the average time it takes to recover from disruptions presents a metric to focus improvements. Once teams are formed, how do they work together in unison to investigate and eventually recover from the problem? Where is there friction in this process? Do engineers have the necessary tools or access to systems? Where can we make things better? Consistently reducing the average time it takes to recover from a problem reduces the cost of downtime.
Cost of Downtime can be calculated as follows: Deployment Frequency x Change Failure Rate x Mean Time to Recover x Hourly Cost of the Outage
In other words, improving the time to recover lowers the real dollar cost of failure. What accounts for the time between “time to acknowledge” and “time to recover” is examined during a wellexecuted post-incident review. Endless ways in which teams can improve can be discovered.
Mean time to recover (MTTR) is an arithmetic average. Averages are not always a good metric because they don’t account for outliers and therefore can skew the reality of what’s actually happening. Still, when coupled with other metrics, it helps to tell a story in which you can discover where to focus improvement efforts.
As mentioned previously, transparency is extremely important, and not only within teams or the organization as a whole. Customers and users of our services expect transparent and timely communication about problems and remediation efforts. To provide this feedback, many employ status pages as a quick and easy way of issuing real-time updates regarding the state of a service.
Information regarding when status pages were updated and what they were to say should also be examined during a post-incident review. Including this in the discussion can help reduce the time it takes to inform our customers and users that we know about a problem and are investigating it, shortening that feedback loop.
For many stakeholders, the first question they have is “How bad is it?” Those in management positions may need just a high-level view of what’s going on, freeing up engineers to continue remediation efforts without too many interruptions. Previously, we defined an incident. One component of that incident that many will ask about and which can help to create context during a post-incident review is the level of severity. Including this metric in analysis discussions means that it is documented.
Tracking the overall improvements in reduction of Sev1 or Sev2 outages will provide evidence of any increases in reliability and availability and decreases in the cost of downtime.
Along with the severity of the incident, many in management roles may be very concerned about the possible impact on any service level agreements (SLAs) that are in place. In some cases, harsh penalties are incurred when SLAs are broken. The sad truth about SLAs is that while they are put in place to establish a promised level of service, such as 99.999% (five nines) uptime, they incentivize engineers to avoid innovation and change to the systems. Attempts to protect the systems hinder opportunities to explore their limits and continuously improve them. Nevertheless, discussing the potential impact to SLAs that are in place helps everyone understand the severity of the failure and any delays in restoring service. If engineers are to make the case that they should be allowed to innovate on technology and process, they will have to ensure that they can work within the constraints of the SLA or allowable downtime.
Someone in attendance to the exercise should be able to speak to questions regarding customer impact. Having a member of support, sales, or elsewhere providing that kind of feedback to the engineers helps to build empathy and a better understanding of the experience and its impact on the end users. Shortening the feedback loop from user to engineer helps put things into perspective.
TIP: Throughout the discussion of the timeline, many ideas and suggestions will emerge regarding possible improvements in different areas. It’s important to document these suggestions as this sets us up for concrete action items to assign and prioritize.
When discussing an incident, evidence will likely emerge that no single piece or component of the system can be pointed to as the clear cause of the problem. Systems are constantly changing, and their interconnectivity means that several distinct problems typically contribute, in various ways and to varying degrees, to failures. Simple, linear systems have a direct and obvious relationship between cause and effect. This is rarely the case with complex systems. Many still feel that by scrutinizing systems, breaking them down, and figuring out how they work we should be able to understand and control them. Unfortunately, this is often impossible.
Causation and correlation are difficult to triangulate, but discussion surrounding what factors contributed to a problem, both before and during remediation efforts, does have great value. This is where suggestions can be made with regard to countermeasures against factors that impacted the system. Perhaps additional monitoring would help engineers notice an early signal of similar problems in the future. Identifying and discussing as many factors as possible will set the foundation for what is to come next: action items.
NOTE: In a well-executed post-incident review, proximate or distal causation may be relevant to management who only want a high-level understanding of what went wrong. However, in complex systems there is never a single root cause.
Once we have had a chance to discuss the events that took place, conversations that were had, and how efforts helped (or didn’t) to restore service, we need to make sure our learnings are applied to improving the system as quickly as possible. A list of action items should be captured when suggestions are made on how to better detect and recover from this type of problem. An example of an action item may be to begin monitoring time-series data for a database and establish thresholds to be alerted on. Shortening the time it takes to detect this type of problem means teams can form sooner and begin remediation efforts faster.
Action items should be assigned to owners responsible for seeing them through. Ticketing systems are often used for accountability and to ensure that the work is prioritized appropriately and not simply added to the bottom of the backlog, never to actually be completed. It is recommended that action items that come from the exercise are made a top priority. Pausing existing development until the action items are complete indicates that teams respect the postincident analysis process and understand the importance of balancing service reliability and availability with new features.
Collaboration of between ops and product teams to ensure time is spent on these concerns along with new features allows for agile software development on infrastructure the entire team has confidence in.
A great new feature provides no value to the business if it is operated on an unreliable and unstable system.
NOTE: Don’t allow for extended debate on action items. Place ideas into a “parking lot” for later action if need be, but come up with at least one action item to be implemented immediately
Internal process reports and external summary reports are the primary artifacts of a post-incident review. Internal reports serve to share the learnings about what happened with the larger team or organization and feed work back into improving the system. Transparency and sharing emphasize an importance placed on learning and improving. External reports provide the same benefit to stakeholders outside of the company and typically lack sensitive information or specific system details.
During remediation of service disruptions, a lot of information is retrieved and discussed openly among the responders to the event. It’s important to capture this context for a deeper understanding of where improvements can be made. However, some information may be unnecessary or too sensitive to share with the general public. As a result, an internal investigation will typically serve as the foundation for a summarized version to be made available to the public as quickly as possible.
Save and store your analysis report artifacts in a location where they are available to everyone internally for review or to assist during future similar incidents.
Like we saw in Chapter 6’s case study, users of systems and services are often aware of issues before the provider, particularly if that service is part of their own workflow. When someone relies on the uptime of a service or component, such as an API endpoint, its availability may directly impact their service or business.
In the modern age of connectivity, we generally have a low tolerance for services that are unreliable. More and more, end users want (or need) to be able to access IT services nearly 24×7×365.
Following an incident, it’s important to share relevant information with the general public. Often shorter and less detailed than the internal documentation created, an external analysis report serves as a summary of the high-level details of what happened, what was done to restore service, and what new commitments are being made to improve the availability of the service moving forward.
For most, the cause of a disruption in service isn’t nearly as important as knowing that everything was done to recover as quickly as possible. Users want to know that something has been learned from this experience that will reduce the likelihood of a repeat occurrence of the problem, as well as being kept apprised of any improvements being made.
In Chapter 10, we will take a look at a guide that can serve as the starting point for both the process and summary reports.
Like in many large organizations prior to DevOps transformation, the Ops organization at CSG International frequently dealt with production issues and fixed them with little help or involvement from the Development org. With little insight into the frequency or nature of these issues from the developers’ perspective, Operations staff held post-incident reviews at the leadership level, but Development wasn’t involved.
Development was concerned primarily with releases. Any problems that occurred during a release were followed by a post-incident review. Problems with operability in the production environment were not part of their concern.
This left a large gap in involving the right people in the right discussions. We weren’t looking at things holistically.
After we joined our Development and Operations organizations together in 2016, Development began getting an up-front seat to the outages and Operations finally got some needed relief from a staffing standpoint. We also started holding joint analysis exercises (called After Action Summaries) at the leadership level.
Room for Improvement
Ultimately, CSG wanted three things from their post-incident reviews:
Identify opportunities for improvement.
Share knowledge. We were single-threaded in numerous areas, and this was a good opportunity to share knowledge. What did the troubleshooting look like? What steps were taken? What was the result of those steps? What went well that should be repeated next time?
Shape company culture.
Simplifying the act of having these discussions and looking for growth opportunities reinforces a culture of continuous improvement. This helps overcome the culture of “fix it and move on.”
Lack of involvement from Development meant Operations often band-aided problems. Automation and DevOps best practices were not typically considered key components in improvements to be made to the system. Often problems were attacked with more process and local optimization, without addressing proximate cause. In a lot of cases, they moved on without any attacking of the problem.
The Shift in Thinking
The first large post-incident review I participated in was related to a product I inherited as part of our DevOps reorganization. I was still fairly new to working with the product and we had a major failed upgrade, causing our customers numerous issues. It was very eyeopening for me to simply walk through the timeline of change preparation, change implementation, and change validation.
Coming from the development world, I was very familiar with retrospectives. However, I’d never applied the concept to production outages before.
We invited members of multiple teams and just had a conversation. By simply talking through things, a number of things came to light:
• There were pockets of knowledge that needed to be shared.
• We had a number of bad practices that needed to be addressed. So many assumptions had been made as well. While we were working to remove silos within a team of Dev and Ops, we still had not fully facilitated the cross-team communication.
• This was a great way to share knowledge across the teams and also a great vehicle to drive culture change. As the team brainstormed about the issues, many great points were brought up that would not have been otherwise.
It quickly became clear this should be done for many more incidents as well.
After we saw the real value in the exercise, we set out to hold a team-wide post-incident review on every production outage.
We didn’t have a lot of formality around this—we just wanted to ensure we were constantly learning and growing each time we had an issue.
Additionally, we beefed up the process around our leadership After Action Summaries (AASs). We moved the tracking of these from a SharePoint Word doc to JIRA. This allowed us to link to action items and track them to closure, as well as providing a common location where everyone can see the documents.
One of the biggest challenges has been analysis overload. Similar to agile team retros, you have to select a few key items to focus on and not try to boil the ocean. This takes discipline and senior leadership buy-in to limit work in process and focus on the biggest bang for the buck items.
The End Result
Teams hold post-incident review for outages where one or more of the following criteria are met:
1) major incident
2) estimated cause or path forward
3) good opportunity for knowledge sharing.
These continue to be more informal in nature, which encourages openness, sharing, and brainstorming of future solutions. All members of a team are invited. If it’s a cross-team issue, members of other teams are invited as well. Meeting minutes are captured on a Confluence wiki page. We always ensure we discuss the timeline of the event, what went wrong, and things we can do to improve for next time. Where possible, we aim to solve things with automation over process. If the opportunity arises, we also discuss helpful troubleshooting steps and best practices. The focus is on things we can do to improve. If a person failed, we inspect ways the system they operated within failed to guard against human error. Action items are part of the minutes and are also recorded as JIRA tasks so they can be tracked. While the process around team-level analysis is still fairly informal, our AAS process is much more structured. All major incidents and associated action items are entered into our ticketing system to be tracked.
Part of the transformation we’ve experienced is the attitude that we must get better and we are empowered to do so. Failure should not be accepted, but we learn from it and improve our systems as a result. When we have an outage now, we fix the immediate issue and then immediately flip the narrative to “What did we learn to help us avoid this happening again?” Another key component was the switch to tracking follow-ups in JIRA. This ensures follow-up and accountability so things aren’t lost.
While post-incident reviews provide a number of benefits, I find the culture shift they unconsciously drive has been the most valuable thing about them for us.