The structure of a post-incident review was described in Chapter 9, but here we’ll look more closely at the output the exercise provides, as well as a procedural guide on how to conduct your own.
This chapter presents a guide to help get you started. A downloadable version is available at http://postincidentreviews.com.
Begin by reflecting on your goals and noting the key metrics of the incident (time to acknowledge, time to recover, severity level, etc.) and the total time of each individual phase (detection, response, remediation).
We are here to LEARN the following:
1.) How do we know sooner? (Detection)
2.) How do we recover sooner? (Response & Remediation)
1.) Time to Acknowledge ____________
2.) Time to Recover ____________
Elapsed Time of the Following Phases:
3.) Severity ____________
4.) Customer Impacted (yes/no) ____________
5.) Incident Commander (optional) ____________
Establish and Document the Timeline
Document the details of the following in chronological order, noting their impact on restoring service:
• Date and time of detection
• Date and time of service restoration
• Incident number (optional)
• Who was alerted first?
• When was the incident acknowledged?
• Who else was brought in to help, and at what time?
• Who was the acting Incident Commander? (optional)
• What tasks were performed, and at what time?
• Which tasks made a positive impact to restoring service?
• Which tasks made a negative impact to restoring service?
• Which tasks made no impact to restoring service?
• Who executed specific tasks?
• What conversations were had?
• What information was shared?
Plot Tasks and Impacts
Identifying the relationships between tasks, automation, and human interactions and their overall impact on restoring service helps to expose the three phases of the incident lifecycle that we are analyzing:
Areas of improvement identified in the detection phase will help us answer the question “How do we know sooner?” Likewise, improvements in the response and remediation phases will help us with “How do we recover sooner?”
Indicate the type and impact of each entry on the timeline by classifying it into one of three types (task, auto, or human interaction), and one of three levels of severity (negative, neutral, or positive).
Plot the tasks, automations, and human interactions along a timeline marking the impact of restoring service (negative, neutral, or positive). By noting and blocking out the phases of detection, response, and remediation, new target conditions can be established (e.g. acknowledge incidents 50% faster)
By plotting the tasks unfolding during the lifecycle of the incident we can visualize and measure the actual work accomplished against the time it took to recover. Because we have identified which tasks made a positive, negative, or neutral impact on the restoration of service, we can visualize the lifecycle from detection to resolution. This exposes interesting observations, particularly around the length of each phase, which tasks actually made a positive impact, and where time was either wasted or used inefficiently. The graph highlights areas we can explore further in our efforts to improve uptime.
Throughout the discussion, it’s important to probe deeply into how engineers are making decisions. Genuine inquiry allows engineers to reflect on whether this is the best approach in each specific phase of the incident. Perhaps another engineer has a suggestion of an alternative quicker or safer method. Best of all, everyone in the company learns about it.
In Chapter 6, Gary exposed Cathy to a new tool as a result of discussing the timeline in detail. Those types of discovery may seem small and insignificant, but collectively they contribute to the organization’s tribal knowledge and ensuring the improvement compass is pointed in the right direction.
Engineers will forever debate and defend their toolchain decisions, but exposing alternative approaches to tooling, processes, and people management encourages scrutiny of their role in the organization’s ongoing continuous improvement efforts.
The most important part of the report is contained here. Genuine inquiry within an environment that welcomes transparency and knowledge sharing not only helps us detect and recover from incidents sooner, but builds a broader understanding about the system among a larger group.
Be sure to document as many findings as possible. If any member participating in the post-incident review learns something about the true nature of the system, that should be documented. If something wasn’t known by one member of the team involved in recovery efforts, it is a fair assumption that others may be unaware of it as well.The central goal is to help everyone understand more about what really goes on in our systems and how teams form to address problems. Observations around “work as designed” vs. “work as performed,” as mentioned in Chapter 7, emerge as these findings are documented.
Many factors that may have contributed to the problem and remediation efforts will begin to emerge during discussions of the timeline. As we just covered, it’s good to document and share that information with the larger teams and the organization as a whole. The better understanding everyone has regarding the system, the better teams can maintain reliability.What significant components of the system or tasks during remediation were identified as helpful or harmful to the disruption and recovery of services?
NOTE: Factors relating to remediation efforts can be identified by distinguishing each task as discussed in the timeline with a value of positive, negative, or neutral.
As responders describe their efforts, explore whether each task performed moved the system closer to recovery, or further away. These are factors to evaluate more deeply for improvement opportunities
How is this different than Cause?
While not quite the same as establishing cause, factors that may have contributed to the problem in the first place should be captured as they are discovered. This helps to uncover and promote discussion of information that may be new to others in the group (i.e., provides an opportunity to learn).
In the case study example in Chapter 6, the unknown service that Cathy found on the host could have easily been identified as the root cause. However, our approach allowed us to shed the responsibility of finding cause and continue to explore more about the system. The runaway process that was discovered seemed like the obvious problem, and killing it seemed to have fixed the problem (for now). But as Greg pointed out, there are other services in the system that interact with the caching component. What if it actually had something to do with one of those mystery services?
In reality, we may never have a perfectly clear picture of all contributing factors. There are simply too many possibilities to explore. Further, the state of the system when any given problem occurred will be different from its current state or the one moments from now, and so on. Still, document what you discover. Let that open up further dialogue.
Perhaps with infinite time, resources, and a system that existed in a vacuum, we could definitively identify the root cause. However, would the system be better as a result?
Finally, action items will have surfaced throughout the discussion. Specific tasks should be identified, assigned an owner, and prioritized. Tasks without ownership and priority sit at the bottom of the backlog, providing no value to either the analysis process or system health. Countermeasures and enhancements to the system should be prioritized above all new work. Until this work is completed, we know less about our system’s state and are more susceptible to repeated service disruptions. Tracking action item tasks in a ticketing system helps to ensure accountability and responsibility for work.
NOTE: Post-incident reviews may include many related incidents due to multiple monitoring services triggering alarms. Rather than performing individual analyses for each isolated incident number, a time frame can be used to establish the timeline.
These exercises provide a great deal of value to the team or organization. However, there are likely others who would like to be informed about the incident, especially if it impacted customers. A high-level summary should be made available, typically consisting of several or all of the following sections:
• Services Impacted
• Customer Impact
• Proximate Cause
• Countermeasures or Action Items
John Paris (Service Manager, Skyscanner) and his team decided they needed to create a platform for collective learning. Weekly meetings were established with an open invite to anybody in engineering to attend. In the meetings, service owners at various levels have the opportunity to present their recent challenges, proposed solutions, and key learnings from all recent post-incident reviews.
“There are many benefits from an open approach to post-incident reviews,” says Paris. “But the opportunities for sharing and discus‐ sing outcomes are limited and as the company grows it becomes harder to share learnings throughout the organization. This barrier, if not addressed, would have become a significant drag on both throughput and availability as the same mistakes were repeated squad by squad, team by team.