People make mistakes, technology breaks down, and processes aren’t infallible. But, when incidents happen, what can we do about it? What can we learn?
As with all things, learning isn’t a binary action, it’s a process. And, when an incident occurs, organizations typically conduct a post-mortem analysis and generate a post-incident review to uncover what went wrong and why. These reports are critical for identifying where our processes and guardrails break down so we can learn how to do better next time.
Post-mortem incident report tips
When it comes to incident response, committing to learn from it is half the battle. The other half is setting up a successful process for continuous education by creating accurate and helpful post-mortem incident reports.
The goal of these reports is to provide the information you need to grow. Still, there are a few things you should (and shouldn’t) do to ensure you’re getting the most out of them.
(Read more about incident management & incident response metrics.)
1) Don’t assign blame
We’ve all heard about “blameless post-mortems.” But, what does it really mean to be “blameless” in DevOps and IT? While it doesn’t mean there are no consequences for malicious actions, a blameless culture recognizes that:
- Everyone makes mistakes.
- Consequences without context will de-emphasize learning and continuous improvement over time.
When creating a post-incident review, it’s critical to avoid assigning blame to any one person. Instead, focus on where the process broke down (or where more process is needed). It doesn’t matter if Erica pushed the button or Jacob wrote the function — those actions may have contributed to the incident —but the true failure is almost always a lack of checks and balances along the way.
2) Do take responsibility
A good post-mortem report should avoid pointing fingers. Instead, everyone involved should take responsibility for both the process and the incident itself. In order to foster a blameless culture, it’s important to emphasize the fact that everyone owns the quality process. When a problem that “you caused” gets re-framed into a problem that “we own,” it allows engineers to focus more clearly on what can be done to make things better rather than waste time trying to deflect blame.
3) Don’t procrastinate
When should you perform a post-mortem? If your answer to that question wasn’t “immediately,” then you’re not doing them soon enough.
While it may sound counterintuitive, you should always create a post-mortem incident report while the proverbial incident iron is still hot. This way, the incident is treated with an appropriate level of criticality and all the details are still fresh in everyone’s minds. If you wait too long to recap and evaluate, details will be missed, passion will be low and the need to improve the process won’t feel so urgent.
A good rule of thumb: If you’re going to conduct a post-incident review, do it now or don’t do it at all.
4) Do gather information
When it comes to post-mortems, data is like time. It’s better to have too much and not need it than not enough and need more. Start by establishing a timeline of what exactly happened and then flesh it out with as much detail as possible. This should include:
- The sequence of events that led up to the incident
- How and when the incident itself occurred
- Who was impacted
- How many support cases were generated
- Who responded to the incident and how quickly
- What the response team did to resolve the incident
- Anything else you can think of
This post-mortem information will be invaluable in determining the root cause and will make identifying areas for improvement significantly easier.
5) Don’t be vague
Details, details, details. When creating a post-mortem report, don’t be vague. While the minutiae may seem unimportant, these details can be crucial to root cause analysis.
More importantly, by putting as much detail as possible in the report, you eliminate the need to regroup unnecessarily with the incident response team, ensuring that learnings can be extracted directly from the report itself.
6) Do define clear owners
While “we” should always take responsibility, it’s important to identify who owns any action items that come out of the post-incident review. As the saying goes, “if everyone owns it, nobody owns it.” By defining clear owners for action items, you ensure any work that needs to get done as a result of the incident report has a person accountable for it.
7) Don’t lose focus
Engineers like solving problems. Unfortunately, a post-mortem isn’t always the time to do that. Don’t get me wrong, the purpose of a post-mortem is to identify…
- What went wrong
- How we can prevent it from happening in the future
But a postmortem is not always the time to get into the weeds of solving technical problems. In some cases, a post-mortem is a great place to do a root cause analysis. Other times, some problems have nuanced technical causes that require deeper investigation.
While we may want to dig into these problems right away, we should instead identify them as action items coming out of a post-incident review — otherwise we might be distracted from the postmortem’s primary goal.
8) Do use a consistent template
The best way to maintain focus when creating a post-mortem incident report is to use a fixed template. When time is critical, it’s important not to waste it by performing a post-mortem without an agenda. A fixed template can be used for every post-mortem at your organization to ensure all information follows a consistent pattern, especially when it comes to:
- The timeline
Templates can, in turn, allow you to perform post-mortems on your post-mortem workflows to further improve the overall process over time.
Always be learning
Incidents happen. But that doesn’t mean we can’t do something about them. By defining what a high-quality and clearly defined post-mortem incident report looks like, you can ensure the lessons learned with each incident aren’t lost.
Eliminating blame, taking responsibility, gathering information, and focusing on tangible outcomes are all critical steps towards building a failure-tolerant culture that can learn from its mistakes.
What is Splunk?
This article was written by Zach Flower. Zach is a web developer, writer and polymath. He has an eye for simplicity and usability and https://sweetcode.io/author/zflower/strives to build products with both the end user and business goals in mind. Zach is a regular contributor at Fixate IO.
This posting does not necessarily represent Splunk's position, strategies or opinion.