Incident Review: How To Conduct Incident Reviews & Postmortems

Key Takeaways

Conduct structured, blameless incident reviews promptly using a consistent agenda and defined roles to identify root causes and ensure accountability.
Document incidents thoroughly and share actionable insights across teams to foster continuous learning and incremental process improvement.
Measure the effectiveness of incident reviews with key metrics like mean time to detect, mean time to resolve, and completion rates for action items to enhance operational resilience.

In IT and business, disruptions and outages are part of new changes, like new system rollouts or new deployments. Incident review, sometimes called an incident postmortem, is a structured process for analyzing and learning from such incidents within an organization’s system.

The incident review process documents:

What went wrong in a given incident.
Why an incident happened.
Strategies to ensure similar issues don't repeat in the future.

The best part of an incident review is that, when done well, you can easily improve service quality with a set of specific actions, like automating the recovery processes.

So, let’s take a look at the incident review process. In this article, you will learn what an incident review/postmortem is, the steps involved, and the best practices to maximize valuable takeaways.

/en_us/blog/fragments/it-service-intelligence

What is incident review?

Organizations routinely encounter system, site, and machine failures. These disruptions in the normal service operations of any system are called “incidents”, and they can range from minor to severe incidents depending on the impact and nature.

Importantly, there's something for teams to learn from almost every incident. And that’s what the review is meant to capture: the lessons learned from a critical examination of an event or failure within a system. In general, incident review processes involve:

Documenting the incident.
Diagnosing its root cause.
Evaluating its impact.
Creating an action plan to prevent these incidents.

So we can say that the incident review process is one part of your incident response and incident management strategy.

Interestingly, postmortems have long been a part of aviation and manufacturing industries. Only more recently have these concepts gained popularity in the business and technology space, too.

Why incident postmortems are necessary

Yes, it’s true that these reviews are optional, unless of course your team or organization mandates them. Still, we think every smart organization should conduct an incident review — here’s why:

Allows you to do a detailed analysis of the incidents, to truly understand where the breakdown(s) occurred, in people, processes, and/or technologies.
Supports ongoing high-level system availability.
Clarifies why a system behaves differently after making changes to prevent the same mistakes.

It is a great tool for learning about incident patterns in your systems.

Who performs incident review/postmortem?

Different teams, such as DevOps and SREs, collaborate to review and analyze the incidents using real-time collaboration tools. Ideally, one person should own the postmortem report. It can be anyone from DevOps to SREs to incident managers/commanders.

(This function may even live within a CSIRT: critical security incident response team.)

Importantly, every organization or team must define its criteria for reviewing incidents and postmortems. You can automate the trigger when you want to review incidents. This way, the system will automatically be triggered when the following conditions are fulfilled:

A certain number of users are affected.
Internal or external users report an outage.
The organization experiences a certain amount of revenue loss during an outage.

Steps of incident review/postmortems

Every organization has a different structure of postmortem steps that works for them. In general, teams will create a postmortem report and also hold a meeting afterwards to communicate everything to the wider team.

Let’s look at both.

Creating a postmortem report

These are sections to understand and include in any incident review documentation.

Incident summary

The first step of postmortems is writing a summary of the incident to provide an overview of the initial problem. It includes writing about the type of incident that happened, whether it was a service problem, a bug in the code, or a site failure.

Identifying the root cause

This step involves identifying the incident's root cause and what triggered it. The system automatically sends alerts to the team via email or call. Different types of incident triggers include:

IT monitoring and application monitoring tools triggering the incident through an automated built-in process.
Users reporting incidents or outages.
Team members identifying that something went wrong.

Often, IT or SRE team members must respond to the alerts immediately to resolve problems. A backup person must always be available in case the alert person is unavailable.

Impact on users

Not all incidents are the same. The severity varies and can impact one user or the entire site. It happens when a service is down for all users or when data is compromised.

While a minor incident results in a minor inconvenience, with an incident response plan ready, you analyze how an incident impacted users.

(Related reading: Understand how incident severity levels work.)

//play.vidyard.com/nC5uMM4wgD5h8ASZHCvpiv.html?

(See how Splunk solutions support the entire incident management practice.)

Document detection and resolution

In this step, you document how the incident was detected. Did internal teams report it, or did an external user complain?

Here, team members document the delay from the initial report, which can range from minutes to hours. The longer the delay in reporting, the higher the loss. You also document how the incident was resolved and the duration and timeline of actions taken.

In some cases, detecting the problem takes longer than resolving it. The goal should be to minimize the duration of incident detection and resolution.

Acknowledge what went well & what went wrong

Here, you simply want to acknowledge the good outcomes and the things that could have been better. (As we’ll see later, this is not the time for blame.) You also record any positive aspects or successful responses during the incident. This section of the report identifies:

Areas where the response or system fell short.
Any fortunate circumstances that helped mitigate the impact.

Map an action plan for the future

The crux of postmortem action is to learn from an incident postmortem report and map an action plan. Here, team members outline specific steps to prevent similar incidents in the future, including:

Mitigation
Prevention
Process improvements

Lessons learning/postmortem meeting

Why and when should you hold postmortem meetings? This is the most common question. You can arrange these meetings in two scenarios, either:

Only when something goes wrong.
At the end of every project.

These meetings discuss what worked well and what went wrong and commit to learning from the mistakes moving forward. All team members in the project should attend the meeting so everyone can focus on constructive feedback — systems and processes that failed, instead of blaming specific people.

Incident review/postmortems: Best practices

Following are the best practices for conducting incident postmortems:

Don't blame humans

The main goal should be to fix systems and processes, not blame individuals. Rather than focusing on who made the change, find why your system was vulnerable to something.

Blameless incident postmortems make the system resilient and reliable. And just as a person shouldn't be blamed, the entire credit for success shouldn't be given to one person — after all, a system's success and failure don't rely on a single person.

Improve your action plan

Identify areas for improvement and update your existing incident postmortem plan to prevent similar incidents in the future, or be prepared to respond if they do occur.

Here’s what you can do to improve the existing action plan during a postmortem:

Identify gaps and weaknesses in the existing action plan.
Gather diverse perspectives from stakeholders to uncover blind spots.
Prioritize improvements based on impact, likelihood, and resources.
Define specific, actionable steps with assigned owners and deadlines.
Address systemic issues by revising policies, training, or implementing new tools.
Regularly review and update your plan accordingly.
Effectively communicate and provide training on the updated action plan.
Establish a mechanism for tracking implementation and ensuring accountability.

(Related reading: incident response plans & disaster recovery plans.)

Think beyond prevention

Prevention shouldn't be your only focus. Automation is invaluable for early detection. It limits the number of incidents and mitigates them, regardless of severity. Here's how automation helps:

Detect incidents earlier.
Analyze and validate incidents seamlessly.
Speed up the recovery.

(Related reading: security automation & RPA: robotic process automation.)

Increase team morale

You should take mistakes as learning opportunities to enhance system resilience and reliability. Doing so will increase team morale and build a high-performing team. Maintaining a friendly culture will help your teams collaborate and communicate openly, leading to efficient and smooth operations.

Learn from incident postmortems

That’s right: 6.41 million data was breached worldwide in 2023. The best organizations can do is identify any issues before an incident occurs to prevent any breaches. But that's one part of the story. Learn from the previous incidents to avoid them in the future. You can do that by preparing incident review or lessons learned reports.

/en_us/blog/fragments/disclaimer-with-divider

Style

two-column

The Chief Product Officer Role: CPO Responsibilities, Salary & Skills

Learn

5 Minute Read

The Chief Product Officer Role: CPO Responsibilities, Salary & Skills

Unlock business growth in the Chief Product Officer role. Learn everything you need to know about the role, skills and impact on revenue generation.

Data Lifecycle Management: A Complete Guide

Learn

8 Minute Read

Data Lifecycle Management: A Complete Guide

Learn data lifecycle management (DLM) to effectively manage data throughout its lifecycle, from creation to deletion, ensuring security and compliance.

Synthetic Monitoring vs Real User Monitoring: What’s The Difference?

Learn

3 Minute Read

Synthetic Monitoring vs Real User Monitoring: What’s The Difference?

Both RUM and synthetic monitoring are useful for managing the performance of websites and applications, and the two methodologies work well when paired together.

/en_us/blog/fragments/about-splunk

/en_us/blog/fragments/subscribe-footer

Incident Review: How To Conduct Incident Reviews &#x26; Postmortems

Key Takeaways

What is incident review?

Why incident postmortems are necessary

Who performs incident review/postmortem?

Steps of incident review/postmortems

Creating a postmortem report

Incident summary

Identifying the root cause

Impact on users

Document detection and resolution

Acknowledge what went well & what went wrong

Map an action plan for the future

Lessons learning/postmortem meeting

Incident review/postmortems: Best practices

Don't blame humans

Improve your action plan

Think beyond prevention

Increase team morale

Learn from incident postmortems

Related Articles

The Chief Product Officer Role: CPO Responsibilities, Salary & Skills

Data Lifecycle Management: A Complete Guide

Synthetic Monitoring vs Real User Monitoring: What’s The Difference?

Incident Review: How To Conduct Incident Reviews & Postmortems