Published Date: December 13, 2022
Incident management is a process within IT service management (ITSM) that identifies and corrects IT incidents to keep an organization’s services running smoothly or — if they’re taken offline — restore them as quickly as possible to minimize impact to the business and end users (including your customers). Types of incidents can range from anything from simply running out of disk space all the way to a cyberattack.
There are numerous ways to approach incident management, and policies, tools and service-level agreements (SLAs) will vary across organizations. In general, IT teams try to prevent incidents through regular software updates, event monitoring and other practices, and they have an incident response plan in place to quickly resolve incidents and identify the root cause to prevent future occurrences.
Establishing an incident management process for your organization is important because service interruptions can be extremely costly, potentially running up to hundreds of thousands of dollars per hour — not including regulatory fines and customer attrition.
In the following sections, we’ll look at the phases and best practices of incident management and how it can help organizations reduce harmful downtime.
What is the cost of a system outage?
Incident response is critical because incidents of any size can be quite costly. According to research from Uptime Institute’s 2022 Outage Analysis Report, downtime costs continue to rise:
- More than 60% of outages cost over $100,000, which is an increase from 39% in 2019.
- 15% of outages cost over $1 million, which is an increase from 11% in 2019.
In the 2021 Facebook outage, for example, Facebook, WhatsApp and Instagram were unreachable for around six hours in the fall of 2021. Over 14 million users reported they couldn’t use any of Facebook’s apps or services during that time. Experts estimated that each minute of downtime cost the company $163,565, totaling around $60 million in lost revenue that day alone.
System outages are inevitable for every business, but shifting from a reactive to a proactive approach to incident management can not only improve time to resolution (MTTR) but importantly — their frequency and impact.
What is security incident response?
When an incident has occurred, a security system alert catalyzes the incident response team into action. The team will strive to use the mix of humans and computers — identifying and understanding key metrics, analyzing and resolving any security threat or incident in real time to mitigate any negative impact on the company.
In addition to troubleshooting, security incident response also includes creating incident reports, responding preemptively and implementing defensive measures that prevent future attacks. For example, following the notorious RockYou2021 attack or SolarWinds hack, administrators immediately got to work to secure and audit systems and IT infrastructure to prevent malicious attackers from gaining access and compromising its systems again.
How is security incident response related to incident management?
Security incident response is a specific workflow within the larger role of incident management. According to ITIL, an incident management process addresses any “unplanned interruption to an IT service or a reduction in the quality of an IT service.” Human error, technological failure, a security breach or any number of other occurrences can cause interruptions. The goal of incident management is to identify the cause of the incident, understand its impact and urgency and determine a response to restore normal service and business operations as quickly as possible.
Security incident response is a similar process, but it’s applied specifically to security incidents. A security incident could be an attempted intrusion, a policy violation, a malware infection or any other event that poses a threat to computer security. When an organization identifies a security incident, the incident response team — sometimes called a CSIRT — assesses the scope, and determines and executes the necessary steps to resolve it. Strong security incident resolution is essential for preventing or mitigating damage and liabilities that result from security incidents.
What are the phases of the incident response life cycle?
There are four phases of the incident response life cycle as outlined by the National Institute of Standards and Technology (NIST):
1. Preparation: This foundational phase helps organizations identify potential risks and put systems in place to deal with inevitable security incidents. A formal risk assessment is an example of a preparatory activity.
2. Detection and analysis: Through a combination of incident management software or automation via an incident management system and human expertise, activities in this phase work to build systems to proactively monitor, detect, prioritize and analyze high-priority incidents. An incident prevented is always preferable to an incident solved.
3. Containment, eradication and recovery: This third phase takes place once an incident occurs. Containment seeks to prevent further damage by cutting off an affected server from the network, for example, giving the security administrator or support staff time to remove the threat and restore service quality — reloading apps or restoring from a backup.
4. Post-incident activity: This post-mortem is particularly important to prevent similar incidents from happening in the future. Root cause analysis can determine how the incident happened in the first place in a blameless environment aimed at problem-solving and continuous improvement.
When used together, the four phases of the incident response lifecycle improve MTTR.
How do you craft a modern security incident response plan?
The first step to implementing an effective incident management plan is to form an incident response team, composed of internal or external personnel or a mix of both. Then, you can get to work developing a plan of action. (For an example of how to do this with Splunk, feel free to download our complimentary copy of Using Splunk to Develop an Incident Response Plan.)
How do you shape an incident response to your organization’s specific needs?
Within the preparation phase of the incident response life cycle, here are some key activities to help set your organization up for successful incident response.
Make an inventory of assets: Categorization is important to determine what systems and data are most critical for your business activity and prioritize the ticketing order in which they’d need to be addressed and recovered after a security incident.
Assemble a security incident response team: Assign team members roles and responsibilities, and be sure to include representatives from departments outside of IT, such as finance, operations and legal, establishing communication with the appropriate individuals during a security incident.
Look for security clues: Start by defining what constitutes a security incident for your organization, so you know what to look for. Then develop policies for how they’re detected and reported.
Create a security incident action plan: This should include a list of all relevant tasks based on the threat, including key performance indicators (KPIs), and who is responsible for handling each one. Then test the plan to determine its effectiveness and streamline as needed — including testing and consolidating your incident management tools.
Evaluate your team’s response: Analyzing response time and successes and failures during an incident will allow you to build your knowledge base to improve the plan for future incidents.
What do you do when an incident has occurred?
The ever-evolving landscape of threats (and complexities as a result of hybrid, distributed environments) means that it’s not a matter of if your organization will experience an incident, it’s a matter of when. Drilling down, the ITIL incident management brain trust offers a detailed Major Incident Management Guide, but the following steps can provide a general framework for approaching any incident:
Gather all the facts: Before you jump into action, it’s important to understand the nature and scope of the problem. At minimum you need to quickly determine what services and which users are being affected, the potential business impact, who is looking at the issue and needs to be notified and whether the issue raises any compliance or legal concerns.
Communicate with the right people: In the event of an incident, you’ll need a list of who to contact and how, many times through incident logging. In addition to members of the response team, you should also be communicating with other stakeholders across the business, the user base of the affected service and consider escalation to any relevant regulatory bodies.
Develop an action plan: Key teams need to determine and implement the best response to the incident based on the gathered facts. The incident manager should coordinate all the team activity and work to keep the response plan efficient and on track.
Keep everyone in the loop: As teams work on the issue, the incident manager should regularly check in to ensure that deadlines are met, while proactively updating other stakeholders on their progress.
Request emergency change approvals: Once you’ve found a resolution for the incident, perform tests to ensure that it works. If necessary, the incident manager should start the emergency change management process so that the response teams can quickly implement the fix.
Let people know it’s fixed: Once the fix is deployed and checked, a small user control group confirms the service is working correctly. The incident response team will subsequently notify everyone that the incident has been resolved.
Perform a quick review: Spend a few moments with the teams recapping what measures they took and any lessons they learned while the event is still fresh in everyone’s mind. Schedule a blameless post-incident review for a deeper assessment following recovery.
What is a blameless post-incident review?
The last phase of incident response is no less important than the phases that precede it, when you want to prevent an incident from happening again. Blameless post-incident reviews are a critical part of the incident lifecycle. By design, teams need open analysis of their incident response process to continuously improve their operational efficiency. The blameless post-incident review enables this analysis by looking at both the technical and human shortcomings of their response efforts after incident closure.
In a blameless post-incident review, incident response team members and others involved in or impacted by the incident come together to gain a better understanding of the event to prevent it from happening again. The review is designed to identify tools and processes to improve, not to assign blame; this not only allows on-call responders to act without hesitation during an incident, but also leads to more innovative ideas and better applications.
To learn eight best practices for incident review and postmortem reports, read the Splunk Learn blog here.
With the growing complexity of IT environments and the increasing number and sophistication of threats, organizations face an unprecedented level of risk. Incident management allows you to mitigate that risk by enabling you to detect and resolve incidents more quickly. Because outages and other incidents are inevitable for every business, incident management is a vital business process to help you launch an immediate response and prevent costly downtime that could jeopardize your customers’ experience, your organization's reputation and your bottom line.

IT/Observability Predictions
What’s better than surprises? Being ready for anything. Our leaders predict key trends for the new year.