Incident management is the process of identifying and correcting IT incidents that threaten or interrupt a business’s services. A component of IT service management (ITSM), incident management aims to keep services running or — if they’re taken offline — restore them as quickly as possible, while minimizing the impact to the business.
An “incident,” according to Information Technology Infrastructure Library (ITIL), is “an unplanned disruption, or impending disruption, to an IT service.” By this broad description, anything from degrading network quality to running out of disk space to a cyberattack would qualify as an incident. The process of detecting and responding to security-related incidents is called security incident management.
There are numerous ways to approach incident management, and policies, tools and service-level agreements (SLAs) will vary across organizations. In general, IT teams try to prevent incidents through regular software updates, event monitoring and other practices, and they have an incident response plan in place to quickly resolve incidents and identify the root cause to prevent future occurrences.
Incident management is important because service interruptions can be extremely costly, potentially running up to hundreds of thousands of dollars per hour — not including regulatory fines and customer attrition.
In the following sections, we’ll look at the phases and best practices of incident management and how it can help organizations reduce harmful downtime.
What Is Incident Management: Contents
What is security incident response?
Security incident response is the process of identifying, analyzing and resolving security threats or incidents in real time with a combination of computer and human investigation and analysis to minimize negative impacts on the business. The process usually starts with the security system alerting the incident response team that an incident has occurred. The response team investigates and analyzes the incident to determine its validity and scope, assess its impact and develop a mitigation plan.
A security incident can be anything from an active threat to a successful data breach, and can originate inside or outside of an organization. An employee using a work computer to access a gambling website, a vendor downloading data they’re not authorized to view and a malware attack are all examples of security incidents.
In addition to troubleshooting, security incident response also includes responding pre-emptively and implementing defensive measures that prevent future attacks. For example, following the notorious Heartbleed and EternalBlue attacks, administrators at affected companies immediately secured and audited systems and IT infrastructure to prevent malicious attackers from gaining access and compromising its systems again.
Security incident response is a specific process within the larger role of incident management. As defined by ITIL, incident management addresses any “unplanned interruption to an IT service or a reduction in the quality of an IT service.” Human error, technological failure, a security breach or any number of other occurrences can cause interruptions. The goal of incident management is to identify the cause of the incident, understand its impact and urgency and determine a response to restore normal service as quickly as possible.
Security incident response is a similar process, but it’s applied specifically to security incidents. A security incident could be an attempted intrusion, a policy violation, a malware infection or any other event that poses a threat to computer security. When an organization identifies a security incident, the incident response team — sometimes called a CSIRT — assesses the scope, and determines and executes the necessary steps to resolve it. Strong security incident resolution is essential for preventing or mitigating damage and liabilities that result from security incidents.
There are four phases of the incident response life cycle as outlined by the National Institute of Standards and Technology (NIST):
1. Preparation: The first phase is designed to help organizations determine the risks to their systems and data, outline problem management strategies and put mechanisms in place to deal with security incidents. This can include performing a formal risk assessment, implementing the tools and processes to analyze and mitigate incidents, prioritizing threats, creating and training an Incident Response Team and putting together an Incident Response Plan (IRP) in accordance with the NIST life cycle guidelines.
2. Detection and analysis: In this phase, the service operation sets up systems to proactively monitor, detect, prioritize and analyze high-priority incidents, with the aim of recognizing any irregular and suspicious threats or activity in the network environment that might disrupt workflow. Detection and analysis are generally done through a combination of human investigation and security tools that automate security processes. With automation and effective execution, this phase can often minimize the spread and impact of an incident.
3. Containment, eradication and recovery: The third phase addresses security incident resolution. Containment aims to stop the incident from causing further damage — disconnecting the affected server from the network and implementing firewall rules to block the attacker can stop a malware attack, for example. Security administrators or support staff remove the threat upon point of contact, dispatching the malware from the infected server and making sure it doesn’t exist anywhere else in the system. Finally, support staff recover the system to its state prior to the malware infection and restore service quality by reloading apps or restoring data from backups.
4. Post-incident activity: Phase four encompasses steps to prevent similar incidents from happening again. Using data collected from the incident and post-mortem meetings, the organization determines how the incident happened, what preventative measures to strengthen or add, how to improve monitoring and alerting processes, and how to streamline help desk and service requests, remediation and recovery processes. You’ll need to address any legal or regulatory compliance issues in this phase as well.
Altogether, the four phases are designed to build on a comprehensive knowledge base; the effectiveness of phase three relies heavily on the success of phases one and two. To provide optimal protection and restore service quickly, organizations need to implement all four phases together.
Effective security incident response hinges on having a strategy in place before an incident occurs. The ISO/IEC Standard 27035 outlines a five-part process for incident response management:
1. Prepare to deal with incidents.
2. Identify and report potential security incidents.
3. Assess incidents and decide what action to take.
4. Respond to incidents by containing, investigating and resolving them.
5. Document key takeaways and learnings from every incident.
Every organization will execute this plan a bit differently, but there are some best practices that can help shape security incident response to your business needs:
Triaging and responding to threats will vary by organization, but there are a few best practices for categorization and prioritization that can provide a framework for an effective and efficient process:
Cybersecurity tools can support the triage process, and even make it more effective. Automation and orchestration can relieve security teams of time-consuming data analysis and collection so they can focus on investigating and resolving critical incidents.
DevOps uses incident management to support security monitoring in software applications and the development environment. While ITIL informs incident management for ITSM, there is no official guide for DevOps teams. Instead, incident management in this context is based around the core DevOps principles of breaking down organizational silos, increasing collaboration and transparency and utilizing lightweight processes. As such, it can be summarized in a handful of steps:
Detection: DevOps incident response teams collaboratively identify system vulnerabilities and plan responses to potential incidents. They also set up monitoring tools and alert systems and maintain runbooks that outline what to do when they detect an incident.
Response: Most DevOps incident management teams receive information from the monitoring tools, assess the severity and impact of the incident and follow the runbook to escalate the problem to the right responders through the appropriate communication channels.
Resolution: The incident manager works with the relevant teams to fix the issue, recover systems and data, and return the app to normal operation.
Analysis: At this “closure” stage, the incident management team comes together to share lessons learned in a “blameless post-incident review,” with the goal of improving systems and preventing similar incidents from happening again.
Readiness: The incident management teams evaluate their readiness for the next incident, applying what they learned in the blameless post-incident review to adjust their monitoring and alerting tools, update their runbook processes and team responsibilities, discuss possible workarounds and implement permanent fixes for the resolved issue in the development pipeline.
Blameless post-incident reviews are a critical part of the incident lifecycle. By design, DevOps teams need open analysis of their incident response process to continuously improve their operational efficiency. The blameless post-incident review enables this analysis by looking at both the technical and human shortcomings of their response efforts.
In a blameless post-incident review, incident response team members and others involved in or impacted by the incident come together to gain a better understanding of the event to prevent it from happening again. The review is designed to identify tools and processes to improve, not to assign blame; this not only allows on-call responders to act without hesitation during an incident, but also leads to more innovative ideas and better applications.
A prepared plan of attack is the best way to navigate through the stress and uncertainty of a major incident. ITIL offers a detailed Major Incident Management Guide, but the following steps can provide a general framework for approaching any incident:
According to a 2020 ITIC Hourly Cost of Downtime survey, 40% of enterprises polled said a single hour of downtime can cost between $1 million to more than $50 million — without including legal fees, fines or compliance penalties.
Data shows that any kind of worker productivity interruption — including downtime — can take a toll. A UC Irvine study indicates that it takes around 23 minutes to refocus after a worker productivity interruption. While actual costs related to outages vary from organization to organization, it’s well-established that a single system outage can cost an organization millions of dollars — and that’s without factoring in associated costs like loss of business opportunities, reduced productivity and damaged reputation.
System outages are inevitable for every business, but shifting from a reactive to a proactive approach to incident management can reduce their frequency and impact.
MTTD stands for “mean time to detect” or “mean time to discover” and MTTR stands for “mean time to respond.” Both are metrics used to quantify the effectiveness of a team’s incident management processes.
MTTD is a key performance indicator for incident management, measuring how long a problem exists before the organization or appropriate parties become aware of the issue. A shorter MTTD indicates that organizations suffer from outages and other disruptions for less time than with a longer MTTD. In addition, the lower the MTTD, the less corresponding cost an organization will incur due to downtime. Organizations discover issues either by end users who report an outage to the service desk or from alert systems’ various monitoring and management tools.
MTTR represents the average time it takes to repair and restore an affected component or system to functionality, measuring the maintenance level of an organization’s equipment, as well as the team’s efficiency in resolving IT incidents. MTTR starts the moment a failure is detected, which includes diagnostic time, repair time, testing and all other activities that occur until service is restored to normal. The combination of MTTR and MTTD comprise the duration of a cyber incident.
MTTR is important because it’s a powerful predictor of IT incident costs. The higher an IT team’s MTTR, the greater the risk that the organization will experience significant downtime when IT incidents occur, potentially leading to business disruptions, customer dissatisfaction and revenue loss.
An incident management platform is the first line of defense during an incident. It provides critical support for each phase of the incident management process, with features that include incident identification, logging, diagnosis and investigation, issue escalation and resolution. There are numerous platforms available — selecting the appropriate platform will largely depend on the size and scope of your organization, compliance requirements and budgetary considerations.
The first step to implementing an effective incident management plan is to form an incident response team, composed of internal or external personnel or a mix of both. From there, you’ll need to decide what constitutes an incident for your organization and perform an incident threat analysis by assessing the potential threats, risks and infrastructure failure. You can then start designing response plans for different scenarios, training staff and practicing through simulated breaches with the goal of continuously improving your incident response.
With the growing complexity of IT environments and the increasing number and sophistication of threats, organizations face an unprecedented level of risk. Incident management allows you to mitigate that risk by enabling you to detect and resolve incidents more quickly. While outages and other incidents are inevitable for every business, incident management is the most effective way to launch an immediate response and prevent costly downtime that could jeopardize your organization's reputation and bottom line.