Skip to main content

DATA INSIDER

What Is Incident Management?

Incident management is the process of identifying and correcting IT incidents that threaten or interrupt a business’s services. A component of IT service management (ITSM), incident management aims to keep services running or — if they’re taken offline — restore them as quickly as possible, while minimizing the impact to the business.

An “incident,” according to Information Technology Infrastructure Library (ITIL), is “an unplanned disruption, or impending disruption, to an IT service.” By this broad description, anything from degrading network quality to running out of disk space to a cyberattack would qualify as an incident. The process of detecting and responding to security-related incidents is called security incident management.

There are numerous ways to approach incident management, and policies, tools and service-level agreements (SLAs) will vary across organizations. In general, IT teams try to prevent incidents through regular software updates, event monitoring and other practices, and they have an incident response plan in place to quickly resolve incidents and identify the root cause to prevent future occurrences.

Incident management is important because service interruptions can be extremely costly, potentially running up to hundreds of thousands of dollars per hour — not including regulatory fines and customer attrition.

In the following sections, we’ll look at the phases and best practices of incident management and how it can help organizations reduce harmful downtime.

What Is Incident Management: Contents

Security Incident Response

What is security incident response?

Security incident response is the process of identifying, analyzing and resolving security threats or incidents in real time with a combination of computer and human investigation and analysis to minimize negative impacts on the business. The process usually starts with the security system alerting the incident response team that an incident has occurred. The response team investigates and analyzes the incident to determine its validity and scope, assess its impact and develop a mitigation plan.

A security incident can be anything from an active threat to a successful data breach, and can originate inside or outside of an organization. An employee using a work computer to access a gambling website, a vendor downloading data they’re not authorized to view and a malware attack are all examples of security incidents.

In addition to troubleshooting, security incident response also includes responding pre-emptively and implementing defensive measures that prevent future attacks. For example, following the notorious Heartbleed and EternalBlue attacks, administrators at affected companies immediately secured and audited systems and IT infrastructure to prevent malicious attackers from gaining access and compromising its systems again.

How is security incident response related to incident management?

Security incident response is a specific process within the larger role of incident management. As defined by ITIL, incident management addresses any “unplanned interruption to an IT service or a reduction in the quality of an IT service.” Human error, technological failure, a security breach or any number of other occurrences can cause interruptions. The goal of incident management is to identify the cause of the incident, understand its impact and urgency and determine a response to restore normal service as quickly as possible.

Security incident response is a similar process, but it’s applied specifically to security incidents. A security incident could be an attempted intrusion, a policy violation, a malware infection or any other event that poses a threat to computer security. When an organization identifies a security incident, the incident response team — sometimes called a CSIRT — assesses the scope, and determines and executes the necessary steps to resolve it. Strong security incident resolution is essential for preventing or mitigating damage and liabilities that result from security incidents.

What are the phases of the incident response life cycle?

There are four phases of the incident response life cycle as outlined by the National Institute of Standards and Technology (NIST):

1. Preparation: The first phase is designed to help organizations determine the risks to their systems and data, outline problem management strategies and put mechanisms in place to deal with security incidents. This can include performing a formal risk assessment, implementing the tools and processes to analyze and mitigate incidents, prioritizing threats, creating and training an Incident Response Team and putting together an Incident Response Plan (IRP) in accordance with the NIST life cycle guidelines.

2. Detection and analysis: In this phase, the service operation sets up systems to proactively monitor, detect, prioritize and analyze high-priority incidents, with the aim of recognizing any irregular and suspicious threats or activity in the network environment that might disrupt workflow. Detection and analysis are generally done through a combination of human investigation and security tools that automate security processes. With automation and effective execution, this phase can often minimize the spread and impact of an incident.

3. Containment, eradication and recovery: The third phase addresses security incident resolution. Containment aims to stop the incident from causing further damage — disconnecting the affected server from the network and implementing firewall rules to block the attacker can stop a malware attack, for example. Security administrators or support staff remove the threat upon point of contact, dispatching the malware from the infected server and making sure it doesn’t exist anywhere else in the system. Finally, support staff recover the system to its state prior to the malware infection and restore service quality by reloading apps or restoring data from backups.

4. Post-incident activity: Phase four encompasses steps to prevent similar incidents from happening again. Using data collected from the incident and post-mortem meetings, the organization determines how the incident happened, what preventative measures to strengthen or add, how to improve monitoring and alerting processes, and how to streamline help desk and service requests, remediation and recovery processes. You’ll need to address any legal or regulatory compliance issues in this phase as well.

Altogether, the four phases are designed to build on a comprehensive knowledge base; the effectiveness of phase three relies heavily on the success of phases one and two. To provide optimal protection and restore service quickly, organizations need to implement all four phases together.

incident response lifecycle diagram incident response lifecycle diagram

How do you craft a modern security incident response plan?

Effective security incident response hinges on having a strategy in place before an incident occurs. The ISO/IEC Standard 27035 outlines a five-part process for incident response management:

1. Prepare to deal with incidents.

2. Identify and report potential security incidents.

3. Assess incidents and decide what action to take.

4. Respond to incidents by containing, investigating and resolving them.

5. Document key takeaways and learnings from every incident.

Every organization will execute this plan a bit differently, but there are some best practices that can help shape security incident response to your business needs:

  • Make an inventory of assets. Determine what systems and data are most critical for your business activity and prioritize the ticketing order in which they’d need to be addressed and recovered after a security incident. 
  • Assemble a security incident response team. Assign team members roles and responsibilities, and be sure to include representatives from departments outside of IT, such as finance, operations and legal, establishing communication with the appropriate individuals during a security incident. 
  • Look for security clues. Start by defining what constitutes a security incident for your organization, so you know what to look for. Then develop policies for how they’re detected and reported. 
  • Create a security incident action plan. This should include a list of all relevant tasks based on the threat and who is responsible for handling each one. Then test the plan to determine its effectiveness and refine it as needed. 
  • Evaluate your team’s response. Analyzing a response’s successes and failures around service delivery will allow you to improve the plan for the next security incident.

How do you triage and determine the appropriate level of response for threats?

Triaging and responding to threats will vary by organization, but there are a few best practices for categorization and prioritization that can provide a framework for an effective and efficient process:

  • Identify: Once an incident is confirmed, you should start collecting evidence it leaves behind. This involves analyzing log files and other data sources to help identify any compromised or infected endpoints.
  • Investigate: Once you’ve collected all the evidence about the incident, you can piece it together to get a picture of the path the attacker took. Following the incident trajectory will also allow you to determine the attacker’s target.
  • Resolve: Visualizing the attack path allows you to identify the most business-critical targets and prioritize your response accordingly. You can use the information collected during the prioritization stage to remove the malware and restore the infected systems in order of importance to your business operations.

Cybersecurity tools can support the triage process, and even make it more effective. Automation and orchestration can relieve security teams of time-consuming data analysis and collection so they can focus on investigating and resolving critical incidents.

Incident Management Systems

How is DevOps used for incident management?

DevOps uses incident management to support security monitoring in software applications and the development environment. While ITIL informs incident management for ITSM, there is no official guide for DevOps teams. Instead, incident management in this context is based around the core DevOps principles of breaking down organizational silos, increasing collaboration and transparency and utilizing lightweight processes. As such, it can be summarized in a handful of steps:

Detection: DevOps incident response teams collaboratively identify system vulnerabilities and plan responses to potential incidents. They also set up monitoring tools and alert systems and maintain runbooks that outline what to do when they detect an incident.

Response: Most DevOps incident management teams receive information from the monitoring tools, assess the severity and impact of the incident and follow the runbook to escalate the problem to the right responders through the appropriate communication channels.

Resolution: The incident manager works with the relevant teams to fix the issue, recover systems and data, and return the app to normal operation.

Analysis: At this “closure” stage, the incident management team comes together to share lessons learned in a “blameless post-incident review,” with the goal of improving systems and preventing similar incidents from happening again.

Readiness: The incident management teams evaluate their readiness for the next incident, applying what they learned in the blameless post-incident review to adjust their monitoring and alerting tools, update their runbook processes and team responsibilities, discuss possible workarounds and implement permanent fixes for the resolved issue in the development pipeline.

What is a blameless post-incident review?

Blameless post-incident reviews are a critical part of the incident lifecycle. By design, DevOps teams need open analysis of their incident response process to continuously improve their operational efficiency. The blameless post-incident review enables this analysis by looking at both the technical and human shortcomings of their response efforts.

In a blameless post-incident review, incident response team members and others involved in or impacted by the incident come together to gain a better understanding of the event to prevent it from happening again. The review is designed to identify tools and processes to improve, not to assign blame; this not only allows on-call responders to act without hesitation during an incident, but also leads to more innovative ideas and better applications.

What are some techniques for major incident management in systems?

A prepared plan of attack is the best way to navigate through the stress and uncertainty of a major incident. ITIL offers a detailed Major Incident Management Guide, but the following steps can provide a general framework for approaching any incident:

  • Gather all the facts. Before you jump into action, it’s important to understand the nature and scope of the problem. At minimum you need to quickly determine what services and which users are being affected, the potential business impact, who is looking at the issue and needs to be notified and whether the issue raises any compliance or legal concerns.
  • Communicate with the right people. In the event of an incident, you’ll need a list of who to contact and how. In addition to members of the response team, you should also be communicating with other stakeholders across the business, the user base of the affected service and any relevant regulatory bodies.
  • Develop an action plan. Key teams need to determine and implement the best response to the incident based on the gathered facts. The incident manager should coordinate all the team activity and work to keep the response plan efficient and on track.
  • Keep everyone in the loop. As teams work on the issue, the incident manager should regularly check in to ensure that deadlines are met, while proactively updating other stakeholders on their progress.
  • Request emergency change approvals. Once you’ve found a resolution for the incident, perform tests to ensure that it works. If necessary, the incident manager should start the emergency change management process so that the response teams can quickly implement the fix.
  • Let people know it’s fixed. Once the fix is deployed and checked, a small user control group confirms the service is working correctly. The incident response team will subsequently notify everyone that the incident has been resolved.
  • Perform a quick review. Spend a few moments with the teams recapping what measures they took and any lessons they learned while the event is still fresh in everyone’s mind. Schedule a blameless post-incident review for a deeper assessment following recovery.

Risks of Downtime

What is the cost of a system outage?

According to a 2020 ITIC Hourly Cost of Downtime survey, 40% of enterprises polled said a single hour of downtime can cost between $1 million to more than $50 million — without including legal fees, fines or compliance penalties.

Data shows that any kind of worker productivity interruption — including downtime — can take a toll. A UC Irvine study indicates that it takes around 23 minutes to refocus after a worker productivity interruption. While actual costs related to outages vary from organization to organization, it’s well-established that a single system outage can cost an organization millions of dollars — and that’s without factoring in associated costs like loss of business opportunities, reduced productivity and damaged reputation.

System outages are inevitable for every business, but shifting from a reactive to a proactive approach to incident management can reduce their frequency and impact.

cost of system outage graph cost of system outage graph

What is MTTD/MTTR?

MTTD stands for “mean time to detect” or “mean time to discover” and MTTR stands for “mean time to respond.” Both are metrics used to quantify the effectiveness of a team’s incident management processes.

MTTD is a key performance indicator for incident management, measuring how long a problem exists before the organization or appropriate parties become aware of the issue. A shorter MTTD indicates that organizations suffer from outages and other disruptions for less time than with a longer MTTD. In addition, the lower the MTTD, the less corresponding cost an organization will incur due to downtime. Organizations discover issues either by end users who report an outage to the service desk or from alert systems’ various monitoring and management tools.

MTTR represents the average time it takes to repair and restore an affected component or system to functionality, measuring the maintenance level of an organization’s equipment, as well as the team’s efficiency in resolving IT incidents. MTTR starts the moment a failure is detected, which includes diagnostic time, repair time, testing and all other activities that occur until service is restored to normal. The combination of MTTR and MTTD comprise the duration of a cyber incident.

MTTR is important because it’s a powerful predictor of IT incident costs. The higher an IT team’s MTTR, the greater the risk that the organization will experience significant downtime when IT incidents occur, potentially leading to business disruptions, customer dissatisfaction and revenue loss.

Getting Started

What are some incident management tools you can implement to build defenses?

An incident management platform is the first line of defense during an incident. It provides critical support for each phase of the incident management process, with features that include incident identification, logging, diagnosis and investigation, issue escalation and resolution. There are numerous platforms available — selecting the appropriate platform will largely depend on the size and scope of your organization, compliance requirements and budgetary considerations.

How do you get started implementing an effective incident management plan?

The first step to implementing an effective incident management plan is to form an incident response team, composed of internal or external personnel or a mix of both. From there, you’ll need to decide what constitutes an incident for your organization and perform an incident threat analysis by assessing the potential threats, risks and infrastructure failure. You can then start designing response plans for different scenarios, training staff and practicing through simulated breaches with the goal of continuously improving your incident response.

The Bottom Line: Effective incident management is essential for every business

With the growing complexity of IT environments and the increasing number and sophistication of threats, organizations face an unprecedented level of risk. Incident management allows you to mitigate that risk by enabling you to detect and resolve incidents more quickly. While outages and other incidents are inevitable for every business, incident management is the most effective way to launch an immediate response and prevent costly downtime that could jeopardize your organization's reputation and bottom line.

More Resources