The question isn't whether an incident will happen: it's when it will happen. Systems will crash. Software will fail. Vendors will suffer an outage of their own. It's your job to be prepared for these problems, and incident severity levels are one of the tools you need.
Incidents have varying impacts on your business and customers. Incident severity levels are how you classify their impact and manage your response. When you use severity levels properly…
- Your teams react faster.
- Your organization improves mean time to repair (MTTR).
- Your stakeholders better understand how problems affect customers.
In this article, let's look at what incident severity levels are, how to use them and how they differ from priority levels.
What are severity levels?
A vital part of the incident management practice, severity levels measure how acutely an event impacts your business. Whether an event is internal, such as equipment or software failures, or external, such as a security breach or a vendor outage, it has a specific effect on your ability to serve your clients. The severity level reflects that impact.
(Manage security incidents events better with these SIEM features.)
Depending on the organization, severity levels commonly range from one to three, four or five. With one, or SEV 1, being the most severe and the highest number in your system (3, 4 or 5) being the least severe.
There's no universal definition for severity levels. How you define them depends on what's important to your organization and your users. For some companies, only three levels make sense. For others, dividing incidents into five may be a better idea. Here are definitions for five levels:
A critical incident that affects a large number of users in production.
A significant problem affecting a limited number of users in production.
An incident that causes errors, minor problems for users, or a heavy system load.
A minor problem that affects the service but doesn't have a serious impact on users.
A low-level deficiency that causes minor problems.
Why use incident severity levels?
When an incident occurs, your teams need to know:
- Who is responsible for managing the response to the problem?
- How will the team members communicate with each other?
- How serious is the issue?
- What steps are the team permitted to take to clear it?
- How will they report on and track the incident?
For example, when an outage occurs that affects all users, a typical response is "All hands on deck!" But having everyone focus on a single problem isn't productive. It's usually counter-productive and leads to duplicated or even contradictory efforts and confusion. Defining a severity level and attaching processes to it leads to a better response. (Even better: Designate an Incident Commander so you already know who's calling the shots.)
Defining severity levels should be a part of your incident management plan. They can go a long way toward answering these questions in advance and saving your team's time since they know what to do as soon as an incident is assigned a level.
(Check out these incident review best practices.)
Incident severity level examples
Using our questions above, let’s see what the answers to a SEV 1 incident might be:
- Who is responsible for managing the response to the problem? The associated department head, or a designated management team member, is responsible.
- How will the team members communicate with each other? There will be an open call bridge.
- How serious is the issue? SEV 1 means a majority of customers are affected.
- What steps are the team permitted to take to clear it? With a SEV 1 outage, you can take all measures, including restarting production processes.
- How will they report on and track the incident? The responsible party will issue hourly reports to management.
While for a SEV 5 outage, the answers are very different:
- An engineer or developer is responsible for managing the problem.
- Communication occurs over the incident tracking system.
- SEV 5 means a minor, non-emergency issue.
- You can only take steps to address a SEV 5 in production during a change window.
- The engineer will track the issue in the incident or software tracking system.
Severity levels are a common reference for everyone involved in responding to incidents. With an assigned level and a clear set of procedures, the right teams get to work on clearing the issue. Without them, you'll either lose time working out the rules of engagement or create more issues by not having them.
Incident severity vs priority: Is this the same?
From a distance, severity and priority look like the same thing. If you have a SEV 1 incident, it's obvious that you're going to clear it before a SEV 2, so what's the difference between severity and priority?
- Severity measures impact. If you look at the table above, it describes the levels in terms of how they affect the user community or the services.
- Priority measures urgency. It tells you how quickly you need to fix an issue and which issue you need to address first.
Priority and severity often match up perfectly. An outage that prevents all users from using a service is both high priority and SEV 1. This is an example of technical issues and business priorities being in alignment. But sometimes these priorities don't align:
- Getting a new feature into production might be a high priority, but the service is working fine. There is no incident, so no severity level applies.
- A mobile application may have an embarrassing typo, and it needs to be fixed ASAP. The technology team may classify it as a SEV 5 incident, but it's also high priority.
Even while these different classifications can be at odds, they're both important methods of communication. Severity tells stakeholders how serious an issue is. Priority tells technology staff what they need to work on next.
(Track more incident response metrics.)
Defining incident severity levels: Best practices
Incident severity levels are a simple enough concept. Unfortunately, simple doesn’t mean easy to implement. You can't copy them from a blog post or white paper and immediately put them into use. You need to adapt them to your organization by taking several factors into consideration, such as:
- Your user community
- Your software and hardware systems
- Your business requirements
Still, these best practices can help your organization define (and adhere to) incident severity levels.
Uniformity is key
Best practice: Adopt a unified set of levels and descriptions for your entire company.
Using different incident severity levels for different applications or software stacks, especially if you're in a large organization, might look like a good idea. But it will complicate one of the biggest benefits of creating the levels in the first place: clear communication about incidents. Different levels or definitions will make it hard for stakeholders to understand what an incident means. It may even confuse engineers and developers that work on different applications.
Keep it simple
Best practice: Use the smallest number of severity levels you can. No more, no less.
Too many will quickly become confusing. One reason incident security levels exist is so that when an incident occurs, you can assign it a level and get to work. Too many levels will slow this down. Too few will lead to lumping incidents together. Subtle (or even not so subtle) nuance between incidents will disappear when they're forced into the same category.
How do you get it right? Get the stakeholders together and come up with a plan. Go over past incidents and see how they fit into a proposed framework. Examine previous root cause analyses. Try it out and don't be afraid to change your scheme if you need to.
Create clear guidelines for assigning severity levels
Best practice: Make it easy to assign severity levels
If your organization can't quickly assign the right severity level to an incident, you won't reap the advantages of having a system in place. So, you need specific rules on how to assign them that not only make it easy, but self-evident. You don't want to waste time arguing over the severity of an incident.
You need to designate the level and get to work. So, create rules that rely on measurable impact, such as:
- The percentage of clients affected
- Feature categories
- System impact
Using incident severity levels
Now you've got a great understanding of incident severity levels and how to use them. Effectively, these levels are communication tools, so you can share the impact of a problem and quickly get the right teams engaged to solve it. Of course, severity and priority are related in incidents, but they are still very different.
(For the latest in all things security, check out these Cybersecurity and InfoSec Events & Conferences.)
What is Splunk?
This posting does not necessarily represent Splunk's position, strategies or opinion.