How do we know what an incident is or when it is appropriate to perform a post-incident review?
An incident is any unplanned event or condition that places the system in a negative or undesired state. The most extreme example of this is a complete outage or disruption in service. Whether it is an ecommerce website, a banking app, or a subcomponent of a larger system, if something has happened causing the operability, availability, or reliability to decrease, this is considered an incident.
In short, an incident can be defined as an unplanned interruption in service or a reduction in the quality of a service.
A standard classification of incidents helps teams and organizations share the same language and expectations regarding incident response. Priority levels as well as an estimation of severity help teams understand what they are dealing with and the appropriate next steps.
PriorityTo categorize types of incidents, their impact on the system or service, and how they should be actively addressed, a priority level is assigned. One common categorization of those levels is:
Severe incidents are assigned the critical priority, while minor problems or failures that have well-established redundancy are typically identified with a warning priority. Incidents that are unactionable or false alarms have the lowest priority and should be identified simply as information to be included in the post-incident review.
First responders should never be alerted to unactionable conditions. For the health and sanity of our teams, priority and severity levels help everyone set expectations.
Categorization of severity levels often varies depending on industry and company culture. One example of labeling severity levels is as follows:
• Sev1 (Critical)—Complete outage
• Sev2 (Critical)—Major functionality broken and revenue affected
• Sev3 (Warning)—Minor problem
• Sev4 (Warning)—Redundant component failure
• Sev5 (Info)—False alarm or unactionable alert
Noise and unactionable alerts account for approximately half of the “top-reported problems” of life on-call.
One way to assess the health and maturity of an organization with respect to incident management, post-incident analysis, and the phases outlined in this section is to dissect the flow and properties of incidents. One method to do this is J. Paul Reed and Kevina FinnBraun’s Extended Dreyfus Model for Incident Lifecycles, which describes the typical language and behaviors used by various people and teams in an organization, from a novice practitioner to an advanced one. (It should be noted that organizations and teams can fall into different categories for each of the stages of the incident lifecycle.)
Five phases make up the entire lifecycle of an incident. Each phase may vary in degree of maturity with respect to the tools, processes, and human elements in place. Some teams may be far advanced in detection but weak and therefore beginners in response and remediation. Typically, these teams have established well-defined processes to detect failures but need to focus improvement efforts on responding to and recovering from problems.
Knowing about a problem is the initial step of the incident lifecycle. When it comes to maintaining service availability, being aware of problems quickly is essential. Monitoring and anomaly detection tools are the means by which service owners or IT professionals keep track of a system’s health and availability. Regularly adjusting monitoring thresholds and objectives ensures that the “time to know” phase of an incident is continuously improved upon, decreasing the overall impact of a problem. A common approach to monitoring is to examine conditions and preferred states or values. When thresholds are exceeded, email alerts are triggered. However, a detection and alerting process that requires an individual to read and interpret an email to determine if action needs to be taken is not only difficult to manage, it’s impossible to scale. Humans should be notified effectively, but only when they need to take action.
Once a problem is known, the next critical step of an incident’s lifecycle is the response phase. This phase typically accounts for nearly three-quarters of the entire lifecycle.
Establishing the severity and priority of the problem, followed by investigation and identification, helps teams formulate the appropriate response and remediation steps. Consistent, well-defined response tactics go a long way toward reducing the impact of a service disruption.
Responding to problems of varying degree is not random or rare work. It is something that is done all the time. It is this responsiveness that is the source of reliability.
TIP: Examine the “response” rather than focusing solely on the outcome (i.e., cause or corrective action).
In organizations that implement post-incident reviews, tribal knowledge, or critical information that often only resides in the heads of a few, is spread beyond subject matter experts. First responders have relevant context to the problem as well as documentation and access to systems to begin investigation. As new information is identified and shared with the greater group to create a shared contextual awareness of the situation, roles such as Incident Commanders often prove to be helpful in establishing effective and clear communication (especially between large teams or in the case of cascading failures). As organizations mature in their responses, teams begin to track metrics regarding the elapsed time of the incident. In an effort to improve the average duration it takes to recover from a service disruption (i.e., mean time to recover), teams find improved ways to collaborate and understand how best to respond to a problem. If there’s a phase in which to identify areas of improvement, it’s this one, in terms of how teams form and begin to manage the incident.
A well-crafted post-incident review will uncover the improvements that can be made in team formation and response. Focusing less on the cause of problems and more on how well teams formed, areas of improvement are continuously identified, and the focus on improving response to incidents indicates a solid understanding of “systems thinking.” A proactive response plan provides the most effective method of decreasing downtime.
The response phase can be unpacked even further into the following stages:
• Triage (What's going on?)
• Investigation (How bad is it?)
• Identification (How long has it been occuring?)
We don’t know what we don’t know. If detection and response efforts are chaotic and undefined, the resulting remediation phase will be just as unsuccessful. In many situations, remediation may start with filing a ticket as part of established procedures. Without a sense of urgency, recovering from a service disruption as quickly as possible is not made a priority. Knowledge is not shared, improvements to the system as a whole are not made, and teams find themselves in a break/fix cycle. This chaotic firefighting is no way to approach remediation.
In the following chapter, we’ll read about how CSG International approached their own problem with a lack of knowledge sharing. They were able to improve on this shortcoming as well as identifying bad practices that they currently had in place. Along with increasing the sharing of knowledge, these improvements helped to shape their company culture.
Through post-incident reviews, teams look to understand more regarding contributing factors of incidents during remediation. A small change or “fix” in one area may have larger implications elsewhere. Reliance on ticketing systems to actively “manage” incidents begins to fade as teams begin to realize that not only do incidents need to be managed in real time, but a deeper discussion is required to fully address a problem. A stronger sense of ownership for the availability of a service leads to a proper sense of urgency and a system of prioritization, allowing teams to remediate effectively and continuously improve their methods. A belief that strength is greatest in a collective team and a desire for open sharing of information and knowledge are part of the evolving culture.
Focusing on response efforts feeds the evolution as more aspects of the incident lifecycle are improved. Detection and response methods evolve, leading to cognitive load and response times improving.
Methods of automating much of the remediation phase become central to improvements, focusing more on how humans can be involved only when required. Service resiliency and “uptime” become byproducts of continuously improving aspects of the detection, response, and remediation phases, rather than attempting to protect systems.
The analysis phase of an incident is often missing entirely in organizations with low maturity in preceding phases of the lifecycle. Frequently considered a low-priority exercise, “day job” responsibilities commonly prevent analysis from taking place. Additionally, a fear of reprimand for involvement in an outage prevents many from sharing critical details that are relevant to not only the problem itself, but the response. Teams feel that their jobs are on the line when incidents happen as it is common for blame to be assigned to individuals, incentivizing them to keep useful and sometimes critical information to themselves. As teams mature, a tendency to ask more about “how” something happened than “why” helps avoid such negative incentive structures.
Root cause analysis becomes less common once an understanding of complex and simple systems has been realized. High availability of the complex requires a different approach.
Clarification that learning from an incident is the top priority helps to establish expectations for any analysis. Gaining a diverse and multifaceted understanding of the incident as it cycles through the phases helps to uncover new and interesting ways to answer the questions “How do we know about the problem faster?” and “How do we recover from the problem faster?”
Embracing human error is part of a healthy and mature understanding of complex IT environments. Successful analysis requires accurate and complete data to thoroughly discuss and scrutinize. This allows for a better understanding of incident management throughout the lifecycle. A clear path and a focus on continuous improvement of managing the entire incident lifecycle helps build resilient systems.
Organizations have gone to great lengths to try to predict and prevent disruptions to service. In simple systems where causation and correlation are obvious, outages may be reduced by looking for known patterns that have previously led to failure.
However, in complex systems such obvious relationships and patterns are only understood in hindsight. As a result, mature organizations understand that prediction and prevention of service disruptions is practically impossible. This is not to say that efforts should not be made to identify patterns and act accordingly. However, relying heavily on spotting and avoiding problems before they happen is not a viable long-term or scalable solution.
Instead, teams begin to take more of a “readiness” approach rather than concentrating on “prevention.” Efforts become focused more on reviewing documentation, response processes, and metrics in a day-to-day context.
Signs that an organizational culture of continuous improvement is gaining traction include updating out-of-date documentation or resources as they are found. Providing the team and individuals with the space and resources to improve methods of collaborating over important data and conversations brings the greatest value in terms of readiness and, as a result, leads to highly resilient systems with greater availability.
As teams and organizations mature, crew formation and dissolution are considered the top priority in responding to, addressing, and resolving incidents. Frequent rehearsal of team formation means teams are well practiced when problems inevitably emerge. Intentionally creating service disruptions in a safe environment assists in building muscle memory around how to respond to problems.
This provides great metrics to review and analyze for further improvements to the various phases of an incident’s lifecycle. Creating failover systems, tuning caching services, rewriting small parts of code…all are informed by analysis and set the team up in bigger ways. Service disruptions cannot be prevented, but the only logical path toward providing highly available systems is for teams to collaborate and continuously learn and improve the way they respond to incidents.
We use the idea of “readiness” rather than “prevention” for this final phase because our focus is on learning and improvement rather than prediction and prevention. This is a distinction from the Dreyfus model mentioned earlier.
These five distinct phases make up the incident lifecycle in its entirety. Post-incident reviews are reflected specifically in the “analysis” phase of an incident, but the learnings that come from them are fed back in to improve each phase of the lifecycle. This is why we value learning over identifying the root cause of an incident: by focusing solely on determining the cause of problems large and small, we miss the opportunity to learn more about possible improvements across all phases of an incident and the system as a whole.
It is within these phases and subphases that we’ll focus our efforts on what to learn and how to improve. By establishing the goals of what can we learn to help us “know sooner” and “recover sooner,” we will uncover action items to shorten the impact of a service disruption.In the following chapter, I’ll show you how to conduct your own post-incident review.