Successful organizations understand that continued success depends on a core value, lived daily, of continuous improvement. Entering a new market or defending an existing position requires a constant evaluation of the current state.
Retrospective analysis has become a commonplace exercise in many organizations, particularly as IT orgs have begun to fully adopt Agile software development and DevOps principles. Shortened software development and delivery cycles means reduced feedback loops to understand what went well, what didn’t, and how we can do better the next time. In fact, shortening the available time to identify prob‐ lems, investigate options, and take action is a key reason why postincident reviews have become so important and prevalent in modern IT organizations.
Continuously improving the development practices, delivery pipelines, and operational techniques provides the best environment for business success. Long cycle times to produce new features and lengthy service disruptions are a common property observed in organizations that have not begun their journey into modern IT incident management and retrospective analysis. Making the idea of continuous improvement a central part of company culture provides extraordinary value across the entire organization.
Identifying what to improve can often feel like an exercise in who has the most authority. With so many areas of effort and concern in an IT organization, the decision of where to begin may simply come from those in senior positions rather than from the individuals closest to the work. This isn’t wholly unjustified: practitioners of the work itself may struggle to see the forest for the trees. Without the presence of generalists or consultants trained in seeing the big picture, identifying the work that brings the most value can be challenging.
Understanding the entire process, from idea to customer usage, requires a close examination of every step in the lifecycle. This can enable inefficiency and friction to be identified and examined further.
Creating a value stream map, a method for visualizing and analyzing the current state of how value moves through the system, is a common practice to assist in understanding this flow. Mapping the entire process and the time between handoffs not only helps all stakeholders have a much larger and clearer picture, but begins to surface areas of possible improvement. Finding ways to trim the time it takes to move through the value stream paves the path toward continuous delivery and highly available systems.
Post-incident reviews work in much the same way. When we are responsible for maintaining the availability and reliability of a service, anything that prevents us from knowing about a problem is friction in the system. Likewise, delays in efforts to remediate and resolve service disruptions are considered waste.
Identifying impediments, such as delays in contacting the right individual during the detection phase of the incident timeline, sets the stage for a clear direction on not only what needs to be improved but where to begin. Regardless of where friction exists in the overall continuous delivery lifecycle, everything that makes it into a production environment and is relied on by others has started in the mind of an individual or collective team. Focusing on where the most value can be gained earlier and often in the entire process shows you your path of continuous improvement.
We will explore flow and waste value streams again later when we break down the components of a post-incident review. These elements will emerge as we plot tasks along the timeline, their impact on service recovery, and how much time elapsed. Exposing and addressing waste in the phases of detection, response, and remediation leads to higher availability.
The key to successful continuous improvement is reducing the time it takes to learn. Delaying the time it takes to observe the outcomes of efforts—or worse, not making an effort to receive feedback about the current conditions—means we don’t know what’s working and what’s not. We don’t know what direction our systems are moving in. As a result, we lose touch with the current state of our three key elements: people, process, and technology.
Think about how easy it is for assumptions to exist about things as simple as monitoring and logging, on-call rotations, or even the configuration of paging and escalation policies. The case study in will illustrate a way in which we can stay in touch with those elements and more easily avoid blind spots in the bigger picture.
Post-incident reviews help to eliminate assumptions and increase our confidence, all while lessening the time and effort involved in obtaining feedback and accelerating the improvement efforts.
Feedback loops exist in many forms. Monitoring of systems, customer sentiment, and improvements to incident response are only a few forms of feedback loops many companies rely on.
Retrospective analysis is a common practice in many IT organizations. Agile best practices suggest performing retros after each development sprint to understand in detail what worked, what didn’t, and what should be changed. Conditions and priorities shift; teams (and the entire organization) aren’t locked into one specific way of working. Likewise, they avoid committing to projects that may turn out to be no longer useful or on the path of providing value to the business.
It’s important to note that these retrospectives include the things that worked, as well as what didn’t quite hit the mark. A common pitfall of post-incident review efforts is focusing solely on the things that went poorly. There is much to learn from what went well when it comes to identifying and recovering from a problem.
In an attempt to place emphasis on the importance of understanding the good and bad in a post-incident analysis, many teams refer to the exercise as simply a learning review. With the agenda spelled out in the name, it is clear right from the beginning that the purpose of taking the time to retrospectively analyze information regarding a service disruption is to learn. Understanding as much about the system as possible is the best approach to building and operating an IT system.
Regardless of what you call these reviews, it is important to clarify their intention and purpose. A lot of value comes from a well-executed post-incident review. Not only are areas of improvement identified and prioritized internally, but respect from customers and the industry as a whole emerges due to the transparency and altruism of sharing useful information publicly.
• Incident Debriefing
• After Action Report
• Rapid Improvement Event
Many call these reviews simply postmortems. However, this type of language insinuates an investigation into the “cause of death” (i.e., failure), which is counter to the true purpose of the analysis. The idea of death sets the wrong tone and allows fear to bias or persuade individuals toward self-preservation and justification rather than encouraging genuine transparency and learning. Thus, in this book I have chosen to discuss a type of learning review known as a post-incident review.
We had no label for it in my opening story. All I could have told you at the time was that I wasn’t scared to tell everyone what had happened, I was never blamed, and when it was all over and done, we went about our day like normal, with a broader understanding of the system and holding a list of action items to make it even better.
What you choose to call the analysis is up to you. Understanding the value the process provides, regardless of naming convention, is the true takeaway from this book. Don’t get hung up on the name, so long as you follow the principles laid out.
It is important to clarify the objective of the analysis as an exercise in learning. It is common for many to approach these exercises much like troubleshooting failures in simple systems, such as manufacturing and assembly systems, and focus their efforts on identifying the cause.In nonlinear, simple systems, cause and effect are obvious and the relationships between components of the system are clear. Failures can be linked directly back to a single point of failure. As made clear by the Cynefin complexity framework, this in no way describes the systems we build and operate in IT. Complexity forces us to take a new approach to operating and supporting systems—one that looks more toward what can be learned rather than what broke, and how it can it be prevented from happening ever again.
Depending on the scope and scale of issues that may contribute to IT problems, information can be gained from a variety of areas in not only the IT org, but the entire company. Learning becomes the differentiating factor between a high performing team and a low performing team.
Shortening the feedback loops to learn new and important things about the nature of the system and how it behaves under certain circumstances is the greatest method of improving the availability of the service.
To learn what will contribute the most to improvement efforts, we begin by asking two questions—questions designed to get right to the heart of shortening critical feedback loops. Answering the first two questions leads to the third and most important of the group:
How do we know sooner?
How do we recover sooner?
How (specifically) will we improve?
How do we know sooner?
The longer it takes to know about a problem, the lengthier the delay in restoration of services will be. Because of this, one of the main takeaways of any post-incident review should be an answer to the question, “How do we know sooner?” Here are just a few areas within the detection phase where improvement opportunities for technology, process, and people may exist:
• Monitoring and logging
• On-call rotations
• Humane paging policies
• Documentation and runbooks
• Communication and collaboration methods
• Escalation paths
In Chapter 6, we’ll dive into a fictional story illustrating a service disruption. As you follow along with the recovery efforts and analysis on day two, notice how a list of improvements within the detection phase of the incident is produced that includes several items mentioned in the preceding list. As with most post-incident reviews, action items from that case study included work to shorten the time it takes to know about a problem. The sooner we know, the sooner we can do something about it.
With the always-changing nature of IT systems, it is extremely important to have a finger on the pulse of the health and state of a system. Shortening the time it takes to detect and alert first responders to a problem, be it potential or active, means that teams may in fact be able to reduce the impact of the problem or prevent it from affecting end users altogether.
A seasoned engineering team will always make understanding this question the highest priority.
Once we are aware of problems, such as missing backup data required for a successful cloud migration, the recovery process begins focusing on elements of response and remediation.Within each of these efforts lies data that, when analyzed, will point to areas of improvement as well. Understanding how teams form, communicate, and dig into the related observable data always uncovers opportunities for improvement. It’s not good enough to only be paged about a problem. Detecting and alerting is an important first step of restoring service, but a very small piece of the larger availability picture.
A majority of remediation efforts are spent investigating and identifying the problem. Once the data has been explored, allowing theories and hypotheses to be built, efforts shift to actually repairing and restoring the services.
Take note in the next chapter’s case study of how the time required to restore services is extended due to poor availability of documentation. Efforts to restore service often can’t begin until information on access and triaging procedures has been obtained. Not every problem has been seen before, either. Documentation may not exist. If so, how can that be confirmed early on to prevent engineers from wasting time searching for something that does not exist?Again, this phase of recovery typically holds many areas of potential improvement if we just take the right approach to discovering them. Identifying weaknesses, such as lack of monitoring or not having complete and accurate information and access to parts of the system, should be a priority.
Multiple areas of the incident lifecycle will prove to be excellent places to focus continuous improvement efforts. The return on investment for IT orgs is the ability to recover from outages much sooner. The 2017 State of DevOps Report reported that high performing organizations had a mean time to recover (MTTR) from downtime 96 times faster than that of low performers.
Scrutinizing the details related to recovery helps to expose methods that can help achieve that improvement.
By focusing on the two previous questions, several areas of improvement should emerge within the detection, response, and remediation phases of the incident lifecycle. The next thing to do is to establish how we are going to go about making these improvements.
Action is required in order to fully begin the journey of learning and improvement. Thus, the post-incident reviews mustn’t be reduced to something that more closely resembles a debate or blame game and has little to no focus on actively prioritizing improvements to the system. Nor is this simply an exercise in “finding the cause and fixing it.”
The aim is to identify actionable improvements: enhancements or countermeasures to minimize or offset the impact of a future similar occurrence. Regardless of the approach, the goal is to make improvements to the system that will increase its availability and reliability.
Traditional approaches to post-incident analysis are often unpleasant exercises. It’s common to see an opportunity to learn more about the system wasted because excessive time was spent debating the cause and who should be “held accountable.”
In so doing, not only do we narrow our opportunity for learning, but we create an environment in which it’s easy to point the finger of blame toward one another, or even ourselves.
The systems thinking approach to IT-related service disruptions and the retrospective analysis is best understood with an example. We turn to that next.