All companies have and rely on technology in numerous and growing ways. Increasingly, the systems that make up these technologies are expected to be “highly available,” particularly when the accessibility and reliability of the system tie directly to the organization’s bottom line. For a growing number of companies, the modern world expects 100% uptime and around-the-clock support of their services.
Advancements in the way we develop, deliver, and support IT services within an organization are a primary concern of today’s technology leaders. Delivering innovative and helpful service to internal and external stakeholders at a high level of quality, reliability, and availability is what DevOps has set out to achieve. But with these advancements come new challenges. How can we ensure that our systems not only provide the level of service we have promised our customers, but also continue a trajectory of improvement in terms of people, process, and technology so that these key elements may interact to help detect and avoid problems, as well as solve them faster when they happen?
Anyone who has worked with technology can attest that eventually something will go wrong. It’s not a matter of if, but when. And there is a direct relationship between the size and complexity of our systems and the variety of factors that contribute to both working and non-working IT services.
In most companies, success and leadership are about maintaining control—control of processes, technologies, and even people. Most of what has been taught or seen in practice follows a command and control pattern. Sometimes referred to as the leader–follower structure, this is a remnant of a successful labor model when mankind’s primary work was physical.
In IT, as well as many other roles and industries, the primary output and therefore most important work we do is cognitive. It’s no wonder the leader–follower structure L. David Marquet describes in Turn the Ship Around! has failed when imposed in modern IT organizations. It isn’t optimal for our type of work. It limits decision-making authority and provides no incentive for individuals to do their best work and excel in their roles and responsibilities.
Initiative is effectively removed as everyone is stripped of the opportunity to utilize their imagination and skills. In other words, nobody ever performs at their full potential.
This old-view approach to management, rooted in physical labor models, doesn’t work for IT systems, and neither does the concept of “control,” despite our holding on to the notion that it is possible.
Varying degrees of control depend on information and the scale or resolution at which we are able to perceive it. Predictability and interaction are the key necessary components of control.
Unfortunately, we never have absolute certainty at all scales of resolution. Information we perceive is limited by our ability to probe systems as we interact with them. We can’t possibly know all there is to know about a system at every level. There is no chance of control in the IT systems we work on. We are forced to make the most of unavoidable uncertainty.
To manage the uncertain and make IT systems more flexible and reliable, we employ new models to post-incident analysis. This approach means we gather and combine quantitative and qualitative data, analyze and discuss the findings, and allow theories from unbiased perspectives regarding “normal” (but always changing) behaviors and states of our systems to form, spreading knowledge about how the system works.
This growth mindset approach encourages teams to carve out time for retrospection in order to analyze what has taken place in as much detail as possible. Along the way, specific details are surfaced about how things transpired and what exactly took place during recovery efforts. Those may include:
• A detailed timeline of events describing what took place during remediation
• Key findings that led to a deeper and unified understanding and theory of state and behavior with regard to people, process, and technology
• An understanding of how these elements relate to the evolving systems we build, deliver, operate, and support
These details are analyzed more deeply to discover ways to know about and recover from problems sooner. A larger understanding of the system as well as everything that contributed to a specific incident being discussed emerges as the timeline comes together. Thus, details for actionable improvements in many areas are provided along an incident’s timeline.
Nearly all practitioners of DevOps accept that complicated, complex, and chaotic behaviors and states within systems are “normal.” As a result, there are aspects to the state and behavior of code and infrastructure that can only be understood in retrospect. To add to the already non-simple nature of the technology and process concerns of our systems, we must also consider the most complex element of all of the system: people, specifically those responding to service disruptions.
We now know that judgments and decisions from the perspective of the humans involved in responses to managing technology must be included as relevant data. “Why did it make sense to take that action?” is a common inquiry during a post-incident review. Be cautious when inquiring about “why,” however, as it often may subtly suggest that there’s going to be a judgment attached to the response, rather than it simply being the beginning of a conversation.
An alternative to that question is, “What was your thought process when you took that action?” Often arriving at the same result, this line of inquiry helps to avoid judgment or blame.
This slight difference in language completely changes the tone of the discussion. It opens teams up to productive conversations. One approach that can help reframe the discussion entirely and achieve great results is to begin the dialogue by asking “What does it look like when this goes well?”
Consider the company culture you operate in, and construct your own way to get the conversation started. Framing the question in a way that encourages open discussion helps teams explore alternatives to their current methods. With a growth mindset, we can explore both the negative and the positive aspects of what transpired.
We must consider as much data representing both positive and negative impacts to recovery efforts and from as many diverse and informative perspectives as possible. Excellent data builds a clearer picture of the specifics of what happened and drives better theories and predictions about the systems moving forward, and how we can consistently improve them.Observations that focus less on identifying a cause and fix of a problem and more on improving our understanding of state and behavior as it relates to all three key elements (people, process, and technology) lead to improved theory models regarding the system.
This results in enhanced understanding and continuous improvement of those elements across all phases of an incident: detection, response, remediation, analysis, and readiness.
Surfacing anomalies or phenomena that existing behavior and state theories cannot explain allows us to seek out and account for exceptions to the working theory. Digging into these outliers, we can then improve the theories regarding aspects that govern our working knowledge of the systems.
During analysis of an incident, such as the data loss example described in the Introduction, we categorize observations (missing data) and correlations with the outcomes (unsuccessful data migration) of interest to us. We then attempt to understand the causal and contributing factors (data had not been backed up in months) that led to those outcomes. My presumptuous correlation between the speedy migration and a recent feature release led me to dismiss it as an area to investigate sooner. That assumption directly impacted the time it took to understand what was happening. These findings and deeper understanding then turn into action items (countermeasures and enhancements to the system).
Why is this important? Within complex systems, circumstances (events, factors, incidents, status, current state) are constantly in motion, always changing. This is why understanding contributing factors is just as (or possibly more) important as understanding anything resembling cause.
Such a “systems thinking” approach, in which we examine the linkages and interactions between the elements that make up the entire system, informs our understanding of it. This allows for a broader understanding of what took place, and under what specific unique and interesting circumstances (that may, in fact, never take place again).What is the value in isolating the “cause” or even a “fix” associated with a problem with such emergent, unique, and rare characteristics?
To many, the answer is simple: “none.” It’s a needle-in-a-haystack exercise to uncover the “one thing” that caused or kicked off the change in desired state or behavior. And long-standing retrospective techniques have demonstrated predictable flaws in improving the overall resiliency and availability of a service.
This old-school approach of identifying cause and corrective action does little to further learning, improvements, and innovation. Furthermore, such intense focus on identifying a direct cause indicates an implicit assumption that “systems fail (or succeed) in a linear way, which is not the case for any sufficiently complex system.” The current hypothesis with regard to post-incident analysis is that there is little to no real value in isolating a single cause for an event. The aim isn’t only to avoid problems but rather to be well prepared, informed, and rehearsed to deal with the ever-changing nature of systems, and to allow for safe development and operational improvements along the way.
Failure can never be engineered out of a system. With each new bit that is added or removed, the system is being changed. Those changes are happening in a number of ways due to the vast interconnectedness and dependencies. No two systems are the same. In fact, the properties and “state” of a single system now are quite different from those of the same system even moments ago. It’s in constant motion.
Working in IT today means being skilled at detecting problems, solving them, and multiplying the effects by making the solutions available throughout the organization. This creates a dynamic system of learning that allows us to understand mistakes and translate that understanding into actions that prevent those mistakes from recurring in the future (or at least having less of an impact—i.e., graceful degradation).
By learning as much as possible about the system and how it behaves, IT organizations can build out theories on their “normal.” Teams will be better prepared and rehearsed to deal with each new problem that occurs, and the technology, process, and people aspects will be continuously improved upon.