As we all know, humans will need to make judgment calls during response and recovery of IT problems. They make decisions regarding what tasks to execute during remediation efforts based on what they know at that time. It can be tempting to judge these decisions in hindsight as being good or bad, but this should be avoided.
When accidents and failures occur, instead of looking for human error we should look for how we can redesign the system to prevent these incidents from happening again.
A company that validates and embraces the human elements and considerations when incidents and accidents occur learns more from a post-incident review than those who are punished for actions, omissions, or decisions taken. Celebrating transparency and learning opportunities shifts the culture toward learning from the human elements. With that said, gross negligence and harmful acts must not be ignored or tolerated.
Human error should never be a “cause.”
How management chooses to react to failure and accidents has measurable effects. A culture of fear is established when teams are incentivized to keep relevant information to themselves for fear of reprimand. Celebrating discovery of flaws in the system recognizes that actively sharing information helps to enable the business to better serve its purpose or mission. Failure results in intelligent discussion, genuine inquiry, and honest reflection on what exactly can be learned from problems.
Blaming individuals for their role in either the remediation or the incident itself minimizes the opportunity to learn. In fact, identifying humans as the cause of a problem typically adds to the process, approvals, and friction in the system. Blaming others for their involvement does only harm.
The alternative to blaming is praising. Encourage behavior that reinforces our belief that information should be shared more widely, particularly when exposing problems in the way work gets done and the systems involved. Celebrate the discovery of important information about the system. Don’t create a scenario where individuals are more inclined to keep information from surfacing.
Nurturing discovery through praise will encourage transparency. Benefits begin to emerge as a result of showing the work that is being done. A stronger sense of accountability and responsibility starts to form.
Make Work (and Analysis) Visible
Many principles behind DevOps have been adopted from lean manufacturing practices. Making work visible is an example of this. It reinforces the idea that sharing more information about the work we do and how we accomplish it is important, so we can analyze and reflect on it with a genuine and objective lens.
Real-time conversations often take place in group chat tools during the lifecycle of an incident. If those conversations are not captured, it’s easy to lose track of the what, who, and when data points. These conversations are relevant to the work and should be made visible for all to see. This is a big reason for the adoption of practices like ChatOps, which we’ll learn more about in Chapter 9.
Teams should be able to respond, collaborate, and resolve issues in a shared and transparent space. When work and conversations are visible, it becomes easier to spot areas for improvement, particularly as thirdparty observers and stakeholders observe remediation efforts unfolding.
Think back to the data loss incident described in the Introduction. My testimony the following morning on exactly what took place, when, and how helped make the work visible. Others who had no previous exposure to this work now had questions and suggestions on how to improve the process and the customer experience. It was an eye-opening exercise for many to discover a truer sense of how our systems were designed, including those responsible for responding to failures in IT. Everyone felt included, engaged, and empowered.
Making work visible during incident response helps to paint a very clear picture of what has occurred, what can be learned from the unplanned disruption of service, and what can be improved. Another important benefit of making work more transparent is that a greater sense of accountability exists.
It is very common for people to mistake responsibility for accountability. We often hear people proclaim, “Someone must be held accountable for this outage!” To that, I’d say they are correct. We must hold someone accountable, but let’s be clear on what we mean by that.
Holding someone accountable means that we expect them to provide an accurate “account” of what took place. We do expect as much information and knowledge as can be gleaned from responders.
However, this is rarely what people mean when they demand accountability. What they actually mean is that someone must be held responsible.
While blaming individuals is clearly counterproductive, it is important to seek out and identify knowledge or skill gaps that may have contributed to the undesirable outcome so that they can be considered broadly within the organization.
There’s a small but significant difference between responsibility and accountability. Everyone is responsible for the reliability and availability of a service. Likewise, everyone should be held accountable for what they see and do as it relates to the development and operation of systems.
We often see a desire to hold someone responsible for outcomes, despite the fact that the person in question may not have had the authority to provide the expected level of responsibility, much less the perfect knowledge that would be required.
We need to evolve and create an environment where reality is honored and we extract it in a way that creates a culture of learning and improvement. Making problems visible provides an opportunity to learn, not an opportunity to blame.
Those holding leadership positions demanding to know who pushed the button that ultimately deleted a month’s worth of customer data during a routine migration, for example, is not the way to encourage discovery in how to make things better. It is an expectation of someone to justify their actions or accept blame for the negative outcome. Unfortunately, the ultimate result of blaming, as pointed out previously, is that it leads to individuals keeping information to themselves rather than sharing.
When engineers feel safe to openly discuss details regarding incidents and the remediation efforts, knowledge and experience are surfaced, allowing the entire organization to learn something that will help avoid a similar issue in the future.