A DevOps Guide to Incident Response Software
Incident: A problem, represented by an alert, that could negatively impact customers, your employees, and the stakeholders inside or outside of your organization.
In order to stay competitive in today’s market, businesses are expected to innovate — quickly. Many engineering teams feel pressure to build, deploy, and operate services with increasing speed. High performing teams innovate faster and maintain their sanity because they’re able to quickly recover from incidents.
As we move from agile development to rapid deployment, teams need to think beyond a reactive operations center. That’s why choosing the right on-call and incident response system is more than just the icing on the cake to a successful DevOps culture. Incident response is the cornerstone to engaging high-performing engineering and ops teams who champion uptime and own on-call — instead of fear it. Ultimately, rethinking and retooling your approach to DevOps and incident response is imperative to delivering products and applications that keep businesses relevant.
The purpose of this buyer’s guide is to discuss why progressive, high-performing teams choose to invest in high-performance incident response software. From the challenges across the SDLC to specific incident response product features, we’ll lay out everything you need to consider when choosing an incident response solution.
High availability is essential to business success—an issue complicated by the increasing deployment demands of a highly competitive market. Accordingly, investing in processes to ensure near-zero downtime alongside rapid deployment is mission critical for the entire engineering and IT department.
Here, we break down how incident response is key to maintaining a culture of availability without slowing down the innovation process—and how DevOps is the essential piece for successfully executing this shift.
More advanced companies use historical incident data to proactively prepare teams to resolve events faster, and to prevent those events in the first place. This in turn becomes a competitive advantage as highly functional “on-call” teams help protect revenue loss, maintain brand reputation, and drive customer satisfaction.
Today’s teams must manage incidents across the entire lifecycle — folding in detection, response, remediation, analysis, and readiness. In this section, we’ll dive into the five different phases of the incident life cycle.
For each stage, we’ll cover the definition. Then, we’ll discuss how they relate to the features and functionality you need in on-call and incident response software to do more than react to alerts.
The response phase is the delivery of a notification to an incident responder via any means and the first steps the responder takes to address the alert. Thus, a detection threshold is passed, an email/SMS/chat/phone call is sent (notification), and someone acknowledges receipt (response).
How It Relates to Incident Response Software
There are a few key features to ensure the response happens effectively. You can think about these features as on-call essentials or, depending on how thin the feature set is, “basic alerting.” Thus, the leading incident response tools in market will offer:
- Dynamic scheduling
- Team-specific rotations
- Automated escalation(s)
- Scheduled overrides
These feature sets are essential, yet in isolation, they’re simply not robust enough to support a true DevOps culture. High-performing DevOps teams tend to focus on less reactive environments, investing in the people, process, and tooling to ensure teams are proactively preparing, minimizing, and preventing incidents. Accordingly, every second during response provides an opportunity for improved reliability and uptime.
This is an important point: Developers will not positively respond to (read: adopt) a highly-reactive on-call management tool. The tool needs to offer context, collaboration, and visibility.
Many high-performing teams have found success through ChatOps tooling and workflows that centralize communication and setup the first responder for success. While receiving a basic notification in Slack/Stride/Mattermost is great, a contextual alert with a visual indication of the current state, plus links to relevant runbooks or dashboards, saves the responder valuable time digging into the error.
When purchasing an incident response tool, buyers should look not only for bidirectional chat integrations and ChatOps functionality but also the ability to configure alerts to fit team needs—any information present in the alert payload can be used to provide additional details to the on-call responder. Straightforward contextual details attached to each alert will reduce the stress of on-call and provide a next-level technique for resolving incidents faster.
The analysis phase, often referred to as postmortem or post-incident review, is the learning process after an incident is resolved. While the historic approach to this phase has relied heavily on Root Cause Analysis (RCA), increasingly complex systems have led progressive teams away from relying only on single causal entity analysis. Instead, teams are increasingly looking towards models that address system complexities, e.g. Cynefin, to better understand the holistic, multi-faceted cause of an incident.
How It Relates to Incident Response Software
When we discuss analysis, there are a few key pieces necessary for incident response software to support a healthy Post-Incident Review (PIR). The first is the the Incident Dashboard or Timeline, which is helpful for providing a quick view of misbehaving systems before and during the incident; who shipped something to production; who was taking action; what actions was that individual taking; and what communication was happening throughout the incident. All of these pieces serve as critical data for an effective PIR.
Close readers may notice some nuances to words we’ve chosen (or avoided) as we discuss incident analysis, namely “Post-Incident Review” and “root-cause analysis” (RCA).
Post-Incident Review is our replacement for post-mortems. You can learn more about our approach to the Post-Incident Review, including why it’s so essential for DevOps teams—here. The decision to not use RCA mirrors this sentiment based on the current complexity of people and systems.
The second is also reporting related: Mean time to acknowledge (MTTA) and mean time to resolve (MTTR). MTTA/MTTR reporting allow your teams to visualize and uncover the underlying trends regarding a team’s ability to respond to and resolve incidents. By wholistically analyzing the impact of incident volume — and your teams use of the incident response software — you can determine levers to lower MTTA/MTTR specifically and minimize the cost of downtime.
The third is a Post-Incident Review—different than the actual process of an internal PIR, this PIR is a tangible report where individuals, including Leadership, can quickly pull a timeframe of data (no more manual aggregation of emails, Slack, SMS, and monitoring systems) for key learnings. This report facilitates a PIR, or “retrospective”, and documents long-term action items. Out-of-the-box PIR reporting allows your team to quickly and easily access monitoring data, system actions, and human remediation to better understand the who, what, when, where, and why of an incident. All of this analysis is essential for the preparedness and readiness required for teams to not only quickly resolve incidents in production, but also improve the reliability of systems to proactively address issues before they occur.
These are the most important questions to ask of your solution:
Questions for on-call management
Will I find contextual alerts with abundant information for resolution? Does the tool have built-in automation to reduce noise and alert responders only during critical incidents? Does the tool support collaboration with bidirectional group chat integrations? Does the software support international notifications? Does this tool support/integrate with my existing critical toolchain components? Can I access a variety of reports, including MTTA/MTTR and overall incident frequency? Is there a native mobile app that supports on-the-go on-call? How easy is it to conduct a thorough post-incident review? How hard is it to access historical data? How can I configure alerts? Are there varied levels of user permissions Do I have SDLC visibility to see when things are shipped to production
How likely is it that my development team would use this tool? Would they find value in alerting? Or, would they simply be inundated with noisy alerts that make on-call miserable? Does this tool prepare me for continuous learning and continuous improvement? Can I access out-of-the-box performance metrics to report on SLAs and uptime? How easy is it to conduct a thorough post-incident review? Does this tool surface when new code is pushed into production? Is this tool build for DevOps standardization? Or would we need to migrate to a new tool as our team progresses?