By Scott Fitzpatrick
One of the most important processes for any development organization is that for incident response. When issues occur within applications (and they will), teams need to be able to respond quickly and efficiently to restore service and limit impact on the end users.
Below, I’ll examine some of the important metrics for measuring the effectiveness of an incident response strategy. Additionally, I’ll detail how an incident management platform that features a unified UI and enhanced functionality for managing incident data can help to improve both an incident response strategy and the applications this strategy supports.
Key Performance Indicators for Incident Response
In order to accurately assess the effectiveness of any process, data is a necessity. And incident response is no exception. Let’s take a look at a few metrics that are indicative of an organization’s efficiency in responding to problems within their applications.
Mean Time To Acknowledge (MTTA)
Mean time to acknowledge represents the average time taken from the moment a problem is detected by the alerting system to the moment it is recognized by response personnel. When MTTA is longer than acceptable, the organization should ensure they have appropriate on-call coverage and examine their methods for alerting on-call personnel.
Mean Time To Resolve (MTTR)
Mean time to resolve is the average time taken for a problem to be resolved, from the moment the incident is detected until the moment it has been fixed. An organization with high MTTR should ensure that issues are being acknowledged in a timely fashion and evaluate manners in which root cause analysis can be accelerated.
Most modern development teams leverage monitoring solutions and alert functionality in an effort to learn of application problems as early as possible. Alert frequency is a measurement of the frequency with which certain alerts are being triggered. A specific alert being generated more often than expected requires analysis to gain a deeper understanding of why this is occuring. It’s possible that the alert threshold is simply too sensitive and needs to be adjusted, or that the problem referenced by the alert needs to be addressed in a more permanent manner within the application. An overall high number of alerts leads to alert fatigue and less diligence among responding team members for each alert.
How a unified UI can help improve these metrics and enable continuous improvement
Splunk On-Call is an incident management offering for centralized tracking and managing of all incident data. By leveraging its all-encompassing UI and enhanced functionality for getting the alerts to the right people the first time, organizations can gain insights into their incident response processes that enable them to better the metrics discussed above and continuously improve both their applications and response processes.
Let’s examine a few of the features available within Splunk On-Call and exactly how they can help.
Context-rich alerting, intelligent routing, and collaboration: Reducing the time to resolve
When working to resolve application problems, information providing context is extremely important. Splunk On-Call enables collecting these types of details up-front by supporting the ingestion of context-rich alerts via numerous integrations with popular monitoring tools. This lays the foundation for the analysis portion of the response process. Furthermore, on-call rotations and escalation policies can be defined within Splunk On-Call, ensuring that these detailed alerts are in the right hands as soon as possible (thereby helping to reduce MTTA and MTTR).
Additionally, in line with alert details, Splunk On-Call provides functionality that allows responders to review similar incidents that have occurred in the past. This provides additional context that could prove useful in narrowing the search for root cause. Additionally, a runbook link can be provided right in the same interface. If that isn’t enough and the incident requires multiple sets of hands to come up with the fix, additional personnel from other teams can be quickly added to the incident from within the UI. In the same vein, communication regarding incidents is made even easier through platform integration with collaboration tools like Slack and Microsoft Teams. Both of which feature two-way integration that allows for communication and incident actions to be taken directly in the collaboration tool, and ensuring that all of these events are present in a single pane of glass from within the Splunk On-Call UI.
In short, Splunk On-Call equips response personnel with context and historical data, and makes it easy for responders to reach out and collaborate with all relevant personnel and teams.
Increased visibility into response drives continuous improvement within applications
Tracking all incident information in a centralized manner has an impact that goes beyond reaching a resolution for a single occurrence. The timeline from the moment the alert was triggered, through the implementation of the fix (and all events that took place in between) will prove exceptionally valuable in conducting an effective post-incident review. This is exactly the type of detail that Splunk On-Call provides.
The purpose of a post-incident review is to learn from the incident, document it, and to make sure that root cause is fully understood. Moreover, teams should identify and take appropriate actions to prevent (or, at the very least, limit) recurrences of the same problem. Splunk On-Call simplifies the process for analyzing what happened and why, through the use of the Post-Incident Review reporting feature.
Post-Incident Review enables teams to pull a comprehensive event log defining the entire response process for the incident. This log aggregates alert information, escalations, interactions between response personnel, and more. In doing so, teams no longer need to manually reconstruct the incident’s timeline and can rest assured they have all the information needed to conduct a thorough review. This facilitates the construction of more complete documentation for faster remediation the next time around, and enables teams to implement the proper preventative measures to help limit incident recurrence.
Furthermore, visualizations detailing alert frequency are available for review at any time. This information can be leveraged to identify potential locations for alert threshold configuration changes (should they be deemed too sensitive), or actual trouble spots within applications. In the long run, this information (and the actions taken based on this information) lead to applications that are improving in a continuous fashion, and reduced alert noise that may be hurting morale and unnecessarily taking up time that would be better spent addressing other initiatives.
Analytics and reporting for response process improvement
With all incident information being tracked within a single platform, Splunk On-Call has an abundance of data that can be leveraged to provide insight into the key incident response metrics discussed earlier. This includes visualizations and statistics that facilitate a deeper understanding of how the organization is performing in regard to incident acknowledgement and resolution. The hard numbers provided via MTTA and MTTR reports provide a jumping off point for analysis of the response process. Reviewing and adjusting alerting techniques, on-call rotations, and escalation policies represent just a few of the possible actions an organization could take, should the data dictate shortfalls in these areas.
Splunk On-Call is an incident management platform featuring rich functionality that can assist an organization acknowledge and resolve system problems in a timely manner.
By tracking and managing all incident data within Splunk On-Call, organizations unlock actionable insights that drive continuous improvement in both their applications and their response processes; over time, leading to the development of highly-reliable applications and enabling highly effective response when incidents occur.
Find out more about how Splunk On-Call works with Splunk Observability Cloud to resolve outages faster with intelligent and automated incident response. Watch this demo video and sign up for your free trial today.