Application quality and incident response pack a punch when it comes to customer experience. A software team should always be working to improve on both. But — where to start?
- Which metrics should you collect?
- How can analysis of these metrics facilitate these improvements?
Read on to hear about five key metrics essential to incident response. Discover how these metrics and incident management KPIs can provide insights that add value to your customers — both in the quality of your application and the efficiency of your incident response strategy.
The best part? We’ve included real-world examples that show just how these metrics help organizations get better.
Common incident response metrics
- Making application changes to improve the quality of the application/service and implementing new features that provide value to the customer.
- Improving the incident management practice to quickly and seamlessly resolve issues encountered by the customer.
Now, let’s look at five metrics that can help an organization take these steps more thoughtfully:
- Issue classification
- Time to acknowledge (MTTA)
- Time to resolution/repair (MTTR)
- Incident time
- Escalation statistics
1) Issue classification (determining the most reported application issues)
The first metric to analyze, issue classification, may also be the most impactful.
Here, we want to look at the most reported issues with the given application. Track commonly reported errors and/or performance issues and report these to the development team for root cause analysis via post-incident reviews. Repeated failure of the same functionality will likely trace back to the same root cause which, when resolved, could fix the problem for good when moving forward.
By extension, application slowness may be the result of improper query construction. Simply optimizing these queries could lead to better performance and happier customers.
(Improve your incident review/postmortems with these best practices.)
2) Time to acknowledgment (MTTA)
The time it takes for an incident response team to acknowledge a reported incident can reveal a lot about the effectiveness of your overall incident management practice. While the acknowledgment time for any particular incident may not indicate a trend, calculating the mean time to acknowledgement (MTTA) can help determine if your incident management strategy needs improvement.
A better incident management strategy can facilitate faster response times and let customers know they’re not forgotten — going a long way towards customer satisfaction. These alterations could include:
- Setting up additional or repeating, time-based alerts to inform the necessary incident response personnel of newly-created issues, ensuring faster acknowledgment and fewer gaps in on-call coverage.
- Restructuring current schedules and/or adding on-call staff to ensure adequate staffing to handle the volume of issues.
3) Time to resolution (MTTR)
Similarly, another important incident response metric to track is the time to resolution for reported incidents. The goal, of course, is to resolve incidents as quickly and efficiently as possible. Calculating the mean time to resolve (MTTR) and the average time to resolve for particular issues can provide insights that suggest where to focus on improving your incident response strategy. (MTTR isn't limited to incident response: it's also an important failure metric for IT systems.)
Sometimes, improving documentation, communication and knowledge sharing alone can reduce MTTR. But you might need to dig deeper or make bigger changes to significantly improve efficiency in this area.
Here’s a real-world example: When Carrefour, the eighth-largest global retailer, wanted to improve customer experience across its online channels, it focused on improving MTTR by using actionable insights into system performance. This MTTR improvement means Carrefour is now…
- Responding 3x faster to security threats.
- Making smarter decisions about preventing incidents in the first place.
(Discover what Carrefour calls “the cornerstone” of their security operations.)
4) Incident report time
Tracking exactly when each incident occurs can also highlight important trends–even if the incidents are seemingly unrelated. For example:
- Is application slowness commonly detected and reported on Monday mornings, for instance? Maybe traffic to your application is significantly higher at this particular time — scaling might be necessary to permanently prevent this problem from occurring.
- Did an issue present itself after a particular deployment? Perhaps something unusual occurred in this deployment, and isn’t a problem more widely. In that case, you may be able to simply reverse the deployment.
Knowing this type of information provides insight that allows the development team to track problems quickly and more easily.
5) Escalation statistics
Are incidents frequently being escalated or rerouted to different units within the organization? If this is the case, there would likely need to be some alterations made to the incident response strategy. These changes can range widely, for example:
- Slight adjustments to the alerting process could inform the correct personnel of relevant issue occurrences in a timelier manner.
- Overhauling the issue classification process to provide the team with more granular detail, increasing the likelihood of the right people being the first to tackle the problem.
(Use incident severity levels to your advantage. And beware that severity is not the same as priority.)
Why incident response is so important
It’s easy to see how collecting and analyzing the right incident response metrics can improve the incident management process and enhance application quality. But, why is this so important?
Look no further than online retailers, financial institutions and social media companies. Slow incident response times and frequent application issues can quickly sully a company’s reputation, leaving you to fight an uphill battle against your competitors.
But a positive customer experience can mean the difference between being the go-to organization or being completely irrelevant. Just ask Papa Johns, the world’s third-largest pizza delivery company. To keep all its operations running smoothly, it needed visibility into its complex hybrid environment. Today, the team can find and fix issues fast:
“It used to take us days to find out about issues with a new release. Now with our custom dashboard built with Splunk Dashboard Studio, we can pinpoint and fix a problem on the same day so that customers can place orders seamlessly,” says Willie James, director of resiliency services at Papa Johns.
(See how Papa Johns keeps up with increased customer demand — and innovates faster.)
Reliability and prompt issue resolution can help cultivate trust between an organization and its customer base, leading to recurring customers and a positive reputation that draws in new customers.
- Incident Response with Splunk
- Kubernetes Incident Response Best Practices
- Business Continuity vs. Business Resilience: Comparing Strategies for Staying Resilient
- Disaster Recovery Planning: The Organizational Guide
- NOC vs SOC: Comparing Network & Security Operations Centers
- IT Spending: Trends & Forecasts Today
What is Splunk?
This posting does not necessarily represent Splunk's position, strategies or opinion.