The sooner you know about a problem, the sooner you can address it, right? Imagine if you could do that in your most important apps and software.
Well, that’s exactly what MTTA measures. Let’s take a look.
What does MTTA mean?
Mean Time to Acknowledge (MTTA) refers to the average time it takes to recognize an incident after an alert is issued. MTTA is measured as the average time between alerts created across all incidents and the time taken to respond to those respective incidents.
This metric is used to evaluate the performance of incident management teams in responding to an alert system. It uncovers both:
- The effectiveness of the monitoring mechanism.
- The capacity of the incident management team to respond to the issued alert.
What MTTA can tell us
This metric is particularly useful to enhance service dependability — which includes availability, reliability and effectiveness of an IT service. Failure to detect and, therefore, respond to an IT incident translates into a steep cost of downtime:
- 76% of surveyed organizations lose data thanks to a downtime incident.
- The true cost of a downtime incident can range from $10,000 per hour to $5 million per hour.
MTTA at network scale
MTTA isn’t just about monitoring for specific alerts. The challenge facing incident management teams primarily centers around the scale of networking operations.
Network nodes and endpoints generate large volumes of log streams in real-time. These logs describe every networking activity including traffic flows, connection request, TCP/IP protocol data exchange and network security information at a large scale. Advanced monitoring and observability tools use this information to make sense of every alert — but not every alert represents a potential incident.
These tools are designed to recognize patterns of anomalous network behavior. Since the network behavior evolves continuously, predefined instructions cannot accurately determine the severity of impact represented by an alert.
Instead, the behavior of networking systems is modeled and generalized, by a statistical system such as a machine learning model. Deviations from the normal behavior are classified as anomalous and therefore mandate an acknowledgement action from the incident management teams.
Since these models generalize the system behavior which is continuously changing, some important alerts go under the radar, while most of the common and less important alerts do not necessitate a control action. Additionally, it may take several alerts in succession to definitively point to an incident that mandates a control action — which may not be entirely automated.
This discrepancy causes an average delay in issuing an alert and acknowledgement from the incident management teams to respond.
Ways to reduce MTTA
So how do you reduce MTTA? The following best practices are key to reducing the Mean Time to Acknowledgement of an IT incident:
Information that can be used to issue an alert is generated in silos — network logs, application logs, network traffic data. This information must be integrated and collected in a consumable format within a scalable data platform. For example, data lakes or data lakehouses that acquire data of all structures and various formats within a scalable cloud-based repository.
(Know the difference between logs & metrics.)
Because network behavior changes rapidly, the mathematical models that represent this behavior must be adaptable, learning continuously. That means they also require continuous training on real-time data streams.
Real-time decision making
It is important to reduce the time to issue an accurate alert, especially when only a pattern or series of alerts can point to a specific incident. This requires real-time analytics processing capabilities to make sense of the acquired data.
When alerts are already issued, incident management, risk management and governance processes often contribute to the delays in responding thanks to their necessary countermeasures.
By integrating automation tools to your monitoring systems, you can reduce these delays—but you’ll still want a risk, governance and incident management framework to streamline automation and reduce the risk of automatically responding to incident alerts.
(Read about the GRC framework.)
Business value alignment
Focus your resources on alert categories that have the largest impact on your…
- Business operations
- Service dependability for mission critical functions
- End-user experience
It’s likely not possible or viable to invest all incident management resources into resolving issues that do not directly impact SLA performance and service dependability. Instead, prioritize based on the biggest impact to users and business.
Incident management is a highly data-driven function. Traditional tools that follow fixed alert thresholds may require ongoing manual efforts to align the incident management performance with the service dependability goals of your organization. Therefore, advanced incident management technologies are significantly important for two key reasons:
- Acquiring data from a multitude of siloed sources.
- Making sense of data patterns to issue the most impactful alerts in real-time.
Identify the root cause
Incidents, and therefore alerts, can be recurring unless the resolution procedure addresses the underlying cause. Identifying incident types that contribute significantly to your MTTA metric performance and understanding the root cause can help IT teams establish long-term and impactful resolution.
This reduces the burden on incident management teams on responding to repeated issues while potentially eliminating the underlying cause.
What about MTTR & MTTR?
In addition to the service dependability metric MTTA, it is also important to consider Mean Time to Failure (MTTF) and Mean Time to Recovery (MTTR) and how these impact your MTTA improvement decisions.
What is Splunk?
This posting does not necessarily represent Splunk's position, strategies or opinion.