Learn

June 10, 2024

6 Minute Read

Failure Metrics & KPIs for IT Systems

By Chrissy Kidd

The game in enterprise IT is this: delivering amazing services to your customers while also reducing costs.

That means the time it takes to respond to an incident is critical. Incidents can ruin service delivery and destroy your budget. Certain incidents almost surely deliver a poor customer experience.

Response times, you hear? Yep, we’re talking about MTTR, but that’s not all. Mean time to resolve is no secret formula for success – it also surely is not The One Metric that defines a system’s health or an IT team’s performance. In fact, several additional metrics, known as failure metrics, measure different things that all support your reliability, availability and maintainability.

So, let’s look at the topic of failure metrics in IT systems and infrastructure.

What are failure metrics?

Failure metrics are performance indicators that enable organizations to track the reliability, availability and maintainability of their equipment and systems.

These systems can range from the tiniest desktop service requests — such as basic troubleshooting of a piece of equipment or connectivity problems — major incidents like server failures and other malfunctioning components. These could have ripple effects to anyone currently using the system or even downstream effects where the failure of one part of a system affects how another part functions.

The term “failure” doesn’t solely refer to non-functioning devices or systems, such as a crashed file server. Failure in this context can also denote systems that are running but, due to time or degraded performance, have intentionally been taken offline. Any system that isn’t meeting its objectives can be declared a failure.

And right here is the perfect time to briefly detour into the system design attributes of RAM.

RAM: reliability, availability & maintainability

Often abbreviated as RAM, reliability, availability and maintainability are system design attributes that influence the lifecycle costs of a system and its ability to meet its mission goals. As such, RAM can be a measure of an organization’s confidence in its hardware, software and networks.

Individually, these attributes can illuminate the strengths and weaknesses of a system. Understood together, these attributes can each impact on:

System productivity
Overall customer satisfaction
Your organization’s bottom line

Taken together, RAM can be used to determine a complex system’s uptime (reliability) and downtime/outage patterns (maintainability), as well as its uptime percentage over a particular span of time (availability).

Reliability

Reliability is defined as the probability that a system will consistently perform its intended function without failure for a given period of time.

Simply, reliability means that your IT systems work consistently and don't unexpectedly break down. For example, when you press the power button on your computer, it should start up every time without crashing or freezing. Now play that out over the many computers, devices and various apps each of your users is running.

Of course, all hardware and software is subject to failure, so failure metrics like MTBF, MTTF and failure rates are often used to measure and predict component and system reliability. Importantly, reliability is not the same as redundancy – that’s a different concept.

(Reliability is the heart of the SRE practice.)

Availability

In this context, availability is the probability that a system is operating as designed when it needs to be used. Therefore, it is a function of both reliability and maintainability, and can be calculated by dividing MTBF by the sum of MTBF and MTTR (A = MTBF / (MTBF + MTTR).)

(Availability also plays a crucial role in the CIA cybersecurity triad.)

Maintainability

Maintainability describes the ease and speed with which a system and its component parts can be repaired or replaced, then restored to full operation after a failure.

A system’s maintainability depends on a host of factors, including:

The quality of the equipment and its installation
The skill and availability of IT personnel that support it
The adequacy and efficiency of maintenance and repair procedures

How do you measure maintainability? MTTR is one of the benchmark metrics used to determine the maintainability of a component or system, with a low MTTR figure indicating high maintainability.

Popular failure metrics: MTTR vs. MTBF vs. MTTF… and more!

With all that background set, let’s look at some common failure metrics. These metrics help you track and analyze the frequency, severity and impact of failures – so you can identify and prioritize areas that require improvement to enhance reliability and performance. All in the name of excellent customer experience.

(A note about “mean” vs. “average”: While the two terms have different meanings, especially in statistics, they are used interchangeably in the context of failure metrics. In other words, there is no difference between mean time to repair and average time to repair.)

Mean time to repair (MTTR)

Let’s start with everyone’s go-to failure metric: mean time to repair. Also known as mean time to recover or resolve, MTTR is the average time it takes for a failed system to be fixed and restored to full functionality.

It is a relative measure: depending on the system, MTTR could be measured in anything from minutes to days. Mathematically, the Mean Time to Recover metric is defined as:

MTTR = Time elapsed as downtime / number of incidents

MTTR = Time elapsed as maintenance / number of repairs

If MTTR is continuously improving, that can only be seen as a positive sign. Getting a handle on MTTR is one of the best ways to monitor, measure and ultimately improve your overall metrics and performance.

(Read our full MTTR explainer.)

Mean time to acknowledge (MTTA)

MTTA stands for Mean Time to Acknowledge, which denotes the average duration for recognizing an incident following the issuance of an alert. It's calculated as the mean time between the creation of alerts for all incidents and the subsequent acknowledgment of those incidents.

This measure serves to assess the efficiency of incident management teams in addressing alerts. It reveals two key aspects:

The efficacy of the monitoring system.
The responsiveness of the incident management team to the alerts generated.

(Read our full MTTA explainer.)

Mean time between failures (MTBF)

MTBF tracks how often something breaks down. It is the average amount of time that occurs between one failure and the next. Basically, if one thing just failed, how long before the next one will?

As with MTTR, this is a relative metric that can be applied to anything from an individual component to an entire system. MTBF can be useful as an overall measure of the performance and reliability of an individual system or for your infrastructure as a whole.

MTTR and MTBF are often used together to calculate the uptime of a system: After a breakdown, MTTR describes how quickly it can be brought back to a functioning state. A healthy trend would be a steady reduction in MTTR combined with increasing MTBF, describing a system with minimal downtime and the ability to quickly recover when it happens.

(Read our full MTBF explainer.)

Mean time to failure (MTTF)

Mean time to failure is the average time a device or system spends in a functioning (working) state between failures. Typically, IT teams collect this data by observing system components for several days or weeks.

While similar to MTBF, MTTF is normally used to describe replaceable items such as a tape drive in a backup array. This way you can know how many to keep on hand over a given time period.

MTBF, above, is used with items that can be either repaired or replaced.

Mean time to detect (MTTD)

Mean time to detect is all about one thing: how long it takes your team to detect an issue. Formally, MTTD is the average elapsed time between when a problem starts and when it is detected. MTTD denotes the span of time before IT receives a trouble ticket and when it starts the MTTR clock.

Mean time to investigate (MTTI)

Mean time to investigate is the average amount of time between when a fault is detected (like above) and when the IT team begins to investigate it. We can also say it’s the time between MTTD and the start of the MTTR clock.

Mean time to restore service (MTRS)

Mean time to restore service sounds more obvious: how long does it take you to get it back up and running?

MTRS formally is the average time between when a fault is detected and the service is restored to full working order from the users’ perspective. MTRS differs from MTTR like so:

MTTR tracks how long it takes to repair an item.
MTRS describes how long it takes to restore full service after that item is repaired.

Mean time between system incidents (MTBSI)

MTBSI is the average time between two consecutive incidents. MTBSI can be calculated by adding MTBF and MTRS.

Failure rate

The failure rate is a metric that tracks how often something fails, expressed as a number of failures per unit of time.

Failure metrics are essential for managing downtime and its potential to negatively affect the business. They provide IT with the quantitative and qualitative data needed to better plan for and respond to inevitable system failures.

To use failure metrics effectively, you must collect a large amount of specific, accurate data. This would be tedious and time-consuming to do manually – except for modern technology!

With the right enterprise software, you can easily collect the necessary data and calculate these metrics, drawing from a variety of sources, both digital and non-digital, with just a few clicks.

An Expert's Perspective

To better understand the important role of failure metrics, we spoke with Eliot Vancil, CEO of Fuel Logic LLC, a firm located in Dallas that focuses on providing all-encompassing solutions for fuel management, covering the transportation of different diesel and gasoline products across the country. Having a wealth of background in Information Technology and DevOps, Eliot has a unique understanding of the vital role of tracking failure metrics in ensuring the dependability and effectiveness of Fuel Logic's systems.

In this section, we've included Eliot's responses to our prompts.

Which are the most important failure metrics to monitor?

The following key indicators should be kept an eye on MTTR (Mean Time to Repair), MTTA (Mean Time to Acknowledge), MTBF (Mean Time Between Failures), and MTTD (Mean Time to Detect). MTTR is vital for gauging the speed at which problems are resolved, and services are back up and running. MTTA reflects the team's efficiency in responding to incidents, while MTBF offers a view into the systems' dependability and robustness. MTTD is critical for spotting problems early, enabling quick intervention before issues worsen. Each indicator serves a unique purpose: MTTR and MTTA for handling immediate incidents, and MTBF and MTTD for ensuring long-term dependability and implementing preventive maintenance plans.

What is the difference among the four types of MTTR?

Various MTTRs have specific functions. Mean Time to Repair concerns how long it takes to repair a malfunction and restore its operational capabilities. Mean Time to Recovery looks at how long it takes to resume regular operations following a failure. Mean Time to Respond measures the period from when the failure is detected until the repair process begins. Mean Time to Resolve covers the whole procedure, from when the problem happens to when it is fully resolved. Grasping these details aids in correctly evaluating and enhancing the system's performance.

What tips do you have for implementing failure metrics?

When applying these indicators, it's essential to steer clear of typical traps, such as depending too much on averages, which might hide extreme values and differences. Making sure the data is precise is of the utmost importance, as wrong information can result in inaccurate findings. Additionally, employing a mix of indicators for a thorough perspective is key instead of concentrating solely on one aspect.

Practical strategies for putting into action failure metrics involve consistent oversight, in-depth examination, and the application of predictive analytics. The direction of trends in the application of failure metrics is moving more toward real-time surveillance and the employment of artificial intelligence and machine learning for forecasting and averting failures. By embracing these cutting-edge technologies and adopting a comprehensive strategy, companies can significantly improve their IT and DevOps activities, guaranteeing robust system functionality and dependability.

Prevent failures: Splunk for observability & security

With Splunk and our Unified Security and Observability Platform, security teams, ITOps pros and engineers get solutions built for their specific needs plus all the benefits of a unified data platform.

The Splunk Observability portfolio provides complete visibility across your full stack: everything that supports your infrastructure, applications and the digital customer experience. Splunk Observability provides the insights you need to ensure the continuous health, reliability and performance of your business — and the applications and infrastructure that it runs on.

Splunk’s security and observability solutions are powered by the Splunk platform, which provides comprehensive visibility across the hybrid and edge technology landscape, as well as powerful tools for investigation and response, all at scale.

See an error or have a suggestion? Please let us know by emailing splunkblogs@cisco.com.

This posting does not necessarily represent Splunk's position, strategies or opinion.

Metrics Guide

Chrissy Kidd

Chrissy Kidd is a technology writer, editor, and speaker. The managing editor for Splunk Learn, Chrissy has covered a variety of tech topics, including cybersecurity, software development, and sustainable technology. She's particularly interested in how tech intersects with our daily lives.

Learn 5 Min Read

Threat Actors: Common Types & Best Defenses Against Them

Learn about threat actors, the person, persons, or entities responsible for causing cybersecurity incident or more generally posing a risk.

Learn 3 Min Read

Splunk OnDemand Services: An Introduction & Example

Get started with Splunk OnDemand Services (ODS), an advisory service that bridges the gap between Technical Support and project-based services delivered by Professional Services.

Learn 4 Min Read

Cybercrime as a Service (CaaS) Explained

Perhaps unsurprisingly, cybercrime is now available for hire. Harnessing the ‘as a service’ model, find out how cybercrime can be enacted by practically anyone.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram