LEARN

Failure Metrics & KPIs for IT Systems

The game in enterprise IT is this: delivering amazing services to your customers while also reducing costs.

That means the time it takes to respond to an incident is critical. Incidents can ruin service delivery and destroy your budget. Certain incidents almost surely deliver a poor customer experience.

Response times, you hear? Yep, we’re talking about MTTR, but that’s not all. Mean time to resolve is no secret formula for success – it also surely is not The One Metric that defines a system’s health or an IT team’s performance. In fact, several additional metrics, known as failure metrics, measure different things that all support your reliability, availability and maintainability.  

So, let’s look at the topic of failure metrics in IT systems and infrastructure.

What are failure metrics?

Failure metrics are performance indicators that enable organizations to track the reliability, availability and maintainability of their equipment and systems.

These systems can range from the tiniest desktop service requests — such as basic troubleshooting of a piece of equipment or connectivity problems — major incidents like server failures and other malfunctioning components. These could have ripple effects to anyone currently using the system or even downstream effects where the failure of one part of a system affects how another part functions.



The term “failure” doesn’t solely refer to non-functioning devices or systems, such as a crashed file server. Failure in this context can also denote systems that are running but, due to time or degraded performance, have intentionally been taken offline. Any system that isn’t meeting its objectives can be declared a failure.

And right here is the perfect time to briefly detour into the system design attributes of RAM.

RAM: reliability, availability & maintainability

Often abbreviated as RAM, reliability, availability and maintainability are system design attributes that influence the lifecycle costs of a system and its ability to meet its mission goals. As such, RAM can be a measure of an organization’s confidence in its hardware, software and networks.

Individually, these attributes can illuminate the strengths and weaknesses of a system. Understood together, these attributes can each impact on:

  • System productivity
  • Overall customer satisfaction
  • Your organization’s bottom line

Taken together, RAM can be used to determine a complex system’s uptime (reliability) and downtime/outage patterns (maintainability), as well as its uptime percentage over a particular span of time (availability).

Reliability

Reliability is defined as the probability that a system will consistently perform its intended function without failure for a given period of time.

Simply, reliability means that your IT systems work consistently and don't unexpectedly break down. For example, when you press the power button on your computer, it should start up every time without crashing or freezing. Now play that out over the many computers, devices and various apps each of your users is running.

Of course, all hardware and software is subject to failure, so failure metrics like MTBF, MTTF and failure rates are often used to measure and predict component and system reliability. Importantly, reliability is not the same as redundancy – that’s a different concept.

(Reliability is the heart of the SRE practice.)

Availability

In this context, availability is the probability that a system is operating as designed when it needs to be used. Therefore, it is a function of both reliability and maintainability, and can be calculated by dividing MTBF by the sum of MTBF and MTTR (A = MTBF / (MTBF + MTTR).)

(Availability also plays a crucial role in the CIA cybersecurity triad.)

Maintainability

Maintainability describes the ease and speed with which a system and its component parts can be repaired or replaced, then restored to full operation after a failure.

A system’s maintainability depends on a host of factors, including:

  • The quality of the equipment and its installation
  • The skill and availability of IT personnel that support it
  • The adequacy and efficiency of maintenance and repair procedures

How do you measure maintainability? MTTR is one of the benchmark metrics used to determine the maintainability of a component or system, with a low MTTR figure indicating high maintainability.



Popular failure metrics: MTTR vs. MTBF vs. MTTF… and more!

With all that background set, let’s look at some common failure metrics. These metrics help you track and analyze the frequency, severity and impact of failures – so you can identify and prioritize areas that require improvement to enhance reliability and performance. All in the name of excellent customer experience.

(A note about “mean” vs. “average”: While the two terms have different meanings, especially in statistics, they are used interchangeably in the context of failure metrics. In other words, there is no difference between mean time to repair and average time to repair.)

Mean time to repair (MTTR)

Let’s start with everyone’s go-to failure metric: mean time to repair. Also known as mean time to recover or resolve,  MTTR is the average time it takes for a failed system to be fixed and restored to full functionality.

It is a relative measure: depending on the system, MTTR could be measured in anything from minutes to days.  Mathematically, the Mean Time to Recover metric is defined as:

MTTR = Time elapsed as downtime / number of incidents

or

MTTR = Time elapsed as maintenance / number of repairs

If MTTR is continuously improving, that can only be seen as a positive sign. Getting a handle on MTTR is one of the best ways to monitor, measure and ultimately improve your overall metrics and performance.

(Read our full MTTR explainer.)

Mean time between failures (MTBF)

MTBF tracks how often something breaks down. It is the average amount of time that occurs between one failure and the next. Basically, if one thing just failed, how long before the next one will?

As with MTTR, this is a relative metric that can be applied to anything from an individual component to an entire system. MTBF can be useful as an overall measure of the performance and reliability of an individual system or for your infrastructure as a whole.

MTTR and MTBF are often used together to calculate the uptime of a system: After a breakdown, MTTR describes how quickly it can be brought back to a functioning state. A healthy trend would be a steady reduction in MTTR combined with increasing MTBF, describing a system with minimal downtime and the ability to quickly recover when it happens.

Mean time to failure (MTTF)

Mean time to failure is the average time a device or system spends in a functioning (working) state between failures. Typically, IT teams collect this data by observing system components for several days or weeks.

While similar to MTBF, MTTF is normally used to describe replaceable items such as a tape drive in a backup array. This way you can know how many to keep on hand over a given time period.

MTBF, above, is used with items that can be either repaired or replaced.

Mean time to detect (MTTD)

Mean time to detect is all about one thing: how long it takes your team to detect an issue. Formally, MTTD is the average elapsed time between when a problem starts and when it is detected. MTTD denotes the span of time before IT receives a trouble ticket and when it starts the MTTR clock.



Mean time to investigate (MTTI)

Mean time to investigate is the average amount of time between when a fault is detected (like above) and when the IT team begins to investigate it. We can also say it’s the time between MTTD and the start of the MTTR clock.

Mean time to restore service (MTRS)

Mean time to restore service sounds more obvious: how long does it take you to get it back up and running?

MTRS formally is the average time between when a fault is detected and the service is restored to full working order from the users’ perspective. MTRS differs from MTTR like so:

  • MTTR tracks how long it takes to repair an item.
  • MTRS describes how long it takes to restore full service after that item is repaired.

Mean time between system incidents (MTBSI)

MTBSI is the average time between two consecutive incidents. MTBSI can be calculated by adding MTBF and MTRS.

Failure rate

The failure rate is a metric that tracks how often something fails, expressed as a number of failures per unit of time.

Failure metrics are essential for managing downtime and its potential to negatively affect the business. They provide IT with the quantitative and qualitative data needed to better plan for and respond to inevitable system failures.

To use failure metrics effectively, you must collect a large amount of specific, accurate data. This would be tedious and time-consuming to do manually – except for modern technology!

With the right enterprise software, you can easily collect the necessary data and calculate these metrics, drawing from a variety of sources, both digital and non-digital, with just a few clicks.

Prevent failures: Splunk for observability & security

With Splunk and our Unified Security and Observability Platform, security teams, ITOps pros and engineers get solutions built for their specific needs plus all the benefits of a unified data platform.

The Splunk Observability portfolio provides complete visibility across your full stack: everything that supports your infrastructure, applications and the digital customer experience. Splunk Observability provides the insights you need to ensure the continuous health, reliability and performance of your business — and the applications and infrastructure that it runs on.

Splunk’s security and observability solutions are powered by the Splunk platform, which provides comprehensive visibility across the hybrid and edge technology landscape, as well as powerful tools for investigation and response, all at scale. 

What is Splunk?

This posting does not necessarily represent Splunk's position, strategies or opinion.

Chrissy Kidd
Posted by

Chrissy Kidd

Chrissy Kidd is a technology writer, editor and speaker. Part of Splunk’s growth marketing team, Chrissy translates technical concepts to a broad audience. She’s particularly interested in the ways technology intersects with our daily lives.