Useful dashboards can elevate data analysis tasks, and bridge the gap between data and action. Viewers should be able to look at a dashboard and go, “I understand what’s going on and exactly what I need to do now.”
Published Date: December 16, 2022
Mean time to repair (MTTR) is an important performance metric (a.k.a. a “failure metric”) in IT that represents the average time between the failure of a system or component and when it is restored to full functionality. (The acronym MTTR can also stand for mean time to recovery, mean time to resolve and mean time to resolution, all of which are used interchangeably.) It is one of the most visible and useful metrics to determine how well an organization’s IT infrastructure, systems and equipment are performing, and how efficient and effective the IT team is when responding to critical incidents.
When calculating MTTR, the clock starts ticking as soon as a failure is detected. MTTR includes the time it takes to diagnose the problem, repair it, test it and any other procedures that must take place before the service is up and running and there is a return to normal operations. Therefore, obviously, a low MTTR is preferable to a high MTTR. A low MTTR indicates that the system was offline for a relatively short period of time, whereas a high MTTR signals the opposite, and suggests that users or customers were inconvenienced for a longer period of time. MTTR, therefore, is a relative measure, and whether or not an MTTR figure is low or high depends on how it compares like-for-like to related metrics.
A high MTTR should prompt IT administrators to reevaluate their approach to troubleshooting, taking into account how they monitor and detect through diagnosis and resolution, with the goal of reducing MTTR and therefore potential unplanned downtime.
Most service-level agreements (SLAs) between a customer and a service provider or vendor include MTTR in some manner as a guarantee of performance, and a high MTTR can lead to high penalties. It’s important to remember that MTTR represents a typical repair time, not a guaranteed one. A vendor claiming an MTTR of 24 hours is saying that’s how long it usually takes to complete a repair, but individual incidents could take more or less time to resolve.
What are failure metrics?
Failure metrics (such as MTTR) are performance indicators that allow organizations to track the reliability of their equipment and systems, from common desktop service requests — such as basic troubleshooting of a piece of equipment or connectivity problems — to server failures and other malfunctioning components that can have a significant impact. The term “failure” doesn’t only refer to non-functioning devices or systems (such as a crashed file server); it can also denote systems that are running but, due to degraded performance, have intentionally been taken offline. Any system that isn’t meeting its objectives can be declared a failure.
A note about “mean” vs. “average”: While the two terms have different meanings, especially in statistics, they are used interchangeably in the context of failure metrics. In other words, there is no difference between mean time to repair and average time to repair.
Common failure metrics include:
- Mean time to repair (MTTR): As described above, MTTR is the average time it takes for a failed system to be fixed and restored to full functionality. It is a relative measure, and therefore, depending on the system, could be measured in anything from minutes to days.
- Mean time between failures (MTBF): The average amount of time that occurs between one failure and the next. As with MTTR, this is a relative metric that can be applied to anything from an individual component to an entire system. MTBF can be useful as an overall measure of the performance and reliability of an organization’s systems and infrastructure as a whole.
- Mean time to failure (MTTF): The average time a device or system spends in a functioning state between failures. Typically, IT teams collect this data by observing system components for several days or weeks. While similar to MTBF, MTTF is normally used to describe replaceable items such as a tape drive in a backup array, whereas MTBF is used with items that can be either repaired or replaced.
- Mean time to detect (MTTD): The average elapsed time between when a problem starts and when it is detected. MTTD denotes the span of time before IT receives a trouble ticket and when it starts the MTTR clock.
- Mean time to investigate (MTTI): The average amount of time between when a fault is detected and when the IT team begins to investigate it. It can also be described as the time between MTTD and the start of the MTTR clock.
- Mean time to restore service (MTRS):The average time between when a fault is detected and the service is restored to full working order from the users’ perspective. MTRS differs from MTTR in that MTTR denotes how long it takes to repair an item, while MTRS refers to how long it takes to restore full service after that item is repaired.
- Mean time between system incidents (MTBSI): The average time between two consecutive incidents. MTBSI can be calculated by adding MTBF and MTRS.
- Failure rate: A metric that tracks how often something fails, expressed as a number of failures per unit of time.
Failure metrics are essential for managing downtime and its potential to negatively affect the business. They provide IT with the quantitative and qualitative data needed to better plan for and respond to inevitable system failures.
To use failure metrics effectively, you must collect a large amount of specific, accurate data. This would be tedious and time-consuming to do manually, but modern enterprise software can easily collect the necessary data and calculate these metrics, drawing from a variety of sources with just a few clicks.
Failure metrics timeline
What are reliability, availability and maintainability?
Often abbreviated as RAM, reliability, availability and maintainability are system design attributes that influence the lifecycle costs of a system and its ability to meet its mission goals. As such, RAM can be a measure of an organization’s confidence in its hardware, software and networks.
Each of these attributes can illuminate the strengths and weaknesses of a system and their respective impact on productivity, customer satisfaction and the organization’s bottom line.
Reliability refers to the probability that a system will consistently perform its intended function without failure for a given period of time. Because reliability is to some extent a function of the quality of a product, it’s an inherent characteristic of a system’s various components (as well as its design). However, all hardware and software is subject to failure, so failure metrics like MTBF, MTTF and failure rate are often used to measure and predict component and system reliability.
Availability is the probability that a system is operating as designed when it needs to be used. Therefore, it is a function of both reliability and maintainability, and can be calculated by dividing MTBF by the sum of MTBF and MTTR (A = MTBF / (MTBF + MTTR).)Maintainability describes the ease and speed with which a system and its component parts can be repaired or replaced, then restored to full operation after a failure. A system’s maintainability depends on a host of factors, including the quality of the equipment and its installation, the skill and availability of IT personnel, and the adequacy and efficiency of maintenance and repair procedures. MTTR is one of the benchmark metrics used to determine the maintainability of a component or system, with a low MTTR figure indicating high maintainability.
Taken together, RAM can be used to determine a system’s uptime (reliability) and downtime/outage (maintainability) patterns, as well as its percentage of uptime over a particular span of time (availability).
Why is MTTR important?
The purpose of MTTR is to track the time that business-critical systems are unavailable for use, which makes it a valuable metric when analyzing the overall severity and impact of an IT incident. It’s also a performance metric that can measure the efficiency and effectiveness of the IT team in responding to incidents. Reducing overall MTTR is a key goal for IT teams. A comparatively low MTTR indicates an IT team that is doing its job well, whereas a high (or rising) MTTR is a sign of problems, either with the IT team’s performance or elsewhere in the system. In any case, MTTR is a valuable benchmark.
What is the difference between MTTR and MTBF?
MTBF is a metric of how often something breaks down. After a breakdown, MTTR describes how quickly it can be brought back to a functioning state. While they measure different things, they can be used together to calculate the uptime of a system. A healthy trend would be a steady reduction in MTTR combined with increasing MTBF, describing a system with minimal downtime and the ability to quickly recover when it happens.
How is MTTR calculated?
MTTR is determined by taking the amount of downtime directly attributable to failures and dividing it by the total number of failures. For example, suppose a system failed four times in one month resulting in a total time of 12 hours, the MTTR formula would look like this:
12 hours downtime / 4 failures = 3 hours MTTR
What is MTTR in ITIL?
MTTR is a key metric according to the organization known as Information Technology Infrastructure Library (ITIL). ITIL maintains a series of written volumes that detail best practices for better aligning IT service management (ITSM) with business needs.
ITIL breaks down IT functions into several measurable processes, including service catalog management, service level management, risk management, capacity management, availability management, IT service continuity management, compliance management, IT architecture management and supplier management.
MTTR is noted along with MTBF, MTBSI and MTRS as a measurement for incident and problem management that may be included in a service level agreement (SLA).
What is MTTR in DevOps?
In the IT practice known as DevOps, MTTR is called mean time to recovery, but the equation is otherwise the same. Because of the nature of DevOps, MTTR is used as a measure of the time it takes the DevOps team to recover from an issue during production. It is usually figured as average production downtime over the previous 10 occurrences of downtime.
Metrics are always essential to ensure and quantify DevOps success. Though MTTR can be skewed by the volume of new features being added to an app, code complexity and other production variables, it generally provides an accurate measure of a team’s capabilities. Ideally, MTTR will shrink as an organization’s DevOps implementation matures.
For anyone hoping to communicate the value of a DevOps approach, MTTR can be a valuable metric, used to express the amount of time (and money) saved by improving productivity and reducing downtime.
What is MTTR in continuous development?
MTTR is also a useful metric for anyone describing an organization’s continuous development (CD) process.
Speed of software development and delivery is a vital driver to the success of most organizations. A robust continuous delivery process employs a “build, measure, learn” feedback loop to ensure that it’s always improving and meeting business goals.
Because speed and stability are the foundation of continuous development, metrics that help evaluate and improve these issues are essential. There are no standardized metrics to monitor for continuous development, and ultimately each organization must decide which metrics are right for its goals. MTTR is frequently used as a measure of how quickly the IT team is able to respond to issues in the CD pipeline with the aim of increasing stability.
How do you lower MTTR?
While many of the issues that contribute to a high MTTR will be unique to each organization (requiring specific evaluation of its particular IT processes and procedures), there are six general steps to lower MTTR that are likely to benefit any business.
Analyze incidents: The first step to improving MTTR is to understand the incidents that cause it. Modern IT software can bring together metrics from all over your system to help with root cause analysis, making sure you have the most reliable MTTR calculation possible and that you’re getting useful insights.
Monitor, monitor, monitor: If you want to fix an issue, you have to know what it is and where and when it occurred. An advanced IT monitoring solution will give you real-time, uninterrupted data to help you fully understand your system’s performance and give you all the data related to any fault or failure.
Have an incident response plan: Organizations with a carefully planned incident response protocol are much more likely to respond quickly and effectively to issues and therefore have a lower MTTR. For many organizations, this likely includes an IT service management (ITSM) approach. Companies that have successfully undergone full digital transformation may take a more flexible approach, employing cross-functional collaboration tools and constructing specific responses — even explicit checklists — for each incident. The key to any plan, regardless, is to have a clear understanding of who to notify of an incident, how it should be documented and what steps should be taken to rectify it.
Automate your incident-management system: A quick response starts with making sure the right people receive accurate information about a problem quickly. Alerting a team member with a phone call may be fine for low-priority incidents during business hours. But what happens if a failed server takes your website offline at 8 p.m. on a Friday? Automated incident management systems handle the process of sending alerts in multiple channels (phone calls, SMS texts, email, etc.) to all incident responders, reducing the time frame to notify people and ensuring that everyone is in the loop.
With enterprise IT being pressured to increase service levels while reducing costs, incident-response times are critically important. MTTR isn’t a secret formula for success. Nor is it the one metric that defines a system’s health or an IT team’s performance. That being said, it is a very powerful measure. If MTTR is continuously improving, that can only be seen as a positive sign. System downtime is one of the most insidious and visible issues an IT department can struggle with. Getting a handle on MTTR is one of the best ways to monitor, measure and ultimately improve your overall metrics and performance.

Four Lessons for Observability Leaders in 2023
Frazzled ops teams know that their monitoring is fundamentally broken in this new multicloud reality. Bottom line? Real need will spur the coming observability boom.