Skip to main content

DATA INSIDER

What are Service Level Objectives and Service Level Indicators (SLOs/SLIs)?

Service Level Objectives (SLOs) are reliability targets for technology products and services. Service Level Indicators (SLIs) are the measurements that are compared against those reliability targets. Often used together as acronyms, SLOs and SLIs help businesses determine when various key performance indicators (KPIs) are being met, which can tell the enterprise whether performance or reliability is falling. They are also used for incident management, especially for troubleshooting problems when (or ideally before) they develop. Both terms are key to managing Service Level Agreements (SLAs), which define the minimum service levels that a provider will provide to a customer. In the case of a web service, this is usually a specified level of uptime or minimum response time.

The management of SLOs, SLIs and SLAs is of particular interest to the site reliability engineer (SRE), who ensures networks and services are running as expected. Availability and reliability are of paramount concern to the SRE, who must balance the desires for near-100% uptime against the cost of providing this level of service. Other common SLOs include error generation, throughput rates and the responsiveness of services such as service desk operations.

Here we’ll outline the differences between SLOs, SLIs and SLAs, discuss how SLOs and SLIs are calculated and show you how to create appropriate SLOs for your organization.

What are Service Level Objectives and Service Level Indicators (SLO/SLI)? | Contents

Understanding SLOs and SLIs

What is the difference between Service Level Objectives (SLOs) and Service Level Indicators (SLIs)?

Service Level Objectives (SLOs) define targets for certain metrics, while Service Level Indicators (SLIs) reflect ongoing measurements of those metrics. Since the two terms are closely related, they are often confused. It is easiest to think of an SLO as a goal or ideal condition that you would like a service to reach: a percentile, such as 99.999% uptime or .0001% downtime from outages, for example. If a dashboard shows a dial from 0 to 100%, a mark at 99.999% would indicate the SLO. In this case, the SLI would be the needle on that dial that varies over a period of time. If that needle dips below the 99.999% mark, a performance threshold has been crossed. This prompts an alert and compels the site reliability engineer to take action. As part of a broader service level management ecosystem, SLOs and SLIs cannot exist without each other.

How is an SLI calculated?

SLIs are simply calculated as the percentage of total events that are considered acceptable. If you are measuring the total number of HTTP requests that are completed successfully, the formula for the corresponding SLI would be:

(successful requests / total requests) x 100

Similarly, consider an SLI that is designed to measure whether a server is becoming too slow. The organization would set a minimum bar for latency — say, 400 milliseconds — and would calculate the SLI as such:

(requests completed in less than 400 milliseconds / total requests) x 100

Any SLI must be calculated over a given amount of time, such as every minute, over the course of an hour or over an entire month. SLOs are commonly stated with fairly lengthy terms. For example, Amazon Compute services promise a minimum of 95% monthly uptime; if the service doesn’t meet this bar, customers are due a 100% credit on their bill. Evaluating an SLI over a shorter time period can be useful in troubleshooting problems, but for the purposes of ensuring SLA compliance, the organization will also need to consider an SLI with a longer time frame.

What is a good SLO?

The qualities of a good Service Level Objective were initially laid out by Rick Sturm, Wayne Morris and Mary Jander in their 2000 book Foundations of Service Level Management. The authors state that good SLOs will have these key features:

  • Attainability: An SLO of 100% is functionally unattainable and represents a theoretical goal, not a useful SLO. SLOs should represent a minimum acceptable level of performance and never be considered “impossible.”
  • Meaningfulness: Many organizations set SLOs for metrics that aren’t meaningful to the enterprise. CPU utilization is often cited as a meaningless SLO; for most organizations, this is irrelevant to both users and the enterprise.
  • Measurability: If a metric cannot be accurately measured, it will not be useful as an SLO.
  • Controllability: An SLO that sets a maximum bar for the number of lightning strikes to the data center, for example, would not be valuable.
  • Understandability: Some metrics may not have immediate relevance to IT management. For example, “Packet collision” is a commonly used proxy for system performance, but it doesn’t offer any real meaning to users and should therefore not be used as an SLO.
  • Affordability: Does an organization really need 99.9999% uptime if 99.99% is acceptable? If an organization must invest in multiple redundant data centers, failover protocols and additional staff, the cost of the investment to ensure the SLO is met probably far outweighs the benefit the organization receives from that extra 52 minutes of uptime.
  • Mutual Acceptance: All SLOs must be agreed upon by all parties involved — typically the service provider and the customer.

Once an SLO is selected, the appropriate value of that SLO must be set. This is part of a negotiation between the service provider and the customer (whether internal or external), though in many cases, SLOs may be presented as “take it or leave it” values.

9s-of-availability

Appropriate SLO values include availability, meaningfulness and measurability, but never an ideal figure or maximum threshold.

Ultimately, it is important to understand that an SLO should never be considered an ideal figure or maximum threshold for any metric, but rather a minimum acceptable level for the organization to achieve its business goals.

What is the error rate of an SLI?

The error rate of an SLI is the proportion of activity that falls below the minimum threshold of the SLO. In other words, the error rate is the inverse of the SLI: 1 – SLI = error rate.

The error rate — sometimes called the error budget — is simply another way to think about performance metrics. 99% uptime per month may be more immediately understandable (and cautionary) when expressed by its error rate of 1% downtime per month, particularly if the SLO is on the order of 99.999% uptime (or a 0.001% error rate). This may also be expressed quantitatively as an error rate of 7.2 hours per month, for example.

Conceptually, the idea of an error budget gives IT management the tools it needs to make more strategic decisions about downtime and latency. For example, if management knows that a server absolutely must be taken offline with no backup available and that the organization has an error budget of 7.2 hours per month of allowable downtime, management can target these metrics during the downtime.

SLAs, KPIs and SREs

How do you define Service Level Agreements (SLAs)?

Service Level Agreements are formal agreements — either between an organization and an outside party or within the organization — that specify Service Level Objectives for a provided service. SLAs formalize SLOs and, when used with a service provider (commonly a cloud service provider), usually establish penalties if an SLO is not achieved.

External SLAs also include other contractual terms, such as the mechanism required to apply for credits or refunds, exclusions for certain actions (such as customer error), a formal definition of how SLIs are calculated, and processes for terminating or altering the SLA.

Legal teams, incident response or engineering teams can use Internal SLAs to measure the effectiveness of in-house operations like service desk operations, internal network performance and uptime, or even the proportion of chargebacks due to fraud.

How do KPIs relate to SLAs?

Key Performance Indicators (KPIs) are metrics that measure performance of a real-world business process. The term can be applied broadly to processes, systems or even individual workers. In the IT world, KPIs are commonly used to define system performance on an internal or external network, and they are ideally linked to business results. For example, the uptime of an ecommerce website is directly related to the revenue driven by that website; both of these may be used as KPIs.

KPIs are closely related to SLIs, and the terms are often used interchangeably in SLA agreements. However, some argue that while KPIs should be purely quantitative, SLIs should be somewhat more abstract, offering some level of reflection on the customer experience and a more strategic outlook instead of simply gauging performance.

How does an SRE define SLOs, SLIs and SLAs?

A Site Reliability Engineer (SRE) will have a distinct point of view on SLOs, SLIs and SLAs. Since SREs are primarily tasked with ensuring network and service availability, they focus these service-level metrics heavily on uptime and reliability. A system that is not available is not reliable, so SLAs are designed to ensure maximum availability. SLIs are examined in real time to monitor system conditions and historical SLIs are analyzed by an SRE to determine whether and when a system is likely to fail in the future. AI tools can be particularly instrumental in this analysis.

As with any SLO, as the required availability increases, so does the cost. SREs are well aware that 100% availability is an impossibility. Also, in many cases 100% uptime may not even be appropriate or desirable. For many services, planned downtime is intentional. It prevents over-reliance or inflated expectations about a single service’s availability and it can also be instrumental in discovering and preventing improper usage and security violations.

Benefits and Applications

What are various applications of SLOs and SLIs?

SLOs and SLIs are used in a variety of business and technology applications, both internally and externally. SLOs are widely used to measure the performance and reliability of various SaaS products, such as web hosting uptime, cloud storage availability and the latency or responsiveness of various cloud application hosting services. Although SLOs are commonly outlined in vendor contracts with cloud service companies, it is often the client’s responsibility to monitor SLIs to ensure compliance with the stated SLOs.

Here are some typical SLI applications you’re likely to encounter:

  • Server uptime: How often is a server or web service available and responsive?
  • Server latency: How long did it take the server or service to respond to a request?
  • Error rate: How often did a request (to a web server, for example) result in an error response?
  • Throughput/performance: How fast is data being delivered on a given channel?
  • Utilization: How frequently is a certain service being used?
  • Data freshness: What portion of data being delivered to users is the most recent version of the data?
sla-applications

SLI applications include latency, error rate and throughput, among others.

SLIs can also be used to gauge human performance. For example:

  • Service desk responsiveness: How quickly are help desk calls answered (or resolved)?
  • Escalation level: How frequently are help desk calls escalated to a higher severity?

In order to be useful, each of these SLIs should be measured over a set time period and benchmarked against a corresponding SLO to set a minimum threshold for these metrics.

Why do we need SLOs?

SLOs are critical to ensure an organization is receiving the services it is paying for. SLOs gauge the performance of your internal systems and people and can even quantify customer expectations and satisfaction. Without SLOs, an organization has no way to monitor these performance metrics and won’t know whether conditions are good or bad, improving or getting worse.

End users today are more demanding than ever when it comes to web services and applications. Short periods of downtime and minor latency issues may seem negligible, yet are keenly felt by the seasoned online user. In fact, latency of more than one second can cause users to lose focus on the task they are performing. After 10 seconds, users are likely to become so frustrated that they totally abandon the task they are working on.

Ensuring a responsive and error-free experience is increasingly a critical function of every business. Monitoring SLIs and gauging them against SLOs is the most effective way for a business to gain much-needed, quantifiable insight into its customer-facing service environment.

Getting Started

How do you get started implementing SLOs and SLIs?

SLOs and SLIs are important for nearly all enterprises. To get started, you need a deep understanding of how your various systems work and how employees and customers interact with them. Understanding your architecture and how that plays into the customer experience is vital.

In general, the best advice is to start with the basics. Simple SLOs around network and service availability and latency are the most common SLOs for a reason: they are easy to calculate, easy to understand and make a significant and obvious impact on the user. Specific SLO targets can be harder to pin down, but by monitoring SLIs and iterating over time, analysts can fine-tune SLOs to determine the level that offers the maximum value to the organization.

It’s also important to note that for many organizations, SLOs and SLIs aren’t created internally. They come built into SLAs that vendors have generated. For smaller organizations, these SLOs are unlikely to be negotiable — but they still represent important metrics that the organization should keep tabs on.

What are some important guidelines/best practices around SLOs/SLIs?

Here are some key best practices around setting and managing SLOs and SLIs:

  • Focus on metrics that matter: Ensure that SLOs are relevant and have a genuine impact on the customer.
  • Understandability is key: SLOs need to be easily understood by stakeholders outside of the IT department, so choose those that align well with business and customer needs.
  • Ensure SLIs are verifiable: You may need to audit SLIs periodically to be sure that they are being measured properly. If dashboards show an availability SLO is being met, but users are complaining that services are offline, you should audit the way your metrics are generated.
  • Minimize the total number of SLOs: Don’t overwhelm your dashboard with SLOs; ensure that only the most relevant, non-duplicative metrics are being tracked to keep things simple and maximize the utility of the system.
  • Act when SLOs are out of compliance: Act immediately to determine why the SLI has fallen below expectations and work on a remediation plan right away.
  • Periodically review all SLOs: Businesses and technologies both change, sometimes dramatically. An SLO that was relevant last year may be outdated — or irrelevant — today.

The Bottom Line: SLOs and SLIs are a core part of modern infrastructure management

Customers are more demanding than ever, and one of the clearest ways to measure the quality of their experience is through the use of SLOs and SLIs. By paying careful attention to SLOs and their corresponding SLIs, an organization can ensure that systems are running in accordance with expectations, customers are happy and you are maximizing profits.

More resources: