Chapter 9 | Measuring Success in SRE

Measuring for Program Success

Measuring for outcomes is always at the top of our mind when approaching goals. While we do have specific targets we may be aiming for, circling back to confirm that the resulting outcome is in fact what you were after is extremely important. Small course corrections are required. Outcomes may be more general but often attract the attention and support of decision makers earlier.

Key measurements and thresholds to hold us accountable for our efforts as well as communicate expectations across the entire organization needed to be established. Nearly every resource you find regarding site reliability engineering will talk about key metrics used to establish high-level objectives, indicators of the movement toward or away from those objectives, and ultimately what agreements are in place should objectives be unfulfilled.

SLIs will help us know how we are performing against our SLOs and our SLA will outline the consequences (good or bad) of meeting those objectives. Once we have data to observe, we will begin orienting ourselves to it and establish what we believe our SLIs and SLOs to be.

Service Level Indicators (SLI)

Empathetic understanding of customer needs and expectations will help to inform indicators. And at first look, there are many possible indicators that could be measured. However, we found that landing only a handful of indicators that really matter was the right choice. Finding a good balance of indicators is important to help teams accurately examine and understand important aspects of the system.

Service Level Objectives (SLO)

Service Level Objectives (SLOs) are established measurements to inform Service Level Agreements (SLAs). This measurement establishes a target value (or range of values) for assessing the overall trajectory, and eventually, accuracy of your objectives.

Service Level Agreements (SLA)

What happens when SLOs are breached is what SLAs address. The council does not accept ownership of constructing SLAs because SLAs are closely tied to business and product decisions typically managed higher up the chain of command. The council will, however, be involved in helping to avoid the consequences of missed SLOs.

Assignment Three: Establish Blackbox Metrics and Service Level Expectations

Our next assignment was to begin establishing our own Black Box metrics and expectations regarding service levels. This included coming to an agreement on thresholds, how we should address violations, what should we make visible right away, and what types of alerts should go to engineering?

As a group, we determined that we would collaboratively define SLIs and SLOs with interested parties (e.g. Engineering, Product). We would assure that all teams are in agreement on what constitutes violations. We determined that metrics we had surrounding service level expectations should be made visible through dashboards.

An assortment of various dashboards related to service expectations began circulating within our Slack groups. We also put in place alerts to reinforce the importance of our new metrics.As a group, we would address violations and aim to achieve SLO’s with near perfection, allowing us to maximize change velocity without violating an SLO.

Key Takeaways

Once we returned for the next council meeting, engineers had already begun plumbing in new instrumentation such as the Grafana “OpsDash”, building dashboards, and cleaning up existing murky data collection. This lead to creating and socializing deployment dashboards (also in Grafana) for the company to have available any time they wanted.

We updated metrics to Jenkins jobs, as well as annotations for said jobs in Grafana, and we added a number of new metrics for deployment health and general health. This is all the result of simply establishing a few key metrics and processes that we felt were important to get visibility around.

Engineers added Prometheus to our bootstrap process, making metrics collection and experimentation accessible to any developer who is interested. We also implemented a new abstraction for collecting metrics in our platform. We determined healthy “golden signal” metrics around our socket traffic in the browser and established thresholds for all three visible panes (people, timeline, incident).

Initial golden signals:

  • Alerts Received / Processed
  • Incidents Created / Resolved
  • WebSocket Connections Connected / Disconnected

These would serve as a starting point and we are already exploring more signals such as “Time to First Notification” (TTFN) as our best leading indicator. Could this possibly be an even better Netflix-like PPS metric?

Let us help you make on-call suck less. Get started now.