Chapter 7 | Monitoring and Alerting for SRE

Monitoring and Alerting for SRE

If you can’t monitor a service, you don’t know what’s happening. You can’t be reliable.
Site Reliability Engineering: How Google Runs Production Systems


A subset of observability, monitoring, plays a key role in engineers knowing when acceptable thresholds have been breached—provided established service levels and steady methods of system health data collection. Or, if something is moving the current state considerably away from the pre-established baselines. So long as monitoring is instrumented early on during the SDLC, engineers can decide very early what types of problems engineers should be alerted to.

We want to avoid unactionable alerts that lead to not only burnout but future inaction. When everything is urgent—nothing is urgent. It’s too much noise and impossible to know what’s actually important (from the customer’s perspective). So many critical issues are often incorrectly ignored.

As we began establishing various metrics and going deeper into our discussions around monitoring and alerting, the council turned its attention to two distinct ways of looking at monitoring — Black Box and White Box.

Black Box & White Box Monitoring

When you hear the terms Black and White Box monitoring, there are a couple ways to speak to the ideas. One aspect is to think of Black Box as “pull-based” monitoring where White Box is “push-based”. James Turnbull’s book, “The Art of Monitoring,” gives one of the better explanations of these two types of monitoring from this perspective.

However, another explanation of Black (and White) Box monitoring exists. If we abstract away all of the inner workings of VictorOps and purely look at the expectations of the service from a customer’s perspective, this is Black Box monitoring. Is it working? What “it” is can vary but the idea is binary. It’s either good or bad. It’s either working or not. We are monitoring externally visible behavior as a user would see it. How does the “value” look?

Going deeper takes us to White Box monitoring in this model. This is monitoring based on metrics exposed by the new internals of the system, including logs, interfaces like the Java Virtual Machine Profiling Interface, or an HTTP handler that emits internal statistics (O’Reilly).

Our SLIs will likely contain a mixture of both Black and White Box metrics. The next assignment for the council will be to discuss and record possible Black Box metrics. We established a vision and goal to keep in mind when suggesting metrics.

Things to keep in mind when selecting black-box metrics:

Vision: Catch issues before customers reach out and reduce time to notify for customer affecting issues.

Goal: Page the most applicable team—not necessarily the team who’s product caught the issue.

Let us help you make on-call suck less. Get started now.