Chapter 8 | Business Reliability Engineering

Business Reliability Engineering

In engineering and ops, we can quickly forget about our secondary “customer” — our internal departments. At VictorOps, our sales, support, marketing, and success teams rely on access to our digital resources from our top-level domain. Outages to digital services often impact entire teams relying on those services to do their job.

More and more, systems are being plugged together through cooperative APIs and automation. Marketing and Sales teams rely on the flow of accurate data from customers and prospects as their sensitive (and often difficult to obtain) information is moved between various customer relationship and business analysis systems.

Visitors to the site are encouraged to download digital resources to further educate themselves on the most advanced methods of oncall and incident management. As I’m sure you are perfectly aware, in exchange for these digital resources, potential customers (hopefully) will share with us their information so that we can keep them updated on what’s going on at VictorOps and the world of building scalable systems faster and safer. Love it or hate it, this is business and it’s something nearly all organizations rely on to operate. VictorOps is no different. We need to make sure that aspects of the website that are marketing and “top-of-funnel” focused are also behaving as the customer (i.e. Marketing) expects.

For example, when a visitor to the site downloads our Post-Incident Report Book, their information is stored in a number of systems, pushed all around through automation—automation that we don’t know much about. The process of obtaining this information safely and correctly relies on several steps and tools. Unless someone reaches out to us, how do we know if it’s NOT working?

Measuring For Normal in SRE

We now have a list of top concerns as well as a better idea of how much effort and reward is involved with prioritizing related work. However, along with getting better visibility around potential worrisome areas, we also need to establish what a “healthy” VictorOps system looks like.

There were several suggestions regarding the specific scenarios we should be watching for and measuring. For instance, when a connection between the web client and back-end system are experiencing trouble, a bright gold bar displays at the top of the screen so users are aware that something is wrong.

What we sometimes forget is that WE also know about it, but are we watching for it? Thankfully the problem is rarely something on our end when the gold bar appears. There are many stops along the way in the complex system in which the alerts are delivered through. From shoddy WiFi to a DNS problem on the customers’ end, numerous reasons exist as to why the web client and the back-end aren’t talking to each other.

A user on a 99.9% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability. With this in mind, rather than simply maximizing uptime, SRE seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness—with features, service, and performance—is optimized
Site Reliability Engineering: How Google Runs Production Systems

 

At least for the moment, we don’t care what specific possible disconnection scenarios exist and how we can protect ourselves from them. For now, we only need to determine what is acceptable. What is “normal”? We know disruptions happen. We can point fingers as to who is to blame or what is the cause, but the point is to discover what is the expected behavior? This will be our baseline.

Obviously, we know when the gold bar is displayed. Multiple logs can tell us that part of the story. But how often is the gold bar displayed? When it is displayed, on average how long does the user see it? Are many users experiencing it simultaneously or is it sporadic? What kind of information is in between the lines of the gold bar data that we have been (or can begin to) start collecting?

One tip we learned from our friends at Netflix was that they have a single metric that they watch very carefully. That being PPS (Plays Per Second). In other words, how many times does a customer press or click the “Play” button on Netflix? This single metric would act as both an overall health check but also a leading indicator of trouble on the system. Once a healthy “plays per second” baseline has been established, setting a reasonable threshold, in one way or the other, means that teams can be alerted to possible trouble early on.

What might be our own version of the PPS metric? What about how long it takes for information to be displayed to the user? In what way could we more closely measure how long it takes for information to display in the incident timeline. Could this be our PPS or is there something else that might be even better to look at?

Ideally, the population of data into the timeline is so fast that humans would never know or have reason to think, “This seems to be taking longer than I expected.” If it is slow, how slow exactly? And what threshold is deemed too slow? In order to really know how fast (or slow) it was, however, we would need instrumentation to measure it, a data store to collect it, a dashboard to visualize it in real-time, and, of course, alerts should established thresholds be breached. Find something. Try it. Adjust.

Response Metrics

Another suggestion was to closely watch the response time for searching for users within the system. One of the most powerful features during a firefight is the ability to quickly pull in additional responders and collaborate in real-time.

Getting everyone up to speed on what is known about the incident speeds up the recovery efforts. To do this, much like social media, “mentioning” individuals and teams are achieved in the incident timeline by typing “@” in the chat field. This triggers a search module to display potential matches as the user begins typing. How fast (or slow) is this response time, and what value should trigger cause for alarm? Clearly, we were going to need to not only roll out new instrumentation, we were going to need to collect data over time, establish thresholds, and build actionable alerts to contextually inform first responders what is broken and where to start looking. Not only do we want better observability on these things, we also want to learn more about how new code needs to be instrumented, as well as what makes for an actionable and helpful alert.

The engineers who wrote the code are likely the ones responding to it in our production environment. They should decide and establish the right types of metrics and alerts to provide in an incident related to their code and functionality. Engineers become much more familiar with the tooling earlier in the software development lifecycle.

Duplicating monitoring, alerting, and on-call rules in a development or staging environment mean the engineers responding to the problem in production are already well-versed in the tools used during real-life response and remediation efforts. Also, they are likely the ones that established the monitoring thresholds, on-call policies, and (hopefully) actionable alerts.

If the information provided regarding a problem isn’t helpful in reducing the time to detection or resolution, it’s easier and less expensive to spot that in pre-production environments. Should the problem rear its ugly head in production, the proper metrics are in place to spot it, the right teams or individuals are contacted, and they are immediately familiar with the current status through helpful context appended to an alert that has already been confirmed as “actionable”.

*What about the round-trip interactions when a user triggers a new incident? *

There are multiple things that take place during some of the most important pieces of functionality. For instance, when a user has been alerted to an incident, their first action is to acknowledge the incident. How long does it take for the user to press “ACK” on their mobile device, for VictorOps to receive and process this request, and for the user’s confirmation to be displayed as an action? Specifically, how long does that take? While it should be extremely small, how small exactly?

Once our new instrumentation was capturing higher granular data and we started averaging out various metrics of the system, we were ready to start setting targets to both maintain or to achieve. It was time for us to establish Service Level Indicators, Service Level Objectives, and Service Level Agreements. This helps us determine what “normal” is. The baseline then serves as our early indicator that something is no longer “normal”.

Let us help you make on-call suck less. Get started now.