Chapter 1 | An Introduction to Site Reliability Engineering (SRE)

What is Site Reliability Engineering (SRE)?

For most of us, our day begins with a few routine tasks. We turn on the lights, brew a pot of coffee, and warm the shower—a common series of events for anyone’s morning.

As we move from task to task, there are expectations. We expect the lights to turn on when we flip a switch. We are confident turning the appropriate knob will cause warm water to come from the shower head. In fact, we expect the same experience each and every time. Consistently. Reliably.

Rarely do we stop to question or understand how these services are delivered to us on demand. Admittedly, there is very little reason to give much thought to the underlying complexity that delivers on-demand electricity, gas, water, or even the internet. These services “just work,” and we rely on them to accomplish goals in other aspects of our lives. Simply put, these services enable us to achieve expected outcomes; we come to expect and depend on them.

Simply put, these services enable us to achieve expected outcomes; we come to expect and depend on them.


But it’s not just the electricity and hot water we have come to rely on. In fact, availability and accessibility are of equal importance, playing a significant role in assessing reliability and quality of service. When we flip a switch, light doesn’t just emerge from nowhere. It came from a provider and is delivered through a complex network of components, each with their own opportunity to fail. We rely on both the end-result service (i.e., electricity) as well as the way in which it is made available to us.


Modern Society Expects Reliable Technology

Digital services work in a similar way. Because the same provider builds and operates the application and infrastructure, functionality and reliability are part of the same value proposition. Energy companies can produce electricity, but if users can’t access the service (electricity or gas), the value of the energy and business is diminished.

In most parts of the world, an immense and growing spectrum of digital services and technology influences nearly every aspect of our lives. To some degree, each of us rely on digital resources provided by businesses, governments, individual contributors, and more—and that reliance is steadily increasing.

Let’s take travel as an example. For any given business trip, I personally rely on a variety of digital services. My airline alerts me to check in for my flight through the mobile app. I hail a ride-share to the airport using both my phone and another mobile application. The driver takes me to the airport in the most efficient way thanks to GPS and live-traffic updates. I scan my boarding pass at the TSA podium using my watch. While I wait to board my flight, I listen to a podcast streaming from my tablet. Suddenly, just as we are about to push back from the gate, I remember my mortgage payment is due this week, so I execute the transaction from my banking app.

It’s part of the “Digital Transformation” changing the way we interact with the world around us. There are certain expectations the end user has regarding reliability. Accordingly, organizations are evolving in order to hold up their end of the agreement. The transformation is changing the way we set our expectations around quality and reliability of these digital services.


24 Hour Tech Availability

Access to this growing digital functionality and information is expected to be available and operating at all times. Much like the pipes that deliver water to our homes, the complex inner-workings of delivering a service to the end user is both critically important to the overall quality (read: reliability) of the service, yet intentionally abstracted away from the end user, and to some degree, the organization providing the service. We encounter layers upon layers of abstractions meant to simplify everything, including software, keeping it healthy, and getting it to the person who needs it.

Access to this growing digital functionality and information is expected to be available and operating at all times.


With the shift towards always-available digital services, the need to provide reliable and improved access to these services has increased. As a result, innovative engineering practices across nearly every industry have emerged to meet the modern world’s demand for access to digital services whenever, wherever, and however they want.

In fact, this “reliability” expectation has given birth to services and tools, VictorOps included, while also supporting entirely new approaches to building, operating, and iterating on software and infrastructure. VictorOps, much like a utility company, provides services to enable its end users. Specifically, we empower the makers of the world to build resilient systems.

This is accomplished by designing, building, operating, and improving the VictorOps software as a service including the underlying physical and virtual infrastructure. Our constant pursuit is to explore new methods for delivering both quality software sooner and receiving faster feedback from real usage. Continuing to develop better methods of delivering software as a service to meet the changing needs of our user base is at the heart of our own journey into site reliability engineering, or “SRE.”


Next Chapter »

Let us help you make on-call suck less. Get started now.