Chapter 3 | What is SRE?

What is SRE?

In many organizations, Site Reliability Engineering (SRE) is the responsibility of very specific teams or individuals, typically those familiar with operations-like engineering efforts. They keep the critical infrastructure and applications up and running. Think of them as the keepers of “Production”. System Administrators, IT Operations, SRE Team, or individual engineers (i.e. SRE’s) typically own this responsibility.

In some cases, individual reliability engineers are embedded with development teams, while in other cases, there’s a central SRE team. However, an increasingly common approach to engineering where roles such as development, operations, quality, security, and others are combined into small, loosely coupled, yet highly collaborative teams have empowered organizations to respond to problems much faster when they inevitably arise. Perhaps more importantly, these collaborative teams are able to deliver value (in the form of digital services) to the end user much more quickly.

Terms such as DevOps have emerged to give a name to organizational efforts to bring disparate conversations around building, deploying, and operating applications and infrastructure into the same group. Previously siloed conversations about responsibilities slowed the process of delivering value as teams were essentially incentivized against each other. Developers were encouraged to pump out new functionality while the operations teams were incentivized on maximizing the availability of the resources (i.e., uptime). Without realizing it, competing efforts were in action to both introduce and limit the one common cause of IT failure: change. Conflicting incentive structures is a classic flaw in the makeup of many IT organizations.

As a company, VictorOps has an inherent passion for reliability. Founded by software builders and systems architects who deeply relate to those who are tasked with the pressures of maintaining uptime of systems, a culture of high availability has always been strong within the organization. It’s ingrained in the majority of our work and what we think about each day.

Engineering teams and IT professionals around the world rely on us to alert and assist in the mitigation of disruptions to services critical to the business. As our CTO puts it:

Reliability is our most important feature
Dan Jones
CTO VictorOps

 

If we are experiencing a problem impacting our service, the issue creates a ripple effect, impacting our customers, and our customers’ customers, and so on.

The value we, as a business, deliver is not only in the rapidly improving service itself (on-call and incident management) but also the ability to rely on services to work as expected when customers need it most—during their own high-stress service disruptions.

 

What is Site Reliability?

Protecting the VictorOps customer experience AND increasing our ability to deliver value more quickly is ultimately what we are attempting to tackle as a company-wide SRE effort. Still, associated responsibilities and expectations around our SRE efforts need to be specific about which problems we are trying to own and solve.

First, we began our efforts by defining and focusing on two primary areas tied to the customer experience aspect of reliability: Correctness and Availability

Correctness:

    • Functions as expected

    • Data is consistent

    • Consistent, predictable performance

    • Consistent innovation

Availability:

    • Always on (24/7/365)

    • Minimal downtime (planned or unplanned)

    • Resilient to failure / fails gracefully

    • Global accessibility

The relationship between correctness and availability demands a balanced approach. Like efficiency and thoroughness, each can often carry incentive structures, which are often at constant odds with each other.

For most modern organizations, velocity is more than a “nice to have.” Halting functionality work in order to focus engineering resources towards improving only the reliability of a service doesn’t usually sit well with product owners and management. We need to achieve a balance between reliability and deployment speed.

 

Reliability from a Customer Perspective

VictorOps customers depend on us when there is an active problem within their own system. Their experience with the VictorOps service as they acknowledge, triage, collaborate, and resolve issues is far more important than whether or not the VictorOps core servers are experiencing high levels of CPU usage. Is VictorOps empowering them to do their best work?

Metrics such as CPU and memory usage are important to have observability around but do little to communicate the experience from the customer’s point of view. Users don’t give a damn if we have our own datacenter, a multi-cloud architecture, or a couple of hamsters on a wheel plugged into a Raspberry Pi. They do give a damn about fixing their own broken application or service. VictorOps enables them to resolve service disruptions as well as retrospectively analyze incident response efforts for deeper learning. They rely on us to enable them to solve their own problems. Plain and simple.

Here’s a real question…

What is the user experience while interacting with VictorOps during an active incident?

 

This is an observability question. This is where we need the highest fidelity data if we want to accurately answer it.

More specifically, what happens (exactly) when:

• Someone interacts with the software we’ve built,

• Running on the infrastructure we’ve architected, and

• Delivered through the pipelines we currently have in place,

• Using processes and tooling that have been established over the life of the service…

• During an active incident?

Do we know? Is it possible to find out? Is it “knowable”? Some engineers have intimate knowledge around parts of the system. Others haven’t been with the company long enough to share the same mental representation of how the system actually works. What data needs to be collected in order to begin attempting to answer the questions above?

 

Scalability from a Customer Perspective

Consistent operability isn’t quite enough to satisfy today’s end users. The tech world moves fast. When was the last time you installed software from a disc and could operate it without a connection to the internet? That’s rarely how things work today. We access services from our phones or devices whenever and wherever it’s convenient for us. And because access to the best and most innovative software is only a “Sign up Now” button away, vendor lock-in isn’t quite as prominent as it once was. That’s great news for consumers and end users. It’s a bit more worrisome for companies realizing that functionality and differentiating features quickly become commodities and the only real chance at differentiating yourself in the market is by outpacing the competition on feature releases and displaying dominance in reliable infrastructure.

Customers will educate themselves and choose the service that is, of course, reliable, but they will also pay close attention to the manner in which the service itself adapts to their own changing needs.

How innovative is the service? The answer reflects how empathetic the vendor is to the always changing landscape of IT.

Scalability is of great concern to end users whether they explicitly make the claim or not. It is directly related to the overall reliability of a service. You must demonstrate the ability to keep up with and support them. If you show the inability to enable them to succeed as things become more complex and mission-critical, the end user will begin the search to find a more suitable partner to explore the future of software.

We need to optimize for delivering improvements to our service safer and faster.

Our users expect that the tools they leverage today will grow with them into the foreseeable future. They expect to influence and shape the roadmap of the service by providing feedback to welcome and eager product teams. We must be able to introduce changes to our systems based on feedback from the customer’s experience.

Finding ways to improve our ability to scale was important enough for us to call it out for the problem we are own and solve.

 

Our journey towards curating a specific culture of reliability is an ongoing one. But what we’ve learned and where we are headed all started with asking questions. Throughout this text, I’ll share with you what those questions were, what kinds of conversations they generated, and what new questions and discussions that led to. The final sections of this text will conclude with the very first VictorOps Chaos Day orchestrated under SRE. We will use chaos engineering to learn how our system handles failure, then incorporate that information into future development.

Embracing risk is a big part of the cultural change we are trying to bring about not only in ourselves but in the rest of the tech community. It’s one thing to say we embrace it, we need to mean it, as well as demonstrate the critical relationship between this new embracing of risk and its positive impact on the reliability and scalability of the VictorOps service.

There are no clear “best practices” to SRE. There is no official playbook. Like DevOps, there is no one-size fits all approach to Site Reliability Engineering. What works for a company like Google or Facebook doesn’t make sense for us. What works for VictorOps likely won’t plug and play into your organization without some adjustments.

 

Embedded vs. Dedicated SRE

Very early, we evaluated two of the more popular approaches to SRE: embedded and dedicated. After many conversations internally as well as through interviews with reliability engineers from Twitter, Netflix, Github, and others, we made the decision to resist the tendency of hiring into the role of SRE. Likewise, we wanted to avoid unintentionally creating a new silo by forming an “SRE team.”

Worried that a specific team might induce assumptions about who owned our availability, we concluded that our approach to SRE was not limited to a distinctive team. From our perspective, the responsibility of building reliable systems is taken away from a majority of the engineering team almost entirely when following the dedicated model (i.e., a distinct site reliability engineer or team). We also weren’t in love with the embedded model as that carried the same problem. It might be a larger team with more context but we knew we wanted reliability and scalability to fall on the shoulders of everyone.

We are building a culture of reliability

Much of what we wanted to accomplish was going to require a shift in the mindset of what we care about and how we accomplish goals associated with that care. We wanted to communicate explicitly that SRE was not a project. It’s not an initiative we will take on for a few months until we have achieved some empirically measured goal, such as 99.99% of availability. This initiative ought to align with a cultural change in not only our engineering team but also the entire company—a change to align the company with the objectives of the business and the needs of the customer.

A growth mindset with a hunger for continuous improvement is part of the company culture that is often hard to build and sustain. Something like this doesn’t just emerge out of nowhere. It requires a change agent: a champion to challenge the status quo (i.e., how we do things around here).

 

Getting Buy-in for SRE Change

We needed buy-in from management, from the Product team, as well as from all corners of the engineering team. We needed everyone to have a clear sense of responsibility and control over their role in our SRE efforts. We also knew that someone needed to champion this effort.

Without a champion, it would be too easy for our SRE aspirations to get lost in the day to day business.

We chose to look internally for an individual to lead our efforts and create a company-wide focus. Someone who would serve as a coach to our entire engineering team, supporting and enabling them to embrace and own reliability in each of their own domains.

One platform engineer stepped forward and offered to assume this role. Much of their work on the “Platfrastructure” team (Platform & Infrastructure) was tied to these concerns already. Likewise, they were becoming increasingly more curious about the principles of DevOps and our own ability to get new functionality to users while also maintaining a hardened system. It was a natural fit.

For SRE to succeed, our engineers needed to see and feel that the value of their engagement was valuable. Above all, we wanted to know the truth about our systems, including the human components. As such, we valued transparency and feedback in pursuit of genuine inquiry and continued learning; we saw (and see) this as a hunger to expose more and develop a greater sense of the system (including the people).

Above all, we wanted to know the truth about our systems, including the human components.

 

This new hunger led teams across the entire organization to begin talking about a common challenge. A challenge of increasing velocity and maximizing uptime, which, when reframed from reactive to proactive, now seems a whole lot more interesting. A whole lot more like an engineering problem that, with the support from the rest of the company, we can prepare for trouble (i.e. unplanned work) by engineering ways to shorten feedback loops and expedite the remediation of service disruptions. That’s something everyone from upper management to technical support can get behind. We’ll all play a role in solving for it.

Start With Questions

Monitoring tells you whether a system is working, observability lets you ask why it isn’t working.
Baron Schwartz
CEO VividCortex

 

Asking questions was the most important step early on for us, and in a really generic sense, observability is just that—asking and answering questions, any question. Filling in the blanks on what is known, or even knowable about our systems. If someone has a question about any aspect of our system, we want to be able to get an answer we feel confident about. Because we are going to make some really important decisions based off of those understandings of reality.

In order to reliably answer questions, you need to have access to information. Not only that, but you have to be able to make sense of it. No matter what question you have about your system, you should be able to answer it. It’s about moving closer and closer to a clearer understanding of the reality of our systems. To us, this is what we mean when we speak to the topic of observability. We can’t improve what we can’t measure. We can’t measure what we don’t see. And, we’ll never even know what to look for if we don’t know what is important. What’s important to VictorOps can help us shape what SRE is to VictorOps.

There are many great explanations on what observability is and is not. I suggest reading anything from Charity Majors, Baron Schwartz, Cindy Sridharan, or Jonathan Schwietert on the subject. Each has a deep understanding that goes beyond the scope of this book but is still super important. I definitely recommend giving their work a read.

 

Let us help you make on-call suck less. Get started now.