Whether you’ve heard of or fully jumped on the DevOps or SRE bandwagon, you may have also wondered how the two relate. What’s the difference? Are they really just different ways of looking at the same problem?
The term DevOps hit the market first, but SRE wasn’t too far behind. And though they have different origin stories, they both focus on autonomy, automation, and iteration.
So why do these paradigms exist? And why do we need both? Let’s look at this further.
To start, let’s begin with a base definition of both.
DevOps encompasses the practices, principles, and culture that fuse development and operations. It’s not a specific tool or technology, but the idea that teams should own their production support.
In the past, and still in some organizations, software operations tasks around support, on-call, and maintenance have been the responsibility of operations teams or application maintenance teams.
That allowed the application developers to focus solely on shipping features. However, it created a divide between practices that helped operate and maintain code in production and the practices that delivered new features. Therefore, the incentive to make the application resilient, observable, and maintainable wasn’t critical. If you don’t see the problem, it doesn’t exist.
To solve this divide, DevOps bridges the gap between development and operations, reducing the silo between the two sets of work and the people doing the work.
Site Reliability Engineering
SRE, coined by Google, applies engineering practices to infrastructure and operations. As with DevOps, SRE isn’t just about tools but also culture and principles.
Part of SRE involves embracing risk. So, instead of trying to reduce risk as much as possible by delaying deployments, which has the opposite effect, we assess risk and make sure our reliability matches what our customers need.
To attain the proper levels of reliability, SRE drives the use of metrics like Service Level Indicators and Service Level Objectives. We’ll go into those later.
Additionally, as we’re applying engineering practices to infrastructure and operations, there’s a heavy emphasis on automation and eliminating toil. Here, the goal is to reduce repetitive and uninteresting tasks so that engineering efforts can be applied to more difficult problems.
That automation is important in both the release process as well as operating in production.
The North Star
As you can see by the descriptions above, both DevOps and SRE include aspects of automation, operations, and more that we haven’t talked about.
However, there’s one north star goal that both share: continuous improvement. Whether it’s improving availability, deployment frequency, or automation, every aspect involves looking at what’s currently happening, assessing opportunities, and improving the situation.
Now let’s talk about culture for a bit.
Both SRE and DevOps drive autonomy, increase transparency, and leverage automation. It’s about engineering.
Additionally, SRE principles drive a blameless culture. Mistakes happen, but they happen because the system or processes are not set up in a way that ensures success. That’s why tools like postmortems, which analyze outages or failures, ensure that we look at the facts and the system, and not assign blame to individuals.
DevOps focuses on creating shared responsibility between development and operations. This may mean that dev folks and ops folks work very closely together. Or it may mean that devs take on ops responsibilities. It depends on what works for the organization that you’re in.
DevOps & SRE Tools
Next, how do the tools that DevOps and SRE folks use differ? And what types of problems do they solve?
First, let’s remember that DevOps and SRE aren’t role definitions or team definitions. They’re principles, practices, and culture. So a day in the life of an application engineer, DevOps engineer, and site reliability engineer can vary from company to company. In smaller teams or organizations, many engineers follow practices from all three disciplines to improve operations for their product.
Though the disciplines vary greatly, let’s look at an example we’ve seen.
In the day in the life of a DevOps engineer, you may be working on product features. Or you may be improving the CI/CD pipeline. Perhaps there’s a better way to configure rollbacks or deployment operation that will reduce the amount of time a bad deployment affects an application. Furthermore, you could be improving production monitoring or working to fix bugs.
As an SRE, you may also be working on bug fixes to improve reliability or improving production monitoring and resiliency. You may be working with a team by helping analyze their recent outages or availability concerns through postmortems, analysis, or pairing in with the team.
For both, the tools are similar. You’ll be using logging, metrics, and tracing tools like Splunk. To improve, you need to know where your metrics are.
DevOps & SRE Metrics
In this section, we’ll cover SRE metrics and DevOps metrics. Both SRE and DevOps track metrics that indicate how the application or system functions. Some of these metrics bring value to both SRE and DevOps, while others fall to one side or another a bit more frequently.
Let’s start with metrics around how we write code and build applications, measuring how we shift left with testing and security, address how we measure our ability to ship quickly, and then move on to operational metrics.
DORA Metrics (Deployment Frequency, Lead Time for Changes, Mean Time to Recovery, Change Failure Rate)
DORA metrics measure how well a team works to prioritize and release work in a DevOps model. These metrics include:
Deployment Frequency —How often code deploys to production. To improve deployment frequency, we need a few key practices from DevOps. First, we automate deployment pipelines with little to no manual approvals. For that we need automated tests and code scans.
Lead Time for Changes —How long it takes for a committed change or feature to deploy to production. For lead time, a smooth deployment pipeline with automated checks and fast tests provides the tools.
Mean Time to Recovery (MTTR) —How long it takes for the system to recover during an outage. For MTTR, we need a robust and automated rollback ability from DevOps. We also need good metrics and monitoring of our production systems from SRE.
Change Failure Rate —How many deployed changes result in failure. To improve the change failure rate, we need to shift left with automated tests, security scans, and automated configuration.
Four Golden Signals
Now that we’ve explored the DORA metrics, which lightly talked about monitoring, what has SRE brought to the table? Most famously, SRE pushed the idea of signals and metrics that provide a quick view of how your system runs.
Though the four golden signals don’t cover all monitoring needs, they provide the base on which thorough and valuable monitoring can be built.
Latency —How long it takes to serve a request.
Traffic —How much demand is on the server. For example, how many transactions or requests per second are handled.
Errors —How many requests fail.
Saturation —How much of your system resources are used.
Though these metrics came from SRE roots, they add value to DevOps as well, since they provide insights as to how our system runs in production.
Service Level Indicators, Objectives, and Agreements
Next, as we progress further into operations, let’s talk about SLIs, SLOs, and SLAs. These metrics provide information on how our system behaves over time.
Service Level Indicators —These metrics show real-time performance of your app. For example, you may have an indicator that measures uptime of your application. It shows, for example, that your app is available 99% of the time over the course of one month.
Service Level Objectives —Objectives take it up one level and indicate to the team what values they should strive for. Looking at the previous example, perhaps the actual SLO of availability is 99.9% over the course of one month.
Service Level Agreement —Finally, SLAs indicate any contractual obligations based on available SLOs. Typically SLAs aren’t as strict as SLOs, as there are financial implications to meeting or not meeting the stated SLA.
Now we’ll introduce metrics that measure our operational reliability and incidents.
Mean Time to Resolve/Repair (MTTR) —What is the mean time to resolve an incident in production?
Mean Time to Detect (MTTD) —What is the mean time to detect a failure or outage?
Mean Time Between Failures (MTBF) —What is the mean time between failures or outages?
That’s a sample of metrics that may provide value in your DevOps and SRE journey. And most of these metrics can apply to both DevOps and SRE.
There are also subsets of these that apply to one more than another. For example, error budgets come from SRE and calculate how much availability you can lose and still be within your SLAs and SLOs. This lets us know we can deploy faster and take some risks.
Also, you may have heard of accelerate metrics, which come from DevOps and focus on lead time, change failure rates, deployment frequency, time to restore, and availability.
Depending on your particular pain points, you will want to focus on the metrics that truly measure what truly matters. And that may change over time.
Why You Need Both
As hinted at the end of the metrics section, there isn’t one solution that works for everything. Whether it’s the culture, the metrics, or the principles, you will want to pull from what your organization needs most.
Ultimately, SRE and DevOps work together and overlap in many ways. And looking at our development and operations from both viewpoints can uncover new ways to provide automation, visibility, and the opportunity to improve what will make the biggest impact.
What is Splunk?
This posting is my own and does not necessarily represent Splunk's position, strategies, or opinion.