DevOps gained popularity in order to combat siloed workflows, decreased collaboration and a lack of visibility across the software development lifecycle. While establishing a culture of DevOps has helped teams collaborate better and deliver reliable software faster, DevOps teams don’t necessarily have someone specifically dedicated to developing systems that increase site reliability and performance.
That’s where a site reliability engineer (SRE) comes into the picture.
Site reliability engineers sit at the crossroads of traditional IT and software development. Basically, SRE teams are made up of software engineers who build and implement software to improve the reliability of their systems.
So, in this article, let’s…
- Review SRE concepts
- Break common SRE misconceptions
- Define the basic roles and responsibilities of a site reliability engineer
- Look at salaries
- Show how SRE can drastically improve the resilience of your people, processes and technology.
What does an SRE do?
Site reliability engineering was originally developed by Google. In the words of Ben Treynor, SRE is “what happens when you ask a software engineer to design an operations function.”
In a traditional setup of siloed IT operations and software development teams, developers would throw their code over to IT professionals. Then, IT would be in charge of deployment, maintenance and any on-call responsibilities associated with the system in production. Luckily, DevOps came along and forced developers to share accountability for systems in production, own their code and take on-call responsibilities.
DevOps pushed shared responsibility for the reliability of your applications and infrastructure. And, while this is a great first step forward, it doesn’t proactively help teams add resilience to their system. Many DevOps teams, even with shortened feedback loops and improved collaboration, can still find themselves deploying new, unreliable services into production at a rapid pace.
Site reliability engineering is a way to bridge the gap between developers and IT operations, even in a DevOps culture. It isn’t SRE versus DevOps — it’s SRE with DevOps. SRE is kind of like a more proactive form of quality assurance (QA). Site reliability engineers will be dedicated full-time to creating software that improves the reliability of systems in production, including:
- Fixing issues
- Responding to incidents
- Usually taking on-call responsibilities
Aside from its growing role today, SRE’s biggest claim to fame might be the four golden signals of monitoring:
Day in the life of SREs: myth breaking
So, before we move into the formal responsibilities and job descriptions, let’s bust some common SRE myths. We actually asked a couple SREs about what the day-to-day job looks like.
Vivek Basavegowda Ramu works as an International Performance Testing Expert with UnitedHealth Group and Optum Technology in the U.S. Vivek says that SRE falls into the same organizational areas as non-functional testing… aka performance testing and engineering. His overall charge, in simple terms, is to ensure application performance is optimal and reliable—so that there’s fewer issues actually in production and so business and revenue streams are not impacted.
According to him, people often confuse site reliability engineering as an additional layer in the hierarchy focused solely on monitoring and application/environment uptime.
“In reality, the [SRE] role demands developing and maintain the system and its services, automating the deployment process, ensuring system scaling, investigating and resolving outdates, identifying and implementing preventive measures proactively, collaborating with key stakeholders.”
Percy Grunwald of Hosting Data echoes this common misconception. Grunwald says:
“A significant portion of my time is spent on proactive measures such as capacity planning, performance tuning and implementing infrastructure as code. Additionally, SREs work closely with development teams to ensure that new features and services are designed and deployed in a way that meets the reliability and performance goals of the organization.
So, where many folks think that SREs only put out fires — aka “fix problems” — their remit is really holistic: Optimizing the system for performance. And let’s see how they do that.
Common SRE roles and responsibilities
Implementing an SRE team will greatly benefit both IT operations and software development teams. Not only can SRE drive deeper reliability to systems in production but it will likely help IT, support and development teams spend less time working on support escalations — giving them focused time to build new features and services.
So, let’s look at common site reliability engineering roles and responsibilities you can expect to see.
(Compare this with common DevOps roles & responsibilities.)
Building software to help DevOps, ITOps & support teams
SRE teams are in charge of proactively building and implementing services to make IT and support better at their jobs. This can be anything from adjustments to monitoring and alerting to code changes in production. A site reliability engineer can be tasked with building a homegrown tool from scratch to help with weaknesses in software delivery or incident management.
Fixing support escalation issues
Similar to the point above, a site reliability engineer can expect to spend time fixing support escalation cases. But, as your SRE operations mature, your systems will become more reliable and you’ll see fewer critical incidents in production – leading to fewer support escalations.
Because an SRE team touches so many different parts of the engineering and IT organization, they can be a great source of knowledge and can be helpful for routing issues to the right people and teams.
Optimizing on-call rotations & processes
More times than not, site reliability engineers will need to take on-call responsibilities. At most organizations, the SRE role will have a lot of say in how the team can improve system reliability through the optimization of on-call processes.
SRE teams will help add automation and context to alerts – leading to better real-time collaborative response from on-call responders. Additionally, site reliability engineers can update runbooks, tools and documentation to help prepare on-call teams for future incidents.
Documenting “tribal” knowledge
SRE teams gain exposure to systems in both staging and production, as well as all technical teams. They take part in work with software development, support, IT operations and on-call duties – meaning they build up a great amount of historical knowledge over time. Instead of siloing this knowledge into the mind of one team or one person, site reliability engineers can be tasked with documenting much of what they know. Constant upkeep of documentation and runbooks can ensure that teams get the information they need right when they need it.
Conducting post-incident reviews
Without thorough post-incident reviews, you have no way to identify what’s working and what’s not. SRE teams need to keep teams honest and ensure that everyone — software developers and IT professionals — are conducting post-incident reviews, documenting their findings and taking action on their learnings.
Then, site reliability engineers are often tasked with action items for building or optimizing some part of the SDLC or incident lifecycle to bolster the reliability of their service.
Salaries for SREs
OK, so we can confidently say that site reliability engineers are not only responsible for plenty, but their talent and skills are necessary for preventing total digital chaos for any business. Another way to put it: SREs can make a great income. As with any salary discussion, your experience, location and organization make the biggest impact on how much you can earn.
Still, let’s look at some averages. As of February 2023, Glassdoor reports that the average annual pay is $104,459 for SREs working in the U.S., but additional take-home money like year-end bonuses actually puts that number closer to $128,000 a year. Plus, that average varies widely: more experienced SREs could earn closer to $165,000 or even more.
- ZipRecruiter reports a U.S. national average of $130,238 per year.
- One outlier puts the median average at $236,000, including other compensation. At the highest end, Gremlin has seen up to $450,000 per year.
Where does SRE fit on your team?
Site reliability engineering roles and responsibilities are crucial to the continuous improvement of people, processes and technology within any organization. Whether your team has already taken on a full-blown DevOps culture or you’re still attempting to make the transition, SRE offers numerous benefits to speed and reliability.
SRE fits right at the crossroads of IT operations, support and software engineering. SRE serves as the perfect blend of skills to tighten the relationship between IT and developers – leading to shorter feedback loops, better collaboration and more reliable software.
Ready for an SRE approach? Learn how to hire the best candidates with these SRE interview questions and see average IT salaries in your area.
Pros & cons of being a Site Reliability Engineer
While SREs can’t spend all their time building new features for customers, they’re constantly making an impact on customer experience. In fact, if you’re looking for a role designed to help customers the most – then SRE is it. Site reliability engineering not only improves the lives of customers but, when done right, improves the lives of:
- On-call teams
- IT professionals
- Software developers
SRE can be one of the most fulfilling roles for a software engineer. It can help you better understand the struggles of IT and support, making you a better developer going forward. For more support, check out the state of DevOps today and these must-attend DevOps & SRE conferences.
What is Splunk?
This posting does not necessarily represent Splunk's position, strategies or opinion