DevOps gained popularity in order to combat siloed workflows, decreased collaboration and a lack of visibility across the software development lifecycle. While establishing a culture of DevOps has helped teams collaborate better and deliver reliable software faster, DevOps teams don’t necessarily have someone specifically dedicated to developing systems that increase site reliability and performance.
That’s where a site reliability engineer (SRE) comes into the picture.
Site reliability engineers sit at the crossroads of traditional IT and software development. Basically, SRE teams are made up of software engineers who build and implement software to improve the reliability of their systems.
So, in this article, let’s…
- Define the basic roles and responsibilities of a site reliability engineer.
- Show how SRE can drastically improve the resilience of your people, processes and technology.
In a traditional setup of siloed IT operations and software development teams, developers would throw their code over to IT professionals. Then, IT would be in charge of deployment, maintenance and any on-call responsibilities associated with the system in production. Luckily, DevOps came along and forced developers to share accountability for systems in production, own their code and take on-call responsibilities.
DevOps pushed shared responsibility for the reliability of your applications and infrastructure. And, while this is a great first step forward, it doesn’t proactively help teams add resilience to their system. Many DevOps teams, even with shortened feedback loops and improved collaboration, can still find themselves deploying new, unreliable services into production at a rapid pace.
Site reliability engineering is a way to bridge the gap between developers and IT operations, even in a DevOps culture. It isn’t SRE versus DevOps — it’s SRE with DevOps. SRE is kind of like a more proactive form of quality assurance (QA). Site reliability engineers will be dedicated full-time to creating software that improves the reliability of systems in production, including:
- Fixing issues
- Responding to incidents
- Usually taking on-call responsibilities
Aside from its growing role today, SRE’s biggest claim to fame might be the four golden signals of monitoring:
Common SRE roles and responsibilities
Implementing an SRE team will greatly benefit both IT operations and software development teams. Not only can SRE drive deeper reliability to systems in production but it will likely help IT, support and development teams spend less time working on support escalations — giving them focused time to build new features and services.
So, let’s look at common site reliability engineering roles and responsibilities you can expect to see.
Building software to help DevOps, ITOps & support teams
SRE teams are in charge of proactively building and implementing services to make IT and support better at their jobs. This can be anything from adjustments to monitoring and alerting to code changes in production. A site reliability engineer can be tasked with building a homegrown tool from scratch to help with weaknesses in software delivery or incident management.
Fixing support escalation issues
Similar to the point above, a site reliability engineer can expect to spend time fixing support escalation cases. But, as your SRE operations mature, your systems will become more reliable and you’ll see fewer critical incidents in production – leading to fewer support escalations.
Because an SRE team touches so many different parts of the engineering and IT organization, they can be a great source of knowledge and can be helpful for routing issues to the right people and teams.
Optimizing on-call rotations & processes
More times than not, site reliability engineers will need to take on-call responsibilities. At most organizations, the SRE role will have a lot of say in how the team can improve system reliability through the optimization of on-call processes.
SRE teams will help add automation and context to alerts – leading to better real-time collaborative response from on-call responders. Additionally, site reliability engineers can update runbooks, tools and documentation to help prepare on-call teams for future incidents.
Documenting “tribal” knowledge
SRE teams gain exposure to systems in both staging and production, as well as all technical teams. They take part in work with software development, support, IT operations and on-call duties – meaning they build up a great amount of historical knowledge over time. Instead of siloing this knowledge into the mind of one team or one person, site reliability engineers can be tasked with documenting much of what they know. Constant upkeep of documentation and runbooks can ensure that teams get the information they need right when they need it.
Conducting post-incident reviews
Without thorough post-incident reviews, you have no way to identify what’s working and what’s not. SRE teams need to keep teams honest and ensure that everyone — software developers and IT professionals — are conducting post-incident reviews, documenting their findings and taking action on their learnings.
Then, site reliability engineers are often tasked with action items for building or optimizing some part of the SDLC or incident lifecycle to bolster the reliability of their service.
Where does SRE fit on your team?
Site reliability engineering roles and responsibilities are crucial to the continuous improvement of people, processes and technology within any organization. Whether your team has already taken on a full-blown DevOps culture or you’re still attempting to make the transition, SRE offers numerous benefits to speed and reliability.
SRE fits right at the crossroads of IT operations, support and software engineering. SRE serves as the perfect blend of skills to tighten the relationship between IT and developers – leading to shorter feedback loops, better collaboration and more reliable software.
Pros & cons of being a Site Reliability Engineer
In Catchpoint’s 2021 SRE Report, their survey indicates that site reliability engineers were some of the happiest employees in software development and IT. While SREs can’t spend all their time building new features for customers, they’re constantly making an impact on customer experience. In fact, if you’re looking for a role designed to help customers the most – then SRE is it.
Site reliability engineering not only improves the lives of customers but, when done right, improves the lives of:
- On-call teams
- IT professionals
- Software developers
SRE can be one of the most fulfilling roles for a software engineer. It can help you better understand the struggles of IT and support, making you a better developer going forward. For more support, explore these DevOps & SRE conferences.
What is Splunk?
This posting does not necessarily represent Splunk's position, strategies or opinion