Site reliability engineering continues to gain traction in software development and IT. SRE is at the crossroads of software development and IT operations. In Ben Treynor’s words, SRE is “what happens when you ask a software engineer to design an operations function.”
Site reliability engineering is a way for developers to actively build services and functions to improve the resilience of people, processes and technical systems. SRE lives somewhat in the shadows – contributing greatly to the team’s overall productivity and the reliability of the team’s applications and infrastructure. If constantly improving the efficiency and resilience of the software delivery lifecycle appeals to you, then you should look at working in SRE.
So, we’ve put together this SRE interview guide — perfect for both candidates and hiring managers — so you’ll be prepared for your next SRE interview.
Overview of the SRE role
A site reliability engineer is essentially the perfect mix of a software developer and a traditional IT operations organization.
- Like IT professionals, SREs are highly skilled at identifying weaknesses and blind spots in their infrastructure and systems.
- Unlike traditional IT, SRE teams also have the autonomy and ability to write and deploy code that proactively fixes problems and avoids incidents.
SRE inherently feeds into a forward-thinking, efficient DevOps culture. By taking the time to identify reliability concerns and building a team dedicated to addressing them, you’ve already started to shift reliability and testing further left into the development lifecycle. Additionally, SRE helps feed IT concerns and information back into the development teams – leading to faster, more resilient software development.
SRE helps break the stereotype that developers don’t take accountability for the services they build. Along with DevOps methodologies, SRE helps bridge the gap between IT and developers. And, even if your team still believes in the “throw-it-over-the-wall” mentality between traditional IT and development, SRE teams can still retroactively add value to your systems. By running tests in production and continuously adding new functionality dedicated to resilience, SRE teams constantly find new ways to make people, processes and technology better.
SRE primary roles & responsibilities
The first question you need to ask yourself is, “Do I want to work as an SRE?” To answer that question, you need to know what you’re getting into. Even before you start interviewing for that next SRE role, you should understand the common responsibilities of a site reliability engineer, including these:
- Building services for DevOps, ITOps & customer support teams
- Remediating support escalation cases
- Taking & enhancing on-call responsibilities
- Documenting & sharing knowledge
- Conducting post-incident reviews that actually work
Interview questions for Site Reliability Engineers
While every engineering and IT organization is built differently, there are a few common questions you can expect during an SRE interview. These questions and explanations will help you prepare when heading into an SRE interview.
What’s the difference between SRE and DevOps?
The answer to this question will vary from team to team. Generally, this is an opportunity for you to highlight:
- The importance of SRE
- How you’ve used site reliability engineering in the past to bolster resilience and productivity
Some organizations will have dedicated DevOps teams where others will simply follow DevOps methodologies. You’ll appease the interviewer as long as you’re thoughtful about the way you’ve used SRE in the past and how you see it contributing to overall reliability and efficiency in IT and software development in the future.
(Read more in our DevOps vs. SRE comparison.)
What appeals to you about becoming an SRE?
Like most other job interviews, it’s important to show why you’re excited about the role. SRE isn’t always viewed as the most luxurious role, and many developers will shy away from it. So, it’s important to speak to why you’re excited about building services that improve system reliability and lead to greater customer and employee happiness.
Being part of an SRE team should excite you because you’ll be able to make a large impact that affects everyone from product managers to end users.
How does your current deployment pipeline look? What are the biggest issues?
At first, this seems like a simple question — but beware: it’s a loaded one. The interviewer wants to determine your ability to analyze your deployment pipeline and make intelligent decisions for changing it. SRE teams are crucial for:
- Identifying monitoring deficiencies and deployment bottlenecks.
- Surfacing reliability concerns to the applicable parties.
Being able to determine where your team can make the biggest improvements to resilience without drastically affecting employee productivity or process will show that you’re able to problem-solve at a high level.
How does your team monitor their system and track “success”?
This is an excellent technical question to determine how you’ve set up monitoring and alerting tools and how you’ve helped define the “healthy” state of a system in the past.
If you want to join an SRE team, you’ll need to understand how you can leverage both internal and external outputs to determine overall system health. Then, you should be able to translate that information into insights and action for IT and engineering teams.
What tools, programming languages & architectures are you familiar with?
This is a quick yet obvious question. Of course, the interviewer wants to know if you’re familiar with the languages and technical systems you’ll need to use in order to do your job.
What’s the relationship between your ITOps and engineering teams? How could that relationship improve?
Because of SRE’s involvement in so many aspects of the engineering organization and business, it’s important that you can identify human bottlenecks in productivity. With this question, the interviewer is trying to determine how you would go about solving issues between cross-functional teams. Most of the time, it’s as simple as finding ways to improve the communication and visibility across different departments – helping people find the information they need when they need it.
What does the on-call setup look like? In a perfect world, how would you structure on-call for your team?
Being a steward for on-call efficiency and quality of life will likely be a core responsibility for any site reliability engineer. So, for any SRE interview, it’s likely you’ll need to show how you would go about setting up a humane on-call experience. What can you do to improve the on-call experience?
Make sure you address this question from the viewpoint that on-call isn’t simply about processes and tooling — but that people need to be a core focus when setting up your on-call rotations and alert rules.
SRE offers autonomy & improvement
Being an SRE can be one of the most fulfilling roles you’ll ever have on an engineering team. You should have the autonomy to make organizational changes and run experiments that lead to greater reliability in the system. And, many times, you’ll find yourself in a position where you can make the lives of customers and colleagues much better.
You can also expect to learn more in a number of IT and software development disciplines, improving your knowledge of the entire software delivery lifecycle and making you a better developer.
What is Splunk?
This posting does not necessarily represent Splunk's position, strategies or opinion.