Skip to main content

DATA INSIDER

What Is Site Reliability Engineering (SRE)?

Site reliability engineering (SRE) is a method of applying software engineering strategies to IT operations. Tasks that have historically been performed manually by ITOps teams are handed over to a dedicated SRE team that uses software and automation to manage production systems and solve problems. 

The goal of SRE is ultimately to create and support scalable and highly reliable software systems. Historically, operations managed dozens, or at most hundreds, of machines and were able to perform tasks like production system management and incident response manually. But as systems have extended or migrated to the cloud, system administrators are increasingly responsible for thousands of hosts — far exceeding most operations teams’ bandwidth. SRE solves these operations problems by using software code to automate governance and optimization of these systems.

SRE is performed by teams of site reliability engineers — software engineers who also have IT operations experience — as well as cloud architects and other embedded SRE partners whose job is to ensure websites and apps are always available to end users. Their ability to write code and maintain large-scale IT environments makes them uniquely suited to overseeing the automation and management of system administration and IT operations tasks.

SRE is an important practice in cloud-native environments, helping achieve a balance between the release of new features and services while also maintaining their availability. In this piece, we’ll look at how SRE works, its relation to DevOps and how it can benefit your organization.

What Is Site Reliability Engineering: Contents

Additional resources

What are the benefits of SRE?

Site reliability engineers deliver customer value by ensuring reliability of software development and incident response life cycles, resulting in several benefits to those frameworks, including:

  • Observability into the health of services: SRE teams are involved in many different areas of an organization’s systems, which gives them insight into how those systems are connected and how they work together. Site reliability engineers know how to track metrics, logs and traces across many disparate services to generate a holistic picture of system health, which provides them the context they need when an incident occurs.
  • Stronger ties between developers and operations: Site reliability engineers strengthen the relationships between developers and ITOps by introducing automation and improving communication that benefits both teams. SRE can reveal and shore up weaknesses in the release pipeline and create accountability around on-call availability and incident response.
  • Modernization of the NOC: Network operations centers (NOC) have typically relied heavily on repetitive human labor for triaging incidents and alerts and determining how to route them to the right person. SRE modernizes these processes with automation and machine learning, enabling alerts to be automatically directed to the person responsible for fixing the problem.
  • Organization of on-call structures and alerting workflows: Site reliability engineers bring deep knowledge of how to build an effective on-call process and optimize alerts. They can determine the best approach to on-call schedules and alert rules, identify the best way to route alerts through systems, and take on some of the on-call responsibilities themselves.
  • Surfacing of production concerns: With deep visibility into production environments, site reliability engineers are responsible for observability of system helath and services, allowing them to pinpoint deficiencies that can potentially impact customers. As these issues come to the surface, teams can address fixes early on in the product road map.
  • Stewardship across engineering teams: Among many other things, site reliability engineers help improve, increase and enforce best practices and support inter-departmental resilience across the organization.

What are key practices in site reliability engineering?

All SRE practices are geared toward improving system reliability and can be grouped into five core categories:

  • Availability: The SRE team is responsible for maintaining the availability of systems and services once they’re in production, starting by setting service-level objectives (SLOs), service-level agreements (SLAs) and service-level indicators (SLIs) for the underlying service. SLIs outline the metrics, such as uptime or request latency, which enterprises use to measure the service level provided to customers. SLOs define the means of measuring the site or service performance — 99.99% availability, for example — based on those SLIs. Both are then used to create the SLA, which explains the expected reliability of the service and how the team will respond if that goal isn’t met. 
  • Performance: Once availability stabilizes, SRE teams can turn their attention to improving service performance. Latency, page load speed and other performance metrics don’t necessarily impact overall availability, but they can compound over time, eventually deterring customers from using the service if they occur frequently enough. SRE teams assist development teams and application support, fix bugs and proactively identify performance issues across the system, gradually taking on smaller performance issues and fixes as overall service reliability improves.
  • Monitoring: SRE teams are responsible for deciding what to monitor and then implementing the appropriate monitoring solutions based on how the respective services measure uptime and performance. They ultimately need to implement a solution that provides a holistic view of system health to engineering or IT teams — one of SRE’s biggest challenges.
  • Incident response: SRE teams are critical for incident response — the mobilization of an incident (not to be confused with incident management, which is the system of record and audit trail). SRE team members must be available to respond to, explain and review any incidents that occur within the system. This may include auditing production workflows, processes, alert criteria and other factors around a deployment. Typically, SRE teams use an on-call playbook to guide their responses to an event. They also facilitate blameless post-mortems to understand what caused a particular issue and how to prevent it from happening again and to document any improvement that needs to be made.
  • Preparation: Ultimately, SRE teams help development and IT teams be more prepared, gain a better understanding of the health of their services and how to respond to incidents quickly and effectively. The integration of site reliability engineers into development and IT allows developers to learn more about the production environment and helps ITOps teams get involved earlier in the development life cycle. A big part of this effort includes deployment prep, working with the engineers to make sure a new service is observable. The result is a more proactive, responsive approach to reliability concerns.
SRE Key Practices SRE Key Practices

What are the “four golden signals” of SRE?

SRE practices require visibility and transparency across all services and applications within a distributed system. But measuring the performance and availability of distinct services in these environments is complex. To make the process more workable, Google’s SRE team developed the four golden signals, one of several frameworks for effective distributed systems monitoring that establish benchmarks indicating when a system is healthy. 

  • Latency: This is the time it takes to serve a request. Teams define a benchmark for “good” latency rates and monitor the latency of successful requests against failed requests to track the health of the system. By tracking latency across the entire system, SRE teams can help determine which services are not performing well and detect incidents earlier.
  • Traffic: This is a measure of how much stress the system is taking from users or transactions running through the service at a given time. Monitoring real-user interactions and traffic in the application or service allows SRE teams to see how customers experience that product while also seeing how changes in demand impact the system.
  • Errors: This refers to the rate at which requests fail. SRE teams have to monitor the rate of errors occurring across the entire system, create an error budget and define which errors are most critical. This enables teams to understand the health of a service from the customer’s point of view and respond quickly to fix frequent errors.
  • Saturation: This refers to the overall capacity of the system and the resources it has available, providing SRE teams visibility into the capacity of a given service. Most systems begin to degrade before they reach 100% utilization, so SRE teams must set a benchmark for a “healthy” percentage of utilization,( i.e., one that secures service performance and availability for customers.)

What is the role of SRE in DevOps?

The role of SRE in DevOps is to ensure that the apps and services that DevOps teams develop are available to end users when they need them. Although SRE and DevOps are often discussed together, they’re two distinct disciplines. 

DevOps is recognized both as a practice and as a set of principles. DevOps as a practice is an approach to IT delivery that brings people, practices and tools together to eliminate the silos between development and operations teams. As its name indicates, it bridges the gap between software development, which creates the application code, and IT operations, which puts those applications into production, makes them available to end users and maintains their reliability. 

DevOps as a principle, or DevOps culture, grew out of the agile movement, which established principles to guide better software development practices emphasizing gradual delivery, team collaboration and continual planning and learning. By bringing software development and IT operations together, DevOps extends agile principles across the entire software development life cycle (SDLC), optimizing the entire workflow with a goal of continuous improvement. High-performing DevOps teams not only see faster code iterations and deployments but overall shorter time to market for new ideas, fewer bugs and more stable infrastructure.

SRE is a key function of DevOps principles, and a peer to DevOps as a practice, but with narrower responsibilities. Although DevOps dictates that developers own and operate their product in practice — both by writing code and addressing related problems — the push to constantly develop new features for their apps often renders that undertaking impractical. Site reliability engineers can step in and use their knowledge of both software development and IT operations to oversee the management of the code, including deployment, configuration and monitoring. At a higher level, SRE tightens the relationship between development and operations by ensuring the rapid deployment of new software and features, as well as the stability of the IT infrastructure.

How is automation fundamental to SRE?

A chief concern of SRE is around reducing redundant human effort through automation. Site reliability engineers benefit from automated operations tasks like log analysis, performance tuning and many others. Automation is also critical for reducing mean time to resolution (MTTR), mitigating the impact of downtime and outages and providing expanded context for incident response activities such as monitoring and alerting, as well as patching. 

Automation is essential in modern development environments, where speed, consistency, efficiency and adaptability are critical. Perhaps most importantly, it reduces the number of mundane, repetitive operations tasks, freeing SRE team members to focus on creating new tools, monitoring infrastructure changes and performing other tasks that improve reliability.

What is the necessary philosophy and skill set to implement SRE properly?

Effective SRE requires a holistic understanding of systems and how they work together. Site reliability engineers have to approach the system as a whole, giving as much weight to its interconnections as its individual components. With that in mind, teams can effectively implement SRE by following seven principles laid out in “The Site Reliability Workbook”:

  • Operations is a software problem: As the main tenet of SRE, this dictates that a software engineering approach is the solution to a software problem.
  • Manage by service-level objectives: Maintaining 100% availability isn’t the goal of SRE, and failure is expected. SRE works with the product team to set an agreed-upon availability target and manages the service to that SLO.
  • Work to minimize toil: Repetitive, tedious, manual work should never be the default approach — any task or operation that can be automated should be.
  • Automate this year’s job away: Determine what tasks or operations to automate and create a strategy for implementation.
  • Move fast by reducing the cost of failure: Identify and fix problems early in the life cycle to reduce or minimize impact to the customer.
  • Share ownership with developers: Remove silos and reduce boundaries so that development and SRE teams share visibility and ownership.
  • Use the same tooling, regardless of function or job title: One SRE team can’t effectively support multiple development teams if each is using separate tooling. Standardization of tool sets is key to a successful SRE practice.

What are the roles and responsibilities of a site reliability engineer?

Site reliability engineers have a range of responsibilities. Some of the most common SRE roles include:

  • Building software to help operations and support teams: Site reliability engineers use their development skills to build and implement software that helps IT and support staff do their jobs better. This can range from building a new tool to shoring up weaknesses in software delivery to adjusting existing monitoring tools to changing code in production. 
  • Fixing support escalation issues: Early on, site reliability engineers spend time fixing support escalation cases, which decreases as system reliability improves. Because of their diverse skill set and experience, site reliability engineers have the necessary expertise to route issues to the appropriate people and teams.
  • Optimizing on-call rotations and processes: Site reliability engineers are usually expected to be available during an incident, giving them a lot of say into how to optimize the on-call process to improve system reliability. SRE teams can add automation and context to alerts to improve collaborative incident response, as well as update runbooks and documentation to help prepare on-call teams for future incidents.
  • Documenting knowledge: SRE teams are involved in virtually every aspect of the software development life cycle, which gives them a wealth of historical knowledge about services and processes. Site reliability engineers can then regularly iterate on their learnings and maintain runbooks so engineering teams can get the information they need when they need it — a benefit that enhances stewardship and facilitates trust between teams.
  • Conducting post-incident reviews: SRE teams are tasked with ensuring that software developers and ITOps professionals are conducting blameless reviews, documenting their findings and putting what they learn into action. Site reliability engineers are also responsible for any post-incident action items that involve building or optimizing part of the SDLC or incident life cycle.

What should I know before getting started with SRE?

SRE aligns closely with the DevOps movement and depends on tight interaction between development and operations teams, making a culture of trust, collaboration and continuous improvement essential for a team to thrive.

In addition, site reliability engineers need a combination of development and operations skills, as well as an understanding of traditional software engineering tools and practices. They also need to understand systems holistically and be open to new ways of moving them toward reliability.

Finally, SRE requires a significant commitment. It’s not a tool or patch you apply to fix a flawed system. It requires some measure of cultural change, the introduction of new processes and new ways of thinking about operations. As with other initiatives, SRE may start with a single champion, but a mature SRE practice will eventually need buy-in from the top.

The Bottom Line: Make reliability a feature of your software development

Increasing and maintaining uptime is a constant struggle for every organization. But businesses that have effective SRE processes have a leg up on competitors, with greater system resilience and, consequently, a larger percentage of successful releases. When incidents occur, they have a faster mean time to acknowledge and repair (MTTA/MTTR). Less time fixing production issues means that all teams — developers, SRE and operations — can focus on delivering business value in their particular disciplines. As a result, reliability becomes a feature of software development rather than an impediment to it.

Soup to Nuts SRE with Splunk

 

More resources