SRE (site reliability engineering) is a discipline used by software engineering and IT teams to proactively build and maintain more reliable services. SRE is a functional way to apply software development solutions to IT operations problems. From IT monitoring to software delivery to incident response – site reliability engineers are focused on building and monitoring anything in production that improves service resiliency without harming development speed.
Site reliability engineering is often used as a highly-integrated method for tightening the relationship between developers and IT teams. An SRE’s biggest role is to improve the overall resilience of a system and provide visibility to the health and performance of services across all applications and infrastructure. Site reliability engineers can actively write code to improve service resilience and flexibility. Then, they can help spread information across DevOps and business teams – encouraging a blameless culture focused on workflow visibility and collaboration.
The traditional approach to service management
IT service management (ITSM), has been around since the beginning of computers. System administrators (sysadmins) would handle everything from assembling software components, deploying them and responding to incidents. Then, with the introduction of personal computers, IT professionals needed to define universal principles for reliably handling applications and infrastructure.
The growing adoption of technology gave way to ITIL – a set of defined rules for all of IT operations. For a while, defined rules worked well – software developers wrote the code and gave it to sysadmins who were in charge of configuring and deploying the services. However, this proverbial fence led to a natural division of labor between software developers and sysadmins.
Then, with the birth of the internet and highly complex, integrated systems, Agile software development practices and CI/CD became a necessity. In order to keep up with the faster delivery of always-on services, IT service management practices also needed to change.
- Tightening feedback loops
- Improving collaboration between software developers and IT operations
A DevOps methodology gets IT teams involved earlier in the software delivery lifecycle while also increasing developer accountability for services in production. Application and infrastructure testing is automated and integrated throughout more of the development lifecycle and developers take on-call responsibilities for the services they build.
With the faster delivery of more complex applications and cloud infrastructure, teams needed a way to proactively address reliability concerns – leading to the creation of the modern practice of SRE.
The origin of SRE
The discipline of site reliability engineering first popped up at Google. They needed to make a shift towards an IT-centric organization, aligning everyone across the business – from engineering to sales. SRE became a way to move forward with this vision. Google’s then VP of Engineering, Ben Treynor, defined SRE as:
Fundamentally, it’s what happens when you ask a software engineer to design an operations function. - Google Interview With Ben Treynor
Google started to treat issues that were normally solved manually as software problems – formalizing an SRE team to apply software development expertise to traditional IT operations problems. With developers focused solely on making operations better, the team can build resilience into their services without harming development speed – automating numerous manual tasks and tests, increasing visibility into system health and improving collaboration across all of IT and engineering.
(Learn more about the SRE role.)
The core components of SRE
40-90% of the total costs of a system are incurred after birth. - Google’s SRE Book
While most DevOps and IT professionals are constantly focused on improving the development process, a large number of teams don’t focus on their systems in production. But, the vast majority of application and infrastructure costs are incurred after deployment. It stands to reason that development teams need to spend more time supporting current services. In order to reallocate their time without impeding velocity, SRE teams are forming – dedicating developers to the continuous improvement of the resilience of their production systems.
The core responsibilities of SRE teams normally fall into these categories:
Availability is the term for the amount of time a device, service or other piece of IT infrastructure is usable.
Setting service-level objectives, agreements and indicators (SLOs, SLAs and SLIs) for the underlying service. SLIs are the actual unit of measurement defining the service level that customers can expect of the system. SLIs form the basis of SLOs which are the desired outputs of the system (e.g. 99.99% availability, etc.). SLAs are based on SLOs and given to customers to communicate the expected reliability of the service they’ll be using, and the way the team will react if those numbers aren’t met.
- What are the expectations of internal teams, customers and other external stakeholders?
- What’s the overall importance of service to most organizations?
- Is 99.999% availability really necessary?
These types of questions listed above will help facilitate the speed at which businesses can reliably release new features and services – dictating the way SRE teams initially set SLOs, SLAs and SLIs.
Over time, as SRE teams spend more time working in production environments, engineering organizations begin to see more resilient architecture with further failover options and faster rollback capabilities. These companies can then set higher expectations for customers and stakeholders, leading to impressive SLOs, SLAs and SLIs that drive greater business value. While the greater development and IT teams are in charge of maintaining a consistent release pipeline, SRE teams are tasked with maintaining the overall availability of those services once they’re in production.
(Go deeper into availability management.)
As teams gain maturity in SRE and availability becomes less erratic, they can start to focus on improving service performance metrics like latency, page load speed and ETL.
- Which services or nodes are frequently failing?
- Are customers consistently experiencing page load or latency lag?
While overall availability may not be impacted by performance errors, customers who frequently encounter performance issues will experience fatigue and may be likely to stop using the service.
SRE teams should not only help application support and development teams fix bugs, but they should help proactively identify performance issues across the system. Small performance issues can build up over time and become larger customer-facing incidents. As overall service reliability improves, teams will open up more time to identify small performance issues and fix them.
(Explore the application performance monitoring practice.)
In order to identify performance errors and maintain service availability, SRE teams need to see what’s going on in their systems. Naturally, the SRE team is assigned the great task of implementing monitoring solutions. Because of the way disparate services measure performance and uptime, deciding what to monitor and how to do so effectively is one of the hardest parts of being a site reliability engineer.
At the end of the day, SREs need to think of monitoring as a way to surface a holistic view of a system’s health. Anyone from any department in engineering or IT should be able to look at a single source to determine the overall performance and availability of the services they support.
The need for cross-service, cross-team visibility led to the creation of SRE’s golden signals. The goldens signals serve as a foundation for actionable monitoring and alerting for DevOps and IT teams. In a few pages, we’ll go over SRE’s four golden signals of monitoring and show why they’re such a powerful foundation for service reliability.
The continuous improvement of monitoring, incident response and the optimization of service availability and performance will inherently lead to more resilient systems. At the end of the day, SRE teams build the foundation for a more prepared engineering and IT team. With the monitoring resources provided by the SRE team, the development and IT team can deploy new services quickly and respond to incidents in seconds.
A prepared team knows the health of their services and how to respond when there’s a problem. When site reliability engineers are integrated into engineering and IT, developers are exposed to more of their production environment, and IT operations are involved earlier in the software development lifecycle. SRE teams serve the organization with the weapons and the transparency they need to combat reliability concerns. A reactive SRE team simply responds to issues and fixes them. But, a proactive SRE team puts the resilience of the system directly in the hands of individual team members.
Effective implementation of the core components of SRE requires visibility and transparency across all services and applications within a system. But, measuring the performance and availability of disparate services on a single scale can be convoluted. So, Google’s SRE team developed the four golden signals as a way to consistently track service health across all applications and infrastructure.
What are the four golden signals of SRE?
The four golden signals of SRE are:
SRE’s golden signals define what it means for the system to be “healthy.” Establish benchmarks for each metric showing when the system is healthy – ensuring positive customer experiences and uptime. While a team could always monitor more metrics or logs across the system, the four golden signals are the basic, essential building blocks for any effective monitoring strategy.
Latency refers to the time taken to serve a request.
Define a benchmark for “good” latency rates. Then, monitor the latency of successful requests against failed requests to keep track of health. Tracking latency across the entire system can help identify which services are not performing well and allows teams to detect incidents faster. The latency of errors can help improve the speed at which teams identify an incident – meaning they can dive into incident response faster.
Traffic refers to the stress from demand on the system. The more traffic, the more "stress".
How much stress is the system taking at a given time from users or transactions running through the service? Depending on the business, what you define as traffic could be vastly different from another organization. Is traffic measured as the number of people coming to the site or as the number of requests happening at a given time? By monitoring real-user interactions and traffic in the application or service, SRE teams can see exactly how customers experience the product while also seeing how the system holds up to changes in demand.
Errors, or the error rate, defines the rate of requests that are failing.
SRE teams need to monitor the rate of errors happening across the entire system but also at the individual service level. Whether those errors are based on manually defined logic or they’re explicit errors such as failed HTTP requests, SRE teams need to monitor them. It’s also important to define which errors are critical and which ones are less dangerous. This can help teams identify the true health of a service in the eyes of a customer and take rapid action to fix frequent errors.
Saturation is defined as the overall capacity of the service. The saturation is a high-level overview of the utilization of the system:
- How much more capacity does the service have?
- When is the service maxed out?
- Because most systems begin to degrade before utilization hits 100%, SRE teams also need to determine a benchmark for a “healthy” percentage of utilization.
- What level of saturation ensures service performance and availability for customers?
Taking action with your monitoring
SRE’s Four Golden Signals in the Incident Management Lifecycle
The four golden signals serve as an excellent jumping-off point for actionable monitoring. Tracking the latency, traffic, errors and saturation for all services in near real-time will help all teams identify issues faster. The golden signals also give teams a single pane of glass view into the health of all services – whether they’re maintained by that specific team or not. Instead of disparate monitoring across every feature or service, you can roll all monitoring metrics and logs into a single location.
Effective monitoring will not only lead to improved incident management but it will improve the entire incident lifecycle over time.
Using SRE to facilitate a DevOps mindset
Site reliability engineers expose themselves to many aspects of the system, inherently improving the collaboration between developers and IT operations teams. Facilitating a DevOps mindset through SRE leads to breakthroughs in your team’s productivity and your system’s resilience. When an incident occurs, instead of passing blame between development and IT, SRE opens transparent discussions about how they can improve. SREs are the gatekeepers for efficient, reliable software development practices that don’t force all production responsibilities to IT teams. Learn more by attending an upcoming DevOps conference or event.
Implementing SRE and the four golden signals of monitoring will improve cross-functional visibility and collaboration, bringing IT operations and developers together. Join the teams that are already using SRE’s four golden signals to promote a positive engineering culture and drive better customer experiences.
What is Splunk?
This posting is my own and does not necessarily represent Splunk's position, strategies, or opinion.