Creating a Process for Continuous Improvement
Creating an efficient streamlined process to raise, discuss, and affect improvements to the reliability and scalability of VictorOps was the high priority early on and we wanted to build a first-class development workflow to address it. To achieve this, council members would regularly collect SRE concerns & improvements from their teams. These concerns would then be vetted together in front of the council in order to build an SRE backlog. This includes breaking work down into team-specific stories as well as epic level work. During subsequent program planning sessions, teams would then pull work into sprints. On a regular cadence, teams would present before and after improvements once a concern has been addressed. The combination of these efforts would help to shape and provide input to the SRE roadmap.
Create a Formal Submission Process
We wanted to standardize what would be needed for all future submissions. This would allow us to evaluate and prioritize them accordingly. As a result, we established a formal process and outlined a few basic guidelines each concern would be evaluated against. Once a concern was identified, it would be raised in the following council session. The council has three initial criteria that each concern must address.
Required criteria to raise concerns to the council:
- Why is this SRE?
- Why is this important?
- What is involved?
Collectively, the council would evaluate each and either accept or reject the concern. If a concern was accepted, we would create an “epic” together ensuring all relevant details are captured in our project planning tool.
If the council deems the epic to be properly vetted, a story would then be submitted by the council member who raised it. From here, it follows the path of any other engineering effort. Work is assigned during sprint planning, engineers follow their normal routine of building, testing, and deploying to the pre-production environment, at which point we begin gathering results from instrumentation that has just been added. This gives us more visibility into the health of a system.
Discussions of instrumenting applications earlier in the SDLC will begin taking place as a result of trying to understand the reality of the systems better. Once engineers realize that they will be the ones responding to problems in production environments, it begins to make a lot of sense to instrument earlier on.
Engineers become familiar with the monitoring and alerting tools. They get to craft their own alerts, ensuring that when they are woken in the middle of the night for a problem, they know with greater certainty that this is indeed an actionable alert and, because they’ve seen this before in pre-production environments, they know exactly what kind of detail, context, and tools they will require in that sleepy-eyed moment. It’s like helpful engineers from the past… traveling forward in time to help out during an outage!
Framework for Submitting SRE Concerns
SRE Concern Submission Framework
1. Team representative submits a concern to the council
2. Council assesses concern using the following guidelines:
a. Why is this SRE?
b. Why is this important?
c. What is involved?
3. Council determines if the concern is valid
4. If valid, an “epic” is created
5. If the epic is vetted, a story for the work is submitted by the council member
6. Sprint planning
8. Gather results
Tools for SREs
Of course, in our efforts to create an observable system, conversations around tooling surfaced. The council would direct the team to useful and powerful tooling for instrumenting the system and serve as a unified resource for toolset decision making. Architecture reviews and decisions would be a social, group effort.
As we continue our journey we want to make sure that all new tooling MUST be data-driven. If any existing tooling is polluted or is inhibiting effective usage, let’s correct that. When we don’t have accurate data and telemetry on the flow of value through our system. (Reminder: The value is the service VictorOps offers AND the underlying infrastructure on which it is provided.) Then we have a very limited scope of reality.
A tangible result of our SRE efforts should be that we have empirically reduced the unknowns and increased what is now “knowable”. You don’t know what you don’t know… And that’s a problem when it comes to reliability. We need to operate with realities, not hunches. We need to be able to prove that work is important and the benefits should be measurable.
SRE aims to alleviate overhead in all other teams affected by this problem domain.
Meeting Frequency and Format
Sixty-minute meetings would be held every other week. All meetings were (and are) open to anyone in the company interested in either contributing to reliability conversations or learning more about how the system currently works and proposed improvements.
The Council was opt-in and until efforts were more formalized, SRE work was not to interfere with existing planned sprint work. There were no obligations to contribute to SRE conversations, yet everyone was encouraged to. The experience shall remain collaborative and engaging rather than a top-down “project”. Let’s create ownership that helps to move the needle on feeding our own culture of reliability.
Most of the key responsibilities of the council became obvious very quickly. However, to formalize them, we established that the SRE council was at least initially responsible for the following:
Bring Concerns From Your Team
In order to encourage our entire engineering team to embrace and own reliability in their own domains, the council coaches and stimulates individuals to raise any concerns or ideas. Improvements would be continuously made to process and tooling to improve the system from a holistic point of view. By diversifying our council, we had subject matter experts from all corners of the business bringing ideas and concerns that others would not have had any knowledge about.
Vet Concerns in Order to Build SRE Backlog
Coming up with ideas is one thing, but if work is never performed to address the concerns, no improvements will be made. Functionality and features are perceived as better use of engineering resources unless we can make a bulletproof argument that our concerns and the associated work is actually tied to improving the system from the customer’s experience. We knew we needed a process to convert these concerns into engineering work. We needed to provide a first-class workflow to address reliability and scalability into our backlog and prioritized as important engineering work. The council would help to break down high-level work into detailed story-level representations as well as be a representative during backlog refinement and sprint planning exercises.
Present Before and After Improvements Once a Concern is Addressed
To encourage accountability and acknowledgment for improving the system, we asked that representatives present before and after results towards improvements during the next program increment planning week.
We want to regularly demonstrate to the organization how we are continuously looking for methods to evaluate and improve the technology, process, and people as they relate to building, deploying, operating, and supporting the “value” of the VictorOps service—including minimizing the disruption of services from these efforts.
Provide Input to SRE Roadmap
Along our journey, the council would provide input to the overall SRE roadmap. By unifying an understanding of SRE and associated efforts across the council and organization, we will produce a comprehensive SRE roadmap with input from all teams and outline specifics on how we will get there. This would be an ongoing effort as the need and objectives of the business can and will change quickly and often dramatically. Bringing value to the end user is the ultimate goal. What that value looks like in the form of functionality may shift and change but reliability and scalability will remain a constant priority.
SRE Council Responsibilies:
- Bring concerns from your team
- Vet concerns in order to build SRE backlog
- Form into epic level work - break into team-specific stories
- During program planning: Teams pull work into sprints
- Present before/after improvements once a concern is addressed
- Provide input to SRE roadmap
The SRE Council is NOT responsible for:
- Responding to immediate customer needs
- Discovering bugs in functionality and issues with user experience
- Exploring or defining creative user functionality
To dive deeper into the responsibilities of SRE, there are a few more things our council chose to keep outside the scope.
According to the Support team, SRE was NOT responsible for responding to the immediate needs of customers. While attending to and communicating trends indicating future reliability issues for customers is greatly appreciated, SRE was not part of an escalation path for customer issues received by the support team. When we asked our QA team, they let us know that discovering bugs in functionality and issues with user experience was NOT the responsibility of SRE.
SRE would instead look for ways to support identifying reliability problems in the user experience through a number of approaches.
Not only did we solicit feedback from our different engineering teams, but we also wanted to hear from members of the Product team. Invoking input from many different perspectives should give us a more holistic approach to what SRE means to VictorOps and align our objectives and incentives.
To the Product team, SRE was NOT responsible for work related to exploring or defining creative user functionality. Ideas and feedback pertaining to product enhancements are always welcome, yet SRE would not own this as a core responsibility.
Assuring that new functionality is instrumented from a reliability perspective means bringing multiple areas of expertise together to inform improvements to the overall product faster and with fewer service disruptions. Involving Product Owners in these discussions surfaces effort that may be relevant to sprint planning and feature work. Don’t forget to share findings that involve engineering resources that may not only be feature work.
If you ask the front-end engineers where an SRE’s role ends, they will make it clear that building out a new system and user functionality was their domain—and outside of the expectations for an SRE. If SRE could help ensure that new functionality is instrumented from a reliability perspective, the front-end engineers would own the rest.
Our IT Operations team informed us that building and supporting infrastructure that runs the product was NOT an expectation of SRE. However, any help with forecasting demand and proactively triggering automated scalability efforts would be greatly appreciated.
Last, we got together with our data team to gather their feedback on what SRE should NOT be for them. Their answer was simply…
There were no real surprises with these conversations. Most teams are clear on their role and responsibility in delivering value to the end user. However, it did help surface talking points and suggestions around what efforts SRE might be able to bring to the table to increase our overall reliability, as well as increase our ability to deliver functionality (read: value) to the end user faster.
When examining these expectations, we realized that when we start to put ourselves in the perspective of the end user and empathetically understand what problems they are solving for, it was clear that the ideal customer profile, as they say, sounded a whole lot like ourselves.
VictorOps needs to be able to deliver value in the form of features that enable customers to do what they love (build systems that enable others) and we need to do it faster while still maintaining reliability.
This is a common challenge for many of our customers. While some are just looking for better ways to reduce downtime, others are experimenting with ways to introduce change (and therefore chances for failure) faster and faster into their systems; continuously improving the system with each release. Releases that used to go out to end users once every three weeks are now taking place at the very least once a week and, in some cases, even more often but with the intention of speeding up even more over time.
Involving the Product Team
At first glance, from a product owner’s perspective, SRE might present what appears to be a “competing” value stream. For product owners, it’s about getting functionality out the door as efficiently as possible
Example: As a user .. I want to …
This is the language, and as a result, incentive structure product owners are working in… a “user story”.
The user doesn’t see the relationship between functionality and reliability. They do not necessarily know that they care about how the service is brought to them. They just want to perform their own task at hand.
Without an honest conversation with product owners about the relationship of feature velocity and system reliability, opposing incentives may cause dysfunction when prioritizing engineering resources for functionality, reliability, or scalability.
Thankfully, our product owners care a great deal about reliability from the customer’s perspective. And not only do they understand that relationship, they and everyone on the engineering team can’t wait to achieve greater confidence and speed in the delivery pipeline.
As data-driven decision-makers themselves, they believe that the council’s data-driven approach supports effective prioritization and the best approach to balancing reliability with scalability from the customer’s perspective.
In order to achieve this balance, quantifiably measuring “reliability” using instrumentation of the running system in production became a top priority. Accordingly, we needed to find ways to examine and verify correctness and availability while also tracking release frequency.
Measuring how often something goes wrong with releases is also related and important. How quickly our team was made aware of and were able to swarm to problems both right after changes to the system were made (deployment) and during unplanned service disruptions are metrics we watch closely. With an increase in deployment frequency, it becomes even more critical to have metrics available. These data points and observations would then inform a hypothesis for improvements— rather than opinions or hunches. Delivering the greatest value to our end user required us to challenge assumptions about how our system behaved.
Armed with this hypothesis, we could now take a data-driven approach to improving the underlying infrastructure of the system along with the application and experience of the customer. For any organization, knowing where to focus resources is essential. In our experience, when the data tells you where you have the biggest problems or where you’ll get the largest return on engineering effort, resource allocation decisions become much easier. Access to high-fidelity data helps to create a well-informed and proactive engineering team.