SRE Use Case at VictorOps
For VictorOps, the SRE mentality would need to be central to the culture of our entire organization. The responsibility of owning the scalability and reliability of the product (VictorOps) from a customer experience point of view doesn’t rest solely on an SRE team or individual engineer. Rather than assigning the SRE role and responsibility to a specific team or individual, we chose to assemble a cross-functional panel of engineers, support leads, and product representatives referred to as the SRE council.
The SRE Council
The council would be made up of at least one representative from each of the core teams with an immediate stake in reliability and scalability (i.e web client, mobile client, platform, QA, IT Operations, etc.). Our SRE Champion would facilitate discussions during scheduled meetings and serve as the main point of contact for SRE outside of council gatherings. Continuous improvements to reliability in the customer experience will continue to advance, as would scaling the speed and confidence of deployments..
But, how can we sum this up into specifics?
We were able to get buy-in from management on SRE efforts by communicating that we are most focused and empathetic toward the end user’s experience. From the end users perspective, SRE would create a unified vision, mission, responsibility, and goal for the continued reliability of our product (i.e. VictorOps). The quality of our service when it comes to reliability must always be examined with true care towards the expectations of the end user. Empathy is necessary.
3 Tips to Facilitate the Culture of SRE
You’ve no doubt heard many times that changes like these aren’t accomplished solely with adjustments to tooling and process. And they definitely aren’t accomplished by hiring an individual or even an entire team to “implement SRE” (or DevOps). There must be a cultural shift of some kind within the company.In order to move quickly and in unison we must all maintain a sense of empowerment and freedom for engineers to explore and “own” their SRE efforts. The council will serve as the main point of contact for SRE but engineers are encouraged to take ownership and make proactive decisions based on data.
Facilitating the culture of SRE:
1. Empower each engineer’s “reliability feels,” so they can take ownership of improvements
2. Proactively expose dependencies across systems starting with dialogue and data
3. The council would serve as the point of contact for reliability conversations
Taking ownership of something means empowering engineers to do what they think is right. We would encourage each engineer to engage their “reliability feels”. In other words, if something feels like a concern, bring it up to the council and assume ownership for improving it.
Ordaining engineers with “you are empowered now” rarely works. In many cases, dependencies, as well as system and team dynamics, prohibit teams from actually being able to make much of a difference. Because of this, we made it clear that removing barriers and any resistance to the flow of value needs to be made an actual priority to accomplish early on.
When the council meets, each member was responsible for bringing concerns related to SRE from their respective domains. As a group, we would aggregate and vet concerns in order to begin adding work to our engineering backlog. Representatives would present before and after improvements once concerns had been addressed and supported by data. As a unit, they would provide input to the future SRE roadmap and efforts. We’d all decide together.
Along with a concise mission statement, we felt that formalizing the responsibility of the council as it relates to the mission statement made sense. Because we were attempting to formalize and legitimize SRE (at VictorOps), explicitly spelling these out felt appropriate, especially if we planned to share our journey outward.
Template for DevOps Mission, Vision and Goals
Mission Provide an avenue to direct VictorOps’ hunger for reliability.
Vision What’s the big picture here? What are we trying to achieve? What will the SRE council own and solve? Formally, we established and communicated to the company that the official vision of SRE was: SRE will maintain and continuously improve reliability and scalability in the customer experience.
Goals Aiming for buy-in across our entire organization, there were a few conversations that surfaced early on around establishing some clear goals. We want to monitor and improve the customer experience in order to achieve an optimal balance between high reliability and scalability as it relates to deployment speed.
From a high level, this was broken down into:
• Bring visibility to the system’s reliability and scalability through instrumentation
• R&D for unknown concerns
∙ Load testing (API & benchmark)
∙ Game day exercises to uncover unknown aspects of the system in a controlled environment
• Address and facilitate a resolution to current reliability and scalability concerns
• Tackle existing known concerns
• Focus on proactive actions (demand forecasting/capacity planning)
• Proactively pursue future concerns
∙ Capacity & Saturation metrics
∙ Anomaly detection
∙ Product and Management team input to understand where we’re going
• Operate with transparency and genuine inquiry
• Open council meetings
• Communicate vision and roadmap to VictorOps