First Steps to Create an SRE Culture
We want to see and analyze what’s happening before much planning or effort is enacted. Shortened feedback loops are achieved simply by placing more emphasis on observability.
We want to turn high fidelity data into information that fuels changes to the system. Improvements to the reliability, enhancements to our delivery pipeline, shortened feedback loops to engineers, and faster deployments of features, as well as improvements in human performance, can all be driven by data.
In working closely with other teams and using this information to collaborate on solutions, we would need to begin not only collecting more data to enhance our observability into the system but we’d also need to establish baselines to establish our expectations of what a healthy system looks like (to us). What is “normal”?
Concluding each council meeting, action items were established for the group. The first four assignments kicked off some of the deepest discussions we’ve had around the reliability and resiliency of our service—not to mention what types of things we can’t currently answer.
We asked each member to go back to their teams and return with a list of the most obvious concerns you can think of when it comes to reliability of the VictorOps service. Something that has always bothered you and is easy to determine if you have the ability to confirm the concern or not using data. Is “that thing” that’s bothering you something we can actually see in the system using data?
These early conversations pointed out obvious blind spots in our own system. The truth is you don’t know what you don’t know about systems. When it comes to reliability, the last thing you want to do is make decisions based on emotions or anecdotes. All efforts should be aimed towards exposing the knowable and amplifying the known. The importance of observability (a superset to monitoring, logging, tracing, etc.) is increasing significantly as it allows you to learn and know more about your systems.
What Do You Worry About Most?
One by one, we went around the table and asked each representative on the council to share their list of concerns. What really keeps them up at night?
Our IT Operations representative pointed out some blind spots in monitoring that were recently uncovered. Scalability was a growing concern as our customer base has exploded recently as well.
For our Data team, not having enough good data in pre-production environments was proving to be problematic for testing effectively. Monitoring was often too noisy and, as a result, alerts weren’t always that meaningful or even actionable. Third-party tooling use was beginning to sprawl and we felt that we had poor visibility into the things that are touching our system.
When you talk to the representative from the Web client team, exception monitoring was at the top of their list. This held the largest area for opportunities for improvement. They also mentioned that there is no tie between exception tracking and deployments; another blindspot that was becoming more and more worrisome.
Scalability issues of the UI and UX were brought up as well. We need to get the design team involved sooner and get them better data to make informed decisions before our web client isn’t able to meet user demand and expectations. They also felt that the deployment process could use some tweaks.
We asked the council members to provide a short list of top concerns. Dozens of ideas were presented. Once we had a list of solid concerns, the next meeting would be to discuss methods of observing data related to them. In order to build and test theories around how certain aspects of the system work under certain circumstances, we would need greater visibility.
What did we learn?
We needed more data. This would require engineering time. But, we are in a pretty good spot to make significant positive impacts in a very short period of time. Although the running system may not be well understood by all, engineering cares deeply about reliability. Especially when discussed in regards to VictorOps scaling to meet the needs of our customers who themselves are experiencing fast growth and demand.
We have a lot of input on SRE concerns so far, but no way of prioritizing or assessing the risk of individual concerns. Now that we have a list of concerns to address, we needed to begin breaking them down further so we could prioritize. We need to understand what is involved with making data related to these concerns obtainable. For each concern, we want to determine the value, effort, and blockers involved in adding instrumentation specifically addressing the concern. Additionally, if they could advise the council on the complexity, risk, and any supporting evidence as well. This should help us sort in a few ways.
Each council representative was then asked to begin researching the following information as they relate to each concern:
If possible, provide the following as well:
For the IT Operations representative who had previously mentioned monitoring coverage and scalability were at the top of their list of concerns, they informed the council of the following:
- Value: High
- Effort: Low-Medium
- Blockers: More of a time commitment than it should be.
- Value: High
- Effort: High
- Blockers: Time is a large blocker on this one. Spinning up new servers takes a cross-department effort. IT needs to create and provision the server, dev needs to deploy to it.
For our Data team that said not having enough good data in staging was proving to be problematic, as was noisy monitoring and alerting and Saas tooling sprawl, they came up with:
Monitoring of ETL processes:
- Value: High, we will actually know if ETL is broken, on-fire, or just working
- Effort: Moderate, we have some tooling in place with Sumo, but that is it
- Risk: ETL breaks silently
- Evidence: Count the SE's
Tests (all levels):
- Value: High, we really have no testing in ETL, making it fastfail impossible, and all validation manual
- Effort: High, nothing exists right now.
- Blockers: Data Volume issues. ETL is heavily influenced by db size, production has had a number of issues that can’t be seen in other testing environments.
- Risk: Continued bugginess and unreliability of ETL/reporting, customer churn
- Evidence: (See above)
Monitoring of SaaS tools:
- Value: Moderate, we have minimal monitoring of our thirdparty tools, causing a lack of visibility into current state, failures, and bugs
- Effort: High, most of these tools provide minimal options for alerting/monitoring so in most cases the monitoring/alerting we have is very noisy
Key Prioritization Takeaways
This exercise helped us to better understand the lift involved with efforts associated with these concerns. As a collaborative team, we all had a much clearer picture of the risk involved when contrasted with the reward it would provide. With this information, as a group, we could make decisions moving forward on how we prioritize SRE-related work.
Within just two 60-minute sessions (and some research outside of the council meetings) we had generated nearly 200 legitimate questions, hypothesized how we could collect data to answer them, and began analyzing them in order to prioritize them.