Companies all across the world are adjusting to new working from home policies and are taking precaution to limit the impact diseases are having on the lives of employees and customers. The virus has created a ripple effect impacting everything from a visit to the local grocery store to countless conference cancellations. And the world became aware of this crisis only a little over a month ago. Tech companies have responded by asking, and even requiring, employees to work remotely. The CEO of Zoom Communications stated publicly that usage is at an all-time high, most likely due to restrictions on travel.
For companies that deliver applications that enable remote work and collaboration, this has obvious implications. To enable a sudden and potentially sustained burst of utilization, there needs to be a business continuity plan. Executives at these companies must be asking:
- How do we keep our employees safe and productive?
- How do we continue to meet SLAs as usage increases?
- What is our capacity planning strategy?
- For incidents that do occur, are we adequately prepared to address them?
- As the utilization of services increases, what is the impact on margins?
Indeed, these questions should be top of mind for those companies in the remote workspace, but even companies who now may have larger employee counts working remotely on in-house applications face similar challenges.
These are questions we’re thinking about here at Splunk, where we treat data as the fuel that helps us make better decisions.
From a technical operations perspective, we’ve identified 4 areas where companies can find these answers:
- Measure what matters
- Drive standardization of tools
- Employ an effective escalation policy
- Make learning a part of the process
Measure What Matters
Access to accurate, discoverable, and timely data is what drives collaborative planning and response. Even in the era of the cloud, resources are not limitless. It is critical to develop a deep understanding of infrastructure utilization and how application changes over time have affected performance and reliability, particularly when capacity planning. However, baseline analysis doesn't adequately safeguard against future incidents. An effective metrics system will be capable of firing an alert within seconds, ensuring fast mean-time-to-acknowledge (MTTA) and detection (MTTD). Distributed tracing has become the go-to debugging approach for more complex application architectures, where multiple services are called to fulfill individual requests. Its effectiveness in identifying causality during incidents can also help technical teams better understand the overall impact on application performance by aggregating metadata contained within the traces to produce tag-specific SLIs.
Drive Standardization of Tools
Unfamiliarity with tools and data sets used across teams creates a huge obstacle in driving responsiveness and cross-team collaboration. It is not uncommon for two teams to produce different metrics from the same datasets. The more tools, the more likely one will encounter data that is or may appear inconsistent. Time will be spent debating dashboard and data validity, rather than focusing on capacity planning and updating runbooks. When something does go wrong, the last thing the incident manager wants to run into are conflicting tools and dashboards. As open-source data collection grows in popularity, and IT Operations companies grow the breadth of offerings, there are more options than ever to collapse the observability stack.
Employ an Effective Escalation Policy
At any given time, an on-call SRE has many dozens of active Slack, Teams, or Mattermost channels. Many of these channels have unread, yet urgent messages. Every tool in the stack has a notification policy and a webhook, making it now more important than ever for valid alerts to not go unnoticed. Unfortunately, an on-call and backup may not be sufficient, especially if there aren’t adequate mediums for these individuals to interact and respond to alerts. Mean-time-to-Resolution (MTTR) is directly influenced by the context that a responder receives within these alerts. If alerts are constantly being escalated, there is a good chance that the front-line is not adequately prepared and/or informed on how to handle a specific alert. This is a great opportunity to revisit the on-call process, how post-mortems are documented and shared, and the effectiveness of the overall process.
Make Learning a Part of the Process
According to the Bureau of Labor Statistics, employment of software developers is projected to grow 21 percent from 2018 to 2028, much faster than the average for all occupations. The introduction of less experienced engineers into the workforce, coupled with the trend to develop smaller services that share more complex interactions with peer services, creates huge gaps in system-wide understanding and application behavior. Historically, junior engineers would have to shadow senior engineers to tap into all that institutional knowledge. This is not a scalable process. Additionally, to carry the cognitive load of a frequently changing, distributed system is unrealistic and unsustainable. Today, companies focused on increasing learning within their engineering teams are fostering collaboration, making work visible, and using post-mortems as a way to accelerate learning. It is common to see service owners share their recent releases or use of a technology privately and publicly. These events enable engineering teams to network personally and gain a deeper understanding of what adjacent groups are doing. Tools like Jira and the presence of tv-monitor dashboards make it easy for engineers to quickly visualize what is important to track and why. A consistent objectively derived post-mortem output that tracks activity and interactions turns the retrospective process into a predictable and information-rich moment where engineers learn about escalations, the tooling, and the teams involved. These experiences all provide context that is hard to come by in a traditional onboarding.
While writing this blog, I read an article mentioning a web collaboration company that had scaled capacity in preparation for the increased demand in response to COVID-19 and the temporary closure of offices in impacted areas. Right now, this exercise is being performed all over the globe. Some teams are preparing their capacity plans relying on a combination of instinct and data.
At Splunk, we prefer to start with the data, and to leverage the tools that make the best use of that data.