Chapter 11 | Chaos Engineering

What is Chaos Engineering?

Assignment Four: Make The Case For Chaos

To encourage teams to begin thinking about how they can learn more about the system and be made aware of problems in the customer experience before the customer notices—the council decided that scheduling a Game Day exercise would help. With a date set on the calendar, council representatives understood when indicators (i.e. Black-box metrics) would need to be established and how they could be observed. The Game Day exercise would require that those SLIs be clearly defined and a method for triggering a breach of thresholds be established.

“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
Casey Rosenthal
CTO Backplane - Previously Eng. Mgr. Chaos Team / Netflix

 

To specifically address the uncertainty of distributed systems at scale, Chaos Engineering can be thought of as the facilitation of experiments to uncover systemic weaknesses.

These experiments follow four steps:

1. Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.

2. Hypothesize that this steady state will continue in both the control group and the experimental group.

3. Introduce variables that reflect real-world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.

4. Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

The harder it is to disrupt the steady state, the more confidence we have in the behavior of the system. If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large. (source: http://principlesofchaos.org/)

Why Chaos Engineering?

Contrary to what the name may indicate, chaos events are not performed in a chaotic fashion. Ultimately, they boil down to a specific set of wellplanned scientific experiments. For VictorOps, SRE is a scientific practice which aims to make data-driven decisions to improve a system’s reliability and scalability—as observed by the customer.

We are actively pursuing more knowable information about our systems in order to improve them while recognizing that this is a constant effort. Several members of our engineering team have previously conducted Game Day exercises and so an internal presentation was given to the entire organization. This helped to set expectations and communicate to the broader company what would be taking place, how, and most importantly—why.

Chaos Buy-In

Before we could get too crazy with our Chaos Engineering aspirations, we needed to get buy-in from everyone on the council and those representatives needed buy-in from leadership. We needed a plan that would outline the process, why we are doing this, what our goals are, risk analysis, etc.

We knew that all simulated service disruptions were going to be taken in our pre-production environment in order to increase our confidence that it wouldn’t impact users. There’s always a small chance a customer could experience something. Remember, we don’t know what we don’t know. What if a service in our staging environment is actually talking to something in our production environment? We are still learning its reality. However, we need to “reduce the blast radius” as they say, so our initial exercises will take place in our pre-production “staging” environment.

This means that we need to ask questions about how we can make staging behave (as closely as possible) to the customer-facing environment. How do we want alerts to be delivered? It’s probably not smart to co-mingle delivery methods in the event that a real incident is triggered for a separate problem while engineers are rehearsing failure in a simulated scenario. How long would it take us to figure out which alerts were “real” and which weren’t?

Test Plan Checklist

As a group, the council discussed and came up with a formal test plan, setup considerations, and preparation checklist.

Test Plan:

  • Black-box Alerts
  • Discuss test plans for each alert
  • Meet with IT to evaluate RISK

Staging VictorOps org Cleanup: Discuss how close it should be to production’s VictorOps org

Special paging configurations because alerts will go off at any time a failure is detected

Things to Check:

  • Consider email-only for paging policies
  • No employee should be notified on their personal device
  • Teams/escalation policies/rotations?
  • What should these look like?

Solid observability of the system is required before Game Day testing can be successful. This was why so much emphasis was put on determining our metrics in assignment three and then ensuring we had visibility of them. A lack of structure for the day would be detrimental to the Game Day efforts. Some sort of defined plan would need to be established. All tests would take place initially in our staging environment but future exercises would take place in the production environment.

Teams testing at the same time can and will collide with each other. Be ready for this. Something else that was brought up due to previous experience was that we needed to not only minimize the blast radius of our efforts but we also need to limit the time the game day exercise was supposed to take place. A 14-hour day (typically on the weekend) was not the right approach. Engineers will lose focus and interest if game day exercises go on too long. We would not be “randomly unplugging shit”. This is not the best place to start nor is it part of the principles of chaos.

What are the principles of chaos? Glad you asked.

Principles of Chaos:

  • Build a Hypothesis around Steady State Behavior
  • Vary Real-world Events
  • Run Experiments in Production
  • Automate Experiments to Run Continuously
  • Minimize Blast Radius

Chaos Day FAQ

About a week prior to the event, our champion sent a company-wide email on behalf of the council. The message outlined the agenda for our Chaos Day event as well as offered a response to frequently asked questions.

What is a Chaos Day?

A dedicated time of performing experiments on a system.

Chaos Day Schedule 9:00 am: Kickoff with training for the day

10:00 am: Experimentation begins

3:30 pm: Retro and Lean-coffee

What is the goal of Chaos Day?

Using the principles of chaos engineering we will learn how our system handles failure, then incorporate that information into future development.

What is the goal of our experiments?

This time around, we’re verifying our first round of Black-box Alerts in our staging environment.

What is the scope of a single experiment?

Some experiments will affect only one or two web browser clients where others could affect a major backend service, which would affect many other services and, potentially, clients as well.

Who is involved?

All of VictorOps Engineering and anyone else that has the capacity and would care to join and observe.

Will we be able to demo our product during this time?

Yes. These experiments are only in our staging environment.

Who will this effect?

The goal is to not affect production systems. However, it is possible we unintentionally affect production. So, if something does happen that is affecting or has affected production systems, we’ll communicate that ASAP. We’ll have a dedicated Chaos Incident Commander in communication with Ops Support.

How do we ensure it does not affect customers?

We’re doing our best to assure this but cannot absolutely ensure there will be no effects. The test plan is being reviewed by the SRE Council along with IT Operations.

What happens if there is a Sev 1 SE on that day?

Chaos does not take priority over production emergencies. The appropriate people will be brought in and tasked per usual.

What happens if we actually break Staging in a way which takes longer than a day to fix?

We’re aiming to avoid this with back-out criteria for experiments and reset criteria for bad/overloaded data. If, however, a long recovery time is needed, we’ll communicate this and make arrangements with affected teams.

Chaos Team Roles

In our final council meeting before the Chaos Day, we discussed and established well-defined roles for everyone. This would ensure standardization to some degree on the make-up of teams during the exercise. Documenting as much as possible was the first role we wanted to assign—we needed a recorder:

Recorder

  • Assure hypothesis and risk assessment have been created
  • Record how the experiment unfolds
  • Collect data (graphs, alerts, times, etc) while experiment is performed
  • Time to know (from moment we trigger monitoring has identified)
  • Time to detection (trigger time VictorOps has notified us)
  • Note whether or not the Black-box alerts were triggered
  • Gather information from mini-retro after the test

Someone should be responsible for driving the experiments as well:

Driver

  • Perform experiment
  • Provide all history of actions performed (command line, Jenkins jobs, puppet modifications, etc)
  • Verify alert was triggered

A technical lead (typically the council representative) would assume the role of the incident commander to be the main point of contact and to maintain a high-level holistic awareness for the experiments:

Incident Commander (Tech Lead)

  • Assure back-out plan is defined
  • Keep an eye on the back-out plan during tests
  • In an incident, perform any communication with Chaos Incident

Commander for the day

In addition to team incident commanders, one engineer played the role of the event I.C., communicating across all experiments throughout the day.

Chaos Incident Commander

  • Communicate with Ops Support & Incident Commander for the team under test
  • Update internal Statuspage & Slack channel

Chaos Day Guidelines & Outcomes

For each defined experiment, the following guidelines were provided. Again, this helps to standardize the exercise across all teams as well as serving as a checklist for engineers.

Step-by-step guide

  • Describe test to group
  • Develop a hypothesis for what is expected to happen
  • Take a poll to gauge the group’s assessment of risk
  • Execute test steps
  • Once complete… Perform mini-retro on test

What did we learn?

Having a dedicated time set aside that was well communicated in advance helped on many fronts. Not only did it give us a kick in the butt to determine SLIs and SLOs, it helped formalize the event. Collaboration during the experiments was fantastic.

In fact, digging into unexpected problems together was fun. As a group, we learned much about the process of Chaos Engineering. We learned a great deal about tools we had in place and how we need to improve monitoring in a few places.

In terms of opportunities for improvement, the most obvious regarding our Chaos Day event was how more people from different areas of, not only the engineering teams, but the rest of the company can be involved and participate.

It became clear that some engineers felt left out even though it was an open exercise. People wanted to contribute but we had only defined a limited number of roles. In future exercises, more people throughout the company will play a specific role in the day.

Suggestions were made to move the event to a different part of our development cycles. Chaos Day might have been able to attract more attention if it took place at a slower point in time for development teams.

Inherently, the experiments individual teams chose affected only their own services. This is not the reality of complex systems. In future chaos events, we intend to group experiments by functional areas of the product in order to test cross-functionally.

In general, there were several suggestions on how we could prepare for the day a little better. Now that we’ve been through one, it has been determined that:

  • Test plans are much more important than we realized ahead of time
  • Organization of the test plans would help divvy up work to more people
  • Schedule focused time windows ahead of time for teams/groups of tests

Last, we intentionally left the human response questions out of these experiments. For our next event, we’ll want to observe and capture more around the elements associated with getting the right people on the (injected) problem as quickly as possible.

Ideally, engineers will have established thresholds, paging policies, contextual alerts, runbooks, and anything else the first responder would need. Not preparing first responders with what they need in those moments is a weak spot for most organizations.At VictorOps, we’ve learned a few things about how to recover from failure quickly, but there’s always room for improvement. So, we’ll begin adding these ideas to our future test plans.

Let us help you make on-call suck less. Get started now.