Chaos testing is a part of site reliability engineering (SRE). In chaos testing, we intentionally break things in and around a given application, in order to:
The purpose of chaos testing is to assess how software systems respond to scenarios like network outages, hardware failures, database failures, and server or cluster node failures in the infrastructure.
Chaos testing helps to identify possible vulnerabilities and improve system reliability by exposing hidden issues before they cause real-world outages in production.
A primary function of chaos engineering, chaos testing was developed by Netflix in 2010 during their effort to migrate to Amazon Web Services. They wanted to perform this migration due to a prior system outage that they experienced. This bad experience highlighted the need for more reliable infrastructure. Netflix decided to switch to a microservice architecture that increased system reliability.
For this effort, they developed Chaos Monkey, a tool to create purposeful disruptions to test the system's resilience. Netflix used this tool to verify its system resilience by randomly shutting down virtual machine instances in its infrastructure.
The business outcome of this was huge: Netflix transitioned smoothly during the migration without badly affecting Netflix users.
By 2012, Chaos Monkey's source code was made available on GitHub under Apache 2.0 license. This promoted the use of this tool among a wider audience.
Chaos testing differs from traditional software testing.
Unlike regular testing, chaos testing introduces disruptions and unpredictable settings to validate system stability. It specifically assesses how systems perform under stress, whereas traditional tests evaluate both functional and non-functional aspects of software.
(Related reading: performance testing, autonomous testing & continuous testing.)
The Chaos Testing Pyramid starts from the bottom (the foundation) with unit testing of isolated components, progresses to integration testing to check how these components interact, and finally proceeds to system testing where the whole system faces real-world chaotic scenarios.
This approach identifies vulnerabilities across all levels and helps to increase system resilience and fault tolerance.
Today, most organizations run IT infrastructure that includes distributed systems, cloud technologies, and microservices. This variety and broad distribution contribute to more complex deployments. More complexity, more failures are likely to happen.
Chaos testing is essential for companies to improve system resilience by proactively testing how systems handle these complexities. Below are some top reasons why companies should perform chaos testing.
Importantly, chaos testing complement traditional testing methods like unit, integration, and end-to-end testing as this testing can be carried out using live data in a real environment.
(Related reading: IT failure metrics.)
In 2015, a significant incident highlighted the value of chaos engineering. Amazon's DynamoDB faced availability issues in one of its regional zones — the dreaded downtime. This impacted over 20 Amazon Web Services in that region, causing failures for numerous websites.
Among the users of these services was Netflix. Importantly, Netflix experienced much less downtime than others using AWS in this same region. Why the difference? Their proactive use of Chaos Kong, an improved version of Chaos Monkey, helped them strengthen systems to be more resilient.
With the concepts explained, let’s now turn to the practical side of chaos testing and chaos engineering.
Chaos experiments range from simple manual actions in test environments to complex automated tests in production. Here are a few major experiment types.
Chaos testing functions as a form of experimental testing by introducing unpredictable elements to evaluate system behavior. This follows the steps typical of scientific experimentation.
Among many tools, here are a few major tools that can be used to carry out chaos testing.
This is an open-source, cloud-native chaos engineering tool that enhances system resilience by simulating various faults. Its user-friendly dashboard helps easy configuration and control of experiments. It lacks features like scheduled attacks and node-level testing.
Chaos Monkey is an open-source tool tests system resilience by randomly terminating virtual machine (VM) instances. It allows configurable scheduling and monitoring but is limited to one experiment type and it does require custom coding.
This is a hosted chaos engineering platform. This improves system reliability through SaaS-based multiple attack types. It offers an easy-to-use UI, API support for manual integrations, and a variety of reliability evaluations. It lacks customizability and robust reporting features.
Additional tools are available for chaos testing in specific environments, such as Pumba for Docker environments and LitmusChaos for Kubernetes environments.
Pros | Cons |
|
|
These best practices will help achieve the best results from chaos testing experiments.
Chaos testing is somewhat more complex than regular testing. For example, simulating real-world disruptions can be a challenge, and this can also use lots of resources.
Also, as most of the log files, particularly error logs, are recorded on the server side, observing the outputs of generated responses can be difficult. For example, if the quality assurance team (QAs) cannot see the server logs they may need to request the assistance of the DevOps team to get the error logs. Once you have the required logs and test results identifying only important failures from non-critical system responses requires thorough attention to detail.
Finally, as is true in any scientific experiment — if you do not have a clear and specific hypothesis, the test results may be ambiguous.
To close, let’s look at common questions people ask about chaos testing.
When it comes to automation, chaos testing is similar to other testing types. The ability to automate chaos testing depends on the type of test cases being used and the resources required. Using automation in chaos testing helps to:
Overall, automation will minimize human errors and decrease both time and cost.
Nowadays, the majority of companies integrate CI/CD pipelines for product development to accelerate product updates, decrease manual error risks, and help release product milestones faster. An automated CI/CD process needs to pinpoint application vulnerabilities and understand performance impacts when components fail during build time.
For this reason, chaos testing should be performed within the DevOps environment in order to:
While chaos testing helps to recover from system failures and improves overall security, it may not be an answer to outages arising from these situations:
Chaos testing can indeed be performed in a production environment. In fact, to maximize the effectiveness of chaos experiments, it's recommended to conduct them as close to the production environment as possible.
The ideal approach is to run all experiments directly in production, which helps in understanding how applications behave under real conditions.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.