Chaos Engineering Tutorial: Comprehensive Guide With Best Practices
This guide explains the basics and benefits of chaos engineering and how it impacts the testing team and ensures high-quality software.
Join the DZone community and get the full member experience.
Join For FreeChaos engineering is the discipline of testing distributed software or systems by introducing failures and permitting engineers to study the demeanor and perform modifications with the outcome so that the failures are avoided when end users work with the software and systems. It is blended with Site Reliability Engineering (SRE), which attempts to compute the influence of the improbable.
In chaos engineering, practitioners intentionally inject failure into a system to assess its resiliency. This science involves the implementation of experiments and hypotheses followed by comparing the outcome with a steady state.
An example of chaos engineering in a distributed system is randomly taking down services to observe the responses and impact inflicted on users. An application needs the following infrastructure to run: networking, storage, computing, and application. In chaos experiments, valid experiments include injecting turbulent or faulty conditions in any random section of the stack. This engineering permits a large extent of controlled testing of failures.
Evolution of Chaos Engineering
Some Internet organizations were the pioneers of distributed, large-scale systems. The complexity of these systems necessitated a novel approach to testing failures. This led to the creation of chaos engineering.
In 2010, Netflix changed its focus from physical infrastructure to cloud infrastructure. Amazon Web Services (AWS) offered this cloud infrastructure. Here, the requirement was to ensure that the Netflix streaming experience should not be affected if Amazon lost an instance. In response to this requirement, the Netflix team developed a tool called Chaos Monkey.
In 2011, the Simian Army came into existence. This Army appended some failure injection modes to Chaos Monkey that enabled testing of a holistic suite of failures and developed resilience in the suite. The task was the design of a cloud architecture wherein disparate singular components could fail without impacting the entire system's availability.
In 2012, GitHub had the source code of Chaos Monkey, which Netflix shared. Netflix claimed that they had invented the optimum defense against unexpected large-scale failures. However, it revealed that it was frequently used, causing failures to coerce the construction of services with incredible resiliency.
In 2014, Netflix created a new role, Chaos Engineer. Koltron Andrus, the Gremlin Co-Founder, and his team declared that they had come up with a novel tool, Failure Injection Testing (FIT), which offered developers a higher granular control over the failure injection'sā blast radius.' In addition, FIT gave developers control over the failure scope, which made them understand the chaos engineering insights and mitigation of the potential downside.
In 2016, Matthew Fornaciari and Kolton Andrus established Gremlin, the first managed solution of chaos engineering. In late 2017, Gremlin became available publicly. In 2018, the first large-scale conference pertinent to chaos engineering, Chaos Conf, was launched by Gremlin. In only two years, the attendee count was multiplied by approximately ten times. These attendees included veterans from industries such as Delivery, Finance, Retail, and Software.
In 2020, AWS ensured the addition of chaos engineering to the AWS Well-Architected Framework (WAF) reliability pillar. Toward the end of this year, AWS declared the Fault Injection Simulator (FIS) advent. This completely managed service natively ran the Chaos experiments on the services of AWS.
In 2021, the first report of 'State of Chaos Engineering' was published by Gremlin. This report consisted of the main advantages of chaos engineering, the expansion of its practice among organizations, and the frequency at which top-performing teams conducted chaos experiments.
How Does Chaos Engineering Work?
Chaos engineering begins with analyzing the expected behavior of a software system. Here are the steps involved in implementing chaos experiments.
- Hypothesis: When engineers change a variable, they ask themselves what should happen. They assume that services will continue uninterrupted if they terminate them randomly. A hypothesis consists of a question and an assumption.
- Testing: Engineers use simulated uncertainty, load testing, and network and device monitoring to test their hypotheses. A failure in the stack breaks the hypothesis.
- Blast Radius: Using failure analysis, engineers can learn what happens under unstable cloud conditions. A test's effect is known as its 'blast radius.' Chaos engineers can manipulate the blast radius by controlling the tests.
- Insights: It helps make software and microservices more resilient to tackle unforeseeable events.
Advantages of Chaos Engineering
The chaotic experiments render valuable insights. These are leveraged to decrease the frequency of High Severity Expansion (SEV), ensure lesser time to detect SEVs, enhance the system design, comprehend system failure modes better, reduce the on-call burden, and minimize incidents. All of these are technical advantages.
The organization can enhance the SEV Management program, make the on-call training for engineering teams better, make the engineers more engaged and happy, and prevent enormous losses in maintenance and revenue expenses. These are the business advantages.
It is possible to have no outages to hamper daily activities. This implies that the organization's service is more durable with increased availability. These are the customer benefits.
Some other advantages are the following.
- The team members can have enhanced engagement and confidence to implement disaster recovery methods, which results in the application becoming highly dependable.
- The system in question becomes more resilient in the wake of failures. As a result, the gross system availability increases.
- The team can confirm the demeanor of the system in the event of failure. Then, the team can execute a proper action.
- There will be a drop in production incidents in the future.
Who Uses Chaos Engineering?
A chaos engineering team is typically part of a small DevOps team, often working with pre-production and production software applications. However, with its broad implications across various systems, chaos experiments can affect groups and stakeholders at all levels of the organization.
Various stakeholders can participate in and contribute to a disruption involving hardware, networks, and cloud infrastructure, including network and infrastructure architects, risk specialists, cybersecurity teams, and even procurement officers.
Principle of Chaos Engineering
The principles of chaos engineering are divided into four practices. Herein, it is assumed that the system is stable, and then you have to find the variance. If the steady state is harder to interrupt, the system is more robust to a proportional degree.
Commence by the Definition of the Baseline (Steady-State)
You must know the features of the normal or the steady state. This is pivotal to finding out the regression or the deviation. Based on what you are testing, you can select an apt metric for a good measure of normalcy. The metric can be the completion of the user journey in a stipulated time or the response time. In an experiment, the steady-state is the control group.
Assume That the Steady State Can Sustain
Assuming that a hypothesis applies perpetually to the system, you will get little scope for testing. The design of chaos engineering enables it to run against steady and robust systems with the ability to detect faults, such as infrastructure or application failures. If you run chaos experiments against unsteady systems, the process is not crucial because such systems are known to be unstable and unreliable.
Initiate Experiments or Variables
The experiment involves introducing variables in the system to observe the system's response to variables. Such experiments represent real-world scenarios that affect one or more of the application pillars: infrastructure, storage, networking, and computing. An example is that when there is a failure, it could be either a network interruption or a hardware failure.
Attempt to Contradict the Hypothesis
Let us consider that a hypothesis is for a steady state. The differences between the experiment and control groups are disruptions or variances from the steady state. These contradict the idea of stability. Now, you can focus on the design alterations or fixes that can result in a more stable and robust system.
Chaos Engineering: Advanced Principles
At Sun Microsystems, L. Peter Deutsch, a computer scientist, and his colleagues drafted a list of eight distributed systems' fallacies, which are the following:
- The network is homogenous.
- The transport expenses are zero.
- There is one admin.
- Topology never undergoes modifications.
- The network is secure.
- The bandwidth is infinite.
- There is zero latency.
- The network is dependable.
The preceding are false assumptions about distributed systems made by engineers and programmers. When applying chaos experiments to an issue, the preceding eight fallacies are a good starting point.
The chaos engineers regard them as core principles to comprehend the network and system problems. The underlying theme of these fallacies is that the network and systems can never be 100% dependable or perfect. As all accept this fact, the concept of 'five nines' exists in the case of highly available systems.
So, the chaos engineers strive for less than 100% availability, and the closest they can be to perfection is 99.999%. In distributed computing environments, you can easily make these false assumptions, and based on them, you can identify the random problems created in complicated distributed systems.
Chaos Engineering Tools
Netflix is regarded as a reputed pioneer of chaos experiments. This company was the first to use chaos engineering in the production environment. It designed the test automation platforms and made them open-source. For these platforms, it termed them collectively as the 'Simian Army.'
The suite of the Simian Army included many tools, some of which are the following:
- Latency: It initiates latency to feign degradation and network outages.
- Chaos Monkey: It randomly disables the instances of the production environment to result in a system failure. It does not have any impact on the activities of a customer. The prime usage of this tool is testing the system's resilience.
Its functionality is to disable one production system to create an outage and then test the manner of the other remaining systems' responses. The design of this tool is to enforce failures in a system and then check the system responses.
- Chaos Gorilla: It is identical to Chaos Monkey but on a bigger scale. It drops the entire availability zone during the testing.
- Chaos Kong: Its function is to disable the complete AWS availability zones.
With time, a large count of chaos-inducing programs is being generated to test the abilities of the streaming service. Due to this, the suite of the Simian Army is continuously expanding.
Following are some other chaos engineering tools.
- AWS Fault Injection Simulator: It consists of fault templates, which can be introduced in instances of Production by AWS. The failure injection testing cannot lead to system problems because the platform has protective agencies and inbuilt redundancy.
- Gremlin: It collaborates with Kubernetes and AWS and concentrates on the finance and retail sectors. Its inbuilt redundancy ceases the chaos experiments when they reach a stage where they can threaten the system.
- Monkey-Ops: It is implemented in 'Go' and is an open-source tool. It has been designed to test and end deployment configurations and random components.
- Simoorg: It is a failure-causing program that is open-source. LinkedIn uses it to execute experiments in chaos.
- Security Monkey: It works by detecting vulnerabilities and security violations and then puts an end to the offending instances. This is looked up to as an extension of Conformity Monkey.
- Janitor Monkey: Its functionality is to dispose of any waste. It ensures that the cloud service functions correctly and has no unused resources.
- Conformity Monkey: It works by detecting the instance that is not abiding by the best practices related to a protocol. Further, it dispatches an email notification to this instance's owner.
- Doctor Monkey: It performs a check of the status of the health of the system and the components that are associated with the system's health. An example is the CPU load that finds the unhealthy instances and finally fixes these instances.
- Latency Monkey: It generates communication delays that incite network outages. In this manner, this tool checks the fault tolerance of the service.
The rationale for using the word 'monkey' is the following:
No person can predict when a monkey might enter a data center and what the monkey would destroy. A data center has a collection of servers that hosts all critical functions done during online activities. Let us assume that a monkey enters this data center and then behaves randomly, destroying devices, ripping cables, and returning everything that passes his hand.
So, IT Managers face the challenge of designing the information system they are accountable for to continue functioning despite the destruction caused by such monkeys.
Theory Guiding Chaos Engineering
In a chaos experiment, the basic idea is to intentionally break a system and gather data that can be leveraged to augment the system's resiliency. This type of engineering is closely related to software testing, and software quality assurance approaches. Therefore, it is majorly suitable for sophisticated distributed systems and processes.
You will find it cumbersome to predict error-prone situations and resolve these errors. The size and complications of a distributed system play a role in giving rise to random events. The more the size and complexity of a distributed system, the more unpredictability in its demeanor.
To test a system and determine its weaknesses, turbulent conditions are purposely created in a distributed system. This chaos experiment results in the identification of the following problems.
- Performance Bottlenecks: These scenarios have the potential for improvement of performance and efficiency.
- Hidden Bugs: These issues result in software malfunction, such as glitches.
- Blind Spots: These refer to the locations where the monitoring software fails to procure sufficient data.
In the current era, a rising number of organizations are moving to the cloud or the enterprise edge. The outcome of this movement is that the systems of these organizations are becoming complex and distributed. This outcome is also applicable to software development methodologies with an emphasis on continuous deliveries.
The rise in complexity of an organization's infrastructure and processes within the infrastructure is augmenting the need for the organization to adopt chaos engineering.
Example of Chaos Engineering
Let us consider a distributed system that manages a finite count of transactions per second. Chaos testing is applied to determine the software's response mode when it reaches the transaction limit. It is observed whether the system crashes or the performance gets hampered.
Let us now consider a distributed system that witnesses a single point of failure or a shortage of resources. Chaos experiments determine the response of this system in the preceding two scenarios. In the case of the system's failure, developers are directed to execute modifications in design. After these modifications, the chaos tests are repeated to ratify the expected outcome.
In 2015, one reputed failure of a real-world system was identified with a chaos engineering relevance. The DynamoDB of Amazon witnessed an availability issue in a regional zone. In this region, over 20 Amazon web services dependent on DynamoDB failed during operation.
The websites that leveraged these services, one of which was Netflix, were down for multiple hours. However, among all the websites, Netflix was the least hampered site. The reason cited was that Netflix had used Chaos Kong, a chaos engineering tool, to be prepared to address such a scenario. This tool disabled all the AWS availability zones.
These were the AWS data centers that were serving a specific geographical region. While using this tool, Netflix got hands-on experience in addressing regional outages. This incident strongly cemented the significance of using chaos experiments.
Difference Between Testing and Chaos Engineering
Testing does not result in the generation of new knowledge. The Test Engineer knows the system's features under consideration and writes the test case. Here, there are statements regarding the known properties of the system. Using existing knowledge, the tests make an assertion. After the test is run, the assertion is considered either True or False.
Chaos engineering is experimentation. It results in the generation of new knowledge. In the experiments, a hypothesis is proposed. Your confidence in the hypothesis gets augmented if the hypothesis is not contradicted. If the hypothesis is negated, you learn something new.
There is an inquiry to determine the reasons why the hypothesis was incorrect. Thus, chaos experiments have two possible results: an increase in your confidence or comprehension of new features of your system. In a nutshell, this is about exploring the unknown.
Even an enormous quantity of testing cannot match the insights due to experiments. Testing is done by humans who suggest assertions ahead of time. Experimentation is a formal way of discovering novel properties. After new system properties are discovered through experiments, you can translate them into tests.
Suppose you create novel assumptions of a system and encode them into a novel hypothesis. In that case, the result is a 'regression experiment,' which can be used to explore the modifications in the system with time. It was complicated system issues that gave birth to chaos experimentation. So, experimentation is favored over testing.
What Does Chaos Engineering Not Imply?
Frequently, chaos engineering is confused with anti-fragility and 'breaking stuff in production.'
Anti-Fragility
Nassim Taleb introduced the concept of anti-fragility. He coined the term 'anti-fragile' to point to those systems that increase their strength when exposed to random stress. He indicated that the ability of complicated techniques to adapt is not sufficiently implied by the term' hormesis.'
Some remarked that chaos experiments are the software version of the process indicated by anti-fragility. However, these two terms imply different concepts. In antifragility, you add chaos to a system and hope that it does not succumb to the chaos but responds in such a manner that its strength increases. Chaos engineering alerts the team regarding the inherent chaos in the system so that the team can be more resilient.
In antifragility, the initial step to enhance the robustness of a system is to identify the weak regions and eliminate them. Resilience engineering proposes that identifying what works correctly in safety provides more information than identifying what works incorrectly.
Another step in antifragility is the addition of redundancy. This step stems from intuition. In resilience engineering, there are several instances where redundancy has resulted in safety failures. But, redundancy is responsible for failures with almost the same ease as it can lessen failures.
Resilience engineering has a history of support for many decades. Antifragility is considered a theory that is outside peer review and academia. Both these schools of thought deal with complicated systems and chaos, due to which people opine that they are identical. However, chaos experiments have a fundamental grounding and empiricism absent in the antifragility spirit. Thus, you must realize that these two are disparate.
Breaking Stuff
Nowadays, some believe that chaos experiments and 'breaking stuff in production' are synonyms. On closer investigation, it appears that the correct synonym for chaos experiment is 'fixing stuff in production.'.
Breaking stuff is relatively easy. The more challenging job is to diminish the blast radius, contemplate safety critically, decide whether fixing something is worthwhile, and conclude whether your investment in experimentation is essential. This way, chaos experiment is differentiated from 'breaking stuff.'
Baseline Metrics Before Initiating Chaos Engineering
It is essential to procure the following metrics before initiating chaos experiments.
The application metrics are breadcrumbs, context, stack traces, and events. The High Severity Incident (SEV) metrics are MTBF, MTTR, and MTTD for SEVs by service, the total number of SEVs per week by service, and the total number of incidents per week by SEV level.
The alerting and on-call metrics are the top 20 most frequent alerts per week for each service, noisy alerts by service per week (self-resolving), time to resolution for alerts per service, and total alert counts by service per week.
The infrastructure monitoring metrics are network (packet loss, latency, and DNS), state (clock time, processes, and shutdown), and resource (memory, disk, IO, and CPU).
After you gather all the preceding metrics, you can determine whether the chaos experiments have generated a successful impact. Also, you can set aims for your teams and determine the success metrics.
When you have the collection of these metrics, you will be able to offer answers to some pertinent questions, some of which are the following.
- When the CPU spikes, what are the general upstream or downstream effects?
- Which are the top three main reasons that cause a CPU spike?
- Which aim is apt to set for an incident reduction in the upcoming quarter?
- Which upper-level five services have the maximum count of incidents?
- Which upper-level five services have the maximum count of alerts?
The Sequence of Chaos Engineering Experiments
Let us assume that we have a shared MySQL database. There is a group of 100 MySQL hosts where there are multiple shards per host. In Region A, there is a primary database host along with two replicas. In Region B, there is a pseudo primary and two pseudo replicas.
In this scenario, the sequence of the chaos experiments is as follows.
- First Experiment: Known-Knowns: This is about what you understand and are aware of. The first step is to enhance the number of replicas from two to three. From the primary, a new replica is cloned, which is added to the cluster. If a replica shuts down, it is deleted from the cluster. Now, you can start the experiment.
After you shut down one replica, you need to measure the time for detection of the shutdown, the removal of the replica, the kick-off of the clone, the completion of the clone, and the addition of the clone to the cluster. You have to maintain a steady frequency of conducting this shutdown experiment, during which you need to ensure that the experiment doesn't result in having zero replicas at any moment.
You have to draft a report of the mean time taken for recovery after a replica shutdown. The last step is to break this average total time into days and hours to determine the peak hours.
- Second Experiment: Known-Unknowns: This is about the things you don't completely understand but are aware of. You are aware that the clone will happen, and you have logs from which you can know whether the clone is successful or a failure. However, you are not aware of the weekly average in the meantime. The meantime is calculated from when the failure occurs until the clone is efficiently added to the cluster.
In the first experiment, you get the results and data. For example, if the cluster has only a single replica, you get an alert after five minutes. In this case, you don't know whether an adjustment is essential for the alerting threshold to avoid incidents more effectively.
Now, you have to leverage this data to reply to the questions presented in the second experiment. You can use the weekly average of the mean time necessary to move from witnessing a failure to adding a clone to understand the impact of this range of activities. You can also comment on whether, for the prevention of SEVs, the apt alerting threshold is five minutes.
- Third Experiment: Unknown-Knowns: This is about the things you understand entirely but are unaware of. In a cluster, if you simultaneously shut down the two replicas, you are unaware that it is essential to clone two novel replicas of the existing primary on a Monday morning.
However, you know that the transactions can be done by a pseudo primary and two replicas. In this experiment, you need to augment the count of replicas to four. At the same time, you should shut down two replicas. Then, you must obtain the time essential to clone two new replicas of the existing primary over many months on a Monday morning to compute the meantime for this process.
This experiment can result in the identification of unknown issues. An example of such issues is that the primary cannot simultaneously bear a load of cloning and backups. Therefore, you have to use the replicas in a better manner.
- Fourth Experiment: Unknown-Unknowns: This is about what you are unaware of and do not understand. If you shut down an entire cluster in the main region, you do not know what the outcome can be
You also need to find out whether the pseudo region can fail over effectively. In this experiment, you have to shut down the primary and the two replicas, the entire cluster. In a real-life scenario, this failure would be unexpected; hence, you would not be prepared to handle it.
Such a shutdown would need some engineering work. Your task is to assign high priority to this engineering work to address such a failure scenario. After this engineering work, you can proceed with chaos experiments.
Best Practices in Chaos Engineering
The implementation of chaos experiments is guided by three pillars, which are the following:
Render Sufficient Coverage
You can never attain 100% test coverage in software. First, the expansion of coverage is time-consuming. You can never account for every scenario. You can improve coverage by determining the testing that has the maximum impact. This implies that you do testing for scenarios with the gravest impact.
Some examples are network failures, network saturation, and non-availability of storage.
Ensure the experiments are frequently run, imitated, or run in the production environment.
The infrastructure, systems, and software are subject to modifications. There can be a quick change in the health or condition of these three items. So, the optimum location to experiment is the CI/CD pipeline. You should execute these pipelines when a modification is being done. The potential impact of a modification is best measured when the change commences the confidence-developing journey in a pipeline.
Conduct Experiments in the Production Environment
The production environment consists of users' activities, and the traffic load or traffic spikes are real. Suppose you decide to run chaos experiments in the production environment. In that case, you can thoroughly test the resilience and strength of the production system and eventually procure all the essential insights.
Minimize the Blast Radius
You cannot hamper production under the plea of science. So, it is a responsible practice to restrict the blast radius of the chaos experiments. You need to concentrate on small-sized experiments. These can provide you insights regarding what is essential to be identified. Thus, you have to focus on tests and scope. An example is the test of network latency between two disparate services.
The chaos engineering teams have to adhere to a disciplined method during their experiments and test the following:
- The areas that they are unaware of and do not entirely comprehend.
- The areas that they are unaware of but they comprehend.
- The areas that they are aware of and do not entirely comprehend.
- The areas that they are aware of and comprehend.
Conclusion
In the current software development life cycle, the inclusion of chaos experiments aids organizations in augmenting the speed, flexibility, and resiliency of the system and operating the distributed system smoothly. It also renders the remediation of issues before they affect the system. Therefore, organizations are witnessing that chaos experiment execution is quite significant, and your vision for better results in the future can be manifested by its implementation.
Published at DZone with permission of Kavita Joshi. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments