17. Januar 2020
6 Min.

Part 1: Learning Chaos Engineering – Exploring Fragile Software Systems

Modern distributed systems more and more frequently adopt the microservice architectural style to design cloud-based systems. However, the prevalence of microservice architectures and container orchestration technologies such as Kubernetes increase the complexity of assessing the resilience of such systems. In the next following two blog posts I will illustrate how system weaknesses can be revealed and explored by applying resilience testing or rather chaos engineering to a sample application.

Companies would like their systems to be reliable.  Users expect that these systems are also available during peak working hours. Imagine an online retailing platform such as Amazon that becomes unavailable for several minutes during a black friday. Does not sound like a big issue? In 2013 Amazon became unavailable for approximately 15 minutes which cost the retailer $66,240every minute [1]!

A lot of businesses would suffer huge revenue losses if they went down for just a couple of minutes, not only Amazon. In August, British Airways had to cancel more than 100 flights and
more than 200 others were delayed based on several system failures [2]. History provides us with many more of such incidents [3].  Hence, businesses and developers increase their efforts to strengthen their system’s resilience. However, as stated by Edward A. Murphy:

„Anything that can go wrong will go wrong“

Current resilience engineering approaches aim to avoid failures in the first place such as distributing components on multiple machines (distributed system architectures) or by deploying multiple instances of a single component (redundancy). By now, developers know that software systems will fail. Somehow. Sometime. In the worst possible moments. Therefore, it is necessary to know how our systems can fail as well as how our systems behave in such scenarios.

How can we prepare for this?

This means, to prepare our systems in such a way that they can withstand and recover properly from possible failures.

Resilience, as stated by Hollnagel and Woods [4], means that a system is able to cope with complexity and has the ability to retain control. Chaos engineering is a methodology to investigate a system’s resilience by executing controlled experiments within a given environment

In this blog post we will explain the fundamentals of chaos engineering which will be required for the next blog post where we will put our knowledge into practice! We explain the chaos engineering fundamentals, i.e., explaining the fundamental terms, and we will myth bust the worst believes about chaos engineering.

What is Chaos Engineering?

As stated in the Principles of Chaos [6], chaos engineering can be defined as follows.

Chaos Engineering is the discipline of experimenting on a system
in order to build confidence in the system’s capability
to withstand turbulent conditions in production

The initial idea of chaos engineering emerged from Netflix’s necessity to migrate their privately hosted infrastructure into a cloud environment [7]. As a result, Netflix created Chaos Monkey [8]. Chaos Monkey is responsible for shutting down single AWS EC2 instances in order to observe if the loss of a single EC2 instance results in a system outage [9]. Formerly, Netflix’s called this practice resilience testing but was later changed to chaos engineering. Early approaches suggested to experiment in a production environment in order to get the best out of the experiments.

However, looking at recent research many practitioners suggest to transition between stages before starting out in a production environment. Hence, it is advisable to begin experimentation within an isolated environment until you are confident enough, with the process and your experiments, to move to the next stage. Unfortunately, experimenting in an isolated environment does not provide equal insights as experimenting in production.

Chaos engineering boils down to the following process:

  • Defining a Steady-State – In order to know if your system is working as expected, you should define a set of business measurements in combination with a set of tolerances that give detailed information about the current state of your system [12]. See [13] as an example of a set of business measurements.
  • Hypothesis – In order to run an experiment you need to state a hypothesis. The hypothesis is suggested to be a „statement of belief“. In the hypothesis you assume that your system behaves as you intend it to in a specific situation. E.g., „We believe that our system is still able to serve the data content after one out of two instances of the database server is missing“.
  • Run Experiment – In this step you execute your experiment and observe your system as long as necessary. Any findings should be documented in a reasonable and understandable way in order to allow precise analysis.
  • Verify – In this step, you analyze your findings under consideration of your hypothesis. You may use any kind of meaningful tests and verification methods, e.g., statistical hypothesis tests such as t-tests, Kolmogorov-Smirnov, and so on.
  • Adjust – If you have found any deviations, you should be fixing them immediately! If you believe to fixed the issue, you go back to the hypothesis step and either state a new hypothesis or re-run the previous experiment.

Running the same experiments continuously, should always be considered in your experiment test plan as this provides continuous verification that previously fixed issues don’t find their way in your system again.

Myth Busting Chaos Engineering

Though, I assume that rumors state what chaos engineering is supposed to be but actually, most of them describe what it is not! Let’s demystify some of them:

  1. Breaking random components in a given system ­– admittedly, the first solutions and theory about chaos engineering was based on actions that randomly terminate EC2 instances but in the last months and years, it has been realized that uncontrolled chaos yields less value than using a systematic approach
  2. We run Chaos Monkey so we are doing chaos engineering – As Grady Booch once said: „A fool with a tool is still a fool“. Though, Chaos Monkey has initially been announced as Netflix’s start of the chaos engineering era, it is not enough to just run it. Chaos engineering can be applied on many levels not only on the infrastructure level, therefore just by running a chaos engineering tool without thorough investigation if it fits your purpose might or might not help you gaining the knowledge you seek.
  3. Breaking things just for the purpose of breaking them –  prevents you from gaining new knowledge based on thorough experimentation with your system
  4. Injecting a failure which we knew our system could not handle – As Russ Miles, Benjamin Wilms, and several others pointed out, we will not gain any new information by executing an experiment if we already know that our system is not capable of holding to our hypothesis.
  5. We need to test in production in order to do real chaos engineering – despite the fact that the definition above suggests to do so, many people in the field of chaos engineering (e.g., Russ Miles, Benjamin Wilms), suggest to start out small, e.g., within an isolated test environment or similar. Production should be the last step in the chaos engineering cycle and only considered if you are confident enough with your previous efforts in applying chaos engineering
  6. We are doing chaos engineering so we can drop other tests – definitely not! Chaos engineering is not a substitution for unit, integration or other tests but rather complements them

Chaos Engineering Fundamental Terms

So far, we have already been talking about different terms which we haven’t explained, i.e., steady-state, steady-state hypothesis, and blast radius. In the following, we would like to give a better description of them as we will need them in in order to apply chaos engineering to an application.

Steady-State Description

Let’s begin with the steady-state: The steady-state of an application is a set of business measurements in combination with a set of tolerances that give detailed information about the current state of your system [12]. A steady-state can either be a single metric such as measured response times or a compound metric. A good example is Netflix’s Stream starts per second (SPS). Netflix’s business metric is a compound of multiple metrics that are collected over time. But having identified the set of measurements unique to your system is not enough to describe a steady-state. Within a chaos engineering experiment we want to observe, if a turbulent situation has a significant impact on our system behavior, i.e., a significant change in our measurements. For example, assume you use response times to identify and describe the steady-state of your system. If we run a chaos experiment, we want to observe if our response times are changing either in a positive or negative manner. Any deviation from our reference distribution provides us with valuable information of how our system behaves during the injection phase.

Steady-State Hypothesis

The steady-state hypothesis is supposed to be a hypothesis or an assumption about the system’s steady-state. Therefore, it can be stated in various ways. A common way to use hypotheses are to state them as a statement of believe. For example,

We believe that our system is still able to maintain its core services despite the drop out of the database

A good start is to ask questions like: „What happens if … “ or „Will this still work if … „. However, you may have already identified a lot of potential candidates that are worth to be implemented as an chaos experiment. At this point, you will need to decide which experiments are worth executing and which experiments will be dropped. Of course, you can design full blown experiments for each „what if … “ scenario but this will cost a lot of time, money, and effort. Therefore, you should consider using different approaches to prioritize identified hypotheses. A valuable approach is applying incident analysis. Known methods are Fault Tree Analysis, and Failure Modes and Effects Analysis. Though, applying these methods, depending on the system, takes a lot of time and effort if done correctly. Nonetheless, they will provide you with a good indication of which experiments or hypotheses are worth considering and which can be neglected for now.

Blast Radius

The blast radius is considered to be the effect your experiment might have on your system. Minimizing the blast radius, among others, should be your highest priority especially if you run chaos engineering in production [11]. The blast radius is mostly referred to reducing your experiment size, e.g., by reducing the amount of components you target within a single experiment.

Outlook

In this blog post we have discussed and explained the fundamentals necessary to understand chaos engineering. In the next blog post, we will introduce an example application build with state-of-the-art technologies. Additionally, we will give an introduction to Chaostoolkit which we will use to build and execute chaos engineering experiments.

Artikel kommentieren