One of our favorite questions during job interviews is: “What do you know about Chaos Engineering?”. While it is an intriguing practice, a lot of people have never heard about it. After explaining it’s about introducing chaos on production, most people react shocked. So what is it all about?
What is it?
In short, Chaos engineering is the discipline of experimenting on a software system in production to build confidence in the system’s capability to withstand turbulent and unexpected conditions. It is a practice invented by Netflix and adopted in more and more companies. They will, for example, automatically shut down servers and services to test the resiliency of the system. Do all your applications handle failures (infrastructure failures, network failures, or application failures) correctly? Do they have correct fallback flows?
Why should you do it?
Quality software is not only about pretty code. It should also be about resiliency. If a small gust of wind causes your application to endure critical failure, then you have an issue.
But why on production?
There are always differences in your setup between production and your test environments? Different scaling rules, different load/context. Although you should strive to have a test environment as equal as possible to production, there will always be differences. The Chaos Engineering culture starts from the principle that true resiliency can only be tested on a live system.
When not to do it?
Although we strongly believe in chaos by default. Sometimes there are good reasons why you shouldn’t introduce chaos on production. Or at least not continuously and automatically.
At ‘DPG Media’ (formerly known as ‘De Persgroep’), we implement chaos by default on all our applications. However, one of the third-party applications of ‘DPG Media’ relies on sticky user sessions. It means that if we kill servers in our Chaos processes, these users lose their connection and need to log in again. Third-party software often has known issues that will never be resolved by the supplier. In my opinion, it is a good example of software that is not chaos-ready. You don’t want users to be automatically logged out, every time your Chaos process triggers and tests your resiliency. That would be user bullying.
Yes, you should do chaos by default. But if it has user impact like the example above, maybe consider it a scheduled process at acceptable intervals. We should test our systems for resiliency to improve the quality of products, not to bully your users at regular/random intervals.
Chaos Monkey is a tool developed by Netflix to automatically introduce chaos to your system. It can, for example, be installed on AWS cloud infrastructure to terminate servers behind load balancers. Does your setup automatically start up new healthy and correctly configured servers? Does your software handle the broken connection correctly? Let the monkey try it out for you!