pijl

antifragility FAQ

 

How to build antifragile systems?

Most famous example is Chaos Engineering. It is the practice of intentionally injecting random failure in a system.

"Imagine a monkey entering a data center. The monkey randomly rips cables and destroys devices. The challenge is to design the information system in a way that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy."
antifragility 2898

 

This ‘Chaos Monkey’ was invented by Netflix. It randomly disables production instances to make sure that Netflix can survive this common type of failure without any customer impact. The Monkey runs during business hours, engineers are standing by to address any problems, learn about the remaining weaknesses of the system and build automatic recovery mechanisms to deal with them. The engineers love it because it challenges them to the max and they are paged less often at night and on weekends.

 

Inspired by the success of the Chaos Monkey, Netflix introduced many other monkeys to induce different kinds of failures, like the Latency Monkey (inducing artificial delays), the Conformity Monkey (shuts down instances that don’t adhere to best practices) and the Security Monkey (terminates instances with security violations or vulnerabilities)

Netflix has proven that this works because they built the Chaos Gorilla after they became immune to the Monkey.

And later they introduced Chaos Kong because they were looking for more extreme cases of failure. It made them immune to unavailability of an entire AWS Region.

Nowadays, the whole simian army can be downloaded as open source: https://github.com/Netflix/chaosmonkey

Other examples that contribute to antifragility are:

  • continuous deployment
  • reducing technical debt
  • microservices
  • AB testing
  • autoscaling
  • focus on MTTR
  • canary releases