Today’s software systems are, essentially, controlled chaos—and lightly controlled chaos, at that. This makes it exceptionally challenging to model the behavior of those systems. Our systems are quickly becoming larger and larger, with more and more moving parts. It is not uncommon for enterprises to have over 1,000 microservices and millions of containers running thousands of applications containing many thousands of open source libraries.
Why are things growing more complex? Because from microservices to CI/CD pipelines, from service meshes to blue/green deployments, all these new practices and new technologies help us make our applications and infrastructure more resilient and scalable. The downside is the ever-increasing complexity, which is actually accelerating.
In addition, software has officially taken over every layer of the networking and hardware stack (which sounds like an oxymoron, but it’s containers and virtualization turtles all the way down until you hit bare metal). Software is, by definition, easy to rewrite. Thus, the ease of changing software has further increased the number of changes we make to systems. Which adds still more complexity. In fact, software only ever increases in complexity; never decreases in complexity. Software is always trying to do more and encompass more. For that reason, even when you try to simplify software, you just move the complexity around rather than eliminating it.
Introducing Turbulence
Here is how Netflix, the organization that pioneered chaos engineering, defines the practice: “Chaos engineering is the discipline of experimenting on a distributed system to build confidence in the system’s ability to withstand turbulent conditions.” Turbulence is the opposite of order, but it’s where all systems end up.
In the beginning of a system’s life cycle, there is always an orderly plan for deployment and operations. You always have a clean diagram of your resources, code repositories, containers and always a nice 3D diagram of AWS instances you will deploy to and the network topology.
In reality, our systems are never this simple. After the system goes live, that is when we start learning about how it actually operates through unforeseen events like when marketing asks to refactor the pricing module or an outage to the payments API forces you to hardcode a token or a DNS outage—well, you know about DNS outages. In the span of a year or two, you go from 300 to 800 microservices and you have to add Kubernetes clusters and a container network interface (CNI) to properly secure the networking of those clusters. And so on and so on. The bottom line? Almost all production systems evolve into a state that we do not understand and cannot easily map back to our original systems diagram.
Chaos engineering is a mechanism for dealing with this growing level of complexity. The basic idea behind chaos engineering is asking the computer or application or network (or any other environmental element) a series of questions. Consider a legacy system like a mainframe. How did our legacy systems become so stable? They are well-documented. Engineers feel confident in how the system operates. The question is, was the system always that way? Is the system the same one that they originally designed? The answer is “No.”
Legacy systems actually went through a trial-by-fire; a constant parade of unforeseen surprises. These surprises were, effectively, questions. Through these surprises, engineers learned how the system would actually operate (or wouldn’t operate) in the face of chaotic and unforeseen events. This process created iterative feedback that helped the engineers learn how to make the system continuously more stable.
The point of chaos engineering is to generate this same type of feedback loop without the pain, stress and repercussions of unforeseen (and generally unwanted) events. Thus, chaos engineering is a proactive way of introducing adverse conditions to ensure that the system can still do what it’s supposed to do when unforeseen events transpire. Because everything today is software and change is constant, the questioning must also be constant.
For example, chances are that in the time since we originally designed any given system, it has changed a lot. This means some or all the mitigation and resilience we programmed in—retry logic, failovers, circuit breakers—are old news, designed for an older system. When triggers for those conditions actually occur, those mechanisms of resilience we designed may not perform as we had intended.
Chaos engineers proactively ask the question, “Under the conditions in which you are designed to perform, do you do what you are supposed to?” We introduce the conditions themselves in a controlled manner with a limited blast radius to put the system to the test and verify that our assumptions line up with the intended reality. The practice of chaos engineering is intended to be performed on live, running systems without the risk of taking down critical production assets or ambushing unsuspecting applications and their associated DevOps and development teams. This helps engineering teams reduce uncertainty by building confidence in their understanding of their systems by providing them with enhanced context. It turns out that, when engineers are presented with better context, it considerably improves their ability to make smarter, well-informed decisions where resilience is concerned.
Chaos Engineering for Cybersecurity
Cybersecurity is a context-dependent discipline, dependent on how well you understand the systems you are protecting. Software engineers are constantly changing the system. And they must have the flexibility to change it because their primary task is to constantly move towards business value. Yet to effectively apply cybersecurity practices, we must know what we are trying to secure. From that knowledge, we then can know what we most need to secure and how to do it on a given system in its most current state.
Before an engineer can understand how to build robust security measures that provide effective cybersecurity, the engineer must first understand the system. This reliance on the “system first” is a foundational driver for why our security systems and controls—even the immutable and ephemeral ones—experience considerable engineering drift over time. We are continuously burdened by the dependence on a stateful understanding of the system.
Without having some sort of testing or instrumentation feedback loop in our post-deployment world, we don’t know there is a problem with our security until there is an actual problem. Finding out that the room is on fire after the house has been burning for hours is too late. The result is the self-defeating series of reactive incident fire drills, war rooms and outages that we find ourselves in today. If we do not change this reactive crisis loop of chasing the tiger’s tail, it inevitably leaves us taking two steps back for every step we take forward with engineering teams.
When this drift goes undetected, we are left with our wounds exposed and attackers salivating at the meal to come. Engineers always prefer to find evidence and sources of drift through some internal engineering failure or controlled test rather than experiencing a successful real-world attack. Over time, there is a constant drift between the system and security; engineers are constantly changing the system. As a result, there is frequently a lack of recalibration between systems and the security designed to protect them. The way to achieve this recalibration is through continuous testing or experimentation.
Testing is not chaos engineering. Testing is a verification and validation of something you know as either true or false. This could be vulnerability to a specific CVE or exposure to an attack pattern. In experimentation, you are trying to derive new information and explore the unknown unknowns. From this exploration, you seek to create a better understanding of how a system reacts and then share that with the people who are designing the system.
Live security incidents are not effective measures of detection because, by then, it is already too late in the life cycle to effectively address the problem. Worse, incidents cause stress, fear and irrational behaviors. People freak out. Not surprisingly, a security incident is not a good learning environment. Yet security incidents serve the same purpose: Asking questions about a system or systems and helping teams see how to improve them. Chaos engineering offers a better way to create these questions under conditions where it is easier to learn and observe system behaviors, providing clues to improving security. In other words, the purpose of security chaos engineering is to help us proactively understand our systems and where the security gaps are before an adversary has a chance to take advantage of them.
There are multiple use cases for security chaos engineering including:
- Incident response: Testing efficacy of run books, response plans, and determining whether data collection and alerting systems work as intended
- Security control validation: Introducing conditions that should induce behavior from one or more controls and observing behavior
- Security observability: Introducing unexpected attacks or breaches to see if your systems report and observe the incident accurately
- Compliance monitoring: Applicable across all chaos engineering events by documenting that systems do what they say they are supposed to do
A Practical Example
A team I headed wrote an open source application, ChaoSlingr, designed to generate chaos engineering events for security experiments. The application consisted of three serverless functions to generate the experiment, apply it to infrastructure and then monitor the resulting impact on the system.
One example of how we used ChaoSlingr was an exercise where we induced an open port. Enterprises and engineers inadvertently open ports in firewalls all the time. Sometimes this is a result of misconfiguration. It may also be the result of a software upgrade after which the developer or AppSec engineer forgot to reset the port to the proper security settings. In theory, firewalls should automatically detect an unauthorized open port.
For the experiment, we selected a set of resources that had opted-in to our experiments, selected a security group and opened the port. At the time we were relatively new to Amazon Web Services. We were surprised to find that the firewall only detected an unauthorized port opening 60% of the time. We discovered clear evidence of configuration and security tool drift.
We also noticed that a cloud-native configuration and management tool we had deployed was identifying and alerting the unauthorized port opening almost 100% of the time. We then realized that correlating and responding to incident events of an unauthorized port opening was complicated; it was hard to map the incident back to the log data describing how our systems responded to the event. We added a metadata pointer to each alert, allowing us to more easily correlate the information and save valuable time finding the affected security group and port.
In short, the exercise gave us excellent context on where our systems were drifting and failing to keep up, which of our systems were the most reliable and whether we could quickly and easily respond to these incidents by chasing down the right log files.
Security Chaos Engineering Programs Build Trust
In any environment where we are managing complex systems, trust is critical. You must trust that your systems will handle unforeseen events gracefully and not cost your company money or reputational currency. You must trust that the people designing and building your systems are constantly trying to improve them. You must trust that the security teams protecting your systems are up to the task. As systems grow more and more complex, the number of unknown unknowns multiplies. As humans, with our limited processing capacity, we cannot solve for unknown unknowns at scale, continuously. This is the task of security chaos engineering and the processes around it: To create a programmatic way to experiment against and find answers to unknown unknowns. Your systems, developers, customers and your CEO will thank you for it.
About Secure Software Summit
The Secure Software Summit brought together the world’s leading innovators, practitioners and academics of secure software development to share and teach the latest methods and breakthroughs on secure coding and deployment practices. If you are about developing, releasing and securing software, delivering new features fast and building things right from the start, click on the link below to get access to all of the sessions from this summit.
https://go.shiftleft.io/secure-software-summit-2022-replay