How We Use Chaos Engineering to Make Cloud Computing Less Vulnerable to Cyber Attacks

Cloud computing has become a key part of today’s technology, providing the foundation for global connectivity. It enables companies, governments, and individuals to employ and build cloud-based services, and underpins a vast number of systems we use every day, including telecommunications, transportation, healthcare, banking, and even streaming services.

These systems, like any hardware or software, are susceptible to failures and cyberattacks that can occur unpredictably. Cybercriminals are becoming more determined, and their attacks are becoming more sophisticated and frequent. One tactic that these groups often use is DDoS (Distributed Denial of Service) attacks, which flood companies’ systems with more requests and traffic than their IT systems can handle.

This blocks access to the service from legitimate users, causing serious problems for businesses, including lost revenue and reduced customer loyalty. This problem can cause serious difficulties for companies like Google and Amazon, which offer cloud computing services to host consumer data, systems, and services.

In our latest study, we used several strategies to show how cloud computing systems can actually be hardened by stress. We used something we call chaos engineering and adaptive strategies that help the system learn from mistakes and cyberattacks.

In its latest quarterly cybersecurity threat intelligence, cloud security company Cloudflare noted a 65% increase in DDoS attacks in Q3 2023 compared to the previous quarter. Cloudflare’s Q2 2024 report saw four million DDoS attacks.

In addition to DDoS and other targeted attacks, businesses using cloud software are also vulnerable to outages caused by issues ranging from connection issues to physical server failures – some of which can also be the result of cyberattacks. Sometimes, even a minor issue like a typo can bring down cloud sites.

On July 19, CrowdStrike’s Falcon sensor outages caused Windows hosts connected to Microsoft’s Azure cloud computing system to crash, causing a global IT outage worldwide. The Falcon sensor, designed to prevent cyberattacks, was not affected by the cyberattack. The outage was caused by a technical issue with an update. On July 31, a bug in Microsoft’s DDoS defense caused an eight-hour outage in Azure.

Untangling the fragility

Resolving major outages like these poses significant challenges due to the complexity of the cloud and its many dependencies on other systems—including cybersecurity. Deploying reliable fixes can take anywhere from hours to days, and in some cases, like CrowdStrike, even longer.

Such incidents demonstrate the fragility of our technology infrastructure in general, but cloud-based systems in particular. Solutions are currently focused on managing the aftermath of these incidents rather than addressing the root causes by creating more robust and resilient cloud systems. To prevent failures, a key step is to integrate advanced software testing as standard to assess its resilience and reliability under pressure.

In our research, we help cloud consumers address these threats by doing exactly that, making cloud computing better prepared for major attacks and outages and still functioning. Those who operate cloud systems must also adapt and learn from previous incidents to strengthen them.

We used a technique called chaos engineering – deliberately attacking and experimenting with cloud-based applications – to test how the system responds to such attacks.

In one of our recent articles, we discovered that we can use this technique to more accurately predict how a system will react to an attack. Chaos engineering involves deliberately introducing errors into a system and then measuring the results. This technique helps identify and address potential gaps and weaknesses in the design, architecture, and operational practices of a system.

These methods may include disabling a service, injecting latency (a delay in how the system responds to a command) and errors, simulating cyberattacks, terminating processes or tasks, or simulating a change in the environment in which the system is running or in the way it is configured.

In recent experiments, we injected faults into live cloud-based systems to understand how they behave in stressful scenarios, such as attacks or faults. By gradually increasing the intensity of these “fault injections,” we determined the maximum stress point of the system.

Our investigation revealed a decrease in performance and service availability as a result. Thus, these chaos engineering experiments revealed problems that traditional performance metrics were unable to detect.

Learning from chaos

Chaos engineering is a great tool for improving the performance of software systems. But to achieve what we call “antifragility”—systems that can become stronger, not weaker, under stress and chaos—we need to integrate chaos testing with other tools that transform systems to become stronger under attack.

In our recent work, we presented an adaptive framework to do exactly this. This framework, called “Unfragile,” uses chaos engineering to introduce failures gradually and evaluate the system’s response to these stresses.

Untangling the fragility

Learning from chaos

Read more: Catching online fraudsters: New model combines data and behavioral science to map the psychological games cybercriminals play