Key Takeaways
- Chaos Engineering (CE) and Disaster Recovery Testing (DiRT) are essential methodologies for addressing modern technological challenges beyond traditional error budgets.
- DiRT enhances system resilience by intentionally instigating failures, exposing hidden risks, and improving disaster recovery effectiveness.
- Maturity models like the novel Disaster Recovery Testing Maturity Assessment (DiRMA) framework provide structured paths to enhance DiRT implementation across people, processes, and tools, helping to overcome cultural resistance and metric measurement issues.
- DiRMA evaluates DiRT adoption across people, processes, and tools, assessing maturity levels (Introductory to Advanced).
- Continuous improvement is key, with DiRMA emphasizing ongoing enhancement of DiRT practices through monitoring, feedback, and adaptation to evolving technologies.
In today’s complex technological landscape, traditional error budgets are no longer sufficient to address modern challenges such as cloud outages, AI bias, data loss, and regulatory compliance. To build more resilient systems, companies like Google, Netflix, Slack, and CapitalOne have adopted structured methodologies such as CE and DiRT. While these approaches improve system reliability by deliberately introducing failures, implementing them effectively presents challenges, including cultural resistance, lack of ownership, and difficulty measuring their impact.
To address these challenges, organizations have developed maturity models, structured frameworks that assess the effectiveness of reliability programs and guide their improvement. However, while CE maturity models exist, they do not account for the unique characteristics of DiRT, which goes beyond system resilience to evaluate business processes and human responses.
This article introduces DiRMA, a new framework designed to measure and improve the maturity of DiRT programs across three key dimensions: people, processes, and tools. By assessing an organization’s current capabilities and providing a structured path for advancement, DiRMA helps teams overcome common obstacles and build a more resilient disaster recovery strategy.
The following sections will explore the fundamentals of DiRT, compare existing Chaos Engineering maturity models, and detail how DiRMA provides a comprehensive approach to evaluating and enhancing disaster recovery readiness.
DiRT Overview
DiRT is a structured approach to stress-testing systems by intentionally triggering controlled failures. Originally pioneered in large-scale technology infrastructures, DiRT helps organizations proactively identify weaknesses and refine their recovery strategies. Unlike traditional disaster recovery methods, which rely on theoretical scenarios, DiRT forces teams to confront real operational disruptions in a controlled manner, ensuring that failure responses are both effective and repeatable. The methodology consists of performing a coordinated and organized set of events, in which a group of engineers plan and execute real and fictitious outages for a defined period to test the effective response of the involved teams [Climent, 2019]. Tests are categorized into tiers based on the organizational breadth of those expected to respond or be impacted by the testing as it is described in Table 1.
Table 1. General Description of DiRT Tiers
Tier 3 | In this tier, the exercises are based on testing resilience in specific systems or isolated products. It is not expected that the experiments impact other external applications. |
Tier 2 | Here, testing resilience is focused on probing the dependencies of a shared system or product, so the experiments are applied to services, such as databases or APIs, used by other applications. |
Tier 1 | Finally, at this level, the experimentation looks to test the organizational response to an enterprise-level event. The exercises used to be fictional and involve persons and processes. |
According to the information shared by Google publicly, the following examples illustrate what types of real-world tests would be performed at each tier. Tables 2, 3 and 4 show scenarios and what a team can learn practicing with exercises in each one of tiers.
Table 2. Example to run a DiRT test in Tier 1
Name | Vulnerability exploited in a core system. |
Scenario | A security vulnerability is exploited with the potential to be leveraged by a threat agent to compromise the availability of an e-commerce application. |
What to learn |
|
Table 3. Example to run a DiRT test in Tier 2
Name |
Database degrades the response times of a service. |
Scenario | A change in a database degrades the quality and the response times of an application. |
What to learn |
|
Table 4. Example to run a DiRT test in Tier 3
Name | Test in an isolated service. |
Scenario | A change in the configuration parameters during a deployment generates an increment in the CPU and memory consumption. |
What to learn |
|
In addition to the tiers, tests are also classified in terms of prioritization, communication protocols, and impact expectations. These tests must all include a revert/rollback plan in case something goes wrong and they are reviewed and approved by a cross-functional technical team, different from the coordinating team. The lifecycle of a test is illustrated in Figure 1.
Figure 1. Illustration of the DiRT Test Lifecycle, based on the model presented in Chapter 5, contributed by Jason Cahoon, in the “Chaos Engineering” book [Rosenthal and Jones, 2020]
Google practical resilience testing covers a broad range of scenarios such as disconnecting complete data centers, forcing application traffic rerouting, introducing configuration changes to live services, and deploying deliberately flawed service versions. Additionally, they experiment by disabling people who might have knowledge or experience that is not documented, or removing documentation, process elements, or communication channels. More information can be found in [Rosenthal and Jones, 2020].
In the book, Rosenthal and Jones explain that DiRT offers a means of verifying conjecture and proving a system’s behavior empirically, leading to a deeper understanding and ultimately more stable systems. Given the importance of resilience testing, it is valuable to have a method to determine how to start, how to progress to advanced levels, and to evaluate how well one is progressing in this journey, which is precisely what DiRMA seeks to do.
Chaos Engineering Maturity Models
Maturity assessment models have been created to help organizations to understand and improve their capabilities in several particular areas, such as security, reliability, and innovation, among others. These areas have evolved along with the organizations to offer a better view of the current state of adoption, implementation, and sophistication of the relevant subject matter in the state of art. Those models use surveys, performance data, and observation to gather insights about related key indicators:
- High employee turnover
- Inconsistent communication
- Frequent project delays
- Lack of clear goals
- Conflicting priorities
- Poor decision-making processes
- Low morale
- High stress levels
- Lack of standardized procedures
Particularly for CE engagements, two maturity models have been documented in the literature: the CE Maturity Assessment Model from Netflix [Rosenthal, Hochstein, Blohowiak, Jones and Basiri, 2017] and the CE Maturity Model from Harness [Mukkara, 2022]. In both cases, they were designed to guide organizations in their journey towards building more resilient systems through controlled experimentation.
The first one, Chaos Maturity Model (CMM), is based on two dimensions: sophistication and adoption. Sophistication measures the validity and safety of experiments, ranging from elementary (manual, non-production) to advanced (fully automated, integrated with development). Adoption measures the coverage of chaos experimentation, from in the shadows (unsanctioned, few systems) to cultural expectation (frequent experiments for all services, part of onboarding).
The second model, acronym Chaos Engineering Maturity Model (CEMM), proposed by Mukkara for Harness [Mukkara, 2022], was designed to guide organizations in progressively adopting and scaling CE practices. The model emphasized a gradual approach, for which it divided CE maturity into four levels, each requiring specific goals and actions. Starting with basic experiments and progressing to full integration into the development lifecycle and production environments.
- The Level 1 – Test/Start is the foundational stage where organizations begin experimenting with CE and engineers are selecting less critical services for conducting basic experiments.
- Level 2 – Automate emphasis on automating simple chaos experiments within the continuous delivery pipelines. Here the organizations begin collecting reliability metrics, such as service health and resiliency scores, to track improvements.
- In a Level 3 – Scale, successful chaos engineering practices are scaled across all teams and services and auto-remediation is implemented for experiments resulting in system failures.
- Finally the Level 4 – Expert has CE fully integrated into production environments, where chaos experiments are developed based on production incidents and validated in lower environments.
In essence, the objective of maturity assessment models is to provide a structured and systematic way for organizations to understand their current capabilities, identify areas for improvement, and achieve their strategic goals.
Disaster Recovery Testing Maturity Assessment (DiRMA)
DiRMA is inspired by the program DiRT, created in 2006 by Google to inject failures in critical systems, business processes and people dynamics to expose reliability risks and provide preemptive mitigations. Since some organizations have already started their journey toward the creation of environments for DiRT, in which they can launch failures, determine their level of resilience and test their incident response processes, it is essential to have frameworks, like CE Maturity Assessments, to evaluate the effectiveness, in this case, of a program like DiRT.
Intending to lay the first foundations for the development of these models, this article presents DiRMA: a practical framework for evaluating and improving the readiness of organizations in terms of DiRT. DiRMA is inspired by the PMA or Production Maturity Assessment [CRE, 2021], created also by Google to evaluate where a team lies on the SRE spectrum, and in this sense, uses an employee survey, group discussions, and leadership observations to determine in a range from one to five, the adoption level of DiRT.
DiRMA maps the results in three key dimensions: people, process, and tools. These dimensions give companies a clear picture of their current state and provide the next steps to reach a proper level of DiRT. IThe next sections explain the methodology, the three dimensions, and the five levels in the model.
DiRMA answers these questions:
- How are people involved with DiRT, ranging from in-shadows to a complete training program?
- How does DiRT use systems and business metrics to ensure the program’s reliability and accuracy?
- How does DiRT utilize historical data to forecast capacity needs and inform resource allocation decisions?
The Framework determines a level of maturity, ranging from introductory to advanced, in three dimensions evaluated during the DiRT exercises, practiced in Google. The methodology is illustrated in Figure 2.
Figure 2. DiRMA Map proposed in this article
Specifically, each dimension (persons, processes, and tools) has a set of questions. Each one of those questions has five options to be answered and each of those options has a score associated as it is illustrated in Tables 5 and 6. With this structure, once participants complete the survey, the framework determines the organization’s score in each dimension by averaging the participants’ scores on each question and dimension.
All the questions have the same weight, so the measure used to determine the organization value is the mean, but it could also use the mode or the median, which are more robust measures than the mean. An example of the process is also illustrated in Figure 3.
Table 5. DiRMA questions and answer options example
Question 1: How are people involved with DiRT, ranging from in-shadows to a complete training program? | |
Answer Options | |
Option 1 | ☐ Never has run any experiments yet. |
Option 2 | ☐ Early adopters infrequently perform DiRT. |
Option 3 | ☐ Multiple teams are interested and engaged. |
Option 4 | ☐ A team is dedicated to the practice of DiRT. |
Option 5 | ☐ DiRT is part of the engineering onboarding process. |
Table 6. Association among Answer Options, Scores and Maturity Levels
Question | How are people involved with DiRT, ranging from in-shadows to a complete training program? | ||||
Answer | Option 1 | Option 2 | Option 3 | Option 4 | Option 5 |
Score | 1 | 2 | 3 | 4 | 5 |
Maturity Level | Introductory | Elementary | Basic | Sophisticated | Advanced |
Figure 3. DiRMA Process proposed in this article
The Three Dimensions of DiRMA
DiRMA assesses the DiRT adoption level in terms of people, processes, and tools:
- People: DiRMA delves into the knowledge, mindset, and attitudes of individuals involved in the disaster recovery program. It emphasizes the importance of evaluating different roles, including operations engineers, developers, product owners, architects, managers, and executives. Although measuring mindset or attitudes within an organization is a complex task because they are inherently intangible, the framework, here proposed, gains insights into them using a survey that includes questions designed to gauge employees’ feelings about their work, the organization, and their sense of belonging.
- Process: DiRMA analyzes and assesses the maturity of processes within the disaster recovery program delivery. It highlights the need to consider the various subprocesses and the involvement of different teams and roles, emphasizing the importance of interviewing the right people.
- Tools: DiRMA evaluates the sophistication of the tools, such as fault injection tools, monitoring platforms, and automation scripts, employed in disaster recovery program delivery. It recognizes that technology encompasses both technical and user experience aspects and acknowledges the diverse tools used for injecting and observing failures. Sophistication is measured in terms of the environment in which the tools are used, the setup configured, automatic result analysis, and either manual termination or automated experimentation. Other criteria include if the results are tracked over time and if tooling supports interactive comparison of experiment and control.
The Evolution of DiRMA Maturity
DiRMA defines distinct maturity levels, visualized as an evolutionary journey. To provide a better understanding of the methodology, DiRMA maps these levels in graphic representation:
- Introductory (Padawan): at this initial stage, disaster recovery efforts, even if started with DIRT, are often disorganized, ad hoc, and potentially chaotic. Success relies heavily on individual efforts and lacks repeatability. Processes are poorly defined and documented, hindering replication.
- Defined (Senior Padawan): DIRT is repeatable because processes are defined, established, and documented. Basic project management techniques are applicable, and successes in key process areas can be replicated.
- Managed (Knight): stakeholders actively monitor and control DIRT within the organization through data collection and analysis. Process metrics are used, and the effective achievement of process objectives is evident across various operational conditions.
- Advanced (Jedi): DIRT reaches an optimized level, where processes undergo continuous improvement through monitoring and feedback. The focus is on continually enhancing process performance through both incremental and innovative technological changes and improvements.
Although DiRMA is based on successful models such as the CMM from Netflix, CEMM from Harness, and PMA from Google, and despite having been used in academic scenarios, the journey towards complete validation has so far begun. In the future, it will be necessary to implement DiRMA in other scenarios, which will allow more data and get feedback on the learned lessons to be collected.
In the long term, DiRMA and the other maturity assessment models will have to adapt to the rapidly evolving landscape of emerging technologies by incorporating more dynamic and data-driven assessment methodologies and fostering greater interoperability between different maturity frameworks. Finally, a critical area for development should be the integration of human-centric factors, such as organizational culture and individual learning, to ensure that maturity models truly drive sustainable and meaningful progress.
Conclusions
In the evolution of reliability practices, traditional error budgets are insufficient for modern technological challenges (cloud outages, AI bias, etc.). Companies like Netflix, Slack, CapitalOne, and Google have adopted structured methodologies like Chaos Engineering (CE) and Disaster Recovery Testing (DiRT) to enhance reliability.
However, in implementing programs like DiRT, organizations have faced hurdles like cultural resistance, lack of ownership, and difficulty in measuring the business impact of reliability programs. Maturity models help address these challenges by providing a structured path for improvement.
Considering this scenario, this article introduced DIRMA, a framework that provides actionable insights on how to implement DiRT within an organization, identify areas for improvement and build a more robust and resilient disaster recovery plan. By using DiRMA, organizations can systematically identify areas for improvement and develop a more robust and resilient disaster recovery plan.
DiRMA provides a structured approach to assessing and enhancing disaster recovery readiness. By evaluating DiRT adoption across people, processes, and tools, organizations can systematically improve resilience and adaptability in an evolving technological landscape.