Key Takeaways
- Automation is often involved in counter-intuitive ways in software incidents (e.g., impeding resolution or complicating human response).
- Separating tasks entirely by those that automation handles and those that humans handle influences the design of systems in ways that make resolving incidents more difficult.
- Automation can degrade human knowledge and skill acquisition/retention in unanticipated ways due to humans having less experience with the systems intended to be run by automation.
- Cognitive science can inform better design of systems and tools that incorporate or rely on automation using principles of Joint Cognitive Systems.
- Ideally, designers of complex software systems would implement automation as a way to augment and improve human work rather than aiming to replace it.
On August 1, 2012, Knight Capital Group, a financial services company, released an update that ended up losing the company $460 million in twenty minutes, severely impacting the valuations of numerous other companies. By the next day, Knight Capital’s value plummeted by seventy-five percent leading to its acquisition at a fraction of its original value and the eventual dissolution of the company. More than a decade has passed since an automated process contributed to the downfall of a major financial firm in minutes. And yet our understanding and beliefs about automation have not evolved while we hurtle toward an AI-dominated industry. This belief is partly driven by the notion that AI can address the limitations of traditional automation.
Our research has shed light on these assumptions and limitations, and how they continue to wreak havoc on software systems that are increasingly reliant on automation (and increasingly, AI). Below I lay out some of the common assumptions and misconceptions about automation and its role in software (and software incidents), what our research has found regarding how automation shows up in software incidents, and some ideas around how people can better design automated tools to help people better handle software incidents.
Myths and Misconceptions About Automation
While automation is involved in nearly all of software at this point, I was curious how tools/systems designed to operate independently of humans might either contribute to, assist with, or otherwise be involved in incidents when software doesn’t work as designed or expected – things like CI/CD deployment pipelines, or Kubernetes pod auto-scaling. My interest in researching the role of automation in software incidents came both from tons of stories and anecdotes that I know from folks in the SRE and incident response domains, and also from watching as AI continues to dominate tech conversations, tooling, and products.
Research from other domains like aviation has demonstrated a number of myths and misconceptions about how automation contributes to accidents in those domains, and I set out to see if this held true for software incidents. The primary misconception from studying automated cockpits and pilots is the assumption that automation should replace the work that humans do.
This misconception has been described in human factors research as “functional allocation by substitution” or the “substitution myth”, promoting the idea that all we need to do is give separate tasks to computers and people according to their strengths. The substitution myth refers to the flawed assumption that automation can simply replace human functions in a system without fundamentally altering how the system or human work operates.
This misconception is built on assumptions like HABA-MABA (“Humans Are Better At / Machines Are Better At”), which assume that human and machine strengths are fixed, and system design is merely a matter of allocating tasks accordingly (Dekker, Sidney & Woods, David. (2002). MABA-MABA or Abracadabra? Progress on Human-Automation Co-ordination. Cognition, Technology & Work. 4. 240-244)
Dekker and Woods argue that this substitution-based approach fails because:
- Automation doesn’t just replace existing tasks, it transforms human work, creating new tasks, roles, and challenges.
- These changes are often qualitative and unpredictable, not just quantitative.
- The belief in simple substitution leads to poor design assumptions, neglects coordination needs, and overlooks how real human work is adapted in practice.
Their research revealed that automation, contrary to these prevailing beliefs, often contributes to incidents and accidents in unforeseen ways, imposing unexpected burdens on humans responsible for overseeing or interacting with it.
The Unintended Consequences of Automation
What we saw in the software incident reports from the VOID mirrors what Dekker and Woods articulated above. We found that automation contributes to incidents in a variety of ways, often in multiple ways in a single incident! Automation can be a contributing factor to an incident happening, it can alert people to an incident, it can (very rarely) solve the incident on its own, and, unfortunately, it can also make it even harder for people to solve the incident than if the automation weren’t present in the first place.
Perhaps the most public example of this was Facebook employees not being able to get into their own datacenters during a large outage in 2021. A command automatically propagated through their system, which unintentionally took down all the connections in their backbone network, effectively disconnecting Facebook data centers globally. This cascaded into a DNS issue that made it impossible for the rest of the internet to find their servers while also taking extra time to activate the secure access protocols needed to get Facebook employees on site and able to work on the servers.
In particular, my research found a lot of the same patterns that researcher Lisanne Bainbridge articulated in her foundational paper The Ironies of Automation. A few of the patterns include:
- Designing for only desirable outcomes
Designers of automation tend to imagine the desired outcomes of automation (e.g., lower workload, higher accuracy) and that only those desired outcomes will occur. As an example, take the entire arena of “configuration errors”: a CI/CD pipeline is great until it pushes a change that takes the whole system down, without any signs that anything awry was going to happen.
- Unanticipated negative consequences
Automation does not have access to all real-world parameters for accurate problem solving in all contexts, and may in fact make it harder for humans to directly impact the system in the ways they want when there are unanticipated negative consequences of the automation. Retry storms are a great example of this in action, when the automation is doing what it was putatively designed to do, but in extreme circumstances it can make the problem much worse. Interrupting that process is rarely trivial.
- Deskilling paired with passive monitoring
Humans still have to monitor that the automation is working properly, yet often lack knowledge or context about how the automation is supposed to function in order to know if it’s actually working properly or not. This is due to the fact that proper knowledge of the system, and how the automation works within it, requires frequent usage and exposure to it operating in a variety of scenarios. This type of knowledge only develops via practical, hands-on feedback with the system in question, how effective it is, and where and when edge cases or unexpected outcomes occur. If you’ve ever stared at an alerts dashboard unsure of what exactly is going on with a system you’re unfamiliar with, you’ve probably experienced this firsthand.
- New, unanticipated forms of human work
When an automated system fails, the amount of knowledge required to make things right again is likely greater than that required during normal operations. This creates immediate, new, and numerous items of work. Because the designers of automation can’t fully automate the human “parts”, the human is left to cope with what’s left after the automated parts don’t behave as expected, leaving more complexity in their wake.
- Lack of visibility into automated system activity
Automation creates systems that require people to provide intelligence outside of or beyond known instructions (e.g., run books). To debug a system with automation in it, one needs to understand both the system overall, but also the automation and what it should be doing and how it is doing it wrong. This is a fundamentally more complex task than without the automation involved! This often shows up in pockets or silos of experience, such that when Amy the Kafka expert goes on vacation, no one else knows the intricacies of consumer rebalancing issues like she does.
How Joint Cognitive Systems Better Support Humans Working with Automation
A Joint Cognitive System (JCS) is one in which humans and machines work together based on principles of shared cognitive efforts, not simply dividing the work to be done across humans and machines separately. Instead of replacing human work, JCS advocates for making automated components into effective “team players” that can augment and support human work.
In research laid out by Gary Klein and colleagues, they describe 10 key aspects of this joint activity collectively as:
“…an agreement (often tacit) to facilitate coordination, work toward shared goals, and prevent breakdowns in team coordination. This … involves a commitment to some degree of goal alignment. Typically this entails one or more participants relaxing their own shorter-term goals in order to permit more global and long-term team goals to be addressed”.
It’s too much to go through all ten key elements here, but it’s worth noting that in order to have this goal alignment (and adjust it over time), those working together must:
- Be mutually predictable in their actions
In highly interdependent tasks like software operations, we can only plan our actions effectively when we can accurately anticipate the actions of others. Skilled teams achieve this predictability through shared knowledge and their own coordination mechanisms that are developed over time through extensive collaboration. Despite the common refrain of “human error” in incidents, in general, humans are quite predictable in their work, and we have established means for checking if something seems unpredictable (call, chat, email, etc.). So ultimately this goal is really more about automation being more predictable (and for humans to have better ways of “seeing” what is going on with it).
- Be mutually directable
Being directable is the ability to deliberately assess and modify other parties’ actions in a collaborative setting as the situation and priorities change (a very common reality for software teams). Effective coordination requires a team’s adequate responsiveness to each other’s actions as the work progresses.
- Maintain common ground
This is the really key attribute of a well-honed JCS. Common ground reflects the pertinent knowledge, beliefs, and assumptions that all the involved parties share, enabling each party to comprehend communication that helps coordinate joint actions. An erosion of common ground can sometimes lead to a potentially disastrous breakdown of team functioning.
Now, if you’re thinking, “Well I don’t know of any automated systems that do this!” then you’re not alone, because neither do I. Not yet. I am pretty curious about some recent work in this area from Honeycomb. They appear to grasp many of the principles I describe above, and can articulate the why (and why not) of using LLMs and AI:
“As writing code gets easier and easier, understanding code has gotten harder and harder. Writing code has never been the hardest part of software development; it has always been operating, maintaining, and iterating on it. In order to truly boost productivity, AI will have to get better at helping us with these things”.
I certainly hope that more designers of automated (and AI-based) tools and systems factor these kinds of principles into their designs in order to make automation a better team player and help us do better work ourselves. I am often asked if “better AI” will simply solve this problem for us, and my answer is always the same: If the people designing AI tools aren’t considering these joint cognitive factors, then no, it will not. It will just exacerbate the scale and rate at which humans have to tackle these same problems over again.
If people are interested in learning more, I helped launch the Resilience in Software Foundation, a diverse, interdisciplinary community focused on tackling these kinds of issues in our industry.
So much of my (and other people’s) research highlights how underappreciated and misunderstood the role that human expertise plays in software. I’ve always been fascinated by human expertise: how we acquire it, what it looks (and feels) like in action, and how ephemeral it can seem at times. It’s easy to take for granted once it is established, and it can’t simply be cloned and systematically scaled. As the AI machine grinds ever forward, I sincerely hope we continue to invest in the human expertise that keeps our systems running.