Transcript
Shane Hastie: Good day folks. This is Shane Hastie for the InfoQ Engineering Culture Podcast. Today I’m sitting down with Courtney Nash. Courtney, welcome. Thanks for taking the time to talk to us.
Courtney Nash: Hi Shane. Thanks so much for having me. I am an abashed lover of podcasts, and so I’m also very excited to get the chance to finally be on yours.
Shane Hastie: Thank you so much. My normal starting point with these conversations is who’s Courtney?
Introductions [00:56]
Courtney Nash: Fair question. I have been in the industry for a long time in various different roles. My most known, to some people, stint was as an editor for O’Reilly Media for almost 10 years. I chaired the Velocity Conference and that sent me down the path that I would say I’m currently on, early days of DevOps and that whole development in the industry, which turned into SRE. I was managing the team of editors, one of whom was smart enough to see the writing on the wall that maybe there should be an SRE book or three or four out there. And through that time at O’Reilly, I focused a lot on what you focus on, actually, on people and systems and culture.
I have a background in cognitive neuroscience, in cognitive science and human factors studies. And that collided with all of the technology and DevOps work when I met John Allspaw and a few other folks who are now really leading the charge on trying to bring concepts around learning from incidents and resilience engineering to our industry.
And so the tail end of that journey for me ended up working at a startup where I was researching software failures, really, for a company that was focusing on products around Kubernetes and Kafka, because they always work as intended. And along the way I started looking at public incident reports and collecting those and reading those. And then at some point I turned around and realized I had thousands and thousands of these things in a very shoddy ad hoc database that I still to this day maintain by myself, possibly questionable. But that turned into what’s called The VOID, which has been the bulk of my work for the last four or five years. And that’s a large database of public incident reports.
Just recently we’ve had some pretty notable ones that folks may have paid attention to. Things like when Facebook went down in 2021 and they couldn’t get into their data center. Ideally companies write up these software failure reports, software incident reports, and I’ve been scooping those up into a database and essentially doing research on that for the past few years and trying to bring a data-driven perspective to our beliefs and practices around incident response and incident analysis. That’s the VOID. And most recently just produced some work that I spoke at QCon about, which is how we all got connected, on what I found about how automation is involved in software incidents from the database that we have available to us in The VOID.
Shane Hastie: Can we dig into that? The title of your talk was exploring the Unintended Consequences of Automation in Software. What are some of those and where do they come from?
Research into unintended consequences [03:43]
Courtney Nash: Yes. I’m going to flip your question and talk about where they come from and then talk about what some of them are. A really common through line for my work and other people in this space, resilience engineering, learning from incidents, is that we’re really not the first to look at some of this through this lens. There’s been a lot of researchers and technologists, but looking at incidents in other domains, critically safety critical domains, so things like aviation, healthcare, power plants, power grids, that type of thing. A lot of this came out of Three Mile Island.
I would say the modern discipline that we know of now as resilience engineering married with other ones that have been around even longer like human factors research and that type of thing really started looking at systems level views of incidents. In this case pretty significant accidents like threatening the life and wellbeing of humans.
There were a lot of high consequence, high tempo scenarios and a huge body of research already exists on that. And so what I was trying to do with a lot of the work I’m doing with The VOID is pull that information as a through line into what we’re doing. Because some of this research is really evergreen just because it’s software systems or technology there’s a lot of commonalities in what folks have already learned from these other domains.
In particular, automated cockpits, automation in aviation environments is where a lot of the inspiration for my work came from. And also, you may or may not have noticed that our industry is super excited about AI right now. And so I thought I’m not going to go fully tackle AI head on yet because I think we haven’t still learned from things that we could about automation, so I’m hoping to start back a little ways and from first principles.
Some of that research really talks about literally what I called my talk. Unintended Consequences of Automation. And some of this research in aviation and automated cockpits had found that automating these human computer environments had a lot of unexpected consequences. The people who designed those systems had these specific outcomes in mind. And we have the same set of beliefs in the work that we do in the technology industry.
Humans are good at these things and computers are good at these things so why don’t we just assign the things that humans are good at to the humans and yada yada. This comes from an older concept from the ’50s called HABA-MABA (humans-are-better-at/machines-are-better-at) from a psychologist named Paul Fitts. If anyone’s ever heard of the Fitts list, that’s where this comes from.
Adding automation changes the nature of the work [06:15]
But that’s not actually how these kinds of systems work. You can’t just divide up the work that cleanly. It’s such a tempting notion. It feels good and it feels right, and it also means, oh, we can just give the crappy work, as it were, to the computers and that’ll free us up. But the nature of these kinds of systems, these complex distributed systems, you can’t slice and dice them. That’s not how they work. And so that’s not how we work in those systems with machines, but we design our tools and our systems and our automation still from that fundamental belief.
That’s where this myth comes from and these unintended consequences. Some of the research we came across is that adding automation into these systems actually changes the nature of human work. This is really the key one. It’s not that it replaces work and we’re freed up to go off and do all of these other things, but it actually changes the nature of the work that we have to do.
And on top of that, it makes it harder for us to impact a system when it’s not doing what it’s supposed to be doing, an automated system, because we don’t actually have access to the internal machination of what’s happening. And so you could apply this logic to AI, but you could back this logic all the way up to just what is your CI/CD doing? Or when you have auto-scaling across a fleet of Kubernetes pods and it’s not doing what you think it’s doing, you don’t actually have access to what it was doing or should have been doing or why it’s now doing what it’s doing.
It actually makes the work that humans have to do harder and it changes the nature of the work that they’re doing to interact with these systems. And then just recently some really modern research from Microsoft Research in Cambridge and Carnegie Mellon really actually looked at this with AI and how it actually can degrade people’s critical thinking skills and their ability when you have AI in a system depending on how much people trust it or not.
There’s some really nice modern research that I can also add too. Some of the stuff people are like, “Oh, it came out in 1983”, and I’m like, “Yes, but it’s still actually right”. Which is what’s crazy. We see these unintended consequences in software systems just constantly. I went in to The VOID report and really just read as many as I could that looked like they had some form of automation in them. We looked for things that included self-healing or auto-scaling or auto config. There’s a lot of different things we looked for, but we found a lot of these unintended consequences where software automation either caused problems and then humans had to step in to figure that out.
The other thing, the other unintended consequence is that sometimes automation makes it even harder to solve a problem than it would’ve been were it not involved in the system. I think the Facebook one is I feel like one of the more well-known versions of that where they literally couldn’t get into their own data center. Amazon in 2021 had one like that as well for AWS where they had a resource exhaustion situation that then wouldn’t allow them to actually access the logs to figure out what was going on.
The myth comes from this separation of human and computer duties. And then the kinds of unintended consequences we see are humans having to step into an environment that they’re not familiar with to try to fix something that they don’t understand why or how it’s going wrong yet. And then sometimes that thing actually makes it harder to even do their job, all of which are the same phenomenon we saw in research in those other domains. It’s just now we’re actually being able to see it in our own software systems. That’s the very long-winded answer to your question.
Shane Hastie: If I think of our audience, the technical practitioners who are building these tools, building these automation products, what does this mean to them?
The impact on engineering [10:16]
Courtney Nash: This is a group I really like to talk to. I like to talk to the people who are building the tools, and then I like to talk to the people who think those tools are going to solve all their problems, not always the same people. A lot of people who are building these are building it for their own teams, they’re cobbling together monitoring solutions and other things and trying. It’s not even that they necessarily have some vendor product, although that is certainly increasingly a thing in this space. I was just talking to someone else about this. We have armies of user experience researchers out there, people whose job is to make sure that the consumer end of the things that these companies build work for them and are intuitive and do what they want. And we don’t really do that for our internal tools or for our developer tools.
And it is a unique skill set, I would say, to be able to do that. And a lot of times I learned recently, in another podcast, tends to fall on the shoulders of staff engineers. Who’s making sure that the internal tooling, you may be so lucky as to have a platform team or something like that. But I think I would just, in particular, the more people can be aware of that myth, the HABA-MABA Fitts list, it is, I had this belief myself about automating things and automating computers. And just to preface this, I’m not anti-automation. I’m not, don’t do it, it’s terrible. We should just go back to rocks and sticks. I’m a big fan of it in a lot of ways, but I’m a fan of it when the designers of it understand the potential for some of those unintended consequences.
And instead of thinking of replacing work that humans might make or do, it’s augmenting that work. And how do we make it easier for us to do these kinds of jobs? And that might be writing code, that might be deploying it, that might be tackling incidents when they come up, but understanding what the fancy, nerdy academic jargon for this is joint cognitive systems. But thinking instead of replacement or our functional allocation, another good nerdy academic term, we’ll give you this piece, we’ll give the humans those pieces.
How do we have a joint system where that automation is really supporting the work of the humans in this complex system? And in particular, how do you allow them to troubleshoot that, to introspect that, to actually understand and to have even maybe the very nerdy versions of this research lay out possible ways of thinking about what can these computers do to help us? How can we help them help us? What does that joint cognitive system really look like?
And the bottom line answer is it’s more work for the designers of the automation, and that’s not always something you have the time or the luxury for. But if you can step out of the box of I’m just going to replace work you do, knowing that’s not really how it works, to how can these tools augment what our people are doing? That’s what I think is important for those people.
And the next question people always ask me is, “Cool who’s doing it?” And I answer up until recently was like, “Nobody”. Record scratch. I wish. However, I have seen some work from Honeycomb, which is an observability tooling vendor that is very much along these lines. And so I’m not paid by Honeycomb, I’m not employed by Honeycomb or staff. This is me as an independent third party finally seeing this in the wild. And I don’t know what that’s going to look like. I don’t know how that’s going to play out, but I’m watching a company that makes tooling for engineers think about this and think about how do we do this? And so that gives me hope and I hope it also empowers other people to be, oh, Courtney is not just spouting off all this academic nonsense, but it’s possible. It’s just definitely a very different way of approaching especially developer or SRE types of tooling.
Shane Hastie: My mind went to observability when you were describing that.
Courtney Nash: Yes.
Shane Hastie: What does it look like in practice? If I am one of those SREs in the organization, what do I do given an incident’s likely to happen, something’s going to go wrong? Is it just add in more logs and observability or what is it?
Practical application [14:40]
Courtney Nash: Yes and no. I think of course it’s always very annoyingly bespoke and contextually specific to a given organization and a given incident. But this is why the learning from incidents community is so entwined with all of this because if instead of looking for just technical action item fixes out of your incidents, you’re looking at what did we learn about why people made the decisions they made at the time. Another nerdy research concept called local rationality, but if you go back and look at these incidents from the perspective of trying to learn from the incident, not just about what technically happened, but what happened socio-technically with your teams, were there pressures from other parts of the organization?
All of these things, I would say SREs investing in learning from incidents are going to figure out A, how to better support those people when things go wrong. It’s like, what couldn’t we get access to or what information didn’t we have at the time? What made it harder to solve this problem? But also, what did people do when that happened that made things work better? And did they work around tools? What was that? What didn’t they know? What couldn’t they know that could our tooling tell them, perhaps?
And so that’s why I think you see so many learning from incident people and so many resilience engineering people all talking around this topic because I can’t just come to you and say, “You should do X”, because I have no idea how your team’s structured, what the economic and temporal pressures are on that team. The local context is so important and the people who build those systems and the people who then have to manage them when they go wrong are going to be able to figure out what the systemic things going on are, and especially if it’s lack of access to what X, Y, or Z was doing. Going back, looking at what made it hard for people and also what natural adaptations they themselves took on to make it work or to solve the problem.
And again, it’s like product management and it’s like user experience. You’re not going to just silver bullet this problem. You’re going to be fine-tuning and figuring out what it is that can give you that either control or visibility or what have you. There is no product out there that does that for you. Sorry, product people. That’s the reason investing in learning from their incidents is going to help them the most I would biasedly offer.
Shane Hastie: We’re talking in the realm of socio-technical systems. Where does the socio come in? What are the human elements here?
The human aspects [17:14]
Courtney Nash: Well, we built these systems. Let’s just start with that. And the same premise of designing automation, we design all kinds of things for all kinds of outcomes and aren’t prepared for all of the unexpected outcomes. I think that the human element, for me, in this particular context, software is built by people, software is maintained by people. The through line from all of this other research I’ve brought up is that if you want to have a resilient or a reliable organization, the people are the source of that. You can’t engineer five nines, you can’t slap reliability on stuff. It is people who make our systems work on the day-to-day basis. And we are, I would argue, actively as an industry working against that truth right now.
For me, there’s a lot of socio in complex systems, but for me, that’s the nut of it. That’s the really crux of the situation is we are largely either unaware or unwilling to look at close at how important people are to keep things running and building and moving in ways that if you take these ironies or unexpected consequences of automation and scale those up in the way that we are currently looking at in terms of AI, we have a real problem with, I believe, the maintainability, the reliability, the resilience of our systems.
And it won’t be apparent immediately. It won’t be, oh shoot, that was bad. We’ll just roll that back. That’s not the case. And I’m seeing this talking to people about interviewing junior engineers. There is a base of knowledge that humans have that is built up from direct contact with these systems that automated systems can’t have yet. It’s certainly not in the world we live in despite all the hype we might be told. I am most worried about the erosion of expertise in these complex systems. For me, that’s the most important part of the socio part of the social technical system other than how we treat people. And those are also related, I’d argue.
Shane Hastie: If I’m a technical leader in an organization, what do I do? How do I make sure we don’t fall into that trap?
Listen to your people [19:36]
Courtney Nash: Listen to your people. You’re going to have an immense amount of pressure to bring AI into your systems. Some of it is very real and warranted and you’re not going to be able to ignore it. You’re not going to be able to put a lid on it and set it aside. Faced with probably a lot of pressure to bring AI and bring more automation, those types of things, I think the most important thing for leaders to do is listen to the people who are using those tools, who are being asked to bring those into their work and their workflow. Also find the people who seem to be wizards at it already. Why are some people really good at this? And tap into that. Try to figure out where those sources of expertise and knowledge with these new ways of doing are coming from.
And again, I ask people all the time, if you have a product company, let’s say you work at a company that produces something. You work for big distributed systems companies, but they’re still like Netflix or Apple or whatever, “Do you A/B test stuff before you release it? Why don’t you do that with new stuff on your engineering side?” Think about how much planning and effort goes into a migration or moving from one technology to another.
We could go monolith to microservices, we could go pick your digital transformation. How long did that take you? And how much care did you put into that? Maybe some of it was too long or too bureaucratic or what have you, but I would argue that we tend to YOLO internal developer technology way faster and way looser than we do with the things that actually make us money as that is the perception, the things that actually make us money.
And the more that leaders of technical teams can listen to their people, roll things out in a way that allows you to, how are you going to decide what success looks like? Integrating AI tools into your team, for example, what does that look like? Could you lay down some ground rules for what that looks like? And if you’re not doing that in two months or three months or four months, what do your people think you should be doing? I feel like it’s the same age-old argument about developer experience, but I think the stakes are a little higher because we’re rushing so fast into this.
Technical leaders, listen to your people, use the same tactics you use for rolling out lots of high stakes, high consequences things, and don’t just hope it works. Have some ground rules for what that should look like and be willing to reevaluate that and rethink how you should approach it. But I’m not a technical leader, so they might balk at that advice. And I understand that.
Shane Hastie: If I can swing back to The VOID, to this repository that you’ve built up over years. You identified some of the unintended consequences of automation as something that’s coming up. Are there other trends that you can see or point us towards that you’ve seen in that data?
Trends from the VOID data [22:31]
Courtney Nash: Some of the earliest work I did was really trying to myth-bust some things that I thought I had always had a hunch were not helping us and were hurting us as an industry, but I didn’t have the data for it. The canonical one is MTTR. I wouldn’t call this a trend, except in that everybody’s doing it. But using the data we have in The VOID to show that things like duration or severity of incidents are extremely volatile, not terribly statistically reliable. And so trying to help give teams ammunition against these ideas that I think are actually harmful, they can actually have pretty gnarly consequences in terms of the way that metrics are assigned to team performance, incentivization of really weird behaviors and things that I think just on the whole aren’t helping people manage very complex high stakes environments.
I’ve long thought that MTTR was problematic, but once I got my hands on the data, and I have a strong background in statistics, I was able to demonstrate that it’s not really a very useful metric. It’s still though widely used in the industry. I would say it’s an uphill battle that I have definitely not, I don’t even want to say won, because I don’t see it that way, but I do believe that we have some really unique data to counteract a lot of these common beliefs and things like severity actually is not correlated with duration.
There’s a lot of arguments on teams about how should we assign severity, what does severity need to be? And again, these Goddard’s law things and things like the second you make it a metric, it becomes a target, and then all these perverse behaviors come out of that. Those are some of the past things that we’ve done.
I would say the one trend that I haven’t chased yet, or that I don’t have the data for in any way yet is I really do think that companies that invest in learning from their incidents have some form of a competitive advantage.
Again, this is a huge hunch. It’s a lot, I think, where Dr. Nicole Forsgren was in the early days of DevOps and the DORA stuff where they were like, we have these theories about organizational performance and developer efficiency and performance and stuff, and they collected a huge amount of data over time towards those theories. I really do believe that there is a competitive advantage to organizations that invest in learning from their incidents because it gets at all these things that we’ve been talking about. But like I said, if you want to talk trends, I think that’s one, but I don’t have the data for it yet.
Shane Hastie: You’re telling me a lot of really powerful interesting stuff here. If people want to continue the conversation, where do they find you?
Courtney Nash: Thevoid.community, which is quite possibly the weirdest URL, but domain names are hard these days. That is the easiest way to find all of my past research. There is links to a podcast and a newsletter there. I’m also on all the social things, obviously, and speaking at a few events this year. And just generally that’s the best spot. I post a lot on LinkedIn, I will say, and I’m surprised by that. I didn’t use to be much of a LinkedIn person, but I’ve actually found that the community that are discussing these topics is very lively. If you’re looking for any current commentary, I would actually say that strangely, I can’t believe I’m saying this, but The VOID on LinkedIn is probably the best place to find us.
Shane Hastie: You also mentioned, when we were talking earlier, an online community for resilience engineering. Tell us a little bit about that.
Courtney Nash: There’ve been a few fits and starts to try to make this happen within the tech industry. There is a Resilience Engineering Association. Again, the notion of resilience engineering long precedes us as technology and software folks. That organization exists, but recently a group of folks have put together a Resilience in Software Foundation and there’s a Slack group that’s associated with that.
There’s a few things that are emerging specific to our industry, which I really appreciate because sometimes it is really hard to go read all this other wonky research and then you’ve asked these questions even just today in this podcast, okay, but me as an SRE manager, what does that mean for me? There’s definitely some community starting to build around that and resilience in software, which The VOID has been involved with as well. And I think it’s going to be a great resource for the tech community.
Shane Hastie: Thank you so much.
Mentioned:
.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.