Safely Changing Software To Avoid Incidents: A Conversation With Justin Sheehy

Transcript

Michael Stiefel: Welcome to The Architects Podcast, where we discuss what it means to be an architect and how architects actually do their job. Today’s guest is Justin Sheehy, he has spent most of his career thinking about how to make systems, composed of both computers and humans, safer and more resilient. He’s currently chief architect for Infrastructure Engineering Operations at Akamai. Previously, he’s played roles including chief technologist for Cloud Native Storage at VMware, CTO at Basho, principal scientist at MITRE, distinguished systems engineer in multiple places, senior architect and more. He’s equally happy writing code, doing research or leading people.

It’s great to have you here on the podcast. And I’d like to start out by asking you, were you trained as an architect? How did you become an architect? It’s not something you decided one morning you woke up and said, “Today I’m going to be an architect”.

How Did You Become An Architect? [01:26]

Justin Sheehy: Thank you for that kind introduction, Michael. So I certainly wasn’t trained as an architect. I think that title as we use it in the software industry is not one with much of a training apparatus in most places. And it also was a much less common title in software related fields, of course, I’m not talking about building architecture here, when I first ended up with it, which was a little over 25 years ago. So I think the story there is in the year 2000 and early that year, I was in my first time around at Akamai. So I’m back there now, I’ve done some things in between. But at that point, my career until then had been mostly in distributed systems engineering in the way that it existed in the 90s.

And then the type and scale of systems that we were trying to build in and around 2000 at Akamai were kind of unknown, in that even though some parts of it may have been solved by a couple of other companies already, those weren’t yet things with a lot of practical literature, broad understanding, they were new even if you weren’t really the first one trying to solve some of those things. And we knew working on this that if we succeeded at what we were trying to build, we were going to have to live with those systems coming under much heavier load and much greater scale than we could test or simulate, right? You don’t get to have a second internet in that way.

Michael Stiefel: There’s no production then test internet.

Justin Sheehy: Exactly. There’s a lot you can do, but there’s some that you can’t. So we knew that a lot of our ability to keep succeeding came down to the architecture of those systems. Meaning that they could either handle that load even though they hadn’t experienced it before, or I think just as importantly to that architecture, they could be improved in place in ways that wouldn’t be too disruptive. I think that’s an important architectural element too. And so those of us there that ended up responsible not just for shipping code but also actively managing those kinds of constraints and communicating with our peers ended up being called architects. So that’s how I first ended up with that kind of a title.

And as you said in the intro, I’ve had things adjacent to that title off and on for the 25 years since then, but I also don’t think it’s what I am, right? To me, it’s a job, it’s a good job. It’s what I’m doing now, it’s what I’ve done before, it’s one I might do again, but I don’t treat it as an identity so much as I treat it as a role of the moment.

How To Change Software Safely [03:46]

Michael Stiefel: That’s very interesting. I’ve talked to you before and listened to several of your talks and you’ve talked about building defense in depth, how to make safe changes, how the effects of undetected bugs can be minimized. Can you explain a little more about this? And how does this relate to being a responsible developer or architect?

The Relationship between Incidents and Pushing Out Changes [04:07]

Justin Sheehy: Sure. So one of the things you might be referring to there when you talk about making changes is an article that my colleague and friend, and I, Jon Reed wrote for ACM Queue. And part of what prompted that was that we were seeing that both for our own current work but also for many enterprises that are our friends, customers, partners, whatever worked at, that pushing out a change was the triggering event that led to many outages and incidents. Now, I want to be really, really clear before we go farther down this, that most systems need frequent changes. And it’s not very visible how many outages and incidents are prevented or stopped by pushing a change. So this isn’t at all about blaming people or tech change for incidents, and it’s not about making less changes or making them more slowly.

But what we were trying to do is recognize that if, and we believe this is true, a high proportion of incidents are at least in part triggered by changes, you’ll hear me use words like triggered or things like that because we could do a whole other podcast on the fact that the phrase root cause is both never useful and in fact slightly destructive. They’re not root causes, nothing is in an interesting incident, but they might be part of the triggering chain of events.

Michael Stiefel: The proximate cause, if you wish.

Justin Sheehy: Yes, absolutely. That’s a fair term. A proximate cause, but that’s very different from them being the cause. But if those moments, and if you go and read the incident reports that are published by companies that publish them, which most don’t most of the time, but some do, I think you’ll see that certainly, not all major incidents, but many of them are in fact having pushing out a change as a proximate cause. So if we recognize that that’s true, what we decided, and this is where architecture starts to come into it, that it’s worth taking a principled approach to thinking about how we can make that connection happen less often or less severely, right? So that we can make changes with less worry.

Michael Stiefel: So I see that’s where sort of the architecture comes in because you’re trying to look to the future to see if you can make things resilient to failure rather than fragile by default.

Justin Sheehy: Yes. In fact, our whole approach was to in fact assume that things are going to go wrong when you make changes just like any other time.

Michael Stiefel: It’s like life.

Justin Sheehy: Yes, exactly. And so when I said, “We took a principled approach”. A few minutes ago, what I mean there is not only were we not trying to make systems that can’t have failure, that’s not a very effective goal to have. We also weren’t inventing any new concepts or new techniques. In fact, I think many people when they read the article that we wrote about this will look at it and think that a lot of it is either boring or well-known or obvious. And I think that’s great. I’m happy about that. The more people feel that way, the better state we’re in. But I also would say to those people that, “The parts of it that you think are well-known or reliably followed locally are not likely to actually be the exact same ones in another environment”. So they’re not the same in terms of being uniformly understood and applied.

So what we were hoping to do was to provide a very simple lightweight framework and some principles so people could figure out where in that space they have room for improvement when it comes to change safety, so that they could have a shared method for thinking about how they would focus their efforts if they want to make those things better or safer.

The Importance of a Shared Language for Discussing Incidents [07:42]

Michael Stiefel: Several things come to mind when you say that, but first of all, in my consulting career, one of the things I’ve learned is what’s obvious to one person is not obvious to others. And if you think something’s important, you should say it because you would be surprised how many people just don’t realize it. And the other thing, what I hear you saying and tell me if you think I’m incorrect about this, one of the things you’re trying to do is also to come up with a common language. So people when they talk about these things, when someone says something, they know exactly what you’re saying.

Justin Sheehy: That’s exactly right. In fact, there’s a funny moment within the past year when there was a very major public outage that had nothing to do with my employer, and two of my previous colleagues who had seen the work that Jon and I wrote about long before we wrote an article about it, messaged me in that language and said, “Oh, if only they’d done more of this principle and this principle, they might’ve actually had a better time and they might not have been in the news”. And it’s being able to have that shared language, where we were able to have a 30-second conversation, where we both knew what we were talking about.

Michael Stiefel: And I think that’s important for when architects talk to DevOps or SRE talks to DevOps or SRE feeds back to the architect how things could be made better. If you speak the same language, it’s much more effective.

Justin Sheehy: Yes. In fact, I love that you brought up those different elements of organizational application and culture because that was part of how we built this, was over the past, I guess six or seven years at least, I and a couple of my colleagues spent time working with our other colleagues in all of those roles. Especially, when investigating things after incidents, but not only then trying to figure out how people talked about what they did and boiling that down into, “Wait, are there ways that we can take what these SREs over here said and take what this person who’s an engineer but not an SRE and this other person who’s an incident responder perhaps and create a way that they’ll all agree that when I say this one phrase, it’s referring to what they do”. And that’s one of these not very visible about how much work it is, but it really pays off in the end.

So in fact, I would say that the heart of what we did is something that looks very, very simple when we’re done. And that’s the goal of it was if you go and look at that article we wrote six very simple, we called them change safety principles, right? And it’s this list of just six things. And again, like I said before, most people reading them would be like, “Yes, of course you do that”. Right? “These aren’t interesting”. But it was about having those phrases to identify that category of work for different communities.

Observability vs. Malleability [10:33]

Michael Stiefel: And one of the things that struck me about it is this distinction between observability and malleability. It’s not enough to be able to know that you’re sinking. The Titanic knew it was sinking. The Titanic, to use your words, was not malleable enough to really do anything about it.

Justin Sheehy: Yes, that’s absolutely true, right? And so part of this is figuring out which of those pieces you have to a point that you’re happy and comfortable with, so that you can look at the others and focus on them, right? You can imagine, to take the metaphor you just used, and wildly mangle it. Right? If the Titanic in that model was in fact extremely malleable and with some sort of ships that we don’t have today, but that could change itself in the midst of an accident but was missing the observability to know that that was happening, it still would’ve been a problem, right? So it’s not that one of these things or any of these things are more important or supersede the other, it’s an attempt to say, “Look, you need to get at the right level”. And I want to talk about right level in a second because there’s not a generic answer to that, for all of these for your systems.

Michael Stiefel: So I would like to get into that issue, right level in a moment, but I think what you’re saying is important because there’s been a lot of emphasis in the past about observability, telemetrics. How do we figure out what’s happening or what happened? But if you have not done the right things or you’re not malleable enough or you don’t know when’s the best time to make a change, what constitutes a large change? What’s a small change? You mentioned before the distinction between gradual and slow. All these things have to be understood by the organization.

Justin Sheehy: They have to be understood. And what I was really pointing at by making that note about right level is that how much of each of those elements of making your system safer when you make changes is necessary, is not the same from system to system, right? So there’s some things for which outages have different consequences, for instance, or some things where if outages have a little consequence, but your ability to get back to a good state is enormously good, you may be able to do less of some of the other things just out of comfort that, “Okay, we’re so malleable”. To use the earlier phrasing, “That as long as we’re good enough at some of these other things, we don’t have to invest too hard”. And so that balancing act is still required with this framework.

Michael Stiefel: What you’re saying, if I hear you correctly, is that the refrain that comes up over and over again, is software is not independent of the business. You have to understand your business. I remember when cloud computing first became popular, I took my electric bill and did some investigation and did a calculation that the electricity coming into my house was not 100% reliable. It was 99.97% reliable, and that was good enough. Especially if I had backup systems.

Justin Sheehy: Exactly. And for a great number of systems, trying to shoot for, and this isn’t really… I don’t want to focus too hard on sort of availability and numbers of nines and things like that. That’s its own topic that you can go into a long hole on. But it is worth noticing that, as you said, for some things, 99% is probably great, right? For many things it is. Now, for an international bank, it might not be because you may have transactions flowing continuously 24 hours a day and actually have a financial loss for every second of outage. So you may actually find it worth investing even beyond that point of diminishing returns that some other system might need.

Michael Stiefel: Right. You, I think you hit the nail on the head when you said that if you spend money on getting nines that you don’t need, that’s less money in giving customers what they want or making your systems more reliable. No one has infinite money. And I did some consulting with a manufacturer of ATMs. And even ATMs are not perfect. And again, now we’re getting a little specific, but sometimes it’s better for the ATM to give you your money even though they can’t get back to the main system. This is a business trade-off.

Everything is Related to a Business Decision [15:07]

Justin Sheehy: Absolutely. And those ATMs that did that were often programmed with the particular amount of money that they were willing to do that with until they could get back online. So that business had clearly made a choice like, “This is the maximum risk we’re willing to overtly take in this situation”. And that’s great. I love that you keep bringing that up because over the years, one of the things that’s often frustrated me the most is when someone in a relatively technical role, whether that’s engineering, operation, or something like that, uses a phrase like, “Well, that’s a business decision”. To frame it as something that’s just a law of nature out of their hands. And I would maintain, it’s not only that that’s not true, it’s that you can turn that all the way around and every single engineering and operations decision is a business decision, right? An engineer writing software for their employer makes hundreds of business decisions every day. They’re not actually a separate thing.

Michael Stiefel: When you write an if statement and say, “This branch, this branch or this branch”. That’s a business decision.

Justin Sheehy: Absolutely. And it gets even more obvious when you start talking… What you said is true, but it makes it even more obvious for folks when you add, say for instance, how long some piece of your code is going to wait before it times out doing something? That is very much a decision that is relevant to the rest of the business. And so I don’t believe those things are separable. And that’s what I mean when I talk about most of our systems being composed of our software, our people and everything else. They’re not really independent of each other.

Michael Stiefel: Which says that it’s also the responsibility and usually the architect is the best one to do this, is to also make that clear to the business people. They can’t dump it. Just like as you said in the example, someone says, “Everything’s a business decision”. The business people can’t say, “Oh, that’s just technology and they can fix that”.

Justin Sheehy: Yes, absolutely. And I think that’s often helped by having some of the people, at least, in any given environment, making what people think of as business decisions be ones that are at least connected enough to the technology implementation going on that they can say like, “Wait, let’s make sure that we’re making these decisions in harmony with each other”.

Autonomous Agents and Software Change Management [17:24]

Michael Stiefel: So this raises an interesting question, and I think this is maybe a little bit of a curveball, but we’re seeing increasingly, AI being put into the business world. And especially, when you start having autonomous agents. And autonomous agents, where the state may not be known because it’s learning. Do you have any thoughts about how this affects the malleability, the observability or your ability to measure the change that you make?

Justin Sheehy: Oh, jeez. So there’s three or four interesting questions in there. So one thing that immediately comes to mind there that I’m going to temper it afterwards is there’s an old quote that I’m probably going to misquote and fail to attribute, but that you just reminded me of, which is that, “A computer can never be held accountable. Therefore, it is never acceptable to allow a computer to make a business decision”. So that’s a framing that matters here, right? When you were talking a minute ago with the developer, whether they’re writing an if statement or a timeout or something, making those decisions. The reason that’s an interesting sentence is that means we can talk to that person about why that choice was made and inform why they make the next choice, which is not actually a conversation that you can have in a grounded fact-based way with an AI tool, at least not any of the ones that exist today.

Artificial Intelligence and Accountability [18:43]

Now, I think a lot of the AI systems people are starting to use are very valuable when used to augment certain human capabilities, much like calculators and things like that, but for different classes of tools. But we have to be careful when doing those kinds of things not to move the decision-making into a place where there not only is no accountability, but that means there’s no correctability, there’s no ability to understand why something is the way it is. And therefore, change the why and change the next decision. I think we could talk for hours about just that topic because there’s an old phrase from, and again, you’ve given me these curveballs before I thought about them, so I don’t have the citation ready at hand, but there’s from the larger field of cybernetics that every augmentation is an amputation.

And so something we have to be careful of is that we only amputate as it were the things we’re willing to not have full human thoughtful control over. So I think that there’s some very powerful augmentation possible with some of these tools, but taking decision-making capability and moving it into a place where there’s no real shared understanding is something that worries me.

Michael Stiefel: So when you say amputation, it’s an interesting phrase because are you saying what we should put around the agents because it’s really the agents that really strike me the most because with the LLMs, using those tools to write code, we as developers or programmers can understand when this doesn’t look right.

Justin Sheehy: I’m going to catch you up. Even there, I think it’s tricky. I think we can, but will we? Is a different question because the primary design goal of the LLM-based tooling here, now I’m not talking about other things either in developer tooling or the larger AI field, but if LLM-based tooling is, and this is what Google says, this is what OpenAI says, I’m not the one coming up with this, is to create plausible text, which means if it looks right to you, the AI is successful. So there’s a little bit of an almost frenemies or accidental arms race going on there, where your ability to see that something looks right, it’s actually making you think it looks right. That is the goal of these tools. So that’s difficult.

Michael Stiefel: Where it sort of feeds back your own biases to you and you really don’t get independent thought. There’s a mirroring going on.

Artificial Intelligence and the Loss of Software Understandability [21:13]

Justin Sheehy: Exactly. Now, when you asked about amputation, right? The metaphor is obvious, right? If I were to get an artificial hand, I have to lose a hand to do that. But for instance, if we take all of our work of a given kind with code, all of it, and we start offloading that to something, and here I don’t care what kind of tool it is, doesn’t matter if it’s AI or something else, then we are due to lack of practice going to atrophy that skill over time. And someone who starts out their career offloading a whole category of work onto any kind of tooling is never going to fully build the skills that are hidden in that tooling.

Michael Stiefel: Just a simple example is compilers. How many people understand machine code?

Justin Sheehy: Right. I’ve written a few compilers. I might be the wrong sample set for you there, but what I see there now, I’m not saying that to say let’s not automate things, let’s not use tools. I’m one of the biggest fans of automation you’ll find. It’s about taking care of which things we really think are going to be done so much better, so much more reliably forever by a given tool set that we’re willing to have humans not have to be too skilled at it anymore, right? Which part of that means the tooling has to have certain qualities, right?

For instance, now it’s not that interesting for people to memorize giant multiplication tables anymore because everyone’s got a calculator all the time and we know that our pocket calculators or now our phone that has a calculator app is going to multiply as well or better than we are. So that’s a skill that’s worth people learning a little bit about how it works, but it’s okay if people don’t build that skill to a huge degree because they are having tooling that stands in for it.

Michael Stiefel: Provided they understand that when they make a data entry… When I played with slide rules, you had to know where that decimal point was.

Justin Sheehy: Absolutely.

Michael Stiefel: I remember when compilers… People still worried about compilers making… I remember one time early in my career there was a problem because a compiler put a machine instruction crossing a page boundary. We’re going to things that maybe people don’t remember or don’t know, but the point is, to your point, we’ve gotten to the point with the technology, with compilers, with debuggers, the linker, even file systems. I’m old enough to play with JCL, where you manipulated things on disk. There was no file system. So to your point, we have gotten tools where people are good enough not to have to worry about what’s going on underneath.

Justin Sheehy: Right. In computing, a file system is a great example there. Even though they will get some things wrong some of the time. People that have done things like written databases and storage engines and all that don’t necessarily trust file systems very much, but we still probably trust them better than we trust even a really good programmer to handle durable storage without a file system.

Michael Stiefel: Correct. And you tend to know what kind of mistakes to expect from the file system. You have dirty reads and synchronous writes. There’s a whole class of things that you know about. The thing is with AI agents, we don’t know these things yet.

AI’s Best Guess vs. Admitting What is Not Known in Tooling [24:28]

Justin Sheehy: Right. The systems people are using now that are LLM-based and therefore, probabilistic word generators and things like that, it’s really hard to know the things. It doesn’t make those things not useful, but it means you have to think about which categories of work you’re comfortable offloading to them in what ways.

Michael Stiefel: And perhaps what that means, and again I’m thinking of AI agent style, is perhaps there are boundaries that say, “Under every circumstance you can’t do X”.

Justin Sheehy: Right. Well, so actually, you also bring up to me one of the big shifts that we’ve seen due to LLMs and we’re way off track from change safety, but that’s okay, this is fun, which is people are using them now for human language translation. And here I mean languages like Japanese or Irish, not languages like C++ or Rust or something. And a lot of their shared sort of history and heritage from a software point of view is in common and comes from work on language translation, but they’re not the same systems. To me, one of the really fascinating differences between most translation software that we were using say three years ago, and most of the things people are using when they ask an LLM to translate a body of text, the LLM is probably in many cases faster, has access to more data, can translate some things that the old translation software couldn’t for some reason.

But there was this really interesting feature that was fundamental to translation software before, that these don’t have, which is that if it got to a piece of text that it couldn’t translate, the output would be clearly marked as untranslatable in some way by the software. How it was marked depends on your system, but you would often have little bits of, “I don’t know how to translate this”. An LLM can never say that. So you’ll always get something.

Sometimes it’ll be right. And so which of those situations you’d rather be in is one that I wonder if people even think about when deciding what to use. Would you rather have something that is pretty good-looking output in the language you’re translating to that’s probably mostly right? Or would you rather have something that might not be quite as slick, that might have a couple of pieces in it that aren’t successfully translated but that doesn’t have anything in it with any sort of probabilistic nature? And well, in the sense of it’s either able to translate it based on known structure and grammar and vocabulary or it’s not.

AI’s Best Guess vs. Admitting What is Not Known in Production [26:52]

Michael Stiefel: The flip answer to that question is it depends. But the reason why I brought this up is because to bring it back to change management is that when you introduce these kinds of systems in not necessarily tooling but actual production software, how do you manage the changes with something that’s intrinsically unpredictable?

Justin Sheehy: Yes. So I would say that’s a great question. And so putting that in that context, one thing that a few people have pointed out to us that I’m glad they noticed, but I don’t take it as the criticism it’s sometimes phrased as, those six principles that we put in the article, two of them, so fully 1/3 of the framework, can basically be read as write down what you’re doing. It is that simple. It’s documenting and defining how you change things and each plan that you have so that your planning and reasoning and understanding is shared with your colleagues. And so one of the things that’s important there is that what you’re documenting isn’t just sort of repeatable processes, it embeds in it those decisions made by people. And so you can know which things someone meant to do and when and hopefully why. And that helps when you’re trying to figure out did something even go wrong? Why did it go wrong? Did someone just make some change that’s affecting what I’m looking at?

Why is Documenting Plans and Reasons Difficult? [28:13]

Michael Stiefel: Why is it so difficult for organizations to do?

Justin Sheehy: Oh, jeez, it is. I have asked that question 1,000 times and it’s really strange to me because I consider it so fundamental. First off, from a reader, from a user of those things, having the details of how things work, why someone did something, what’s happening when, that’s just so valuable all the time. Doesn’t matter whether you’re trying to modify some code someone else wrote, whether you’re trying to, again, diagnose an incident, something else like that, but I would go farther than that. I think it is basic professionalism. I think of communication, written, thoughtful, prose communication as a core necessary component of both architecture and engineering, however you define it. If you are not writing down what you do and how and why in a way that’s findable and readable and useful to your colleagues and the people that come after you, in my opinion, you shouldn’t be calling yourself either an engineer or an architect. It is one of the core fundamentals.

Now that said, to come to the question you actually asked, which is why is it so hard for organizations to do this? I think one of the tricky things is that all of that work, not just to create, but to curate and manage all of that human language. All of that writing down what we do, the payoff to the organization is both indirect and not immediate. So when someone’s thinking in terms of short-term investment, where do I spend my dollars? Let’s say I’m a vice president at some other company deciding which people to fund to do what. It’s a lot easier to draw a straight clear line from, “Have someone write and ship more code no matter what”. To some value on the revenue generating element of the business or the cost savings side. It’s less immediate and less direct when you put that into investments, into knowledge curation and creation and all those things.

And I’ve seen even people that absolutely do value those things have a harder time, especially, in an executive role justifying those being the priorities that I think they weren’t being in most organizations. And here I’m not even talking about any particular company, I’ve seen this over and over again my whole career. And it’s something I struggle with because the times I’ve seen an organization invest in this for real, I’ve seen it pay off enormously. And when I say invest, I don’t even mean big software companies.

I’ve seen a 40, 50 person startup decide that this was a priority and then all of a sudden the next person hired was productive in a week instead of a month. And all of a sudden someone switching projects and joining someone else’s effort was able to dive right in and not treat everything as mysterious legacy, but as something that they have people working with them, even if those people don’t work there anymore, because they have their thoughtful words to base themselves on. And that’s why we put literally 1/3 of our change safety principles are, “Hey, write it down”.

The Target Audience for Code is the Next Person, Not the Compiler [31:17]

Michael Stiefel: So this is interesting from two perspectives because one, when I got started many, many years ago in the software business, it was made clear to me, not so much by employees, but my reading, is that software is read more than it’s written. And the generalization of that is people use what you do or look at what you do more than your writing of it. So they should be the audience, not your ease of use or in those days, the fancy C statements you could write.

Justin Sheehy: I agree with that, that our primary audience of our software is not actually the compiler or interpreter, it’s the next person that comes along. But I would go even farther than that. And again, you’re sending me off in directions that I don’t have the citations for handy. I want to clarify that this next bit is not my idea, but I can’t remember whose it is off the top of my head, but that the primary responsibility and purpose of a team that owns a given piece of software, and here, it doesn’t matter what form that is, whether it’s operating as a service or whatever, and we can be including SREs in this or not depending on the kind of organization, but the primary purpose of a team involved heavily in software is the creation and constant maintenance of a shared model of that system and how it works and how it’s supposed to work and why.

Notice that I did not say creating the actual software. That is important. Doing that well is something you do at the same time as creating that shared model between your colleagues, but the thing that makes you really successful as a software team over time, and by the way this is true even when that team is one person, it doesn’t matter if it’s one person or 50, because I’ve come back to things I’ve written a long time ago. And the ones where I’ve valued that in the creation have been things I could come back to. And the ones where I haven’t, maybe I can’t.

Michael Stiefel: I’ve had the situation where I’ve come back to code that I wrote and I was not clear about it. And I think I made a mistake, and start making changes, and then I realized I was right the first time. It’s just that I didn’t explain why I did what I did. Yes.

Justin Sheehy: You didn’t explain it to yourself in that case, right?

Do We Need A Disaster To Learn to Do the Right Thing? [33:31]

Michael Stiefel: So the question arises now, is it necessary for an organization to have a disaster to learn this? Because that’s the only way the C-suite will realize that this is worth being an investment.

Justin Sheehy: So in my experience, it’s not always necessary. I have seen organizations make these choices in good ways without that. I would flip it on its head. I would say that it’s not necessary to have a disaster to make some of these good investment decisions. I’ve seen some of them made, both at my current employer and some past ones without that, but I think one should never waste a disaster.

And in fact, one of the best framings of incidents that I’ve seen and then reused is as unplanned investments. And unplanned investments in that you didn’t choose to have them. What you do get to choose is how you structure what you’re going to get back out of them. There’s a cost, you paid it, you didn’t get to choose when, you didn’t get to choose how much. Your choices ahead of time affected that, but you don’t get to choose it at the time. What you do get to choose is what growth as an organization, what learning as an organization, what improvement you get out of that unplanned investment.

So I would say no, a disaster isn’t necessary. It may be in some cases, but it isn’t, generically. But it’s absolutely true that if you’re in an environment that isn’t investing in this way and you’re trying to support those investments and a disaster connected to those things happens to occur, that might be an easier time to get your colleagues on board with the value of such an investment.

Finding the Truth vs. Assigning Blame [35:13]

Michael Stiefel: And I think it’s important to go back to something that you said before about proximate versus root causes, is an organization in this situation has a choice. It can learn or can look for people or things to blame. Because those are very contradictory things because if you do one, if you look for people to blame, they will not tell the truth and you’ll never find out what actually happened. So you have to make that choice.

Justin Sheehy: Well, there’s multiple layers there. If you’re in an environment that looks for people to blame, frankly, you’re probably in trouble. All the reasons you said. And people don’t do good work when they’re scared. And people are going to look quite sensibly to protect themselves. So I think that element of blamelessness, I’m not saying it’s universal, but it’s well understood. But there’s another layer of it, which is that I also think a lot of people move past that and switch to which piece of our system do we blame? What bit of our software had a bug or what deployment did a step get missed in? And finding the blame in that sense. And I think that still has a lot of those problems, not all the same ones. And here I’m not just saying that in the sense of the fact that people attach their identity to those things and you’re still sort of blaming the people. I don’t mean that. That might be true too.

But what I mean is if you do that, you get tunnel vision. And as soon as you find that thing, you come up with a plan to fix that thing, but you didn’t keep your eyes open to the broader picture and learning. So most of the most terrifying both incidents and outages and near-incidents that I’ve seen are ones where it didn’t work that way. And you might be able to pretend it did. You can certainly point at something somewhere that had a bug, but a lot of the most terrifying situations when it comes to that kind of environment are where many of the elements of a larger system are each individually operating basically as intended, but the emergent properties of them and then some moment in the environment combined to put you in a situation you didn’t want to be – to have some loss.

Michael Stiefel: Let me give you an example. Let’s say you’re an airline pilot, and you make the wrong decision and the plane crashes or there is some incident. The question you also have to ask is the person and the software were in a system together. And you have to ask the question, why did the pilot make the decision that he or she made at that time? That’s not blaming the pilot, that’s not blaming the software. So you may have to change both. The pilot may have been misled. The software may have had a bug that led the pilot to make a different decision. So you can’t say, “Blame the pilot”. And you can’t blame the software.

Justin Sheehy: Right. And it’s a great direction you pointed as for the example, because a lot of getting these things right has actually come from the aviation industry. And so aviation accident investigation is actually a pretty mature field that has gotten really far in that area. And one place I’ve seen that where we’ve applied that when doing post-incident analysis for instance, is making the really important point that saying we’re being blameless and meaning it and doing all those things right does not at all mean that we’re not going to talk about mistakes people made.

Just like we’re not going to avert our eyes from a bug in our software. It’s that we don’t stop there. We don’t get to that and say, “Oh, well, they messed up. So we need to make sure that we don’t have someone mess up again”. What we instead need to do is, well, why did that person who presumably did not come to work planning to cause an outage or something worse, why did they think they were making the right decision? What context were they in that did or didn’t give them information that led to that?

And also, even if we can’t prevent them from making that decision differently in the future, which we might try to do also, why did that one moment, that one mistake of decision or action, why was that able to cause such harm? Why aren’t there other controlling systems that were able to impact this? So I think talking about human error, first off, there’s also been multiple books written on that topic, in this kind of context is really valuable as long as what you do is you use it as a way to open up more areas of investigation instead of as a way to, “Hey, now we know what went wrong”. And stop the conversation

Michael Stiefel: Right. And shoot the person who did it and satisfy everybody that they found somebody to blame. And I don’t mean literally shoot them, obviously.

Justin Sheehy: No, no, no, I know. But even places that think they’ve moved past that part or may have. Where the people might not be afraid of that and so they’re a level beyond and better than what you’re talking about, I think may still be at that point in some cases where you can almost over correct on that and be like, “Well, yes, someone messed up. We’re not looking at that”. No, no, no. Let’s look at that. Let’s assume that I was that person and I was fully skilled for that job and trying to do it. What’s missing? Was it in my operational context? Was it some poor information? Was it that the thing I tried to do I thought was going to do something else?

Assessing An Organization’s Ability to Make Changes [40:24]

Michael Stiefel: We could talk about this for hours because this is fascinating to me because it’s the interaction of humans with automation and increasingly with society and automation. And we could go on for this forever, but I do want to ask you one other thing because I think this is important, is, what questions should you ask about your organization in order to assess its ability to make changes?

Justin Sheehy: Oh, that’s a good one. So I think probably one of the more important questions. That’s a good one. I haven’t thought about it in exactly that framework before, but one way to turn that around and come into it is why does your organization have whatever level of confidence it has when it makes a change? Ask that question as an exploratory question, and say, “When we change our systems..”. And that can be whatever way you do it, right? That can be merging a PR in a system that has continuous deployment all the way to somewhere that has long QA cycles and has to burn things in on chips and do things. There’s an enormous spectrum of what pushing a change can mean. But ask yourself at an organizational level, why do we believe it’ll go well?

Instead of focusing on the negative, turning that all the way around. And this is where things like resilience engineering comes in. Or let’s instead focus on what actually succeeds for our organization and figure out how to magnify and use that to the greatest advantage.

Because guess what? A lot of places have organizational gaps and problems that will be in fact obstacles to this and that are not the same from place to place. And the answer can’t just be fix those, it’s often not that simple. But if you look at, “All right, how have we gotten well-founded confidence in things before?” And use that as a way to open up, “Hey, what are we actually really strong at here? Can we in fact use that more and use that in fact to help us where our gaps are?” And I think when you as a team or an organization, and this can be anything from, again, from a two person company running a little website to an enormous enterprise, if you get a really well-founded confidence, what I mean by well-founded is it’s not just arrogance. It’s based on facts and analysis, in the safety of when you do things, you get to speed up, you get to go faster.

And that’s where one of the seeming contradictions in this comes apart, which is that when I talk to some people about doing all this work to make changes more safe, an immediate obvious objection by some is, “Well, that’s going to slow us down. Doing all that stuff, writing those things down, oh my gosh, that’s going to take me a long time to write what I do and all these other pieces. We can’t wait for that”. That might be true at a given moment, but if that’s always true and you never invest, you can’t ever build that well-founded confidence that lets you go fast. People like to talk about things like continuous deployment, pushing out changes constantly, maybe hundreds a day. Well, I think that there are two ways that you can become okay with that. One of them is to be operating a system with almost no consequences.

I can push out changes every minute to my personal website that nobody cares about if it’s down. No consequences means I don’t matter. But that’s not the interesting case. The other case is where I’m prepared to justify to myself and others why that’s actually the best call, why doing that all the time is both less likely to cause too much damage, and that doing those changes constantly is actually making my system converge in a better, probably safer direction by doing so. But you don’t just get that simply by moving fast. You have to do it by building reasons to believe that doing that will make your system safer instead of less safe.

Michael Stiefel: Planes, trains and automobiles go faster than they used to because people spent the time to figure out how to make them safe.

Justin Sheehy: Exactly. And things in automobiles like crumple zones or even seat belts and things like that turn into the ability to drive more safely, drive faster. I think a great example there would be even today if we move past older technology, look at the difference between a street car and a Formula One car. You’re wearing different safety gear if you’re piloting a Formula One than you are if you’re in a Toyota Corolla on the street. And it’s not that one of those is making better choices, but the one that wants to go faster and not die has actually invested a bit in being able to do so.

The Architect’s Questionnaire [45:00]

Michael Stiefel: This discussion to me is very, very fascinating because I’m always interested in where humans and technology interact, but I’d like to sort of ask you now those questions that I ask all the people who appear on the podcast because I think it humanizes architects and architecture and it’s interesting to see how different people answer the questions differently. What is your favorite part of being an architect?

Justin Sheehy: Sure. So I want to first address this whole category of these questions because I think our industry at large doesn’t really know what an architect is. We can talk about some things it has in common like communication and design and all these things, but so all I can answer when I say this is the roles I’ve had where that’s true, but I’d be careful about generalizing. But for me, I think one of my favorite parts is that I get to see so many different problems and come at them from so many different directions that it doesn’t get boring. It stays interesting because the kinds of problems and the kinds of ways that you bring to think about them and the way that the rest of the industry helps us keep having new lenses to analyze our problems. It just stays interesting.

Michael Stiefel: What’s your least favorite part of being an architect?

Justin Sheehy: So I want to be really clear that I don’t think this next thing is exactly bad because it’s part of how the world works now in a positive sense, but it can still be difficult, is that our work is never really done with some very, very rare exceptions. But most of the time, most of the systems that we work on and design, I think the past couple of decades, once they’re operating, they’re operating in contact with the real world, which is not static. And the demands on them are not static, their situations, their constraints, those things change. And so even really good architecture work, even that which might last a while, often doesn’t really get to be done. You have to revisit it. And so it’s hard, I don’t mean impossible because there’s still moments of satisfaction and points of completion and launch and all that, but you never really get that same satisfaction that you can get in some other disciplines of the thing is completely done.

Michael Stiefel: Interesting. Is there anything creatively, spiritually or emotionally about architecture or being an architect?

Justin Sheehy: Oh, those are some interesting words to apply to that kind of role. I’m not sure if I have off the top of my head spiritual and emotional aspects that I would say about it. That’s not where I get most of my spiritual or emotional reward in life. But the other word you use there, it absolutely applies for me, which is that architecture is certainly a very compellingly creative endeavor. I think one of the ways that I look at most architecture work is as a matter of trying to solve essentially, a problem with many different constraints. And like we talked about before, those constraints, you don’t get to say, “Oh, just the technical ones”. Right? They’re technical, operational. Some of them are from our customers, some of them are cost driven. And so you have these different constraints that can’t even be expressed in apples to apples ways.

And so finding good architectural solutions I think often includes finding a shared language, and that refers back to the topic we were talking about earlier, but also to other problems to help express and understand all of our constraints at the same time so that we can approach solutions. And I think that work is deeply creative and valuable.

Michael Stiefel: What turns you off about architecture or being an architect?

Justin Sheehy: So I don’t know about so much about being an architect, but one thing that turns me off about the approach to thinking about architecture that I sometimes see is this notion that it’s something you hand off. That you do the architecture for a system or a problem or something and then give it to someone to go deal with the implementation and operation and walk away. And it refers back to the difficulty of being done. And that’s why I think that’s not exactly a pure negative because it’s necessary. I think that doing a good job at this kind of role is harmed when people don’t have a continuous process of engagement with their real stakeholders, with their real running systems. So I think it’s when people have too many simplifying assumptions, where they’re just like, “I designed something beautiful, what happens next is your problem”. And they might have some unhelpful abstractions. That kind of architecture to me turns me off.

Michael Stiefel: I think these are important things because as you say, it may be life as an architect or life in general, but we have to be honest with ourselves, what we like and don’t like because it helps us do those things that we have to do.

Justin Sheehy: Absolutely.

Michael Stiefel: Do you have any favorite technologies?

Justin Sheehy: Oh. All right. There’s two different ways I want to answer that question. And I almost have almost opposing answers because on the one hand I take a very pragmatic and not at all what people think of as a religious approach when it comes to making choices within the context of a given project. So I very much think that the right programming language choice, database, web framework, communication system, you name it, those choices ought to be dictated by the actual practical constraints of the project at hand as a much more important thing than my preferences. And don’t get me wrong.

The preferences of a team working on something are among the things that should be valued but shouldn’t be the leading way you make choices. Now, that said, I am very much a nerd about exactly some of those disciplines. Programming languages and databases, for instance, are both things that I care a great deal about and I appreciate and of course, I have opinions about.

So I think the way that I might identify the favorite technologies I have, and I’m not going to dodge your question, I’ll name a few, but the way that I would categorize them is I tend to really like technologies that make promises and then show that they can keep them. And that’s just the kind of direction I go. I’ve spent some time doing formal methods and programming language theory and things like that. And so in more recent things, for instance, I have a lot of appreciation for Rust’s model of keeping certain kinds of promises about memory use or something a little more esoteric, but it’s a little more popular now of CRDTs or conflict-free replicated data types and how those can make certain promises that even when you do things out of synchronization, you won’t lose certain convergences in your data.

And some of the things I’ve built or contributed to over the years, some of the ones I’m most proud of and most like have a similar philosophy in that they think hard in the sense of the developers, designers, architects, about what promises we’re going to make to whoever cares about this software and how we’re going to justify that promise, how we’re going to be sure that we make them. Now, some of those technologies, by the very nature of doing that, both the ones I’ve helped make and the ones I’ve liked to use over the years, some of them are relatively esoteric.

So to come back to the first half of my answer, I don’t recommend them very often, some of those things, but that doesn’t mean I don’t value them. I like them, they’re favorites. Sometimes only because of the way that they’ve nudged the industry forward a little better by showing what’s possible. And so my favorites tend to be not the things I pick, but the things that have helped me either think about how to make promises or think about how to solve problems.

Michael Stiefel: Interesting. So what about architecture do you love?

Justin Sheehy: All right. So love, again, some of the words here are interesting because I don’t know that there’s something about it I love, in that, again, that’s a big word for me, but I can say what I enjoy. So one of your earlier questions, I was talking about constraint problems as a way to think about what the problem is. And I think that those kinds of problems, the problems facing most architects, when you come up with that common language and then you have to figure out how to solve the next step. Figuring how to solve for all these different complex organizational and technical constraints, those can be really fun puzzles to solve. And so that, I really enjoy. So maybe coming up against the corner of love there, but I think there’s a danger in that environment because I would frame it as a very junior architect mistake that one might make to love that too much.

And so then to think about the puzzle you have too perfectly, and to not take into account the very messy nature of the world it exists in. And so to me, it ends up being more rewarding not to have the sort of most on paper elegant solution that can be nice too, but to solve a puzzle in a way that accounts for that messiness and accounts for the real world. When you feel like you can justify doing that, that’s a really enjoyable feeling.

Michael Stiefel: Well, it also says, I think you’ve almost answered the next question, which is what about architecture do you hate? Because it sounds like people who don’t do that is sort of-

Justin Sheehy: So that would be a way to answer that. I would also say that for me, hate isn’t maybe an even stronger word than love, right? I don’t think I hate anything about architecture. The things I reserve hatred for are things like cruelty and greed, things that are more deeply damaging to people, not things about designing systems. But yes, doing this work in a way that is blind to people, I think would be something that I at least dislike.

Michael Stiefel: What profession other than being an architect would you like to attempt?

Justin Sheehy: I would go farther than attempt. So in my case, in addition to all of my software related work, I am a practitioner and instructor of karate and some other martial arts. I do that multiple times a week. That’s an extremely rewarding vocation. It doesn’t pay the bills the same way that the software industry does, but I’ve been doing it actually longer than I’ve been doing software. So had I not ended up in our industry doing this kind of work, that would be most likely my profession as well as a vocation.

Michael Stiefel: Do you ever see yourself not being an architect anymore?

Justin Sheehy: Oh, sure. I don’t mean as a goal, but at the very beginning of our conversation when you asked me how I became an architect, I think I said something about how, to me, it’s not an identity. It’s a job. It’s a really good job. It’s a job I like doing, but it’s not essential to who I am. I think much more about, is the work I’m doing, something I’m good at, something that’s rewarding in some ways. And those ways can come in different forms. And whatever the next role is from a professional point of view, if it’s something that I can do well and that others find valuable and rewarding, whether that has architect in either the name or description, to me, that’s not one of the more interesting parts. I just happen to have developed the skills that make me sometimes end up in architect jobs, but if the next one doesn’t have that, I barely notice.

Michael Stiefel: So we did talk before about the difficulty of something being done, but certainly phases of projects are done. And when a project or a phase of a project is done, what do you like to hear from the clients or your team?

Justin Sheehy: Yes. So right, we’ve already identified that done is a little weird in today’s world for most systems. Especially, ones that are operating as a service people are interacting with all the time. But again, you still have launches and major revisions and moments of a kind of doneness.

Michael Stiefel: Moments of satisfaction, hopefully.

Justin Sheehy: Yes. Well, so actually I think for me, and this is one of the reasons why I think having a longer or bigger career in this that’s rewarding can be a challenge for some folks. I think it requires a sense of delayed gratification because to me, the most rewarding elements after some of those done points are ones that you can’t get right away. For instance, that a given architecture has allowed a system to change later without having to have a long, expensive re-architecture process. To me, that’s the kind of thing that I love to hear. I’ve had the good fortune to have a few things. And to be really clear, it’s a small minority of the things I’ve worked on stay in use for 25 years or more in some cases. Doesn’t mean they haven’t changed. They have changed. And other people have had to change them and probably said nasty things about me at the time, sometimes.

Michael Stiefel: Sort of like having children.

Justin Sheehy: Right, exactly. But the fact that they were able to change those systems as opposed to having no choice but to deal with them, to me, that’s really rewarding. So something that can continue to exist in the world and change, by having done that, that to me is just really rewarding to hear or see or find out about. If anything, more so than those immediate satisfying moments of, “Hey, the thing’s running”. Which is cool, don’t get me wrong, I like that. But I like even more hearing, “Well, yes, now we had to deal with this other kind of use case or this other kind of traffic or this whatever. And well, so we had to change it, but we could”. That, “Oh, okay”. That means we got something right when we thought about the approach to designing a system.

Michael Stiefel: That’s what I meant when it’s like having children. You watch them grow up and being capable of living on their own, not necessarily doing the exact same things they did when you taught them at 10 or whatever.

Justin Sheehy: Indeed. And what’s rewarding is seeing not that they make decisions the same way you do because I don’t know about you, but that’s not my experience.

Michael Stiefel: They’re different people.

Justin Sheehy: Right. Exactly. But that you’ve hopefully helped them and I say helped here because one has to be careful not to get their ego in the wrong place, but hopefully played some part in them developing a framework for that decision making that you can look at and say, “All right, as their world changed and they had to make decisions that I couldn’t have prepared them for, they did it. They kept going”. That alone is something to be celebrated.

Michael Stiefel: Yes. I found this conversation very fascinating. I always enjoy when we talk about people, technology, the ability to make changes because life is not static, as you said, the people are not static. The people inside are not static, the world outside is not static, the machines are not static, and the machines that interact with those other machines are not static. The world, if anything, has become more and more dynamic, and these issues are just more and more important. It was a pleasure having you and having this conversation.

Justin Sheehy: It was my pleasure, Michael. This was a really fun conversation. I always like talking about this stuff and I like talking with you about them, so thanks for inviting me.

Michael Stiefel: Thank you very much.

Mentioned:

.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Safely Changing Software to Avoid Incidents: A Conversation with Justin Sheehy

Transcript

How Did You Become An Architect? [01:26]

How To Change Software Safely [03:46]

The Relationship between Incidents and Pushing Out Changes [04:07]

The Importance of a Shared Language for Discussing Incidents [07:42]

Observability vs. Malleability [10:33]

Everything is Related to a Business Decision [15:07]

Autonomous Agents and Software Change Management [17:24]

Artificial Intelligence and Accountability [18:43]

Artificial Intelligence and the Loss of Software Understandability [21:13]

AI’s Best Guess vs. Admitting What is Not Known in Tooling [24:28]

AI’s Best Guess vs. Admitting What is Not Known in Production [26:52]

Why is Documenting Plans and Reasons Difficult? [28:13]

The Target Audience for Code is the Next Person, Not the Compiler [31:17]

Do We Need A Disaster To Learn to Do the Right Thing? [33:31]

Finding the Truth vs. Assigning Blame [35:13]

Assessing An Organization’s Ability to Make Changes [40:24]

The Architect’s Questionnaire [45:00]

Leave a Reply Cancel reply

Stay Connected

Latest News

'Only Murders in the Building': When to Watch Season 5

First Benchmarks Of Windows 11 25H2 vs. Ubuntu 25.10 On AMD Ryzen 9 9950X

Everything we expect Apple to launch this week

Musk’s SpaceX spends $17 billion to acquire spectrum licenses from EchoStar

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

How Did You Become An Architect? [01:26]

How To Change Software Safely [03:46]

The Relationship between Incidents and Pushing Out Changes [04:07]

The Importance of a Shared Language for Discussing Incidents [07:42]

Observability vs. Malleability [10:33]

Everything is Related to a Business Decision [15:07]

Autonomous Agents and Software Change Management [17:24]

Artificial Intelligence and Accountability [18:43]

Artificial Intelligence and the Loss of Software Understandability [21:13]

AI’s Best Guess vs. Admitting What is Not Known in Tooling [24:28]

AI’s Best Guess vs. Admitting What is Not Known in Production [26:52]

Why is Documenting Plans and Reasons Difficult? [28:13]

The Target Audience for Code is the Next Person, Not the Compiler [31:17]

Do We Need A Disaster To Learn to Do the Right Thing? [33:31]

Finding the Truth vs. Assigning Blame [35:13]

Assessing An Organization’s Ability to Make Changes [40:24]

The Architect’s Questionnaire [45:00]

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News