Looking For Root Causes Is A False Path: A Conversation With David Blank-Edelman

Transcript

Michael Stiefel: Welcome to the Architects Podcast where we discuss what it means to be an architect and how architects actually do their job. Today we are going to talk about something that is very important to architects but is not often explicitly discussed. We have spoken quite a bit on this podcast about reliability and designing for failure, but we have not spoken about what we do to make our system design more robust, not just fixed after it has a failure.

Today’s guest is David Blank-Edelman. He is the program lead for Microsoft’s SRE Academy, the program for onboarding and training of Azure SREs and others who strive to improve reliability and quality. He has roughly 40 years of experience in the operations space and he’s the co-founder of the SREcon Conference, and the curator of Seeking SRE by O’Reilly and is the author of Becoming SRE, also by O’Reilly. He spoke at the recent InfoQ Dev Summit in Boston and that’s where I got a chance to listen to him and meet him. Welcome to the podcast.

David Blank-Edelman: Well, thank you. Lovely to come here and lovely to see all the ships at sea.

Becoming a Software Reliability Engineer [01:49]

Michael Stiefel: Okay. Reliability looks very different if you come at, not from the perspective of an architect, but from the perspective of site reliability engineering. You’ve spoken about how the obvious questions often do not lead to the right answers, which can also be that that idea can be counterintuitive, which makes me ask the question, how did you decide to be a site reliability engineer and how different is that perspective from that of an architect?

David Blank-Edelman: Well, I think it was pretty easy for me to naturally fall into this space. I had, very early on, realized that I enjoyed the world of serving other people and allowing other people to be able to get the best out of technology. And so that started early on at the beginning of college in which I started into operation space where I was in charge of a VAX/VMS cluster once upon a time, and then later, a part of the CS department of the college I was in, helping to run their systems. And so I started in the land of systems administration as perhaps some of your listeners have or certainly encountered, and then from there, as things moved on, I kept my eye on, “Where’s operations going, where did it go when it came to DevOps?”

And then I started to hang out with people that were identifying themselves as site reliability engineers, largely from Google at the time. And they were saying some really, really interesting things about how do you build and run large systems at scale that I thought, “Well, that’s really interesting”, and from there I fell into thinking like, “Okay, this is the field I want to be in”.

Michael Stiefel: So it actually seemed to be a quite natural evolution. It’s almost like you grew up being a site reliability engineer.

What do Architecture and Software Reliability Engineering Have in Common? [03:31]

David Blank-Edelman: Yes. But it’s not like that picture of the evolution of man where there’s a tadpole on one side, you get yourself to Homo erectus. I don’t claim that sysadmin then grew up to become DevOps, then it grew up to be SRE. I think these are all paths that make sense and they have their own evolution. But I also realize it didn’t answer your architecture question. In many ways to me, I think the direct answer to your question is that the difference between SRE and architecture is one of focus.

Both have a very strong overlap. You can’t do SRE stuff if you’re not paying attention to architecture, if you’re not bringing it to their architectural skills. And I would make the argument that… I don’t want to say a good architect because that really implies my judgment, but the architects that I appreciate are ones that have spent some time thinking about reliability concerns. They understand reliability as this emergent property of systems and what they can do to possibly make it more likely to emerge or less likely to emerge and to just pay attention to that.

Just like I appreciate the ones that also think about privacy or the ones that also think about security. These are all emergent properties that really matter a lot to the people that are using our system. And ultimately I’m here to serve, and so in both ways, I’m trying to serve using SRE and I think architects are trying to do the same using architecture. And I think it’s super, super important and I think we have a lot to learn still from the land of architecture, in my opinion.

Designing for the Full System Life Cycle [04:58]

Michael Stiefel: Well, it’s interesting you put it that way because very often I think of the architect’s role is to worry about everything that cannot be written into a use case. In other words, you cannot write a use case that says, “Make the system reliable, make the system scale, make the system secure”. And in my experience, if there’s nobody responsible for it, it doesn’t happen. And that very much goes along with reliability, however, we can talk about exactly what that term means, but let’s just use that term for the moment – reliability. You can’t write on a use case, “make the system reliable”. Someone has to see all the pieces of the system and watch that property emerge, which also leads into something that both of us have read, Letters To a Young Poet by Rilke, about the idea of living the question, because all these things that don’t fit into a use case are not things that can be solved once and for all, and perhaps you might want to develop that idea a little bit because I think it’s very important.

David Blank-Edelman: Well, so I have struggled a little bit with this because once upon a time I was doing a bunch of stuff at my day job with Microsoft and I work at Microsoft, but I’m not here representing them in any way. But when I was representing stuff we were talking about called the Well-Architected Framework, which was attempting to lay down some principles and some guidance as to how to architect things, the first thing, I was like, “Okay, well the problem with this analogy that doesn’t work for me is I don’t usually think of…” Like, you go to an architect and you say, “Please make me a building”, and then you have a long conversation about what that building is, what it has to do, what are the parameters, et cetera, and they make you a thing and then you build the building. But you don’t often anticipate the architect knocking on your door a year later going, “Hi, I’m here to architect some more on your building”.

And so what I understood, and I think that this is another place where SRE probably has opinions on, is that some of it is keeping it in a state. Keeping it in a well-architected state is also architecture. It’s just not the one you think of that often. When we’re not talking software, when we’re talking atoms, you don’t think of the architect showing up later to architect at you some more. And if that’s the case, then what do you do about this notion that you and I both believe in, which is that A, you never get there and B, things don’t stay the same, and there’s all these other forces acting on this, entropy and et cetera. So what do you do as an architect, in my mind, when you’re aware of entropy? This is a super fun question.

And I’ve seen really fun things where people go so far as to say, “Okay, I’m building stuff in so that there’s a way to iterate”. And I think that’s super cool. I’ve seen people say, “Well, I’m going to actually figure out how do I make this building degrade gracefully over time?” And then later on you’re like, “Cool, then we can start again”. And I think those are really interesting ideas in the real world that I think, from real world architects, and I often wonder how often do architects in the software space think about these things? I mean, in my experience, people don’t usually think about what is it like to sunset a system? Are you building this so that later on you can turn it down? Because you and I both know someday we’re going to turn it down. It’s not going to live forever.

What is the Goal of Software Reliability Engineering? [08:10]

So is it got what’s necessary built into it to think about not only the, “Congratulations, I built it. Congratulations, I’m running it and I’m fixing it as I run it”, because that’s the state of the world. And then someday I want to be like, “This is no longer serving my needs. What is the easiest way and the most graceful way for me to demolish it?” And I find that to be super interesting. I’ve given talks about destruction and demolishing things and what we can learn from the people that do that, because I think it’s a problem we don’t think enough about. Now, the moment people’s problem starts at the beginning with, like, “Congratulations, I built something. Is it doing what I want it to do?.”, ‘Cause when you talk about the definition of reliability, it’s often like, “Am I meeting expectations?”, ’cause there’s lots of expectations you could be meeting.

In the talk that you referenced, I talk about reliability as having all these different facets because usually people talk about availability. That’s where we often start the conversation. “Availability, is it working or is it not? Is it on? Is it off?” And as an SRE, you’re going to probably spend the most of your time in that space. But if you’re an SRE and you’re dealing with some sort of system that requires certain parameters around latency, then you might be paying a lot of attention to latency. And my friends over at Xbox, they think a lot about latency because in a game that’s crucial. That’s right. It’s part of the system where that’s the thing. But there’s also, let’s say I’m running a pipeline, I might care about throughput or let’s say I’m running a batch system, I might hear a lot about, “Did I complete the work over the whole batch”, or I’m running sports scores or election thing. Maybe I care about freshness of data or durability if I’m running some sort of storage system, right? It’s like it’s really important. Reliability is true.

If I said to you, “Congratulations, I have this disc”. I’m looking around at the various discs on my desk, and if I put a bit there and I didn’t get it back out again, you wouldn’t call that reliable. But these are all aspects of this, and I try, when I talk to people, to sort of expand their idea about reliability, and the good news is we haven’t used the word resilience yet. I’m looking forward to when we jump into that world, but I want to be very clear that I’m staying in the land of reliability for a second. If we want to talk about resilience and resilience later, that’ll be a fun talk as well. The thing that’s interesting to me about your definition is the notion that you’re asking people to know what they don’t know and I don’t know how much preparation people have or know what to do about the things they don’t know.

The SRE Mindset

Openness [10:25]

Michael Stiefel: Well, what I can say about that is first of all, you have to have an openness. In other words that’s involved in Rilke’s famous expression, “Living the question”, because an architecture is never done. You never should think of it as done. So that’s the first thing you have to understand. And going back to your house example, most people in their house, when they have a house, don’t go back to the architect and say, “Please put a jacuzzi right smack in the middle of the living room”. But we do that in software all the time.

David Blank-Edelman: Oh, yes, totally, totally.

Maintaining Reliability as Systems Evolve [11:03]

Michael Stiefel: So that’s why you have to assume that if you have these architectural questions and some of which you’ve raised and some of which you… I appreciate your making clear what we mean by reliability ’cause I think that’s very important. But systems evolve. You may have a reliable system today, but someone asks for a new feature, and that affects the reliability because it comes into the definition of something you talked about before, is where reliability is from the customer’s point of view, not the system’s point of view. And I’m going to continue the word reliability as a shorthand for all those things that you just mentioned before. So not only will your system evolve, but there are future systems which you will design. So you’ll never get an ultimate understanding even of what those concepts mean. So that’s what I mean about you have to be open to the fact that you don’t know things.

David Blank-Edelman: Well, yes, and what are you going to do about it? Just feeling cool that you don’t know things? To me, we are going in three different directions that have many neurons lining up in my head, so I’m going to do my best to try to be very clear on the paths that I see what we’re talking about. So on the openness question, I guess I want to say it’s not only openness, but for SREs, there’s a really strong desire to make sure that we understand how we’re going to get data, how we’re going to get signal on the system so that we can understand it. Not just now, but in the future. How do we know whether it’s working? How do we know it’s not working? How do we know whether something is impacting it? And in many ways, there are many ways to get signals. One is the stuff you built in the things that go boop, boop, boop, that pinging noise in the background you hear, that you built in. Then there’s the, “It had a failure”, and we go back and we look at that failure.

Learning From Failure [12:55]

So learning from failure is absolutely crucial to SRE, and I would like to assert, hopefully it’s for architecture as well. I have all these great books about architects. I’m not going to be able to say his name. It begins with P. It’s a Polish name, I’m certain you know who I’m talking about. It has really great books about systems and how, when they’re tightly coupled, it’s problematic. Anyway, there’s that aspect of things where do we get the information that allows us to not only stay open, but to be active with it, to create that feedback loop, because reliability improves due to a feedback loop. I’m willing to make that assertion very clearly. So that’s part one.

Complex Systems are Almost Always at the Verge of Failure [13:31]

Part two is when you’re talking about the fact that you add features. In many ways, it can be a number of things where you understood very well the components of your system, you understand the component, you built it, you understand it, you understand its failure modes, you understand it, and it all makes sense to you. And then somebody comes to you and says, “It would be really cool if I could embed your thing in my thing”, or, “Maybe, just maybe, wouldn’t it be great if I could query your thing from that and include it”, or, “Wouldn’t it be great if we had this feature that combined our two things”. And then you start to have failures and you’re like, “Well, wait a second. Everything is working exactly as designed at the time I designed it”. However, it’s this conglomeration of overlapping contributing factors that come into play, where now all of a sudden the complex system is behaving in the ways that complex systems do, which is to say that they’re almost always on the verge of failure.

The other thing that I might talk about is that I encourage your listeners to go find, if they haven’t already, Dr. Richard Cook, who is an anesthesiologist who spent a lot of time working on stuff in the resilience engineering field, super, super amazing guy wrote a short paper. Everybody can read it. It’s like five pages called How Complex Systems Fail. And if you want to go read it, there’s a website called howcomplexsystems.fail.

Michael Stiefel: We can put a link in the show notes.

David Blank-Edelman: Oh, yes. Yes, okay. But one of the things that he points out there is that… So now we start getting into this realm of complexity, and the complexity that happens not because we built complex things, but because the interactions between these things are not necessarily foreseen or if they are foreseen, you’re lucky, but there’ll be things you don’t know, like humility. And so maybe openness and humility are also really, really closely tied to each other as an architect, as an SRE and just saying like, “Okay, I don’t know everything, but I really want to learn”. ‘Cause I think that the basis of SRE, the SRE mindset is curiosity.

The thing that I tell people, and we can go into definitions of SRE per se, but I tell people it’s all based on how does the system work, and then how does the system fail? And when I say how the system works, I don’t mean on your whiteboard. I mean how does the system work when it’s actually running? And when I say how do systems fail? I mean so I can learn more about how the system works is really the thing.

And how does it work for the customer? How does it work when I scale it? How does it work when I bring in people who don’t speak English well? How does it work when I bring in people who don’t speak English and their writing system goes the opposite direction? Or how do I make sure that people who have color blindness can see?

Michael Stiefel: Or somebody made a change in the field that’s not in the original document. If you know anything about, for example, fire engineering, buildings have to have something called as-built drawings. In other words, it’s not enough that the fire professionals come to a burning building and have what the architect laid out. They need to see the building as it was built.

David Blank-Edelman: Right. Which everybody, including the fire people will then say, “This gets me..”. I don’t know what percentage, “… of the way”. But not all the way because you might’ve built it, but I guarantee you that somebody has now moved those desks around or they’ve got the air wall in a different direction or-

Michael Stiefel: … or put some asbestos in the wall.

David Blank-Edelman: Yes. Or somebody decided, “What I really need to do is put a jacuzzi in the middle of the room”, and all of a sudden, congratulations, the challenge of the firefighter in that particular case is up there. So I think this is all really important to note, that to just understand that we understand things to a certain degree and it is incumbent on us to, wherever possible, understand it more because only with that greater understanding of what is true now, as best we can understand it, can we be effective. And again, that holds for that moment, and then the next moment comes and now we’re in a different place and now we’re in a different river. And so I think it’s just really important to have that openness. I mean, I really like your description of that, and that’s why we both love Rilke so much.

Michael Stiefel: Right. Henry Kissinger once said that, “A foreign policy success is just an admissions ticket to the next crisis”. And essentially when you have a failure and you learn from the failure, because you’re making a change, so you’re changing the system. So there’s no guarantee that the system will work in exactly the same way it was before. So you’re just waiting for the next failure.

David Blank-Edelman: Yes. Well, it’s terrible that we’re swapping Henry Kissinger quotes because I don’t particularly appreciate him, but he also said, “There can’t be a crisis. I have no room on my schedule”.

Michael Stiefel: Yes. Yes.

David Blank-Edelman: And I think of that a lot. So yes, I mean, well, that’s the other tricky question. I was having a discussion with somebody who was like, “Hey, we believe firmly what you should be doing is fixing your current problems and not trying to go back very far into your debt”. And I was like, “Cool. But you realize in fixing your current problems, you’re creating the next ones”. I mean, I dig the treadmill and I understand it’s the treadmill we’re all on, but don’t try to mollify yourself to say, “Cool, things are better”. No, they’re just different and they’re setting up for the next different thing.

Failures Do Not Have a Root Cause [18:37]

Michael Stiefel: So that leads into something important that you’ve talked about, and it’s also a little bit of a bugaboo of mine. When we have a failure and we’re trying to understand our system… I think you know what’s coming next. Someone will tell us, “Well, let’s get to the root cause of the failure”.

David Blank-Edelman: Oh, I see. You’re just trying to provoke me. I’m okay with that.

Michael Stiefel: Well, I think it’s important. It’s important.

David Blank-Edelman: Yes. It’s not the first time that you’ve heard me go off on this, so I’ll gladly go up on this for your listeners. So I think what you’re trying to point out to me… So I have tried to make it one of my missions to try to get people to stop calling it root cause, to stop talking about root causes or root cause analysis, because one of the things in almost all the cases, especially nowadays with more complex systems, is that we find out that there’s no one root cause. It’s not like we can say, “Colonel Mustard killed such and such with a candlestick in the kitchen and congratulations, we can, with absolute certainty, describe a single thing”, because when we do that, it turns out that that’s not really the whole picture and that we’re just deluding ourselves if we talk about that. And my problem is that when we say root cause analysis, we’re trying to assert that there might be one root cause.

And so people will twist themselves backwards to try to find the thing and blame it on the one cause, and sometimes it’s the person. I learned not to believe in root cause analysis because there’s just so many different things that come up. Every time you look at this, you’re like, “Okay, cool. I get the fact that you think that the disk filled up”. So should we just keep this from filling up? No, no. What was the problem with the system that was emitting so many logs that it filled the disk up? Isn’t that a thing? And so in my world, it’s far better to talk about things like triggers, like, “What triggered this whole sequence of things?”, and also, “What are the contributing factors?” That, to me, is a better way of describing when there’s a problem, like, “What were the contributing factors?” Because the thing is, is that sometimes those contributing factors, and I promise I’ll shut up in a second, are sociotechnical. In fact, almost always they’re sociotechnical.

It’s not like the human was terrible, and it’s also not like that one component failed. If that’s the only thing we focus on, then we never really level up. If I say to you, “Congratulations, this thing broke and it took us three hours to fix it”, and you say, “Okay, we fixed it”, and nobody goes back and says, “Why did it take three hours?” “Oh, well, because the documentation was wrong”. If nobody took you down that path at all as you start to think about these things, as you start to examine these things, then you would never fix the documentation and congratulations in your next outage, which is going to be three hours or longer. I mean, maybe shorter because someone figured it out and maybe they walked around remembering it. But do you want to depend on somebody’s memory if next time it goes down? I don’t.

So that’s part of where my concern comes with this terminology. And I work at a company that used to call things root cause analysis and still does sometimes when I’m not looking. And I try very hard to say, “Look, hey, let’s talk about this in a way that really represents the reality of the situation”. That, to me, is what I’m trying to do.

Michael Stiefel: Well, let me put it in a perspective that’d be very direct to people.

David Blank-Edelman: Sure.

Human Error is a Symptom [21:40]

Michael Stiefel: Suppose there is an airline crash and someone goes and says, “Well, the pilot pushed the wrong switch”. So the question becomes, “Why did the pilot think this was the right switch to push? Was it the switch wasn’t designed properly? Were the instruments giving them incorrect or misleading readings? Was the documentation that this person got trained on wrong? Did someone make an assumption about what conditions the airplane would be operating in, and they never expected some piece to malfunction in quite this failure mode? Did something else fail that the pilot was not warned about when the switch was…?” So as you would say, there is no one root cause to that failure, and very often we want to find a victim to blame.

David Blank-Edelman: Yes. So that’s the problem with human error as a thing, because human error is a symptom, it’s not a cause in any way. And when you say human error, often it means you’re just going to stop looking at the problem right at the exact moment where you’re about to stop. And just so people don’t think that you’re making up this abstract thing, so the B-12 bombers during World War II had a very specific problem that was going on, which was they would come in, some of them, some large number in the theater land, and some of this stuff is still classified, but most of this stuff is out there. But would come in, it would land, and then on their way back to the taxiway, they would pancake right onto the runway. They would just kersplat onto the runway, and this happened to a lot of B-12s. It didn’t happen to any other aircraft that were happening, even though there were even more of them in the theater at the time.

And so they brought in all the experts to go look at the mechanical situation. They looked at the electrical stuff and they couldn’t find any problem at all. So they just called it human error. But then the Air Force hired somebody to come in. He was an industrial psychologist, and he interviewed people, and he went and he looked in the cockpit of the B-12, and the B-12 had two switches that were next to each other. It looked almost exactly the same. They were small, they were next to each other. One of them did the flaps, the other one did the landing gear. And because they were so close and because they weren’t distinguished in that sort of stuff, people were clicking them incorrectly. And so this is why today there are FAA requirements that say, “Your flaps must look like this. Your landing gear must look like this”, and they’re very different looking knobs.

So that, to me, is really, really, super important. I mean, that’s one other problem with this idea that there’s a root cause. I mean, what’s the root cause there? Yes, sure, somebody clicked the wrong switch, but really that’s not the important thing.

Michael Stiefel: So one of the things that occurs to me is that this idea of root cause came from the idea of looking at simple systems. When things were not complicated, when things were not non-linear, but as we build more complicated systems and maybe come more non-linear, and there are many things that contribute to something happening, even successfully. I mean, the flip side of failure is the fact that for things to succeed, many things have to work correctly. So it should not be surprising that when something fails, it’s because many things do not work correctly, which means that at any given point of time, we’re looking at the probability of success versus the probability of failure.

For example, there is a finite probability that someone kicks over the cable or there’s a power outage in your place, and the place of your fallback systems. In other words, you never get to all 100%, where no matter how many nines you have, there’s still a probability of success or failure. So we have to start thinking in terms of probabilistic judgment as being part of engineering judgment, and not to be surprised when things go right or wrong because it’s all a matter of probability.

David Blank-Edelman: Okay, I have a lot of things to say in response to that ’cause I think it’s a super good question. First, I don’t know if I would agree that the root cause was from a simpler, gentler, kinder time back when we were just sitting and churning our butter.

Michael Stiefel: Well, I mean, if you have an ax and a handle falls off the ax, it’s pretty clear what’s wrong.

David Blank-Edelman: Well, but no, I mean, let’s argue this for just one moment-

Michael Stiefel: Okay.

David Blank-Edelman: … ’cause I think that’s fun. I mean, I might argue that what also unchanged was the complexity, depth and understanding of interconnectedness of our analysis. So if the handle falls off your ax, you even then probably said, “Well, who is supposed to be responsible for making sure that the axes are properly kept up”, or, “It was up to me”, or, “I noticed it was wobbly before. Why didn’t I have some time set aside to keep my tools up to date?”, or whatever. I’m really stretching your analogy a little too far, but you get where I’m going.

Michael Stiefel: Yes, yes.

David Blank-Edelman: So that’s the first thing I want to say. The second thing I want to say is one could look at it probabilistically, but I don’t think there’s anybody who says, well, I don’t know if the people in our data center do, but I don’t know how many people sit there and say, “Well, let me think of all the possible things I can in this insurance way. Let me look at all the possible ways I can think of for failure and then calculate its probability as best I can”.

The Appropriate Level of Reliability [27:02]

Michael Stiefel: Well, my idea was not to calculate the probabilities, but the understanding that you can never get to.

David Blank-Edelman: Yes. So the good news is I’m with you there because SRE talks about the appropriate level of reliability and you’re trying to search for the appropriate level because, if we leave a axes and other things aside or cables aside, if I say to you, “Great, I would like 100% reliability”, that means a couple of things. Thing number one it means, is how about your dependencies? Is it the case that let’s say you want your customers 100% to get to something? Well, I don’t know about you, but my cable went out fairly recently for a little bit. There goes the 100% there, but I also don’t have the headroom if I think about 100% thing to make changes, because if we make the assertion that even the time it takes to make a change, but if we make the assertion that I don’t have the chance to make these changes because I don’t have that headroom, that doesn’t work either, then we never evolve our systems.

And so that’s why SRE has very specific ways it talks about how to think about, how to set goals, how to notate them about reliability. And in those cases, if you have a system that you only need once every Thursday to generate a report, should I wake you up at 2:00 AM on a Sunday when it’s not working? Maybe not. And so that’s what we mean by appropriate. So I agree that having a better understanding that things aren’t always working or things are always working is really super important. The other little thing I’ll tag onto this again for your listeners is there’s some really super interesting research in the resilience engineering community by a person at MIT called Nancy Leveson.

And Nancy Leveson did some really great work about what we call safety two or safety three, which she started by saying, “Okay, folks, I’m glad that we spent a lot of time looking at those 30 minutes that you had an outage. What about the rest of the time? You can spend all your time figuring out what went on in those 30 minutes. Or you can also spend a lot of time figuring out what went right the two days before when everything was working great, and what are those things that are contributing to that success and can we identify those and strengthen those as opposed to just trying to deal with the exceptions?”

And I think that’s a super interesting way, and to me, that breaks apart the RCA idea. As you were saying, the way I say blithely, which I stole from John Ospa, another person in resistance engineering. It’s like, what was the root cause of things going well yesterday? The first time I heard John say that and directly quoting him, my head went poof. All of a sudden I had inverted the situation in a way that helped illuminate that moment of what an outage was about.

Michael Stiefel: To take that insight about success and failure, really failure, as you said, is a signal and a piece of data about why the-

David Blank-Edelman: It can be.

Michael Stiefel: Can be.

David Blank-Edelman: Can be.

Architects and System Reliability Engineers Need to Communicate…Often [29:56]

Michael Stiefel: Can be, yes. So the question is how does that signal get transferred back to the architect so the architect can understand, “Well, my system was working not the way I thought it was”, or, “Could I improve it?” How do we get that feedback loop from the people at the front lines to the architect, or the designers?

David Blank-Edelman: I wish to express a very, I don’t know what you would call it, opinion, which is we almost never do.

Michael Stiefel: Yes.

David Blank-Edelman: That’s what I want to say. So if we want to talk about this in sort of a theoretical way, I’m all for it because I wish it happened more often.

Michael Stiefel: This is the thing. I want our listeners to start to think about this, and to think about maybe it’s just having lunch with the SREs every now and then. I mean, in other words, as you were mentioning before, it’s the sociotechnical things that are often the barriers ,and trying to institutionalize things is thinking about the wrong way. The thing is to have this communication between people and they will figure out in their organization, “What’s the best way to make sure this happens”.

David Blank-Edelman: As best we all know. I continue not wanting to be too like, “Oh, if you just knew, everything would be fine”.

Michael Stiefel: No, no, but I’m saying that maybe if once a month they got together for lunch and then they figured out, “Well, you’re in the next office from mine as it turns out, maybe we can meet”. In other words, I just want to emphasize the importance of this and then people can figure out how to do it in their own way.

David Blank-Edelman: I have two thoughts. Thought number one is I would strongly encourage your audience to figure out how to build things into their designs that gets them that data and to have the hunger for the question, “How is my system going to behave in production?” How does it behave when I wrote it down is different, but have that real hunger for okay, and then try to think, “And how am I going to gather that information?” Because sometimes an SRE will go back to an architect and say, “Cool, I like your system. How do I know when it’s succeeding and how do I know when it’s failing? What’s your definition for success? Because your definition for success might have an implicit notion about latency that I don’t know about, or it might have an implicit notion of scale that you haven’t really told me”. “This will work as long as you don’t stand up four of them, it’ll be fine for three, but I’m not, no guarantee about four of them”.

So I think being as explicit as you can about your desires is really important, but also trying to pass that information on and figuring out how am I as an architect going to get this information for myself and to want that information. I am not going to try to shove anything down anybody’s throat, but I want them to want that. And this is so funny because there’s this middle group of people who in theory are taking whatever it is you’re architecting and building it, and then we’re conceiving this as if it’s like the people who sit there, they thought big thoughts, the brains in jars thought big thoughts. They told the gnomes that make the cookies and they made a bunch of cookies. And then there are a bunch of people whose job is to make sure the cookies run. And then we’ve had all these arguments that say, “It’s so cool. How does anybody of these people know about any other part of the system?”

And some of the answer to that question is also to invite SREs, just like you should invite security people, into your architectural meetings, into your design meetings. You want somebody who’s going to walk into the room, walk over to the whiteboard and go, “What are you going to do about the signal pointer failure there?” Because they’ve seen it fail 14 times with this configuration.

Now, maybe you didn’t know that, and I’m assuming you didn’t when you put that configuration on the white board, but it’s really good if you can get somebody to show up and say, “Hey, here’s what I’ve experienced”, and to pass that on. And so I really encourage that as a thing. I also talk about how it’s also great to spend time in other groups to say, “I’m going to go spend..”. Like, take your architect to work day, have them hang out with you. And some organizations have that built in where you do basically you revolve through different groups as a way of really understanding what other people’s lives are like. I think that’s great. I think it’s another way to implicate the right things.

Michael Stiefel: Well, to pick a slightly different example with launching the same idea is if you want to have systems where you can put out changes with great rapidity and great frequency, you have to design a system to be able to do that ’cause you have to understand intrinsically the coupling of the system. So maybe there’s something in the SRE world that is sort of like coupling that the architects can understand that if they want to understand how to make the system more reliable, available, reasonable for the customer gets what they want, so the system degrades gracefully, et cetera, et cetera, et cetera, what should they be thinking about?

David Blank-Edelman: I want to say, in SRE we call that coupling. It’s not like the French have a different word for everything. We understand tightly coupled systems equally well or having many different things that have lots of connections. So that’s the thing, I think. The other thing I think that’s really useful for both sets of people is to construct stuff out of known blocks of not highly complex things such that we can reason about those blocks and how they interact with each other. If I know that this component is going to behave this way, and I understand it and I understand its failure modes and I start to put it together with other things, then it’s a lot easier for me when it winds up in my hands to be able to say, “Oh, I see. This looks like a failure mode from that component”, because I’ve seen this.

But if it’s sort of like, “Hey, I’m going to give you an animal you’ve never seen before, please take care of it”, that’s a lot harder. Rather than saying, “Oh, I see. Okay, reptile. Reptiles need this and that”. So I know I’m mixing metaphors left and right here, but what I’m mostly trying to say is that to have composable parts that are well understood, it goes a long way and that are well instrumented. And if you’re going to couple things, make sure you instrument that coupling really well.

Michael Stiefel: That’s an important idea because if you use third party systems, they are not always available to you exactly what all the failure modes are. So that sounds like a very important principle. Instrument points of joining with third party systems, instrument pieces or components that come together because it’s more likely than not, that’s where your real failures are going to happen. Because if a component fails, now I’m just thinking from an architectural view, if you’re designing for failure, you always understand that a component can fail. So you make some assumptions about what happens if that component fails. What you don’t very often understand is the interactions between multiple components and/or how to detect that the system is degrading as opposed to failing outright. So to your point, it’s more about the interactions between the components that are harder to understand as opposed to the component itself.

David Blank-Edelman: I don’t know if I’m going to claim more, I’m going to say as important.

Michael Stiefel: Okay.

David Blank-Edelman: Do you know what I’m saying?

Michael Stiefel: Yes, yes, yes.

David Blank-Edelman: That’s the case. I don’t think that instrumenting the middle makes the two sides work any better necessarily, but it might give you some sense, certainly about the class of problems that happen in the communication between the two. And I just assert that that’s a class of problems.

Michael Stiefel: Okay.

David Blank-Edelman: I think that that’s really, really important. But if you just watch the middle and you never watch the two sides, you’re still going to have problems as well, is what I’m saying.

Practical Advice for Post-Incident Reviews [38:00]

Michael Stiefel: Okay. So one of the last things that I’d like to talk about is in the process of finding out about system failures and going from the proximate cause and thinking about all the things why that happened, are there any thoughts you have about the type of questions you have to ask? Because there’s certain questions, like for example, prosecutors like to ask questions that have yes or no answers, and you rule out certain things by the very nature of the question, like, “When did you stop beating your wife?”

David Blank-Edelman: Right, right. Both me and my wife are not so keen on that question, strangely enough. But I guess what I would say is this. So in my world, we think a lot about post-incident reviews, right?

Michael Stiefel: Yes.

David Blank-Edelman: And that’s where my head goes when you say this. And when you say this, the thing that I think is really important to start with is stick with the how and the what before you get to the why. I know that sounds really, really basic, but let me explain what I mean by that. Spend as much time as you can describing what went on before you try to get to the, “How do I fix it?”, and, “What was the problem?”, If you immediately jumped to like, “Hey, the disk was overloaded”, then you don’t really get a sense of, “How did we detect this? Who was involved? What did we see? What went well? What didn’t go well?” If we immediately try to go to, “Fix it!” If you immediately move to fix it, then you’ve skipped over a whole bunch of potential stuff that would be useful to you to learn things.

Now, when you’re in the middle of an outage, it may be the case that you want to mitigate even before you know what’s happening. Like, “If I can stop the customer’s pain by moving all my traffic to a different region, by cutting back to a previous release even”, rather than saying, “Well, I’m going to spend the next four or five hours trying to figure out where the bug is”. So in many cases, in this particular case, I guess what I’m just saying is understand what the how is well before you get into the why, because people really love to jump into like, “Okay, cool. Got the problem, done, see you, I’m out of here. I got my repair item. I’m out of here”. And I just want to say that that behavior leads to very shallow reviews and there’s lots of other traps that you can fall into when you’re doing these reviews, but that language really matters, and that focus.

And you’ve heard me go off on the other thing, which is the five whys that some people are really, really enamored with, which people say, “Oh, it’s part of the analysis, you should run these five why’s”, which is for those who haven’t heard that, you should ask, “Okay, what was the problem?” “Well, the server went down”. “Okay, why?” “Well, because it ran out of disc space”. “Why?” “Well, because..”. I don’t know.

Michael Stiefel: “A config file was missing”.

David Blank-Edelman: “Too many logs”. And then, “Well, why were there too many logs?” “Well, because the cleanup system didn’t run well”. “Well, why is that?” “Because the config was done”. And the problem is with those things where you’re trying to just go and hoping that you get one root cause, is you’ve just shed a whole bunch of stuff. Like, “Well, that thing that was spinning out a lot of logs, was it having a problem? Should we know whether your thing was screaming in pain, and you just looked at the fact that your disk was out of space?” So to me, just staying with what is true. Like, “What are all the things we know? What do we know? What don’t we know? Who was involved? What did they know?” That sort of stuff is as important. Just like your pilot thing. What did the pilot know at that moment is as important to me as did they flick the switch? Because then I can figure out what am I doing poorly around giving context to the people that absolutely need context and control in that situation.

Michael Stiefel: It’s almost like at a crime scene, you want to know the facts before you start attributing motives to people. “Where was the blood ? Where were the fingerprints? Where was the point of entry?” And not worrying about, well, “Why did they come through the door instead of the window?”

David Blank-Edelman: That was an interesting question in itself. I think this instead of the other is a better why, but the fact that you noticed they went through the window is really good. Like, “What actually happened?” But if you’re like, “Obviously they shot him”, and then you didn’t notice that there were three other bullet casings from three different other guns. Well, how good was your investigation?

Michael Stiefel: Or there was a very clever criminal and figured you’d notice the bullets but didn’t really see the poison. That was the thing that actually was…

David Blank-Edelman: Right. Again, we’re way into another analogy set.

Michael Stiefel: But the point is things can be misleading.

David Blank-Edelman: Well, not only misleading, but the thing is it’s wasting it, ’cause here’s the thing we were talking about before out is, outages and failures can be a loss of time, money, health, reputation, hiring issues and stuff like that, or they can help you level up. Your call. If you want to just go through the same thing and every time this happens, you whack the big red button and then every time it happens and you whack the big red button again and you lose customers every time you do it, totally up to you. You make your own decisions. But if you can learn from it and figure out like, “Oh, wait a second. This happens every single time. I’ve just noticed that. Let’s go fix the thing, and now we’re not losing customers”. It’s up to you to take advantage and to level up for this sort of stuff. And I’m interested in teaching people how to make use of that once they’ve made that decision.

The Architect’s Questionnaire [43:10]

Michael Stiefel: At this point, I’d like to go and ask my guests some questions to humanize the person behind the interview.

David Blank-Edelman: Sorry, he noticed I wasn’t human. Anyway, go ahead. Sorry. Because I take off my reptile mask. I thought I’d get through the entire interview pulling off the species thing. Okay, great. Ask away.

Michael Stiefel: That’s because I was looking for the root cause.

David Blank-Edelman: Look, you can’t taunt me with that forever. You’re not the only person who says root cause at me. So I am a little inert to this.

Michael Stiefel: So what is the favorite part of your job, being an SRE?

David Blank-Edelman: Well, my job also is running a training program for SREs, people who are just in reliability. So I can give you both answers.

Michael Stiefel: But whatever answer you want to give is fine.

David Blank-Edelman: Well, I’ll give you both. I mean, it’s really fun to lead people and to help people understand how to level up and how to think about things in this way that really serves them. I really dig that part. The part about SRE that’s really fun is that it’s all about curiosity, that you’re ideally encountering new things all the time, that you’re tracking down problems wherever they are, however far you have to go, to whatever part of the earth you have to go, either the big picture, the small picture. I really love diagnosis. I really love helping people and supporting people. That’s been what I’ve done all my life, and so that’s why I dig all that stuff.

Michael Stiefel: What’s your least favorite part of your job?

David Blank-Edelman: Well, that’s always tricky because it’s being recorded. I have worse days, I think. The days where the dragon wins. That’s not so super fun. Sometimes when late stage capitalism rears its head in ways I would prefer it wouldn’t, that’s less fun. So I don’t have a lot of bad days or I wouldn’t keep on doing this, so I’m pretty positive.

Michael Stiefel: Is there anything creatively, spiritually, or emotionally compelling about architecture or being an architect?

David Blank-Edelman: I’m sure there is, and I would ask an architect that question.

Michael Stiefel: I mean an SRE as I described.

David Blank-Edelman: Yes, I mean, I find it really compelling because it’s all about learning. It’s all about the questions. I find that really quite good. I think it’s fun. I think it’s funny, and bringing architecture to it is really important. It allows me to bring many different things to this sort of stuff.

Michael Stiefel: And what turns you off about reliability engineering or being an SRE?

David Blank-Edelman: The stuff that turns me off are situations where we blame people instead of going deeper into a situation, and we don’t take the time to understand that. We don’t have the desire to assume that people have the best of motives, and we assume that’s the case. You can’t fire your way to reliability, and I think a lot of people still haven’t figured that out yet. I think that that sort of stuff turns me off.

Michael Stiefel: Do you have any favorite technologies?

David Blank-Edelman: Parenting. Parenting is a favorite technology. Fermentation. I like fermentation a lot. As a bread baker, I’m a big fan of fermentation. That’s one of my favorite technologies.

Michael Stiefel: Okay. Okay, fair enough. What about reliability engineering do you love?

David Blank-Edelman: I’ve said a little bit of this before. I love the challenge. I love the fact that I have to be relentlessly collaborative to be successful at it. I really like working with people. I really like helping people succeed, and this forces me to do that every single day, and I dig that a lot.

Michael Stiefel: What about it do you hate?

David Blank-Edelman: Hate’s a strong word. I hate that chance not to be creative. And there’s some times where you get boxed in and being creative isn’t the right thing to do at those times. So I like those times where it can go other ways or where I can break through using creativity.

Michael Stiefel: What other profession besides your current one would you like to attempt?

David Blank-Edelman: I’ve done professional bread baking. I’d consider doing that, more of that. That’s super fun. I often thought that doing cognitive science was cool. Farming sounds cool sometimes. Organic farming sounds really cool. I mean, there’s lots of stuff in technology, but I don’t think those are the interesting answers, quite frankly.

Michael Stiefel: Do you ever see yourself as not being an SRE anymore?

David Blank-Edelman: It is my experience so far and talking to lots of people that you may not be paid for that, but you’ll always be thinking that way. And I would say the same with architects. I would find it very hard to believe that architects that are no longer employed or not doing it officially, don’t think about things architecturally. It’s the way we’re built. It’s the way we like to operate. And I think the same is true with SREs. So technically I think someone will stop paying me, but forever I will always be looking at things and think about like, “Oh, well, how can that fail?”, or, “How can I scale that?”, or, “What are the right questions to ask in this moment?”

Michael Stiefel: And when a piece of work is done, done, be it fixing a failure, actually making a deep change to a project where the signals are really interpreted and the projects change, or even when a project is done, what do you like to hear from the people you work with or the clients if you come into contact with them?

David Blank-Edelman: I’d like to hear that, “We enjoyed working on this together”, and that, “It brought us a sense of satisfaction and success, right livelihood”, et cetera. I really enjoy that. I like it when other people feel good about the interaction and engagement that we had and that we feel that we did the right thing, not just for ourselves, but also for the people it was serving. If you say, “Congratulations, we’re done, and now a few billion people are using it”, that feels pretty good.

Michael Stiefel: Well, thank you very much. I found this very interesting and I think important for architects to get a different view of the world, sort of watching how the pigs get slaughtered as opposed to just thinking about how to cook a meal.

David Blank-Edelman: Well, as a vegetarian, I encourage us to continue to iterate on our particular analogies. But yes, it’s been really super cool talking to you and to your audience. I’m always up for talking about this and I’ll always talk to you about architecture ’cause it’s so, so important. Thank you for all you do, and thank you for asking all the right questions. How’s that? That’s a nice way to end.

Michael Stiefel: Okay, well, thank you very much. I enjoyed talking to you, and maybe we’ll do this again.

Mentioned:

.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Looking for Root Causes is a False Path: A Conversation with David Blank-Edelman

Transcript

Becoming a Software Reliability Engineer [01:49]

What do Architecture and Software Reliability Engineering Have in Common? [03:31]

Designing for the Full System Life Cycle [04:58]

What is the Goal of Software Reliability Engineering? [08:10]

The SRE Mindset

Openness [10:25]

Maintaining Reliability as Systems Evolve [11:03]

Learning From Failure [12:55]

Complex Systems are Almost Always at the Verge of Failure [13:31]

Failures Do Not Have a Root Cause [18:37]

Human Error is a Symptom [21:40]

The Appropriate Level of Reliability [27:02]

Architects and System Reliability Engineers Need to Communicate…Often [29:56]

Practical Advice for Post-Incident Reviews [38:00]

The Architect’s Questionnaire [43:10]

Leave a Reply Cancel reply

Stay Connected

Latest News

Apple HFS/HFS+ File-System Drivers Receive Corruption Fixes & More For Linux 6.19

Dominate the Competition With These Elite Gaming Mice

Best Cyber Monday 2025 deals at Best Buy: AirPods Max, Windows laptops, and more

I let AI control my entire PC. Here’s what happened.

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

Becoming a Software Reliability Engineer [01:49]

What do Architecture and Software Reliability Engineering Have in Common? [03:31]

Designing for the Full System Life Cycle [04:58]

What is the Goal of Software Reliability Engineering? [08:10]

The SRE Mindset

Openness [10:25]

Maintaining Reliability as Systems Evolve [11:03]

Learning From Failure [12:55]

Complex Systems are Almost Always at the Verge of Failure [13:31]

Failures Do Not Have a Root Cause [18:37]

Human Error is a Symptom [21:40]

The Appropriate Level of Reliability [27:02]

Architects and System Reliability Engineers Need to Communicate…Often [29:56]

Practical Advice for Post-Incident Reviews [38:00]

The Architect’s Questionnaire [43:10]

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News