Failure As A Means To Build Resilient Software Systems: A Conversation With Lorin Hochstein

Transcript

Michael Stiefel: Welcome to the Architects Podcast, where we discuss what it means to be an architect and how architects actually do their job.

Once again, as we have done several times in the past, we are going to talk about something that is very important for architects but is not often explicitly discussed. We are going to focus on how to use software failures to improve software architecture.

Today’s guest is Lorin Hochstein, who is a Staff Software Engineer for Reliability at Airbnb. He was previously senior staff engineer at Coupon, senior software engineer at Netflix, senior software engineer at SendGrid Labs, lead architect for cloud services at Nimbus Services, computer scientist at the University of Southern California’s Information Sciences Institute, an assistant professor in the Department of Computer Science and Engineering at the University of Nebraska-Lincoln.

Lorin has a Bachelor of Computer Engineering from McGill University, MS in Electrical Engineering from Boston University, and a PhD in Computer Science from the University of Maryland. He is a proud member of the Resilience and Software Foundation and the Resilience Engineering Association.

Welcome to the podcast, Lorin.

Lorin Hochstein: Hey, Michael.

How Did You Become A Reliability Engineer? [01:52]

Michael Stiefel: Reliability looks very different if you come at it not from this perspective of an architect but from the perspective of site reliability engineering. How did you decide to be interested in reliability and how is the perspective of a reliability engineer different from that of an architect?

Lorin Hochstein: I’ll just start off by saying the standard disclaimer that this is just my opinion and not my employer. I don’t think anyone wakes up and decides to be a site reliability engineer, an SRE. There’s no path explicitly for that. I was a traditional software engineer at the time. I applied to Netflix on what was called at the time their Chaos team, and that was building fault injection tools. I worked on Chaos Monkey, I wrote the second version. It’s Chaos Monkey. And what I found on that team was that I became a lot more interested in how the system actually failed, real failures than the synthetic ones that we were injecting into the systems.

We would try to intentionally make things break like what happens if you fail requests to this non-critical service or what happens if you add latency here. But real incidents I discovered, like looking over at what the SREs were doing, seemed a lot more interesting to me. And so I just got hooked on them and I moved over to what was called the core team at Netflix, which was the Central SRE team and they were the ones that had to do incident management. The incident management itself is not my personal passion. I do it because it lets me get closer to the incidents and do analysis type stuff, but that’s really what… I just got hooked on it, on learning how these systems fail in very bizarre ways.

The Limits of Chaos Monkey and Fault Injection [03:35]

Michael Stiefel: Were you able to incorporate any of that reality into Chaos Monkey or that was just not practical?

Lorin Hochstein: Chaos Monkey itself is relatively simple, all it does is terminate instances. It just turns off virtual machines but that’s only one kind of failure. Another system we built on that team, on the Chaos team was called the Chaos Automation Platform where we would run experiments by selectively injecting RPC failures. Netflix operates what’s called a microservice architecture, where there’s a whole bunch of different services that are talking to each other. And so it would try to say, “Well, what happens if there’s some failure from service A talking to service B?” Which is different than just a server or a container or a pod going down in service B. And so there, we were relying on an existing fault injection library that had been built at the platform level where you could inject failures into the RPC calls. But really, I didn’t see a lot of feedback go back from the actual incidents we were having into the tooling that was being built to support that.

The tooling worked based on a certain set of failures that you could actually inject based on the libraries that things were built on. But the truth is that real incidents happen because of a confluence of different things happening at the same time. And so typically when you do a Chaos experiment, you’re failing one thing at a time. I’ve never seen people actually try to do multiple ones. And you don’t know, there are so many different possible combinations that you just can’t cover that space. The real incidents are just too messy to be able to generally reproduce in a way that’s generalizable, that wouldn’t just stop that specific one from happening again which generally people are pretty good at preventing anyways.

Michael Stiefel: You would say though that Chaos Monkey was still useful, it’s just not useful enough.

Lorin Hochstein: Yes, CHAAP was the more general experimentation platform. Chaos Monkey was useful in forcing people to think about a certain type of failure. It was a forcing function for the architecture so you needed to be able to withstand a particular instance or pod going down at any point in time. And so you couldn’t maintain state on that thing that might just go away or you had to have a cluster, you can’t just have one thing running because when you take it down, the whole thing goes down. What I had noticed when I got there is that the real value in the Chaos stuff is, do people feel comfortable turning it on? If you say, “We can’t do that experiment to kill these instances or we can’t do that experiment to fail calls to this non-critical service because I know the system is going to break”. Well, there you know what your problem is. You know you need to architect your system to withstand that.

Real Incidents Provide the Real Learning [06:22]

But once people have done that, Chaos Monkey can test regressions to make sure that you haven’t fallen back and are now vulnerable again. But generally speaking, its work has been done, people have internalized those rules, they’ve incorporated that into their architectural designs already. And so most of the value I would say is in forcing people to think about, “Okay, how do I actually architect my system so that it can withstand those failures? How do I build in fallback behavior so that when this non-critical service is down, I can serve stale data or I can serve some reasonable thing. Do I have the retry set up correctly?” Things like that.

How do Architects Learn From Real Incidents [06:59]

Michael Stiefel: That raises the question then, if it’s only from real incidents that you can gather knowledge of how a complex system fails, how do you get this knowledge back to the architects so that both the system architecture that you’re working on can be improved but also you learn something for the future?

Lorin Hochstein: I think it’s the key question. I would say the hardest problem in an organization that’s non-trivial in its size is how to get the right information into the heads of the people who need it because there’s too much information. You could spend all your time, say, reading docs or something like that and do no other work and still not even absorb everything that you would need to know. What I would say to that is you would want the architects to attend the incident review meetings. Most companies, at least all the ones that I’ve been at, after at least some of the incidents, typically the more severe ones, there’s some sort of incident review meeting that is open to the entire company where they go over what happened in the incident. And that is a great way to learn about not just failures but actually how the system normally works.

And you will learn things about how the system works that you would’ve never known even if you would initially design the system because people use it in ways that are surprising, changes happen over time that invalidate initial assumptions about how the system works and you can’t see that stuff normally. But when it breaks is when we have a chance to actually spend time looking into it. And also there’s the postmortem document or the incident writeup. Reading that writeup and attending the review meeting and actually being able to talk to people who are involved and have conversations about it, I think is where the real value is.

Michael Stiefel: Do you actually find that happening or is this something that just architects are not interested in?

Lorin Hochstein: I do see it happening. I do see high level people attending incident review meetings. I don’t know how much they’re internalizing generally. One of the challenges is how do you know what people are learning and are they learning? And I will say something that is very disappointing and one thing that is very encouraging to me. The disappointing thing is one thing that I will try to do in incident review meetings is there’s usually a Slack channel associated with that. And I will say, “If you learned anything in this incident review, put it in this Slack thread”. And there’s really very little traffic in that thread, not many people actually post there, unfortunately, although I try to put in stuff that I’ve learned.

On the other hand, people keep coming back. I would say that the incident review meetings that I’ve attended are surprisingly well attended, including some sort of high level engineers we basically call architects. And so they’ve got to be getting something out of those meetings. These are optional, they don’t have to come unless they were directly involved and yet they still keep coming. They must be getting some kind of value out of these. Sometimes they contribute and make suggestions about how you could architect the system differently but not always. And so I assume based on the fact that their time is scarce, just like all of our time is scarce, they are making time to attend these meetings so they must be getting stuff out of it.

Michael Stiefel: From my experience, you often learn more from failure than you do from actual success, as disappointing as the failure is. One of the things that comes to mind is on December 5th, they had that Cloudflare outage. And to my understanding, what they were trying to do is actually improve the system and wound up destabilizing the system.

Advanced Failure Mitigation Can Lead To More Failures [10:38]

Lorin Hochstein: Yes. I have found that… I even have a conjecture about that, which some of my colleagues call Lorin’s Law, which is that once you reach a certain level of reliability, then all of your large failures are going to be either because someone was taking some action to mitigate a smaller incident and then something happened during that mitigation, or it’s some subsystem that was designed to improve reliability had some unexpected interaction with the rest of the system. We talk about simplicity being important for reliability but if you look at any real system, the ones that have gotten more reliable, they’ve added complexity over time to increase that reliability.

Even if you look at a car, look at seat belts or airbags or anti-lock brakes. Those are all increases in complexity to the system, there’s a trade-off and it’s a good trade-off. I’m glad we have anti-lock brakes and seat belts and airbags, and I’m glad that we have health checks and load balancers and all sorts of monitors and things like that. But these things, just like your immune system can attack itself. These complex reliability systems that are monitoring and trying to take action, things can go wrong because you can’t see the entire space. You didn’t realize that this other thing would happen at the same time, there’s this latent bug we never hit until now. Pretty much every large scale Amazon incident that I’ve read the public write-ups, for example, it’s always like that. It’s always some monitoring system or some system that’s designed for reliability. Because we talk about trying to reduce complexity and that’s good but at the end of the day, we’re always increasing complexity to increase reliability and that creates new complex failure modes. And that’s just life.

Homeostasis and Failures Due to Resource Saturation [12:17]

Michael Stiefel: Well, that’s interesting you talked about the example of the immune system because in a human body, one of the most fundamental functions of the human body is to maintain homeostasis. And the problem with maintaining homeostasis in the human body is you have all these feedback loops going on. It’s a very complicated, non-linear system. When certain things get out of bounds and interact with other things, you get things like the immune system attacks itself. It would seem to me that these large, complicated systems have some notion of homeostasis. And because they get complicated and you get external pressures and internal pressures on various things, they can wander off out of homeostasis and you wind yourself up with these problems which are inevitable.

Lorin Hochstein: I think failures are fundamentally unavoidable. We can recover quickly or slowly, we can do better or worse, but really we cannot build a perfect system that never fails. The world is just too complex, no human being can understand all of the code that people are all changing at the same time and the changing traffic patterns internally and changing things underneath you. The world is just too dynamic. And like you said, the way you deal with dynamism is through feedback, that’s how we build control systems. But feedback always has the risk of instability.

Once upon a time, I studied electrical engineering, so much of control theory is stability. How do you ensure that this system that you’re building doesn’t go unstable? One of the most common failure modes I’ve seen is what’s called saturation where something gets overloaded. That is extremely, extremely common in complex system failures and distributed systems. It could be that all the logic is correct but the cloud is not fully elastic, it might eventually run out of resources. You probably hit some limit somewhere and then bad things happen.

Michael Stiefel: I had David Blank-Edelman on the podcast a couple of months ago. And one of the things he mentioned, which goes to your point, is that very often we should focus on how the system actually worked. And it’s a miracle sometimes that it actually does work and doesn’t fail more than it does.

Lorin Hochstein: It kind of is, right? Look how dependent our entire world is today on software and yet we don’t have catastrophic failures constantly happening to us. Things pretty much work. It’s kind of shocking actually. It is legitimately surprising. Do you know Gerald Weinberg? He’s a famous software author. He said that if software engineers wrote software the way… What is it? Civil engineers…

Michael Stiefel: Construction engineers built buildings…

Lorin Hochstein: The way software engineers built software, the first woodpecker that came along would destroy civilization. But we’ve had plenty of woodpeckers and no civilization destroyed here.

Michael Stiefel: I’ve read quite a bit of his books and I think the Psychology of Computer Programming is a classic. He had a very, very big gift to be able to simplify very complicated things and explanations and get to the point.

Lorin Hochstein: Yes, he understood the human aspect a lot. He wrote a book on general systems thinking, which I like very much.

Risk Mitigation and Tradeoffs [15:29]

Michael Stiefel: In some sense, you’re also saying, and maybe this will appeal more to the engineering mind, is you are not really eliminating risks, you’re trading them off.

Lorin Hochstein: That’s right. Yes. There’s a guy named Todd Conklin who works out of one of the national labs and he’s a safety guy. And one thing I really like that he says is that you don’t manage risk, you manage the capacity to absorb risk, which I think is really nice. You can be prepared so that when the risk happens, you can deal with it effectively is what you can do. And so that’s one of the big ideas around this research area that I am really inspired by called Resilience Engineering, where what you do is you try to in advance build up this capacity, this general capacity for dealing with issues so that when things go wrong, when something unexpected happens, you are as well positioned as possible to deal with it, to mitigate it, even though you don’t know what that’s going to be.

And so one of the advantages of the cloud in this sense is you could scale up, you throw capacity… And often we do that like, “Oh no, this scale up is a very common mitigation strategy”. And it is because it works well because you can throw more resources at it. Just staffing on-call rotations is a form of generic capacity. It’s not the software architecture, it’s the human part of the entire system architecture. You have these resources that you can make available very, very quickly to solve… You don’t know exactly what the problem is going to be but you know they have the expertise to be able to solve that. And so we don’t think about staffing on-call rotations as an architectural concern but it is part of the architecture of the entire system, it’s just not part of the software architecture.

Socio-technical Constraints [17:10]

Michael Stiefel: Well, if you look at any software problem, any large software problem, it’s true with smaller software problems too, but you have to look at the universe of constraints, including the social ones as well as the technical ones. My favorite example of this and this topic is a little far field but people who advocated, for example, agile development. That depends critically on having independent software developers who are able to articulate. And that’s only a part of the software world, there are plenty of software engineers who want to be passive and just do their job and go home at the end of the day. You really have to look at all the resources, financial, people, risk. Where are you in the range of risk? There’s a big difference if you’re designing a software for an airplane or you’re designing the software for some video game.

Lorin Hochstein: Even just an airplane, if you look at the control software for the airplane versus the entertainment system. The entertainment system for software, I’ve seen many failures in that entertainment system for software but that’s fine. I don’t want my ticket to be five times more expensive to get more reliable entertainment software. It’s an inconvenience but I don’t want to pay for that. But I do want to pay to not crash.

Michael Stiefel: Yes. Yes. I always like to give this very personal example. Many, many years ago I wrote some test software for some medical equipment and then I left the project and went on to other things. And then one day I walk into my doctor’s office and lo and behold, he wants to administer a test to me using the equipment that I had written the test software for. I said to myself, “I sure hope I did a good job”.

The Build vs. Buy Decision and Organizational Complexity [19:06]

Lorin Hochstein: One thing I’ll say about that in terms of thinking of the larger constraints is that one very common question you’re always faced within the software world is build versus buy. Do I build this in-house or do I go to a vendor? And one of the issues that comes up on the buy case is that if there’s an incident that involves some interaction between your software and the vendor software, now you have to coordinate across two different organizations. And the further you are organizationally from the people you’re working with, the harder it’s going to be to resolve that. And I’ve almost never seen that taken into account when making that decision, thinking about, “Well, when the incident happens and we don’t know if it’s us or them or if it’s an interaction”. But it’s just an order of magnitude harder. Just like it’s easier if it’s just your team versus another team, they’re much further away in the virtual org chart. And so that is a constraint that is often overlooked but it’s a real thing, and it’s just much harder to deal with across organizational boundaries when an incident happens.

Michael Stiefel: I remember early in my software career where we were building software and I had to interact with the compiler people and the database people, and then our group was moved to a different building. And all of a sudden, same company, same language, not a different culture, it became infinitely more difficult to get the information because I couldn’t just wander and meet them at lunch or wander into their office if I had a question. Of course, this is days before real email and what have you. But that I think is underestimated.

Lorin Hochstein: Sure. It’s funny, there is the physical architecture of the organization that impacts the way the whole system works. You don’t think of architecture in terms of building architecture but it impacts the way that the system functions.

Robustness vs. Resilience [20:53]

Michael Stiefel: There’s a difference between robustness where you make something, try to make something as strong as possible, and resilience. And again, if you go back to the human body, the human body is not always robust but it is resilient.

Lorin Hochstein: It is, yes. It could adapt to all sorts of things that nature did not imagine it to be able to do.

Michael Stiefel: Right. You could say it is an evolving design. In other words, just because the evolutionary pressures go on the body and the body reacts in a way, there’s sort of a design to it. You can look at it and say, “How does this work from an engineering perspective?” And then abstract the way of design, even if there wasn’t one originally.

Lorin Hochstein: Yes. And one thing that we have in common in our systems is that our systems evolve over time. You may have designed it initially but there are lots of incremental changes that happen over time, and that might invalidate initial design assumptions and you do the best you can as you’re evolving it based on your understanding of the world and the constraints and stuff. But we end up with things that are not necessarily optimal based on the problems we’re facing today but we’re constrained by history, just like our bodies. And I can complain about my knees, I don’t think they’re particularly well-designed but that’s how they evolved.

Michael Stiefel: And as the systems get older and the technology around them changes, very often like the aging human, they become less resilient simply because the world around them has changed.

Lorin Hochstein: It became harder to change, this is something that I think they recognized in the ’70s that software becomes harder to change over time but we have to keep changing it. Yes. The robustness-resilience distinction is really important because I think it’s not super well known in our industry. Resilience is often used in software as a synonym for robustness but they really are different. Robustness is really designing for the kinds of failures that you can anticipate and there are a lot of failures we can anticipate. We know a lot of things that could potentially go wrong and there’s a ton of architectural patterns that are designed to handle known failures. But we are always going to hit something we didn’t design for. And so that’s where the resilience comes in is how can we be best prepared to deal with the problems that we did not anticipate that we were not explicitly designed for, that we may have in trying to design to deal with problem X, we are now actually more vulnerable to problem Y, we didn’t even realize that.

We Make the Same Mistakes Over and Over Again [23:10]

And so you need both, you definitely need both. But our industry historically is really focused on robustness and really doesn’t think in terms of like, “Well, what can we do to generally get better at dealing with the unknown?” Engineers are not good at thinking about how do we deal with problems that we cannot anticipate. Prepare to be surprised.

Michael Stiefel: Right. As Rumsfeld said, “It’s not the unknowns that we know about, it’s the unknown unknowns”.

Lorin Hochstein: And the amazing thing is it keeps happening. There are things that I feel like happen to us over and over again and yet we don’t quite internalize the lesson. In every incident, where it’s often I never imagined that this kind of thing would happen. And I could tell you the next time an incident happens that it’s going to happen to you again, I never imagined this would happen. Our field is famous for not being great at doing estimation but we seem to make the same mistakes over and over and over again. And I don’t think I’ve seen in my lifetime a significant improvement in our ability to estimate software project completion time. This seems to be a very hard problem, we don’t seem to be able to do this very well.

Michael Stiefel: I think other industries have this problem too and they’ve owned up to it in a way because there are things like price escalators or cost escalators in the contracts. I think part of the problem is that in software, unlike other forms of engineering, you’re not doing the same thing over and over again. In other words, you can be a civil engineer and make a career out of building the same bridge over and over again. That’s not how software works. If I want another copy of Microsoft Word, I just copy the bits. Intrinsically, you are doing something that has probably not been done exactly that way before. It becomes very difficult to find tooling to estimate costs because you’re always pushing the frontier in some way.

Lorin Hochstein: Yes. I’m always a little hesitant to compare with other fields just because I’ve never worked in construction, say.

Michael Stiefel: But usually in the construction industry, they have estimation books. And they know in winter weather, it’ll take so long to… Even the delays that I’ve had in home remodeling are usually attributed to both more to time than they are to cost. In other words, you know what the parts are. No one comes into a kitchen, let’s say you’re doing kitchen remodeling, and desire to put a jacuzzi in the middle of the kitchen. We do that all the time in software.

Lorin Hochstein: There is definitely something… What did Fred Brooks say? Ethereal about the nature of software that we are…

Michael Stiefel: Yes. Yes. Yes.

Lorin Hochstein: On the one hand, we’re constrained only by our imaginations. But on the other hand, there are resources underneath. Instead of circle back, you’re running on physical machines and everything is resource constrained. One of the insights I’ve gotten on the SRE side is that it’s not ethereal magic stuff, there’s actual physical and virtual resources that you’re always running on that you can run out of.

The Blameless Culture and Personal Responsibility [26:09]

Michael Stiefel: Just to change focus for a minute. I find that we still have not learned this in society, there’s always a great temptation to blame somebody or something. And if you remember the conversation we had at the end of one of the talks at the San Francisco QCon, somebody raised the objection, “Well, if you have this focus on not trying to blame humans..”. Which is good because if you blame humans, then they won’t tell you the truth and you’ll never find out what really happened. For example, an airline plane crash. It’s agreed upon by the airlines, there will be a certain liability. But you’re not going to blame somebody in the incident review because if you do, they won’t tell you really what happened and you won’t learn from it.

But in general, we seem to not have learned this because people want to blame humans. At the same time… Well, sometimes how do you have accountability for this? Because at some point there is some human responsibility somewhere.

Lorin Hochstein: On that topic, I think it’s a very human response to say something bad happened, somebody must have done something wrong. This is how we understand the world. And there are some people in my field who prefer the term blame aware rather than blameless, that people are going to blame, it’s going to happen. It’s just something that humans do. One of the reasons that I am a big fan of at least the idea of blamelessness is I think that we’re looking for systemic problems, not individual ones. You can look at it two ways. You can look at it and say somebody did something wrong, they didn’t test well enough, for example. And so what do you do? You tell them to test better next time. You monitor them like, “Hey, do better next time”. What can you really do? I guess you can fire them. But if there’s a problem that makes it harder to test, maybe you could only catch it at an end-to-end testing and their end-to-end tests are flaky and they were failing or we don’t have good support for that.

Michael Stiefel: Or they weren’t given enough time to do the test.

Lorin Hochstein: Yes. That’s a great one, production pressure. If there are problems in the system that are increasing the likelihood that errors happen, then if you don’t attack those systemic problems, then you’re going to have the same issues so someone else will have made those mistakes. And so if you don’t change the system, the system is not going to change. And so that means you have to look for the systemic issues and blame doesn’t look at systemic issues, it looks at individual ones. It says, “What was the problem with this person that they weren’t following the right procedure or was rushing or whatever?” That doesn’t help you improve the system.

What I like to do is think about, imagine every decision that was made leading up to this was rational. Everyone based on the constraints they were working under and based on the information they had at the time, they made decisions that made sense and yet this incident still happened. How could an incident happen given that everyone is making rational decisions based on their constraints and their local knowledge? And I feel like you’re going to learn a lot more about how incidents happen by doing that, by assuming that individuals are actually doing their job.

In terms of accountability, one of the reasons why I get uncomfortable with that language is that in my experience, incidents are frequently due to interactions across multiple components or teams or whatever. And accountability is really about, “Okay, who’s the throat to choke?” You know what I mean? Who’s the person who’s going to be on the hook? But if you’re focused on finding an individual, then you’re not going to see the interactions. And those are the ones I worry about a lot more. And so I don’t think accountability can resolve problems that are interactions across teams, maybe there’s bad information flow they don’t understand. And so that’s why I’m always a little allergic to accountability discussions but I understand that is one of the tools that management uses. We are in large organizations. It’s hard to run a larger organization, this is one of the levers management has to ensure things get done. And so the question is, how do we accommodate the need for accountability with understanding problems that might not be solvable through accountability mechanisms?

Michael Stiefel: My favorite example of this is – an airplane crashes because the pilot flips the wrong switch. Okay? You have a proximate cause, a human being made a bad decision. The question is, why did this individual flip that switch? Were the two switches close together so that it looked like it? Was the airplane in a mode that no one ever thought it was? Were the dials wrong, giving incorrect information? There could be a lot of reasons. Yes, the human made the bad decision. But why did the human make the bad decision?

Lorin Hochstein: Yes. I think trying to get into the heads of the people when they made those decisions, that’s the ultimate goal I think of a good incident review. Can we get into their heads to figure out why they did something that from the outside seems bonkers? Why would you do that?

Michael Stiefel: I suppose from the accountability point of view, if a person seems to be involved in a lot of incidents that can’t be explainable or people constantly use poor judgment or don’t estimate things right, I suppose you could then exercise accountability. But that’s the exception rather than the rule. That’s the result of looking at it through a blameless lens.

Lack of Competence Should Show Up in Everyday Work [31:32]

Lorin Hochstein: There are sometimes issues of competency but my hypothesis would be that it wouldn’t just be incidents where you would see that. If someone is really not competent in a certain way, then I would think a manager should be able to see that in their day-to-day work. You know what I mean? That it shouldn’t just come out in incidents. I would be uncomfortable using incidents as the lens to assess that, especially because some services are more critical than others. If it’s the front door service, then anytime there’s a big problem, that service might be involved. Some services just have large blast radiuses inherently because of architectural decisions. You might be trying to change that but then you’ll see people on that team happen over and over again and it’s just because of the architecture of the system and that happens to be a vulnerable part.

I think that does shine a light on maybe you need to make an architectural change. But I wouldn’t say, well, just because someone pushed a change to that particular thing and then it broke, it’s like, “Well, why is it dangerous to make changes to that service?” Because there aren’t that many people on these teams. Teams are generally relatively small so it wouldn’t surprise me to see some people come over and over again. And often the people I see over and over again, they tend to be more operationally sophisticated because they are operating critical services, and they need to be able to respond quickly when they break. And so I will say as an incident commander, I actually am happy to see people I’ve seen several times before. I know them, I trust them. It’s like when this service that is non-critical has some weird behavior and people get brought in and they’ve never had to deal with this before, they don’t even actually really know a lot about how it works and stuff, those are much, much harder. Those people that don’t have the scars, they don’t have the operational expertise.

Michael Stiefel: It’s like blaming, in a fire, blaming the fire department because they always show up for the fire. Well, of course they always show up for the fire because the fire is someplace else.

Lorin Hochstein: Yes. Statistically, people in hospitals are more likely to die, but it doesn’t mean you should avoid a hospital if you’re sick.

Software Reliability Principles Are Not Widespread [33:26]

Michael Stiefel: Then this raises… And maybe this is the final sum-up question before I get into the questionnaire that I like to ask all the people who appear on the podcast is, why are these ideas not as widespread as they should be? At least in my opinion, I’m sure in your opinion as well. Is it because the software resilience community does not have these ideas widespread, or they have not done a good job explaining them, it’s not a corporate priority? And is this really different from other engineering disciplines?

Lorin Hochstein: I wish I knew the answer to that because you’re asking more generally, why do certain ideas spread and others don’t. There are ideas that we knew spread, like agile spread enormously, DevOps spread, and then there are other ones which didn’t spread. I don’t really know. If I knew what it would take, then I am one of the people who is trying to make this spread. These are ideas that came from a different field that we’re trying to make spread. But sometimes it succeeds. Lean came in from manufacturing, that has spread very successfully in our industry, I would say, ideas around Lean. I don’t really know what it takes for an idea to spread. These are I want to say like squishy human stuff, but agile is squishy human stuff, DevOps is squishy human stuff. It’s kind of related. I don’t know.

I really wish I did know why it’s taking longer to spread. I got hooked on it through John Allspaw posting on Twitter many years ago and he would post papers and stuff. And I’m like, “All right. To shut them up, I’ll read the papers”. And then I got hooked on it. But it tends to have come from an academic-y background, and it’s hard to transfer academic ideas, I think. Although you see success in transferring academic ideas in distributed systems, those have made it over. Yes, I don’t really know. I don’t really have a great theory as to why it’s not spreading as much as I like. But we’re trying. I think we’re doing better than we did 10 years ago.

Michael Stiefel: You would think that economics would force this a little bit. You would think, the examples of the large companies, if maybe they would explain a little more how they did their incident reviews when they have these outages like Cloudflare or Amazon or these things.

Lorin Hochstein: I think we’re making slow progress but I think it’s not necessary to embrace these to survive. And so it’s amazing how… I don’t want to say how poorly an organization can do but organizations don’t have to be optimal in order to be going concerns once they reach a certain size and momentum. They will eventually decline and fall but they can take a long time. And so at the margins, I don’t know how much of a difference this would… You wouldn’t see it in this short-term success of the company. I don’t know with other fields but people rotate very quickly through our industry in terms of companies. If someone has been around for two years, to me, that’s like, “Oh, that’s a pretty substantial period of time you’ve been at a company”. Where my parents, for example, were at the same company for their entire lives.

And so we’re very… I don’t know. It’s very fleeting, the experience within individual companies and it’s like that for all of them. You’d think they’ll all be very vulnerable but the momentum keeps them going. And so I would like to say it’s a matter of survival to learn this stuff but it really isn’t, all companies have a certain amount of resilience already. One thing we do well, when we hire, we hire for expertise. This is one of the things that all companies do. When you hire someone, you don’t say, “Okay, tell me what specifically you’re going to build inside my company when you come”. Nobody does that. I don’t like the way we actually do coding interviews but we are hiring people for general expertise when we hire them.

And everybody does that and everyone understands and they pay more money for seniors than juniors because of that. And that actually goes a long way. And there’s a lot of people behind the scenes doing this stuff implicitly. I think we could do much better. I hope things like this podcast get these ideas out but I think it’s just taken a long time.

Michael Stiefel: Well, certainly if people move from company to company, that’s part of how these ideas spread. I’m sure you brought these ideas with you. Is there anything, reflecting on the conversation we have, that you wanted to bring up that we haven’t covered or talked about?

The Importance of Storytelling [37:48]

Lorin Hochstein: One thing I would bring up is just the idea of storytelling, using stories around incidents to inform people about the system. I think that there is pressure once again from leadership from above to just give me the bullet points, what do I need to know? But really we don’t know what other people are going to learn from any particular incident. And human beings just absorb a lot more content through stories than they do through a PowerPoint with bullets on it, a graph. And once again, our industry, software engineers and architects, we’re not trained to tell stories. This is not something that we learn about in schools, say.

But like my current company at Airbnb, we actually have a storytelling session that we do once a quarter that’s run by myself and another engineer who came over from Twitter who was doing it over there. He brought it to Airbnb. It’s called Once Upon An Incident. We get three storytellers once a quarter, they talk about an older impactful incident. And we get a lot of good tennis in that too. And one thing I hope is that it encourages people to tell at least internally more stories about incidents. It is a way of spreading knowledge and like human beings, we’re just wired for that sort of thing.

Michael Stiefel: I’ve always found that when I give a presentation at a conference, even if it’s a technical one, if I cast it as a story, people relate to it more than if I just give a dry presentation.

Lorin Hochstein: Yes, we love it.

Michael Stiefel: Yes. If you look at the social science, they claim that from an evolutionary perspective perhaps, storytelling was very important in building the earliest human communities.

The Architect’s Questionnaire [39:37]

To get to the questionnaires, what is your favorite part of being involved with software reliability?

Lorin Hochstein: I have to admit, I love a good, complex incident. I love the story of like, “Oh, actually this change had been made two years ago and no one noticed at the time that it was there but it set the stage. And then this other change happened here”. I find it fascinating. I just really enjoy learning about the complexity of how all these different things interacted and happened to hit through all of our defenses. This perfect storm. So many incidents are perfect storms, it’s just learning about the specific details of that and learning about, “Oh, this team assumed that the other team had deployed already because they normally deploy on Wednesdays, but there was something that had delayed them this week”. There’s just all these little details about how the real work gets done in the system. I love learning how people actually really do their work and how things actually happen. And incidents are just… A good incident writeup has a lot of those details and I love that stuff. I read it for fun kind of thing.

Michael Stiefel: It’s almost like it’s a murder mystery or a crime mystery.

Lorin Hochstein: Sometimes it’s like a horror story. Oh my God, you can see the trap has been set. The bug is there and it’s just like someone is going to hit it. Yes. They don’t know it and then they take this action. Oh no, they don’t know what’s happening.

Michael Stiefel: Unlike in a real horror story, “Don’t you know? Can’t you see Freddy Krueger is behind you?”

What is your least favorite part of your job?

Lorin Hochstein: I think my least favorite part of my job is the administration stuff I need to do that I don’t think advances the business at all but it has to be done just for the company to go. An example, we just did performance reviews. And I hate that stuff because I’m like, “Ugh, I don’t think there’s any real value in doing this”. I understand why it has to be done but anytime I’m doing work that I don’t think is actually valuable for the company but… It’s so hard for me to motivate myself to do it. Now, I will say that being on call makes me anxious. I’m an anxious person. I don’t know if that’s… I guess my least favorite is being woken up at two in the morning because an alert has fired and it turns out to not be a real thing, that’s probably pretty high up there in things I don’t enjoy.

Michael Stiefel: Do you think, just something that occurred to me off the top of my head, that AI might make an impact here on trying to be the first responder for certain incidents?

Lorin Hochstein: And there’s a lot of work in that area right now, there’s a lot of different companies that are doing AI SRE stuff. I’m taking a wait and see approach myself to see, “Okay, is it going to be useful? Is it going to save us time? Is it going to help?” I don’t think it’ll take over. You know what I mean? It won’t be 100%, which I would love if it was 100% and we didn’t have to staff humans on-calls anymore. But yes, it’s still early. You know what I mean? Now there’s a bunch of companies that are trying to do this, we don’t know how well it’s going to work and I am being agnostic. Let’s see what happens. But I think there’s promise there. I think it could make the easy ones easier but the hard ones are the ones that I tend to worry about the most.

Michael Stiefel: But that’s one less thing to worry about if it’s successful.

Lorin Hochstein: Sure.

Michael Stiefel: Is there anything creatively, spiritually, or emotionally compelling about software reliability engineering or being an SRE?

Lorin Hochstein: Well, there’s something very synthetic about the way you… And holistic, which is different than… Traditional engineering is very analytic, you break problems apart. We do separation… This is a very big thing in architecture, the separation of concerns for example. You want to decompose the system in a way so that you can work on the individual pieces, say. SRE is the entire opposite because when everything is working properly, then analysis works great, you break things down. But when something is broken somewhere and the system is not working, now you have to see how does the entire system works to figure out how that goes. And so I don’t know if spiritual is the right term I would use but it’s a very holistic view that I find to be very different than the traditional analytic approach. And I find it very rewarding to think about that, to think very holistically of the entire system, especially when you start to include the people in the system and the overall system, not just the software company, the people responding to it.

One thing I’ll say that I find rewarding, I’m on the incident command on-call rotation, there’s an ad hoc team that forms when an incident happens. And the incident commander, keep things moving forward so people don’t get stuck, make sure the different paths get explored and things like that. And that can actually be very rewarding. It’s so stressful but you are there to support other people to fix the system, and that actually can feel very, very rewarding that you are helping other people to help the customers get back.

Michael Stiefel: You’re like the doctor helping the patient recover.

Lorin Hochstein: Yes. But I’m helping coordinate other people to do their work, I’m helping other people do their work. And I personally like that. When I’ve been in software engineering directly, it’s always been engineering tools. I like helping other people work better and incident command is that, not with tooling but with coordination. But I find that can be very rewarding.

Michael Stiefel: What turns you off about software reliability engineering or being an SRE?

Lorin Hochstein: One of the things I find frustrating is the traditional view about metrics, that’s how leadership deals with things because the world is too big. One of the reasons we use numbers is to make it easier to deal with the world, the world is big and messy and complex. And I understand why leadership does that but I find it frustrating to boil things down. Like, “What was the time to resolve this incident? What’s the trends on that?” I don’t like having to record those numbers and have to do trends on them. I don’t think that’s insightful but that’s one of the things that gets asked for.

When I’m asked to do things that I think that I don’t find are constructive and that take my time, and those metrics are one of those things. Is this a step one or a step two? I find those like which bucket… Whenever you’re asking which bucket something falls in, I really don’t enjoy that. There’s no insight to be gained. You’re not learning anything more about something by asking which label do you want to put on it, which bucket goes in A or B.

Michael Stiefel: Part of that, of course, is the fact that when you reduce things to numbers, what can’t be measured gets neglected.

Lorin Hochstein: Right, exactly.

Michael Stiefel: And also, if you have simple metrics, sometimes really metrics require to be triangulated. In other words, it’s not the metric, how long did the incident take, but what was the complexity of the incident? You have to take several numbers and put them together really, then rely on single numbers.

Lorin Hochstein: And so John Allspaw talks about there’s a distinction between a complex incident handled well and a simple incident handled poorly, and they both might have the same resolution time. And just looking at that resolution time doesn’t tell you how well people did in responding to that incident.

Michael Stiefel: Yes. Yes. Do you have any favorite technologies?

Lorin Hochstein: Oh, I have a soft spot for Closure but I’ve never actually used it professionally. I just do it like hobbyish stuff, like when I write my own little scripts, I do it in Closure. I’ve enjoyed playing with some of the formal modeling tools. TLA+ and Alloy are these lightweight formal methods tools that I’ve historically been interested in.

Michael Stiefel: What do they do, these modeling tools?

Lorin Hochstein: Those are tools that are used to build a mathematical model of a software system and then you use that to check that some property holds. For example, I want to model a concurrent algorithm and check to see that there’s never any deadlock. You never have two threads in the same critical sector at the same time, stuff like that. I’ve had fun with those but these are hobby things, the fun things I’ve done on the side. That’s really it, I guess.

Michael Stiefel: What about software reliability engineering do you love?

Lorin Hochstein: I love getting to see the entire system, that is one thing that I really love is that everyone else zooms in on one specific aspect of the system and I love that we get to see the whole thing. And I love that now it makes it harder, and one reason it makes it harder is that you’re always going to hit a problem where you didn’t know if that thing even existed. But I love learning about how that stuff exists. I really love that we, as part of our regular work, get to see glimpses of the entire system.

Michael Stiefel: What about software reliability engineering do you hate?

Lorin Hochstein: I hate that I cannot show you how many incidents didn’t happen because of software reliability work.

Michael Stiefel: Yes.

Lorin Hochstein: I can’t do an ROI. It’s a little bit like the plumbing where you only notice it if something is not working, and so that’s not appreciated. And so that is one aspect that I… We don’t look like a lot of the reliability work because it’s around spreading information to different people. We don’t always have an artifact to show at the end of the day like, “Look, we built X”. The work is often not physically tangible. And even I don’t even know. I can say, “Well, look, I’m doing great things but you can’t see it”. Sometimes I don’t know if I’m having impact or not, that’s one of the things that if I moderate an incident review leading or help with a writeup, I have no idea whether that’s had impact or not. I will never know. And that can be a little disillusioning to say, “I will never know if I’m actually having an impact or not”.

Michael Stiefel: What profession other than being an SRE would you like to attempt?

Lorin Hochstein: I was a professor once upon a time. I don’t know if I would go back to that. When I retire, I think I would like to just be a permanent student in a profession. But that’s what I would want to do. I loved being in school. I can see just doing that for the rest of my life once I don’t have to work anymore, just taking courses and learning about different things.

Michael Stiefel: Do you ever see yourself not being an SRE anymore?

Lorin Hochstein: It’s hard for me to imagine that. I’ve tried to go back to just regular infrastructure platform software engineering but I keep getting pulled back into reliability. And I write about software reliability as a hobby on my blog so clearly this is where my head is. And so I think about it too much. It’s just too much part of my identity at this point that it’s hard for me to imagine. Unless I get super burnt out and try to swing back to regular software engineering again, I think I’m going to be in it for the long haul.

Michael Stiefel: When a project or an incident review or however you want to think of a project is done, what do you like to hear from the clients or your team?

Lorin Hochstein: That’s a good question. My favorite is, “Hey, here’s where I used this. Here’s where it was useful to me that you did this”. On my team, we build some tooling, we’re not just incident responders. I’m like, “If I see someone using that tooling effectively, then that’s actually the thing that gives me the most positive feedback”. Like, “Hey, someone is actually able to use this stuff and do work with it”. More than someone saying, “Hey, this is useful to me”. Then seeing them actually use it in action is the thing that I think makes me happiest.

Michael Stiefel: And you see the world at least incrementally in a better place.

Lorin Hochstein: Yes, and I help with that. It’s funny, I remember when I was very early in my career, I was like, “Oh, I’m writing this code, it’s never going into production”. And then later on I’m like, “Oh my God, the code I’m writing, it’s going into production”. Every time someone flips and I’m like, “Oh no”. But it does feel good to see people use the stuff that you’ve built.

Michael Stiefel: Well, thank you very much for being on the podcast. I found the discussion very interesting. Hopefully, listeners will find it interesting as well.

Lorin Hochstein: Yes, I enjoyed it too. Thanks so much, Michael.

Mentioned:

Weinberg, G. M. (1971). The Psychology of Computer Programming. Van Nostrand Reinhold.

Weinberg, G. M. (1975). An Introduction to General Systems Thinking. Wiley.

Brooks, F. P. (1986). “No Silver Bullet—Essence and Accidents of Software Engineering”. Information Processing 86, 1069-1076.

Brooks, F.P. The Mythical Man-Month).

Conklin, T. (2012). Pre-Accident Investigations: An Introduction to Organizational Safety. Ashgate Publishing.

Allspaw, J. (2015). Trade-Offs Under Pressure: Heuristics and Observations of Teams Resolving Internet Service Outages (Master’s thesis). Lund University.

Netflix Technology Blog (2011). The Netflix Simian Army.

Lamport, L. (2002). Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers. Addison-Wesley.

Jackson, D. (2006). Software Abstractions: Logic, Language, and Analysis. MIT Press.

Lorin Hochstein’s Blog

.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Failure As a Means to Build Resilient Software Systems: A Conversation with Lorin Hochstein

Transcript

How Did You Become A Reliability Engineer? [01:52]

The Limits of Chaos Monkey and Fault Injection [03:35]

Real Incidents Provide the Real Learning [06:22]

How do Architects Learn From Real Incidents [06:59]

Advanced Failure Mitigation Can Lead To More Failures [10:38]

Homeostasis and Failures Due to Resource Saturation [12:17]

Risk Mitigation and Tradeoffs [15:29]

Socio-technical Constraints [17:10]

The Build vs. Buy Decision and Organizational Complexity [19:06]

Robustness vs. Resilience [20:53]

We Make the Same Mistakes Over and Over Again [23:10]

The Blameless Culture and Personal Responsibility [26:09]

Lack of Competence Should Show Up in Everyday Work [31:32]

Software Reliability Principles Are Not Widespread [33:26]

The Importance of Storytelling [37:48]

The Architect’s Questionnaire [39:37]

Leave a Reply Cancel reply

Stay Connected

Latest News

From more than 2 minutes to 47 seconds in 20 years

What mistakes you should avoid when setting up your AI project

if you’re doing vlogs this summer, it’s on flash sale at -49% ⚡️

Blocking of Anthropic’s AI models: China is said to have had access to Mythos

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

How Did You Become A Reliability Engineer? [01:52]

The Limits of Chaos Monkey and Fault Injection [03:35]

Real Incidents Provide the Real Learning [06:22]

How do Architects Learn From Real Incidents [06:59]

Advanced Failure Mitigation Can Lead To More Failures [10:38]

Homeostasis and Failures Due to Resource Saturation [12:17]

Risk Mitigation and Tradeoffs [15:29]

Socio-technical Constraints [17:10]

The Build vs. Buy Decision and Organizational Complexity [19:06]

Robustness vs. Resilience [20:53]

We Make the Same Mistakes Over and Over Again [23:10]

The Blameless Culture and Personal Responsibility [26:09]

Lack of Competence Should Show Up in Everyday Work [31:32]

Software Reliability Principles Are Not Widespread [33:26]

The Importance of Storytelling [37:48]

The Architect’s Questionnaire [39:37]

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News