You Are Asking The Wrong Questions (About Reliability And SRE)

Transcript

David Blank-Edelman: My name is David Blank-Edelman. I am a SRE Academy Program Lead. I’m from Microsoft. I thought it would be good, since one of the things that I do a lot is help people do public speaking, and so I wanted to start off by giving you a few of my top 10 tips for successful public speaking. The first one is, insult the audience, which is what I may have done with this title. I’m hoping that you understand that when I say that, I don’t mean like you personally, I mean you collectively, and that I’m using these questions as a way of opening different doors for you. I’m hoping to have a conversation with you.

Let’s talk a little about questions first. I think it’s really important in this world that we have a little bit more poetry. I just want to start this talk with a letter that was written by the poet, Rilke, that I like very much. He wrote, “I’d like to beg you, dear Sir, as well as I can, to have patience with everything unresolved in your heart and to try to love the questions themselves as if they were locked rooms or books written in a very foreign language. Don’t search for the answers, which could not be given to you now, because you would not be able to live them. The point is to live everything. Live the questions now. Perhaps then, someday far in the future, you will gradually, without even noticing it, live your way into the answer”. That’s why we’re in the land of questions for this particular talk. Isn’t Rilke awesome?

1. Is My ‘Something’ Working Reliably?

I have seven questions for you. Then you’ll be able to ask me questions if you want. The first question I want to start with is this one right here. Let’s talk a little bit about reliability. It’s in the middle of the talk. Might as well talk about it, might get there. What I just want to do first is open up your understanding of reliability a little bit more as we start to get into these questions. The thing is, is when people think about reliability, most of the time they’re thinking about availability. Is it up or is it down? In fact, as an SRE or somebody who’s working in reliability, you’re probably going to spend most of your time there, just straight up. That’s not the only thing to be paying attention to.

The other possibilities, the other expectations that you’re trying to meet as part of doing reliability work is you might care about latency, because you know that slow is the new down. It’s really important to understand that that’s important. Or maybe you’re working on pipeline systems, and you have to care about throughput. Or maybe you’re talking about batch systems, and you care about how much coverage did I do over the data I have to do. Or correctness is something we don’t very often measure that we should. This one’s a little fun, fidelity. Here’s the best example. You know that Netflix runs millions upon millions of microservices. That when you go to the Netflix homepage, as many of you probably do, and you see that it’s a bunch of sections, maybe it’s some recommendations, maybe it’s, here’s the newest thing, here’s what’s going on in your latest area.

You know if, for example, the recommendation engine goes down, that next time somebody goes to Netflix’s homepage, they don’t say, sorry, no movies today, this part is down. Instead, they serve you the page in a degraded state. They take away that piece, they replace it with something else. That’s fidelity. Like how often am I giving you the full experience. Sometimes that’s important to measure. If you’re dealing with sports cars or election results, maybe you care about freshness. Then there’s durability. If you’re running any sort of storage system, and maybe some of you folks have some connection with that, it’s important if you write a bit in to get a bit out.

The thing that I really want to get you started with as we talk about this, is I want to explain that all these things that go into reliability, has to be measured from the customer’s perspective, not from the component perspective. When we’re asking, is such a thing reliable? We have to be thinking from the customer’s perspective, not from the component’s perspective.

In order to make sure that this works, I’m going to give you all a little bit of a quiz, be prepared to answer the following question. I’m going to set up a little scenario for you, and you tell me what you think. I don’t know what you do for a living, but I’m here to tell you that the thing that you should consider switching to is selling tote bags with clouds on them, because cloud is really big. Let’s assume you switch to that, and in order to do that, you set up an entire server farm to serve it, and let’s say that you have 100 servers. In case you’re curious, yes, this is 100 things, and in case you’re curious, doing it on PowerPoint is a big pain. I’m sparing no expense for you. Imagine something happens in the data center. Either there’s a power outage in a rack, or maybe somebody has released a new version of software that they shouldn’t, and 14 of these machines, metaphorically or for real, burst into flame.

Here comes the quiz. I’m going to ask you, A, B, or C, what the answer to this question is. This situation, is it A, not a big deal. If I’m on the beach, I can ask for another fruity drink and a coconut with a paper umbrella. It’ll be fine, and when I feel like it, I’ll come off the beach and I’ll go deal with it. Is the problem B, I should go to my desk right now and start working on it. That seems like a reasonable thing. Or, is it an existential crisis, and if it’s 2 a.m. in the morning, and I have to, I’m going to scramble everybody, including the people with C in their title. How many people think it’s A? How many people think it’s B? How many people think it’s C?

The answer’s here. The answer is, if you can’t see it, it depends. Because if none of your customers notice, if nobody notices that there was a problem, maybe it’s A, maybe you’re fine. If things are slower, that’s a problem, maybe it’s B. Just maybe, if in fact your main revenue stream is down, I’m here to tell you that it could be C, it could be an existential crisis. That’s what I mean by it depends, and that’s what I mean by this question must be carefully constructed. One of the things that you will see about all these questions is that maybe by just inserting a word or two into the question itself, I’m going to change how you think about that question and change what you’re asking.

2. How Do I Get Rid of All My Failures or Errors?

How do I get rid of all of my failures or errors that we want to talk about in whatever that is? Let’s talk about that. In order to get there, I need to first take you into the depths of the SRE mindset. How many people here have SRE in their title or reliability in their title? Doesn’t count for the person who’s got reliability in their company name. The SRE mindset, I think, can be encapsulated in two questions. How does a system work and how does a system fail? That’s what we’re keenly interested in at all times. To be very clear about it, the second question lives to service the first. When we look at failures, we’re doing it because we want to really understand, how does a system work? Because quite frankly, the SRE mindset starts with curiosity. That’s what we’re all about.

If I’m going to make you switch from being a SWE to an SRE in this talk, welcome aboard. I want you to understand that it’s all about curiosity. If I go back to these questions, I want to say that these questions are a little more sophisticated than you might think because I’m asking, how does a system work in real life, in production? I’m not asking you, how does it work on your whiteboard? That’s awesome, but that’s not the same thing. I’m asking you questions like, how does a system work when I scale it? I’m asking you questions about, how does it work for the customers? That’s another important thing. What happens if I have less operational load? We’ll talk about toil later. How does the system work for more people, might be a question I ask? Could I do something to increase the accessibility of that service? How could I make it go faster? How could I make it better?

The other question is, when does a system stop working? Every system that you build is going to come to the end of its useful runway. You want to see it before you get there, just like you want the plane to be up at that point in the runway. Thinking about how systems work and how it fails is really important as part of this. The other thing that I would say that helps define the SRE mindset that might be useful to you is that we have very specific relationships to people and to errors and failure, which is the actual question we’re asking. To people, it’s pretty obvious. SRE and reliability work is critically collaborative. It is not possible to do any of this work by yourself. If you’re not talking to the people that own the system, if you’re not talking to the people that write the system, the people here who write the system, then people are doing not SRE work because they’re just making stuff up for themselves.

The other thing that I think is actually a little more fun to talk about is to come back to this question that we were talking about a moment ago, which is, how do I get rid of all my failures and errors in? The relationship to errors and failure is really super interesting. I first really started to grasp that idea when I was hanging out at a conference, and I was sitting with a friend, and somebody came along who we had never met, he and I, and they sat down at our table, and we started talking, as one does at these conferences. He was saying, I just finished building out a whole new CI/CD system for my team. We’re feeling really good about it. We’ve done a lot of good work to eliminate all the errors so nothing bad gets through to production. We’re like, that’s interesting. We talked for a little while. He went off in his way, and then my friend who was sitting there, ex-Googler, looked down at his copy, said, “If I was building that, I probably would let some of the errors go through because I’d want to know what was going on”.

The thing that I understood for SREs and for reliability work that was really important, and it took me quite some time to figure this out, and when I went back to John, that’s his name, and said, “Do you remember this talk?” He’s like, “What are you talking about? I have no memory of this at all”. It was really striking to me. I can understand that we don’t view failures or errors as the adversary that we must completely stomp out. We’re looking at it as signal. In some ways, it goes back to that question that we talked about a moment ago, how does a system work? How does it fail? That’s why we’re paying attention to these things, because it’s telling us something about the system we probably didn’t know before, thanks to its failures.

3. What is the Root Cause of ‘Some Outage’?

We’re going to talk a little about this term, root cause, and this question, what is the root cause of some outage? This is the gateway to the part about learning from failure. This is the gateway that gets us to talk about that. I’m going to take apart that term root cause and I’m going to split it into pieces, and it will lie bleeding at my feet by the time I’m done. Let me give you a scenario that you remember. I’m going to have to read this scenario because it’s got a lot of detail, and we’re going to go through this as we go. Imagine that Pat is walking into the data center and trips over a power cable. Oscar left the cable on the floor while racking and testing a new server. The cable unplugs and the server loses power. When the server loses power, the database server instance provisioned to run on that server by the automation, or orchestration, or whatever it is, that Susan set up, this database server held an important shard of the data that the application that they use was run. The sharding was configured by Yasmin, all that sharding stuff.

The backend for the application, written by Neeraj, slowly starts to lock up as threads busy wait trying to reach the data they need to continue processing. The response time of Sarah’s frontend for the application begins to get slower as the connections to the backend hang and then time out. The monitoring system set up by Liz, notices this thing, notices this is a problem and sends out that alert to all the people of the team. That alert, for some reason, is delayed for everybody reaching them. Finally, the load balancer configured by Sam just starts handing out error 500s like candy. A customer who is trying to purchase a widget on the website gives up in frustration and purchases it from a competitor. My question for you is, who is responsible? Who’s responsible for that loss of a sale?

Participant: If the servers were running on wireless electricity, this wouldn’t have happened.

David Blank-Edelman: You’re saying it’s really a law of physics thing. That’s really interesting.

Who’s responsible for the loss of the sale?

Participant: All of them.

David Blank-Edelman: All of them, every one of them, fire them?

Participant: No, don’t fire them, but all of them are responsible.

David Blank-Edelman: They are?

Participant: Yes.

David Blank-Edelman: Really? Anybody disagree with that, it’s all of them who are responsible for this problem? It’s no one’s fault. This is hard to reconcile. The thing to understand here is I’m just taking this apart by showing you that in these situations, it’s safe to say the system. It is the collection of all the people and it’s the collection of none of the people. It’s a collection of what they chose to do, it’s a collection of what they didn’t choose to do. Just understand when we’re talking about failures these days, we’re really talking about things that aren’t that simple as a single problem or a single root cause. That’s what I want to ask you. What’s the deal with the root cause at that? Let me continue to take this apart. I was thrilled in the keynote that this particular reference was mentioned. This is a lovely paper you should go read.

If you haven’t read Dr. Richard Cook, talk about how complex systems fail, go read it now. It’s four pages, tops, and it’s really worthwhile. What you’ll start to understand is in this world of complex systems, talking about root causes doesn’t really work anymore. That there’s a single thing. It’s not like you’re looking for some causal chain like Colonel Mustard killed Clarence in the kitchen with a candlestick. We no longer have that causal chain. Looking for that is not going to get you where you want to go.

What do you say instead? These days, people often identify maybe a trigger, maybe not. Maybe you get lucky you can find a trigger. At the very least, you would probably talk about contributing factors. I think it’s really important when you start to talk about this, because when you start to do this thing that we all know we should be doing when something fails, is to have some post-incident review where we get in a room and we try to learn from it, and we try to document it, that we’re going to want to talk in this language.

I’m the child of a sociologist, and that meant that when I was 4 or 5 years old driving with my father in the car, he was talking about how language constructs reality, which was pretty heavy for me at the time, but it stuck with me. I believe that. If we talk about a contributing factor that’s trying to drive people towards a single causal chain, I’m here to tell you it’s far easier to look at this. Do people still say root cause analysis? Do people from Microsoft still say root cause analysis? Yes, we do. Do I spend all my time trying to stamp that out? Yes, I do. Do I succeed? No, I don’t.

As a bonus, since I said there were only seven questions, I’m going to give you an extra five questions. How many people here have heard the five whys? I’m so sorry, because I am here to tell you that the five whys are terrible. Here’s why, and I’m going to show you just how terrible they are. The system was down.

Participants: Why?

David Blank-Edelman: Because the server stopped running.

Participants: Why?

David Blank-Edelman: Because the disk filled up.

Participants: Why?

David Blank-Edelman: Because a log file got too big.

Participants: Why?

David Blank-Edelman: Because the log file trimmer stopped running.

Participants: Why?

David Blank-Edelman: Because it got left out of a config file. I go, the root cause is that the log trimmer got left out of a config file. Let’s make sure we have the log trimmer in all of our config files. Now we’re good. What just happened in that process when we were going through that? When you experienced that, what are we left with? The conclusion I came to is true. You can’t disagree with that, probably. You could, but you probably won’t. What just happened in that process when we were just taking the five whys?

Participant: You could have mitigated that at many different levels, but you only addressed the very last step.

David Blank-Edelman: Yes. You’re getting exactly right. The problem with the five whys is you shed all this really super interesting, super useful, and perhaps really relevant information. No, I’m not saying you shouldn’t look at a situation and try to go deeper into it. Maybe you ask why, though I think you should be asking another question instead which we’ll get to in a moment. Just understand the five whys are terrible because they do that shedding of information. There’s lots of questions like, was somebody on vacation? Does somebody know how to set this up? Was the doc wrong? None of that came into this conversation. Wouldn’t you like to know if your documentation is wrong? You’ve shedded that as part of that 3-year-old attempt to get an answer to a question as to why the sky is blue. I think that it’s really important to understand that those whys are terrible, and those are not questions we’re going to talk about in this, but those are terrible questions.

How many people get a chance to do post-incident reviews, or maybe you call it a post-mortem, or a retrospective as long as you don’t mind fighting with the Agile people. This is why I don’t use retrospectives, because the Agile people have sharp sticks and they’re going to use them. I’m going to give you four traps that you will fall into, I guarantee you, and you will now notice them as we do this when you’re doing post-incident reviews, that might help you, because if you spot them, you could then take care of them.

The first one, perhaps my favorite, and usually people start moaning at this point, is attributing something to human error. People often say, yes, that was a human error. They did that, sorry. The thing I want you to understand is that you could say that, but that’s not super useful because, why did the human make that mistake? What was wrong with the process? What was wrong with the context they had? What documentation did they have? What did the tool not do? When you say human error, you stop the investigation right at the point where it gets interesting, right where you can learn from it. If you stop that investigation, what do you get? A bunch of check marks on a spreadsheet that say human error, human error, human error. You’re never going to learn anything. That’s one of the things that you will see happen. This also leads to blame and other stuff like that. It’s just not the best of ideas. Let’s talk about this one, counterfactual reasoning.

Counterfactual reasoning is when you try to tell the story of something that didn’t happen to try to explain something that did. The operator should have done this. They ought to have done that. When you say that person should have instead failed over to the other region, and they didn’t. The weird thing about that is instead of investigating what went on at the moment, you’re making up this cute little story that didn’t happen in the hopes that you can maybe get some information about what really did happen. That’s why counterfactual reasoning is a problem. Normative languages. Normative languages are often, ought to, should have. You see adverbs, should have, could have, would have, failed to.

Again, the problem with this, when you say ought to, that the person ought to do that, this is assuming exactly the thing that the person making the decision didn’t have at the time which is what was going to be the result of their decision. You’re somehow judging them by something they didn’t have which is knowledge of what would have happened. That’s not super useful. It’s far more easily to try to figure out what they did know than to try to blame them for like, if they didn’t do that, it wouldn’t be a problem.

One more, mechanistic reasoning, which happens all the time. How many of you run production systems? I’ll tell you what I usually do with this. Mechanistic reasoning is what a former colleague of mine called, I would have got away with it if it wasn’t for those meddling kids. What that means is the system would have worked perfectly if it weren’t for those darn humans screwing it up by changing things. The problem with that, in addition to the Scooby-Doo reference, the problem with that is that makes this assumption that the machines would just keep on going if there weren’t humans involved. Everybody knows that often the thing that gets things back to running is the human in the system. It’s denying the human’s role and it’s blaming the human in the system, instead of saying, I see, the humans are the things that are providing some of the adaptive capacity here. It’s really useful to understand that when we say the computers are perfect and the humans are terrible that that’s not the story. It’s not the real story how this goes.

Just to mess with you a little bit more, root causes. The day before your system went down and you had that outage, what’s the root cause of it succeeding? What’s the root cause of its success? The day before, it was running fine, we can all agree that’s probably true, or the 10 minutes before, I don’t know how often your things go down. Ten minutes before it went out, what was the root cause of it running? Anybody? It was plugged in. That doesn’t hurt, I agree. I just want you to understand here, and in fact, thinking a little bit about that idea about what was it that made it so that the system would keep running? What is it that makes it so that the problem wasn’t worse?

All those things, what are the things that you could do to support it so it doesn’t get to that outage? Instead of spending your time thinking about that half an hour it was down, what about speaking about all the time outside that? This is the other reason why root cause falls down, because there’s not necessarily a root cause of success. There might be contributing factors to your success, and I totally agree you should look at those and you should really strengthen those. That’s part of some of the work that’s been done, especially in the resilience engineering and the safety areas. Nancy Leveson is another name that comes up a lot. She did some really great stuff about Safety II and Safety III. She’s trying to say, what if we paid attention not to just the outage?

4. What Role Should SRE Take in My Org?

What role should SRE, or you may substitute it to be reliability work, take in your org? What role should it be? I’m going to give you a model. It’s not meant to be a maturity model if you’ve encountered things like that. It’s just meant to be some possible ways that you can think of the model of the people doing reliability in your place. The first one that almost invariably people start with is firefighting. You spend all your time fighting the fire to get the thing up, to get it to be fixed, to do it on a repeated basis, and that’s a really important role. I’m not denigrating any of these roles. Super important. This is what you’re going to do, for sure, for first. Then, you can start to move on.

The tricky thing is, is often what people do is they move on to gatekeeping. Because what happens is you’re like, I just spent the last year and a half of my time trying to put out fires, getting things to work again. I spent all this time, I spent time in a data center during Father’s Day, and now I just want to prevent the Visigoths at the gate from ruining my perfect production system, that I’ve just spent all this time doing it. They become gatekeepers. I’m here to tell you that’s a very bad place to be in your org because nobody really loves gatekeepers. You know because you’ve met them. You’ve been them, probably. Some people think it would be far better for you to skip over this and to figure out what you can do to help people get to where they want to go. Just like I step off a plane, I’m not super thrilled to see the customs person whose job it is to make sure that everything is cool passing in. I do, and I’m fine. It’s ok.

The next thing you might find yourself doing is being an advocate, in which you’re being part of more of the stages of the life cycle. You’re an advocate for reliability. You’re trying to help empower people from that. You’re going around being more part of the process in an attempt to get reliability to be more part of it. We seem to have no problem saying, let’s pay attention to this from a security perspective. It’s not always that easy to get people to be part of it from a reliability perspective. In some ways, this then leads to the idea of being a partner. Where now you’re, maybe, if you get lucky, part of the roadmap planning. If people can agree that reliability is as much of an important feature as anything else, then maybe you’re part of the roadmap.

Then, finally, and this is the one I would like you to feel free to disagree with me with, but probably later, is, engineer is the way. Ben Ferguson talks about it. Somehow we all merge, and like, everybody does reliability. Everybody is doing this. It’s a little tricky. It doesn’t mean that you stop doing reliability work. You may still be focused on that, but there’s some notion that now we’ve mind melded with the rest of the org in a proper way. Does this often happen? Not in my experience. When it does happen, it’s not all that stable because it’s very easy to get knocked to another place.

The thing I just want to say about this is, this is not a set of stages. Just like Kübler-Ross had stages of grief. She spent a lot of her time towards the end of her career trying to remind people that her stages of grief were not a finite state machine where you went from one to the next one to the other one to the next one to the next one. Maybe all that jazz, the movie didn’t help any, but you know what I’m saying? Where there’s some notion that you move from thing to thing. I just want to mess with you by moving these around a little bit to be clear that it doesn’t matter what order, and you may be going back and forth and it may be that you wake up one day and look in the mirror, and you’re like, I’m the gatekeeper now, even though I’ve been like whatever. Or you might wake up and say, there’s fires to be had. There’s always fires to be had. That’s that.

5. How Can I Sell SRE Internally?

How do I sell SRE or reliability work internally? Because I get this all the time. It’s like, ok, I want to build this group. I’m part of this group. I have to convince other people to do things. How do we do that? I could just stop and say don’t, because you should, but there’s a lot of don’ts to this. I want to take it from the negative here. I want to do that thing where you take that block of marble and you just chip away all the things that aren’t David, and congratulations, you get the statue. It’s not exactly that easy. The first thing I want to say is when people try to talk about reliability work, they often do it as like an insurance salesman. Like, nice system you have here. It would be an awful shame if something happened to it.

In fact, how much do you think we would pay if you had a downtime? That’s not an uncommon way to sell this thing, because people think I’ll just say like, I’ll save you the cost of outages. The problem with that is that’s like trying to quantify something that hasn’t happened yet and to sell something based on something where you don’t have any real data on. Maybe you’ve had outages, but it’s not all that great as a way of doing this. I also want to say when we’re talking about this, when you start to talk about this work and you’re talking to an engineering leader, they will start to nod and be like, yes, I want that. Yes, that’s great. It gets really tempting to sell. Because they want to hear like, you mean I just have to put a dime in the machine once every six months and I get reliability out and I can do everything I want and there’ll never be a problem, and everything will be great. Because we’re people pleasers.

At least I’m a people pleaser. I’m going to be like, yes, you’ll get that. You don’t want to do that. You shouldn’t let your mouth write a check that your tail can’t cash, to quote some of the old bluesmen’s. It’s really important to understand that people will want that fantasy, and don’t be trying to sell that to them. It’s very also seldom to have this situation where, congratulations, reliability is now a tax. It’s ok, we’re just a cost center where we’re going to charge you 20% of your time and hopefully things will be better for you. Just give me 20% of your budget. It’s a bad tax to have. It shouldn’t be considered that way, like an onerous thing that you have to pay, security a tax, I’m sure there’s other things like that, compliance is a tax, that sort of thing.

I want to suggest, and this is really important from experience, that you cannot claim, and I’m not saying you would, but you cannot claim that this effort will be painless or invisible. When somebody goes in to do an SRE engagement, they need to understand that, congratulations, I might make you do something different than you’re doing now. You won’t notice that, or maybe you don’t have access to that now, or maybe we’re going to have a different set of procedures that we know works because we spent a lot of time doing this, but it’s not going to be painless. You might have to build something into your app so that it’s actually emitting data. What a crazy idea? It’s not going to be painless, and you can’t really suggest that it is.

The other thing that we do a lot, which you’re going to see in some of the topics, is we try to talk to everybody in SLI, SLO language. You walk up to anybody you’re talking to, whether they’re business people, whether what, and you’re talking SLIs and SLOs, and they’re just going to go, cool acronym. If you’re talking to business people, they’re going to want to talk to you about customer acquisition, or customer retention, or maybe they’re going to want to talk about sales or whatever. You want to speak in their language when you’re trying to sell and talk about reliability that way.

6. How Do We Automate Away Toil?

How do we automate away toil? I want to start with the SRE definition of toil, because when we talk about toil, we’re not talking about the colloquial way of toil, which means things I don’t like doing that I have to do, that I don’t like, like taking the garbage out or whatever. We’re talking with a specific definition that I’m pulling out of the SRE books by Vivek Rau. Has two chapters in the SRE workbook. I thought like, wouldn’t it be cool if I made this into a bingo card, and I forced everybody to play bingo? It turns out not to look so great, but there it is. Just in case you’re concerned, I can do a bingo card if I really want to, but I didn’t. Let’s talk about what it means for toil. You probably know these things. Anything that’s manual. Anything that’s probably repetitive.

It doesn’t have to be all these things, but the more it’s like these things, the more likely it’s toil. Is it automatable? Is it tactical? The one that I like the best, is no enduring value. Those things that we do over and again, and the system doesn’t get better, the reliability doesn’t get better, we just have to do them. The more we have of those, it’s not great. Then, Vivek Rau uses computer science language. I will translate. He says, O of N with service growth. What he means is that it scales sub-linearly, which I realize also sounds like CS talk. Let me say that a different way. If, in fact, in order for me to run the system, I have to spend an hour to create 10 accounts, and then I get 100 people, 10x.

Then I have 1,000 people, then 10x that. Just to understand there, is that what we’re really searching for is making sure that we can do this without having to bring in resources like that. This is the backdrop to what I want to talk about. That is the bingo card that I don’t want to talk about. Here’s what I want to do to get you to think about toil in a little more slightly sophisticated way. The way I’m doing that is by using cool bolding to say, do we actually automate away toil? I want you to think a little bit about this. Here’s my question for you. I have a problem with my system.

Once every couple of days, it decides to lock up. I have a person, that person reboots the server when that happens. Everybody think that’s toil? Does that sound great? I get something that reboots the server for me automatically. Have I eliminated the toil? How many of you think I have not eliminated the toil? Why do you think that? Because I haven’t eliminated the root cause. See, now, this is the problem.

Participant: In our organization, we have a similar issue. I think that we don’t spend time enough to understand the problem and resolve it. Rather, we put a Band-Aid.

David Blank-Edelman: You might have eliminated the toil in this situation. Do you feel good about it? Does anybody feel good about the fact that there’s something automatically rebooting your server? What, if you have a memory leak, everybody is cool? That’s fine. You don’t feel good about it. I just want to say that toil elimination isn’t super easy. Are outages toil? How many people here think outages are toil? By the definitions that we use. I gave you definitions. We all have definitions. How many people here think outages are not toil? It’s really interesting you think that. Why do you think they’re not toil?

Participant: They might not be repetitive. They can be for unseen conditions.

David Blank-Edelman: The question is, are all outages toil? Don’t know. Are novel outages where I’ve learned something new from it, toil? I don’t know. I’d be really hard pressed to call that toil. Do I want a lot of novel outages? If I have to choose between whether they’re novel or not, yes. If I want fewer outages, not necessarily. It’s just really interesting to think about this. There’s also a difference when it comes to toil between when you have a new service that you’re just standing up and you have an established service, because new services invariably are going to be more noisy. You haven’t tuned the monitoring. You haven’t done anything with that.

Just understanding that toil and its relationship to services sometimes has to do with when you’re measuring it. I would expect a new thing to cause a lot more intensive work to have to be done. I just want to say, this is stuff we were talking about, there is a hit on complexity. If you were going to put automation in place, maybe you’re going to be trading your toil for complexity. Congratulations, now I have a really complex automated thing that is trying to just keep it up and doing the right thing. I have this theory that I argue with a friend about that there’s like this conservation of toil, where you never destroy toil. You just either move it around or you turn it into complexity. I don’t know if it’s true. I think you want to understand that there’s a complexity add-on here.

Then, finally, just to be a little something, whose toil are we talking about? That’s the thing we never ask. My toil? Your toil? Our customer’s toil? Pay attention to that. That’s important to realize when we’re talking about this. Somehow, we’re just like, it’s toil. That means it’s our toil.

I want to also say that I dig Ironies of Automation by Lisanne Bainbridge, as a thing. If you haven’t read it, I’ve got this here just so you can get to it fast. This is the thing that was talked about in the keynote. It’s a really other good thing, which talks about some of the issues around automation. How many people here have seen the movie Fantasia, made by Disney? There’s a scene in it, the Sorcerer’s Apprentice scene, and I don’t want to give anything away for that because not everybody’s seen it. The Sorcerer’s Apprentice scene also, to my mind, is a really good metaphor for what happens when you automate things.

7. Is My ‘Something’ Resilient?

Is my something resilient? When someone asks, is my something resilient? I think you probably meant, in most of your cases, in fact, almost all of them, fault tolerant. Or maybe you meant redundant. Or maybe you meant preventively designed, which I don’t know if that’s a thing. Maybe you meant highly available. You didn’t really mean resilient, and I’ll tell you why you didn’t mean resilient. Or self-healing. When people say, let’s make our apps resilient, no, most of the time, that’s not what they’re talking about. The thing to understand is that resilient means, especially in the resilience engineering community, a lot more than just these things. I have lost the battle in my employer, you will probably lose the battle in your employer to stop calling it resilient when we really just mean fault tolerant. Like, cool, we have resilient cloud architecture. No, it’s fault tolerant.

The reason why is that resilience can mean a number of things, and there’s a lovely paper that I will point you out in a moment again, it can mean lots of things. It can mean, and this is when people think it means fault tolerance, we mean that, is like, rebounds from disruptive or traumatic events and returns to previous or normal activities. They’re subbing it in, because that’s like the first level, but that’s not really where we want to go here. This is from a lovely summarization of a paper by David Woods, and I’m going to point you at the paper itself, but robustness means, is able to manage increasing complexity, stresses, and challenge. The software is able to handle increasing problems, that’s what robust means. It doesn’t mean that I have another one of them and I turn the other one on, it means I can handle that situation.

Graceful extensibility means extends the performance or brings extra adaptive capacity to bear when surprise events challenge its boundaries. It knows what to do in those situations. Sustained adaptability is the ability to adapt to future surprises as conditions continue to evolve. I’m here to tell you that very few times are we talking about adaptive capacity, but that to me is what I think of when I think of resilience. That ability to handle surprises, that ability to handle situations when you didn’t know it. Restarting the server, that is not resilient. Maybe it’s fault tolerant, maybe it’s not. If you want to see a lovely paper called, Resilience is a Verb, I highly recommend it, by Dr. David Woods, there it is. If you haven’t checked it, you can go do it. Here are the questions I asked, just to walk you through a number of things.

Customer-Driven Metrics, and Proxy Metrics

Participant: Regarding customer-driven metrics, when you’re talking about your first point, do you caution against proxy metrics?

David Blank-Edelman: Do I caution against proxy metrics? I think if you get a choice, I would say that it’s far better to measure the thing you want to measure than the thing that’s trying to measure the thing that you’re trying to measure, is the way I would put it. I think the closer you can get to the customer, the better. Not always can you do that. Sometimes you have to fake it. You know it’s a proxy. Just go into it knowing that you’re using a proxy. I think that’s fine.

What is System Adaptability?

Participant: When you say adaptability, you mean like the system knows. For example, if I have Black Friday or something like that. When you have a surge of inputs in your system or something like that, what you mean by adaptability is that you don’t need to do anything to the system and it can react to these kinds of events? It’s something like that?

David Blank-Edelman: It can react to surprises. Let me give you the example that John Allspaw gives. My job is to drive something around in my car, maybe I’m an Uber driver or something like that, or maybe I’m a delivery person. If I have a spare tire, that’s not resilience, that’s fault tolerance. That’s high availability. Tire blows out, I’m cool. If I know how to call another driver, or if I understand how transportation works, or I understand how the bus system works, that’s resilience. It’s more than just having a replacement there. It’s understanding how to cope with unforeseen situations at a higher level.

See more presentations with transcripts

You Are Asking the Wrong Questions (About Reliability and SRE)

Transcript

1. Is My ‘Something’ Working Reliably?

2. How Do I Get Rid of All My Failures or Errors?

3. What is the Root Cause of ‘Some Outage’?

4. What Role Should SRE Take in My Org?

5. How Can I Sell SRE Internally?

6. How Do We Automate Away Toil?

7. Is My ‘Something’ Resilient?

Customer-Driven Metrics, and Proxy Metrics

What is System Adaptability?

Leave a Reply Cancel reply

Stay Connected

Latest News

Apple Brings Another One of Its Apps to the Web

iOS 26.1 lets you tweak Liquid Glass, and it’s out now

Linux 6.19 To Optimize Exiting To User-Space For Restartable Sequences

The Oppo Find X9 Pro has fantastic battery life – so why am I a bit disappointed?

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Transcript

1. Is My ‘Something’ Working Reliably?

2. How Do I Get Rid of All My Failures or Errors?

3. What is the Root Cause of ‘Some Outage’?

4. What Role Should SRE Take in My Org?

5. How Can I Sell SRE Internally?

6. How Do We Automate Away Toil?

7. Is My ‘Something’ Resilient?

Customer-Driven Metrics, and Proxy Metrics

What is System Adaptability?

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News