Transcript
Newman: I’m going to be talking about something very basic, which is almost like the fundamental ideas that I think that every developer should know if they’re unfortunate enough to have to work on a distributed system, which is probably most of you. The reason I’ve written this talk is because I’m working on a book, which is designed as like, you’ve been dropped onto a project where somebody ill-advisedly made the choice to use microservices, which is always a terrible idea. What are you now going to do to survive? The book’s out in early access. I’ll talk later about how you can read the book.
I wanted to start this book with talking about some of the foundational principles that I think every developer should be able to get their heads around. Because you don’t need to do a comparative analysis of Paxos versus Raft versus SWIM, or be able to explain the nuances of CAP Theorem, or Harvest and Yield and why they’re ever so slightly different. Because really what it comes down to is this: timeouts, retries, and idempotency. Timeouts, giving up. Retries, trying again. Idempotency is making it all a bit safe.
Along the way during this talk, we’re going to be exploring these topics and we’ll also be maybe poking a little bit at this quote. It’s a very famous quote, “The definition of insanity is doing the same thing over and over again and expecting a different result”. Beyond the fact this is absolutely not a definition of insanity that anybody in the mental health community would sign up to, I also think it’s a flawed statement in many ways, which we’ll poke through as we go along.
Context
We should first set context. I mentioned that these are ideas that are useful in the context of a distributed system. I’ve become quite interested in meaning and the meaning of words and phrases over the last few years. I think it’s always appropriate when we’re going to use these ideas as a foundation for our time together, is to define some of these ideas a little bit. When I say a distributed system, I think there are two definitions of distributed system I’m thinking of.
The first is a very dry definition, that a distributed system is one which consists of two or more computers communicating with each other via networks. This is a very dry concept. One computer, network, two computers. Probably your system is a little bit more complicated than this, but this is technically a distributed system. A lot of the challenges associated with distributed systems really come down to these things, networks being evil. I think there’s a much more evocative and maybe more accurate definition of a distributed system that I think speaks more to the challenges we face when we use them, and that’s this one from Leslie Lamport. A distributed system is one in which the failure of a computer you didn’t even know existed renders your own computer unusable. We like to think our distributed systems are perfect edifices and that we are building ever greater amazing things, probably with some more GenAI than we should be using at this point in time.
All the while, something is conspiring to bring the whole thing crashing down around our ears. This is why there are complicated concepts in the field of distributed systems. It’s fun as a techie sometimes to revel in all of this complexity. I’d also like to try and strip things back a little bit and look at fundamentally what is it that makes distributed systems tricky. At this point, I’ve just threw it down to three rules. It used to be two rules. It’s now three. I think we can all remember three things together.
The Three Golden Rules of Distributed Computing
What are the three golden rules of distributed systems, according to Sam Newman? The first one is you can’t beam information instantaneously between two points. Yes, I am aware there’s a thing called quantum entanglement. However, as I understand it, A, quantum entanglement only works for read states. B, and very importantly, it’s not a networking protocol. We have to accept that the transmission of information takes time. This is only partly under your control. You might pretend otherwise, but it is only partly under your control.
Secondly, sometimes you are not able to reach the thing that you want to talk to. Again, you can do all the things you can to try and increase the chance that the thing you’re talking to is actually there. You can do things like have redundancy, for example, and we’ll touch on that a little bit later on. Again, it might not be, and you can’t control for that all of the time. The third challenge or golden rule of distributed systems is that resources are not infinite. Resource pools can and do run out. Taken together, I think this trio of golden rules almost underpins all the complexity that we have to deal with in the world of distributed systems. We just put different abstractions on top of it. We’re going to be looking at these foundational ideas of timeouts, retries, and idempotency, and coming back to these three rules to understand how that influences how we think about our systems.
1. Timeouts
We should start at the beginning with timeouts because they are foundational. They are quite important concepts. Probably most of you have had to work with timeouts in some way, shape, or form. A timeout we’re talking about here is a threshold after which we’re going to terminate a request if it hasn’t already completed. Typically, we’d be talking about a timeout on the client side. We’re asking a server to do something. It hasn’t happened quickly enough, and so we give up. This is it. We’re giving up after a certain period of time. The quitter’s protocol. This is fundamentally because we know things take time, and we know that things aren’t necessarily entirely under our control. We also have timeouts because sometimes the thing we want to talk to might not be there.
The connection may be established, but the server at the other end is just not responding, and that’s why we have timeouts, because of these two issues fundamentally. In this situation here, I want to make a payment. I go after payment service. Hi, can I make a payment? Time passes. We hear nothing, and we go, I’m giving up. Now, of course, it’s worth at this point examining why we give up. Why not wait forever? There are two key reasons we don’t wait forever.
The first is, when you are waiting, resources are being allocated as part of that waiting process. Yes, even if you are doing non-blocking I/O and pretending you’re asynchronous programming you are still holding on to resources. Connections are open. Threads are blocked and waiting. CPU is being used to slice between various things. If you are waiting a lot, a lot of the computing resources of your system are busy waiting, and those computing resources then cannot be used for other things. This becomes one of the enemies of a system, because if you completely saturate your computing resources and you run out of computing resources, systems tend to get slow and start falling over.
The other issue, of course, is if we wait forever, other people might be waiting for us, like the human being at the laptop, because they are not going to wait forever, so why should our computers? Really, when we look at timeouts, yes, it’s about things taking time, but it’s also, to a large degree, about controlling for resource consumption. By timing out on things, we are freeing up computing resources.
Getting timeouts right can be difficult. I suspect a lot of you are probably using a library like Polly or resilience4j. You might not even realize you’re using it. In fact, a lot of Polly is now built into the latest version of .NET, which is great to see, but often people are just sticking with the defaults. Sometimes that’s ok, but a lot of the time it isn’t. The challenge with a timeout is if you time out too quickly, you might give up on something that would otherwise have worked, and that’s wasteful and annoying. Can you give me that thing? Then just before Leslie does, I’m like, no, fine. Leslie’s like, but I had the thing for you, and I’ve gone. That’s annoying because we’ve consumed the resources and taken the time to almost succeed. This gets compounded if you time out too quickly, and then retry, because then you’re doing it all over again, potentially.
Timing out too quickly, which you might want to do because you’re worried about resource contention, can actually lead to some not great behaviors, and actually can make things worse, especially where then you have a further backoff and a retry. We’ll get into retries later. Of course, if you wait too long to time out, you’re going to increase your resource contention, which in turn can decrease your system stability because you are reducing the resource pools available to you, in addition to the system just feeling like it’s slow. In general, you want things to be fast rather than slow because, A, people like that, and B, it tends to mean you’re using fewer resources for a shorter period of time.
What can we do about this? How do we get our timeouts right? First thing you need to do is to have some ability to understand what normal performance is of your system. What I mean by that is have a baseline of understanding about how long things normally take when they work. What you’re going to want to be able to do is put your response times for a service on a diagram a bit like this. This is a histogram. For those people who haven’t seen a histogram before, this is basically a cut of occurrences.
As we go along the x-axis here, we’re looking at the time taken for a given successful request, and the y-axis is telling us how many of those requests fell into that bucket. This is what a histogram is, is about counting occurrences. This allows you to get a sense of the shape of how long things take, which can be really important. Averages are worthless for this type of stuff. This already gives us really interesting information. We’re thinking, I’ve got some outliers out here, and there’ll always be outliers when you have a sufficient number of calls. This is a slightly contrived example.
Quite often, these successful outliers might be far more skewed out towards the right. They worked, but they did take quite a long time compared to the rest. There can be lots of reasons why you get outliers, and the conversation about tail latencies is for another time. You might look at this and say, what I could do then is I could time out at that 260 millis, because based on all the information we’ve gathered so far, all of our successful calls complete within 260 milliseconds. Maybe that’s where I’m going to put my timeout threshold.
On the other hand, you realize you’re waiting a very long period of time just in case you have some outliers, and so maybe you compromise that a bit and say, actually, we’re going to pull in our timeout threshold to here so that we free up those computing resources a bit quicker, and we accept we might time out some of those calls. This is also partly why people often fixate on the 99th percentile latencies rather than the 100th percentile latencies, because you’re trying to ignore the outliers. I think that’s fine when maybe tracking your SLOs, not always sure it’s the right thing to do with timeouts, because one of the issues is you probably don’t want to ignore tail latency for too long. Come back to that a little bit later on. This helps you. Already you’re making more informed decisions about where your timeout thresholds would be. If all of your calls normally complete within 250 milliseconds and your timeout is 1 second, just cut it in half. Your system will thank you.
The second thing to do, though, is to overlay this with your understanding of what your users expect should happen. How long are your users going to wait? I’m sure some of you might get the reference here. Because there’s no point waiting 30 seconds for a service to complete if the average user in your system is going to start mashing that F5 key 5 seconds in. Even if your system could tolerate a longer threshold, but your users can’t, use your user expectations as the guide.
The reason this is important is because if I’m a server and I’m waiting for another server to do something, which is a common occurrence in a microservice architecture, and I’m trying to do that to satisfy the request coming from, say, a browser or a mobile device, if the user has got bored and hit the big old refresh button, there’s a good chance that even if I get the answer they want, they’ve moved on, their browser state has shifted on, but those resources are still tied up. In a situation where you could be more generous with your timeouts from a technical point of view from keeping your system healthy, but the users are only going to wait a second, the user expectations override that.
Of course, there’s also some things you could do at this point, which is, maybe we could redesign the UI experience. We could maybe make it a little bit more asynchronous in terms of the way information flows through, and you could look at your design of your UX to stop people just hitting the refresh button.
The third thing, and this is quite important, is make sure you can change your timeouts without changing code and without requiring a redeploy. This is a very straightforward thing. Yes, you can hardcode your timeouts. No, you should not hardcode your timeouts. They should be in some form of configuration so you can change them in the live system.
Firstly, this is useful so you can tailor and tune your timeouts when doing load testing, but also because sometimes you find something’s gone wrong in production, and fixing a timeout would help you out hugely. If you have to wait to redeploy your system, that’s not going to be great. This is a little simple configuration. I think this is from a Spring Boot use of resilience4j. Just having some ability to specify your timeouts on a per-service basis, sticking this in your config store or text file, whatever else it might be.
Fundamentally, what timeouts are about is prioritizing the health of the overall system over the success of a single request. We’re accepting the premise that one request might be dropped to keep the rest of the system healthy. That is actually an idea that goes across a lot of the foundational resilience patterns out there. In general, we also want to look for behavior that’s fairly consistent. If your service has a wide range of normal behavior, it becomes quite difficult to start setting those thresholds. What you really want is a much tighter band. If you’ve got a really inconsistent behavior, where it could take anywhere from here to here is normal, you then have to set your timeouts quite a long way out to cover a sufficient proportion of your traffic, and that in turn can mean more resource contention. Rather than this, having something like this would be a bit nicer. Again, these are slightly contrived examples.
One thing to mention about these tail latencies, it is worth looking into them. If you’re getting tail latencies that are coming up a lot, understand why. Is it a different type of piece of functionality that’s being used, for example? It might be the same endpoint, but very much more expensive operations, in which case separating those endpoints out could be useful, because then you could have different timeouts for different behavior. You also have to remember that these tail latencies are often humans, and so, yes, you might be timing out of them, but if that’s leading to poor experience on the other side, that might not be great. There’s things you can do, like request hedging when you get quite advanced on these things to deal with that. There’s a quite well-known paper on the ACM called, “The Tail at Scale”, so that’s worth digging out if you want to go a bit deeper in some of the strategies that people use to deal with tail latencies.
2. Retries
We’ve given up. We’ve got to that point, and we thought, Leslie is just not giving us what we need, and we’re giving up. We’re moving on. Then we thought, but maybe she will now, so we do retry. Retrying makes sense, and it makes sense for a lot of reasons, despite what this guy might say. That’s because sometimes you can’t reach the thing you want to talk to, and quite often, that is a transient issue. I used to work in an engineering company, and we had these sorts of different buildings on this campus. There were basically trunks that ran through underground, where all that pipework used to go for various different electrical supplies and stuff like that. It was actually the room next door to here, it’s called Whittle. It’s named after Frank Whittle, who developed the jet engine during World War II. That was the site where I was based, and there was still the old equipment knocking around. We started having issues with our network.
I mentioned earlier that sometimes not being able to talk to something is outside of your control. These are rabbits, and we have rabbits here. Rabbits like little tunnels. They make them themselves. If people make little tunnels for them, the rabbits move in and think, these are great. Thank you very much for the accommodations. They moved into these tunnels. There was something that rabbits like more than anything else in this world, and I noticed a good friend of mine keeps house rabbits, and what they love is the plastic insulation around cables. Rabbits aren’t my network. How am I going to control for that situation? Am I going to get down there with a shotgun? I’ve seen water shipped down. I was traumatized by that as a child. That isn’t happening.
Often the issues we face are maybe not as fun and might be a bit more transient than a rabbit. I make a call to a service. That call gets routed via a load balancer to an instance, which we think at that point is healthy, but when I establish that connection and send that request, something seems to be going a bit wrong. Maybe it’s not responding quickly, and maybe it starts timing out. Maybe what’s actually happened is that it’s got a memory leak, and the out-of-memory killer from Kubernetes has nuked the thing from orbit, and so my request failed. There are other healthy nodes there that could happily have taken my request, so if I tried again, it would get through to a working node. We know these things happen, and so retrying makes sense. If at first you don’t succeed, it might make a lot of sense to try again because it might just work. That then opens up some questions. How many retries? Think about that a little bit. How many retries do you want to make? Because we do need to give up.
If you retry too often, that can lead to a server being overloaded. Remember, sometimes when you send a request to a server and maybe, for example, if you time out, those resources may still be being used by the server to try and process that request. You giving up and trying again, you might be increasing the amount of load on a server, and especially if you’ve got a lot of clients for a given server instance, retrying too often can cause some significant problems. Having some kind of limit on that makes sense. I should also say that part of what we do here was by having a limit on the retries on a client, we’re trying to get the clients to be well-behaved, to effectively apply some very simple rate limiting to reduce the load on a server. Servers themselves should also have mechanisms in place to shed excess loads. You should at a server have some load shedding in place. Again, a conversation about load shedding and backpressure is for another time.
Again, as with timeouts, make sure your retry limits can be put and specified in a configuration file that can be changed without redeploy. Back in 2017, a company called Square had quite a large outage. They used a system internally called Multipass, which was used for their multi-factor authentication. For Multipass to fetch information, it needed to fetch it from a Redis service. There was a bit of an outage. Things were getting restarted. Multipass went to Redis to get the information it needed to start up, and of course, because this has been used for multi-factor authentication, it was needed across a lot of the system as a whole. Redis didn’t respond very quickly, so Multipass tried again, and then it tried again, and then it tried again, and then it tried again, and then it tried again. Turned out that it was hardcoded into the Multipass client code for Redis that it would retry 500 times with no gap between those retries. It was basically a self-inflicted denial-of-service attack. You had this vicious spiral. Redis was not in a good shape.
Then Multipass goes, hello, and then repeatedly pummeled it. Then because Multipass then gave up, and the people are like, just restarting stuff normally works, and they just restarted the Multipass nodes, which then proceeded to keep pummeling Redis. Unfortunately, those configuration values, that 500 retries, the lack of a gap between those retries was hardcoded. That’s fixed. I’m really thankful for companies like Square, they talk about this stuff publicly. They explained what happened. They put the details out there. They shared what they then did to fix it. This is how we learn. It’s all very well when we’re thinking, yes, you eejits. It’s like, I’ve made worse mistakes than that. I’ve never done anything as successful as Square, so nobody noticed. I always love these things. If you do have a failure of your own, for the good of the industry, please tell people about it, because it’s cool, and we learn from it. Thank you very much to the people at Square for sharing this stuff.
There was something there about that delay. It wasn’t just the 500, was probably too many. At a certain point, you probably should give up, because this Redis server isn’t getting back up again. Having a delay is also useful, because, again, if it happens too quickly, a server that’s maybe on its knees is not going to recover very quickly. You also want to make sure that your delays aren’t too regular. Otherwise, what can happen is, when you start getting into a failure mode, the clients can start lining up in waves, and their retries can all appear at the same time. They keep hitting you, and hitting you, and hitting you. In this example here, what we see is the load building up on the server.
At a certain point, the level of load has gone over what would normally be healthy for this particular server. As a result, a portion of the stuff above that red line, a portion of the requests above this red line, are now starting to have issues and are timing out. A small portion of these calls are timing out. Because they’re timing out, they’re going to have a little bit of a think about what they’ve done, and then they’re going to try again. They’re going to have a retry. In this example here, they’ve all got the same fixed retry mechanism. They all come back around the same time, which then just pushes the spike a little bit higher.
Some of those retries work, and some of those retries fail a second time because you’re building up that traffic spike again, and so a higher portion of those calls are failing. Then they have the same delay, and then you end up with another spike. What can start happening is things just get worse, and worse, and worse. The easy answer to this is actually to space out the retries a bit more to avoid this collision. If you all have a regular retry interval, you just have that slamming wave that happens.
One of the earliest and easiest things to do is to insert some artificial network jitter. In networking, jitter is a variation in latency. Historically, jitter would be things that are, again, a little bit outside of our control. This is a variation in latency that we’re seeing. What we do here is we insert some fake jitter. Adding just a little bit of randomness to how long you wait will give you some natural spacing out of those retries and should smooth out these peaks a little bit and reduce the chance of these waves happening. This is something that Polly and resilience4j will do for you, but now when you see the word jitter, you know what it’s doing. They’ll probably apply some randomness in that as well. Then you end up, hopefully, with smoother peaks on this stuff instead. These graphs show exactly the same number of calls just with spacing out happening. You see that we’re evening out.
Actually, in general anyway, you don’t want too many peaks and troughs if they’re somehow under your control because the troughs here is unused computing and the peaks here are too much computing. Ideally, you’d like to even it out a little bit. It’s all a bit about space. You’re evening out your resource consumption as well. Obviously, one of the ways you can also space things out is to have, people always say we do an exponential backoff. Be a little bit careful about fully exponential backoffs because it’s going to lead to really long delays.
If your initial gap is, say, I’m going to try, and if it fails, I’m going to wait 500 millis before I retry again. With an exponential backoff, the next time it’s going to be a second, the time after that is going to be 2 seconds. Very quickly, you can get out to really long gaps between these retries. Of course, if you’ve got an overall timeout budget, you’ve probably blown through that already. It’s quite common for you to have, again, delays between your retries that are getting bigger but maybe not fully exponential. Again, resilience4j and Polly have configurable stuff for this if you want to play around with those a little bit.
This then leads us to the tricky part and the third pillar of our fundamentals today. That’s answering this question, is it safe to retry? Because I think, fundamentally, making it safe to retry is a really important thing that you need to do to give your system the best chance of actually working at any given point in time. It does require work to make retry safe. What we’ve talked about so far are what the clients can do to make retrying safer. Having a little spacing out of their logic, having a limit in terms of how many retries they apply to keep our resource consumption low. There’s actually also things we need to do potentially at the server. In this scenario here, when I said, pay Sam £100, if I don’t get a response from the server, there are two possibilities.
The first possibility is that when I sent that request, that it wasn’t picked up and processed by the server. The reason we didn’t get a confirmation that, yes, you have paid Sam £100, the reason we didn’t get that response is because the request was never processed in the first place. I send that request, nothing happens because it never got there. It could have been that it did get there, then halfway through the processing of it, the payment service fell over, and so the transaction was never written. Same deal. Fundamentally, the payment wasn’t made. That’s our first possibility.
The second possibility is a little bit more tricky, which is, we said, pay Sam £100, and the payment service obviously wants to pay me £100. I say, pay Sam £100. It did pay Sam £100. I sent that request, but for some reason that request was not received and acted on appropriately. It could be if this is a messaging system, it got lost, the message broker fell over. If this was a standard HTTP request-response based flow, it might be that the order service crashed before it received and processed the response coming back in.
Fundamentally, from the viewpoint of the client, which in this case is the order service, if I send a request and don’t get a response, each of these possibilities are completely equal, and I can’t know which one has actually occurred without additional information. The problem with that, of course, is you say, pay Sam £100. You did. You didn’t get the response, so you do it again because you really want to make sure Sam gets paid. Now, your expectation may have been to give me £100, I’ve got £200. I’m quite happy with the situation, and I think we can leave it there. Of course, this is no way to live.
3. Idempotency
Now, of course, you say, no one would build payment systems like this, and they do. I could point you to multiple case studies of things like peer-to-peer payment systems draining people’s bank accounts because the acknowledgements were getting lost. This happens a lot more than you might think. The problem with this is making retry safe is something which is really easy to do if you design it in up front, and really painful if you have to retrofit it. I’m going to take you through both parts of that. The concept we’re looking at here is a concept called idempotency. An idempotent operation is an operation that you can apply multiple times without changing the result. In our situation here, this operation is not idempotent because I retry the operation, but I do change the result.
The result should have been Sam has an extra £100, but what actually happens is Sam has £200. This operation here is not idempotent because we don’t get the same result. There are two ways to solve this problem, typically we look at, request IDs, nice, simple. You’ve got to do it up front. Or server-side fingerprinting, which is messy and often something you might have to consider for older legacy software. Request IDs are easy. Do this if you possibly can. When you send a request, you use a unique ID to identify a specific request. This then allows the server to understand, I have seen this request before, I processed it before, and I can just give you the results that I gave you before. Pay Sam £100, here’s my payment ID. I did the processing. You’ve retried it for some reason, and I can just give you the same answer.
Net result is only process it once. Sam only gets paid £100 once. Just to be clear, the payment service can still log the second request. I can still have it appearing in my metrics and everything else, but I’m not going to pay £100 again. The issue is that if you don’t already have it in place, adding a request ID in to make operations that is idempotent will require a change in a client-server protocol. This may well be a backwards incompatible change, depending on how you’re implementing it. This is why retrofitting it is problematic.
If you make these things optional you only get idempotency when people give it to you. Might be ok in some situations. Maybe it’s part of a migration. The key thing is to really get it working well. This is a required field. It’s problematic. If you look at the public APIs of people like Stripe and AWS and Azure, they all have request IDs in it for any write operations for this very reason. That’s built-in as part of the protocol. It’s hidden from you often when you’re using the SDKs, but it is there.
If you can’t retrofit it in, so you’ve got an older legacy system where maybe you’ve got clients that you’re not in control of, the next option is look at a server-side request fingerprinting. This is basically generating a fingerprint of the request so that you can then identify when you get that request coming in again, that this is a duplicate. This can be easy to implement in some situations and a nightmare in other situations. The basic idea goes like this. Here’s my request that’s coming in. Again, I’ve been using HTTP in this example, but this applies to anything really. This could apply to the payload of a message. What I would do is I’ve got some of my headers, some of my metadata, and my body. You would take the body and you would just generate in this example here using an MD5 Hash. This is going to be a fingerprint of the body. The server, when I’ve done that processing, I can then store that hash and say, I have processed this and here’s the response that was sent.
If I get that subsequent call, I can send back the same responses. The issue with this is when you start getting protocols where the body changes when you retry. You shouldn’t do that, but it does happen. Again, we’re talking often in situations here where the software’s legacy. We’ve learned things since then. In this example, we’ve got quite a trivial one, where the timestamp of when the request was sent is actually in the body. I retry it, but the client code updates the timestamp in that request body, and these end up with two completely different MD5 Hashes. We look at this and say, this is a retry attempt, but the fingerprint doesn’t know that.
Then you start thinking, we’ve got situations like this, how do we solve this issue? Again, because in a situation where you’re having to retrofit this into an existing system, you end up having to apply a mask on top of the body. You say that only certain fields should be included in your fingerprint.
This is, again, manageable if those bodies are fairly structured, so if you can put out a certain field you care about. You’re now getting into the world of it being a bit more fiddly, and if you don’t have a schema for the data you’re being sent, this can get a bit tricky. If it’s an unstructured payload, working out those fingerprints can be difficult as well. Obviously, the correct solution in this example is to identify that something like the timestamp is actually part of the metadata and should be moved up into the header. If you could make that change in the client code, just use a request ID. You should probably do both things if you can change the client code here, because the timestamp should be in the header and the request ID should be there as well.
This solves our problem. Does it? You’ve paid Sam £100, but you’re thinking, a week’s passed, and something is missing in my life. What is it? I need to pay Sam another £100. You log in and you send Sam another £100, but the bodies of the requests are the same. The MD5 Hashes are the same. It was your desire for me to have more money, £200. I need to stop this at some point. This is a problem now. What you wanted to happen was, there was another payment that was going to go out a week later, and that’s what you wanted to happen. With a set of fairly naive approach to our fingerprinting, we might identify this as a duplicate.
In this scenario, you say the way you solve this is then you might have a period of validity for the fingerprint. You might say, if I see a duplicate within a minute, we assume it’s a new thing. That might not work though in a DR scenario where you might have to replay a whole day’s worth of events. Then you might say, ok, within a day. Now I can’t make payments more than once in a day. What about a scenario where you are legitimately making the same call, or what looks like the same call, multiple times in a short period of time? For example, a for loop, where in the for loop you’re spinning up loads of the same VMs. A very common thing to do in doing infrastructure automation. Each of those would look like exactly the same request taken in isolation, but they are actually different requests.
Again, this is why AWS, for example, and Azure have a request ID. If that’s the kind of world you’re operating in, these fingerprints start looking more difficult to implement. If you’ve still got things where you really do want to do the same thing multiple times, that’s valid operations, again, these fingerprints are difficult.
One last thing that’s worth addressing here is, what should you do, or what response should you send when you’re ignoring the retried operation, or when you get that, or when a server receives an operation that’s already previously done? Do I go back and say, no, you did that before. The problem is going back and saying you did that before isn’t helpful, because by definition, if the client had got the original response, it wouldn’t be asking me again. It needs the original response, and it should be the original response.
If when they called before, it was a 200, everything’s ok, you send it back. I sent you a 200, payment accepted, you didn’t get it. You’ve asked again, I just send you back the same thing. I can put information in the header here. I can put information in the metadata of that second response saying, look, you keep asking for this, this is the 15th time you’ve asked to do this. I can put that in the metadata, and that can be really useful information. When you’re from a client-side point of view, you’re trying to diagnose why these things are taking so long, why you’re having so many retries, that metadata coming back to the client telling you that this is effectively an already carried out operation can be really useful for the human operators, even if you’re not programmatically doing anything with it at that moment in time. This also means if the first response was an error, unless it was an error which was, I couldn’t process the request.
If I actually did try and process the request and something went wrong with processing the request, maybe because you gave me some bad information, you weren’t allowed to do it, there was no money in the account, you send back the same response in this situation. There’s nothing else that’s really appropriate to do. You don’t want to get into the business of re-operating on things that previously were an error because you’re hiding a lot of important semantics then.
In general, request IDs are the way to go. If you can change the client-server protocol, using request IDs is a very simple way to give you the tools to make those operations that need to be idempotent actually idempotent. It’s not a lot of work if you build it in up front. It is painful to retrofit it. This is all about giving the clients the freedom to retry. If you give clients the freedom to retry, clients are much more likely to be successful.
If you can’t retrofit them, then looking at server-side fingerprinting, but there are a lot of downsides. There are actually some niche situations in which I would consider using both of these together. The AWS APIs actually do use both of these together. Really, if I’m retrying an operation with the same ID, I shouldn’t be in the business of changing the body of the request. Sometimes people do and that can lead to confusion. If I say, pay Sam £100, and you say that was done, but you didn’t get the response, so I’m retrying it. I’ll send it again, same request ID, but now I change it to say, pay Sam £200. You’re sending me the acknowledgement of the first request. I think you’ve paid £200 when you’ve actually paid £100. It gets at least all sorts of confusion.
Now, again, a client should not change the body when it sends the request, but it can be useful to pick up that they’ve done this to tell them that they’re a naughty person. On the AWS APIs, for example, they do server-side fingerprinting. If you do change the body, it will actually constitute an error state and it will come back and give you the appropriate slap on the wrist. This can be useful in a situation, maybe in a corporate environment where you might want to be doing some almost education around how these things are done. Again, once you’ve already got the request ID, also storing the MD5 Hash of that request in the database table alongside the request ID, that’s not a lot more work, so it might be something you want to consider. Of course, because that’s linked to that request ID, all the concerns you have around things like periods of validity go out the window, because those request IDs will be unique across a longer scope than that. You might consider using them together.
We’re in a situation where doing the same thing over and over again may be eminently sensible to a point, but we can only do this if we can make those retries safe. This quote is clearly flawed, “The definition of insanity is doing the same thing over and over again and expecting a different result”. We’ve learned in the world of distributed systems, no, this is actually pretty rational, which is also part of the thing we don’t like about distributed systems. We like our computer, a computer, to be deterministic. Distributed systems aren’t deterministic, really. Not the way we’d like them to be deterministic. Of course, we’re embracing non-determinism now because everyone’s using AI and it’s like, I asked you the same question two minutes ago and I’m getting a totally different answer and this is driving me mad.
We’ve been doing this for 50 years in the world of distributed systems because stuff happens outside of your control, so sometimes just asking again is the right thing. Do you know also, though we keep doing over and over again, that probably isn’t right? He’s attributing this to Einstein. He never said this. Einstein did not say that quote. Maybe the definition of insanity, at least in this context, is we keep attributing this very flawed quote to a very smart person. Because the more we look at that quote, the more we think, that’s a bit of a dumb thing to say, isn’t it? Turned out he didn’t say it.
Recap
I talked about the three golden rules of distributed systems. You might consider it to be an overly reductive way of looking at the world, but I think when you peel through all the complexities of the why we do certain things, at the heart, at the bottom of all of it, we find these three golden rules of distributed systems, which is, we cannot beam information instantaneously between two points. Sometimes the thing you want to talk to isn’t there. Resources are not infinite. Behind all the complexity and all the flim-flam and everything else, these things remain true. Importantly, no tool can magically make these things go away.
You can use things like, for example, Polly or resilience4j to help you manage your timeouts and your retries. It isn’t going to automatically work out what those things should be, or how your application should react when you ultimately get to a point where you have to give up on something. There’s always a business implication for that. If I can’t check inventory to know if something’s in stock, do I still go ahead and make the sale? That is not a question that can be answered by a library. That’s a question you need to ask yourself.
Hopefully at least you get some understanding of some of the fundamental ideas here, and now we’re better armed when you pick up resilience4j and everything else to use these tools well. There’s more we could talk about here. We could start talking about the flaws of circuit breakers, leaky bucket, rate limiting on client-side code, and all this stuff. Again, these things are implemented, and those are the next places you then want to go to. I would say your next stop, if you’re thinking about making things resiliency, would be to look at load shedding and backpressure. That’s the next place I would go.
To recap the fundamentals of what we talked about today, sometimes you should give up. That’s appropriate. You need timeouts. Not having timeouts is dangerous in a system because it is a very quick path to resource saturation and systems falling over. Balancing your normal system behavior against user expectations to find that happy medium. If at first you don’t succeed, retry to a point. Five hundred, probably too many retries to do. What’s the right number for you? Typically, I find the right number is that overall budget of how long a user’s going to wait, and you can use that as part of a decision-making process. If you are going to have delays between retries, and you should, you want a little bit of unevenness around them. Look at jitter and maybe exponential backoffs. If you do want to retry, making your operations idempotent is vital. Hopefully, I’ve given you a couple of tips about how you can go about doing that.
Resources
If you want to know more, you can go read an early access version of my book. It’s up at oreilly.com. You can get a free account. You can log in. I think you get 10 days or something. The first six chapters are available. I have written 7,000 words just on timeouts. It is the most fun I’ve had in the last 18 months. It’s been so much fun. I go a lot deeper into timeouts in that section as well. We go into like passing budgets downstream and all this other stuff. A lot of the content I covered today is in here, just in a lot more detail. You can also sign up to a mailing list there to get more information about the work that I’m doing and when updates in the book are available as well.
Questions and Answers
Participant 1: You mentioned timeouts, and as far as I understand, the approach you are proposing is using timeouts for two different things. One thing is to say, this system is unhealthy, we have to detect it, and we have to put that information back to a user or whoever called the API. Second is to limit resource usage.
Newman: Timeouts aren’t just about saying a service is unhealthy. It’s all about, I can’t talk to the server, so there could be a lot of other things going on there. A timeout in isolation doesn’t tell you a service is unhealthy. That’s an important thing to understand.
Participant 1: Yes, I agree with that. Then you also said that timeouts are not really hardcoded in certain ways, because they are dynamic. From that perspective, it means that it’s a dynamic SLA of a third party or subsystem.
Newman: Absolutely. If you’re in a situation where you already have defined SLOs with other teams inside your organization, if they’re giving you a clear SLO around, for example, their 99th percentile response times, you could absolutely take your cue from that as being what your timeout should be on a client side from that situation.
Participant 1: From that perspective, isn’t it actually better to ask, saying, I expect that this request will take 100 milliseconds in a bad case, multiply it by three, and saying, if I really have a timeout, we are completely done, or there is a violation of a main concept.
Newman: You can’t infer that a serve is unhealthy because a client is timing out, because there are lots of issues that can be local to a client.
Participant 1: Sure, but it means that the system, the total system is not working.
Newman: No.
Participant 2: You mentioned the server-side fingerprinting could be one of the ways using MD5 Hash. There is a slight possibility the hash could be duplicated for different messages. When that will happen, maybe once in 30 years, that’s going to be a disaster because it’s going to be very hard to detect that rare occurrence.
Newman: Firstly, it’s not impossible to detect a rare occurrence. What I would say is, if it is a 30-year span, the time validity argument about the MD5 Hash comes back. Secondly, I picked MD5 Hashes as a scheme, just that we all fairly understood, and also, very importantly, I could go to a website and say, here’s some text, give me an MD5 Hash. There may be schemes out there that don’t have that collision problem. It is a theoretical possibility. I would argue, if in your situation, that theoretical possibility is, as you say, an absolute disaster, then just put the request IDs in.
Participant 2: Because that’s going to happen so rarely, it’ll be very difficult to find out what happened. The payment was accepted by a system, but it never happened.
Newman: If you really are worried about that scenario happening, then I’d put the request IDs in.
See more presentations with transcripts