Lessons On How To Get Timeouts, Retries And Idempotency Right From Sam Newman At QCon London

At QCon London, Sam Newman – the architect who has attributed the coining of the term microservices, went back to the basics to underline the three critical things to get right when working with distributed systems: timeouts, retries and idempotency. Through the talk, he provided mechanisms allowing distributed systems to be more robust.

He started his presentation by “poking at” the quote “Insanity is doing the same thing over and over again and expecting different results.” stating that in many situations, especially when it comes to distributed systems, doing the same thing it’s advisable. Further, he underlined that developers shouldn’t do complex analyses of Paxos vs Raft vs SWIM and not even debate the nuances of the CAP theorem, but just to be able to wrap their heads around timeouts (“knowing when to give up”), retries (“how many times should I try again”) and idempotency (making it “a bit” safe).

Leslie Lamport: “A distributed system is one in which the failure of a computer you didn’t even know existed can render your computer unusable.”

To further frame the context, he enumerates the three “golden rules” of distributed systems:

You can’t beam information between two points instantaneously

Sometimes, you can’t reach the thing you want to talk to

Resources are finite

Before delving into providing more insights on making distributed systems more robust, he stressed that this trio, taken together, underpins all the complexity hidden by distributed systems behind different abstractions.

Timeouts: A threshold after which a request will be terminated if not completed

The system uses computational resources(CPUs, threads, or memory) when waiting, regardless of blocking or non-blocking IO. Waiting “a lot” means overflowing your system with requests, translating into “stuff falling over.” Besides, the user experience might also degrade: how long will the customer wait for the action to be finished?

It’s challenging to get the proper timeout right. To avoid timing out too quickly or waiting too long, you need to mainly understand two things: how long things usually take for your system to be executed and what the user’s expectations from the system (“When are they starting to fit the refresh button of the page”). Besides finding the proper value for the timeout, it’s essential for more consistent system behaviour. In that case, allocating resources will ensure that the duration of the calls falls within a more compact time frame. Also, the system should allow changes in the value timeouts without recompiling or redeploying the system.

Newman: “Timeouts are about prioritising system health over the success of a single request”.

Retries

Like timeouts, choosing the proper number of retries is also challenging. Too many retries would be similar to a self-inflicted DoS attack. To make systems more resilient, you must implement rate limiting on the client and server-side mechanisms to share excess load. Also, introducing an artificial network jitter (random-valued delays between retries) would ensure your systems have time to recover from failures. Newman warns against introducing exponential backoff, as that will put more pressure on your system than release it.

Idempotency: the property of an operation to be applied multiple times without changing the result.

The last fundamental pillar of distributed systems is ensuring that it’s safe to retry calls. If the first two pillars focus on what the clients need to do to make the systems safer, the last one is all about behaviour on the server side. According to Newman, there are two possibilities if a client doesn’t receive a response from a server:

The request didn’t go through. Hence, the server didn’t have anything to process. In this case, there is no problem.

The request was processed, but the response didn’t reach the customer. The system already operated the change, but the customer wasn’t notified.

Idempotency is easy to implement upfront but harder to retrofit. He mentions two ways of implementing it: using a request ID, which multiple major cloud providers use, but it also requires changes on the client side.

The alternative fingerprinting of the request ensures that the changes are isolated on the server side. You need to ensure that the fingerprint is based on consistent information between requests(avoid timestamps, which should be part of the header in the first place) but also to be timebound. Another consideration is that you must notify the customer that other previous requests were processed, and a good place to place that information is in the metadata.

When the request’s body might be changed, it is better to implement both mechanisms.

Newman closed the presentation by stating that in the case of distributed systems, doing the same thing repeatedly is eminently sensible, but to a point when you can make those retries safe and by humorously pointing out that his quote is falsely attributed to Albert Einstein.

Lessons on How to Get Timeouts, Retries and Idempotency Right From Sam Newman at QCon London

Timeouts: A threshold after which a request will be terminated if not completed

Retries

Idempotency: the property of an operation to be applied multiple times without changing the result.

Leave a Reply Cancel reply

Stay Connected

Latest News

Let's Talk About the 'Ironheart' Finale and Post-Credits Scene

Xiaomi denies CEO Lei Jun reduced stake amid share placement · TechNode

The Best Short Throw and Ultra Short Throw Projectors We’ve Tested (July 2025)

AT&T Launches Account Lock to Combat SIM Swapping Scams

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Timeouts: A threshold after which a request will be terminated if not completed

Retries

Idempotency: the property of an operation to be applied multiple times without changing the result.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News