Using the analogy of addressing the lunch rush in restaurants, Michael Haken, senior principal solutions architect at AWS, describes how Amazon builds both well-behaved clients and well-protected services through operational and architectural strategies. “Resilience lessons from the lunch rush” shares strategies used by the cloud provider for managing queue depth, implementing automated capacity forecasting, and employing load-shedding techniques.
The article stresses how automated capacity forecasting and auto-scaling can help ensure services stay ahead of demand by significant margins, and how Amazon favors being over-provisioned to provide a buffer in capacity. Explaining why solving problems of load in restaurants resembles how resilient systems are built in the cloud, Haken writes:
Restaurants need to manage customer demand (load) as well as service time (latency) to maintain the customer experience their patrons expect (…) Some of the approaches we used to manage demand and service time are operational. They are the ways the restaurant responds when load or latency starts to increase. Other restaurants I worked at were able to use architectural approaches. The restaurant’s systems and workflows were designed to help prevent overload from occurring.
The author argues that true overload scenarios are rare events in the cloud, but when one of these exceptional events does occur, there are different strategies to prevent impact to customers’ experience. While some are operational strategies for reacting to overload situations after they have been detected by observability systems, others are architectural strategies that are part of the design of the services. There are also patterns to build well-behaved clients.
Three operational strategies are suggested: load shedding, auto-scaling, and fairness. Load shedding is intentionally discarding work temporarily, protecting a service from becoming overwhelmed during transient spikes in load. Haken warns:
The effects of load shedding are generally indiscriminate in their impact on customers. It’s a coarse-grained tool to control load. While it helps extend the runway the service has to maintain a high level of goodput, it also has an impact on the customer experience, so it’s not the only solution we want in place to handle the situation.
Source: AWS documentation
Fairness and quota management help deliver a consistent, single-tenant experience in a multi-tenant environment and usually imply rate-limiting customers who exceed their quota by relying on algorithms like token bucket, leaky bucket, exponentially weighted moving average (EWMA), fixed window, or sliding window. Automated forecasting is instead considered the primary strategy for ensuring sufficient capacity.
The article recommends four architectural strategies to prevent overload: avoiding cold caches, managing queue depth, constant work, and putting the smaller service in control (appropriately using the control plane and data plane). Constant work means that a system doesn’t scale up or down with load or stress, providing the same amount of work in almost all conditions, resulting in a predictable load.
To avoid making the situation worse for a dependency that is under stress, AWS suggests two patterns for well-behaved clients: circuit breakers, preventing the sustained overload of a dependency, and retries, letting the client retry every request up to N times using exponential backoff with jitter between requests. Haken warns that every approach has to be handled with care:
At Amazon, we choose where to use circuit breakers carefully, and we don’t treat dependencies uniformly. A dependency can experience failures in a single partition, fault boundary, host, or customer, among other dimensions. This means our circuit breakers are more granular than the “whole dependency” and are typically aligned to the dependency’s expected fault domains.
Source: AWS documentation
Werner Vogels, CTO at Amazon, summarizes:
Load management is all around us – from restaurants handling the lunch rush to cloud systems balancing millions of requests. Mike Haken’s latest Builders’ Library article shows how the principles never change: detect early, adapt quickly, degrade gracefully.
Manoj Chaudhary, CTO and SVP of Engineering at Jitterbit, comments:
Fantastic read! It’s all about detecting issues early and taking action. Load spikes, bugs, or unexpected incidents—like dropping a food plate—can create strain on services, whether in SaaS or the restaurant industry. The key is to detect these issues early, adapt, and ensure sufficient capacity to provide a smooth customer experience while preventing chaos in the workforce, regardless of the industry.
“Resilience lessons from the lunch rush” is now part of the Amazon Builders’ Library and includes references to the resources describing the different client and server patterns.