At the recent QCon San Francisco, Netflix Staff Software Engineers Anirudh Mendiratta and Benjamin Fedorka shared insights into the company´s reliability strategy, detailing the evolution of its load-shedding techniques toward a sophisticated Service-Level-Prioritized Load-Shedding. Moreover, this approach was designed to maintain a seamless viewing experience for millions of users, particularly during unpredictable traffic spikes that overwhelm reactive autoscaling and capacity buffers.
The fundamental problem, according to Mendiratta and Fedorka, that Netflix faces is sudden traffic spikes during major content launches, which often exceed the provisioned server capacity. They state that relying on auto-scaling is insufficient; it is too slow to react to a sudden, massive spike, and proactively scaling for the theoretical maximum peak is prohibitively expensive.
To address this, Netflix introduced a conceptual model to quantify resilience using two key buffers:
- Success Buffer: The amount of traffic a service can handle above the baseline without latency degradation.
- Failure Buffer: The capacity reserved to gracefully reject excess requests, preventing cascading failure and allowing the service to maintain stability until the spike subsides.
The goal of effective load shedding is to utilize the Failure Buffer to gracefully degrade service, ensuring the system handles some requests rather than collapsing entirely.
The critical breakthrough was the realization that not all requests are equally valuable during an overload event. Previously, Netflix employed “equal opportunity” load shedding, which indiscriminately dropped all traffic. The new approach drops low-priority requests first, preserving the system’s Success Buffer for high-priority, user-critical requests.
Key Prioritization Scenarios:
|
Priority Type
|
Example Request
|
Impact
|
|
High
|
User-initiated playback
|
Critical for user experience. Preserved under load.
|
|
Low
|
Prefetch requests, Background tasks
|
Non-critical. Dropped first to free up capacity.
|
|
Data Gateways
|
Writes (over Reads)
|
Prevents data loss; reads are easily retryable.
|
Crucially, Netflix shifted the load-shedding decision from the centralized API Gateway down to the individual service level. This allows critical requests to re-purpose, or steal dynamically, non-critical capacity within the application instance, maximizing resource utilization during duress. This granular control also provides efficacy for backend-to-backend and batch traffic, which bypasses the API gateway.
To manage load shedding across hundreds of microservices, Netflix developed an automated platform focusing on three pillars: Priority Assignment, Central Configuration, and Automated Validation.
- Priority Assignment: Request priority is determined early (e.g., via request headers) and is propagated downstream. The system is designed to prevent services from escalating priority but allows them to degrade it.
- Configuration: Utilization metrics (CPU, latency, concurrency) are aggregated, and a per-cluster, unique load-shedding function is automatically generated, mapping utilization and priority to a rejection probability. For instance, non-critical shedding may start at 60% CPU utilization, while critical shedding begins at 80%.
- Validation: The Chaos Automation Platform (CHAP) and failure injection testing are used to experiment with, validate, and safely roll out configurations. This ensures that every cluster has the requisite Success and Failure buffer before major content releases.
To prevent the “thundering herd” problem caused by clients retrying shed requests, Netflix introduced prioritized retry strategies. The system scales back or halts all retries when server-side shedding is active, but allows only high-priority retries under heavy load. This prevents client behavior from amplifying the overload while ensuring critical requests have a chance to succeed once the system stabilizes.
At the end of the talk, Mendiratta and Fedorka shared the following key takeaways:
- Load Shedding is a Safety Buffer: It protects the system from total collapse by ensuring service degradation rather than failure.
- Prioritization is Paramount: By shedding low-priority requests, reliability is maximized for the user’s core experience (e.g., watching a show).
- Automation is Key to Scale: Centralized tooling automates configuration and validation of unique service-level load-shedding functions across a massive microservice fleet.
Lastly, Mendiratta and Fedorka shared a link to resources (including slides).
