Netflix extended its prioritized load-shedding implementation to the individual service level to further improve system resilience. The approach uses cloud capacity more efficiently by shedding low-priority requests only when necessary instead of maintaining separate clusters for failure isolation.
The company previously deployed load-shedding strategies at the API Gateway level but decided to enable service owners to implement their prioritization logic at the service level, focusing on the video streaming control plane and data plane. Netflix categorizes API requests into two types based on their criticality. User-initiated requests are considered critical as not handling these would impact user experience. On the other hand, pre-fetch requests made by browsers or apps anticipating the user’s intentions are considered non-critical and can be dropped without much impact on user experience.
The previous solution didn’t distinguish between user-initiated and prefetch requests and would reduce availability for both in case of large traffic spikes. Netflix considered separating different request categories into separate clusters but decided against such an approach due to higher compute costs and additional operational overhead.
Single Cluster with Prioritized Load Shedding (Source: Netflix Technology Blog)
Instead, the company implemented a concurrency limiter in its Play API that prioritizes user-initiated requests over prefetch requests using the open-source Java library. The limiter is configured as a pre-processing Servlet Filter using HTTP headers sent by devices without parsing the request body.
The team working on the solution could observe their efforts in preventing a secondary outage a few months after deploying their changes when an infrastructure outage caused a buildup of prefetch requests from Android devices. The limiter dropped the availability of prefetch requests to as low as 20%, while the availability of user-initialized requests remained high, above 99.4%.
Availability of Prefetch and User-initialized Requests (Source: Netflix Technology Blog)
After the rollout of load shedding for the Play API, the team created a generic internal library to enable service owners to configure their prioritization logic with multiple priority levels (critical, degraded, best-effort, bulk). Services can use the upstream client’s priority or map incoming requests to one of the preconfigured priority levels.
Anirudh Mendiratta, staff software engineer at Netflix and co-author of the blog post, explains how the service-level load-shedding solution works with CPU-based autoscaling:
Most services at Netflix autoscale on CPU utilization, so it is a natural measure of system load to tie into the prioritized load-shedding framework. Once a request is mapped to a priority bucket, services can determine when to shed traffic from a particular bucket based on CPU utilization. In order to maintain the signal to autoscaling that scaling is needed, prioritized shedding only starts shedding load after hitting the target CPU utilization, and as system load increases, more critical traffic is progressively shed in an attempt to maintain user experience.
The team ran a series of experiments to test load-shedding by generating a load profile surpassing the autoscale volume six times. As expected, the limiter shed first non-critical requests, and then critical requests and latency remained acceptable. Additionally, engineers have extended the library to work for IO-bound services by adding latency-based shedding.