Recently, Cloudflare introduced a new technique called “Shard and Conquer” for reducing cold starts in its serverless platform, Cloudflare Workers. It leverages a consistent hash ring to intentionally coalesce traffic for individual Workers onto a single “shard server” within a data center. With this new technique, the company has reduced the cold start rate by a factor of 10, achieving a sustained warm request rate for 99.99% of requests.
The technique marks the second significant evolution in cold start mitigation for the company, as the initial technique, pre-warming during the TLS handshake, began to fail for increasingly complex applications. Hence, Cloudflare decided to relax several platform limits in response to user demand for running larger, more complex applications by increasing script size from 1MB to 10MB (for paying users) and startup CPU time from 200ms to 400ms.
Yet by accommodating richer applications, these increases simultaneously lengthened the Worker’s cold start duration, causing it to frequently exceed the time of a modern TLS 1.3 handshake. This meant the cold start time could no longer be hidden entirely from the end user, requiring a new approach to minimize the frequency of cold starts.
The core motivation for this optimization lies in the serverless value proposition, as one Hacker News commentator observed:
Because attractiveness of Workers/Lambdas/Functions is whole ‘write simple amount of code and pay pennies to run it.’ Downside is cold starts, twisting yourself into knots you will do at scale to make them work, and vendor lock-in.
To solve the cold start frequency problem, Cloudflare borrowed a key technique from its own CDN HTTP cache: consistent hashing.
Previously, a request arriving on any server could trigger a redundant cold start, even if a warm Worker instance already existed on a nearby machine. This resulted in high cold start rates for low-volume Workers, whose instances were frequently evicted across multiple servers due to low, scattered traffic.
The new architecture works as follows:
- A Worker’s script ID is mapped onto a consistent hash ring shared by all servers in a data center.
- Subsequently, this map dictates a single, primary “shard server” that is responsible for running a specific Worker instance.
- As a result, all requests for that Worker are routed to the shard server, keeping the Worker instance warm indefinitely and reducing memory usage across the cluster by avoiding redundant instances.
(Source: Cloudflare blog post)
A crucial engineering challenge for this sharding model is load shedding. An individual Worker instance could still be overwhelmed by a sudden traffic spike, requiring the system to scale horizontally and instantiate new Workers on other servers immediately. This must be done without incurring the latency of a pre-flight “may I send the request” check (like Expect: 100-continue).
Cloudflare achieved graceful, low-latency load shedding by integrating its cross-instance communication tool, Cap’n Proto RPC:
- Optimistic Sending: The shard client (the server that initially received the request) optimistically sends the complete request to the shard server.
- Capability Passing: Critically, the client includes a Cap’n Proto capability (a handle to a lazily-loaded local Worker instance) within the request payload.
- Refusal and Redirect: If the shard server is overloaded, instead of simply returning a “go away” error, it returns the client’s own lazy capability.
- Short-Circuiting the Trombone: The client’s RPC system recognizes that the returned capability is local. It immediately stops proxying request bytes to the server, short-circuiting the request path and serving the Worker locally via a rapid cold start (since it now knows to skip the shard server).
The mechanism elegantly offloads traffic and achieves horizontal scaling for burst loads without introducing additional round-trip latency. Furthermore, the technique also extends to complex invocation stacks, where Workers invoke other Workers via Service Bindings, by serializing and passing the entire invocation context stack between shard servers.