At QCon San Francisco 2025, Jimmy Morzaria, Staff Software Engineer at Stripe, presented the company’s Zero-Downtime Data Movement Platform, a system enabling petabyte-scale database migrations with traffic cutovers that typically complete in milliseconds. The platform supports Stripe’s infrastructure, handling 5 million database queries per second across 2,000-plus MongoDB-based shards while maintaining 99.9995% reliability for $1.4 trillion in annual transactions.
The platform’s migration process follows a six-phase blueprint designed around three principles: maintaining data consistency with downtime shorter than node failover events, minimizing performance impact on live queries, and accommodating shards ranging from small datasets to tens of terabytes.
Stripe’s DocDB zero-downtime data movement stages
A data migration starts with a “migration registration” step that updates the routing metadata service to register new target shards and their key ranges. This step establishes the intended destination for data before any movement occurs.
The bulk data import phase then transfers the primary dataset using an optimized service that achieves tenfold performance improvements over standard imports. Morzaria explained that the team reordered inserts to align with MongoDB’s B-tree storage engine, sorting items by the most-used indexes in each shard to improve write performance by 10x.
Next, during async replication, a dedicated replication service maintains bidirectional synchronization between source and target shards. This crucial phase captures ongoing changes to source data while simultaneously replicating modifications back to source shards. The bidirectional approach enables complete migration rollbacks if issues emerge, providing a critical safety mechanism for financial data.
Architecture overview of the Async Replication step in the zero-downtime migration
Following replication, a validation service performs comprehensive correctness checks comparing data between source and target shards before proceeding to traffic switching. This verification ensures data integrity across the migration boundary.
The actual traffic switch (or cutover) step represents the platform’s most technically sophisticated phase. Based on what Morzaria termed “versioned gating,” the mechanism coordinates version updates across the database proxy service, coordinator, routing service, and replication service.
The traffic switch stage is based on “versioned gating”, allowing minimal downtime
The process begins with the client application querying through the proxy at version one, which routes to the source database. The coordinator then sets version two and verifies replication synchronization. Once confirmed, the proxy fetches new routes and begins querying with version two, directing traffic to the target database while the source shard receives updates to maintain rollback capability. The entire coordination completes in milliseconds to 2 seconds at most, keeping customer disruption imperceptible.
Migration deregistration concludes the process by cleaning up metadata and decommissioning the migration infrastructure.
Beyond horizontal scaling, Stripe uses the platform for shard merging, MongoDB version upgrades across multiple major releases, and tenancy model transitions. Morzaria noted that substantial foundational investments enable tools to serve a range of scenarios beyond their original designs.
Stripe built its DocDB platform internally rather than using managed services due to requirements around security policy enforcement, predictable performance, and multi-tenancy support with enforced quotas. As individual shards reached tens of terabytes by 2020, the company needed a systematic approach to data movement. Morzaria emphasized that 40% of customers abandon transactions after payment denials, making zero-downtime migrations essential rather than optional. Consequently, the build-versus-buy decision made sense for Stripe given strategic importance, differentiated requirements, and security needs.
