Uber Shares Strategy For Controlling Risk In Monorepo Changes That Affect 3,000+ Microservices

Uber has published details on their approach to controlling rollouts of large-scale changes across monorepos that serve thousands of microservices, addressing one of the key challenges in continuous deployment at massive scale.

The ride-sharing giant’s engineering team faced a critical problem: when a single commit to their monorepos can affect thousands of services simultaneously—such as upgrading an RPC library used across virtually every Go service at Uber—how do you minimize the potential damage from a problematic change?

Uber’s engineering stack relies on a few monorepositories, one per main programming language, that collectively host hundreds or thousands of services, all developed trunk-based and released from the main branch. This structure supports a high degree of code reuse and streamlined workflows, but it brings a significant risk: a single commit, say updating a core RPC library, can ripple through and impact vastly more services than anticipated.

By analyzing 500,000 commits in their Go monorepo, the team discovered that 1.4 percent of commits impacted more than 100 services, and 0.3 percent impacted over 1,000 services at Uber. While not inherently more dangerous in content, these large-scale changes carry an exponentially greater potential for disruption, especially when automated CD pipelines immediately push changes to production. Uber’s earlier safety architecture focused on pre-land testing and service-level health monitoring during deployment. But as deployment automation expanded, those mechanisms alone couldn’t contain the fallout of sweeping commits.

In response, Uber introduced a cross-cutting service deployment orchestration layer. Instead of each service autonomously deciding when to deploy a change, orchestration adds a global gate: deployment decisions in one service now consider signals (both positive and negative) from other impacted services. Architecture-wise, this is implemented through a lightweight, asynchronous state machine. Periodic jobs track the deployment outcomes across all affected services. The system progresses the rollout based on success or failure thresholds at each stage, preventing uncontrolled propagation of failures.

State machine for orchestration of large-scale deployment.

Central to this orchestration method is service tiering. Services are classified into tiers from 0 (most critical) to 5 (least critical). Rollout proceeds in stages: a subset of less critical services is deployed first. Only when they succeed does the system unblock the next tier. If failures exceed a configured threshold, the rollouts halt, and the author is notified to fix or revert the offending commit. This cohort-based rollout ensures that critical services aren’t exposed prematurely to potentially risky changes, and it provides a clear signal about when to proceed or when to abort. Initial parameter choices for cohort thresholds and pacing were intuitive yet overly cautious. The system often lagged; critical services could be blocked behind several deployments, delaying feature delivery.

To balance speed with safety, Uber established a 24-hour maximum window to unblock all cohorts. They built a simulator to replay orchestration with historical data and varying configurations. This allowed them to predict rollout duration based on commit timing, deployment windows, and cohort definitions. By adjusting thresholds and cohort groupings, they achieved a much flatter rollout curve and consistent completion within 24 hours, even for changes initiated mid-week. Once deployed, various large-scale changes validated the simulation’s accuracy—the system performed as predicted, reinforcing confidence in the approach.

Today, this orchestration feature supports not only safety-critical large-scale changes but also other scenarios. One emerging use case is rolling out batch configurations across identical services, such as ML-serving endpoints with version rollouts contingent on successful deployment to lower-tier cohorts.

Major technology companies, including Google, Pinterest, and Airbnb, all operate huge monorepos with varying strategies for scaling, building systems, and managing version control of software. Uber’s approach contributes to the growing body of knowledge on managing large-scale deployments. The challenge Uber addresses is particularly relevant as more organizations adopt monorepo architectures for their ability to enable atomic changes across multiple services while maintaining code consistency and reducing integration complexity.