Uber has detailed a recent transition to running Ray-based machine learning workloads on Kubernetes. This marks an evolution in its infrastructure, with the aim of enhancing scalability, efficiency, and developer experience. The company recently published a two-part series from Uber Engineering delving into the motivations, challenges, and solutions encountered during this migration.
Initially, Uber’s machine learning workflows were managed by the Michelangelo Deep Learning Jobs (MADLJ) service, which utilized Apache Spark for ETL processes and Ray for model trainin.
Uber’s original machine learning infrastructure faced several challenges that hindered scalability and efficiency. One major issue was the manual nature of resource management—ML engineers had to determine the right compute resources themselves, taking into account GPU availability and current cluster capacity. This manual process often led to suboptimal choices and unnecessary delays. Compounding the problem were static configuration settings for resources and clusters, which were hardcoded into the system. This rigidity caused uneven load distribution and underutilization of resources, limiting the overall efficiency of the platform.
Additionally, the system’s inability to flexibly plan for capacity posed an obstacle. The platform either overprovisioned resources—wasting compute—or underprovisioned, resulting in job failures or delays. These limitations collectively created an environment that was both inefficient and difficult to scale, prompting Uber to seek a more adaptable and automated solution through its migration to Kubernetes and Ray.
To address these issues, Uber migrated its ML workloads to Kubernetes, aiming for a more declarative and flexible infrastructure. This transition involved developing a unified platform where users could specify job types and resource requirements without delving into the complexities of the underlying infrastructure. The system would then automatically allocate the optimal resources based on current cluster loads and job specifications.
As Uber migrated its machine learning workloads to Kubernetes, a key focus was improving resource utilization through elastic resource management. To achieve this, the team implemented a set of strategies that enabled more flexible and efficient use of compute resources across the organization. One such strategy was the introduction of hierarchical resource pools, where cluster resources were organized according to team or organizational boundaries. This structure gave teams more granular control over their allocated compute resources and improved visibility into usage patterns.
Another enhancement was elastic sharing across these resource pools. If one pool had idle resources, they could be temporarily borrowed by another, boosting overall utilization without permanently reallocating capacity. These borrowed resources were preemptible, meaning they could be reclaimed by the original pool when needed. To ensure fairness and avoid resource contention, resource entitlement was enforced using max-min fairness principles. This meant each pool retained a guaranteed share of resources while still being able to access additional capacity dynamically based on current demand. These mechanisms collectively allowed Uber to scale more efficiently and respond to the fluctuating demands of ML workloads.

Additionally, Uber implemented strategies to optimize the use of heterogeneous hardware. Clusters were configured with both GPU-enabled and CPU-only nodes. Tasks not requiring GPUs, such as data loading and preprocessing, were assigned to CPU nodes, reserving GPU nodes for training tasks.
Uber also developed a GPU filter plugin to ensure that only GPU workloads were scheduled on GPU nodes. The Kubernetes scheduler was also enhanced to distribute non-GPU pods using a load-aware strategy and GPU workloads using a bin-packing strategy to minimize resource fragmentation.
Through these changes, Uber has achieved a more efficient and flexible infrastructure for its machine learning workloads, enabling better resource utilization and scalability.