By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Uber’s Journey to Ray on Kubernetes
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Uber’s Journey to Ray on Kubernetes
News

Uber’s Journey to Ray on Kubernetes

News Room
Last updated: 2025/05/08 at 12:38 PM
News Room Published 8 May 2025
Share
SHARE

​Uber has detailed a recent transition to running Ray-based machine learning workloads on Kubernetes. This marks an evolution in its infrastructure, with the aim of enhancing scalability, efficiency, and developer experience. The company recently published a two-part series from Uber Engineering delving into the motivations, challenges, and solutions encountered during this migration.​

Initially, Uber’s machine learning workflows were managed by the Michelangelo Deep Learning Jobs (MADLJ) service, which utilized Apache Spark for ETL processes and Ray for model trainin.

Uber’s original machine learning infrastructure faced several challenges that hindered scalability and efficiency. One major issue was the manual nature of resource management—ML engineers had to determine the right compute resources themselves, taking into account GPU availability and current cluster capacity. This manual process often led to suboptimal choices and unnecessary delays. Compounding the problem were static configuration settings for resources and clusters, which were hardcoded into the system. This rigidity caused uneven load distribution and underutilization of resources, limiting the overall efficiency of the platform.

Additionally, the system’s inability to flexibly plan for capacity posed an obstacle. The platform either overprovisioned resources—wasting compute—or underprovisioned, resulting in job failures or delays. These limitations collectively created an environment that was both inefficient and difficult to scale, prompting Uber to seek a more adaptable and automated solution through its migration to Kubernetes and Ray.

To address these issues, Uber migrated its ML workloads to Kubernetes, aiming for a more declarative and flexible infrastructure. This transition involved developing a unified platform where users could specify job types and resource requirements without delving into the complexities of the underlying infrastructure. The system would then automatically allocate the optimal resources based on current cluster loads and job specifications.​

Elastic resource sharing in Kubernetes.

As Uber migrated its machine learning workloads to Kubernetes, a key focus was improving resource utilization through elastic resource management. To achieve this, the team implemented a set of strategies that enabled more flexible and efficient use of compute resources across the organization. One such strategy was the introduction of hierarchical resource pools, where cluster resources were organized according to team or organizational boundaries. This structure gave teams more granular control over their allocated compute resources and improved visibility into usage patterns.

Another enhancement was elastic sharing across these resource pools. If one pool had idle resources, they could be temporarily borrowed by another, boosting overall utilization without permanently reallocating capacity. These borrowed resources were preemptible, meaning they could be reclaimed by the original pool when needed. To ensure fairness and avoid resource contention, resource entitlement was enforced using max-min fairness principles. This meant each pool retained a guaranteed share of resources while still being able to access additional capacity dynamically based on current demand. These mechanisms collectively allowed Uber to scale more efficiently and respond to the fluctuating demands of ML workloads.

Filter plugin for GPU pods.

Additionally, Uber implemented strategies to optimize the use of heterogeneous hardware. Clusters were configured with both GPU-enabled and CPU-only nodes. Tasks not requiring GPUs, such as data loading and preprocessing, were assigned to CPU nodes, reserving GPU nodes for training tasks.​

Uber also developed a GPU filter plugin to ensure that only GPU workloads were scheduled on GPU nodes. The Kubernetes scheduler was also enhanced to distribute non-GPU pods using a load-aware strategy and GPU workloads using a bin-packing strategy to minimize resource fragmentation.​

Through these changes, Uber has achieved a more efficient and flexible infrastructure for its machine learning workloads, enabling better resource utilization and scalability.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article China Eastern Airlines’ C919 aircraft debuts at Singapore Airshow · TechNode
Next Article Intel Link-Off Between Frames “LOBF” Submitted For Linux 6.16 Graphics Driver
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Taiwan places export controls on Huawei and SMIC | News
News
Ford recalls over 1m vehicles but drivers can skip dealership trip for fix
News
A New Obesity Pill May Burn Fat Without Suppressing Appetite
Gadget
How To Add Music to Instagram Stories & Posts |
Computing

You Might also Like

News

Taiwan places export controls on Huawei and SMIC | News

1 Min Read
News

Ford recalls over 1m vehicles but drivers can skip dealership trip for fix

5 Min Read
News

What Is Raspberry Pi and How Can I Use It for My Home Internet?

9 Min Read
News

Science Minister calls for closer health tech ties with US – UKTN

2 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?