Nvidia Corp. has acquired SchedMD LLC, a low-profile company that maintains one of the most important open-source tools in the machine learning ecosystem.
The chipmaker announced the deal today. The financial terms were not disclosed.
SchedMD was founded in 2010 by the developers of Slurm, an open-source platform for managing server clusters. The company provides professional services that help organizations use the software in production. Nvidia disclosed today that SchedMD has several hundred customers including government agencies, banks and healthcare organizations.
Training a large language model on a single graphics card can be prohibitively time-consuming. As a result, companies spread their training workloads across a large number of GPUs, which makes it possible to perform calculations in parallel rather than one after another. That saves time, but creates a significant amount of complexity.
When a training workload runs across multiple GPUs, developers must decide which chip should perform what sub-task. Assigning a sub-task to a busy chip can cause unnecessary training delays. There are also other challenges, such as the need to avoid situations where some GPUs are left underutilized.
Slurm automates the task of determining which GPU should perform what task and when. Kubernetes, another popular open-source cluster management platform, provides a similar capability. But Slurm includes a number of specialized features that make it better suited to power artificial intelligence training workloads.
One of Slurm’s differentiators is that it’s highly scalable: the platform can manage clusters with more than 100,000 GPUs. It also provides fine-grained customization options. If two workloads exchange data with one another on a regular basis, developers can have Slurm place them on adjacent servers to minimize the distance that data must travel. Kubernetes supports similar customization, but only if developers extend it with plug-ins.
SchedMD helps organizations set up Slurm and customize it for their requirements. Once a deployment is live, the company provides ongoing support services that assist customers with tasks such as installing updates.
The company maintains not only Slurm but also another open-source project called Slinky. It enables companies to run Slurm on Kubernetes. That removes the need to run the two open-source platforms on separate clusters, which simplified day-to-day management. Additionally, consolidating servers into a single cluster can improve hardware utilization and thereby lower costs.
Nvidia said Slurm will remain an open-source project following the deal. The chipmaker will continue developing the project and provide professional services to SchedMD’s customers.
It also announced plans to “accelerate SchedMD’s access to new systems — allowing users of Nvidia’s accelerated computing platform to optimize workloads across their entire compute infrastructure.” That suggests the chipmaker may be planning to optimize Slurm for its upcoming Rubin graphics card series and Vera central processing units.
Slurm is used in not only AI training clusters but also supercomputers. The software powers more than half of the world’s 100 fastest supercomputers, many of which also include Nvidia silicon. The talent that the chipmaker is gaining through the SchedMD acquisition may help it enhance its value proposition for supercomputer builders.
Photo: Nvidia
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
- 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
- 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About News Media
Founded by tech visionaries John Furrier and Dave Vellante, News Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.
