As enterprises race to build high-performance computing clusters to serve expanding artificial intelligence demands, AI networking has evolved from a backend utility into the central nervous system of the modern data center.
This resurgence comes with significant engineering hurdles, as the heavy traffic inherent to AI workloads threatens to saturate traditional links. This requires a rethink of fabric architecture — moving toward lossless, high-velocity environments capable of handling massive data flows without latency, according to Saurabh Kapoor (pictured), director of product management, AI networking, at Dell Technologies Inc.
“When we look at AI workloads, networking characteristics are so different,” Kapoor said. “You’re looking at elephant flows, bursty traffic links that can get saturated in microseconds, low entropy use cases. It’s all about how you address those characteristics.”
Kapoor spoke with John Furrier and Jackie McGuire at SC25, for an exclusive interview on theCUBE, News Media’s livestreaming studio. They discussed the integration of open-source standards into high-performance computing and the expansion of the Dell AI Factory with Nvidia Corp. (* Disclosure below.)
Scaling AI networking with open source
To address the complexity of these new workloads, Dell is championing an open-source approach. There are parallels between the evolution of server operating systems and the current state of network infrastructure, Kapoor noted. Just as Linux became the standard for compute, Software for Open Networking in the Cloud, or SONiC, is poised to standardize the fabric, Kapoor explained.
“One common operating system running across multiple compute hardware ecosystems — the same thing is happening with networking,” Kapoor said. “Now with SONiC becoming the Linux of networking, it could have one common operating system that runs across multiple silicon and hardware architectures that gives scale agility.”
This open-source strategy is a significant cost advantage. By decoupling hardware from software and relying on SONiC, organizations can potentially reduce the total cost of ownership by up to 50%, releasing capital that can be reinvested into expensive graphics processing units, according to Kapoor. This week, the company announced that its enterprise version of SONiC is now compliant with Nvidia’s Spectrum-X Ethernet platform. This integration allows enterprises to deploy the same high-performance architecture used by hyperscalers, but with the requisite support and validation needed for corporate environments.
“We’re helping drive that innovation,” Kapoor said. “Making sure that these infrastructures are highly-optimized for AI infrastructures, workload spanning training and inferencing, fine-tuning — you name it.”
Managing these complex environments demands intelligent observability to prevent “GPU starvation,” where expensive processors sit idle waiting for data. Dell is introducing predictive analytics to solve this, Kapoor told theCube. To further simplify deployment, Dell has introduced the SmartFabric Manager, which uses validated blueprints to help customers stand up AI factories quickly.
“Because you’re running at peak capacity, you need highly-optimized infrastructures and next-gen fabric automation observability capabilities so that you’re looking at predictive capabilities,” Kapoor said. “You’re running that correlation across: What is my performance on compute? What is my correlation across networking bandwidth utilization? [Then you are] making predictive actions so that you’re better managing it.”
Here’s the complete video interview, part of News’s and theCUBE’s coverage of SC25:
(* Disclosure: Dell Technologies and Nvidia Corp. sponsored this segment of theCUBE. Neither Dell and Nvidia nor other sponsors have editorial control over content on theCUBE or News.)
Photo: News
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
- 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
- 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About News Media
Founded by tech visionaries John Furrier and Dave Vellante, News Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.
