Artificial intelligence and high-performance computing are blending into a single ecosystem where integrated, co-designed infrastructure and smarter memory handling are essential. As systems scale, the focus is shifting toward architectures that reduce latency and make more efficient use of accelerated compute.
Graphics processing unit utilization has emerged as a central focus in large-scale AI deployments, tightly linked to how efficiently systems can offload and manage memory. This points toward software improvements as a critical way to push performance even further, according to David Noy (pictured), vice president of product management at Dell Technologies Inc.
“When you’re working in inferencing and you’re keeping track of the memory of a conversation … let’s say [with] a conversational chatbot, there’s all this context that’s been built up over time,” Noy told theCUBE. “You don’t want to have to rebuild that context every time. There is some amount of that that can be kept in memory. You don’t want to reuse the GPU cycles just to recalculate context.”
Noy spoke to theCUBE’s John Furrier and Jackie McGuire at SC25, during an exclusive broadcast on theCUBE, News Media’s livestreaming studio. They discussed how accelerated computing is reinventing system architecture and how the growing demand for smarter memory handling is influencing the future of AI workloads. (* Disclosure below.)
Smarter memory handling seen as key
Keeping part of the accumulated context in memory prevents the GPU from wasting cycles on already-processed information, Noy explained. If that memory can be extended onto storage with direct access, the system can maintain far more context without slowing down the GPU.
“We’ve announced integration with vLLM and LMCache using the NIXL transport protocol,” Noy said. “This is a protocol that allows the GPU to speak directly to the storage to save all the previous context of your conversation so you can have longer memory and not have to constantly do recalculation. That accelerates your time to first token by 19x.”
As AI factories scale toward exabyte-class deployments, power and space constraints are becoming just as important as raw performance. That’s driving Dell’s emphasis on co-designed systems that squeeze more useful work out of every rack unit and watt consumed, according to Noy.
“We’re hyper-focused on collaboration across our teams to make sure we’re building the most energy-efficient and space-efficient solutions for AI infrastructure,” he said. “If we can deliver double the performance per watt or per rack versus a competitor, we’ve effectively taken a 5% power budget and turned it into 10%. That’s a big deal.”
Here’s the complete video interview, part of News’s and theCUBE’s coverage of SC25:
(* Disclosure: Dell and Nvidia Corp. sponsored this segment of theCUBE. Neither Dell and Nvidia nor other sponsors have editorial control over content on theCUBE or News.)
Photo: News
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
- 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
- 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About News Media
Founded by tech visionaries John Furrier and Dave Vellante, News Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.
