Nvidia Corp. today announced advances in artificial intelligence software and networking innovations aimed at accelerating AI infrastructure and model deployment.
The technology giant, which makes the graphics processing units that power much of the AI economy, unveiled Spectrum-XGS, or “giga-scale,” for its Spectrum-X Ethernet switching platform designed for AI workloads. Spectrum-X connects entire clusters within the data center, allowing massive datasets to stream across AI models. Spectrum-XGS extends this by providing orchestration and interconnection between data centers.
“So, you’ve heard us use terms like scale up and scale out. Now we’re introducing this new term, ‘scale across,’” said Dave Salvator, director of accelerated computing products at Nvidia. “These switches are basically purpose built to enable multi-site scale with different data centers able to communicate with each other and essentially act as one gigantic GPU.”
In terms of how this helps data centers, “scale up” means bigger machines and “scale out” refers to more machines in the data center. However, many data centers have a limited amount of power they can draw or the amount of heat they can dissipate before efficiency drops. This caps the number of machines or the amount of compute that can feasibly be packed into a particular location.
Salvator said the system minimizes jitter and latency, the variability in packet arrival times and the delay between sending data and receiving a response. Both are critical in AI networking because they determine how much bandwidth can be achieved between GPUs spread across sites.
Comparatively, NVLink Fusion, a network fabric technology Nvidia unveiled in May, allows cloud providers to scale up their data centers to handle millions of GPUs at a time. Together, NVLink Fusion and Spectrum-XGS represent two layers of scaling AI infrastructure: one inside the data center, and one across multiple data centers.
Researching better methods to serve AI models
Dynamo is Nvidia’s inference serving framework, which is how models are deployed and process knowledge.
Nvidia has been researching how to deploy models using a specialized technique called disaggregated serving using this platform. This splits “prefill,” or context building, and “decode,” or token generation, across different GPUs or servers.
This is important because inference, at one time considered secondary to model training is now becoming a serious challenge during the agentic AI era, where reasoning models generate tremendous amounts of tokens than older models. Dynamo is Nvidia’s answer to this by creating a faster, more efficient and cost-efficient way of handling this.
“If you look at both interactivity on a model like GPT OSS, OpenAI’s most recent community model they just released, we’re able to achieve, about a 4X increase in tokens per second,” said Salvator. “You look at DeepSeek, we’re also able to achieve really significant bumps there in terms of a 2.5X increase.”
Nvidia is also researching “speculative decoding,” which uses a second, smaller model to guess what the main model will output for a given prompt in an attempt to speed it up. “The way that this works is you have what’s called a draft model, which is a smaller model which attempts to sort of essentially generate potential next tokens,” said Salvator.
Because the smaller model is faster but less accurate, it can generate multiple guesses for the main model to verify.
“The ability here is that the more that that draft model can speculatively correctly guess what those next tokens need to be, the more performance you can pick up,” explained Salvator. “And we’ve already seen about a 35% performance gain using these techniques.”
According to Salvator, the main AI model does verification in parallel against its learned probability distribution. Only accepted tokens are committed, so rejected tokens are discarded. This keeps latency under 200 milliseconds, which he described as “snappy and interactive.”
Image: News/Microsoft Designer
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
- 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
- 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About News Media
Founded by tech visionaries John Furrier and Dave Vellante, News Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.