Nvidia Corp. today previewed an upcoming chip, the Rubin CPX, that will power artificial intelligence appliances with 8 exaflops of performance.
AI inference involves two main steps. First, an AI model analyzes the information on which it will draw to answer the user’s prompt. Once the analysis is complete, the algorithm generates its prompt response one token at a time. Today, the two tasks are usually done using the same hardware.
Nvidia plans to take a different approach with its future AI systems. Instead of performing both steps of the inference workflow using the same graphics card, it plans to assign each step to a different chip. The company calls this approach disaggregated inference.
Nvidia’s upcoming Rubin CPX chip is optimized for the initial, so-called context phase of the two-step inference workflow. The company will use it to power a rack-scale system called the Vera Rubin NVL144 CPX (pictured.) Each appliance will combine 144 Rubin CPX chips with 144 Rubin GPUs, upcoming processors optimized for both phases of the inference workflow. The accelerators will be supported by 36 central processing units.
The company says the upcoming system will provide 8 exaflops of computing capacity. One exaflop corresponds to a quintillion computing operations per second. That’s more than seven times the performance of the top-end GB300 NVL72 appliances currently sold by Nvidia.
Under the hood, the Rubin CPX is based on a monolithic die design with 128 gigabytes of integrated GDDR7 memory. Nvidia also included components optimized to run the attention mechanism of large language models.
An LLM’s attention mechanism enables it to identify and prioritize the most important parts of the text snippet it’s processing. According to Nvidia, the Rubin CPX can perform the task three times faster than its current-generation silicon. “We’ve tripled down on the attention processing,” said Ian Buck, Nvidia’s vice president of hyperscale and high-performance computing.
The executive detailed that video processing workloads will receive a speed boost as well. The Rubin CPX includes hardware-level support for video encoding and decoding. That’s the process of compressing a clip before it’s transmitted over the network to save bandwidth and then restoring the original file.
According to Nvidia, the Rubin CPX will enable AI models to process prompts with one million tokens’ worth of data. That corresponds to tens of thousands of lines of code or one hour of video. In many cases, increasing the amount of data an AI model can consider while generating a prompt response boosts its output quality.
Nvidia plans to start shipping the Rubin CPX at the end of 2026.
Image: Nvidia
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
- 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
- 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About News Media
Founded by tech visionaries John Furrier and Dave Vellante, News Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.