Apple Collaborates With NVIDIA To Research Faster LLM Performance - 9to5Mac

Apple collaborates with NVIDIA to research faster LLM performance – 9to5Mac

Last updated: 2024/12/18 at 4:57 PM

News Room Published 18 December 2024

In a blog post today, Apple engineers have shared new details on a collaboration with NVIDIA to implement faster text generation performance with large language models.

Apple published and open sourced its Recurrent Drafter (ReDrafter) technique earlier this year. It represents a new method for generating text with LLMs that is significantly faster and “achieves state of the art performance.” It combines two techniques: beam search (to explore multiple possibilities) and dynamic tree attention (to efficiently handle choices).

While its research demonstrated strong results, Apple collaborated with NVIDIA to apply ReDrafter in production. As part of this collaboration, ReDrafter was integrated into NVIDIA TensorRT-LLM, a tool that helps run LLMs faster on NVIDIA GPUs.

Here are the results:

To enable the integration of ReDrafter, NVIDIA added new operators or exposed existing ones, which considerably improved TensorRT-LLM’s capability to accommodate sophisticated models and decoding methods. ML developers using NVIDIA GPUs can now easily benefit from ReDrafter’s accelerated token generation for their production LLM applications with TensorRT-LLM.

In benchmarking a tens-of-billions parameter production model on NVIDIA GPUs, using the NVIDIA TensorRT-LLM inference acceleration framework with ReDrafter, we have seen 2.7x speed-up in generated tokens per second for greedy decoding. These benchmark results indicate this tech could significantly reduce latency users may experience, while also using fewer GPUs and consuming less power.

“LLMs are increasingly being used to power production applications, and improving inference efficiency can both impact computational costs and reduce latency for users,” Apple’s machine learning researchers conclude. “With ReDrafter’s novel approach to speculative decoding integrated into the NVIDIA TensorRT-LLM framework, developers can now benefit from faster token generation on NVIDIA GPUs for their production LLM applications.”

You can learn more about this work on Apple’s website and in a blog post on NVIDIA’s website:

Follow Chance: Threads, Bluesky, Instagram, and Mastodon.

FTC: We use income earning auto affiliate links. More.

Apple collaborates with NVIDIA to research faster LLM performance – 9to5Mac

Leave a Reply Cancel reply

Stay Connected

Latest News

I think AI laptops are the next big thing — the Acer Swift 16 with 3K OLED is $300 off

The Epic Games Store is offering 16 free PC games for the holiday season

NYT Strands today — my hints, answers and spangram for Thursday, December 19 (game #291)

Pornhub Block Is Expanding to Florida: How to Watch Anyway

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News