Chinese artificial intelligence developer DeepSeek today open-sourced DeepSeek-V3, a new large language model with 671 billion parameters.
The LLM can generate text, craft software code and perform related tasks. DeepSeek says it outperforms two of the most advanced open-source LLMs on the market across more than a half-dozen benchmark tests.
DeepSeek-V3 is based on a so-called mixture of experts, or MoE, architecture. It comprises multiple neural networks that are each optimized for a different set of tasks. When DeepSeek-V3 receives a prompt, a component known as a router sends the request to the neural network best-equipped to answer it.
The MoE architecture’s main benefit is that it reduces hardware costs. Sending a prompt to DeepSeek-V3 doesn’t activate the entire LLM, but only the specific neural network to which the request is routed. Each such neural network has 34 billion parameters, which means it requires a relatively limited amount of infrastructure to run.
Alongside its benefits, the MoE architecture also introduces certain challenges. During the training process, some of a MoE model’s neural networks receive more training data than the others, which can create inconsistencies in the LLM’s output quality. DeepSeek says it has developed a new method of mitigating this challenge and implemented it in DeepSeek-V3.
The LLM was trained on 14.8 trillion tokens’ worth of information. One token corresponds to a few letters or numbers. The training process took 2.788 million graphics processing unit hours, which means it used relatively little infrastructure. The industry’s most advanced AI clusters have tens of thousands of GPUs or more that can complete such a training project in a few days.
Alongside its MoE architecture, DeepSeek-V3 is equipped with several optimizations designed to boost its output quality.
LLMs use a technique called attention to identify the most important details in a sentence. DeepSeek-3 implements multihead latent attention, an improved version of the technique that allows it to extract key details from a text snippet several times rather than only once. This makes the LLM less likely to overlook important information.
DeepSeek-V also features a so-called multitoken prediction feature. Language models usually generate text one token at a time. DeepSeeek-V3, in contrast, generates several at once, which speeds up inference.
DeepSeek put its algorithm to the test by comparing it with three other open-source LLMs: the previous-generation DeepSeek-V2, Llama 3.1 405B and Qwen2.5 72B. DeepSeek-V3 achieved higher scores across all nine of the coding and math benchmarks that were used in the evaluation. It also proved better at a range of text processing tasks.
The code for DeepSeek-V3 is available on Hugging Face.
Image: Unsplash
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU