A Chinese startup called DeepSeek has just released DeepSeek V3, an absolutely gigantic LLM that is available with an “open” MIT license that allows developers to download it from GitHub and modify it for various scenarios, including some commercial ones.
Promising performance. According to internal testing, DeepSeek V3 outperforms both Open Source and other proprietary AI models that can only be used through an API. In tests such as the Codeforces programming test, the Chinese model managed to outperform Llama 3.1 405B, GPT-4o and Qwen 2.5 72B, although all of them have many fewer parameters and that can influence performance and comparisons. Only the Claude 3.5 Sonnet seems to hold its own, and it outperformed or matched the Chinese model in several tests.
Efficient and cheap but voracious training. According to those responsible, DeepSeek V3 “only” needed 2.788 million hours of training on 2,048 H800 GPUs, the refined versions of NVIDIA’s H100. According to those responsible, the training cost only 5.5 million dollars, and it is estimated that to train GPT-4 OpenAI invested close to 80 million dollars. To train it, they used a dataset with 14.8 billion tokens, an equally enormous figure: one million tokens are equivalent to approximately 750,000 words. Andrej Karpathy, co-founder of OpenAI (out of the company for months) was surprised by this efficiency and reduced training cost.
60% larger than Llama 3.1 405B. Until now, Meta had one of the largest AI models on the market with 405 billion parameters (405B). The DeepSeek model reaches 671B, almost 66% more. The question, of course, is whether so many parameters are of any use.
The more parameters, (usually) the better. The number of parameters usually has a strong relationship with the capacity of the models. The AI models that run locally on our PCs or mobile phones tend to be much smaller (3B, 7B, 14B are usually their sizes) and those that run in data centers are capable of being much larger and capable in both precision and performance. options and power, as is the case with DeepSeek V3. But of course, the larger they are, the more computing resources they need to be used with a certain fluidity.
Two innovations to improve. DeepSeek V3 makes use of a Mixture-of-Experts architecture that only activates some parameters optimally to process various tasks efficiently. Those responsible for it have introduced two striking improvements in this new model. The first, a load balancing strategy that monitors and adjusts the load on the “experts.” The second, a token prediction system. The combination of both allows token generation to triple that of DeepSeek V2: it now reaches 60 tokens per second using the same hardware as its predecessor.
China takes a run. This new “open” model is the latest demonstration of the great progress that China is making despite the obstacles of the trade war with the United States. DeepSeek already surprised us a little over a month ago with its DeepSeek-R1 model, capable of competing with OpenAI’s o1 in the field of AI “reasoning”. And other startups and large Chinese technology companies continue to work frenetically, and the fruits are visible and promising. And also, with an Open Source approach that makes them especially interesting for researchers and academics.
Image | WorldOfSoftware with Freepik Picasso
In WorldOfSoftware | China was lagging behind in AI, but it continues to launch increasingly advanced models. And very socialist