A few weeks ago, DeepSeek’s announcement of a super-capable R1 model that combines high performance with low resource costs thrilled the entire tech community as well as the US stock market. R1 model is part of a growing trend—AI models trained using a technique called distillation. Essentially, distillation is an approach to training a smaller, faster AI by letting it learn from a bigger, smarter AI. Thus, the smaller AI keeps most of its intelligence but runs more efficiently. However, we won’t be focusing on this technique here.
OpenAI and similar companies are trying to protect their intellectual property, limiting how their models are used to train competitors. Companies may take countermeasures, such as banning certain accounts/IP addresses, reducing model request limits, and legally prohibiting the use of the model to create competitors.
Can a powerful model be built on a budget?
Recent experiment conducted by researchers from Stanford and the University of Washington demonstrated it is indeed possible.
TLDR: Researchers created a new s1 model based on Alibaba’s Qwen2.5 and paid $50 for tokens to Gemini 2.0 Flash Thinking (free with limits), 16 NVIDIA H100 GPUs, and in 26 minutes got a competitor to the o1-preview model that answers math questions 27% better, the paper says.
The s1 model demonstrates how AI systems can be trained efficiently through strategic data curation, supervised fine-tuning (SFT), and budget forcing. Rather than depending on costly, large-scale datasets, the researchers developed s1K, a compact yet high-quality dataset containing 1,000 reasoning questions. which consists of 1,000 carefully curated questions paired with reasoning traces and answers distilled from Gemini Thinking Experimental model, enabling the capture of complex problem-solving patterns without requiring manual annotation. By fine-tuning Alibaba’s Qwen2.5-32B-Instruct on this dataset, they built a highly capable model at a fraction of the usual cost.
The core of their training method was supervised fine-tuning, where the model directly learned from Gemini’s reasoning traces instead of following the conventional teacher-student distillation approach. This 26-minute fine-tuning process on 16 NVIDIA H100 GPUs cost less than $50, proving that fine-tuning a strong open-weight model with well-curated data can lead to significant performance gains.
To optimize inference efficiency, the researchers implemented budget forcing, a technique that regulates how long the model spends on reasoning. If a response exceeded a certain length, an end-of-thinking token signaled the model to stop and deliver an answer. Conversely, incorporating the word “Wait” prompted the model to extend its reasoning, leading to more accurate answers. This simple yet powerful adjustment boosted the model’s accuracy on American Invitational Mathematics Examination (AIME) 2024 from 50% to 57%.
The s1-32B model surpassed OpenAI’s o1-preview model by 27% on competitive math benchmarks, demonstrating that small, well-trained models can compete with those built using vast computational resources. This research challenges the notion that state-of-the-art AI requires billion-dollar training pipelines. Instead, it underscores a future where strategic dataset design, fine-tuning, and inference optimization can democratize AI model training.
If someone wants to run this process independently, the price of one H100 GPU is $30,000, making a total GPU cost of $480,000. That’s a $500,000 investment versus the billions spent by major AI players—for nearly the same results.
New LLM-based products are just around the corner
If AI can be trained this efficiently, what’s stopping individuals or small teams from building their models? With the right expertise and a few hundred dollars, crafting a custom AI could soon be as accessible as, say, getting a dog. 🐶
Models like Mistral, Qwen, and Llama are getting closer to proprietary ones like GPT, reducing the big tech dominance. Distillation allows teams to train high-quality models using API access instead of building from scratch – at a fraction of the cost. As a bonus, we can reduce dependency on a single provider.
If this trend continues, AI might evolve the cloud computing model: big companies still dominate the infrastructure, but smaller players gain power by optimising and customising models for specific needs, cost efficiency, and control.
The barriers to AI development are falling. What happens when anyone can train a high-performing AI assistant for the price of a laptop?