Phi-4 is 14B parameter model from Microsoft Research that aims to improve the state of the art for math reasoning. Previously available on Azure AI Foundry, Phi-4 has recently become available on Hugging Face under the MIT license.
According to Microsoft, Phi-4 outperforms comparable and larger models on math reasoning thanks to a number of innovations throughout the training process, including the use of synthetic data for pre-training and mid-training, curation and filtering of organic data, and a new post-training scheme. This approach, Microsoft says, produced a significant improvement over previous models in the Phi family:
While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation.
The use of synthetic data is not new for LLMs or Phi models in particular. Microsoft says that using synthetic data is not a cheap substitute for organic data but offers distinct advantages over the latter by providing a more gradual learning path and better alignment with inference contexts. For example, organic data from the Web could include the statement of a mathematical problem followed by the final solution, with the reasoning steps coming afterward. This makes it harder for an LLM to learn to generate the solution from the problem statement. In contrast, a synthetic description of the problem would lead the LLM step by step from the initial problem statement to the final solution.
Along with synthetic data, Microsoft also used curated organic data, including tens of millions of high-quality organic problems and solutions from public websites and external datasets. In cases where accurate solutions were not provided, they were generated synthetically using majority voting to increase accuracy. Academic papers, educational forums, and programming tutorials were also collected.
We found clean and correct natural data to be absolutely crucial for seeding synthetic data: minor errors can result in severe quality degradations for derived synthetic documents. We therefore invested heavily in the perfectionistic curation of our web data.
The post-training phase for Phi-4 was aimed at transforming the pretrained model into a reliable AI assistant. In the first step, Microsoft fine-tuned the model with data generated from high-quality data across diverse domains, including math, coding, reasoning, conversation, model identity, and safety. Then, they ran two direct preference optimization (DPO) steps to better align the model with human preferences and exclude undesired behavior. In the first step, Microsoft used a new technique, called Pivotal Token Search, to generate pairs of desired/undesired results; in the second, they relied on GPT-4o as a judge to label positive or negative each given pair.
Phi-4 was evaluated on a set of benchmarks using OpenAI’s SIMPLE-EVALS framework and outperformed Llama-3.1-405B on several of them as well as its teacher model GPT-4o on the GPQA (graduate-level STEM Q&A) and MATH (math competition) benchmarks.