US Tech & AI

Phi-4 proves that a ‘data-first’ SFT methodology is the new differentiator

By Eric November 18, 2025

In the rapidly evolving world of artificial intelligence, a significant shift is occurring in the development of large language models (LLMs). Traditionally, AI engineers have focused on enhancing performance by scaling up model parameters and datasets. However, the emergence of smaller, more efficient models is gaining traction, exemplified by the Phi-4 fine-tuning methodology developed by Microsoft. This innovative approach demonstrates that a carefully curated dataset and strategic fine-tuning can enable a 14 billion parameter model like Phi-4 to compete effectively with much larger models. The Phi-4 model was trained on just 1.4 million meticulously selected prompt-response pairs, focusing on “teachable” examples that push the model’s reasoning capabilities rather than relying on brute-force scaling.

The Phi-4 methodology is particularly noteworthy for its emphasis on data curation and quality over quantity. By assembling a dataset that covers critical areas such as STEM, coding, and safety, the Phi-4 model has outperformed several larger models, including OpenAI’s o1-mini and DeepSeek’s 70B distilled model, across various reasoning tasks. For instance, in math olympiad benchmarks, Phi-4 achieved a score of 75.3% compared to o1-mini’s 63.6%, showcasing its superior reasoning prowess despite its smaller size. The Phi-4 team employed a systematic approach to data selection, filtering out examples that were either too easy or too difficult, ensuring that each training instance provided a valuable learning opportunity. This methodology not only demonstrates the effectiveness of targeted data curation but also serves as a practical blueprint for smaller enterprise teams looking to enhance their AI models without the need for extensive computational resources.

As the AI landscape continues to evolve, the Phi-4 reasoning model stands as a testament to the idea that bigger isn’t always better. By focusing on teachable data and iterative tuning, even a model with fewer parameters can achieve remarkable reasoning performance. The insights garnered from the Phi-4 project encourage AI engineers to adopt a more strategic approach to model training, emphasizing the importance of data quality and targeted learning. For teams operating with limited resources, this represents an opportunity to achieve significant advancements in AI capabilities without the need for massive infrastructure investments. Ultimately, the Phi-4 methodology invites a reevaluation of existing paradigms in AI development, highlighting the potential of smaller, well-crafted models to drive innovation and performance in the field of artificial intelligence.

AI engineers often chase performance by scaling up LLM parameters and data, but the trend toward smaller, more efficient, and better-focused models has accelerated.
The
Phi-4 fine-tuning methodology
is the cleanest public example of a training approach that smaller enterprise teams can copy. It shows how a carefully chosen dataset and fine-tuning strategy can make a 14B model compete with much larger ones.
The Phi-4 model was trained on just
1.4 million
carefully chosen prompt-response pairs. Instead of brute force, the Microsoft Phi-4 research team focused on “teachable” examples at the edge of the model’s abilities and rigorous data curation.
The Phi-4 reasoning smart data playbook demonstrates how strategic data curation with replicable SFT and RL can elevate a 14B model beyond much larger counterparts.
Why Phi-4 stands apart
Smaller reasoning models, such as
OpenAI’s o1-mini
and
Google’s Gemma
, are becoming more common, and models like Alibaba’s
Qwen3
(8B and 14B) are seeing wide adoption across use cases. That adoption is important, but it doesn’t displace the value of Phi-4 as an experimental proof: Phi-4 was designed as a testbed for a data-first training methodology, and its documentation reads like a smart data playbook for teams that want to replicate that approach.
The Phi-4 team has shared a repeatable SFT playbook that includes a 1.4-million-prompt response set. It’s built around
“
teachable
”
edge examples, questions that are neither too easy nor too difficult, chosen to push the model’s reasoning. Each topic, such as math or code, is tuned separately and then combined with synthetic rewrites that turn complex tasks into forms that can be checked automatically.
The paper outlines the data selection and filtering process in enough detail for smaller teams to reproduce it with open-source models and evaluators. For enterprise teams, that level of transparency turns a research result into a practical, copyable training recipe they can implement and measure quickly.
The data-first philosophy: Why less can be more
Traditional approaches to LLM reasoning have often relied on scaling datasets massively to encourage generalization. Phi-4 reasoning takes a different path, showing that carefully curated data can achieve similar or even better results with far less.
The team assembled a dataset covering STEM, coding, and safety. Despite its small size, it outperformed models trained on orders of magnitude more data.
In benchmarks, the 14B Phi-4 reasoning model outperformed OpenAI’s o1-mini and DeepSeek’s 70B distilled model across most reasoning tasks, and approached the full DeepSeek-R1 (671B) on challenging math (AIME) questions.
With just 14 billion parameters, Phi-4 reasoning delivers the following results when compared to other leading models:
Benchmark (task)
Phi-4 reasoning
Comparison model (size)
Comparison score
Date / Source
AIME 2024
(math olympiad)
75.3%
o1-mini
63.6%
Microsoft Phi-4 model card (Apr 2025). (
Hugging Face
)
AIME 2025
(math olympiad)
62.9%
DeepSeek-R1-Distill-70B
51.5%
Microsoft Phi-4 model card (April 2025). (
Hugging Face
)
OmniMath
76.6%
DeepSeek-R1-Distill-70B
63.4%
Microsoft Phi-4 model card (April 2025). (
Hugging Face
)
GPQA-Diamond
(graduate-level science)
65.8%
o1-mini
60.0%
Microsoft Phi-4 model card (April 2025). (
Hugging Face
)
OmniMath
(same benchmark, different comparison)
76.6%
Claude-3.7-Sonnet
54.6%
Microsoft Phi-4 model card (April 2025). (
Hugging Face
)
Table
: Phi-4 reasoning performance across benchmarks compared to other models.
Source:
Microsoft
The key to this is
filtering for quality over quantity
. Much of the generic data is either too easy (the base model already knows it) or too hard (no learning signal). The Phi-4 team explicitly discards such examples. “Given the strong baseline reasoning capabilities of Phi-4, many initial seed questions are already handled competently,” they note. “To make further learning impactful, we specifically target seeds situated
at the edge
of Phi-4’s current abilities.”
In practice, they rely on LLM-based evaluation. For each candidate question, a strong reference model (like GPT-4) generates an “answer key,” and the answers from weaker models are compared. If the weaker model disagrees enough, it indicates a teachable gap. Those questions are retained, while trivially solved or utterly unsolvable questions are dropped.
For example, a simple arithmetic problem might be dropped (too easy), and an extremely obscure theorem proof might be dropped (too hard) as well. But a moderately challenging geometry problem that Phi-4 gets wrong is included.
This “sweet spot” approach ensures every example forces the model to stretch its reasoning. By focusing on
multi-step problems
rather than rote recall, they pack maximum learning into 1.4M examples.
As the authors explain, training on these carefully chosen seeds “leads to broad generalization across both reasoning-specific and general-purpose tasks.” In effect, Phi-4 reasoning demonstrates that intelligent data selection can outperform brute force scaling.
Independent domain optimization
Phi-4 reasoning’s data are grouped by domain (math, coding, puzzles, safety, etc.). Rather than blending everything at once, the team
tunes

each domain’s mix separately
and then merges them.
This relies on an
“
additive property
”
: Optimizing math data in isolation and code data in isolation yields weights that, when concatenated, still give gains in both areas. In practice, they first tuned the math dataset to saturation on math benchmarks, then did the same for code, and finally simply added the code data into the math recipe. The result was improved performance on both math and coding tasks, without retraining from scratch.
This modular approach offers clear practical advantages. This means a small team can first refine just the math dataset, achieve strong math performance, and then later add the coding data without redoing the math tuning.
However, the Phi-4 authors caution that scaling this method to many domains remains an open question. While the approach “worked very well” for their math+code mix, they note, “it is not known whether this method can scale to dozens or hundreds of domains,” a direction they acknowledge as a valuable area for future research. In short, the additive strategy is effective, but expanding into new domains must be approached carefully, as it may introduce unforeseen interactions.
Despite potential pitfalls, the additive strategy proved effective in Phi-4 reasoning. By treating each domain independently, the team avoided complex joint optimization and narrowed the search space for data mixtures. This approach allows incremental scaling of domains. Teams can begin by tuning the math SFT, then incorporate the code dataset, and later expand to additional specialized tasks, all while maintaining prior performance gains.
This is a practical advantage for resource-constrained teams. Instead of requiring a large group of experts to manage a complex, multi-domain dataset, a small team can focus on one data silo at a time.
Synthetic data transformation
Some reasoning problems, such as abstract proofs or creative tasks, are difficult to verify automatically. Yet automated verification (for RL reward shaping) is very valuable. Phi-4 reasoning tackled this by transforming hard prompts into easier-to-check forms.
For example, the team rewrote a subset of coding problems as word puzzles or converted some math problems to have concise numeric answers. These “synthetic seed data” preserve the underlying reasoning challenge but make correctness easier to test. Think of it as giving the model a simplified version of the riddle that still teaches the same logic.
This engineering hack enables downstream RL to use clear reward signals on tasks that would otherwise be too open-ended.
Here’s an example of synthetic data transformation:
Raw web data
Synthetic data
On the sides AB and BC of triangle ABC, points M and N are taken, respectively. It turns out that the perimeter of △AMC is equal to the perimeter of △CNA, and the perimeter of △ANB is equal to the perimeter of △CMB. Prove that △ABC is isosceles.
ABC is a triangle with AB=13 and BC=10. On the sides AB and BC of triangle ABC, points M and N are taken, respectively. It turns out that the perimeter of △AMC is equal to the perimeter of △CNA, and the perimeter of △ANB is equal to the perimeter of △CMB. What is AC?
Table
: Rewriting seed data from the web (left) into verifiable synthetic questions for SFT and RL (right).
Source:
Microsoft
Note that by assigning numeric values (AB=13, BC=10) and asking “What is AC?”, the answer becomes a single number, which can be easily checked for correctness.
Other teams have applied similar domain-specific tricks. For example, chemistry LLMs like
FutureHouse’s ether0
model generate molecules under strict pKa or structural constraints, using crafted reward functions to ensure valid chemistry.
In mathematics, the
Kimina-Prover model by Numina
translates natural-language theorems into the Lean formal system, so reinforcement learning can verify correct proofs. These examples highlight how synthetic augmentation, when paired with verifiable constraints, can push models to perform well in highly specialized domains.
In practical terms, engineers should embrace synthetic data but keep it grounded. Heuristics like “convert to numeric answers” or “decompose a proof into checkable steps” can make training safer and more efficient. At the same time, maintain a pipeline of real (organic) problems as well, to ensure breadth.
The key is balance. Use synthetic transformations to unlock difficult verification problems, but don’t rely on them exclusively. Real-world diversity still matters. Following this approach, the model is guided toward a clearly defined, discrete objective.
Here are some results on Phi-4 reasoning models:
Practical implementation for enterprises
AI teams looking to apply Phi-4 reasoning’s insights can follow a series of concrete steps to implement the approach effectively.
Identifying the model’s edge
Detect your model’s “edge” by identifying where the base LLM struggles. One way is to use its confidence or agreement scores. For example, generate several answers per prompt (using a tool like Hugging Face’s vLLM for fast sampling) and see where consensus breaks. Those prompts at the margin of confidence are your teachable examples. By focusing on these low-confidence questions rather than the questions it already gets right, you ensure each new example is worth learning.
Isolating domains for targeted tuning
Tune one domain at a time rather than mixing all data genres upfront. Pick the highest-value domain for your app (math, code, legal, etc.) and craft a small SFT dataset for just that. Iterate on the mix (balancing difficulty, source types, etc.) until performance saturates on domain-specific benchmarks. Then freeze that mix and add the next domain. This modular tuning follows Phi-4 reasoning’s “additive” strategy. It avoids cross-talk since you preserve gains in domain A even as you improve domain B.
Expanding with synthetic augmentation
Leverage synthetic augmentation when gold-standard answers are scarce or unverifiable. For instance, if you need to teach a proof assistant but can’t autocheck proofs, transform them into arithmetic puzzles or shorter proofs that can be verified. Use your LLM to rewrite or generate these variants (Phi-4 used this to turn complex word problems into numeric ones).
Synthetic augmentation also lets you expand data cheaply. Once you have a validated small set, you can “multiply” it by having the LLM generate paraphrases, variations, or intermediate reasoning steps.
Scaling through a two-phase strategy
Use a two-phase training strategy that begins with exploration followed by scaling. In Phase 1 (exploration), run short fine-tuning experiments on a focused dataset (e.g., one domain) with limited compute. Track a few key metrics (benchmarks or held-out tasks) each run. Rapidly iterate hyperparameters and data mixes.
The Phi-4 paper demonstrates that this speeds up progress, as small experiments helped the team discover a robust recipe before scaling up. Only once you see consistent gains do you move to Phase 2 (scaling), where you combine your verified recipes across domains and train longer (in Phi-4’s case, ~16 billion tokens). Although this stage is more compute-intensive, the risk is significantly reduced by the prior experimentation.
Monitor for trigger points such as a significant uplift on validation tasks or stable metric trends. When those appear, it’s time to scale. If not, refine the recipe more first. This disciplined two-phase loop saves resources and keeps the team agile.
In practice, many teams at Hugging Face and elsewhere have followed similar advice. For example, while developing conversational model
SmolLM2
, the team noticed poor chat performance in Phase 1. They then generated ~500K synthetic multi-turn dialogues and re-trained, which “significantly improved both downstream performance and its overall ‘vibes,’” as one researcher reports. This represents a concrete win, achieved through a targeted synthetic data injection based on an initial feedback loop.
How to do this now
Here’s a simple checklist that you can follow to put these ideas into action.
Pick a target domain/task.
Choose one area (e.g., math, coding, or a specific application) where you need better performance. This keeps the project focused.
Collect a small seed dataset.
Gather, say, a few thousand prompt–answer pairs in that domain from existing sources (textbooks, GitHub, etc.).
Filter for edge-of-ability examples.
Use a strong model (e.g., GPT-4) to create an answer key for each prompt. Run your base model on those prompts. Keep examples that the base model often misses, discard ones it already solves or is hopeless on. This yields “teachable” examples.
Fine-tune your model (Phase 1).
Run a short SFT job on this curated data. Track performance on a held-out set or benchmark. Iterate: Refine the data mix, remove easy questions, add new teachable ones, until gains taper off.
Add synthetic examples if needed.
If some concepts lack auto-verifiable answers (like long proofs), create simpler numeric or single-answer variants using your LLM. This gives clear rewards for RL. Keep a balance with real problems.
Expand to the next domain.
Once one domain is tuned, “freeze” its dataset. Pick a second high-value domain and repeat steps 3 to 5 to tune that data mix. Finally, merge the data for both domains, and do a final longer training run (Phase 2).
Monitor benchmarks carefully.
Use a consistent evaluation methodology (like majority-voting runs) to avoid misleading results. Only proceed to a full-scale training if small experiments show clear improvements.
Limits and trade-offs
Despite the effectiveness of the Phi-4 training method, several limitations and practical considerations remain. One key challenge is domain scaling. While Phi-4’s additive method worked well for math and code, it has yet to be proven across many domains. The authors acknowledge that it remains an open question whether this approach can scale smoothly to dozens of topics.
Another concern is the use of synthetic data. Relying too heavily on synthetic rewrites can reduce the diversity of the dataset, so it’s crucial to maintain a balance between real and synthetic examples to preserve the model’s ability to reason effectively.
Lastly, while the repeatable SFT method helps reduce computational costs, it doesn’t eliminate the need for thoughtful curation. Even though the approach is more efficient than brute-force scaling, it still requires careful data selection and iteration.
Lessons from Phi-4
The Phi-4 reasoning story is clear: Bigger isn’t always better for reasoning models. Instead of blindly scaling, the team asked where learning happens and engineered their data to hit that sweet spot. They show that “the benefit of careful data curation for supervised fine-tuning extends to reasoning models.” In other words, with a smart curriculum, you can squeeze surprising capability out of modest models.
For engineers, the takeaway is actionable. You don’t need a billion-dollar cluster or an endless internet crawl to improve reasoning. For resource-strapped teams, this is good news, as a careful data strategy lets you punch above your weight.
Phi-4 reasoning proves that methodical data and training design, not sheer parameter count, drives advanced reasoning. Focusing on teachable data and iterative tuning, even a 14B model surpassed much larger rivals. For AI teams today, this offers a practical blueprint. Refine the data, iterate fast, and scale only when the signals are right. These steps can unlock breakthrough reasoning performance without breaking the bank.

Phi-4 proves that a ‘data-first’ SFT methodology is the new differentiator

Related Articles

The best smart rings for tracking sleep and health

Creating a glass box: How NetSuite is engineering trust into AI

EU investigates Google over AI-generated summaries in search results