US Tech & AI

Google’s new AI training method helps small models tackle complex reasoning

By Eric November 18, 2025

In a groundbreaking study, researchers from Google Cloud and UCLA have unveiled a new reinforcement learning framework called Supervised Reinforcement Learning (SRL), which significantly enhances the capability of language models to tackle complex multi-step reasoning tasks. Traditional methods of training large language models (LLMs) often rely on reinforcement learning with verifiable rewards (RLVR), which rewards models solely based on the correctness of their final answers. This approach, while effective in some cases, encounters a critical limitation: it fails to provide useful feedback when models make minor mistakes during multi-step problems. Instead of learning from partially correct attempts, models receive negative rewards for incorrect final answers, leading to a bottleneck in their learning process. SRL addresses this issue by reformulating problem-solving as a sequence of logical actions, allowing smaller models to learn intricate reasoning patterns without being hindered by the all-or-nothing nature of traditional methods.

SRL introduces a structured approach that combines the strengths of outcome-based reinforcement learning and supervised fine-tuning (SFT). By breaking down expert problem-solving into a series of intermediate actions, SRL teaches models to replicate key steps in reasoning while developing their own internal logic. This method not only enhances the model’s ability to learn from expert demonstrations but also mitigates the risk of overfitting, which is common in SFT due to the scarcity of high-quality training data. The researchers demonstrated SRL’s effectiveness through experiments, showing that models trained with this framework outperformed their counterparts in challenging mathematical reasoning and agentic software engineering tasks. For instance, a model fine-tuned with SRL achieved a notable 3.0% performance improvement over traditional methods in math benchmarks and a remarkable 74% relative improvement in task resolution rates for coding tasks, showcasing SRL’s potential to elevate smaller, cost-effective models to new heights of reasoning capability.

The implications of SRL extend beyond mere performance gains; they signal a potential shift in how AI systems are developed for high-stakes applications. By establishing a curriculum-like framework that emphasizes step-by-step reasoning, SRL not only stabilizes the learning process but also enhances the interpretability and generalizability of models. As the researchers suggest, combining SRL with RLVR for post-training refinement could become a new standard for building specialized AI, paving the way for more competent and adaptable AI agents. While challenges remain in scaling this approach, particularly regarding the generation of high-quality expert trajectories, the future looks promising. The study indicates that automating data generation and leveraging advanced teacher models could be the key to unlocking the next major advancements in AI reasoning capabilities.

https://www.youtube.com/watch?v=NAEsl4T8HUs

Researchers at
Google Cloud
and
UCLA
have proposed a new reinforcement learning framework that significantly improves the ability of language models to learn very challenging multi-step reasoning tasks.
Supervised Reinforcement Learning
(SRL) reformulates problem-solving as a sequence of logical “actions,” providing rich learning signals during the training process.
This approach enables smaller models to learn complex problems that were previously out of reach for other common training techniques. Experiments show that SRL not only excels on math reasoning benchmarks but also generalizes effectively to agentic software engineering tasks.
SRL is a versatile training framework that can elevate smaller and less expensive models to higher reasoning abilities.
The limits of current LLM reasoning training
Recent advances in training large language models (LLMs) for reasoning have largely been driven by reinforcement learning with verifiable rewards (RLVR), a method where a model is rewarded based on the correctness of its final answer. By repeatedly trying to solve problems and getting feedback on the final outcome, the model gradually learns effective problem-solving strategies.
However, the success of this outcome-based approach depends on the model’s ability to discover a correct solution within a limited number of attempts, or “rollouts.” Since each rollout is computationally expensive, models can’t try indefinitely. This method hits a wall when problems are so difficult that the model rarely, if ever, finds the right answer within its budget.
This creates a critical learning bottleneck. In many multi-step reasoning problems, a model might correctly solve several steps but get derailed by a single mistake, leading to an incorrect answer. With RLVR, this entire effort receives a negative reward, and the model learns nothing from its partially correct work. It’s an all-or-nothing approach that fails to provide granular feedback and provides sparse rewards.
An alternative method is supervised fine-tuning (SFT), where the model learns from examples containing the full reasoning process laid out by experts. While SFT can instill reasoning abilities, it often leads to overfitting (the model simply learns to imitate the trajectories in the training data instead of learning to generalize to problems beyond the examples it has seen). This issue is made worse by the fact that high-quality, human-created training data is both scarce and expensive to produce.
As the paper notes, these limitations leave “a critical gap for training small open-source models to effectively learn difficult problems.”
How supervised reinforcement learning works
SRL introduces a framework that reformulates problem-solving as a “sequential decision-making process,” striking a balance between pure outcome-based RL and pure imitation learning. Instead of optimizing only for the final answer or forcing the model to imitate an expert’s entire thought process, SRL teaches the model to reproduce a sequence of key actions that form the backbone of expert reasoning. This allows the model to learn to take actions similar to an expert while developing its own internal reasoning style.
In the SRL framework, expert demonstrations are broken down into a series of intermediate, concrete actions, each representing a meaningful step. For a math problem, an action might be an algebraic manipulation. For a software engineering agent, it could be a command executed in a code repository. To generate training data, SRL uses a powerful teacher model to create solution trajectories, which are then used to train a smaller model.
According to I-Hung Hsu, a research scientist at Google and co-author of the paper, this middle-ground approach is key to its effectiveness in real-world scenarios. “SRL sits in the middle: It captures the structured flexibility of real-world problem solving, where there are multiple valid strategies but also clear notions of what ‘good reasoning’ looks like at each step,” Hsu told VentureBeat. “This makes SRL suitable for domains like data science automation or probably supply chain optimization — tasks that reward sound intermediate reasoning rather than mere final answers.”
During training, the model first generates an “inner monologue” (its internal reasoning process, enclosed in tags) before committing to an action. At each step, SRL provides a reward based on the similarity between the model’s predicted action and the expert’s action. This step-wise reward system provides dense, fine-grained feedback, allowing the model to learn and improve even if its overall solution isn’t perfect. This solves the sparse reward problem RLVR faces.
SRL in action
The researchers’ experiments show that SRL significantly outperforms strong baselines in both challenging mathematical reasoning and agentic software engineering benchmarks. They also observed that SRL encourages more flexible and sophisticated reasoning patterns in models, such as interleaved planning and self-verification, which improve solution quality without just making the outputs longer.
For enterprise leaders, performance gains are only valuable if they don’t come with runaway costs. Hsu clarifies that SRL-trained models are more efficient in their reasoning. “The gains come from better reasoning quality and structure, not from verbosity,” he said. “In terms of efficiency, SRL-trained models are roughly on par with the base model in token usage… while SRL isn’t designed to reduce inference cost, it achieves stronger reasoning performance without increasing it.”
For the math tests, the team fine-tuned
Qwen2.5-7B-Instruct
on a dataset of 1,000 difficult math questions. They compared its performance against models trained with SFT and RLVR (using the GRPO algorithm common in models like
DeepSeek-R1
) on four competition-level math benchmarks. The SRL-trained model achieved a substantial 3.0% average performance boost over other methods.
The team extended SRL to agentic software engineering, a domain critical for enterprise automation. They trained a coding-specialized model,
Qwen2.5-Coder-7B-Instruct
, on 5,000 expert trajectories of agents interacting with a coding environment. The SRL-trained model was benchmarked against the original base model and SWE-Gym-7B, a strong baseline fine-tuned with SFT. SRL achieved a 14.8% task resolve rate, representing a 74% relative improvement over the SFT-based model. This shows SRL’s ability to train more competent AI agents for complex, real-world programming tasks.
A new standard for high-stakes AI?
The paper’s strongest results came from combining methods: First, using SRL to teach foundational reasoning, then using RLVR to refine that skill. In their experiments, when the researchers used SRL as a pre-training and applied RLVR in post-training, they observed a 3.7% average increase, demonstrating a powerful curriculum learning strategy.
This raises the question of whether this could become a new blueprint for building specialized AI.
“We view SRL as a strong foundation,” Hsu said. “In a sense, SRL provides a curriculum — teaching models to think and act step by step — before we refine those behaviors with outcome-based reinforcement learning. This SRL-first approach not only stabilizes the later RL stage but also makes reasoning more interpretable and generalizable, which is critical for high-stakes applications.”
Looking ahead, Hsu acknowledges that scaling this pipeline still faces challenges, particularly the high cost and complexity of end-to-end RLVR for agentic tasks. However, he is optimistic about the path forward. “While high-quality expert trajectories remain important,” he concluded, “we think the next big leap will come from automating their generation and filtering — leveraging strong teacher models or even self-improving student models to bootstrap new data.”

Google’s new AI training method helps small models tackle complex reasoning

Related Articles

The best smart rings for tracking sleep and health

Creating a glass box: How NetSuite is engineering trust into AI

EU investigates Google over AI-generated summaries in search results