US Tech & AI

Google’s new AI training method helps small models tackle complex reasoning

By Eric November 16, 2025

In a groundbreaking development, researchers from Google Cloud and UCLA have introduced a novel reinforcement learning framework called Supervised Reinforcement Learning (SRL), which significantly enhances the capacity of language models to tackle complex multi-step reasoning tasks. Traditional methods, such as Reinforcement Learning with Verifiable Rewards (RLVR), have often struggled with intricate problems due to their reliance on outcome-based feedback, which can lead to learning bottlenecks. In contrast, SRL redefines problem-solving as a sequence of logical actions, allowing smaller models to benefit from richer learning signals during training. This innovative approach not only excels in mathematical reasoning benchmarks but also demonstrates impressive adaptability in software engineering tasks, marking a significant leap forward in the capabilities of smaller, less resource-intensive models.

The SRL framework effectively addresses the limitations of existing training techniques by balancing the benefits of outcome-based reinforcement learning and supervised fine-tuning. By breaking down expert problem-solving processes into a series of meaningful actions, SRL enables models to learn from both successful strategies and partial successes. For instance, in mathematical reasoning, an action might involve a specific algebraic manipulation, while in software engineering, it could entail executing a command in a coding environment. This structured approach allows models to develop their own reasoning styles while still adhering to expert-like decision-making. The researchers have reported substantial performance improvements, with SRL-trained models achieving a 3% boost in math test accuracy and a remarkable 74% enhancement in task resolution rates for software engineering challenges compared to traditional fine-tuning methods.

Looking ahead, the implications of SRL are profound, potentially setting a new standard for high-stakes AI applications. The researchers suggest that combining SRL with RLVR could create a powerful curriculum learning strategy, where foundational reasoning skills are established before refining those abilities with outcome-based reinforcement learning. This dual approach not only stabilizes the learning process but also enhances the interpretability and generalizability of reasoning in AI systems. As the team continues to explore ways to automate the generation of high-quality training data, the future of AI development appears promising, with SRL poised to play a crucial role in training more competent and efficient AI agents capable of handling complex, real-world tasks.

https://www.youtube.com/watch?v=NAEsl4T8HUs

Researchers at
Google Cloud
and
UCLA
have proposed a new reinforcement learning framework that significantly improves the ability of language models to learn very challenging multi-step reasoning tasks.
Supervised Reinforcement Learning
(SRL) reformulates problem-solving as a sequence of logical “actions,” providing rich learning signals during the training process.
This approach enables smaller models to learn complex problems that were previously out of reach for other common training techniques. Experiments show that SRL not only excels on math reasoning benchmarks but also generalizes effectively to agentic software engineering tasks.
SRL is a versatile training framework that can elevate smaller and less expensive models to higher reasoning abilities.
The limits of current LLM reasoning training
Recent advances in training large language models (LLMs) for reasoning have largely been driven by reinforcement learning with verifiable rewards (RLVR), a method where a model is rewarded based on the correctness of its final answer. By repeatedly trying to solve problems and getting feedback on the final outcome, the model gradually learns effective problem-solving strategies.
However, the success of this outcome-based approach depends on the model’s ability to discover a correct solution within a limited number of attempts, or “rollouts.” Since each rollout is computationally expensive, models can’t try indefinitely. This method hits a wall when problems are so difficult that the model rarely, if ever, finds the right answer within its budget.
This creates a critical learning bottleneck. In many multi-step reasoning problems, a model might correctly solve several steps but get derailed by a single mistake, leading to an incorrect answer. With RLVR, this entire effort receives a negative reward, and the model learns nothing from its partially correct work. It’s an all-or-nothing approach that fails to provide granular feedback and provides sparse rewards.
An alternative method is supervised fine-tuning (SFT), where the model learns from examples containing the full reasoning process laid out by experts. While SFT can instill reasoning abilities, it often leads to overfitting (the model simply learns to imitate the trajectories in the training data instead of learning to generalize to problems beyond the examples it has seen). This issue is made worse by the fact that high-quality, human-created training data is both scarce and expensive to produce.
As the paper notes, these limitations leave “a critical gap for training small open-source models to effectively learn difficult problems.”
How supervised reinforcement learning works
SRL introduces a framework that reformulates problem-solving as a “sequential decision-making process,” striking a balance between pure outcome-based RL and pure imitation learning. Instead of optimizing only for the final answer or forcing the model to imitate an expert’s entire thought process, SRL teaches the model to reproduce a sequence of key actions that form the backbone of expert reasoning. This allows the model to learn to take actions similar to an expert while developing its own internal reasoning style.
In the SRL framework, expert demonstrations are broken down into a series of intermediate, concrete actions, each representing a meaningful step. For a math problem, an action might be an algebraic manipulation. For a software engineering agent, it could be a command executed in a code repository. To generate training data, SRL uses a powerful teacher model to create solution trajectories, which are then used to train a smaller model.
According to I-Hung Hsu, a research scientist at Google and co-author of the paper, this middle-ground approach is key to its effectiveness in real-world scenarios. “SRL sits in the middle: It captures the structured flexibility of real-world problem solving, where there are multiple valid strategies but also clear notions of what ‘good reasoning’ looks like at each step,” Hsu told VentureBeat. “This makes SRL suitable for domains like data science automation or probably supply chain optimization — tasks that reward sound intermediate reasoning rather than mere final answers.”
During training, the model first generates an “inner monologue” (its internal reasoning process, enclosed in tags) before committing to an action. At each step, SRL provides a reward based on the similarity between the model’s predicted action and the expert’s action. This step-wise reward system provides dense, fine-grained feedback, allowing the model to learn and improve even if its overall solution isn’t perfect. This solves the sparse reward problem RLVR faces.
SRL in action
The researchers’ experiments show that SRL significantly outperforms strong baselines in both challenging mathematical reasoning and agentic software engineering benchmarks. They also observed that SRL encourages more flexible and sophisticated reasoning patterns in models, such as interleaved planning and self-verification, which improve solution quality without just making the outputs longer.
For enterprise leaders, performance gains are only valuable if they don’t come with runaway costs. Hsu clarifies that SRL-trained models are more efficient in their reasoning. “The gains come from better reasoning quality and structure, not from verbosity,” he said. “In terms of efficiency, SRL-trained models are roughly on par with the base model in token usage… while SRL isn’t designed to reduce inference cost, it achieves stronger reasoning performance without increasing it.”
For the math tests, the team fine-tuned
Qwen2.5-7B-Instruct
on a dataset of 1,000 difficult math questions. They compared its performance against models trained with SFT and RLVR (using the GRPO algorithm common in models like
DeepSeek-R1
) on four competition-level math benchmarks. The SRL-trained model achieved a substantial 3.0% average performance boost over other methods.
The team extended SRL to agentic software engineering, a domain critical for enterprise automation. They trained a coding-specialized model,
Qwen2.5-Coder-7B-Instruct
, on 5,000 expert trajectories of agents interacting with a coding environment. The SRL-trained model was benchmarked against the original base model and SWE-Gym-7B, a strong baseline fine-tuned with SFT. SRL achieved a 14.8% task resolve rate, representing a 74% relative improvement over the SFT-based model. This shows SRL’s ability to train more competent AI agents for complex, real-world programming tasks.
A new standard for high-stakes AI?
The paper’s strongest results came from combining methods: First, using SRL to teach foundational reasoning, then using RLVR to refine that skill. In their experiments, when the researchers used SRL as a pre-training and applied RLVR in post-training, they observed a 3.7% average increase, demonstrating a powerful curriculum learning strategy.
This raises the question of whether this could become a new blueprint for building specialized AI.
“We view SRL as a strong foundation,” Hsu said. “In a sense, SRL provides a curriculum — teaching models to think and act step by step — before we refine those behaviors with outcome-based reinforcement learning. This SRL-first approach not only stabilizes the later RL stage but also makes reasoning more interpretable and generalizable, which is critical for high-stakes applications.”
Looking ahead, Hsu acknowledges that scaling this pipeline still faces challenges, particularly the high cost and complexity of end-to-end RLVR for agentic tasks. However, he is optimistic about the path forward. “While high-quality expert trajectories remain important,” he concluded, “we think the next big leap will come from automating their generation and filtering — leveraging strong teacher models or even self-improving student models to bootstrap new data.”

Google’s new AI training method helps small models tackle complex reasoning

Related Articles

The best smart rings for tracking sleep and health

Creating a glass box: How NetSuite is engineering trust into AI

EU investigates Google over AI-generated summaries in search results