US Tech & AI

Meta’s DreamGym framework trains AI agents in a simulated world to cut reinforcement learning costs

By Eric November 20, 2025

In a significant advancement for the field of artificial intelligence, researchers from Meta, the University of Chicago, and UC Berkeley have unveiled DreamGym, a groundbreaking framework designed to streamline the training of large language model (LLM) agents using reinforcement learning (RL). This innovative approach addresses the inherent challenges of RL, including high costs, complex infrastructure, and unreliable feedback mechanisms that have historically hindered the development of effective RL systems. DreamGym operates by simulating an RL environment that dynamically adjusts task difficulty, allowing agents to progressively tackle more complex problems as they learn. This adaptability not only enhances the training process but also significantly reduces the resources required for data gathering and environment interaction, making RL more accessible to enterprises looking to deploy tailored applications.

The DreamGym framework comprises three core components that work in unison to facilitate efficient agent training: a reasoning-based experience model, an experience replay buffer, and a curriculum task generator. The experience model translates the dynamics of a target environment into a textual format, enabling agents to interact with a simulated version of the environment rather than a costly real-world counterpart. This model generates consistent state transitions and feedback, which is crucial for effective learning. The experience replay buffer serves as a dynamic memory, continuously updating with new synthetic trajectories while providing essential context from offline data. Finally, the curriculum task generator identifies tasks where agent performance is mixed, creating progressively challenging variations to enhance learning. Together, these elements form a closed-loop system that addresses the persistent challenges of RL training, including cost, task diversity, and infrastructure demands.

The effectiveness of DreamGym was evaluated across various benchmarks, such as WebShop and WebArena, where it demonstrated remarkable performance improvements. Agents trained entirely within DreamGym achieved success rates over 30% higher than traditional methods in challenging environments, while also matching the performance of established RL algorithms without incurring the costs associated with live interactions. Notably, the introduction of a sim-to-real approach, DreamGym-S2R, allowed agents to be initially trained in a synthetic environment before fine-tuning on minimal real-world data, resulting in over a 40% performance boost compared to traditional training methods. This innovative framework not only showcases the potential of simulated environments in agent training but also provides a scalable solution for enterprises to efficiently develop and deploy intelligent agents for a variety of applications. As DreamGym continues to evolve, it promises to revolutionize how RL is applied in real-world settings, making advanced AI training more feasible and less resource-intensive.

Researchers at Meta, the University of Chicago, and UC Berkeley have developed a new framework that addresses the high costs, infrastructure complexity, and unreliable feedback associated with using reinforcement learning (RL) to train large language model (LLM) agents. The framework,
DreamGym
, simulates an RL environment to train agents for complex applications. As it progresses through the training process, the framework dynamically adjusts task difficulty, ensuring the agent gradually learns to solve more challenging problems as it improves.
Experiments by the research team show that DreamGym substantially improves RL training in both fully synthetic settings and scenarios where the model must apply its simulated learning to the real world. In settings where RL is possible but expensive, it matches the performance of popular algorithms using only synthetic interactions, significantly cutting the costs of data gathering and environment interaction.
This approach could be vital for enterprises, allowing them to train agents for bespoke applications while avoiding the complexities of setting up and running live RL environments.
The challenge of training LLM agents
Reinforcement learning
is a key technique for training LLMs to handle complex tasks in agentic environments, such as web navigation, tool use, and robotics. It allows models to learn from direct interaction and experience, moving beyond the static datasets used in pre-training.
However, RL for agent training remains difficult. Real-world applications often involve long action sequences with sparse signals, meaning the agent only receives a positive signal after a long and correct sequence of actions.
Gathering enough diverse and validated data is also expensive, frequently requiring human experts to verify tasks and annotate outcomes. And the infrastructure required to create the live environments for large-scale RL training can be prohibitively complex and costly. Not to mention that interacting with live systems carries risks, as wrong actions (like deleting a file) can cause irreparable damage.
“These limitations make building general-purpose and scalable systems for training agents with RL an open and pressing challenge,” the researchers write.
DreamGym directly challenges that model by delivering comparable performance entirely in simulation, removing the infrastructure burden that has kept most enterprises from adopting RL — and giving teams a practical path to train agents without touching costly or risky live environments.
How DreamGym works
The researchers describe DreamGym as a “unified and scalable RL framework that synthesizes diverse experience data in an online manner to enable efficient and effective training of LLM agents.” It is built around three core components that work together to create a controlled and effective training loop.
The first component is a “reasoning-based experience model” that translates the dynamics of a target environment into a textual space. This model acts as the simulator of the application environment. Instead of interacting with a costly real environment, the agent interacts with this model, which generates consistent state transitions and feedback based on the agent’s actions.
The researchers argue that agent training doesn’t need perfectly realistic environments, but rather data that is “sufficiently diverse, informative, and causally grounded.” For example, in a web shopping task, the model synthesizes clean listings of on-page elements rather than processing raw HTML code. This abstract approach makes training the experience model highly efficient, requiring only a small amount of public data.
The second component is an “experience replay buffer,” which acts as a dynamic memory. At the beginning of the training process, the buffer is seeded with offline data to provide essential context and is continuously updated with new synthetic trajectories generated during training. This buffer helps guide the experience model’s predictions, ensuring the synthetic experiences remain diverse and factually grounded.
The third component, a “curriculum task generator,” works in tandem with the experience model to adaptively create new tasks that are progressively more challenging. It identifies tasks where the agent’s performance is mixed (signaling they are difficult but solvable) and generates variations to push the agent’s capabilities.
Together, these components create a closed-loop system for scalable agent training. “By unifying interaction, memory, and adaptive online task generation, DreamGym addresses the persistent challenges that have limited RL for LLM agents training: prohibitive cost, scarcity of diverse tasks, unstable reward signals, and heavy infrastructure demands,” according to the researchers.
DreamGym in action
The researchers evaluated DreamGym across several agent benchmarks, including WebShop (e-commerce), ALFWorld (embodied control), and WebArena (realistic web interaction). They used
Llama 3
and
Qwen 2.5
models as agent backbones and compared DreamGym against several traditional training strategies. These included offline methods like supervised fine-tuning (SFT) and direct preference optimization (DPO), as well as online RL algorithms like Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), which improve agents through live environment interaction.
DreamGym showed its most significant advantage in environments like WebArena, where setting up a large-scale RL infrastructure is difficult. Agents trained entirely inside DreamGym achieved success rates over 30% higher than baseline methods, which struggled with the sparse rewards and limited exploration in the real environment. The researchers said this shows DreamGym is a mechanism that makes RL training “feasible in domains that were previously intractable due to inherent task and engineering constraints.”
In environments where RL is supported but costly, agents trained with DreamGym performed on par with those trained using GRPO and PPO, but without any costly interactions with the external environment. The team also introduced a sim-to-real approach, DreamGym-S2R, where an agent is first trained in the synthetic environment and then fine-tuned on a small amount of real-world data. This strategy yielded over a 40% performance improvement compared to training from scratch in the real environment while using less than 10% of the external data. This provides a scalable “warm-start” for training general-purpose agents.
Finally, the framework demonstrated strong generalization. An agent trained on tasks in one domain, such as WebShop, could successfully transfer its learned skills to another, like WebArena. The researchers suggest this is because DreamGym agents learn in an “abstract meta-representation space, enabling the agent to learn domain-agnostic behavioral priors rather than memorizing task-specific patterns.”
While still in its early stages, DreamGym shows that simulated environments can provide great gains in training agents. In practice, an enterprise could gather a small amount of trajectories and descriptions for the tasks it wants to automate. It can then use this small seed to bootstrap the DreamGym frameworks for the scalable and sample-efficient training of agents.

Meta’s DreamGym framework trains AI agents in a simulated world to cut reinforcement learning costs

Related Articles

The best smart rings for tracking sleep and health

Creating a glass box: How NetSuite is engineering trust into AI

EU investigates Google over AI-generated summaries in search results