
Hosted by Enoch H. Kang · EN

We discuss Qwen-AgentWorld, a pioneering suite of language world models designed to simulate complex digital environments for artificial intelligence agents. By training on over 10 million trajectories across seven domains, including operating systems, web browsers, and software engineering sandboxes, these models learn to predict how an environment will respond to specific actions. This simulation capability allows agents to rehearse scenarios, refine their decision-making, and learn from a vast scale of diverse interactions without needing constant access to live, physical systems. The research details a three-stage training pipeline consisting of continual pre-training, supervised fine-tuning, and reinforcement learning to ensure high fidelity in these virtual environments. Furthermore, the paper presents AgentWorldBench, a rigorous new benchmark used to verify that these world models can accurately mimic real-world dynamics. Ultimately, the authors demonstrate that integrating world modeling into agent frameworks significantly boosts performance by providing a foundation for predictive reasoning and planning.

This paper discusses a statistical framework for offline reinforcement learning using trajectory-level supervision, where only final outcomes or preferences are observed rather than step-by-step rewards. The authors introduce OPAC, a pessimistic actor-critic algorithm designed to learn from these aggregated signals by estimating latent rewards and applying pessimism to account for distribution shifts. Their analysis establishes that moving from process-level to outcome-level feedback incurs a quantifiable statistical cost, specifically an additional horizon factor in sample complexity. The research also explores generalized RL objectives, proving that non-linear outcomes like "all-success" criteria can lead to exponentially difficult learning problems. To address this, they identify specific structural coefficients, $\kappa_\mu(\sigma)$ and $\chi_\mu(\sigma)$, which determine when efficient learning remains possible. Ultimately, the paper provides a theoretical boundary for when sparse, trajectory-based data can successfully guide sequential decision-making.

SuperThoughts is a novel framework designed to accelerate the Chain-of-Thought (CoT) reasoning process in large language models by processing tokens in superposition. Unlike traditional models that generate tokens sequentially, this method uses a compressor to fuse pairs of consecutive tokens into single latent representations, effectively halving the number of required forward passes. To ensure accuracy is not sacrificed for speed, the system employs a Multi-Token Prediction (MTP) module and a confidence-based adaptive mechanism that reverts to standard decoding when the model is uncertain. Experimental results on complex mathematical and scientific benchmarks show that SuperThoughts reduces reasoning length by 20–35% while maintaining performance within a few percentage points of the original baseline. The research highlights that larger models are particularly adept at handling this compression, achieving significant wall-clock time reductions during inference. Ultimately, this approach offers a more efficient way to utilize test-time compute without losing the dense supervision provided by discrete token training.

This research paper introduces First-Explore Proximal Policy Optimization (FE-PPO), a new reinforcement learning algorithm designed to improve how agents discover rewards in complex, deceptive environments. While standard meta-learning methods often fail when immediate rewards are misleading, the FE-PPO framework trains agents specifically to gather information during exploration that will maximize success in later exploitation phases. By integrating a value function and bootstrapping into the original First-Explore objective, the authors significantly increase efficiency, achieving high performance with 10 to 40 times fewer samples. The study demonstrates that FE-PPO consistently outperforms the strong RL² baseline across various challenging benchmarks, including navigation tasks and bandit problems. Additionally, the authors provide a more competitive comparison by implementing a Transformer-XL architecture for their baselines. Ultimately, this work offers a practical, open-source foundation for future research into efficient meta-exploration strategies.

This research paper investigates self-distillation as a powerful regularization technique for pretraining language models when high-quality data is in short supply. By comparing various training strategies across different model scales and data scarcity levels, the authors demonstrate that self-distillation significantly outperforms both direct training and standard methods like weight decay or exponential moving averages. The study identifies a specific crossover threshold where distillation becomes superior, particularly when the available data is less than one-fourth of the amount prescribed by Chinchilla scaling laws. Practical results suggest that using larger models with natural teacher temperatures provides the most effective supervision, preventing the rapid overfitting typically seen in data-constrained environments. Ultimately, the work advocates for self-distillation as a robust alternative for improving model performance when compute resources outpace the available data pool.

eta-Harness is an advanced optimization system designed to improve how language-model agents process and compress long interaction histories into useful states. Unlike traditional methods that rely on manual engineering or simple feedback, this system uses a coding agent to search for and rewrite the "harness" code that manages an agent's memory and retrieval. By providing the proposer with direct filesystem access to raw execution traces and historical performance data, it avoids the information loss associated with summarized feedback. This approach allows the system to discover superior strategies for history summarization and adaptive retrieval across various complex tasks. Experimental results demonstrate that Meta-Harness achieves top-tier performance on benchmarks like TerminalBench-2 and improves accuracy in mathematical reasoning and text classification. Ultimately, the research suggests that the way agents construct their own internal state can be optimized as an embedded learning problem.

Exploratory RL (ExpRL) is an automated mid-training method designed to enhance the reasoning capabilities of large language models before they undergo standard reinforcement learning. While traditional reinforcement learning often struggles with sparse rewards on difficult problems, ExpRL uses human-written reference solutions as reward scaffolds to provide dense, informative feedback on partial progress. This approach employs an LLM judge to evaluate on-policy reasoning traces against specific rubrics, assigning rewards at both the outcome and process levels to reinforce productive intermediate steps. By shifting probability mass toward successful solution strategies, the method significantly improves pass@k performance and broadens the model’s coverage of complex reasoning paths. Experimental results demonstrate that ExpRL creates a superior initialization for subsequent training, outperforming supervised fine-tuning and standard distillation across challenging math and science benchmarks. Ultimately, this technique fosters sophisticated behaviors like self-correction and backtracking, which are essential for solving high-level reasoning tasks.

This paper introduces a statistical framework for making valid scientific discoveries using synthetic data, specifically addressing concerns that artificially generated data can be biased or noisy. The authors propose a new technical condition called task exchangeability, which allows researchers to calibrate synthetic results by comparing them to historical tasks where both real and synthetic data are available. By measuring the discrepancy between real and synthetic outcomes in these past cases, the method can adjust confidence intervals for new tasks where only synthetic data exists. The researchers demonstrate that this approach provides provable validity guarantees across various fields, including social science surveys and AI evaluation. Experiments show that while naive synthetic-only intervals are often severely biased and overconfident, the task-exchangeability method consistently covers the true values. Ultimately, this framework enables scientists to use LLM-generated "silicon samples" and automated raters to accelerate discovery without sacrificing statistical rigor.

This paper establishs that Group Relative Policy Optimization (GRPO), while appearing to use only final outcome rewards, inherently functions as a Process Reward Model (PRM) through its implicit sub-trajectory credit assignment. By analyzing groups of trajectories that share identical prefixes, the authors prove that GRPO naturally computes step-level rewards using a Monte Carlo approach. However, this hidden structure reveals a flaw where imbalanced step frequencies can skew advantages, inadvertently suppressing high-reward paths and hindering efficient model training. To fix this, the researchers introduce $\lambda$-GRPO, a modified objective that scales token-level losses to neutralize these frequency imbalances. Empirical testing shows that $\lambda$-GRPO enables Large Language Models to achieve superior reasoning performance significantly faster than the standard algorithm. Ultimately, the work demonstrates that the built-in PRM structure of GRPO can be optimized to boost efficiency without the need for expensive, manual step-level annotations.

This paper explores how AI agents inherit and potentially amplify human heterogeneity when tasked with negotiating on behalf of individuals. By comparing agentic interactions to a human-to-human benchmark, the study reveals that instructional prompts act as carriers for the principal's personality, biases, and demographic traits. Remarkably, delegating decisions to machines leads to a greater dispersion of outcomes and a breakdown of traditional fairness norms, such as the 50/50 split. The authors introduce the concept of "machine fluency"—the unique skill of effectively aligning an AI's behavior with one’s own goals—as a new source of economic inequality. These findings suggest that the agentic economy will not be a standardized marketplace, but rather one shaped by specification hazards and the latent characteristics of the humans who design the agents. Ultimately, the transition to AI mediation appears to transform and intensify existing social disparities rather than eliminating them.