
Hosted by Enoch H. Kang · EN

This paper introduces Fully Unsupervised Score Ensembling (FUSE), a novel framework designed to improve the accuracy of large language model (LLM) outputs without requiring human-labeled data. By aggregating scores from multiple imperfect verifiers, FUSE identifies the most reliable responses during the inference process, a technique known as test-time scaling. The method addresses the limitations of traditional ensembling by mathematically adjusting for statistical dependencies between verifiers that typically hinder unsupervised performance. Experimental results demonstrate that FUSE frequently matches or exceeds the performance of semi-supervised models that have access to ground truth labels. This effectiveness is validated across diverse benchmarks, ranging from academic datasets like MMLU to highly difficult math and logic exams. Ultimately, FUSE offers a scalable, cost-effective solution for filtering synthetic data and enhancing model reliability in complex reasoning tasks.

This paper introduces EVOLM, an innovative framework for self-evolving language models that improves performance without relying on human annotations or external teacher models. By transforming a model’s internal knowledge into explicit natural-language rubrics, the system creates an autonomous feedback loop where evaluation and generation capabilities improve in tandem. This method utilizes variational inference to optimize rubric generators, rewarding criteria that successfully help a small, frozen judge distinguish between superior and inferior responses. Experimental results demonstrate that EVOLM outperforms established baselines, including GPT-4.1, by shifting from abstract judgments to verifiable, instance-specific criteria. Ultimately, the research shows that structuring evaluative capacity into co-evolving rubrics allows models to surpass the limitations of static external supervision.

This paper establishes a theoretical framework for personalized alignment in large language models, specifically identifying the conditions necessary for a model to efficiently adapt to diverse user preferences. The author characterizes a fundamental decision-relevant user diversity condition, which asserts that a population of users must be sufficiently varied to expose all latent reward directions that could impact optimal model responses. When this condition is met, simple greedy algorithms achieve optimal performance rates, specifically bounded online regret and logarithmic offline sample complexity. Conversely, if user diversity is lacking, any learner will inevitably suffer from higher regret and statistical inefficiency. These theoretical findings are supported by simulation experiments using Bradley-Terry preference models, which demonstrate that personalized rewards can be identified during an initial learning phase. Ultimately, the research identifies user diversity as the primary driver of personalized identifiability, resolving conflicting empirical reports regarding the efficacy of personalized versus non-personalized alignment methods.

This paper introduces Off-Policy Generative Policy Optimization (OGPO), a novel reinforcement learning algorithm designed to efficiently fine-tune generative control policies (GCPs) for complex robotic tasks. By viewing action generation as a denoising MDP nested within the environmental process, the method utilizes off-policy critics as terminal rewards to optimize the full generative process without expensive backpropagation. This approach bridges the gap between sample efficiency and expressive performance, outperforming existing techniques like residual learning or simple policy steering. Enhanced versions, such as OGPO+ and OGPO+CA, incorporate success-based regularization and conservative advantages to mitigate critic over-exploitation and performance dips during the transition from offline to online learning. Ultimately, the research demonstrates that OGPO can successfully fine-tune poorly-initialized models to near-perfect success rates in contact-rich manipulation environments, even when expert data is unavailable during the online phase.

This paper details a novel Bayesian adaptive querying framework that utilizes AI personas to learn user-specific information within limited question budgets. Traditional methods like Computerized Adaptive Testing often struggle with high-dimensional data or "cold-start" scenarios where little is known about a new user or item. This research addresses these gaps by using large language models (LLMs) to generate a dictionary of diverse personas, each with unique response distributions that serve as principled Bayesian priors. By representing a user as a member of this persona dictionary, the system can perform closed-form posterior updates and efficient predictions without expensive computational approximations. Experiments on WorldValuesBench and synthetic data demonstrate that this persona-based approach provides more accurate and interpretable results than classical models. Ultimately, the framework offers a scalable, end-to-end recipe for interactive systems to understand user preferences and behaviors more effectively.

This research paper evaluates the efficacy of **Large Language Models (LLMs)** in the field of **time series forecasting (TSF)** through a massive empirical study. While previous scholars argued that LLMs offer minimal benefits over standard models, this study utilizes **8 billion observations** to prove that LLMs significantly enhance **cross-domain generalization** and predictive accuracy. The authors identify that **pre-alignment strategies**, which map numerical data to word embeddings, generally outperform post-alignment fine-tuning. Their analysis reveals that LLMs are particularly powerful when dealing with **distribution shifts** and **complex temporal dynamics** rather than simple seasonal patterns. Furthermore, the paper introduces a **routing mechanism** to show that models adaptively choose when to utilize LLM logic based on data complexity. Ultimately, the findings provide a framework for using **pretrained world knowledge** to improve forecasting across diverse real-world scenarios.

This research addresses out-of-distribution generalization by proposing a shift from traditional causal invariance to explicit environment modeling. While standard methods attempt to discard all environment-dependent information, this paper argues that such features can be predictive when the environment directly influences the target. The authors introduce neural generalized random-intercept models, which capture shared structures across settings while accounting for environment-specific variation through marginalization. This framework minimizes environment-average risk, ensuring robust predictions in entirely new contexts. Theoretical analysis and empirical tests on datasets like Colored MNIST and Camelyon-17 demonstrate that this approach consistently outperforms invariance-seeking techniques. Ultimately, the work proves that marginalizing environment effects preserves more useful information than attempting to force absolute representation stability.

This research paper introduces Magentic Marketplace, an open-source simulation designed to study the economic behaviors of autonomous LLM agents. The environment facilitates a complete transaction lifecycle where Assistant agents representing consumers interact with Service agents representing businesses to discover, negotiate, and purchase services. While frontier AI models can approximate optimal market welfare under ideal search conditions, their performance often suffers as the number of choices increases, revealing a paradox of choice where more options lead to poorer decisions. The study also identifies critical vulnerabilities in these systems, such as a first-proposal bias that prioritizes speed over quality and susceptibility to manipulation tactics like prompt injection. Ultimately, the authors provide a framework for evaluating how agentic markets can be designed to ensure efficiency, fairness, and security in real-world applications.

Researchers from MIT have introduced Hyperloop Transformers, a novel architecture designed to significantly reduce the memory footprint of large language models for edge and on-device deployment. This model leverages looped Transformer layers that reuse parameters across the model's depth, specifically by organizing layers into three blocks where only the middle section repeats. To overcome the performance limitations typically found in recurrent architectures, the authors integrate hyper-connections that expand the residual stream into a matrix-valued format. This modification allows for more flexible internal representations and improved data flow without incurring substantial computational overhead. Empirical tests demonstrate that Hyperloop Transformers outperform traditional, depth-matched models while utilizing approximately 50% fewer parameters. Furthermore, the architecture maintains its efficiency through post-training quantization, making it a highly attractive option for memory-constrained environments.

This paper discusses Self-Guided Self-Play (SGS), a new algorithm designed to improve the reasoning capabilities of large language models through autonomous problem generation. Standard self-play often hits a performance plateau because the Conjecturer model eventually creates low-quality or "hacked" problems that do not facilitate real learning for the Solver. To solve this, SGS adds a Guide role that evaluates synthetic tasks for elegance and relevance to target goals, ensuring the training data remains high-quality over hundreds of rounds. This three-part system of Solver, Conjecturer, and Guide allows models to sustain improvement for significantly longer periods than previous methods. Testing on formal mathematical theorem proving in Lean4 shows that a 7B parameter model using SGS can eventually outperform much larger models. The research emphasizes that managing model entropy and providing structured guidance are essential for scaling reinforcement learning effectively.