Latent Space: The AI Engineer Podcast
Episode: [AIEWF Preview] Multi-Turn RL for Multi-Hour Agents — with Will Brown, Prime Intellect
Date: May 23, 2025
Guests:
- Alessio (Partner & CTO at Decibel, co-host)
- Wix (Founder of Small AI, co-host)
- Will Brown (Reasoning Research Lead, Prime Intellect)
Episode Overview
This episode features Will Brown, a leading voice in agentic reinforcement learning (RL) and reasoning models, previewing his upcoming talk at the AI Engineer World’s Fair. The discussion centers on Prime Intellect’s latest research into multi-turn RL for LLM-powered agents capable of multi-hour autonomy, the evolution from reasoning to practical agency in models, the technical and safety controversies surrounding new foundation models such as Anthropic’s Claude 4, and the challenges in credit assignment, tool use, and reward frameworks for training reliable agentic LLMs.
Key Discussion Points and Insights
1. The Claude 4 Release and the Shift to Practical Agents (00:00–06:29)
- After the Claude 4 keynote, the hosts and Will reflect on how the field is moving from pure reasoning to emphasizing agency—the ability of models to act autonomously in extended, multi-turn tasks.
- Will Brown: “The thing that's going to make the next wave of stuff be powerful is just like, everyone wants better agents, everyone wants models that can like go off and do stuff. And like reasoning was kind of like a precursor to that...” (02:23)
- The Claude 4 release emphasized multi-turn tool use and function-calling, downplaying "reasoning" as a marketing term, merging it with practical problem-solving for agents.
Notable Quote:
"Reasoners are a step on the path towards agents. ... What people care about more for actual applications is practical agents."
– Will Brown (02:23)
2. Evaluating Model Improvements: Trustworthiness, Reward Hacking, and Token Budgets (06:29–13:18)
- Discussion of how models like Claude 4 show incremental, “linear” progress—improving trust for programming and minimizing reward hacking (e.g., generating unnecessary code).
- Will Brown on Sonnet 3.7: “You ask it a coding question and it would, like, do your question and then seven other things also. ... Presumably because there was some RL environment where there wasn't really a penalty for doing that.” (07:14)
- On “thinking budgets”: Developers are using explicit token or reasoning budgets to constrain agents’ verbosity and manage costs, now a standard API feature.
Notable Quote:
"You want the models to do the thing and no more. ... The reward hacking issue seems to ... have gone down for both Sonnet and for Opus."
– Will Brown (07:37)
3. Technical Deep Dive: RL for Agentic LLMs, Format Rewards, and Tool Use (13:18–34:59)
Safe Tool Use & Environment Design
- Will highlights the complexities of giving LLM agents broad tool access (e.g., terminals), and the RL problem of unbounded action spaces.
RL Environments in the Multi-Agent and Multi-Turn Setting
- On multi-agent systems: compared to simple environments (like video games), letting LLMs have open-ended text or tool interactions amplifies the challenge of defining rewards and stability.
- Will Brown: "Is this going to be a stable system or not? ... If you want to make AIs do this, you have to translate this into code math." (19:23)
- RL for LLMs now increasingly uses the abstraction of a “turn” as an action (versus a token), fitting natural dialog/unit-of-work cycles.
Multi-Turn RL, GRPO, and Turn-Level Credit Assignment
- Will previews the Prime Intellect paper on multi-turn RL with GRPO, showing that models must be actively incentivized—by the training reward, not just supervised tokens or format instructions—to use tools or “think” methods effectively over long tasks.
- Models “reward hack” by fake tool use unless careful, requiring fine-grained credit assignment (e.g., did the tool call really assist in the final answer?).
- Will’s work extends the GRPO approach with turn-level intermediate evaluations (e.g., using string-matching or LLM-based judges to verify whether tool outputs—like search results—were truly useful).
- Will Brown: “Once you have a way to do intermediate evaluation, if you can evaluate like the quality of an intermediary state, now you can ... take this into account [for RL reward].” (31:25)
- Shift from traditional deterministic, rule-based rewards (which break down in less-structured tasks) to LLM-as-judge or reward model-based frameworks. This unlocks agentic reasoning over longer, less formulaic tasks.
Notable Quotes:
"RL is like one math language that exposes these primitives ... But it's this n-body problem where you freeze it and look at it—how does one thing moving affect everything else?"
– Will Brown (20:23)
"For tool use, [credit assignment] is that ... did the tool result in information? ... The framework is more general... Once you have a way to do intermediate evaluation ... you can rewrite the GRPO advantage calculation to take this into account."
– Will Brown (31:09)
4. Controversies and Safety: The "Snitching" Controversy around Claude 4 (13:18–22:19)
- The hosts and Will address the social media uproar over Claude 4’s reported “snitching” on users by flagging dangerous requests, attributing this to Anthropic’s rigorous stress-testing and safety reporting.
- Will Brown: “You kind of have to pick a goal and ... maybe the right answer is the model just defers and like, nope, I'm gonna stop talking. ... There's no way to kind of win and make everybody happy.” (16:42)
- The distinction between adversarial red-teaming environments and typical user interactions is emphasized.
5. Model Evaluation, Academic Research, and the Future of Evals (23:27–27:54)
- Quick aside on Model Evaluation labs (e.g., eleuther.ai, Model Alliance) and the funding/ethics challenge when “the labs” are also the customers.
- Will predicts that academia will remain the best source of unbiased, creative evaluations—urging grad students to focus on high-leverage, low-cost evaluation research rather than chasing compute-intensive tasks.
- Will Brown: “We are churning through evals ... We always need more. ... [It’s] the task of translating vibes of what is good or bad ... into ... very precise scientific questions.” (25:19)
6. Reflections on Research Taste and Foresight (26:15–28:29)
- Will urges researchers to “think far ahead”—to make educated bets on how AI systems and their challenges will evolve, instead of simply optimizing for immediate, obvious milestones.
- The convergence of RL and agents was a “safe bet” years ago, now vindicated by recent trends.
7. Will Brown’s Current and Future Work (28:29–39:41)
- Will describes ongoing projects at Prime Intellect, including:
- The “verifiers” repo with upcoming major updates.
- New work incorporating LLMs as flexible, intermediate reward judges in RL loops.
- Announcement of a new course and collaboration with Kyle Corbett (OpenPipe) for agentic RL education.
- Will invites listeners to his upcoming AI Engineer’s World Fair talk and encourages them to explore agentic RL in practical settings.
Notable Quotes & Memorable Moments
-
On Progress in LLMs:
“Linear progress which is great, but ... there's not anything ... that feels like a paradigm shift in terms of ... complexity of agents.” — Will Brown (06:51) -
On the Limits of Deterministic Rewards:
“Determinist rewards are nice if you can get them to work, but also really painful ... for math the easiest is when the final answer is an integer and lives in the same spot ... as you go to ... more flexible [tasks], deterministic ... rewards start to break down.” (35:06) -
On Academic Research:
“You want to think about ... making educated bets about what the world looks like in years... you want to be jumping ahead of the curve.” (26:15)
Important Timestamps/Segments
- 02:23 – Will Brown on the path from reasoners to agents.
- 07:14 – Reward hacking and trustworthiness in LLM code generation.
- 13:27 – Claude 4 “snitching” controversy and safety frameworks.
- 19:23 – RL environments for agentic LLMs: multi-agent and stability.
- 28:29 – The story behind the latest Prime Intellect multi-turn RL paper.
- 31:25 – The innovation: intermediate, turn-level reward/credit assignment.
- 35:06 – Deterministic vs. model-based reward evaluation.
- 38:47 – Will’s AI Engineer World’s Fair talk & educational plans.
Closing
Will Brown leaves us with a preview of the future: RL-driven agents, longer-horizon evaluation, model-based rewards, and a steady move toward practical, reliable agentic LLMs with nuanced safety controls and robust evaluation frameworks.
Next up: Catch Will’s talk at the AI Engineer World’s Fair (June 4, San Francisco), and stay tuned for course offerings and Prime Intellect’s open-source tools for agentic RL.
![[AIEWF Preview] Multi-Turn RL for Multi-Hour Agents — with Will Brown, Prime Intellect - Latent Space: The AI Engineer Podcast cover](/_next/image?url=https%3A%2F%2Fsubstackcdn.com%2Ffeed%2Fpodcast%2F1084089%2Fpost%2F186632787%2F86bb0f264bc4b333f8a90e3bf505073b.jpg&w=1200&q=75)