![[AIEWF Preview] Multi-Turn RL for Multi-Hour Agents — with Will Brown, Prime Intellect — Latent Space: The AI Engineer Podcast cover](https://substackcdn.com/feed/podcast/1084089/post/186632787/86bb0f264bc4b333f8a90e3bf505073b.jpg)
Loading summary
A
Hello AI engineers. We're back with a quick reaction pod for Claud 4 with the new reasoning research lead for Prime Intellect. Will Brown Will Brown's talk at AIE NYC and open source work on verifiers have made him one of the most prominent voices able to publicly discuss the current state of the art in reasoning models and where current SOTA research directions lead. We discussed his latest paper on reinforcing multi turn reasoning in LLM agents via via turn level credit assignment and he has previewed his upcoming AI Engineer World's Fair talk on Agentic RL linked in the show notes. We're excited to share that Will will be back at the upcoming AI Engineer World's Fair in San Francisco which now has Expo tickets on sale. He will be headlining the new RL plus Reasoning track with Misha Laskin, Nathan Lambert, Christian se, Greg Kamrat, Kyle Corbett and more. Join us at AI Engineer. Watch out and take care.
B
Hey everyone. Welcome to a Lightning plus Emergency News Latent Space podcast episode. I'm Alessio Partner and CTO at Decibel and I'm joined by my co host Wix, founder of Small AI.
C
Hey.
D
Hey. And yeah, honestly, we knew that Cloud 4 was coming and we just didn't. We're just too busy to like have a dedicated episode. So like this is our makeup dedicated episode with a special guest, Will Brown. From Now I can say it Prime Intellect.
C
How's it going? Great to be on and so excited. I've known each other for a little bit and this is my first time on the podcast I believe. Great to chat with you guys. Big news day. I guess so. Lots of stuff out in the world. There's always a news day.
D
I think this week is particularly heavy for some weird reason. Like Monday was Microsoft Build, Tuesday, Wednesday Google and then today is Claude. I wonder what tomorrow will bring.
C
We had IO and then we had I O and then.
D
Yeah, yeah, different iOS exactly. Yeah. So like we actually were supposed to record this morning and we all wanted to watch the Claude keynote, so we went and watched the Claude keynote. Obviously a good model, you know, good model, big model. They're really emphasizing coding. They didn't really talk much about reasoning to be super honest. They were just like, it runs for longer now. What are you guys takes?
C
Yeah, so I mean like one thing I've kind of been seeing coming for a little bit that I think people are kind of also all aware of now is that like the thing that's going to make the next wave of stuff be powerful is just like, everyone wants better agents, everyone wants models that can like go off and do stuff. And like reasoning was kind of like a precursor to that a little bit. Like, I mean I always think of like OpenAI as like five levels framework where like Chatbots was like the RLHF era and then Reasoners was like the one and R1. But like, really what people were thinking of was reasoners are a step on the path towards agents. And so I can kind of see why Claude Anthropom is not like, oh, we have the best reasoner. They're really like showing off their suite agent and like tool tool use and like function calling benchmarks, multi turn stuff. Because I think that's really like what people care about more for actual applications as opposed to like did really good on this math competition. Like the math competition was like that stuff was all like a signal that was supposed to think we were getting somewhere. But the thing we were getting towards for a lot of people at least, is practical agents.
D
Alessia.
C
Yeah, the.
B
I think the extended thinking mode, I think they removed the uppercase. I think in the Cloud 3 release it was like extended thinking kind of like capitalized and now just like extending thinking with tool use. So I think they're also, yeah, downplaying whether or not it's reasoning or not. I think they're trying to merge everything together and it's not. I mean, I didn't realize that. But Extended Thinking could not use tools before the way they worded it and now they can in Opulus 4.
C
So that's great.
B
But yeah, they haven't put it as far center as last time.
D
Do we have any. This is like already veering off from Claude directly into speculation. But do we have any idea, um, if there are any material differences between how Claude extended thinking works versus, like the old series models? Do we know?
C
The biggest difference seems to be at least. And this is kind of a thing that's been, I don't know, this is all speculation, of course, but from the start, Anthropic had always kind of had this like little thinking thing where you could sometimes Even like quad 3.5 would do like a tiny bit of thinking. And it was really just like deciding which tool to use for the most part. Like if it was doing an artifact in the Cloud ui, it would have this little thing where it would think for like two sentences about which tool to use. And it seemed like Anthropic's kind of attitude has been that extended thinking is an instance of tool use and that it's the kind of thing you want to equip the model with the ability to do. But it's not like oh, it's a thinking model. It's just a sync for the model to like brain vomit because that brain vomiting will help it like find a nice thing to do next. In the same way that doing search or doing code execution are like ways to kind of get more information on the path towards like finishing a problem.
D
Yeah, inference time compute as they say. I did meet somebody who claimed to have coined to found the scratch pad paper and this was obviously before the Jason Wei chain of thought paper. But it's all the same sort of general family of techniques. I think the question for me is also like is there some model routing going on? Are they different models the thinking non thinking or are they the same models with just like you turn off the end of turn token generation?
C
I mean I think these models should be the same model and anthropic knows what they're doing. Well, it's not that hard to like Quinn did it in a very kind of like simple way and they kind of talked about how they did it a little bit. But it's not too difficult to have whether or not a model thinks be the sort of thing. I mean obviously all this stuff is hard at serious scale, but conceptually at least it's not a big problem to solve about how would you ever do it? It's like no, we have reinforcement learning. We can kind of or we just sft on different things. We can teach models skills like that pretty quickly.
D
Yeah, you have some work that you've published recently on like GRPO and the relationship for and you're doing a lot of work on multi turn rl. I think, I think I wanted to just kind of round out any other Claude highlights, you know that you guys.
C
Sure. Yeah, there is a.
D
There is controversy that I'm leaving towards the end. But like any other technical highlights that you guys want to focus on?
C
I mean I think it seems like a really cool model but I think like Calum is like tweeted this earlier today. It seems like it's linear progress which is like great, but it doesn't feel like there's not anything that I've seen from it that feels like a paradigm shift in terms of like the sorts of stuff Daria talks about which like I think maybe we're still on the path to get there and it feels like this is just like going up in terms of complexity of agents. I think the one thing that to me Was really nice to see. I haven't like, done too much testing myself yet, but in their reported benchmarks, the reward hacking issue seems to like, Sonnet 3.7 loves to like, do stuff that to me feels reward in the sense of like, it'll try to. You ask it a coding question and it would, like, do your question and then seven other things also. Presumably because there was some RL environment where there wasn't really a penalty for doing that or there wasn't enough penalty and covering its bases, like, was more likely to pass test cases on some coding thing. Like you could imagine, like a sweet bench kind of thing where there's a minimal diff that is really what you want, but there's like, you could do a ton of other stuff and put all these other things in place that as long as you don't. As long as it's not enough that you trip over your feet, it's just like, extra stuff that's there if it helps pass the test cases. And what I really think you want to do with these models is like kind of min max. Like you want the models to like, do the thing and no more. And they had some internal benchmark for this that went from like 45% down to 15% for both for Sonnet and for Opus as opposed to 3.7. And so I'm hopeful that these models are much more like, friendly to code with and maybe more like trustworthy. And that's the thing that I kind of have buckets for models of like, how much can I trust them in a code base? Especially something beyond like a single file. Like, old Gemini to me was very trustworthy. GPT 4.1 is very trustworthy. New Gemini is not. 3. Sonnet is not. O3 is not. I haven't decided which bucket new OPUS are going to fall into.
D
Trustworthy in terms of reward hacking just.
C
Like, not going to make them like, they're going to do the right thing in the code base. And worst case, they'll do it like dumb. But they're not gonna, like, go break a bunch of stuff. They're not gonna leave a bunch of like, extraneous comments and helper functions all over the place that aren't really needed or, like, make seven new files just to have them there. Like, this is the sort of thing that seven does a lot.
D
Yeah. Like, I already have the function in my code base. It would just make a new one just because it felt like it. Yeah. One thing I often wonder about those things is, like, it's just for RL environments in general. Like, why is it token costs more of a thing in the penalties? You know, like, that's the one rule above all. Like, you can. You can actually skip a lot of reward hacking by just, hey, the more tokens you use, the worse it is.
C
I mean, that's not what the model, they're selling you tokens. They want you to get to work. Okay, but like, so, like, there's that element of it. But I think also it's that there was this initial kind of reaction of everybody of, like, more tokens is better. If you look at the line, it goes up. As you spend more tokens, your accuracy goes up. And so I think the pressure to like, really tamp down on token usage was not that serious for a lot of people, especially because the companies are like, need to sell you more tokens. But it is the sort of thing that you can have some more controls over. So like, Quen did this in kind of like a very kind of abrupt way where they can, like, you can. In the ui, you can like, set a token budget. And it just like truncates the thought. So it seems like artificially truncating the thought is actually like, fine. Like, the model can. Even if, like, it got cut off mid sentence with a injected, like, think token, these are smart enough models that they can kind of finish with the best that they got from that point. And so that's like, one way to do it. The other and like, that's becoming a kind of a standard API feature now is like, your think budget clock has that. Yeah, we did a little bit of experimentation with that in our last intellect to run at Prim Intellect, which it was, but before I joined. But thinking budgets are the kinds of things that you can insert into a reinforcement learning objective. And you can see the model, like, get better at targeting the right amount of thinking based on, like, let's say something goes in your system prompt. You can have the prompt just say, use X amount of tokens. It doesn't need to be like. But if you've kind of trained the model to, like, respect this, you would hope that if you'd, like, execute this correctly, the model learns to, like, roughly think the right amount.
D
Okay. This actually changed my opinion of thinking budgets because previously I was thinking that reasoning effort was better than thinking budgets. Thinking budgets kind of like a max cutoff.
C
The same thing.
D
It's a target, right?
C
It's not. It's not a. Okay. The effort is a target, probably.
D
Yeah, right, right, right.
C
Yeah.
D
Because I actually want to set effort. I don't super care about cutoff apart from the cost and like giving me, you know, 64 bits of cutoff or whatever doesn't matter.
C
I'm not sure that they're like that different. I think like we don't know how they do it under the hood, but my guess is that the whole reasoning effort thing is essentially a token budget that the model has been like RL'd to. Like, you would hope that you get different behavior. So the model, when it's told it has a short thinking budget, you would hope that it uses slightly different strategies that are better versus if it has a high budget, it's more willing to like do lots of math calculations, for example. But I think conceptually it's really just about the model has some amount of room that it can bank in tokens and yeah, it's trying to do that well, hopefully.
B
Do you think we're going to have these as hyper parameters for like much longer or do you think this is kind of like, you know, as we're early in this, like reasoning models, more of the stuff is exposed and then it gets moved away from the user.
C
I think in chat interfaces it probably won't stick around. Like, I don't think we're always going to have the dropdown of like O4 mini and O4 mini high. That feels silly. I do think it's a thing that developers want, especially because once you've kind of built around a certain model, like a lot of these providers are hoping you stick with the one model and are not switching all the time. You do need a knob to control costs and also latency. And so that is one kind of useful knob to expose developers for controlling this like quality versus cost and latency. Awesome.
D
Cool on all of that. I think the elephant in the room, let's talk about it. Is this controversy around OPUS or opus.
C
Right.
D
Snitching on you.
C
Yeah, I mean, so I have a lot of.
D
So for those, for those out of the loop, let's, let's recap because I feel like you're closer to this than I am. Like I learned about it from you.
C
Sure. Yeah. So this was someone from. I'm not going to name him because I know he doesn't want to like have all this attention on him. He deleted the tweet, of course. It was essentially like going through different things that people found during safety stress testing of Claude. And so this is not like what's Claude going to do for you? I think people took this out of context. Pretty badly. And so there's a fair point there that it's like people are really reading into the one sentence much more than they should. But this is the thing anthropic does a lot is they really stress test their models. They try to put their models in situations where they can really see, like what could an adversary get the model to do or what does the model do if it's in a situation where there's no right answer. So like, I think a lot of the kind of headline anthropic like safety results, especially related to reward hacking and kind of deviation and alignment faking are all things to me that seem like a rock and a hard place situation where the model has two objectives it's given that are conflicting with each other and it has to pick one. And no matter which one it picks, it's going to sound terrible. Like it's either following the user's instructions or it's following like common norms. And once you kind of accept either of those, it's going to do the thing that is aligned with that set of like, guidelines. So in the case of like, if your model's goal is to be like maximally helpful to the user, then it would help a user like build a bomb. A model's goal is to be maximally helpful to society. And a user's asking you to build a bomb, it's gonna be like, no, that's bad, I have to do something to stop this. Like, you kind of have to pick a goal and like, maybe the right answer is the model just defers and like, nope, I'm gonna stop talking. But people also get mad when you tell them like that the model will stop talking to you or like refuse to do anything. Like, there's just no, it's, there's no, it's no way to kind of win and make everybody happy. But I do think like, like they report this because they think it's important to have people understand the safety implications of these models and to understand like, okay, how bad would it be if someone was trying to use this? Could this like meaningfully help someone commit crime or violence or whatever? And so like, that's what they have like their state safety framework for. And the things that happen in these, like blog posts and threads and papers about like the model trying these things. They're kind of putting these models in a scenario that elicits these things. Like it's the sort of thing that you would imagine a very smart human might also do in those situations. Like let's say you Are told, like, accomplish some vague, underspecified goal at any cost, and you really, like, want to solve that goal. Think like, game shows like Survivor, I think is a good example of, like, or Lord of the Flies. Any of these, like, kind of canonical situations of people who have put in a weird spot and have to go do stuff and figure it out, how to do it. They're kind of crafting these environments for the models and just looking at it and seeing what happens. And so, like, I think it is a little silly to overanalyze behaviors in either direction of, like, oh, the model is reporting you to the police, or the model's going to go help you find uranium on the dark web. Like, well, these models can kind of do. There's no. Like, they're. The base model, in general, of LLM is not artificially constrained in any way. Like, with the right prompt, it'll do whatever, up to its intelligence limit. And so, like, the question is just, how do you constrain the space from all possibilities down to, like, a more reasonable set? And, like, that's hard.
D
So, okay, you actually gave a serious answer, which I totally respect. I was smart looking for shitposts. You're treating this as though, like, yep. Like, this is how. This is what the problem actually is, which is, like, totally fine. And. Yeah, I mean, that's what you are as. As a researcher, right?
C
Yeah, I mean, I. I think tweeting is fun. Like, it's. It's cathartic to, like, just kind of, like, get a post out. So, like, when I saw the one about the uranium thing, I was like, let me tweet. So the. The tweet was like, we found that Quad can go search the dark web to look for, like, uranium. And I was like, here are the top 10 things that builders are using in their agentic rag applications with the new groundbreaking Claude 4. And it was just, like, silly. Both making fun of, like, LinkedIn, like, thread posters, as well as just, like, the funniness of the scenario that they were talking about. Yeah, this is it.
B
Does any of this make you think differently about what tools to give an LLM? You know, I know they deleted the tweet, but it's basically like, well, before, if you're putting all these MCPs, like, yeah, you have email access and all of this. And now it's like, well, maybe I don't want to give email access all the time if you're going to snitch on me with the email access.
C
I mean, I think coding with these models Especially like quad three. I did a fair amount, like for a few weeks I was doing a lot of quad code with 3.7, mostly for kind of random side projects. I never really got to the point where I found it was helpful for a thing that was like a large existing code base. But if it's like, hey, I want to like, cook something up in a few hours for fun, pretty good at that. But these become messy and they become hard to maintain and you get to a point where it's like nothing is working. I just gotta like, dig in and fix it all myself. And so I think part of that is that the models have access to like a terminal. And you can do a lot of stuff in a terminal. MCP is kind of a way of constraining the action space. So, like in like canonical rl, people talk about, like, states, actions, rewards, policies as like, the things that are like, the moving parts. Models generally are trained in like, old school RL with like a very fixed action space of like, what are the keys on the video game I can hit. But with LLMs, it's like text. Text is like kind of unbounded in what you can do with it in a terminal. There's not much you can't do in a terminal. And so if you're training models. Oh, I got a lot of flack for this one.
D
I'm just showing this. Wait, flack? Why?
C
People were. It was both people who were like, the notation is stupid and bad, but RL is really simple. Or like, RL is like complicated. And it's like, everyone has a different opinion on what RL means. And I was trying to just like, be like, hey, it's actually kind of complicated. And I wasn't picking this up like, oh, the definition of the NMDP is complicated. I was like, no, there's just like a lot of moving parts. And to think about it, to like, do anything, especially if you want to change any, like, part of the system. Like, here's a question, hypothetical, like, what happens if you have two LLMs learning together? How do you reason about that? How do you, like, think about that? Is this going to be a stable system or not? Stable system? What if they're like, kind of cooperative, but kind of not cooperative, and then they're training to work together but also want to backstab each other? Like, this is kind of the environment. People are finding themselves all the time out in the real world. But if you want to make AIs do this, you have to like, translate this into code math and the more complex your goals are with this thing, the more complex the math gets. And RL is like one math language that kind of exposes these primitives. But like, I think a lot of people are like, oh, I can follow the equations. That means I understand it. It's like, well sure, but like also there's. It's like this, I don't know, n body problem thing where you can freeze it and look at it. It's like, oh, how does one thing moving affect everything else? And what are the cascading ripple effects? Wow.
D
And this brought three body problem into this. Amazing.
C
Like as in like the physics version, not the show.
D
No, no, no. I mean actually how. Actually very like impossible to model. Like I guess you can like simulate it. But like even then like yeah, it's.
C
Sensitive to initial conditions. So like you can't really say like this is one of those things that like, like why does no one predict the weather a year out? There's no, I don't think anyone has anything that's like good at long term weather forecasting beyond like, I don't know, climate tr. But no one can predict whether it's going to rain in Seattle on a given day in a year. Even if you think like, like the system's predetermined, determined, like we, it's all clouds bumping off each other and whatnot. And mountain ranges. We kind of know how these things work.
D
So the butterflies are flapping their wings. I mean like, you gotta let it play out like butterflies. If, if we had no butterflies we could predict it.
C
Right. And so it was very sensitive to butterflies.
D
Interesting. Okay, so I guess we can sort of round it out unless there's any more of the controversy. I think there isn't. Like, I think that the system card is actually very good. They probably went too hard on it compared to like normal system cards. And it's a little bit confusing whether this is marketing or are they just like. No, we really super care about safety. And part of this is like Apollo just being Apollo pushing the frontier of red teaming. Right. So they're going to report the things because it's extremely good at Apollo marketing.
C
Yeah, I think they're really, there seem to still be like trying to be creative with their kind of consumer marketing. Like it feels like people in the AI world like Love Quad or have grown tyberquad but still had a phase where they were using it a ton. But it hasn't really broken out to general people in the way. And it feels like a lot of their marketing that I've seen is like a little confusing. Like it feels like they've done a really good job at crafting a brand image that appeals to a segment of the population who has certain considerations that they really like that a model has a deep personality or whatever.
D
People.
C
The sorts of people who I think also really like GPT4.5 many of them like really loved like Claude 3 opus the big model smell like a lot of people just don't care.
D
And I just wanted to use it as a tool.
C
Yeah. Trying to figure out how to like appeal to that audience. The LMSIS sycophanti4o those. The people who love those models. Different crowd and it's a. It's a larger crowd and that's a tough problem to solve.
D
What's your quick take on ella Marina getting $100 million?
C
Well, see like I imagine that they partner with company labs in different capacities.
D
To probably making a lot of money.
C
Yeah, like I'm not, I'm not in the business of trying to point the finger at like saying they definitely did this. But if I was a company that was able to raise at that kind of valuation and I had just had a long public partnership with Meta, eventually public partnership for a thing which we've kind of seen was Meta had the ability to do a lot more back and forth than a lot of other labs did. I would imagine that there's some compensation going on there or access to data. And so like I think being an eval company puts you in a really hard spot. Some people are talking about this on Twitter, like just that to be an eval company you kind of have to sell to the labs. But selling to the labs doesn't really like kind of wrecks the revals because.
D
Your incentives like this like your customer. Yeah, yeah, yeah. So in finance we would. I mean, you know, you are for more instantly. So this is the credit rating agencies. Like literally your customer is the one that you're supposed to govern, but they're also your customers. So then you have to be nice to them or they'll just go to the next one.
C
Yeah, I mean I do think that the best source of evals going forward is probably going to be academia. And so this is the thing that I tell people who are like starting a PhD, which is like find things that are cheap to work on as a PhD student because you cannot go pre train them foundation model really on your own, but you can build a really good, really clever eval. And like we are churning through evals at the time we Saturate them. We always need more. It's not the kind of thing that is ever going to like, end. And so that's the task of translating like vibes of what is good or bad about a model into kind of very precise scientific questions. I think is an important problem. It's a problem that you can get by a lot more with like brain power rather than dumping capital into it. You need to like pay for the API costs. But like, that is generally the kind of thing that either you can get covered within academic grants or like industry sponsors, or the kind of thing that just like there's versions of these things that are like small sample size that get you on, get on the radar, or you kind of pick and choose which models you can afford to eval. But it's like an accessible field of research and it's one that like the incentives of academia I think are quite good for, which is like write a splashy paper that says something interesting about the broader field rather than, oh, we want to make this one look like the winner.
D
Yeah, I think a lot of grad students still don't have taste. I don't know how, how better to put it. It just, that's fair. Yeah, but you go to enough academic conferences and I'm like, why did you work on this man? Like, you're so, you're so smart, you're capable of better. So how do you teach taste?
C
I think, I mean, I can tell how I did it originally, which is like, I think you always want to be thinking pretty far ahead and you want to be like making kind of educated bets about what the world looks like in the state in the years. Like you have to say like, what are the questions that no one's even talking about? And this is like not an easy thing to do. You have to like really convince yourself that you're kind of right about the way things at least might go. Like when I was like in, I finished up undergrad like late, like they finished in 2019, then went right into grad school. But like towards the end of the 2000 and tens, like we had like AlphaGo and DeepMind doing all this multi agent RL stuff that was like really cool. Then it was like, okay, this stuff kind of works. Like AI is like going somewhere. Multi agent systems are kind of going somewhere, still very early stages. But what's going to happen once this gets there? And it seemed like, okay, these things are all going to be like continually learning in parallel as this big multiplayer game basically. And if you look at the math, the math was kind of like undercooked and there's like some really hard open questions that are still open questions in multi agent learning theory. And so like that was my focus, which was like, how do I like learn about this? How do I learn to think about this stuff better? And at some point I kind of got tired of proving theorems and was like, okay, let's just go build the thing. But I think like you want to think about like whether you're doing theory or experiments, like you have to lay out a few different conditional statements to get to the point where you really doing interesting research that's beyond like just loving fruit that people are like obviously going to be working on in parallel. You want to be jumping ahead of the curve a little bit. I think my last. I don't know, this isn't like, I wasn't the first person to do this, but like it was pretty clear to me like after R1 and before R1 that like RL was going to work and that that was going to intersect with agents where the solution was going to be like RL will use. That seemed like the way the direction things were going to go. And so that was like. I don't think that was a very risky research bet, but it was like a research bet that seemed to work out.
D
Yeah. Speaking of which, you just published the paper. Now I have the full context is that you were an advisor on this and one of your grad students was doing the. Something like that.
C
Yeah, so it was me with Silion was my intern. This was kind of the last major thing I was working on at Morgan Stanley. And this kind of was in parallel with the verifiers is the repo that I've been building out. Major updates to that coming very soon, by the way. I'm very excited about some stuff. But it kind of was something I really started in earnest like January. Kind of in the follow up to it, I'd had the GRPO demo thing go viral and I was like, oh wait, there's something to this format reward thing.
D
It was literally like a GitHub gist, right? Or something.
C
This is like a proper repo.
D
No, no, no, like the, the grpo.
C
The other one. Yeah, yeah, the other one was like just a gist. This one is like repo for like multi turn tool use RL with Grpo. And so like in some ways the paper is like, it's the first paper that's really like actually there's been a couple other papers that people have used the Repo for. But it's one where like a lot of the stuff from the original like GRPO demo gist gets kind of extended to the multi turn RL tool use setting. And so there's a lot of experiments here about like, okay, how do you actually get models to use tools? How do you incentivize tool use? Because something we'd see is that if you set these models up to use tools, they just won't like if you say, hey, here's a question, you have access to these tools, do as many rounds of tool calling as you want and then submit your answer. They'll just submit their answer because they like are especially for like small models. Like they aren't already trained to use tools. They don't really want to because they don't necessarily have that instinct. And they're pretty bad at like function calling and format instruction following. And so what you would see is like when they'd use a tool they would like mess up the JSON and then they'd be like, oh, that didn't work. And it threw me, it got me out of focus. And it would be more likely following that, that the model would just like go off the rails because they would get like an error message from the parser. And so the safe option for the models was just to like stay in this basin of like just do think, then respond. Same with like normal formatting rewards too. Like if you want models to use thinking tokens, you kind of have to incentivize that. You have to either do a little bit of SFT warmup or you have to reward them for doing it. Otherwise they will not follow it 100% of the time on format alone versus a model like R1, 100% of the time it is going to use its think tokens. You are not going to ever see R1 just talk normally without the thinking section. And so you kind of do have to decide what you want the model to do. Like this is a little bit like a user facing question of like what, what behavior of the model should the default be? And if you have, if you want it to do a certain thing, if you want it to be a total use agent model, like it does help considerably to like actually have this incorporated into the reward, the kind of key trick in the paper to get around this problem. So okay, one kind of reward these models would do is like they would do like a dummy tool call where they would like learn to ask this, use the same Google search every time and ignore it. So like the some Questions would be like, okay, here's some like MMLU style question, go figure out the answer use web search. And if you start rewarding them for like tool use, they will use the tool, but they don't really want to like have to. They want to like be very safe with it. And a lot of these questions, like, they do kind of know a lot of the answers already. And I think calibrating the right difficulty of your questions for RL is like an important problem that we're still kind of figuring out. But they would like do silly versions of tool use where they aren't actually using the tool to assist in their reasoning, they're using it to get the reward. And so we kind of have to do a credit assignment thing of like, okay, did the tool result in information? And so for, for these experiments we were doing, the trick was like, okay, does the like some string matching thing involving the ground truth answer and the return search results from Wikipedia? So did the model actually search a thing that retrieved useful information for a question? And so this is like. But the framework is more general than just that. It's that once you have a way to do intermediate evaluation, if you can evaluate like the quality of an inter, of an intermediary state, now you can kind of rewrite the GRPO advantage calculation to take this into account. Because I think this is less of a problem than like ppo. If you know, PPO is like the old school RL and it also is what people use for RLHF. But in the context of grpo, GRPO is like great for like leaning heavy on highly parallel inference compute. It's more memory efficient for the actual training process. It's much easier to do in a distributed fashion because you have less gradient syncing and less model weight copies. It's kind of like DPO on steroids I think is one way to think about it. But it's also gets around a lot of the pitfalls of dpo, both in that it's like online by default, as well as that you have this large set rather than just a pair of completions. So you do get like some intermediate credit assignment a little bit via this group comparison. But for tool use it seems to be far enough out of distribution of small models, especially doing incorporating this turn level. So the way that I've been thinking about it is like in like canonical rl, the state action are like things that you do many rounds of like take an action, go to a new state, take an action, go to a new state. And for a While people thought about LLMRL as like, oh, each token's action and the new sequence is a new state. And you can kind of do that, but you can also think of each turn as an action. Yeah, that's more likely where the state is the response you get back from the tool call. And now you have a different way of designing your RL algorithms to take into account credit assignment, which is that like. And it also is like a little more flexible from a reward perspective. So it like feels like people are moving in the direction of model based rewards. Where you either LLM is a judge where the judge sees the correct answer, or it has questions it's supposed to verify as properties of the response. Just because that's much more flexible than like trying to write these little parsers. Like writing a math parser to check if a math question is right is like not that easy actually because there's so many edge cases and you want to handle like latex support and markdown and like equivalent fractions and it's like just like let a model do that. Don't, don't have a 2000 line Python script that does that.
D
And so let me, let me clarify. Math parser to verify that the math is right and you have a latex parser inside it.
C
So yeah, so like a lot of models naturally will like think in latex because they've been trained on a lot of archive like tech. I didn't know that. Yeah, that makes sense. I guess if you're doing like an R1 and people are like, oh, math is easy to verify. The easy to verify still is usually like this very long piece of code that has to handle lots of annoying edge cases. And even then it's like 98%. Yeah.
D
Okay.
C
Because like it's a freeform response that is like there's not only one way to write an equation. Like if you have two valid mathematical expressions that are equivalent, but they're also like symbolic. Like you need to verify that two symbolic expressions are correct, one of which might be written as code, one of which might be written as latex, one of which you're like written as word. Like you can't do it if it's words. With these like literal pseudocode, they try to cover a lot of these cases. That's also why you'd see models put boxed around their final answer a lot is that it's, it's one hack, is that it's much easier to kind of verify the right piece of the information if you know exactly where it's going to live. Rather than like the model saying the answer to the question is four, then you have to like parse away the answer to the question is and just throw that out. And so it's like determinist rewards are like nice if you can get them to work, but they're also really painful and they're pretty hard to generalize across domains. Like for math the easiest is when the final answer is an integer and lives in the same spot. Like there's a box where it's going to be an integer. And so this is one of the reasons like everyone used GSMAK for so long is because it's like mostly integers. I think Amy is all integers. It's super easy to verify these things and to parse them. But as you go to. And multiple choice too, multiple choice is super easy to verify. But anything that's a little bit more flexible, deterministic like rule based rewards start to break down.
B
Right.
C
And. But the model based direction seems to be pretty promising and I think underexplored for like what if you use an LLM as a judge in your RL loop? I think kind of going back to like anthropic's been talking about this for a long time via constitutional AI. In that case it was less about the LLM judging and giving a direct like reward to the model and more about training a reward model that was doing like token level advantage estimates, the P which is the PPO way of doing it. But it seems like you can kind of do that for GRPO too and other flavors of RL where you can incorporate full reward model. The reward model can basically be an LLM where like it's fine tuned to like be more calibrated maybe and to have the right kind of range of responses. Yeah, but you could also have it be a reasoner. You could have it be something that is able to do tool calling. There's. There's no reason why the full power of LLMs can't be offloaded or can't be also given to the process of evaluating whether or not an answer is correct or satisfies a certain set of criteria. And so I think like that's the direction I'm like most excited about is like really pushing on kind of beyond deterministic rule based rewards into like these more flexible things. And I think you want to do this both at like a. So okay, that paradigm is not going to work super well with turn level with token level rewards. But I think it does work with turn level rewards of like, can the LLM verify like whether a certain search query was useful? Sure. Like, there's a lot of these questions that are pretty granular that LLMs can like basically nail all the time. If it's a good enough LLM.
D
Yeah. You decompose it.
C
You can incorporate that into RL with that sort of. Okay.
D
Awesome. I think that was all the, you know, topics that we had prepped. Alessio, I think. I think you're also pretty good on that. Obviously it'll take some time to figure out Cloud 4. Anything you want to plug. We already talked about your talk, I guess, coming up.
C
Sure. Yeah. I'll be at AI Engineer on June 4th in a couple weeks. Yeah. Coming up.
D
Your track is particularly hyped.
C
Yeah, mine's going to be. That's going to be a lot of fun. I'm also collaborating with Kyle Corbett from openpipe to do a course which is both of us like have our open source projects that we like are agentic, RL focused and kind of. We've been friends for a while and are trying to do something that's a little more structured as like a way of kind of getting information out into the world for people who. I think we're especially thinking about like kind of practical use cases for agents and helping people, giving people kind of outlet to learn more about like how the stuff works and. Yeah. More upcoming soon. About that. Awesome.
D
Well, I think that's it.
B
Thanks for coming on, Will.
D
Yeah, thanks for coming on at very short notice. I'm glad we can make this happen. We'll do part two with Callow and do do a full prime intellect thing whenever you guys are ready.
C
Awesome. It'll be fun. Great. Awesome.
Episode: [AIEWF Preview] Multi-Turn RL for Multi-Hour Agents — with Will Brown, Prime Intellect
Date: May 23, 2025
Guests:
This episode features Will Brown, a leading voice in agentic reinforcement learning (RL) and reasoning models, previewing his upcoming talk at the AI Engineer World’s Fair. The discussion centers on Prime Intellect’s latest research into multi-turn RL for LLM-powered agents capable of multi-hour autonomy, the evolution from reasoning to practical agency in models, the technical and safety controversies surrounding new foundation models such as Anthropic’s Claude 4, and the challenges in credit assignment, tool use, and reward frameworks for training reliable agentic LLMs.
Notable Quote:
"Reasoners are a step on the path towards agents. ... What people care about more for actual applications is practical agents."
– Will Brown (02:23)
Notable Quote:
"You want the models to do the thing and no more. ... The reward hacking issue seems to ... have gone down for both Sonnet and for Opus."
– Will Brown (07:37)
Notable Quotes:
"RL is like one math language that exposes these primitives ... But it's this n-body problem where you freeze it and look at it—how does one thing moving affect everything else?"
– Will Brown (20:23)
"For tool use, [credit assignment] is that ... did the tool result in information? ... The framework is more general... Once you have a way to do intermediate evaluation ... you can rewrite the GRPO advantage calculation to take this into account."
– Will Brown (31:09)
On Progress in LLMs:
“Linear progress which is great, but ... there's not anything ... that feels like a paradigm shift in terms of ... complexity of agents.” — Will Brown (06:51)
On the Limits of Deterministic Rewards:
“Determinist rewards are nice if you can get them to work, but also really painful ... for math the easiest is when the final answer is an integer and lives in the same spot ... as you go to ... more flexible [tasks], deterministic ... rewards start to break down.” (35:06)
On Academic Research:
“You want to think about ... making educated bets about what the world looks like in years... you want to be jumping ahead of the curve.” (26:15)
Will Brown leaves us with a preview of the future: RL-driven agents, longer-horizon evaluation, model-based rewards, and a steady move toward practical, reliable agentic LLMs with nuanced safety controls and robust evaluation frameworks.
Next up: Catch Will’s talk at the AI Engineer World’s Fair (June 4, San Francisco), and stay tuned for course offerings and Prime Intellect’s open-source tools for agentic RL.