
Loading summary
A
Foreign. Welcome to the Linspace Podcast. This is Alessio, partner and CTO at Decibel. And I'm joined by Zwicks, founder of Small AI.
B
Hello.
C
Hello. And we're excited to welcome back Nathan Lambert from AI2. Welcome.
B
Thanks. Fun to be here.
C
I feel like I also have to say Interconnects and the Lex Friedman podcast and the AI World's Fair. You've just done a lot in the last year and a half.
B
Not that many. Still say no to plenty of things.
C
Your first episode with us was January 2024, when you just joined AI2. Then you released all the Omos. You joined us again at Neurips where you did the open models. Well, LUCA did and you supported and then you were more recently here in SF for aie. First of all, I wanted to congratulate you on winning the best speaker.
B
Oh yeah.
C
For the reasoning track. Here you go. I'm limited by emoji.
B
Oh, nice. AI generated. I look too zen. I look so zen in this AI generator.
C
So we had our track host, like take photos of you while you're speaking and turn into Ghibli photos, but this one, your eyes were closed.
B
It's funny.
C
Okay. We were trying to have Mochi the Reasoning Pomsky join us, but I think she's like very. Getting very anxious. Very restless too.
A
Crazy Mochi.
C
Very restless. Okay, true. Okay. So you've been doing really good work. And honestly, I think one of the things that we wanted to kind of establish was to loop and rlvr, I guess. Is that a good place to start?
B
Sure. It starts us in the recent journey. I think that we can recap kind of the story of what to do lo3 was aiming to be and then kind of how it got folded into what the new narrative is. What the goal is, is try to do the work. To compress what are complicated industry post training recipes into something somewhat tractable that you can modify on your own and do post training at a what is like actual state of the art level. I think what we do relative to Frontier Labs is that we probably have a smaller amount of tasks. I think our post training suite for to loo is probably like 10 to 15 tasks. But I would guess post training at OpenAI at all you have maybe hundreds of evals and adding more evals is more data work and more mixing work and making sure you have these things. But on core evals for our suite of models from I think 8, 7d and 4 or 5b is based on Llama at the time it's like it matches or beats Meta on these core vals. I think Meta has different priorities and their things for Llama 3.1 which is a great set of models at the time. And it's just like how do we distill what is very complicated post training explanations or diagrams from the like of this Llama 3.1 report where they have these complex feedback diagrams with many iterations and earlier signs of that from anthropic papers that have these multiple model variants and early constitutional AI things for multiple years. And it's like what does that look like when you're doing large scale instruction tuning into preference tuning and what else you might add? I think a lot of the core contributions of that before we talk about this reinforcement learning thing is like we showed how to scale up preference data. It's just like the academic community had been using this one data set since all the way back in the hugging face models of Zephyr Beta is when this ultra feedback dataset got popular. And still a year later is this state of the art dataset for open preference tuning. And it's just one of those obvious things that doesn't need to be the case. So it's a big trying to make more mature recipes available to people. And I mentioned this either on one I think I'm trying to talk with Jordan. I mentioned the origin of the RLVR thing which is like realistically when you work in the open, a lot of it is trying to match what industry has done and we're on a different path because our infrastructure is different. So some things that OpenAI does now that works really well for long context won't work that well for Olmo because we might not have enough flops in our base model. We might not have certain data sets for legal things, but directionally a lot of it is just trying to reproduce things. And I've long tried to get John Shulman on the pod of OpenAI, anthropic and now thinking machines. And at the time he had gotten approval to chat with me, what he said was confirming a lot of the things that I had said on instruction tuning and multitask and preference tuning. And he was like, oh yeah, everyone just does RL on the outputs. And that's how we got the RLVR idea and scale it into something that is a general method. There was a lot of reasonably or very similar works at the time like Vine, PPO and quietstar on doing these math and coding domains for getting verifiable rewards. I think the RLVR thing was about doing it in general recipes and the naming was something that stuck originally we had. I think it's especially like Costa Huang who's was a kind of lead RL engineer at AI2 who's doing some stealth startup now. You can hear more from him on that soon. I think he's founding engineer of something. And Hamish Iveson, who's still a student at UW leading most of the technical work on this. And the naming was going to be RL from ground truths. But then it's like the verifiable rewards is actually a more general notion because only math questions have a ground truth where code is verifiable, precise instruction following is verifiable. So I think it's a nice evolution of the name which makes sense as you look at more domains, which is now why catches on with people. Once Jensen started using it it was like okay, that's. That's set. That wasn't really our goal. But that's. That's.
C
You think that's where it took off?
B
No, that was like in it being taking off because it was after deep seq. But it's like when people like that have the acronym on the slides and that's. It's also very clear of like RLHF is four letters. It's like we want to evolve that and have a similar four letter acronym. It's not that much magic to it, but there's definitely intention on these. On these little things.
C
RLGT may not have worked as well. I don't know why but yeah.
B
That'S what these people all that were definitely thinking and they made that name change which works. Which is fun.
C
You did mention. So we'll show you kind of mostly quoted from the TOLU paper there but we'll show the RLVR chart. You did mention that you wanted to change it now. And we'll sort of preview a little bit of the agent's discussion.
B
Yeah, I think when you are introduced to rlvr there's just a function really that checks if you have a string outputted from the language model you have a relatively simple function that's like is this answer from the language model correct? And there's no real environment because you're just looking at the generation. And now I need to figure out the right way to communicate what either multi hop tool use looks like for this. Which is something people are definitely doing thinking what is the right diagram to encapsulate how O3 is trained which in action they take multiple actions because the next sequence depends on the feedback from the environment, which is some sort of information store. So when it's searching for a niche piece of information, you can't know what the next actions are without whatever feedback from Bing Search is, is what they say they use. That is a step that is very much happening. And then as people try to transition to more end to end, RL is a real strong notion of environment, which is that you're looking for a sparse signal from this multiple generations and that's what people want to do. I think it's debatable whether or not people are actually doing it now. I think the Deep Research blog post kind of hints that they do a bunch of small scale RL and then poof, the system works. Which I think is much more of what's happening is people train on a bunch of small things and they do some prompting and they see that when you put these pieces together or a couple different fine tunes of a model. So it seems like Deep Research has some fine tune of O3 in it as you do that with some different domains of rl, it works rather than deep research being trained on the outcome, which I think makes a lot of sense for it not working in deep research because doing outcome based RL for deep research would be RLHF. Again, because you have to have two humans and you're like which generated report is better? You can definitely do that. And the whole sick eventsy thing at OpenAI showed that they have so many different reward models and reward signals in their post training, but that's just one of them. And I think a lot of the progress in making it exist is doing RL and a bunch of information retrieval and editing and search tasks.
A
We talked with Noam Brown about this Deep Research and kind of like the verifiable rewards he mentioned. Obviously that's an example of like non verifiable thing having RL work on them. And in one of your recent posts you also talked about how the big labs have all this data that they can find long tail things to RL on and then kind of when you put them all together that fixes it. Do you feel like what we're able to verify is like a big bottleneck, that the verifications are only done in kind of like these smaller atomic things and so we cannot really scale that?
B
I think my comment was on making. So in this post I was reflecting mostly on the question of what will agent progress look like relative to modeling progress. So we've had almost three years of modeling progress and we're pretty used to the messaging on that. And it wasn't just about being with the RL on small things. But do any post training to fix a weird behavior. And RL is a very data efficient way if you can get the right signal. But you could also just say it does this weird non verifiable thing. Let's create 100 or 1000 instructions to include in post training so that the model does this types of information extraction correctly or soft extraction. It's a space that I want to flesh out more with more examples of tasks. It's just if you watch Claude code going it's like what is it doing in the background? It's a lot of reading files and even just the compressing context I don't think that's really a verifiable thing. But that being messed up that's a super crucial skill for long context actions and longer tasks is just compressing well and that's going to take some training novelty on how do you. You can effectively modify your training data Instead of having all the multi turn context you just insert the summary and you want to make the performance stay as well. Because it's also cost saving to have shorter context. There's just a lot of new domains like that.
A
But do you feel like you can figure out what these things are before you release or do you think the labs have a big advantage because they have so much user data that they can kind of like inspect this at inference?
B
I think mostly looking at real world data at this point. To the extent that there are clear benchmarks you can use them in the open. But I mean we see the industry consolidated around data in different forms and I think that's a real important touch point for people.
C
I'm curious who's still collecting reliable sources of open data that everyone uses?
B
There's a lot of action in the space, but hard to get traction.
C
Yeah.
B
So I think for a long time preference data has been something where people understand that it'd be very good to have large repositories of it. If you want that you can annoy me to try to release all for to we have a final data set but we have completions and ratings for more models. I'm talking to the student. Let's figure out how to mark this down because we just have so much completions and LLM is a judge AI feedback data that we don't know how to clean. That's one thing. The problem is I think a lot of it is task and model specific. So this notion of on policy to adopt an RL word for just this preference data and preference modeling, which is that you want the sequences that you're training this reward model on, the sequences of generations to look like the model that you're starting to fine tune. That is something that has made it hard to kind of grab off the box. And it's for example, this ultra feedback that I mentioned has a lot of models in it. So most models that people are fine tuning, there's some signal for it to improve on. And I don't know how long that lasts. And we still don't have the answered question on how important human is versus AI feedback. Every time I check in with people at Frontier Labs, they're like, yeah, we still use human preference data. And I'm like, okay, I don't have access to that and I don't know how to measure how, how much it gives you. Really. It might be most of the benefit is on the. What's the right adjective to describe Chatbot Arena? It's like people are down on Chatbot Arena. But it might be that the human data helps boost retention time and general preference a lot. Where most academics were doing multiscale and alpaca eval type things, which it's not as crucial to everybody's fighting in the attention economy.
C
The attention economy. I mean, since we're there. You mentioned sycophancy, you mentioned L.L. marina. That was one of your posts on Interconnects that I really enjoyed. Are they cooked? Is there a future for arenas? How does this play out? They got $100 million now. What are they going to do?
B
I don't know what the money does for them, but I think the eval is still valuable. Especially at the Frontier. People are very cynical. But in the compression race of how much cheap, like what is the cheapest model you can have that does pretty good at this is still so useful to a lot of people.
C
Chat is king.
B
Yeah, everyone chats with these things. It's. It's why I use GPT4.5 isn't as good on Chatbot Arena. I think it's. It's higher on like, Yup, which is a new competitor into this. It's like they have like a Vibe category which.
A
Sorry.
C
Yup.
B
Yeah, there's like, yup, AI. There's. You can look it up. It's a competitor, another startup. They have like a cat. All these companies have categories and one of their categories is vibes and GPT 4.5 is on the top. And I'm like, okay, there's some of those tracks. It's a frontier model. Yeah, it's just like that stuff intangibly is very nice. The leaderboard is established. People still should use it. It's kind of a focusing function for the community across different batches from industry to academia. Yeah, I'm not going to try to solve their monetization problems for them, but having clear norms and things that can be hill climbed forever is very good. Having this idea of an ELO linking.
C
Model that you cannot saturate.
B
Yeah, you just kind of cool. It's a great, like it's a great problem like what is.
C
But you can game it. So I think that's the, that's the issue.
B
Yeah, but everyone evaluates on multiple things.
C
Like Sarah Hooker, I've never seen her so public about any of her. Like she has gripes but she doesn't really go public like that.
B
Yeah.
C
Artificial analysis also has one which I think is kind of cool. The other thing I think is relevant to this discussion is a lot of the data actually is like single test, like a single round, like it's not, it's not multi turn. And I wonder how to create proper multi turn arenas because you have to switch to models. That's the whole premise of Elamarina.
B
It depends on how valuable the user data is. If the user beta keeps being equally equally or more valuable than the inference, there's going to be a platform to keep pushing this into more and more expensive things. So they're going to set up a deep, I mean they're probably setting up a deep research arena because that's the data that. I mean if I was OpenAI working on deep research, that's the data that I want and there are competitors and Elementsys is the entity that has the marketplace meant to set it up. Right. I mean it's almost like how I see scale. It's like scale kept climbing the edge of what AI data processes is. And because they're the name brand, they keep climbing the incremental evaluation game and a lot of them have longevity.
C
Yeah, yeah.
B
It's a network effect in some ways.
C
You mentioned skill, which is another hot topic. But like we'll put, we'll put all the sort of hot takes at the end. But I do want to focus, try to be technical upfront. You're still writing the RLHF book. Is it RLVR book now?
B
I can give my spiel on it. Ultimately RVR is not mature enough nor is it as interesting of a book. So those are the Two fronts of why I don't want to rebrand, and there's also some personal career strategy. But that should be independent on what is objectively a good book. Because RLVR is going to be changing so much in the next 18 months. We've already seen it. There's all these new algorithms, but I think there's a lot more under the hood on how you do the right pre training for it and what the data is, how tool use emerges. All of this stuff is core to what RLVR will be seen as. I'm watching to see if O3 is like a niche model or becomes the path that everybody needs to follow on its kind of different style of tool use that you see particularly with search. Okay. And we don't know how OpenAI did this. And these are the things that I think is kind of core to an RL VR book that we don't have. Whereas RLHF is a more interdisciplinary in the same way that Chatbot arena can never be saturated, RLHF can never be solved. And we kind of know these problems of alignment and over optimization and what the pipelines to getting data that people are using are. And yes, I can add more RL algorithms to the book, which is nice for me to study, but that's not really changing. It's not changing what reward modeling is and the different ways that people implement these today, whether it's a value function or reward model and stuff like this. So I think the breadth on RLHF is nice and I would tell a lot of academics that I think RLHF problems are going to be foundational and kind of just have a much more steady study rate where we're on this massive spike of rlvr, but it might just be solved and then it just goes back to zero academically. It's not. It's an embellishment. But there could just be a best practice for getting a hundred percent accuracy on any problem that you want. And then. And then it's solved. To where the debate on what is a preference is going to go on forever.
C
Yeah. Because it's verifiable. There is a right answer.
B
Yeah.
C
Sorry, what do you mean by like over the next 18 months there'll be a lot of changes. Like, what do you foresee? Actually? Let's just catch up. Like what's already happened, you know, in like the sort of recent history.
B
Yeah. So there's two categories of information that we have, which is what are the models doing and what are the researchers doing? I think the models provide a lot of inspiration in terms of what the actual frontier is. And that's things like O3, Gemini 2.5, Claude. These are a mix of just O3 I think is the most scaling RL approach. And then Claude and Gemini 2.5 are very similar with hybrid reasoning models that you can turn on and off. They rolled it out in different ways. So Gemini didn't have hybrid reasoning at launch, but they've brought it in and Claude had it at launch. One of the most important questions has got to be is, is the O3 path of just a reasoning model or hybrid reasoning models more useful? Do they diverge in their methods for training them? I think the Nvidia Llama Nevatron reasoning paper is probably the most detailed paper on a hybrid reasoning thing. And then deep seq R1 is still the canonical recipe on a reasoning only model. And those are very different approaches. And I don't know if one will win out or not. And then there's just a lot of work on data side and RL methods. I think there's a whole list of kind of GRPO complaints that are out there where the math doesn't make sense for certain things.
C
To me, every paper I see come out always has some fix to grpo. It's kind of cool that people are taking variations on it, but also, I don't know if Deepseak is going to come up with R2 and just blow away everyone with whatever is next.
B
Yeah, I definitely don't think the algorithm tends to be the most important thing. I think I had this in my AI engineer world fair talk, which is kind of a snarky of how do you train a reasoning model, which is like you get a starting dataset, you incrementally improve the dataset.
A
You.
B
You do that until you're running out of time or your performance starts going up. And then you try all of these switches from all the papers, or you turn all the. You do a whole bunch of binary tests of all these various algorithmic changes and you do a grid search and see what works.
C
Candidly, that's why I dismissed GRPO when it first came out, because it was sold as an efficiency thing. And I was like, okay, fine. But I've been trained to not care about efficiency because it's just a matter of resources.
B
Yeah. The GRPO advantage estimate is very well suited to verifiable rewards.
C
Right.
B
But the other thing is kind of a intangible works better on the infrastructure type argument. And when it came out for Deep SEQ math, which is well before the RLVR phase. So it was really marketed as that.
A
When you talk about hybrid models, how do you reconcile that with OpenAI saying they want to move away from the model selector to just having a unified interface? Do you feel like they feel pressure to like, hey look, when I have all these different classes, we want to route them to the right thing or do you think there's something else?
B
I would think that OpenAI wants to have a model that knows how hard the pressure is. I think that has to be the North Star for most people working on reasoning, which is the model will just spend the right amount of tokens on it. And if you look at a compute level discussion what inference time scaling means. I think in plenty of ways hybrid reasoners might just be aged out except for niche applications because quality is so much more important than having 100x less inference tokens. You just pay for it in compute and that'll get better. I think it's like really like that was something like Jensen said in his most recent. I think like strictechary highlighted it or had the interview with him and it was like, yeah, everything's going to be a reasoning model because it's going to get so cheap and they're better. And I was like, that's why it's like the hybrid reasoning thing is a little bit weird. And it's like I always just will turn reasoning on unless it's a really silly query like oh, what is this thing? So it's like okay, in two years that kind of tracks which I think O3 is also just burning money on us searches 80 websites for me asking what paper it is. That's a lot of tokens. But it seems directionally if that's the thing that works, that'll be the default.
C
Yeah.
B
At least in all of these high Most of the things, the people that we talk to, whether it's coding or very high end information economy, those things, the value is there.
C
I wanted to double click on something that you seem to be coming back to a lot. You seem to assert that O3 does something very different by using search a lot. Much more than basically everyone else. Do all models come with a search engine now? Is that like a must have?
B
It depends on your use case. If you're doing general information retrieval or understanding. Yeah. There's old papers that we can try to find the links. I don't know if Sam Alwin was talking about it, but there's this retro paper from DeepMind and other architectures that people have been pulling in the discussion again, which is like you have a very small model with a very big context length and a very big retrieval store, which I'm not one to bet against the transformer architecture and just figuring out long context and stuff like this, but those are ideas that people are bringing back, which is search. Search is better. You look at all the evals from reasoning models and one of the trends is that simple QA numbers all drop. It's like deep seq R1 to the new R1. It goes down. It's like all the new like quin 2.5 to quin 3 simpleqa goes down at least when you're evaluating these without tools. And SimpleQA is like a what is considered to be a very nice, fairly numerically robust long tail knowledge evaluation. And all of these, the raw models, they're all going down, but it just long tail information just have this search behavior makes a lot more sense.
C
Okay, the counter argument for this, I have been through this journey too of like, oh, why don't you make a model that doesn't know anything but search, right? You can search up anything that you want and learn just in time. But the problem is you need to know what the search terms are. You need some baseline intelligence to make all this work.
B
Yeah, that makes sense. That's a good way to put it.
C
I think it's important because there's this thesis of LLMs becoming just online LLMs permanently and it hasn't been super pursued. Perplexity was one of the first to put it on my radar as they were like, we'll attach the search engine to the LLM and that's what you get now. And I think more and more people are starting to offer it as part of their default services. Like Gemini has like a search grounding thing as well.
B
I mean it's what people say. A big limitation of Anthropic is because it uses Brave search, which returns a bunch more like SEO slop than is that proven?
C
Because I don't know, I thought they had their own index.
B
Okay, so I haven't done detailed looks dealing with rumors, but I think they'll all end up doing their own index. And it's one of these things that's like Google should have an advantage again, but who knows if they do. I also hinted at this in my post, but it's like Hamish had tried to set this up. The same student from RLVR of playing with search and RL model and it's very easy to get the model to do tools if you prompt it to. But it's very hard to get the RL model to learn that the tool is useful. That's why to go through these things where there's 80 failed tool uses and it still gets it or it stops or gets it on the 81st is just a RL behavior that feels emergent from having a very nice way of getting the model to learn to use the tool. And it's not like you can't sft this model to do this. It just really feels like they set up the environment right and it plugs into this deep research kind of line of work that they did, and they broke down the problem into these sub RL tasks, and then it kind of lets it do this thing.
C
Interesting.
B
I don't want to be an OpenAI shill all the time, but I just think I tell people to play with O3 all the time because it's weird.
C
It's excellent. I would say, the amount of work you're imputing on the deep research team, when, as far as I know, it's three people did it. It was Issa and the two other collaborators that she had. I don't know if they did that much. On top of O3, every indication I've had from OpenAI is that deep research is more or less a thin wrapper over, which is.03.
B
Yeah. It's probably one or two small things that it's. They're like, oh, we can make our. We can make deep research work by adding this small amount of data to the training thing. And then it just works.
C
That is. That would be how I describe it.
B
I mean. I mean, what is it? Guern, the anonymous person, he replied to my Q Star post on Twitter the other day, and he was like, why was this all wrong?
C
Yeah.
B
And it's obviously simple. Things don't scale. There's a lot of complexity because there's a lot of other exciting things in the AI field at the time. And OpenAI kind of sends out a lot of things that confuse people. But this would fit into that, which is Deep research is a minor change from an existing RL trajectory of what was like, oh, three. Probably they had already figured out that search was going to be better. And then we're like, okay, we can repackage this. And it's a. It's a simple thing that makes a big difference.
C
Yeah.
B
And most of the things are like that. Once you have traction. I think once trying to get the initial takeoff on the sigmoid is the hard Q thing. But then once it's like. Once it's like this, a lot of things in the middle feel obvious, which is why I describe one of the things that we work on for Olmo. It's like a lot of it is just having motivation to do things that feel somewhat obvious, but they're still hard. It's hard to get different recipes or it's hard to get a full reasoning recipe off the ground. It's just a huge change because you have all this inertia on this eval suite and then you have to figure out if you branch your recipe or do you start from. Do we just take open reason or zero and start from scratch? Which is like, it's a whole other headache of. Of things. It's just hard to move these projects that are anywhere above 5 to 10 people with inertia to get stuff done. But then once you're hill climbing, things can seem really obvious.
C
Yeah. Okay, you covered a lot there.
A
Before my next question, just to close the brave thing, our friend Simon Willison wrote a post that anthropic added brave search as one of the sub processor in their product.
C
Yes.
A
So that's where the thing came from. Now, to what extent it gets used to, we don't know.
C
We don't know. I would just kind of comment on a couple of things that he said and then we'll go on to your question. There's a very good post. Just on the retrospective of Q, there's a very good post that you had, which was that I want to send people to, which is was O1 as IOP. Right. That does imply the question of, like, if one was a PSYOP, what else could be psyops now?
B
Yeah, there's definitely psyops out there. I mean, the whole inference time scaling plot is such a psyop. You put these two things next to each other with an X axis and it just looks like it's easy to control. Whenever you see an X axis, you think it's easy to control it. Whereas for training on the left, one was training.
C
Yes.
B
And training makes a lot of sense. So if you haven't. Even if you go to really old RL papers, RL learning curves are a non log x axis usually. And they look like this. They look like these whatever logarithm or exponential rise. And then if you take one of these and you make it a log X axis, it's a straight line. So that side is like, oh, okay. We've seen this before with rl, but with inference time scaling, it Being an X axis is why people are like, oh, there's a knob. I could turn search up a lot. Which is like, what breeds all these weird ideas? The core of that article is just they're taking points from within training or there's a natural variance and then you line them up and if you line them up then you get this nice inference time scaling behavior which is. Now people, a lot of people have reproduced this plot on inference time scaling and it's, it's much clearer now. But at the time it's like, I see why I thought it was a knob. It's like, oh look, they called it inference time scaling. You control it.
A
I think the most interesting. Well, you have a lot of interesting things in your blogs, but one that stood out was about RL and tool use. You said that it's easy in our experiment to tell the model to try searching, but then if it doesn't get results with the tool, it's going to stop using the tool very rapidly. Can we impact that? So can there be a good tool that the model doesn't know how to use and then it kind of fails and then it stops using it? Can there be a bad tool that should be improved before giving up on it? How should people think about designing the tool, improving the model and kind of like where to intervene?
B
This is definitely on the newer side for my things that I want to work on or have worked on. I think particularly in 2026, especially in the open side, all the infrastructure models will caught up a lot. Where I want to go deeper on this in terms of deeper search style things or very inference heavy multiple calls. And to answer your question, there definitely can be bad tools and there definitely can be the model just using them wrong. And something that I would want to see in a model is kind of not necessarily creativity, but an openness that it doesn't know exactly what it'll get out of all of its tools. And this uncertainty to just try a few different things, which almost seems classical RL behavior, but if you think about what a language model does, they're not necessarily confident, but they have a path and a direction in their answer, whereas that's a big change in these reasoning tokens, is to have the notion of backtracking and things like that, which is some sort of openness to the tools having things that are unknown in it seems like a really nice thing for the model to have, which is like, oh, what if I try this? What does it get? Especially on the open model side, which is if this is Going to work where people want to use open models with tools, it's going to be because people have private data stores and stuff. So if you were to train an open model that is going to be a good reasoner like O3, but on private records of some sort that'll never get sent to the cloud, it needs to be thinking of like I can try some things with this to get a sense for it before saying that I have to give up. And if you look at tool use right now seems much more similar to code execution or it's just a part of a sequential path that you need to get to, which is like I have a plan and if it fails at a certain step I might have a backup. But it's not like this iterative of I need to fiddle with the environment in order to come up with my plan. It's just that it's something that people probably are going to have to train into these models, which is like, you might just tell it like you don't know what is in this, but your answer might be in it. Which is like a very odd prompt, but maybe it'll help.
A
Yeah. When we had Eric Schlons from Anthropic who worked on the cloud Agent before cloud code, he mentioned they spent basically like majority of the time on like the tool design to give to the model. And then you just kind of learn how to do it. Are you usually. Well, I don't know how much you worked on actual this stuff, but are you putting the tools one by one in the RL process? Do you think that helps or do you should give. Is it better to give all the tools and let the model explore?
B
I don't really know. We haven't gotten this to work. I would say it would probably depend on the model and your starting point. If your starting point is already good at tools, it can probably generalize more. But if you're doing this weird base model RL and you have to have this kind of curriculum, if you scale RL long enough, you're going to need a curriculum of things getting harder. That's pretty obvious. So in that case it might be tools get added when things become too hard for it to solve certain questions, which sounds very intuitive but also just really hard to manage in practice. Because what is your automated signal and your training run that it's time to do that?
A
That's why video games are so good, because they're designed to unlock things as you progress. But I think with things like search, it's like, you know, if you're given access to a small data store or you're given access to all knowledge on the Internet.
B
Good feedback for the Arc AGI people for the V3 benchmark is like, have things where the language model needs to learn to use new actuators in the world after a certain threshold.
C
That would be Archaeagi 4. Then.
B
Yeah, probably. They're cranking them out.
C
They're cranking them out. They're actually doing a launch party I think in a couple weeks. So I'm actually really. It's fun to play ARC AGI. I don't know if you tried.
B
Oh, I haven't.
C
It's pretty fun. These are IQ tests. I used to be like, oh, they weren't that relevant. But actually now that we have a gradient where LLMs are actually significantly climbing them now, it's actually really more interesting to compare your own intelligence.
B
I'm with Gnome on no harnesses.
C
No harnesses.
B
Yeah, yeah. I mean, harnesses are cool, but they're gonna, They're. They're a handicap that's changing the learning dynamic substantially. So it's good demos, but I feel like the core thrust has to be no harnesses.
C
Is it wrong to say that these are just inductive biases? Right. They're not in the model. Sure. But anything where you're just looking at the results contaminates.
B
This is a different task. I think I talked with Greg about this at rkgi, which I told him, do harness and to harness, you just.
C
Have both different categories.
B
Just like you're trying to be transparent and build targets for Frontier Labs. Just do both. Like, I don't think it dilutes that much. The no harness is going to obviously be harder, then you just get more bang for your buck on your benchmark.
C
Yeah, it's the same dataset. Staying on the topic of tools, while we're at it, you had a really good summary of like recent work in multitool rl, which had like loop and retool and toral and all these other things. And I think that this is just like an area that's super rich for research right now. I just wanted to give you the space to highlight what are your favorites? What do you think that people should explore?
B
I could share what my moderate ambition, what would be fun research project things is you want to create some sort of competitive dynamic or a val and it has to be so much narrower than what industry is doing. So I told you this at lunch, which is like deep research, but only archive papers, so you don't have to do a full index. You have a limited domain, you have to figure out how to measure it or something. Or I think it's good for academics to work on academic tools because they have very high domain expertise. They already know what's going and just figure out how to make that something that is either very useful to users if it's going to be good enough for that or something you can hold climb on. I don't know if this is brainstorming on the fly. Take related works out of papers. Just look at the text and break all the links and make an avow which is filling in hundreds of related works with archive links. That's a fun deep research style idea. See if you could do it with open models on a set data store with tools. AI too has gone through a lot of discussions with this which is if you're trying to have impact in AI right now as an academic, you have to level up out of papers to artifacts, which is models, datasets, evals. Datasets and evals are easier for people to have impact on. And then the next thing is what do people actually use in AI too? Especially in this semantic scholar team that's now working on information agents of different types. There's another thing that I'm distance in, so I don't have all the names but it's can we make open models do that side of things better? It's like can you make something that people actually care about and then that's a whole level of impact that's much higher if you have actual users. It's hard for academics in small institutions to do that, but if you're working on agents, dog feeding is viable. It's like can we make ourselves a good slack summary bot that we like or something. And just making these agents really tractable. I mean that's one direction. Another direction is just hill climb on humanity's last exam with tools. I just think it's kind of unlikely that we're going to win as an academic and a state of the art number because they're going to start spending millions of tokens per query. And it's just a lot of. It's a lot of compute burn beating that on the flop. Equivalence is going to be so hard. Unstructured thoughts is something that I'm mostly like, okay, I'll get to this. I have more things to figure out on the modeling and what I call skills level which is just how do you do reasoning to induce inference, time scaling and get high eval numbers. And once you Know you can do that. You can take your knowledge with you to do it in more specific domains.
C
There's skill and there's skill acquisition.
B
Right.
C
I think the ARC AGI definition of.
B
AGI, I quoted it. What is it? It's like efficient. Yeah. Skill acquisition, efficiency. Because he described it as three words.
C
Right?
B
Yeah.
C
Your emphasis on skills in your recent talks that you've done. Do you want to sort of reiterate that thesis for people to pick up on?
B
Yeah. So I've been thinking about. Mostly I'm trying to get ahead of what OpenAI, et cetera, are doing probably now, if it's not in their models and with all the agents, it seems that planning is a very critical task. So it's kind of how do you come up with the taxonomy for different types of things? You need to train into reasoning models for when it'll be a bottleneck. So I came up with four, and the foundational one was skills, which is what I would say that we have already done with O&R1, which is you do a lot of RL, you show that inference time scaling works and you get really high benchmark numbers. And then the next three are kind of what comes next. And most of them are around planning. So what I had is 3 and 4 on my list were abstraction and strategy, which is trying to not use planning, because planning is a word that people already use a lot. Strategy would be the direction the model should go in and technically what are the steps of its plan. And then abstraction is how does it break it down into things that can actually solve. And then the fourth last thing is calibration, which is just not wasting compute and knowing when to give up and ask the user things. Because overthinking is obviously a problem. It's easy to keep getting your eval scores to go higher by using more inference time scaling. But eventually that's not what people want in their models. They want a smarter training regime where the model is actually getting proportionately better for its training and not. There's a lot of papers on overthinking and stuff like this, which I think is. OpenAI wants it because they have to foot the GPU bill if O3 just infinite loops itself for a bunch of people, that's not good.
C
Does it actually?
B
I don't know, but it might. These reasoning methods definitely can make the models just kind of unstable and just. Yeah, but it's also the GPT5 idea, which is how do you get a model that just routes the question to the right, maybe not necessarily A router, but just knows if it needs to do a plan or if it can just answer. If you look at deep seq R1 and you ask it a hard math question, it's not like, here's my plan of attack. It just starts and having a model that knows when to be like, okay, here's my plan of attack. I might need to make myself a memory store. I might need to take a claude code approach for this query. I'm going to build a memory store and spin up some parallel searchers and then come back. Conceivably this is all something you can train into a model because the searches or the parallel models could be tools. In that case, the simple way to describe it is we have something like thinking tokens and then answer tokens. And it's the model should be able to optionally have planned tokens before thinking or before using tools. It's like, okay, here are the table stakes. I need to do these things and these sorts of tokens tasks will be harder versus easier. It seems more tractable than some far out ideas for AI. A language model can write a good plan. It just needs to be asked to do so, which I would bet that claude code and Deep Research are doing this. You get a user prompt and first the model is like, yeah, there's a plan tool in CLAUDE code and they break it down and it's like that is something they've trained into the models, like deep. I don't think deepseek doesn't have it built in, but it probably could do it. And just thinking about that interface between like if the model needs it to be able to do the task end to end on its own, can it do that sort of thing?
C
I think that my challenge with this whole reconciling this approach with the no harnesses thing is that I think a lot of the way that people, especially engineers, want to model it is that the plans and the memories are tools and there are no special plan tokens, there are no special memory tokens. It's just context or it's just whatever specifically for planning. Because then you can do fan out to other agents for tool calls and stuff. So it doesn't have to be sequential. But I'm just like, is this a fork in the road or do we have to make a real choice here as to do we outsource things to tools or do we keep it native within the model's tokens?
B
I don't think it's a subjective difference. I think mostly the planning idea is to make the point that people don't get things for free. And the planning improvements might be kind of mundane, which is like we were prompting Claude and its plans were bad in this way. Let's give some data where its plans are more detailed or break things down into more steps so that it's easier for them to do it because it's in a black box effectively. So if it hasn't been targeted, it's unclear of what the performance will be. Or on the open model side, it might just be the idea of having different models for different parts of it. Then you're really training a model to just be good at planning. And that's data that you need to come up with and you only use that model for that one part of it.
A
Does it feel like plants are much more reusable and should maybe not be generated every time? I feel like especially in coding for certain sets of tasks, you want to have plants, similar types of plans. So maybe it's not the right way to ask the model to regenerate a plan every time. There should almost be like plan blueprints as like tools, and then the model fills it in. Like, where do you think the balance should be?
B
I think they're reasonable. A plan is obviously an intermediate goal. I just, it seems likely that there's like failures on this kind of planning level. I mean, the same thing goes for these rubrics that are popular, where a lot of the technique that is popular for so called rubric things is you have a prompt and you have a language model generate a rubric for that prompt, which is a few specific things it needs to get right. And that's conceptually very similar to making a plan for every task. I think whether or not it's like grading is you're going to have a different type of abstraction than executing. But I think what people are seeing is that it's cheaper relative to the effectiveness to just generate it. So plans are not super long and they probably, they're not that many tokens. So it's probably just kind of like, okay, we do this. Like putting it in my taxonomy might be overselling it, where it just needs to be a prompt and you just need to make sure that your model's not too weird at that prompting stage.
C
I think your taxonomy is super useful, by the way. So skills, calibration, strategy, abstraction. I feel like maybe abstraction might be the most underrated one or hardest to solve. The way that you introduced it was different than how you wrote in your blog post. You said it was basically not to overthink. That's calibration.
B
Yeah.
C
Abstraction is about breaking things down.
B
Yeah. I think both of these strategy and abstraction make the most sense on the hardest tasks that we don't know if the model can do them. So if you're assigning a task to a model that you don't know if it can implement it, the strategy is very important because it needs to be very specific and narrow where if it's doing mundane code or deep research, the plan is actually not that interesting of a thing. But when you're at the frontier of if it can, I don't know, some GPU implementing thing, you could buy into the OpenAI and anthropic narrative, which is help me implement this research idea in our complex distributed GPU thing. It's like this is a task that's hard for a human and for an AI to come up with the right plan to debug and do this is very narrow path. So therefore the strategy is pretty important of does it start with certain tests and how does it actually build this out to complexity? It's obvious that I need to come up with more, better examples for this, but I think as you push it, it's more natural to see that there's only a few plans that actually get it done. And then abstraction is just important as your task becomes so big.
C
It's like a prompt engineering thing almost.
B
Yeah. And it's like you only have 100k tokens you can generate. Like you need to make sure the model breaks it down so it's not just spawning a ton of infinite processes under itself. Which I do agree that abstraction is an interesting one. Especially when you start to think about these models that could call in other models to do subtasks for it, or parts that can be parallelized with multiple searches or just more compute. I think that kind of folds into abstraction, which is just like, how do you approach a certain nugget of the problem? And I'd definitely say I don't have experience building this. It just feels like if you're going to visualize AI doing the hardest software or other tasks, it's something that humans are very good about. So it's like, how do you come up with a research plan in 10 weeks? There's a lot of how do you prioritize which experiments to do? There's a lot of inductive biases that go into that. A language model would not do well at that right now.
C
Probably memory would be helpful there. So you can just. The way we do this in Real life is we accumulate experience. One thing I did want to dive in on was just parallelism in general. There's one case where with O and the sort of Q ideas, there was one case where it was sort of overhyped in some sense, but now it's coming back. With OAN Pro and DeepThink, the theory is at least correct me if I'm wrong, basically they run o8 times and then you have a reward model rate it and then give you the best of the eight.
B
Yeah, something like that.
C
Something like that. Deepthink also the same. We don't know any details beyond that. I think there's a lot of people exploring that, at least on the infra provider side of how do we parallelize search and planning and all that. And I'm worried about getting too hyped about it. I think it makes a lot of logical sense and this is one of those things where MCTs also made a lot of logical sense and we were fooled.
B
Well, I don't think we're using parallel compute in a way to search over like low probability tokens. We're using it to get robustness. If you like O1 Pro is, it was so nice because it just had a very predictable depth to it. Even on niche topics where like sometimes models just fail.
C
Yeah, you had some numbers that went to like from like 10 to like 95% or something.
B
I don't remember the exact numbers, but that's what it feels like. It doesn't feel like you turn on O3 Pro to make it 10 times more likely to find some niche piece of information. Maybe it'll be a bit more likely. But we're not getting that type of searchy notion of getting more breadth or depth into our tree. So I think there's value to it. Where we want to use this parallelism on what are either the most important tokens that we're generating or like, okay, I know this part is crucial, let's just spend a bit more so that those tokens are better. But it's not a transformative thing. The part that's potentially interesting on the transformative side is if you can get much better verifiers. So I think of verifiers of changing the slope of inference. Time scaling. You spend more tokens at inference, the better verifier you have. If you're doing parallel, it can extract a rare occurrence. So right now if our verifiers are only good at human preference, it's like, okay, we don't need to. We don't need to crank that up very much. But if we are doing really diverse generations and your verifier is better, it'll do better. I think you could look at the extreme between a reward model and an oracle where it's like the oracle is the more you search, eventually it works. So the slope is good, but a reward model is like there's really a capped signal out of it, at least if you're doing this preference type of thing. So the slope is pretty minor and it kind of has diminishing returns. So I do think that, like, if you could fill that with more interesting verifiers, there's potentially more to get out of parallel compute. But I, I don't think it is like, as transformative right now on my outlook. It's more like parallel agents makes more sense of like if you could break down abstraction. Nice. Like as a throughput engine if our tasks are taking a long time, rather than a like at peak performance engine.
C
Okay.
B
Which is a kind of fits with the whole agent versus model thing where agents are much more about like getting it done at all, being robust and being fast. Where this model is one generation. It's like, can you get the answer right?
C
Yeah. We'll spend a little bit more time on this and I'm happy to move on. My pushback or counter to this is that it's a way to pull forward a hypothetical future model that you can then distill from.
A
Yeah.
C
Which is nice.
B
Well, I bet people, I mean, they surely will use these for synthetic data. It's just like the marginal gain on synthetic data is always very high. Or just like Amanda Asko will say, better prompting will effectively make it seem like you have the next generation model where most people don't put effort into their prompts.
A
Oh my God.
C
Okay.
B
Or she had said something of those lines in one of her anthropic interviews, which is just like if you can really figure out how to kind of get into the certain states of the model.
C
Yeah. Well, anyway, that's my pitch for why this is worth doing at all. And I have a science fiction story that I want to write about. Quantum models in a world where you could explore cheaply multiple universes, then sort of pull forward the right one. That would work. This sounds too science fiction. Y but I feel like in a world where we could control quantum computing well enough to explore this and scale it up enough, it could be kind of cool.
B
It also could be that parallel compute is grounds for interesting types of innovation. I don't know what does it mean to have parallel compute with diffusion language models that generate all their tokens at once? Does that meaningfully change some sort of application? I don't really know. I think the diffusion language model would be fun. If it works, you have much more control over inference time scaling. Gemini has one. It's hard to suss out what it changes, but once we have all these knobs, I'm hopeful that it helps build some interesting types of innovation because the parallel stuff is new architectures can change. We'll see.
A
I've been using the Codex vessel van thing and I feel like most of the generations are like, you know, 5%.
C
Different from each other because you use Ruby?
A
No, no, no. I had a JavaScript one, I have a JavaScript one, so it should be good at that. I don't know if it's like just how the RL encoding works. One thing that I've noticed, these models always want to do if statements when there's like a missing M variable so that it doesn't fail when it runs. And I feel like that to me that's just like a symptom of the rl. Yeah, the code is terrible. Like no, you should not write code. It shouldn't silently fail if there's missing variable. It should just raise an error. But I feel like the RL is like pushing the code in this direction and then all the generation have the same pattern. You know, I generate four things, all of them use the if statement just in different pieces.
B
Yeah, that was something I will definitely get over. That's just like the labs are trading off massive gains in performance or small detriments in usability and it's like, do you ship that model? Yeah, you just ship it and deal with it later. But I'm sure they can fix. I'm sure that's a fixable thing.
A
I think like, to me that's the question is like, you know, you talk about how you have gains in like pieces of the thing but not in the full trajectory sometimes. Do you feel like these are examples of that or do you feel like as we get better if we did a longer trajectory, where instead of just writing this piece of code, you have to think about how you're going to maintain it later and like how it's going to run, that's going to fix it or. It's hard for me to grasp.
B
Yeah, the software stuff is not easy because it's almost like maintainability. Almost feels like a human preference type issue again where somebody could look at it and be like, yeah, that's not as good. But adding the heuristic and trading seems very messy. Yeah, so. So maybe it, maybe it is. I don't know. There's a lot more to dig into that. I mean this is what Anthropic says they're doing. And just what are the actual frontiers in making. Like they said they're working on code only. And what does that actually mean? A bunch of it is going to be design trade offs and like how much autonomy the model has versus these potential side effects from training longer that we don't know how to get rid of. I mean that definitely could be the sort of a behavior like that is what I would say is like a simple thing to remove where it might just be obsessed with some code format that fails when you revisit it or something. Even if it's like everyone has seen it with just bypassing test cases. I think they'll be a bit more nuanced than that, but they could probably be super simple.
C
This topic has a similar semantic content address for me as over optimization, which is something that you've written about.
B
It is over optimization with a different reward function.
C
I know. Okay, well, I made that link. I want to verify that we are taking on the same wavelength. I just wanted to go over again specific topics on things that you've spent some time thinking about. You write that there are three types of over optimization. First was RL for control, second was RLHF. And third is rlvr. Third, they always happen. Obviously RL is no stranger to reward hacking, but maybe do you want to elaborate on how things are evolving in terms of how we're learning as an industry?
B
Yeah. So that three things breakdown is for people to put the pieces together for what has happened historically. All of these over optimizations are just the model optimizer is strong enough where it can manipulate the agent with respect to the environment or manipulate the environment in a useful a way that's useful to its target signal. Also for context, I think with what we're doing with language models in RL in general is that if there's something that can move its reward signal up, it'll move the easiest thing, the most direct things to move that signal up. So that's part of the story that I said on sycophancy, which is this reward model for user feedback was probably so obvious that humans just like to like stuff that is like people press that thumbs up button when they're filled.
C
Bullet points.
B
Yeah, like all those things have just been really easy for the model to extract. So once they added it, the model changed a lot and the score went up a lot. And it was easy for the RL to find that in control. The oldest rl, the environment is normally a simulator that is fixed. There's no feedback. So the over optimization looks like unphysical and nonsensical behaviors. There's the motorboat example going in circles. There's an example is a project I was middle author on was effectively over optimizing Half Cheetah, which is this Mojoco thing. Instead of running, it did cartwheels off into the sunset and got infinite numbers. It's like obviously not the intended purpose. It looks like a glitch. So it's just kind of manipulating the agent interface with the environment. RLHF is kind of a classic case where the model will just break down because the reward model is imperfect. The environment is really imperfect. In the RLHF case where it's so.
C
Sparse, it's very artificial.
B
Yeah, it's a very artificial environment. So it makes sense that these actions which are generated tokens will do things like reduce into just repeating one token over again. It'll be like. I think one of the early examples we had playing with this at hugging face was the model would just say JavaScript. It would be JavaScript. JavaScript, JavaScript. It was like some toy data set. And it's very obvious when you see it. It's probably harder to see when you're at the top and making decisions on when to stop training if you're doing a lot of RLHF. But that was kind of the phase that people have gone through. And now we're in the RLVR phase, which is we're giving the model reward when it does something quote unquote, right. For math, it's a bit harder to over optimize, I think, unless you have tools and the model learns to search and cheat instead of learning math, which I'm sure somebody could see that out in the world, which is like, oh, I'll just find the your training. It's like the model's like, oh, you're training me on Stanford's problem set for cs, whatever that it's seen a thousand times. So it's like I'll just go get the solution manual, which I'm sure there's somebody can find an example where that has really happened. But on code and maybe information retrieval, it's easier to fudge. So the code thing is like the easiest way to get a unit test to pass is just put a Pass in it. That is not too surprising that a model can learn how to do that. And there for code you need more reward design, which I think would be a nice for like a substantial academic work is like, what is reward design in code for balancing this sort of like understanding this over optimization of test cases or avoiding failures or something like this. I'm sure there's. It's not just going to be a controlled environment because these models are complicated, but I would guess you can reproduce that in some ways just to double click.
C
Reward design means, for example, giving credit. Partial credit for partially correct work.
B
Yes. Or giving the model a slight penalty for doing the unit test thing if you can detect it. Yeah. For cheating. Which adds a lot of complexity to training these models compared to math, which is just if the answers. I mean, you can look at the GRPO math and partial credit is weird in that because it's kind of normalized per batch. I don't know if I have a whole spiel ready on it for that. But it's also just, it becomes very complicated if you're mixing domains and it's like, is partial credit in code better than partial credit in math or all of these things? It's like reward design becomes very complicated and that's what you're incentivizing the models to do. Different things.
C
Yeah. Is there any literature or hypotheses about mixing these things? So let's say you have the one for code, you have the one for math, you have whatever other verifiers you can come up with and individually they work. Do they conflict?
B
I think part of the intuition of RLVR is that the model is good at knowing which prompt area it is. Which is why the models don't get worse on knowledge benchmarks if you're training on just math or precise instruction following. So the model just kind of develops an intuition for where the different prompts are in space. So the gradient updates will be different depending on your batches, which is partially why people just say do big batches. So a lot of the model is activated and you have a less noisy signal with rl. But a lot of the intuition is that the model just kind of handles that. And there's interesting questions on sequencing. Do you do large scale math? Encode RL to get the sequence length and then add in more general stuff. Yeah. Which DeepSeq mentioned. But that's one thing to go. The Deep Seq report is like math and code to more general RL. There's a question on where do you do tools if you're going to do like code execution and search within this. So I don't know if that's interweaved or if it's a second stage.
C
Got it. Yeah, I don't have comments there. It's just like it's surprising how much is not known and you just need a lot of compute for ablations.
B
The inference high inference length generations definitely just kind of breaks all infrastructure because there's just so many tokens more opportunity for out of memory or other things to go wrong. So it's like just on a default all of your trading jobs need way more GPUs for the memory of inference.
C
Sure.
B
Or just like training but it just makes it more of a pain.
C
Yeah, that's a cost thing. One of the maybe controversial takeaways from the Gnome prod which you listen to was that there's also just wall clock time of just getting feedback from the environment, whatever that is. Especially if it's like a real world thing. And I'm just like yeah, I mean there's some point at which your training runs cannot take longer than a human life. So to me that was the wall. He disagreed with that. But that was what I meant by it. At some point you long and friends you. You do want it to terminate within some reasonable amount of time regardless, just as a user.
B
Yeah.
C
We have to find a way to accelerate internally within the training time faster than the passage of time in the actual universe.
B
Yeah, I'm not worried about that problem but I agree with you in principle. Right.
C
I'm stretching this out too far. I get it.
A
As we kind of start wrapping up what are other interesting ideas that people should pursue? Like in your AIE talk you said what I'm thinking about for scaling rl you had big multi domain data sets, difficulty filtering, long run times. Is there anything specific that if there's people out there that are either doing research or they want to do a company or whatever, these are like interesting things that you don't want to do that you want other people to explore.
B
Most of them I think are not in the reasoning space which like the talks have been about reasoning. So I've been long talking about character training is something that I think is under indexed on and been advising a student that's at one level like personality training and how that different ways of changing the personality of the model from prompting activation or fine tuning like data engineering. So stuff that Joanne jiang does for OpenAI, how much does that matter? What are the fundamental research things hopefully can Share more that I've been advising a student on that. So I've been saying that for a while.
C
Just as a side note, do you like the model spec stuff that she's doing?
B
Yeah.
C
Okay.
B
That.
C
That trajectory.
B
Yeah. So I've, I've been an early fan of that. I mean, that's how she finally. That's how like, she noticed me. As was like the only person that covered it when they first released it. I think it was like over a year ago.
C
I would. I liked it. Yeah.
B
Not many people did.
C
Okay, all right, all right. You were first.
B
I don't know. I don't know. But like, that's what she said to me.
C
Well, we had a, you know, we had a model spec talk. Closed the whole conference. Right. Like, that was my sign of. Pay attention to this, guys.
B
But it's real because of what it sends to develop. It has the developer benefit of where your model's going. And then also just regulatory. I think it is very important to what is an intentional behavior versus just a training error. I think for model transparency, it's really fantastic. And I've said that the model spec is much more useful than a constitution. The constitution is an intermediate training artifact that you give to the training algorithm in order to get the model that you want. It is not necessarily like, what model did we. We don't write down our goals of the model in a constitution form.
C
By the way, have you looked at the constitution?
B
Not recently.
C
They talked about it. They put in like Apple's like design guidelines, but then also like the UN Declaration of.
B
So at this level, I've seen it. I didn't know if they've updated it. That's very odd. I hope that Anthropic would write a model spec. I'm not too optimistic, but they're the next domino to follow.
C
Well, so my take on that. Actually, I pushed for this too late because OpenAI already approved the talk and all that. But I was going to ask them to compare the OpenAI model spec to the Cloud 4 system prompt, which is their closest thing to the model spec.
B
It's. The system prompt is incomplete because OpenAI has things in the model spec that their model doesn't currently do or especially when they started. It's like we want to. When they first released it, it was like we want the model to be able to engage on sensitive subjects and maybe even NSFW was in their model specific, which is they're just signaling of what they wanted to do. And they say this is very hard to implement because there's all these obvious risks to doing this, but it's like in an ideal model where we can solve every problem, this is what we do. Which I think is good, as I said for many different stakeholders. Mostly my thing is there hasn't been a good foundational research paper on that. There's just a lot to do. It also runs into personalization and personality or similar, which is like if open models are to win, part of it could be just like everybody can have exactly the model they want. We're serving GPT 4.5. It's kind of its thing you can prompt it, but if fine tuning is more effective than prompting, everybody can have the model that they want. So it's a good. It's like an academic problem or an open ecosystem problem where people are fighting on the turf that it feels more likely to win, which is good.
C
Is this somewhere where you like as speaking as AI to omo you want to win or is this. You're just advising a grad student on it?
B
I don't think it's a differentiating factor yet, but I'm very open to working on it.
C
I think open models have a strong roleplay use case and character personalization, all that stuff. Right. Especially because people, they find their waifu, they want to keep their waifu and that's the derogatory term for it.
B
But I would say that we've definitely discussed it and I want to part of OLMO should be that it is a base model that's easy to take in directions that you want and we will have an opinion that is probably slightly conservative on personality. I mean I've gone through the OpenAI on model spec and it's like most of these we agree with and be conservative on anthropomorphization. I don't remember I did it a couple months ago. But a lot of it is openness or transparency which is like if we're training an open weight model personality, we're not going to withhold anything and we have a different hierarchy. So most of them are on that type of information exchange rather than be kind like OpenAI's model stack is pretty agreeable if you read through it and it's like treat the user with respect.
C
All these things raising kids that way. Just read the spec.
B
Yeah, it sounds kind of stupid but then the last thing is for people doing research it's like wacky model routing things where you figure out a bunch of different models off hugging face to route to because an open model tool thing could use way more Models more easily than any OpenAI product. Because OpenAI is restricted to OpenAI's models where maybe, I don't know, OpenRouters. I'm going to make a product out of this, which is a router. Openrouter actually does it and they're like our chat window knows the best model based on all this usage that we have for your query.
C
There's people that started other way, like Martian, not diamonds. I don't know who else is. He would know.
A
There's a bunch.
C
There's a bunch.
B
Yeah. So I don't know if that would work. Hugging face should work on it. It's a moonshot idea. You don't know when it'll work.
C
Given your hugging face background, how does hugging face make money? This is a very common meme question.
B
I think mostly enterprise deals.
C
That's what they say.
B
They're doing their thing, they're supporting.
C
I mean look, they're, they're great, they're big, they're profitable. It's just not that obvious to most people.
A
I like the router idea for media models. I feel like there's like so many. There's like a long tail of like a generator remover, like a style applier. Like that is actually hard to find. On the tech side I feel like just use the big model. Unless you're like under some like latency or price constraint, you should just use the best model. Even when we're doing thumbnails, I'm like, okay, I'm trying to remove a background of somebody and it's like I go and replicate and there's like 55 background remover.
B
Yeah, I just use Adobe because it's a website.
A
Well, but that doesn't work. Like the Photoshop model is bad on some things, but again it's like. Or I want to generate a diagram to like mimic something and it's like, well, which model is better? Diagrams, you know, it's like those are not easy to find because none of the benchmarks are.
B
Really part of the argument is that if distillation works really well, we could just keep making the target for distillation smaller and smaller. Which is you have models that are very narrow and they're mimicking these huge models on something that's pretty, I don't know, like reformatting tables. It's like can you do a table reformatter from markdown to latex in a 100 million parameter model? If you get it small enough that is really economically feasible because it's effectively free. It imprints and instantaneous My pushback on.
C
This is just if you're doing image editing, 4.0 should do it.
A
All of it. Well, yeah, but I think it does.
C
We're just not there yet. Give it five years, it'll do it.
B
Right.
C
So why work on a router at all? You just scale up 4.0, I guess.
B
Yeah.
C
Tell me where the logic is here. This is like a temporary thing.
B
On device.
C
On device.
B
The local modeling community I think is much smaller than people give it credit for because most of the use for open models is still in APIs. It's like deep sea API. It's convenient and it's like if there aren't that many models, somebody is going to host it for cheaper than most people doing it themselves. That's pretty realistic. But there is a small community that needs local. The best outcome is if open models can compete on not just long tail things. But that takes the most transformation.
C
Side note. So I resisted building my own like buying my own GPUs, building my own cluster for this reason. I'm like APIs will solve most of it. Like people are losing money to serve me models. Why am I having those? Except for the fact that 4090 prices have doubled in the last year. So actually you made money doing local models.
B
How does that make you money? Because your investment goes up. Yeah.
C
You sell the card and you make you through it.
B
So as used 4090s goes up. Interesting.
A
Should have bought a 4090.
B
I got a 4070.
C
What is this? Well then it puts me on till like should I buy 1590? If it ever is widely available.
A
GDC. They were doing the drops. Yeah, I know, it was crazy. We were like running to the camper to buy it.
C
Any other topics before I give closing question? Just generally your work. ROVR or topics of the day.
B
I think companies should keep considering releasing open models mostly for priority and onboarding. It seems like a way it's going if OpenAI is releasing it.
A
Are you excited about that? Do you feel like it's like a psyops again?
B
The OpenAI model will be good. I expect it to be.
C
No, they're pretty serious.
B
It'll be best in class for some sized category and some subset of tasks. That's like OpenAI only does things like that. You have to give them the respect.
C
They won't deserve it.
B
Yeah, that is a big open wins when more people are doing it. So that's a win and well, I.
C
Mean hopefully they are actually open about the techniques and not just the weights.
A
Do we think the size of the open model tells us anything about the hardware that they're going to build?
B
No.
C
What?
B
No, they're so secretive about this. That's why they haven't released GPT 3.5 or anything, because it's too revealing about internal stuff or plans.
C
Oh, okay. No. So you're talking about Stargate or what.
A
Kind of hardware, Johnny? I.
C
No, yeah, that's a different form factor. Yeah, that's it.
B
Yeah. Yeah, I think that thing will run on the cloud. I don't think that'll run local on your.
C
Well, okay, we have to talk about it. It seems like every podcast we talk about it. So apparently the news from today, which I think you were looking at, was that it was like an ear device that they got sued over or whatever. But I think the ear form factor is pretty good. I actually did get there with B in terms of where does this ultimately go? You want the AI to hear what you hear and where do you hear? Where you hear it on the Earth. Like, that's pretty much it. I don't know if you guys have like thoughts on wearables and where that goes.
A
I try to be. I think it just knows too much. That's my. That's really my.
C
But you want to give it context.
B
Yeah. I have false privacy hopes. I think like a lot of people, I mean, that's the whole thing is like people don't actually care about privacy.
C
It's just not taking. You know, it's just really good memory.
B
I think the meta Ray Ban form factor is good. I don't think it's as mass market. It's like if you get it in the airpod size form factor, it's a way bigger market for obvious reasons. But the sunglasses form factor is the thing that works, I think. Okay, I don't use them for AI, but they can fit the AI to work it.
C
Like. Yeah, empirically, it obviously works.
B
Yeah.
C
Cool. Well, the last question I was saving up was this whole, what is meta doing? You actually had a pretty interesting post back in. When was this? In april. You said llama 4. Did Meta just push the panic button? I feel like back then it didn't actually push the panic button, but now they really pushed the panic button.
B
That's fair. I think the panic button at the time was the whole LMSYS model not being the model that they released thing, along with a bunch of weirdities about the day of the week they released. But to be a model that claims to be open and then not release the model that is your Leading claim is just like a. That is like bad execution. Bad execution. Yeah, yeah. Which is fine. And then the recent stuff, I think mostly can be boiled down to talent is cheaper than GPUs by a dramatic margin. And at the end of the day it's like, okay, if we're spending this much, they go to the room and they stare in the mirror, they're like, wait, it might not actually be that ridiculous to spend this money on the top people. It's like, might as well try.
C
They already spend it on VR.
B
Somebody was bound to do this eventually. And it makes sense that it's like the. It's like if Apple some way somehow decide like we're going to do this, they're going to come in and do exactly what Meta is doing.
C
They need a founder mode CEO who's like, screw it, we'll take the L. The thought that occurred to me is Meta, instead of spending on VR, they should spend on rlvr. Well, I think the question is, I think some researchers, most people will take the payday and happily move to venture.
B
Everybody has a bribe number, right? Just the hummer. Really big. Yeah.
C
But I think some researchers are uncomfortable with the idea that this is sort of the great man theory of research, that you have to pay this much to get this level of talent.
B
The talent is definitely distributed. A lot of the people that they would be paying this much have the confidence to redo things or to just do some of the same things. And just whether you call it feeling the AGI or just drive to build things, or feeling the AGI is not that different than a lot of things that have existed in Silicon Valley lore in the past. Just people with the vision that are willing to execute on it and they see something coming and those people make a big difference. I think you have those people and you remove bare accuracy. Getting technical, talented researchers is actually something that Meta has a lot of or has the ability to get a lot of. So it's like, it's a lot of recycling, which is very hard on individuals and morale of an organization. But understand the approach.
C
Yeah, for sure. Cool.
A
That's all I have. Any parting thoughts on how you're going to build the American Deep Seq? That was a nice tweet.
B
Yeah, mostly. If I have to look at what my. If you were asking me what my 10 year goal is and it's like I only will have a two to five year goal where I think as models are shifting more towards agents, I think that like scaling is slowing. It's like There's a bit of a fixed cost and a fixed path to getting towards something like American Deepseek or mostly just I would say it doesn't have to be American. If it's fully open, if you have everything and you can modify it, which is like there's a few things that need to fall. A lot of it is just more resources. But it's like Olmo 32B is if you squint like original GPT4 level and fully open. And it's like there's a few levels that you need to go through. Like that's obviously a dense model. It needs to be taken to sparse moe and you need to scale it. You need to have a lot more GPUs and then you need to do like large scale reasoning. It's like that's the goal that I want to do. There's a lot like that's what I want to do. There's a lot of complexity and navigating like how to work with AI. What does AI to do to get there? It's very hard. I think that, I mean it's a nonprofit, it's hard to get the resources. And building a model is a lot of aligning a lot of different people. That's the deep seat story is they have great people. OpenAI has kept a lot of really good people for a long time. Anthropic has gotten a lot of good people right now. And it's like it's a lot of incremental, hard technical problems that you need to stack up. I think that's what I would like to do and make work in the next couple of years, but it's not easy to get there. So that's, that's the pitch is like AIT's best case scenario is AIT is going to do other things. Like you can't just run a nonprofit or a company that says our goal is in three years to have an American deep seat. Like no one's going to keep paying the bills on that because you have to tell a better story. But that's like what I would like to do in that. And I'm sure AI too will do many more interesting things along the way, like product stuff. I don't think it's necessarily product, but what are cutting edge things in AI that we can make a new architecture for certain things or what are demos of open models working better? Whether you have private data or something or just far out ideas that could take you off the transformer trajectory, I think that you still need to be doing these to lead an AI.
C
Thank you for working so hard on truly open source AI.
B
Yeah, it's fun. I think it's. I mean it makes it easy to align like values with what you're doing. It'd be better for the world if more things are open and therefore a lot of it is just willing it into existence. And I take seeing what OpenAI does or is saying they're going to do as hopefully a win coming soon. Deep sequels is the most unexpected win that made some other dominoes fall. I think that is the path forward and see what it takes.
C
Thank you so much.
A
Thanks for coming on.
Latent Space: The AI Engineer Podcast
Date: July 31, 2025
Host(s): Alessio (CTO, Decibel), Zwiecks (Founder, Small AI)
Guest: Nathan Lambert (Research Scientist at AI2, Founder of Interconnects.ai)
In this insightful episode, Nathan Lambert, a leading AI researcher at AI2 and founder of Interconnects.ai, returns to the Latent Space podcast to unpack the breakthrough concept of RLVR (Reinforcement Learning from Verifiable Rewards) and its intersection with the latest developments in open-source language models, reasoning architectures, agents, and industry trends in AI alignment and tool use. Nathan shares details from the front lines of open model research, reflects on current industry "psyops," and discusses the technical, infrastructural, and cultural challenges faced by open model communities in 2025.
Background: Nathan recaps the evolution from his TOLU (open-source post-training recipe) work to the RLVR paradigm.
Motivation: Compress and simplify complex industry-grade post-training recipes for broader accessibility and reproducibility in the open-source community—closing the gap between academia and the techniques used at frontier labs like OpenAI.
Quote:
"What the goal is, is to try to do the work to compress what are complicated industry post-training recipes into something somewhat tractable..."
— Nathan [01:36]
RLVR's Naming:
RLVR stands for Reinforcement Learning from Verifiable Rewards—a generalization beyond domains with strict ground truths (e.g., math/code) to any task with verifiable success criteria.
Influence from Industry:
"Everyone just does RL on the outputs. And that's how we got the RLVR idea..."
— Nathan [04:32]
Modern Eval Platforms:
Debate over the value and future of human-centered evaluation platforms such as Chatbot Arena and new entrants like Yup.
Network Effects:
Leaderboards and ELO rating systems are crucial as community focal points but bring gaming and single-round limitations.
Quote:
"Having clear norms and things that can be hill climbed forever is very good."
— Nathan [14:12]
Emergent Search Behaviors:
O3 stands out for aggressive, repeated search behaviors per query—a possible preview of a new baseline for LLM services.
Counterpoints:
LLMs must learn what and how to search—pure retrieval without sufficient generative capacity is insufficient.
"You need some baseline intelligence to make all this work."
— Host [23:47]
Tool Use in RL:
Models often fail to persist in using tools after early failures—training models require not only correct tools, but strategies to recover from errors and experiment.
Research Opportunities:
Skills and Taxonomy:
Nathan introduces four key bottlenecks for future agents:
"Planning is a word that people already use a lot. Strategy would be the direction the model should go in... Abstraction is how does it break it down into things it can actually solve... Calibration is not wasting compute and knowing when to give up."
— Nathan [39:13]
Over-Optimization in RL:
RL—whether in control, RLHF, or RLVR settings—always risks exploiting weaknesses in reward signals.
Examples:
Quote:
"All these over-optimizations are just the model optimizer is strong enough where it can manipulate the agent... to its target signal..."
— Nathan [55:07]
Parallelism:
Technologies like O1-Pro and DeepThink use parallel generations with reward models to identify best answers, mainly for robustness rather than increased depth.
On-Device/Local Models:
The future of open models includes hopes for on-device, highly specialized models, but most users still access through APIs due to infra costs.
Industry Dynamics:
Meta’s massive open model releases, hiring wars, and industry “psyops” are dissected.
On Naming RLVR:
"It's also very clear of like RLHF is four letters. It's like we want to evolve that and have a similar four letter acronym. It's not that much magic to it..." — Nathan [05:38]
On Frontier Human Data:
"We still don't have the answered question on how important human is versus AI feedback. Every time I check in with people at Frontier Labs, they're like, yeah, we still use human preference data. And I'm like, okay, I don't have access to that and I don't know how to measure how, how much it gives you." — Nathan [11:32]
On RLHF Book and Field Pace:
"RLVR is going to be changing so much in the next 18 months... We've already seen it. There's all these new algorithms..." — Nathan [15:47]
On Over-Optimization:
"All of these over-optimizations are just the model optimizer is strong enough where it can manipulate the agent with respect to the environment or manipulate the environment in a way that's useful to its target signal." — Nathan [55:07]
On OpenAI and Model Transparency:
"It's real because of what it sends to develop. It has the developer benefit of where your model's going. And then also just regulatory..." — Nathan [63:24]
Listen to the full episode and find show notes at Latent Space.