
Loading summary
Nathan Lambert
There was a big change in leadership at Meta and Llama's future is unknown. So there's this big vacuum of influence which has been absorbed by the likes of Quen, Deepsea, Kimmy, Moonshot in terms of like, who's trying to build things with open models. And that's a big shift.
Luca Soldeny
We're launching Omo3 family today. And just like every single models that we released before, we're not just releasing the final models, we're putting out all the details.
Nathan Lambert
It's like the first fully open reasoning model where we show doing RL and base models and distilling from bigger thinking models. And there's a lot of discussion within the US that there's like, good reason that we should own the whole technological stack and that includes open models. There are people that are really starting to wake up to this.
Matt Turk
Hi, I'm Matt Turk. Welcome to the Matt Podcast. Today we have a special episode with Nathan Lambert and Lucas Soldeny from the Allen Institute for AI for the release of the almost three model family. a time when most open source releases are just open weights, AI2 is going all in on real openness models, data recipes and intermediate checkpoints. In this conversation we break down Almost three's Archite, the rise of thinking models, and the increasingly high stakes race between US open source efforts and fast advancing Chinese powerhouses like Quinn, Deep Seq and Kimi. This is a rare, fully transparent look at how modern AI models actually work. Please enjoy this great episode with Nathan and Luca. Guys, welcome to the pod. A big announcement today and a big day for open source AI. Walk us through what it is that you're releasing today.
Luca Soldeny
Thanks for having us. Yeah, we're launching Omo3 family today. So this is our latest family of open source models. We have a 7B model, 32B model. We have models that can think, models that can follow instruction and use tools. And just like every single models that we released before, we're not just releasing the final models, we're releasing, you know, the entire recipe we follow to get this model. So the data, the intermediate states, the evaluation frameworks, all the details, all the bits that people need to know to
Matt Turk
make models like Omo, specifically this, Omo 3 Bayes 7B and 32B. So what are those?
Luca Soldeny
Probably we have say five flagship checkpoints that we're putting out. Two of them are base models. That means these are models before they get trained to respond to user instruction. Um, so these are really good for folks who want to take sort of the bulk of our Compute that we spend in pre training these models and then they want to customize them for their use cases. So these are a two based model. They're a smaller one, there's more efficient, that takes about one GPU to fine tune for use case. And then There's a larger 32B that takes about one box of eight GPUs to fine tune. And then on top of that we have our fine tune, our post trained models for various use cases. So there's models, there are a couple of models that are thinking models. So there's almost 7B think and almost 32B think. These are models that just like a lot of the reasoner or pro models out there, they can spend compute power inference time to sort of think through a problem and solve it and then give you an answer at the end. And also we are releasing a 7B instruct model. This is a more like immediate model that gives you faster responses. So it's really good for bulk data processing or use cases where you want to have low latency responses.
Nathan Lambert
I want to add more color to these things. I think LUCA is underselling their base model. We're going to talk more about this. But over this year a lot more people have been releasing open models, especially large open models. But some people are starting to not release base models. We have a bunch of deep SEQ sized giant MOE base models and a bunch of small base models. But for example Quinn 3, which everyone accepts as a research standard and an industry standard, they don't have this 32B base model. So this base model is similar in quality to the best available, which is like Quen 2.5-32B was still the best base model. The upside is that we have all the data so people can actually do some sort of continued pre training and hopefully make it a bit easier to modify and understand the behavior. So that's exciting for us that the actual potentially best in class thing, it's also a fully open thing, which is not something we get to say a lot at AI too. Sometimes it's like oh, we replicated this and now you can do it yourself. Like this is actually a good thing. And then 7B models, which Luca was saying used to be like this huge industry standard where there's just so many of them, it's still a standard size category, but there aren't quite as many models there as there used to be. And this like especially the instruct models are less common. And this is up there with one of the best in the world there at that Size category again. And I just think of this because like Llama 3.18 B is one of the most used models in hugging face of all time and this should be better in our measurements. We see it as being better than llama 3.18 b. Hope that holds up for people. We can release more and fix it. But that's just like trying to give. We might not be at this frontier scale, but these are things that are still widely used in the world. And then it's like the first fully open reasoning model where we show doing RL and base models and distilling from bigger thinking models and all these things that people have seen a ton of times throughout the year. I think another thinking model is like ooh, what is this one for? But we have all the data and we show people what to use for. So I think a lot of times with our especially open post training it's just like the data sets become a standard. So it's like our two loop three data set from last year which we use for old mode 2 is in the thinking machines Tinker API and we want people to use this data, modify it how they need to and look at the different training stages.
Matt Turk
Great, great, fantastic. So to the data point, talk about DOMA3.
Luca Soldeny
DOLMA3 is the data that we use in pre training for OML3. So it's what we use to create the base model. It's really three parts. There's the sort of pre training pool. This is like a pool of about 10 trillion tokens from which we have like an algorithm also fully open source to like sample about 6 trillion tokens that we use during training. And it's, you know, we have kind of new techniques there. It's kind of interesting of like we do this technique where you know, instead of repeating at random documents to get more training tokens, we intelligently repeat the tokens that have the most value. So we have that part. There are smaller subset that we use during this mid training phase. So this is like a more focused dataset with a lot of math, high quality codes, sort of knowledge tidbits that you want a model to pick up. And then finally we have a set of documents that are particularly useful to make models able to work with a long context. Really excited about this one because historically of the data that is available openly out there for people to build a language model, you don't have a lot of long document data. So these are documents that we crawl ourselves. They're PDF scientifically they're mostly like science PDFs they're openly available on the Internet. For crawl we have a pipeline, but it's also open source. Everything's open source to turn them into plain text. And of those we have instead of web pages that are kind of short, like 95% of web pages are below 3,000 tokens. These are quite long. We have about 600 billion tokens that are longer than 8,000 tokens. So these are really good for people to develop other ways for models to understand very long inputs, which is typically something that people are not able to do today in the open unless they are a big lab and they can acquire data that it's long enough to, to do this phase.
Matt Turk
Okay, great. Thank you. You alluded to some of this, but talk about performance and efficiency.
Luca Soldeny
Performance, you know, it's kind of, it's very hard to measure performance a base model. So for like the instruct and the thinking, Nathan will have more info about like comparable benchmark. But like the base model is really good. It's a level of as Nathan was saying, not that many people release the base model. So we're kind of limited there in terms of comparison. But it's at the level of like Quinn 2.5.3 or Gemma 3. Certain capabilities they have maybe a little bit better on some capabilities, better on some others. There in the ballpark, absolute performance of base model don't matter so much as an instruct model. At the end you want to be in the right band where the model is capable enough that then your post training team can do magic on the checkpoint and make it really, really good.
Nathan Lambert
I would say in post training we're the best models that don't start with Quin 3. And we're like reasonable to say that they are comparable to quin3. Like on some benchmarks we beat them. But on some benchmarks they're way ahead. I think a lot of people are like, we don't know what quin3 puts exactly in the training data. So we don't know if some benchmarks, they benchmark to max a little harder than we did. I mean we try to hill climb on benchmarks to make our model good. I think there's always some level of this, but in that it's in the same ballpark and plenty of things, or hoping that there's use cases where people that use Quin 3.8B or 32B are willing to switch over and get some value out of this and maybe modify it to their own use cases. But it's like Quinn also releases great models. So it's like a never ending uphill battle that motivates you to do better, to try to get just like get close and compete with what they're doing. It's like they released these quin3vl, their vision models and like on text only benchmarks, it's way better than the models they released in April. So it's like, okay, that's the new baseline and most people don't know about it because they think it's just a vision model, but it's actually a much better text model. And it's like, okay, the bar's always rising, but at 7B scale. Nvidia had Nematron Nano V2, which is a 9B hybrid model, which I think is almost equivalent to our 7B model, is pretty much equivalent to that. These are good models. There's not that many of them that are in these size bands that are really strong. So it's like, I think we're happy to be there and happy to point out other people are doing great work here. It's not, it's not like we can ignore Quinn.
Matt Turk
That's a losing strategy, Luca, just to drive it home. The concept of open source in AI, there's different flavors of it. Walk us through what that means and where you guys are at.
Luca Soldeny
That's always like a topic that gets sort of overlooked a little bit in discussion. But yeah, like when it comes to models, there is different level of what people consider open source. Majority of models that get released, I think the best term to describe them is open weights. Your Quinn, your Gemma, your llama, Kimi. What gets released is a set of weights that correspond either to the final state of model that's the most common, or maybe final state of the instruct model, final state of the base model. And you know, there are plenty of cases where that's enough. You can build great software on top of it. There is an equivalent large set of cases from research to application, where that's just not enough. You want to have like intermediate state of the model so that you can customize it better. You want to have access to the data so you can maybe redo a step of the training while infusing your own data. You might want to have access to the pre training data because you have this incredible research project that is going to change how we think about language models. But you need to know what a language model is trained on. So we want to support those use cases. So when it comes to omo, if we can release it, we will release it. So we can't release, I don't know, our GPU's out to the world. That's not. Does it work? But when it comes to like the data, the intermediate checkpoints, the benchmarks, the software, anything we can, we'll put it out. If people ask, hey, you described this part of your pipeline, but you haven't put it out, we'll release that part as well.
Nathan Lambert
Like, we've always got questions about like intermediate checkpoints during SFT or other fine tuning stages. And like now we have intermediate checkpoints during our supervised fine tuning for reasoning and for instruct and then also for our multi day RL rounds. At the end of these we have intermediate checkpoints. So people that are looking like a lot of people like to understand checkpoints, to do research on them, but don't have a compute to train. And now it's like, okay, this is all there.
Matt Turk
Before we dive into the specifics, I'd love to take a step back. It's been a very intense year in the world of open source AI. The Deep SEQ moment feels like it was years ago, but in reality that was at the end of January, so 10 months ago and a lot has happened since. Nathan, could you help us maybe recap the key events of 2025 for people to understand what's happened?
Nathan Lambert
Yeah, if I try to make a list of actual models, I'm going to forget some because there's so many that are notable. I think starting with Deep Seq as you mentioned, is definitely the important thing. And then if you talk to people building models in China, that a lot of the consensus is like Deep Seek showed us that AI could be a big deal and that a lot of these companies were like, oh, we should do what they did. So there's just kind of a ton of labs that have popped up over the year. I think known players in addition to deepseek, like Quen and I think like Z AI and Kimmy Moonshot had already kind of existed and like these really stepped up to be much more known names, especially if you're following Western like SF centric discourse like these, these are things that people are using and talking about, which is a kind of big change. But there's just this huge mass of models coming from China. You have everything like, like Ant Group is releasing trillion parameter moes with really strong benchmarks. Meichuan, which is like the Chinese equivalent of DoorDash, which is just like another big tech company in China, like the standard way of developing language models has become to release them openly and like that whole ecosystem is going forward with this, figuring when. At the same time there was a big change in leadership at Meta, and Llama's future is less unknown, which was really the paradigmatic, the definition of open source AI. And that line of thought just ended. So there's this big vacuum of influence which has been absorbed by the likes of Quen, Deepseek, Kimmy Moonshot in terms of who's trying to build things with open models. And that's a big shift. And I think there's a lot of discussion within the US that there's good reason that we should at least have influence over the whole technological stack, and that includes open models, because realistically, it's the big tech companies in the US that'll capture the downstream value of that from having the researchers be in close proximity and speaking the same language and used to the infrastructure. I think this is something that we've seen for decades in the tech industry, so I don't think I need to explain it that much. And there are, there are people that are really starting to wake up to this. I think in June, July is when the Chinese model providers are really becoming like, you could not ignore them. That's when we had the Kimi K2 instruct. QEN was releasing a lot of their big models like Quencoder, GLM 4.5 from Z AI and that's kind of just continuing now. So I think when we're recording and releasing this podcast, there's a lot of interest in like, what are the US companies going to do to respond to this. I know that Nvidia is making a lot of noise here. They invested a lot of money in reflection and there are other players that are trying to get going, but like urgency. And we don't have a lot of compute AI too, but if we could make a dent in this and some model sizes that people actually use, I think that we focus on researchers. I think dense models are great for researchers, so they take a little bit less compute and engineering resources to use. And that's like. I do think that there's more. If you look at this podcast in the coming months, I do think there's going to look like there's a lot more labs in the US participating. I mean, OpenAI has released some models, so it just takes a long time for the norms to shift in the US where they're just established in a different way.
Matt Turk
And Quen is widely used in a way that people may not have completely realized. There was, as an anecdote, the example of airbnb talking about using QEN over ChatGPT if you weeks ago. But do you have any sort of stats or anecdotal evidence on the usage of qin?
Nathan Lambert
The other famous quote was a Martin Casado quote in the Economist where he said 80% of companies are building on Quin that has been corrected where it's 80% of companies building with open models are using Quinn which is like 16 to 24% of his portfolio, which is still a lot. A meaningful amount of people are trying open models for things and most of them are using Quinn. And then there's the likes of Cursor released their own model composer 2. It's accepted that it is built on a large Chinese moe whether of of some sort that was released openly. There's some obvious tells of like it's switching to Chinese and things like this but that's just like that is the sort of company that doesn't want to pre train their own models but has immense value in specifying models for their use case that is just going to build on these on these great models and I think they would want more options to choose from as they try to sell into more markets. I think realistically it's a thing where a lot of US companies don't want to deploy Chinese models. I think currently a lot of the stated reasons are just unknown unknowns and things you can't prove. Like you can't prove that the models aren't doing certain backdoors where I'm fairly certain they aren't now. But just because you can't prove it makes this kind of weird market dance which is like yes, these are stochastic things that are kind of amorphous. And it's like I don't love being in the middle of this as a researcher, but it's like I would like to just provide information and good things that people actually really want to use and leave all of the geopolitical and other messaging to people that have probably realistically way more on the line than I do. We work in a nonprofit. I have my dog.
Matt Turk
Why do you think this happened? That the ecosystems developed in this way that the US was very commercial closed source and China very open source.
Nathan Lambert
Historically the US has a lot more willingness to pay for services. I hear anecdotes from people that know China a lot more than I do that are like yeah, mediumly large billion dollar plus valuation companies in China will just pirate SaaS software. That sounds worse than it is. But it's just like I think the thing is that US Companies are used to paying for services and an API model and paying for tokens has been proven as a very good. Selling tokens is a good business in the US right now. I think there's a lot of debate over profitability, but the demand and usefulness of these tokens is high. So I have a lot of belief that there can be profitable businesses from selling tokens where I think that AI will be embedded in very different ways when it comes to Chinese companies. And I've talked to a few of these labs and they're like, in order to sell into the US market, they will not pay for. They've said this, US companies will not pay for services. So they don't expect enterprises to sign up for the, the Kimi coding plan and mas. But they're like, we have a chance that they'll use our models. And it's like that is a practical way to influence and getting a piece of this sharing pie. And it's like the people building these models in China know the same things about the different ecosystems. It's like, that's why I've enjoyed starting to talk to them. I was like, oh, these people, the same thing, they see the same constraints. It's not that complicated. So they're smart enough to know that if they drop really good models, people in the US can't ignore it. And that's their way to have a part in this ecosystem. So there's a mix of the deep SEQ standard and then they're kind of like, yeah, this is something that works for us, let's keep doing it. It's getting them a lot of mind share in some use in prominent ways. So I think it makes sense.
Matt Turk
And is there more of an emerging organized response in the us? I know you're involved or perhaps behind the Atom project.
Nathan Lambert
I think any concerted response you only see when it actually is public. And I think there's a lot of investment at different stakeholders and conversations that are happening. But like, that's not useful. So it's like, I don't, I don't have the proof for you, but I do think the right people are talking about it and want to invest more because realistically the cost is not that high relative to the trillion dollar build out of AI infrastructure. It's like, oh, if.01% gets us better, great open models, like we should probably do that. I think that's actually not that complicated. It's just like, how do you get the $100 million line item to the right people that have the talent to do it and like, oh, okay, the right incentives. It's just like, okay. It's a lot like the reflection news is B is like, okay, that's, that's probably a good solution for a couple of years. Like they have enough money and like they have a strong base of talent. And it's like, okay, that's. That's like a major checkbox. We need to have some. We need to have diversity there because the llama thing could happen again or it go way but like okay. Looks like a small snowball but hopefully grows in the coming months.
Matt Turk
Today's release and you guys work is part of that American response to China's rise in open source AI.
Nathan Lambert
Like I would say I launched Atom in July and thought it would get more visibility, but now I'm getting like a crazy amount of media inbound and press and brown and everybody wants to share the plot. So it was like, okay. I guess I was just four months too early, but that's what I don't really am. It's like just today I saw Bloomberg published a post that pretty much had the same title as my Kimi post from July. I was like, okay, I'm glad that people are paying attention now. It's like better late than never.
Matt Turk
Congratulations. Best form of flattery. All right, switching tags and in an effort to make those conversations educational for a broad group of people. So one of the key aspects of the release is the thinking model. Could you remind folks what a thinking model is actually is versus other form of self models or prior generations of models.
Nathan Lambert
A lot of people have heard about inference time scaling which makes sense, which if you spend more compute at inference time, you get a better answer. A thinking model is really a way to train the model to exploit that a lot. So you spend a lot of tokens, which is the tokens are usually hidden from the user as like a long chain of thought. And the model therefore kind of has a step change where it's way better at math tasks, coding tasks, agentic tasks. I think we have some. Our future plans are adding more tool use to the model. So we're not talking a lot about agentic search or agentic code execution on the fly and stuff for this model. But building thinking models is the gateway to doing a lot more interesting things like cloud code. Maybe we'll have all low code next year and all these things that we want to do. The thinking model has just been the thing in 2025 that use a lot more compute per answer. Model gets way better at various things
Luca Soldeny
I don't like thinking models, but it's fine.
Nathan Lambert
No, they're good, they're very useful.
Luca Soldeny
Thinking models are really like work mode and regular instruct models are usually more fun to build. They can be more quirky. But yeah, I think that's really what where they're like 90% of the cases. Especially like user facing cases. Folks, I'm okay spending time waiting for this model to craft a better answer. There's still a space for models that can respond faster. You see stats that Google released about adoption of Gemini Flash and that's where non thinking models that can at least approximate have a good approximate first answer are really useful. They're also more fun to build. But yeah, thinking models where the future is especially when it comes to agents integration.
Matt Turk
Before we go into the pipeline, very specifically of the Olmo family because as you alluded to, that's one of the amazing things about open source is that we can in a discussion like this truly understand how the model works versus other conversations with commercial players. So before we get into the pipeline, I'd love to talk a little bit about you guys, your backgrounds and AI too which is a very important player in the ecosystem that people may or may not have heard about. So who wants to go first?
Luca Soldeny
I sort of stumble into this role by just picking problems that are interesting. So my background originally from Italy moved to us for PhD. My PhD is in information retrieval. How do you build search engine to just simplify a lot I slowly got into more and more sort of natural language. First joined after grad school I joined Amazon. I was working on Alexa at the beginning working on like the search part of Alexa and then I got wait the actual part where the users taught me Alexa. The interesting part. So slowly moving towards that. Initially joined AI2 working on a project called Semantic Scholar is still active. It's a search engine for academic paper. And there the interesting bits were actually like interacting with users and less so the actual text of the papers that you were searching on. And then the way I got into LLM and building language model is really intertwined of how AI2 got into building language models. It all started around November of 2022. This is around the same time ChatGPT got released a bunch of researchers AI too. This is like individual contributors. There's not a direction from the top. It's a bunch like very grassroot initiative too. A bunch of researchers got really interested in building a model that will be fully AI2 had already built sort of proto language models around 2017, 2018. So a lot of the interest was in recapturing, expanding the line of work. So a bunch of us got together, started planning, got in touch with a few companies who might be interested in supporting these initiatives. We got initial grant from AMD. At the time, there was about 2 million GPU hours. And so. So the idea had the researchers interested, we had the compute. So we went to leadership at the time and sort of told them, hey, we're gonna go do this thing. I hope you're okay with it. And one of the nice thing about it too is, like, at heart, we are like a research lab. So everyone was like, sure, you figure everything out. Just have fun.
Matt Turk
Great. All right, Nathan, how about you? So you're a man of many talents. You do research, you write this very interesting blog newsletter called Interconnects. You do podcasts, you do a bunch of different things. So tell us about your journey.
Nathan Lambert
Yeah, I say I wear many hats to try to get the things that I want to do done. I showed up to Berkeley as a EE mostly PhD admit in 2017, and then I saw that AI was happening and I decided that I want to try to do this. Which started by going to all the names that people know, like Sergey Levin and Peter Abeel and asking to be in their group. And then they respectfully say no. And then starts the long process of becoming learning how to actually do it without being directly embedded in these elite groups, which was a mix of robotics and reinforcement learning and finding my way there. So my PhD was in mostly model based reinforcement learning. And then my 1 Research Avenue job was to go join HuggingBase when they said they were going to make an open source version of DeepMind to do a bunch of research. Realistically, my job was not that impactful or useful at hugging face until ChatGPT came out. And then I was like, oh, I should maybe just learn about RLHF. And that got very immediate traction as somebody trying to work in public with the team there. So, like, Louis Tunstall and other people at Hugging Face are still doing a great job on this. And we worked together for a while. And then mostly I was just like getting burnt out on remote work and met Luca in Hawaii at a fun conference and was like, wow, I could have real life friends. And I joined the AI too, to work in person and tried to do the same thing, which kind of takes an evolution of the Almost Story, which is just like I had a lot of motivated on trying to figure out these. What was mostly reinforced learning from human feedback at the time. And make versions of these post training techniques public. And then that kind of evolved through both Olmo and we have our post training methods that was named Tulu, which is like we spent a long time to try to replicate what we thought was close to llama 3 post training with multiple stages and optimizers, which is the project that came up with the name Reinforcement learning with verifiable rewards with a bunch of people. So it's kind of this evolving journey at AI2 to and search for impact, which is what we think people are actually doing. And then largely the opportunity that LUCA and I and others AI too fill is there's so much money in AI and it only becomes increasingly so that the amount of people that can talk about these things in public and educate and get more people involved by spreading knowledge is ever smaller. So I describe my career journey as a lot of it is filling that vacuum and thinking about what's impactful there. So it kind of holds you when there's such a void. It has a sort of gravity to make it clear what you should be doing.
Matt Turk
You anticipated my question, which is sort of obvious in a world where we see hundreds of millions of billion dollar packages offered by some commercial AI labs for people just like you. I was curious about your interesting motivation to join AI2, which is a nonprofit. But impact, is the short answer correct?
Nathan Lambert
Yeah, I mean I've been here for two years and I wasn't famous when I joined. So let that be told to people looking for new jobs is that you want to find a job that you can grow into. And I think AI2 has been a really, really good place for that for many people because you have independence and are encouraged to go forth and do things and not be a cog in a broader just like grind out language models machine, which is important, but it's harder to get visibility.
Matt Turk
So we alluded to some of it, but maybe a few words about AI2. So AI2 was started by Paul Allen, right? AI2 stands for Allen Institute for Artificial Intelligence. You mentioned some grant, Luca and I think earlier in the conversation we talked about a recent grant as well. I saw that it was 152 million from NSF and Nvidia. So what is AI2? How did it start? Has it founded at a high level?
Luca Soldeny
AI2 is founded around 2014 by the late Paul Allen. Initial AI2 was very focused on building machines that can do science, can understand science of science problem. That's when semantics started as a repository science paper. Slowly, one of the initiatives that started forming was more fundamental research around how language model works. How at the time was called natural language processing was working. You had teams like Allan LP doing great work. Since the very early. Since very early we worked on. We always had this idea of not just releasing artifacts research but releasing the tool. Back in the day we had this very, very widely used library called LNOP that would allow you to build and customize these models.
Nathan Lambert
I'm going to jump in. It's cool because it's the namesake of our team name and has been for a long time at AI too. And it was the thing, it was the main competitor to Hugging Face Transformers and they ultimately out competed AI too as the thing that people use for that because they had very different model and amount of support. But LUCA can keep going. LUCA was a lot more good thing.
Luca Soldeny
But we have been at it open sourcing for a while. I think it's something that folks here understood really early. This before my time understood really early. That was important both in pushing science and also unlocking commercial use cases that such a non profit. Maybe we didn't anticipate but folks, you release a tool, people pick it up and do amazing things with it. Yeah. And we moved on language modeling more and more recently. Right now AI2 has maybe like three main projects. One is O model family and there is like a. There's variants of omo. There is some that I focus on like the full pipeline. Some the robotics focus more on processing images and video and audio.
Matt Turk
Is Malmo part of one of those variants?
Luca Soldeny
Yeah, you have. Malmo is one of our projects that work on multimodal inputs. Recently we released another one called Malmo act that it's more focused towards robotics, receives multimodal input and then can act in space. And then we had the model that was able to do automatic pitch recognition. Another model focused more on document processing could do ocr. So it's a nifty little family of models. We have a working group on agents for scientific tasks arcing back to our roots. This is the ASTA family of initiatives. This is agents to help scientists do their work.
Matt Turk
And that just came out right like August of this year?
Luca Soldeny
Yep. The team has been cooking since middle of last year. But finally we had our first release this year there's actually two releases. There was the main ASTA release and then recently we announced a partnership with Kaya, the Cancer AI alliance using some of the components in ASTA to help researchers make progress on cancer research. Then there is a third branch on AI for the environment building models that can and understand so that I can model Earth and can work with different signals to do prediction around the environment and so on. I'm being a little bit vague on this one because I don't know if it has been announced yet.
Matt Turk
It's a preview right here. The MAD podcast is making news. Okay, very cool. I'll summarize. That's great background. So we got Olmo, we got Chulu, we got Asta. Just maybe one last question. In terms of size, what are we talking about? How many of you guys are there?
Luca Soldeny
200 people between the research staff, engineering, comms and other support roles.
Matt Turk
That's fantastic background. Thank you very much. All right, as previewed, let's switch tax and go into Almo 3 Almo thinking, almost reasoning, whatever you guys end up calling it. And I think it's a the perfect opportunity to talk about how those reasoning models actually work. In prior episodes of this podcast we've had great conversations with folks like at Anthropic or OpenAI, but not surprisingly, there's only so much they can talk about. And the beauty of what you guys do, which is the very essence of it all, is to make it open and and accessible to everyone. So I'd love for us to talk about the whole pipeline from pre training to post training, the different parts and make that super educational and explain in plain English what part does what. So could either of you start with just a high level of architecture of what the various subcategories of the pipeline are and then we'll go into those one by one?
Nathan Lambert
Sure. I recently gave a talk on this at the conference on language models, so I have them on the top of my head. I think I could provide a personal motivation for this, which I think as researchers we're closely embedded in the community and we see that there are a lot of people that are starting to do this reinforcement learning research after deep seq R1. I think most of this happens on the family of Qin models, which is like Quin 2.5 and Quin 3 is between 1 and 8B parameters. And I think something that particularly motivated a lot of the fine grained details that we might not have time in this podcast is that there's some questions hanging over the data used for QEN when doing this RL research. Specifically, there's two papers. One is spurious rewards rethinking training signals in rlvr, which is one that I was on with a lot of people at UW in AI2 and then another one which was reasoning or memorization unreliable results of reinforced Learning due to data contamination.
Matt Turk
Yeah, actually, let's spend a couple of minutes on that. What does spurious rewards mean?
Nathan Lambert
I think the thing to know about this is that it's going to become. You can trigger this rant on the technical side later. A lot of background on like understanding what these algorithms are, but essentially the question mark is like, did Quen include training data that is too close to the evaluation targets so that the research is picking up on weird behaviors within the model rather than the fundamentals on what this reinforcement learning is doing?
Matt Turk
So in other words, did they teach to the test versus enabling true thinking?
Luca Soldeny
Yes.
Nathan Lambert
I don't think Quen like, Quen didn't. It's a gray zone. Like, I mean, it's not, I think, think all the Frontier Labs will do this to some extent, which is how they're tasked is you have a team member that's tasked with improving an evaluation and then the easiest way to do this is to train on test. But they all have dignity as elite scientists where they won't do this. And the next closest thing is you do some sort of paraphrasing of the test set to create new training data. So therefore you're not technically cheating, but you're potentially where in the spectrum of you scrape GitHub for math problems versus you paraphrase the evaluation set. Where do you draw the line on actually calling it cheating? Different people have different answers. But mostly I think a goal that we kept coming back to because we understand that Olmo is not. You can look at the numbers, we're getting close to Quin 3 with reasoning or without. But this is not a 600 billion parameter model that people are going to immediately download and run Olmo code on or anything. But we want to make sure that our core audience can do the research that we want to do with confidence and debate it. So we want to give people access to every stage and you can then see how this impacts this new important area of research. So we're going to talk about six stages. One is large scale pre training, which is this training on all of the Internet predicting next tokens. Two is what we call mid training, which is debatable whether or not it actually should exist. What technically it is is you train a on higher quality web data with a change in the learning rate. 3 is long context extension, which is absolutely essential for these reasoning models because they generate so many intermediate tokens before sharing an answer with you. And LUCA has a lot of battle stories from that. And then we go into post training which in Our case, those three building blocks of pre training are, I would say, more set and super essential. And then post training you, when you approach this, you have a bag of tools which are optimizers and you apply them in the order that suits your model, depending on size, capabilities you want. So we'll talk about things that we did, which is like instruction tuning, preference tuning and then we did some reinforcement learning with verifiable words again. But if we were to train a model that was 10 times as big, all this post training stuff would change. But the pre training and mid training and long context I think would actually become looking pretty similar. So it's kind of a difference across the two phases of training where post training is a bit of an art and you have to do what is best for your specific use case. And that'll change. But we can go through these too.
Matt Turk
Okay, great. And Luca, you're the pre training guy and Nathan, you're the post training guy. Right. Is that fair?
Luca Soldeny
One of many.
Matt Turk
One of many. But for purposes of this conversation and before we dive into each step. So this idea of pre training plus RL seems to be the key idea in terms of progress in the last year or so. I know the concept of it came up much before that, but in terms of implementation of it, what's the right way to think about it for somebody that's trying to learn about the space, is one part better than the other or do they need to exist together? Is Ara currently delivering more game than pre training? What's the overall kind of high level take?
Luca Soldeny
I think the way I like to think of it is the pre training phase. It's really like a very expensive initialization of the model.
Nathan Lambert
Right.
Luca Soldeny
You want to like, when I think of like, oh, what do we want to. What is a good final set of weights that I can pass to Nathan and the rest of the post training team is well, I want a model that has great knowledge about the world and it also can sort of, you can start seeing sparks of capabilities that you will want to model that then you know you want to chat with, have great capabilities. So it is a very expensive and very compute intensive way to like create an initial models out of like what is essentially like random parameters. But it's all about like, yeah, let's have this model have like a lot of knowledge of world facts and information and let's have it so that it can start behaving a little bit like a chat model so that when we pass it to post training and you have this reinforcement learning There is some behavior to reinforce and to give rewards on so the model can pick it up.
Nathan Lambert
I would say that generally the reason why discussions are hard right now on whether or not people should care about pre training or post training is that we optimize pre training for multiple years and then there is a lot of untapped potential on this type of RL where a couple what is said is that OpenAI figured out a whole bunch of tricks to get 01 to work and then it kind of showed that this area was possible. And then this year has been a race to capture low hanging fruit on rl where I think that's kind of the biggest story is why we have all these crazy new models that appear like O3 with this thinking and tool use which are just downstream of oh, we could do very different things because we have such a good platform as these malleable pre trained models that we've been iterating on for a long time that this RL stuff we just kind of could have tapped into it much earlier, but there's a lot of potential. So yes, the rate of improvement right now in RL is higher, but the end of the the day it's going to be a dance between both of them where you need a better base model. It's said very commonly that a better base model and a bigger base model is much easier to improve with rl. So if you take that as one of the core things of doing RL research, it's pretty obvious that pre training is very important to enabling that as a quick detour.
Matt Turk
There's been that podcast with Richard Sutton that was effectively saying that RL was way to go and that pre training and LLMs was a little bit of a flawed premise because it was sort of an imitation of reality, basically doing the way humans described reality as opposed to being confronted with the actual reality through rl. Do you guys have any quick take on that?
Nathan Lambert
My take is that a lot of people are being exposed to Rich Sutton for the first time and Rich is a font of wonderful ideas, but often not ones that are going to be immediately practical. This is how you get things like creating, reinforcement learning, but not necessarily things that are going to impact what GPT6 is. So I've been on the critiquing Rich life for many years before this in terms of making people try to interpret his ideas as realistic. I think the one from 2021 or 2022 is his reward is enough paper, which essentially is an argument that a reward function is sufficient to get any intelligent agent that you want. So I think that that's actually rather than the technical debate as an entertainment of the whole community being nerd sniped for the first distraction.
Luca Soldeny
Okay, the message is not that surprising. Like there is this fine line between the actual ideas and there is the engineering around it. A lot of making language model works is engineering. And like not in a denigratory way, but in a way it was like we got to figure out how to translate research at the end to practical things. And so pre training is just a good way to initialize one of these models. If a better idea comes out in the future, we can switch to that. No one is married to LLM being the end all solution. There's a big difference between just describing the system in theory and then actually getting them to work. Otherwise there wouldn't be two and a half years between the original GPT3 and GPT ChatGPT and GPT 3.5.
Matt Turk
Thanks for that. So let's take those six modules turn by turn. So let's talk about pre training. What did you guys do specifically for this model?
Luca Soldeny
Pre training is very interesting. The way we sort of plan so a good background is to have is that pre training, all that happens to be pre training. We have to be very methodical in how we do it because first of all, it takes a long time to pre train. I think, I think it's standard practice among the frontier labs to try to cap your big final pre training run to two months, not more than that, but to get to something that will not crush and burn in two months. During these two months, you have to do a lot of preparation around this. So we're really, everyone who works on pre training is fairly methodical and, and just to sketch out how that works is usually you have a sense of, okay, the duration of this running is fixed, the number of GPUs I have available will be fixed. And therefore you write the fastest possible code to train this model. You have these three you can figure out, okay, how much data can I show my model? In our case that number was like 6 trillion tokens. Given that number, then we go back and we figure out, okay, what are the best six triloan tokens out there? And the way you figure out is a combination of what data you have access to. We want to do this with, we want to eventually release the data. So we limit ourselves to data that is publicly available. So either Internet text or PDF, documents that you can find on the Internet or code that you can find on the Internet. And then among this pool pool, our initial pool was closer to 300 trillion tokens. You shrink it down till you reach your target number. And hopefully as you shrink, you only keep the best part of this. So you remove duplicates. You have way to judge is this document better than this other document? We have a way to evaluate the capability of the model. So you pick, you know, if your evaluations wants, I don't know, medical documents because there's a medical test there. You figure out how do you pick documents that have good medical information. It may be to the expense of some other domains, but yes, this delicate balancing act to find this data. And after you commit to this initial run, it will do your training of this run over there. There's a lot of like making sure that the way you design the model doesn't suddenly start forgetting what is learned. We call these spikes in the language model. But basically you don't want this event that if they happen, you have to restart from scratch and you can't recover. So there's a lot of work on that. But after these months of training you get to a final model. And then on this final model it will still lack some capabilities that I know Nathan's team cares about. So these are things like long context or being able to solve some problem to start with. That's where things like long contact extension or mid training happen.
Matt Turk
Yeah, let's get into that. So that phase two, so mid training. So again a term I personally hadn't heard of before and Nathan briefly describe what it is, but just double click on that.
Luca Soldeny
Heard that some labs instead of mid training they call it tail patching, which I think is a much better term. And the term is like at the tail of training, a tale of pre training. You patch the model so that the things that hasn't learned in pre training you will learn after you learn at that phase. Of course, when you do that, you also need to make sure that the model doesn't forget stuff that's seen during pre training. So that's why you mix some of the best data from pre training you do carryover.
Matt Turk
So you give it more like code data for example, or math data, that kind of stuff.
Luca Soldeny
If the model maybe cannot reason about certain math problems, you do it. That's like when Nathan mentioned early, sometimes there is some leakage of things that look like the test. During this phase there is an unchangeable way to describe it which is like, oh, someone is trying to cheat there by adding this data. But it's also so easy for accidentally leaking your test data in there. We spend a lot of time making sure that doesn't happen. Because you want to add. It's really tricky balance because you want the model to start being able to solve problems like the ones that you see during tests, but you really don't want that test data to accidentally, accidentally leak there. Otherwise you can't measure how well your model does.
Matt Turk
And then you mentioned long context, which is the third stage in the six stage pipeline. So why the focus on long context? And I guess what does long context mean in the first place?
Luca Soldeny
You want these models to be able to work with very long sequence of text, both as input. Imagine you want to give it, I don't know, a collection of documents. And you also want this model to be able to generate a lot of text in the output. Especially now that you have this reasoning traces, right, this thinking tokens. Why don't we train from the beginning? The model to be able to do that is because the longer the input that a model is trained on, the slower is the rate at which it gets slower, it's higher than the length by context, it's a quadratic slowdown. So we definitely don't want to do the entire pre training at this extremely long sequence. But at some point we have to teach the model to actually work with these long sequences and we save it for the very end so that we can do it in an efficient way.
Matt Turk
And I think you mentioned somewhere that data doesn't matter for long context. What do you mean by that? And then what does matter?
Luca Soldeny
Ah, this is getting a little bit in the weeds.
Nathan Lambert
No, LUCA loves data. LUCA likes to be in a dark room grinding out tokens to train the models. The emotional backdrop for this
Luca Soldeny
very technical stuff. Do I use QK norm? Do I use gqa? Doesn't really matter. But there are like technical decisions in how you set up your model that you can have the best data in the world and your model will not be able to reason over many, many tokens. So it doesn't matter in the sense that you can train the model on bad data, you can have the best aid in the world, but if you set up your model wrong, you're never going to recover it. So sadly, I can't be the savior with the magic tokens that makes the model good. We have to make the model with the right architecture.
Matt Turk
Okay, so that's stage three, long context. Maybe just to bring this to life, what's the difference between before and after? Like if you have a 40 page PDF that you fit into the window, it will just get faster results or better results. What happens at the beginning, you just can't do it.
Luca Soldeny
Like, you know, do pre train as something like 4,8000 tokens. That's what we use for almost three. That's what llama use. That's about maybe eight pages if you use like you know, double spacing, Uline kind of thing. And after that we extend to about 65. In industry you have extension of a million token. I think Gemini recently announced like over a million token. At that point, a million token is like 10 books. So you can work with extremely long amount of information. It's nice. You don't have to think about if you're building an application with this language model, you don't have to think about like, oh, of this amount of information, how the heck I'm going to extract the ones that I need to show the model. You can, can just give it all and the model will figure it out. So it really unlocks a lot of opportunities.
Matt Turk
All right, so that's the pre training world between pre training itself, mid training and long context. So now let's switch to the post training world, Nathan, if you will. So starting with sft, which stands for Supervised Fine tuning.
Nathan Lambert
Yeah, I think one of the, one of the things, especially for a model like Olmo, where we're scrappy and putting everything together over time, is that one of the biggest changes is that when reasoning models become popular, the invogue evaluation suite of the industry shifts to add a whole bunch more new things in. So one of the things that happens at every stage is even if a lot of the data has overlap, is that you mix it in a different way. So I think like Kyle Lowe and Mei Chen, that's another researcher and an intern, did this whole mixing procedure that we use across all these stages just to upweight the math code and reasoning stuff to make sure that what happens later in post training is much more tractable and that all this stuff is set up. So that's the type of thing that we have to do that's kind of baked into everything and then post training. I think for this model, everything we're doing is operating in the assumption that this is about a 7B model. We are very narrowly focused and therefore we're going to do what many people have done, which is called distilling from bigger teacher reasoning models. I think distillation is described as when you take the outputs for one model and then you fine tune on it later. I think there's been a lot of broader discussions on this in the community and then this supervised fine tuning stage or SFT or instruction tuning is all about just getting the best traces from reasoning models out there or the community and then just teaching your recently trained base model to behave really, really closely to what is going on there there. So in our case we took a mix of existing data sets like Open Thoughts 3 and modified it, which is from Bespoke AI Labs, a startup. And then we also generated a whole bunch of new data. So we ended up using a mix of teachers from DeepSeq R1, DeepSeq R 10528 which was their updated version, and then Quen's reasoning model. Qwq it's like these tend to be pretty strong teachers.
Matt Turk
Why is that? So you have a pre trained model but you for supervised fine tuning you DCL using a different model. Why is that? In simple terms?
Nathan Lambert
Essentially because our small model is not going to be able to output as strong of text. So there's a kind of a fork in process where I'm talking about a small model. And if we had a bigger model, what we would do is do a lot of reinforcement learning to start and the model then would take time to learn these interesting behaviors and have strong performance. But with a smaller model, the ceiling on that is fairly low. It just doesn't have the capacity to learn from these harder math problems. So what the common practice is is you take the absolute best reasoning models you can get that are openly available with a good license, where you can just generate new data yourself and train on it and release it to the community, which is something we've been seeing a lot of this year. And then therefore the models that are closest to the frontier in performance with good license all happen to be Chinese models throughout the year for this case. And I think in our case, even if GPT OSS had existed, I don't think we would have used used it for synthetic data in this because that model is really designed for tool use, which is something that we did a bit of in this project, but not in the sense that that model is which is like this many hop agentic reasoning with search and stuff. So the deep seqs and quens of the world are just powerhouses at generating math and code answers and other things and being generally robust. So that's what we do is we have I think about 2.5 million reasoning traces mostly on math and code and stem, but also on chat and other general capabilities. And the model really absorbs a lot at this point. I Think if you were to told us last year when working on Olmo 2, looking at it, that if we had a similarly sized Olmo model that gets like 95 on math and 70 on Amy on these crazy math evals, it would have been surprising. But this is just what you can get when you can extract data directly from these really powerful models and distill it down. I think realistically a lot of companies are going to want to do this because you can do this for your domain. I think we threw a blanket on. We want all of these evals from instruction following and make sure that you can actually talk to the model and not have it just become totally broken. But you can do this at any specific task you want. If Deep SEQ has coverage on it and it's very efficient, then most of the process after that is. That is the foundation. If you're training a small reasoning model, you need to do this. And then the other things after are how do you extract more performance? And they quickly become more technical or done because we want to do them and maybe not efficient in our time. So 90% of our time this summer is having great people battle reinforcement learning infrastructure. Because when LUCA is hitting out, when you generate a lot of tokens, the time or compute increase and memory increase is quadratic. Therefore you pretty much encounter every possible bug in your framework or every possible corner case that'll make your job go to a halt. Like most of the performances through this SFT and the preference tuning that comes perform it. But the. But the RL is like we need to do this in order to build the infrastructure for many of the future almost that we want to build later this year where they get bigger and they can do more interesting reasoning with tools and so on. So it's kind of like a nuanced point of the model. It's like, yeah, we'll show you that we got a couple points out of doing RL at the end, but really the RL tooling is something that's so crucial to doing the next models that come from here.
Matt Turk
Right, right. And thank you for that. And just to again, in an effort to make interesting to just a broad group of people who are curious to understand how AI works. So SFT is not RL yet. That's a supervised fine tuning. So that means that you basically show the model some gold copy of what good looks like and you train it based on that label data. Is that a good way to describe it?
Nathan Lambert
Yes. So it's the same loss function as pre training, which is you're predicting the next token. In this case, what it looks like is a question could be be like, I don't know, like an Amy style, Like a really hard math question would be like list all the prime numbers within some constraint of X and K. And it's like this one sentence that is really hard. And then the model generates 30,000 tokens of. Let me think about this and do this. And to test this, I'll have to use this theorem and hypothesis, which is like the 30. We were talking about token intuitions for a bit. But 30,000 tokens to solve a math problem is pretty mind bending. So if I were to sit there and read this, it would be hours of me just trying to read this one math solution. So these models are very unintelligible in many ways. I think the reasoning models sometimes will go into a bout of guess and check for hundreds of attempts before realizing that they can no longer guess and check. I mean, this is our reasoning model. I think the frontier models could have probably done this and fixed this issue, but they're just really, really, really odd things in these tokens. But even that doing this next token prediction is an incredible foundation of performance that many people use. So it's not matching any sort of human reasoning or things that people might want it to be doing, but it is teaching the models kind of their own language of breaking down problems step by step in order to solve a goal.
Matt Turk
For this specific stage of sft. Do you want to talk about how you went about creating the data set for it so precisely this representation of what good looks like for the model?
Nathan Lambert
Yeah, Luca, do you want to jump in? Do you have things too?
Luca Soldeny
The other thing I was going to mention is that sometimes in the big announcement of the frontier Labs, you don't see is what Nathan was describing around having to do SFT to then do rl. It's very common. We are in a common situation where this is uncharted technology, right? So you have nothing. You have to find ways to fix some components of your pipeline before you can build the rest of your pipeline and then go back fixing the first part. So for us is, okay, we want to do reinforcement learning on this larger model. Okay, we need our reinforcement learning code to actually be super fast, super reliable and useful. If we need to iterate on that part, we want to iterate with smaller model first because we can iterate faster. They take less, less compute to work, so we can do more things in parallel. Okay, smaller models, they cannot do rl. First you got to create the Data. First you have to go to the sft and then we're lucky enough that, okay, there are other great models that are open source that we can use to create this data, versus the alternative would be, I don't know, to spend. It's not even the money to spend a lot of time instructing humans to create the same volume of data, slow things down. So it's a lot of this of like, I don't know, you're building the tracks as the train is going down at incredible speeds, and you have to figure out ways to fix some parts of your pipeline so you can work on the rest.
Matt Turk
All right, so let's talk about the next stage in the pipeline. Stage 5, DPO and preference tuning. What is that? What does that do?
Nathan Lambert
Yeah, so this is one of the things that is thought of as like, hey, let's try this. We're not sure if it'll work kind of later in the process when you spend a lot of time on other things. And it works very well. I think DPO or direct preference optimization is not exactly new. I think it's a way of optimizing for preferences. It's related to this whole RLHF thing that we mentioned, technically speaking, in one sentence, it's an analytically derived loss function that is essentially applying stochastic gradient descent to the RLHF objective. So it becomes much easier to implement than other things. And we used this in the past with Omo 2, with 2L3Todude, two other Olmos. And the question was, can we apply this out of the box on top of a reasoning model? And we knew that it works in many different situations because we weren't sure what would happen with these long reasoning traces being included in the loss function and so on. So then essentially there's a student, Scott, that has been working on what he calls the delta learning hypothesis, which is like a intuition for understanding DPO as being more about the contrast between your chosen and rejected examples. So the core of preference learning is that you have pairs or some grouping of completions to the same prompt. So you have one question with multiple completions, and his intuition and work is showing that this contrast is more important than the exact magnitude of goodness of the answer. So what he did is he spent a lot of time in trying to come up with a good pairing of reasoning models where, which is they're open source or they're open weights and they have a permissive license and they include the reasoning traces because we kind of need this and you need them to be sufficiently well spread about. So he spent a bunch of time generating this data and doing some normal kind of like, let's fiddle with the learning rate and small things. And it's kind of just like, yes, this works. I think after we did it, we saw that hugging face did something similar with small LM. So they trained a fully open 3B model where they pre trained it as well. And the funny thing is that we converged on using the same Qin32b and Qen06b. So the problem is that these small Qin models and these small public reasoning models are actually so strong that getting a sufficient delta to another model to apply this preference learning technique was kind of hard. So our past techniques we had groups of models we sampled from, but as these open models are getting better, these samples become too homogenous for the learning signal to exist. So it's a kind of cool experiment. It's a cool experiment because it validates this hypothesis of the changing tides where if you think about years ago with alpaca and stuff, those models were so broken that having this group had enough variance and contrast in it, where we could do a different type of preference learning, where now you have to look really closely at the completions and make sure that there's a learning signal for the models. And we did this and it gave us a boost across the board. I think it's like, like sometimes things look very easy when you've done careful data work and kind of set up to understanding your optimizers. I think Luca described pre training is very scientific and post training is like the wild west. I think there's many analogies. So it's like we have me that made this SFT data set where I was like, we had a bunch of cloud credits and they were running out and we're behind and I just generated like as many completions as possible. So if like a few billion completions from Duke Cheap seek over the weekend, you're like, oh, we'll mix it and filter it later. And I applied filtering and the answer was like, oh, we just include almost all of it. It's like we did very little. We would have liked to do more if we had resources for longer. But it's just like sometimes there's low hanging fruit and doing the obvious thing yields a lot of results. And it's like this SFT and DPO stage in a lot of sense are that. And then this kind of RL stage is extremely hard technical grinding week in and week out to make the tools even run and it at all. And like the dis disparity in post training is like eh. Yeah, like kind of tracks to me I think you just have all these checkpoints flying around and it seems like chaos. And then something that's extremely obvious gives you like a massive gain. So like the DPO gains are like the difference from being about like quin2 like this is not an exact apples to apples comparison but it would be like the difference between being like Quinn 2.5 level to like almost qu3 level. It's just like the thing that you apply to get there is sometimes really obvious. And I think the Frontier Labs are much further down this path where they take these low hanging fruits so fast. But as a smaller team that's trying to map to what the changing priorities of the field is. Sometimes it's just turning the crane on A really straightforward thing.
Luca Soldeny
I don't remember but Dario from Anthropic saying very plainly look, what works here is with 50 to 100 lines of code. He was saying it in the context of espionage and him being scared about some trade secret from Anthropic being exported out. But the solutions at the end, everyone the industry favors are actually very simple. The problem is that there is a very large space of equally simple solution and all the work goes in. Okay, how do you test these? How do you test them as fast as as possible? How do you convince? We have to be extremely skeptical in any good results. So how do you convince it that these results look good? They're not just because oh, sometimes there's a bug somewhere that causes something to be too good to be true. So yeah, a lot of it is, it's less about the final solution what matters. It's about the speed at which you iterate and how robust your tools is. So immediately after you see a good result, I know that this is a good result.
Matt Turk
All right. And to complete the the journey since we started talking about rl, so rlvr, Reinforcement learning with verifiable rewards. Let's spend a little bit of time on that sixth stage in particular. Nathan, I understand that's your baby or you're one of the fathers of the baby. Do you want to walk us maybe a little bit through the history quickly?
Nathan Lambert
I mean I think that I'm the person that got to bring it publicly to the world. It's well known that people across the industry have been doing this for years. And then the technique started to get far more impactful. It's broadly taking existing reinforcement learning algorithms or downstream evolution of proximal policy optimization PPO which is an evolution of reinforce and then deep SEQ had their group relative policy optimization. I always try to say group robust. I think it's group relative. And all these algorithms are really quite similar and you're training the models with whether or not they got the answers right or in the case of code whether or not the tests execute and don't fail. I think one of the famous examples is that doing too much of this kind of or racing to get the low hanging fruit from this RL approach is what makes all these code models do all these try except things to avoid errors because they accept all the errors. I think that is just because the gains that you get in the model being useful is so much higher than the annoyance and the fact that it also does these stupid things and we'll fix the stupid things eventually. In the case of this OLMA model it's not anything crazy. We cast a wide net on RL math problems. We do some data comparisons to see which data we think is the best for teaching these models. We do mixing with code and precise instruction following. And this mixing is effectively when you tune the big set that you have to what you've known from many experiments and to the specific model checkpoint that you're starting on. So if you have a really strong model and you show it really easy math problems, there's no learning signal. And if you have a really weak model and you show it really hard problems, it gets them all wrong. There's no learning signal. So the learning signal is all from the gradient of like you sometimes get it right and you sometimes get it wrong.
Matt Turk
You want to give the plain English definition of RLVR versus RLHF.
Nathan Lambert
RLVR verifiable rewards is in the name I think essentially. Essentially the reward that you get from the environment which is the completion or the grader is whether or not you got the problem right. RLHF, the reward is essentially a reward model which is rating the quality of the response based on a proxy to what humans would like. So it's described as being a. Much like the rlvr, reward is much easier to understand because these reward models tend to have a lot of problems and you can over optimize them much more easily because the reward models will pick up on features that are maybe emojis or something like this that you don't actually care about. Where RLVR is much better matched to performance characteristics rather than style.
Matt Turk
And what you tweeted I think or said somewhere that RL and long context reasoning distills is very hard. I don't know if that's RL in general. Specifically this type of rl. What makes it hard, super hard in very much the frontier of AI right now.
Nathan Lambert
So there's many ways that your tooling could fail. I think where most of these processes are set up right now is that you have a set of generation GPUs which look like something like VLLM. And you have a set of training GPUs which is some distributed learning framework, which is where you actually have this RL update and loss function. And therefore you need to have some sort of system that orchestrates the two and passes information back and forth. And this kind of information passing back and forth is really annoying. It's a systems problem because you have distributed error handling and things like this. So a common case is when you have the most basic approach is that you'll have one generation, this one math problem. The model is thinking and thinking and thinking and thinking. So you have all these GPUs working on one problem. So effectively your whole system is somewhat idle waiting for the answer. And there's many other small things like this, which is this long context generation just uses so much memory that you then need to introduce to different types of parallelism and stuff to do the generation effectively. And there's just a lot of subtle numerical issues. So I think it's just kind of stress testing a lot of the post training infrastructure that we have had by turning up a lot of different things that could go wrong to be the maximum. I think there's like the things that the open community struggles with is that VLM and Hugging Face use different kernels to do the actual internal computation of the model. So these kernels are the things that make things like VLM really fast, fast. But these things, this then results in subtle numerical differences between the completions that you're generating from the model and then the log probes that the thing that's doing the loss function actually generates. And if you look at the math of these RL algorithms, it's assumed that those are from the same distribution. So therefore you have these. This is a big root cause of a lot of numerical problems. And then if you look at what we're doing, and a lot of labs have done throughout the years, they do different fixes to change these numerical generics. I thinking Machines had one of their first blog posts on deterministic VLLM to make it exactly deterministic. That is really useful. And people think it is key to their Tinker API and doing other sorts of RL things where you just have complete control over sources of non determinism and that could just be numerical lack of robustness in rl. If we talk about trade secrets or whatever in rl, there's also these discussions on if the open labs have worse algorithms than the closed labs. And reality, it seems like most people are using something like an evolved version of grpo, which is a bit simpler than ppo. Some labs might be using a learned value function. It's not that important about the details, but what happens is that each lab finds the set of tweaks that they need to get really stable RL performance. Performance and in the RL literature historically there's a pretty low bar on the amount of changes that are needed to call it a quote unquote new algorithm. But it's realistically like an implementation detail. So it's like a lot they everyone finds their stable configuration for operating and it's really dependent on the tool. So yes, you could say that they have a different algorithm, but it's also not really something that you could easily exfiltrate from a lab because it's dependent on many layers of the stack and maybe what chips they're operating on and all sorts of of things. So it's just one of these things where training these models is complex and the kind of quick quips could never reflect that.
Luca Soldeny
The stack for post training is also so new like software wise. I feel like the big strides started happening in 2024 around like oh you do like both you train the model but also you run the model at the same time and that they have to happen at certain cadence versus pre training. The seeds of distributed pre training like you had in TensorFlow, which Google released in 2017. Right. So there is a much more mature stack versus what you need on the post training side.
Matt Turk
All right, so maybe to as the last part to this conversation, it's been really fascinating and illuminating everything that you guys have described because in particular it sort of highlights the complexity of the systems like the multiple stages. And I love what you mentioned Nathan, a few minutes ago when you said the pre training is scientific and post training, my words, not yours, but my interpretation of your words was like it's a lot of tinkering and putting things together in a way that you hope is going to work and truly diving into how those models work on the one hand. But on the other hand, each time you open a newspaper online or go on Twitter, everybody's talking about AG and how we're almost there and how it's going to change everything. There's a little bit of a cognitive dissonance between the reality of trying to make those models work with all the unbelievable progress that we've seen, of course, but that on the one hand, and the discourse on the other hand. Nathan, you've had a much more, I would say, tempered view of AI progress compared to some AI researchers. You had a great blog post very recently that you called Thought on the Curve. I'm curious what your latest thinking is and look, obviously feel free to jump in any time, but in terms of what you described in that essay and a prior one as complexity and complexity tax, which again, in view of the pipeline you just described, one starts to understand the level of sheer complexity that's involved in all of this.
Nathan Lambert
Yeah, so ultimately I definitely describe myself as lightly AGI pilled and I think you have to be to appreciate the magnitude and gravity of the situation that we're in. But also I think that I'm very far from believing in any sort of singularity being possible due to these things like complexity. On one hand we talked about all these things which are low hanging fruit to improve the models and I don't doubt that it, I mean like Sholto commented this on the Pod and other places where at these labs they still see low hanging fruit in improving the models. In many ways I don't think that their approach feels that different. They've just been refined relative to what we're doing. But at the same time as things get complex with tools and adding more layers to the stack and you have to build a product to scaffold it, it's like if the requirement to get the best out of Claude is to use Claude code, which is some magic product and prompting relative to GitHub copilot pilot, this is one thing that you're going to need to get right in order to get AGI along with all these tool uses and stuff. So it's like as any system gets more complex, the pace of change is slower. I think any tech company has seen this and then realistically there's going to be physical constraints on the amount of infrastructure that we could build. So this belief is simultaneously giving us these new data centers and I personally think hopefully new power generation. But there's a cap and for having having a in order to like all these things are plateauing and then you 10x the compute and you get a big jump like you can't do this forever. So realistically there's going to be some physical constraint that kicks in at some point, but balancing that with complex systems and the low hanging fruit results in like. I think these researchers are going to grind out improvements for multiple years, but never in a way that results in this kind of accelerating. Well that we get drawn into into. So it's kind of, it's like, I don't know, some ways it feels like I'm having my cake and eating it too. But it seems like the likely outcome if you look at other types of technology.
Matt Turk
Yeah. And that conversation was in a particular reaction to AI 2027, which is a really interesting conversation where I'll let you summarize the primary. The short version is that AI sort of builds itself and therefore accelerates.
Nathan Lambert
Yeah. And I think they have these milestones like AI autom, research, engineering and then AI automates AI research and development where it's like each of these are incredible jumps in performance. And I think what's more likely is this messy co evolution. They deserve credit on their marketing and getting this impact for sure. But even them are now like oh maybe we should have called it AI 2028 or AI 2029. So I think that is the reflection of there are these real constraints but the progress is also going to be be incredible.
Luca Soldeny
It normally the growth capability of these models but like having working on one, I think it's very unlikely that we will see like a discontinuity at any point. It has nothing to do with whether like we'll get to a definition of AGI or super intelligence that people are happy with. We will get there. It seems unlikely that the moment like it's going to be a looking back kind of, kind of exercise of like oh, these were the important milestone and this is what really worked. It's building this model is so much a collection of like refinements to unlock the next stage that it's going to be this smooth trajectory. When it hits at some point where we don't have more capacity to keep improving, whether it forever accelerates it, it's. I don't know. People are going to be disappointed if they want to see a moment where like one day they log into Twitter and AGI is there. Like it's messy and it's fun working on it because it's messy and it gives you a lot of satisfaction.
Matt Turk
So to play it back, you both saying yes to AGI but no to discontinuity singularity. And one, is it fair? And two, if that's what you said saying then for AGI using the current paradigm. Basically what we just described in the last hour of pre training plus RL gets us there.
Nathan Lambert
I think the AGI word is actually pretty not useful. I think that how I describe it is that Big Tech has all collectively realized that these language models plus scaffolding is going to unlock absolutely incredible value. And I have very high probability, barring extreme geopolitical situations, patience, that Big Tech executes on this vision across the two to five years to build 95 to 98% of the way there of what you can do with our physical power constraints and what an LLM's ability is. And I think that that will be extreme. Like the transformation from that by 2030 is going to be so powerful across society. There's a bunch of long tail, there's going to be mass societal readjustment to what the Internet and media, media and information means and like within five years. And that's mostly why I do this. And I think like whether debating whether or not it's AGI is kind of secondary to the fact that this is coming and we want people to study and understand what is happening.
Matt Turk
And to that last point, what does that mean? Study and prepare? Like what, what would you recommend people do? Although if people have made it all this way to, to this point of the podcast, they've already done a bunch of the work.
Nathan Lambert
So I think there's a lot of it. I mean there's a lot of interest AI outside of the CS majors and on of the world where it's informing policymakers. I think it still takes a long time for information to diffuse and there's often not that many people that are engaging in this, that are doing it just purely for this kind of, you can call it alignment and concern. There's just a lot of general noise and I mean I worry about concentration of power or all sorts of many things. And it's just trying to upscale people into understanding AI so they can be engaged, engage listeners and think about how it affects their domain.
Luca Soldeny
The other part, maybe on a more positive note, is like if the scaffolding is what really moves a lot of from procability model to something that actually has meaningful impact. That scaffolding is not just like only the labs of people, trained models can do it. Like the number of people can contribute to that both in terms of people with technical expertise and people with non technical expertise, it's much larger. If the scaffolding is what really moves capabilities, what gets us to this incredible technology being realized, then the number of people can contribute to it. It's not just those who work at Frontier Labs. There's a tremendous amount of technical work to do, but also non technical. As soon as you start integrating in this technology in the life of real people, as soon as you start working on high stake medical application or other high stake domains, then large amount of population can contribute in making this technology better and make it work for everyone. Just a base model. I feel like the number of people can help making this technology really work for everyone. It's large. Everyone in society feels like it can contribute.
Matt Turk
All right, well that feels like a wonderful place to live it. Thank you so much both not just for this conversation, but for all the work that you're doing in open source Frontier AI, which feels sorely needed and extremely important. So really appreciate it. Really appreciate the time and all the thoughts. Thank you so much. Hi, it's Matt Turk again. Thanks for listening to this episode of the MAD Podcast. If you enjoyed it, we'd be very grateful if you would consider subscribing if you haven't already, or leaving a positive review or comment on whichever platform you're watching this or listening to this episode from. This really helps us build a podcast and get great guests. Thanks and see you at the next episode.
Date: November 20, 2025
Guests: Nathan Lambert & Luca Soldeny (Allen Institute for AI)
Host: Matt Turck
In this episode, Matt Turck welcomes Nathan Lambert and Luca Soldeny from the Allen Institute for AI (AI2) to discuss the launch of the Olmo 3 family of open-source AI models. The panel explores the current landscape of open source AI, contrasting US and Chinese efforts, and gives an unprecedented walkthrough of Olmo 3’s full transparency: from open weights to open data, recipes, and intermediate checkpoints. The discussion dives deep into model architecture, training processes, the significance and inner workings of "thinking models,” and the broader context of global AI competition.
(00:00–00:39)
Nathan Lambert notes a significant change in the "balance of power" in open source AI:
“There was a big change in leadership at Meta and Llama's future is unknown...there’s this big vacuum of influence which has been absorbed by the likes of Quen, Deepsea, Kimmy, Moonshot...” (00:00)
The emergence and growing influence of Chinese open-source model labs.
(01:27–03:36)
Announcement:
“We’re launching Omo3 family today. So this is our latest family of open source models...we’re releasing the entire recipe we follow to get this model. So the data, the intermediate states, the evaluation frameworks, all the details...”
— Luca Soldeny (01:27)
Model lineup:
Transparency:
“It’s the first fully open reasoning model where we show doing RL, base models, and distilling from bigger thinking models. And there’s a lot of discussion within the US that there’s good reason we should own the whole technological stack and that includes open models.”
— Nathan Lambert (00:22)
(02:14–03:36)
“These are models that can spend compute at inference time to sort of think through a problem and solve it and then give you an answer at the end.”
— Luca Soldeny (02:14)
(03:36–05:48)
Benchmarking:
“This base model is similar in quality to the best available, which is like Quen 2.5-32B... the upside is we have all the data so people can actually do some sort of continued pre training and... modify and understand the behavior.”
— Nathan Lambert (03:36)
Claims Olmo 3 7B outperforms Llama 3 8B on certain tasks, and stands among best worldwide in standard size categories.
All datasets and post-training recipes are open, enabling continued innovation and reproducible research.
(05:48–08:07)
“Historically of the data that is available openly out there for people to build a language model, you don’t have a lot of long document data...these are really good for people to develop other ways for models to understand very long inputs...”
— Luca Soldeny (05:53)
(08:07–10:25)
(10:25–12:51)
“If we can release it, we will release it...if people ask, hey, you described this part of your pipeline, but you haven’t put it out, we’ll release that part as well.”
— Luca Soldeny (10:38)
(12:51–18:32)
“A meaningful amount of people are trying open models for things and most of them are using Quen.”
— Nathan Lambert (16:55)
(18:32–20:22)
(20:22–21:45)
(22:11–24:16)
“A thinking model is really a way to train the model to exploit [inference time scaling]... the model therefore kind of has a step change where it’s way better at math tasks, coding tasks, agentic tasks.”
— Nathan Lambert (22:33)
(24:16–35:55)
History: Paul Allen founded AI2 in 2014 to build AI systems for science.
Projects:
AI2’s size: Approx. 200 staff (research, engineering, comms, support).
Both guests drawn by the impact and openness of working at AI2 vs. closed, commercial labs.
(37:11–78:12)
“Sometimes there’s low hanging fruit and doing the obvious thing yields a lot of results...then this RL stage is extremely hard technical grinding.”
— Nathan Lambert (65:02)
“RLVR, verifiable rewards is in the name...the reward you get...is whether or not you got the problem right.”
— Nathan Lambert (73:14)
(78:12–87:26)
On openness:
“If we can release it, we will release it.”
— Luca Soldeny (10:38)
On the importance of open data and intermediate checkpoints:
“Now we have intermediate checkpoints during our supervised fine tuning for reasoning and for instruct and then also for our multi-day RL rounds. …this is all there.”
— Nathan Lambert (12:27)
On growing Chinese influence:
“A meaningful amount of people are trying open models for things and most of them are using Quen.”
— Nathan Lambert (16:55)
On the challenging reality of building new AI capabilities:
“You’re building the tracks as the train is going down at incredible speeds, and you have to figure out ways to fix some parts of your pipeline so you can work on the rest.”
— Luca Soldeny (63:24)
On the future of AGI:
“I have very high probability, barring extreme geopolitical situations, that Big Tech executes on this vision across the two to five years to build 95 to 98% of the way there of what you can do with our physical power constraints and what an LLM’s ability is. ...Whether debating whether or not it’s AGI is kind of secondary to the fact that this is coming and we want people to study and understand what is happening.”
— Nathan Lambert (84:15)
| Timestamp | Topic | |-----------|------------------------------------------------------------------------------------------------| | 00:00 | Industry shift, Llama's uncertain future, rise of Chinese open models | | 01:27 | Olmo 3 launch, openness beyond weights | | 05:53 | DOLMA3 dataset, emphasis on long-context data | | 08:07 | Model performance & benchmarking | | 10:25 | What “open” really means in AI | | 12:51 | Recap: 2025 and rise of China's open source surge | | 16:37 | US/China ecosystem and business differences | | 18:32 | Economic drivers for open vs. closed model release | | 20:33 | Atom project and US response | | 22:11 | What are "thinking models"? | | 24:51 | AI2 history and mission | | 37:11 | Full Olmo 3 pipeline: from pre-training to RLVR | | 53:32 | Long context extension explained | | 61:36 | Supervised fine tuning, tracing, and teacher models | | 65:02 | DPO/preference tuning and its impact | | 71:18 | Reinforcement learning with verifiable rewards (RLVR) and system challenges | | 80:01 | The complexity tax of AI, reality vs. AGI hype | | 83:51 | Gradual progress; why there may never be a sharp AGI "singularity" moment | | 84:15 | Big Tech's trajectory vs. AGI definitions | | 85:29 | Importance of broad public engagement, policy, and non-technical contributions to AI progress | | 87:26 | Closing thoughts: open source, impact, and needing “many more hands” to shape the future |
The episode provides a rare, transparent look into the making and ambitions of a cutting-edge open-source AI project. Olmo 3 is more than just another model release—it embodies a commitment to full openness and scientific reproducibility, opening new possibilities for the global research community. The conversation also brings sober clarity on the realities and limits of AI progress, emphasizing the necessity of community, infrastructure, and open collaboration as the pace of innovation accelerates worldwide.
This summary is designed to be comprehensive, detailed, and faithful to the conversational style of the episode, enabling those who haven’t listened to quickly grasp the depth and highlights of the discussion.