Summary8 min read

Latent Space: The AI Engineer Podcast

Episode: Scaling Test Time Compute to Multi-Agent Civilizations — Noam Brown, OpenAI
Date: June 19, 2025
Host(s): Alessio (Decibel), Books (Small AI)
Guest: Noam Brown (OpenAI)

Overview

This episode features Noam Brown of OpenAI, renowned for AI breakthroughs in games like poker and Diplomacy, and currently leading OpenAI’s multi-agent team. The conversation traverses the evolution of AI reasoning paradigms, scaling test time compute, multi-agent systems, the "thinking fast and slow" analogy, coding assistants, and prospects for civilization-scale AI. The episode is a must-listen for anyone interested in the bleeding edge of AI engineering, practical insights from foundational research, and the realities of rapid change in the field.

Main Discussion Points & Insights

1. From Games to Civilization: AI’s Evolving Competence

Noam’s Journey: Diplomacy & Cicero
- Built Cicero (top 10% Diplomat) and won the World Diplomacy Championship in 2025.
- Debugging AIs in games teaches as much about the game as about AI.
- "Sometimes it would do things that humans typically wouldn’t do, and that taught me about the game as well." — Noam [01:00]
On Centaur Systems:
- Noam was inspired by Cicero but didn’t use it during the championship.
- Human players now routinely question if an opponent might be a bot—especially as models improve (Cicero, GPT-4o, etc.).
- Language models’ quality has rapidly improved since Cicero’s 2.7B-parameter days.

2. Safety, Steerability, and AI Agents

AI Safety in Games:
- Cicero’s design (conditioning on concrete actions) aligns with safety goals:
  "It was a very controllable system...not just a language model running loose, but a reasoning system that steers the way the language model interacts with the human." — Noam [04:10]
- Researchers saw this as a pathway toward safe and steerable AI.

3. Reasoning Models and Scaling Test Time Compute

The O Series and the Reasoning Paradigm:
- Progression from O1 to O3 marked by rapid scaling and new capabilities such as agentic behavior and web browsing (“mini deep research”).
- Resistance to the notion that AI is only competent in verifiable domains (math/coding) — deep research is cited as a success in fuzzy domains.
- "I think that's an existence proof that these models can succeed in tasks that don't have easily verifiable rewards." — Noam [07:18]
Quality Control and User Feedback:
- Users can still distinguish between mediocre and good AI research, critical for product improvement.

4. The "Thinking Fast and Slow" Analogy

Limitations and Applicability:
- The two-system analogy (System 1: fast, System 2: slow reasoning) partly fits, but isn’t perfect.
- Models need a threshold of System 1 competence to benefit from System 2 (deliberative) reasoning.
- "If you try to do the reasoning paradigm on top of GPT-2, I don’t think it would have gotten you almost anything." — Noam [09:22]
Emergence Caution:
- Noam is careful about labeling behaviors as emergent but notes clear qualitative shifts occur with scale.

5. Generalization, Harnesses, and Automation in Games

System 1 vs. System 2 in Games:
- Intuitive (fast) play must be strong for slow/conscious reasoning to help — both in humans and AI.
- "Harnesses" (additional scaffolding/tooling around the model) are a crutch and, with scale, will become unnecessary.
- "I think the ideal harness is no harness...I think harnesses are like a crutch that eventually we’re going to be able to move beyond." — Noam [14:05]

6. The End of Routers and Product Strategy in AI

Model Routers:
- Today, routers help switch between fast and slow models (System 1 & 2); eventually, scaling will obviate the need.
- Caution for developers: don't overinvest in complex scaffolding/routers likely to be made obsolete by scale.
- "I think that routers are going to eventually go away...the field is evolving very rapidly. Things are going to change in three months, let alone six months." — Noam [19:23]

7. Reinforcement Fine Tuning (RFT) and Data Collection

RFT Value:
- RFT remains relevant — fine-tuned data will retain value as models improve.
- "If we come out with future models that are even more capable, you could still fine tune them on your data." — Noam [21:26]

8. The Reasoning Paradigm’s Path: From Poker to Language Models

OpenAI’s Strategic Bets:
- The move from scaling pre-training to exploring reasoning via RL was contentious but farsighted.
- Early bets on scale at OpenAI provided a critical competitive edge; willingness to try “big experiments.”
- "OpenAI functions a lot like a startup...with this mission of building AGI and superintelligence...that helped them organize, collaborate, pool resources together." — Noam [32:47]

9. Coding with AI: AGI Moments, Shortcomings, and Practical Tips

AI Pair Programming:
- Noam uses coding assistants (Codex, Windsurf, Cursor), prioritizing them even for core tasks.
- On “feeling the AGI”:
  "You feel the AGI moments every few months ... then you get used to it very quickly." — Noam [34:43]
- Frustration: AI devs are like “geniuses on their first day”—very capable but lacking context/memory; product cycles lag behind model capabilities.
Scalability Limitations:
- Current blockers: code review, context limitations, lack of memory/accumulated experience in agents.

10. Remote Work, Virtual Assistants, and Alignment

Beyond Software:
- AI will impact any remote work—becoming essential in virtual assistant workflows.
- Principal-agent problem: With proper alignment, AI agents could surpass humans (in how jobs are done).

11. Multi-Agent Systems: Lessons from Human Civilization

Research Directions:
- Multi-agent at OpenAI includes scaling test time compute, collaborative and competitive AI societies.
- Civilization-scale: analogy to how humanity's advancement is due to large-scale competition and cooperation, not mere individual intelligence.
- "The AIs that we have today are kind of like the cavemen of AI. If you have billions of AIs cooperating and competing...the things that they would produce would be far beyond today's AIs." — Noam [43:51]
Field Critique:
- Historically, multi-agent AI research has been too reliant on heuristics, not scaling/bitter lesson–driven.

12. Game Theory, Exploitation, and Sample Efficiency

Poker Paradigms:
- GTO (game-theoretic optimal) AI dominates, but lacks adaptivity/human-level exploitation due to sample inefficiency.
- Techniques from Diplomacy AI (modeling other agents) might eventually solve this.
World Models:
- Brown leans toward world models and theory of mind emerging implicitly with scale, away from explicit modeling.

13. Self-Play, Open-Endedness, and Benchmarking

Self-Play Limitations:
- Self-play scaling works for zero-sum games (Go, Chess), less clear for general intelligence, open-ended domains, or math.
- Objective functions are hard to define for multi-agent, cooperative environments.
Benchmarking Constraints:
- Many benchmarks favor easily gradable hard problems; real value lies in messy, subjective, real-world tasks.

14. Applied Scaling, Bottlenecks, and the Wall Ahead

Test Time Compute Barriers:
- Scaling “thinking time” will hit cost and wall-clock bottlenecks (e.g., multi-hour/days compute runs slow down research cycles and iteration).
- "This is actually the strongest case for long timelines...a lot of this you have to run the experiment, complete it, and then see the results in order to decide on the next set of experiments." — Noam [69:31]

15. Robotics, Embodiment, and the Form Factor Debate

Noam’s Robotics Take:
- Progress is slower in robotics due to hardware iteration speed.
- Weakly prefers non-humanoid forms, inspired by drone utility and robotics startups.

16. Research Culture, Staying Updated, & the Danger of Hype

Academic Papers:
- Contrary to perceptions, academic work matters—if it replicates/scales.
- Internal channels and expert recommendations drive reading decisions.
- "A lot of people look at things like Twitter...it’s really unfortunate that we’ve reached this point where things have to get a lot of attention on social media." — Noam [65:34]

Notable Quotes & Memorable Moments (with Timestamps)

On AI Steerability:
"We conditioned Cicero on certain concrete actions and that gave it a lot of steerability to say, okay, well, it's going to pursue a behavior that we can very clearly interpret and very clearly define." — Noam [04:10]
On the “Reasoning Paradigm”:
"For me, O3, I've been using it a ton in my day to day life...it's kind of like a mini deep research that you can just get a response in three minutes." — Noam [05:48]
On Harnesses Becoming Obsolete:
"The ideal harness is no harness. I think harnesses are like a crutch that eventually we’re going to be able to move beyond." — Noam [14:05]
On the Limits of Self-Play:
"Self play outside of these two player zero sum games becomes a much more difficult, nuanced question...this is where the AlphaGo analogy breaks down." — Noam [54:33]
On Research Culture & Hype:
"I would tell the grad students I was working with that like, you need to post it on Twitter ... there's a real art to it and it does matter. And it's kind of the sad truth." — Noam [65:54]
On Civilization-Scale AI:
"The technology that we're seeing is the product of this civilization...I think the AIs that we have today are kind of like the cavemen of AI. And I think that if you're able to have them cooperate and compete with billions of AIs...the things that they would be able to produce...would be far beyond what is possible today." — Noam [43:51]

Timestamps for Key Segments

[00:53] — Impact of working on Cicero/Diplomacy on Noam’s play
[04:08] — Safety and steerability of Cicero
[07:18] — Deep research as an existence proof for non-verifiable domains
[09:22] — Why “thinking fast and slow” only partially applies
[14:05] — On tool scaffolding (‘harnesses’) and the future of model generality
[19:23] — Routers’ demise as models unify
[21:26] — RFT’s enduring value as models scale
[25:45] — OpenAI’s internal debate over reasoning paradigms
[34:43] — “Feeling the AGI” moments with coding assistants
[43:51] — Civilization as a cooperative/competitive multi-agent system
[54:33] — Why self-play breakthroughs in games may not map to general intelligence
[69:31] — Long iteration times as a soft cap on scaling
[72:11] — What to ask Greg Brockman about AGI’s future
[73:12] — “Blood on the Clock Tower” as the new social game in Silicon Valley

Further Rapid Fire Insights & Recommendations

Favorite Social Games:
"Blood on the Clock Tower" is replacing poker as the VC/game-night favorite [73:19]
Research Practices: Internal paper curations, channels, and (sadly) Twitter threads are essential for keeping up.
Robotics:
Non-humanoid forms and novel applications (e.g., drones) are seen as high value.
Benchmarks Limitation:
Too much focus on testable, easily gradable problems stifles evaluation growth.

Closing Thoughts

Noam Brown’s session delivers hard-won insights from the front lines of AI research and engineering: a candid roadmap of how AI is racing beyond its training regimes, how the shift toward agentic and multi-agent intelligence redefines capability, and how the “bitter lesson” of scale continues to sweep aside hand-engineered shortcuts. The lessons—product, research, and philosophical—are invaluable for researchers, engineers, founders, and anyone charting their course through the fast-evolving AI landscape.

Loading summary

Transcript226 lines

[00:00]
Noam Brown
Foreign.
[00:06]
Books
Welcome to the Living Space podcast. This is Alessio, partner and CTO at Decibel. And I'm joined by my co host, Books, founder of Small AI.
[00:13]
Alessio
Hello. Hello. And we're here recording on a holiday Monday with Noam Brown from OpenAI. Welcome.
[00:19]
Noam Brown
Thank you.
[00:19]
Alessio
So glad to have you finally join us. A lot of people have heard you. You've been rather generous of your time on podcast. Lex Friedman. And you've done a TED Talk recently, just talking about the thinking paradigm. But I think maybe perhaps your most interesting recent achievement is winning the World diplomacy championship in 2022. You built Cicero, which was top 10% of human players. I guess my opening question is, how has your diplomacy playing changed since working on Cicero and now personally playing it?
[00:53]
Noam Brown
When you work on these games, you kind of have to understand the game well enough to be able to debug your bot. Because if the bot does something that's really radical and that humans typically wouldn't do, you're not sure if that's a mistake or if that's just like, if it's a bug in the system or it's actually just the bot being brilliant. When we were working on Diplomacy, I kind of did this deep dive trying to understand the game better. I played in tournaments. I watched a lot of tutorial videos and commentary videos on games, and over that process, I got better. And then also seeing the bot, like. Like the way it would behave in these games, like, sometimes it would do things that humans typically wouldn't do. And that taught me about the game as well. When we released Cicero, we announced it in, like, late 2022. I still found the game, like, really fascinating. And so I, like, kept up with it. I, like, continued to play. And that led to me winning the championship in the World Championship in 2025. So just a couple months ago.
[01:47]
Alessio
There's always a question of, like, Centaur systems, where humans and machines work together. Like, was there an equivalent of what happened in Go, where you updated your playstyle?
[01:56]
Noam Brown
Because if you're asking if I used Cicero when I played in the tournament, the answer is the answer is no. Seeing the way the bot played and, like, taking inspiration from that, I think did help me in the tournament. Yeah. Yeah.
[02:07]
Alessio
Do people now ask Turing questions every single time when they're playing Diplomacy?
[02:12]
Noam Brown
Ask to try to tell if the person they're playing with is a bot or he?
[02:17]
Alessio
Yeah, like, that's the one thing you're worried about when you started.
[02:20]
Noam Brown
It was really interesting when we were working on Cicero because, like, you know, we didn't have the best language models. We were really bottlenecked on the quality of the language models. And sometimes the bot would do. Would say, like, bizarre things. Like, you know, 90, 99% of the time it was fine. But then, like, every once in a while, it would say this, like, really bizarre thing. Like, it would just hallucinate about something. Somebody would reference something that they said earlier in a conversation with the bot. And the bot would be like, I have no idea what you're talking about. I never said that. And then the person would be like, look, you could just scroll up in the chat. It's like, literally right there. And the bot would be like, no, you're lying. And when it does these kinds of things, like, people just kind of, like, shrugged it off as, like, oh, that's just, you know, the person's tired or they're drunk or whatever, or they're just, like, trolling me. But I think, like, that's because people weren't looking for a bot. They weren't expecting a bot to be in the games. We were actually really scared because we were afraid that people would. Would figure out at one point that there's a bot in these games. And then they would just, like, always be on the lookout for it, and they would always be. And if you're. If you're looking for it, you're able to spot it. That's the thing. So I think now that it's announced and that people know to look for it, I think they would have an easier time spotting it. Now, that said, the language models have also gotten a lot better since 2022.
[03:27]
Alessio
It's adversarial.
[03:28]
Noam Brown
Yeah. So at this point, like, you know, the truth is, you know, GPT4O and like, O3, these models are, like, passing the Turing test. So I don't think they can really ask that many Turing complete questions that would actually make a difference.
[03:39]
Books
And C0 was very small, like 2.7 B. Right.
[03:42]
Noam Brown
It was a very small language model. Yeah. This is one of the things we realized over the course of the project that, like. Oh, yeah, you really benefit a lot from just having, like, larger language models.
[03:51]
Books
Right. Yeah. How do you think about today's perception of AI And a lot of, like, maybe the safety discourse of, like, you know, you're going to build a bot that is really good at persuading people into, like, helping them win a game. And I think maybe today labs want to say they don't work on that type of problem. How do you think about that? Dichotomy so to speak, between the two.
[04:09]
Noam Brown
You know, honestly, like after we released Cicero, a lot of the AI safety community was really happy with the research and the way it worked because it was a very controllable system. We conditioned Cicero on certain concrete actions and that gave it a lot of steerability to say, okay, well, it's going to pursue a behavior that we can very clearly interpret and very clearly define. It's not just like, oh, it's a language model running loose and doing whatever it feels like. No, it's actually pretty steerable and there's this whole reasoning system that steers the way the language model interacts with the human. Actually, a lot of researchers reached out to me and said, we think this is potentially a really good way to achieve safety with these systems.
[04:51]
Alessio
I guess the last diplomacy related questions that we might have is have you updated or tested O series models on diplomacy and would you expect a lot more difference?
[05:02]
Noam Brown
I have not. I think I said this on Twitter at one point that I think this would be a great benchmark. I would love to see all the leading bots play a game of diplomacy with each other and see who does best. And I think a couple people have taken inspiration from that and are actually building out these benchmarks and evaluating the models. My understanding is that they don't do very well right now, but I think it really is a fascinating benchmark and I think it would be. Yeah, I think it'd be a really cool thing to try out.
[05:25]
Alessio
Well, we're going to go a little bit into O series now. I think the last time you did a lot of publicity, you were just launching O1. You did your TED talk and everything. How have the vibes changed? Just in general, you said you were very excited to learn from domain experts, like in chemistry, like how they review the old series models. How have you updated since, let's say, end of last year?
[05:49]
Noam Brown
I think the trajectory was pretty clear pretty early on in the development cycle. And I think that everything that's unfolded since then has been pretty on track for what I expected. So I wouldn't say that my perception of where things are going has honestly changed that much. I think that we're going to continue to see. I said before that we're going to see this paradigm continue to progress rapidly. And I think that that's true even today that we saw that with going from 01 preview to 01 to O3 consistent progress. And we're going to continue to see that going forward. And I think that we're going to See a broadening of what these models can do as well. You know, like we're going to start seeing agentic behavior. We're already starting to see agentic behavior. Like Honestly, for me, O3, I've been using it a ton in my day to day life. I just find it so useful, especially the fact that I can now browse the web and like, you know, do meaningful research on my behalf. Like it's kind of like a mini deep research that you can just get a response in three minutes. So yeah, I think it's just going to continue to become more and more useful and more powerful as time goes on and pretty quickly.
[06:52]
Books
Yeah, and talking about deep research, you tweeted about if you need proof that we can do this in nonverifiable domains, deep research is kind of like a great example. Can you maybe talk about if there's something that people are missing? You know, I feel like I hear that repeated a lot. It's like, you know, it's easy to do encoding in math, but like not in these other domains.
[07:08]
Noam Brown
I frequently get this question, including from pretty established AI researchers that, okay, we're seeing these reasoning models exceed in math and coding and these easily verifiable domains, but are they ever going to succeed in domains where success is less well defined? I'm surprised that this is such a common perception because we've released deep research and people can try it out. People do use it. It's very popular and that is very clearly a domain where you don't have an easily verifiable metric for success. It's very like what is the best research report that you could generate. And yet these models are doing extremely well at this domain. So I think that's like an existence proof that these models can succeed in tasks that don't have as easily verifiable rewards.
[07:51]
Books
Is it because there's also not necessarily like a wrong answer. Like there's a spectrum of deep research quality. Right. You can have like a report that looks good, but the information is kind of so and so and then you have a great report. Do you think people have a hard time understanding the difference when they get the result?
[08:08]
Noam Brown
My impression is that people do understand the difference when they get a result and I think that they're surprised at how good the deep research results are. There's certainly it's not, not 100%. It could be better and we're going to make it better. But I think people can tell the difference between a good report and a bad report and certainly and a Good report and a mediocre report and that's.
[08:25]
Books
Enough to kind of feed the loop later to build the product and improve the model performance.
[08:30]
Noam Brown
I mean, I think if you're in a situation where people can't tell the difference between the outputs, then it doesn't really matter if you're like, you know, hill climbing on progress. These models are going to get better at domains where there is a measure of success. Now I think this idea that it has to be like easily verifiable or something like that, I don't think that's true. I think that you can have, you can have these models do well even in domains where success is a very difficult to defined thing. Could sometimes even be subjective.
[08:57]
Alessio
People lean on a lot. You've done as well is the thinking fast as slow analogy for just thinking models. And I think it's reasonably well diffused. Now the idea of that this is kind of the next scaling paradigm. All analogies are imperfect. What is one way in which thinking fast and slow or system one, System two kind of doesn't transfer to how we actually scale these things.
[09:22]
Noam Brown
One thing that I think is underappreciated is that the models, the pre trained models need a certain level of capability in order to really benefit from this extra thinking. This is kind of why you've seen the reasoning paradigm emerge around the time that it did. I think it could have happened earlier, but if you try to do the reasoning paradigm on top of GPT2, I don't think it would have gotten you almost anything.
[09:44]
Alessio
Is this emergence?
[09:45]
Noam Brown
Hard to say if it's emergence necessarily, but like I haven't done the, you know, the measurements to really define that clearly. But I think it's pretty clear. You know, people tried chain of thought with gpd, like really small models and they saw that it just didn't really do anything. Then you go to bigger models and it starts to, to give a lift. I think there's a lot of debate about like the extent to which this kind of behavior is emergent. But clearly there is a difference. So it's not like there are these two independent paradigms. I think that they are related in the sense that you need a certain level of system one capability in your models in order to have system two to be able to benefit from system two.
[10:22]
Alessio
Yeah. I have tried to play amateur neuroscientist before and try to compare it to the evolution of the brain and how you have to evolve the cortex first before you evolve the other parts of the brain. And perhaps that is what we're doing here.
[10:36]
Noam Brown
Yeah. And you could argue that actually this is not that different from like, I guess, the, the System one, System two paradigm. Because if you ask a pigeon to think really hard about playing chess, it's not going to get that far. It doesn't matter if it thinks for a thousand years, it's not going to be able to be better at playing chess. So maybe you do still with animals and humans that you need a certain level of intellectual ability just in terms of System one in order to benefit from System two as well. Yeah.
[11:01]
Alessio
Just this side tangent, does this also apply to visual reasoning? So let's say now we have the 4.0, like natively Omni model type of thing, then that also makes O3 really good at GeoGuessr. Does that apply to other modalities too?
[11:18]
Noam Brown
I think the evidence is, yes. It depends on exactly the kinds of questions that you're asking. There are some questions that I think don't really benefit from system two. I think GeoGuessr is certainly one where you do benefit. I think image recognition, if I had to guess, it's one of those things that you probably benefit less from System.
[11:34]
Alessio
Two thinking because you know it or you don't.
[11:36]
Noam Brown
Yeah, exactly.
[11:37]
Alessio
There's no way.
[11:38]
Noam Brown
Yeah. And the thing, the thing I typically point to is just like information, like retrieval. If somebody asks you like, when was this person born and you don't have access to the web, then you either know it or you don't. And you can sit there and you can think about it for a long time. Maybe you can make an educated guess and you can say like, well, this person was like probably lived around this time. And so this is like a rough date, but you're not going to be able to like get the date unless you've actually just, just know it.
[12:01]
Alessio
But like spatial reasoning, like tic tac toe might be better because you have all the information there.
[12:06]
Noam Brown
Yeah, and I think it's true that like with tic tac toe we see that like GPT 4.5 falls over, you know, it plays decently well. I shouldn't say it falls over. It does reasonably well. You can draw the board, it can make legal moves, but it will make mistakes sometimes. And if you really need that system too to enable it to play perfectly. Now it's possible that if you got the GPT6 and you just did system one, it would also play perfectly. You know, I guess we'll know one day. But I think right now you need the System two to really do well.
[12:35]
Books
What do you think are like the things that you need in system one. So obviously, general understanding of game rules. Do you also need to understand some sort of meta game of like, you know, usually this is like how you value pieces in different games, even though it's a. You know, how do you generalize in system one so that then in system two, you can kind of get to the gameplay, so to speak.
[12:54]
Noam Brown
I think the more that you have in your. In. In the system one, like, this is the same thing with humans. You know, like humans are. When they're playing for the first time a game like chess, they can apply a lot of system two thinking to it. And if you. If you apply a ton of system two thinking to it, like if. If you just present a really smart person with a completely novel game and you tell them, like, okay, you're going to play this game against like an AI or like a human that's like mastered this game. And you tell them to like, sit there and you. And think about it for like three weeks about how to play this game. My guess is they could actually do pretty well. But it certainly helps to build up that system one thinking, like, build up intuition about. About the game because it will just make you so much. Yeah. So much faster.
[13:36]
Books
I think the Pokemon example is a good one of like, the system one kind of has maybe all this information about games. And then once you put it in the game, it still needs a lot of harnesses to work. And I'm trying to figure out how much of can we take things from the harness and have them in system one so that then system two is as harness free as possible. But I guess that's like the question about generalizing games and AI.
[13:57]
Noam Brown
Yeah, I guess I view that as a different question. I think the question about harnesses, in my view is that the ideal harness is no harness.
[14:05]
Books
Right.
[14:06]
Noam Brown
I think harnesses are like a crutch that eventually we're going to be able to move Beyond.
[14:11]
Alessio
So only two calls.
[14:12]
Noam Brown
And you could ask. You could just ask O3. And actually it's interesting because when this playing Pokemon thing kind of emerged as this benchmark, I was actually pretty opposed to evaling this with our OpenAI models. Because my feeling is, okay, if we're going to do this eval, let's just do it with O3. How far does O3 get without any harness? How far does it get playing Pokemon? And the answer is not very far. And that's fine. I think it's fine to have an evaluation where the models do terribly. And I Don't think the answer to that should be like, well, let's build a really good harness so that now it can do well on this eval. I think the answer is like, okay, well, let's just improve the capabilities of our models so they can do well at everything. And then they also happen to make progress on this eval.
[14:57]
Books
Would you consider things like checking for a valid move, a harness, or is this in the model? You know, like, chess is like, you can either have the model learn in System 1 what moves are valid and what it can and cannot do versus in system to figuring out.
[15:12]
Noam Brown
I think, I think there's like, a lot of this is design questions. Like, for me, I think you should give the model the ability to check if a move is legal. If you want, like that that could be an option in the environment of like, okay, here's a, you know, an action that you can, like a tool call you can make to see if an action is legal. If it wants to use that, it. It can. And then there's like, design question of like, well, what do you do if the model makes an illegal move? And I think it's totally reasonable to say, well, if they make an illegal move, then they lose the game. I don't know what happens when a human makes an illegal move in a game of chess. I actually don't know.
[15:41]
Alessio
If you're just not allowed to.
[15:42]
Noam Brown
Yeah. Do you just lose the game?
[15:44]
Books
I don't know.
[15:45]
Noam Brown
So if that's the case, then I think it's totally reasonable to say, yeah, we're going to have an eval where that's also the criteria for the AI models.
[15:52]
Alessio
Yeah, but I think maybe one way to interpret that in sort of researcher terms is are you allowed to do search? And one of the famous findings from Deepseek is that MCTS wasn't that useful to them. But I think, like, there are a lot of engineers trying out search and spending a lot of tokens doing that, and maybe it's not worth it.
[16:09]
Noam Brown
Well, I'm making a distinction here between, like, a tool call to check whether a move is legal or illegal is different from actually making that move and then seeing whether it ended up being legal or illegal. Right. So if that tool call is available, I think it's totally fine to make that tool call and check whether a move is legal or illegal. I think it's different to have the model say, oh, I'm making this move. And then it gets feedback that, oh, you made an illegal move. And so then it's like, oh, Just kidding, like I'm going to do something else now. So that's the distinction I'm drawing.
[16:40]
Alessio
Some people have tried to classify that second type of playing things out as test time compute. You would not classify that as test time compute.
[16:48]
Noam Brown
There's a lot of reasons why you would not want to rely on that paradigm. When you're going to imagine you have a robot, you know, and your robot like takes some action in the world and it like breaks something and you just like, oh, you can't say like oh, just kidding, I didn't mean to do that. I'm going to do that action. The thing is broken. So if you want to simulate what would happen if I move the robot in this way and then in your simulation you saw that this thing broke and then you decide not to do that action, that's totally fine. But you can't just undo actions that you've taken in the world.
[17:15]
Alessio
There's a couple more things I wanted to cover in this rough area. I actually had an answer on the thinking fast and slow side, which maybe I'm curious what you think about. A lot of people are trying to put in effectively model router layers, let's say between the fast response model and the long thinking model. Anthropic is explicitly doing that. And I think there is a question about always do you need a smart judge to route or do you need a dumb judge to route because it's fast? So when you have a model router, let's say you're passing requests between system one side and system two side, does the router need to be as smart as the smart model or DOM to be fast?
[17:54]
Noam Brown
I think it's possible for a dumb model to recognize that a problem is really hard and that it won't be able to solve it and then route it to a more capable model.
[18:02]
Alessio
But it's also possible for a dumb model to be fooled or to be overconfident.
[18:06]
Noam Brown
I don't know. I think there's a real trade off there. But I will say I think there are a lot of things that people are building right now that will eventually be washed away by scale. So I think harnesses are a good example where I think eventually the models are going to be. And I think this actually happened with the reasoning models. Before the reasoning models emerged, there was all of this work that went into engineering these agentic systems that made a lot of calls to GPT4O or these non reasoning models to get reasoning behavior. And then it turns out, oh, we just created reasoning models and you don't need this complex behavior. In fact, in many ways it makes it worse. You just give the reasoning model the same question without any sort of scaffolding, and it just does it now that you can still. And so people are building scaffolding on top of the reasoning models right now. But I think in many ways, like, those scaffolds will also just be replaced by the reasoning models and models in general becoming more capable. And similarly, I think things like model, like these routers, you know, we've said pretty openly that we want to move to a world where there is a single unified model. And in that world, you shouldn't need a router on top of the model. So I think that the router issue will eventually be solved.
[19:18]
Alessio
Also, like, you're building the router into the model kind of weights itself.
[19:23]
Noam Brown
I don't think there'll be a benefit for, like, I shouldn't say because I could be wrong about this. Like, you know, and certainly maybe there's, you know, reasons to route to different model providers or whatever, but I think that routers are going to eventually go away. And I can understand why it's worth doing it in the short term because, like, the fact is it is beneficial right now, now. And if you're building a product and you're getting a lift from it, then it's worth doing right now. One of the tricky things I'd imagine that a lot of developers are facing is that you kind of have to plan for where these models are going to be in six months and 12 months. And that's, like, very hard to do because things are progressing very quickly. You know, you don't want to spend six months building something and then just have it be totally washed away by scale. But I think I would encourage developers, like, when they're building these kinds of things like scaffolds and, and routers, keep in mind that the field is evolving very rapidly. You know, things are going to change in three months, let alone six months. And that might require radically changing these things around or tossing them out completely. So don't spend six months building something that might get tossed down in six months.
[20:30]
Alessio
It's so hard, though. Everyone says this and then, like, no one has concrete suggestions on how.
[20:37]
Books
What about reinforcement fine tuning? Is this something that, obviously you just released it a month ago at OpenAI. Is this something people should spend time on right now or maybe wait until the next jump?
[20:47]
Noam Brown
I think reinforcement fine tuning is pretty cool, and I think it's, like, worth looking into because it's really about specializing the models for the data that you have. And I think that something that's like worth, worth looking into for developers, like we're not, we're not suddenly going to like have that data baked into the raw model a lot of times. So I think that's kind of like a separate question.
[21:11]
Books
Yeah. So creating the environment and the reward model is the best thing people can do right now. I think the question that people have is like, should I rush to fine tune the model using RFT or should I build the harness to then RFT the models as they get better?
[21:26]
Noam Brown
I think the difference is that for reinforcement, fine tuning, you're collecting data that's going to be useful as the models improve as well. So if we come out with like future models that are even more capable, you could still fine tune them on your data. That's I think actually a good example where you're building something that's going to complement the model's scaling and becoming more capable rather than necessarily getting washed away by the scale.
[21:51]
Books
Yep.
[21:51]
Alessio
One last question on Ilya. You mentioned on, I think the Sarah Nilad podcast where you had this conversation with Ilya a few years ago about more RL and reasoning and language models. Just any speculation or thoughts on why his attempt, when he tried it, it didn't work or the timing wasn't right and why the time is right now.
[22:15]
Noam Brown
I don't think I would frame it that way that his attempt didn't work. In many ways it did. So Ilya, for me, I saw that in all of these domains that I'd worked on and poker and Hanabi and diplomacy, having the models think before acting made a huge difference in performance. Like orders of magnitude difference, like 10,000 times. Yeah, like, you know, a thousand to a hundred thousand times. Like it's the equivalent of a model that's like a thousand to 100,000 times bigger. And in language models, you weren't really seeing that, that the model, the models would just respond instantly. Some people in the field, in the LLM field were like convinced that like, okay, we just keep scaling pre training, we're going to get to super intelligence. And I was kind of skeptical of that perspective. In late 2021, I was having a meal with Ilya. He asked me what my AGI timelines are very standard SF question. And I told him like, look, I think it's actually quite far away because we're going to need to figure out this reasoning paradigm in a very general way. And with things like LLMs, LMs are very general, but they don't have a reasoning paradigm that's very general. And until they do, they're going to be limited in what they can do. We're going to scale each other, we're going to scale these things up by a few more orders of magnitude. They're going to become more capable, but we're not going to see super intelligence from just that. And yes, if we had a quadrillion dollars to train these models, then maybe we would. But you're going to hit the limits of what's economically feasible before you get to super intelligence, unless you have a reasoning paradigm. And I was convinced, incorrectly that the reasoning paradigm would take a long time to figure out because it's like this big unanswered research question. And Ilya agreed with me and he said, yeah, I think we need this additional paradigm. But his take was that maybe it's not that hard. I didn't know it at the time, but he and others at OpenAI had also been thinking about this. They'd also been thinking about rl. They had been working on it and I think they had some success. But with most research you have to iterate on things. You have to try out different ideas, you have to, yeah, try different things. And then also as the models become more capable, as they become faster, it becomes easier to iterate on experiments. And I think that the work that they did, even though it didn't result in a reasoning paradigm, it all builds on top of previous work. Right? So they built a lot of things that over time led to this reasoning paradigm for listeners.
[24:33]
Alessio
Gnome can talk about this, but the rumor is that that thing was codenamed GPT0 if you want to search for that line of work. I think there was a time where basically RL went through a dark age when everyone went all in on it and then nothing happens and they gave up and now it's the golden age again. So that's what I'm trying to identify. What is it? And it could just be that we have smarter based models and better data.
[24:57]
Noam Brown
I don't think it's just that we have smarter based models. I think it's that, yeah, so we did end up getting a big success with reasoning, but I think it was in many ways a gradual thing. To some extent it was gradual. You know, like there were signs that there were signs of life. And then we like, you know, iterated and tried out some more things. We got like better signs of life. I think it was around like November 2023 or October 2023. When I think I was convinced that we had like very conclusive signs of life that like, oh, this is, this was going to be, this is the paradigm and it's going to be a big deal. That was in many ways a gradual thing. I think what OpenAI did well is like when we got those signs of life, they recognized it for what it was and invested heavily in scaling it up. And I think that's ultimately what led to reasoning models arriving when they did.
[25:46]
Books
Was there any disagreement internally, especially because OpenAI kind of pioneer pre training scaling and kind of like computers all you need, and then you're kind of saying, maybe that's not how we get there. Was it clear to everybody that, okay, this is going to work or was it controversial?
[26:02]
Noam Brown
There's always different opinions about this stuff. I think there were some people that felt that pre training was all we need, that we scaled it up to infinity and we were there. I think a lot of the leadership actually at OpenAI recognized that there was another paradigm that was needed and that was why they were investing all this research effort into this RL stuff. And I think that's also to the credit of OpenAI that, okay, yes, they figured out the pre training paradigm and they were very focused on scaling it up. In fact, the vast majority of resources were focused on scaling that up. But they also recognize the value that something else was going to be needed and it was worth researching, putting researcher effort into other directions to figure out what that extra paradigm was going to be. There was a lot of debate about, first of all, what is that extra paradigm. So I think a lot of the researchers looked at reasoning and RL was not really about scaling, test time, compute, it was more about data efficiency. Because the feeling was that, well, we have tons and tons of compute, but we actually are more limited by data. So there's the data wall and we're going to hit that before we hit limits on the compute. So how do we make these algorithms more data efficient? They are more data efficient, but I think that also, they are also just the equivalent of scaling up compute also by a ton. That was interesting. There's a lot of debate around, okay, what exactly are we doing here? And then I think also even when we got the signs of life, I think there was a lot of debate about the significance of it that like, okay, how much should we invest in scaling up this paradigm? I think especially when you're, when you're in a small company like, you know, OpenAI in 2023 was not as big as it is today and COMPUTE was more constrained than it is today. And if you're investing resources in a direction that's coming at the expense of something else. And so if you look at these signs of life on reasoning and you're saying like, okay, well this looks promising, we're going to scale this up by a ton and invest a lot more resources into it, where are those resources coming from? You have to make that tough call about where to, where to draw the resources from. And that is a very controversial, very difficult call to make that makes some people unhappy. And I think there was debate about whether we're focusing too much on this paradigm, whether it's really a big deal, whether we would see it generalize and do various things. And I remember it was interesting that I talked to somebody who left OpenAI after we had discovered the reasoning paradigm, but before we announced A one and they ended up going to a computing lab. I saw them afterwards after we announced A one and they told me that like at the time they really didn't think this like reasoning thing like this, these O series, the Strawberry models were like that, that big of a deal. It was like they thought we were making a bigger deal of it than it really deserved to be. And then when we announced 01 and they saw the reaction of their coworkers at this competing lab about how everybody was like, oh crap, like this is a big deal. And they like pivoted the whole research agenda, oh my God. To focus on this, that then they realized like, oh, actually this maybe is a big deal. A lot of this seems obvious in retrospect, but at the time it's actually not so obvious and be quite difficult to recognize something for what it is.
[29:01]
Books
I mean, OpenAI has a great history of just making the right bet. GPT models are kind of similar, right where it started with games and rl and then it's like, maybe we can just scale these language models instead. And I'm just impressed by the leadership and obviously the research team that keeps coming up with these insights.
[29:20]
Noam Brown
Looking back on it today, it might seem obvious that like, oh, of course these models get better with scale. So you should just scale them up a ton and it'll get better. But it really is the best research is obvious in retrospect and at the time it's not as obvious as it might seem today.
[29:36]
Alessio
Follow up questions on data efficiency. This is a pet topic of mine. It seems that our current methods of learning are so inefficient still, right? Like compared to the existence proof of humans. We take five samples and we learn something. Machines 200 maybe per whatever data point you might need. Anyone doing anything interesting in data efficiency or do you think there's just a fundamental inefficiency that machine learning has that will just always be there compared to humans?
[30:05]
Noam Brown
I think it's a good point that if you look at the amount of data these models are trained on and you compare it to the amount of data that a human observes to get the same performance, I guess pre training it's a little hard to make an apples to apples comparison because I don't know how many tokens does a baby actually absorb when they're developing? But I think it's a fair statement to say that these models are less data efficient than humans. And I think that that's an unsolved research question and probably one of the most important unsolved research questions, maybe more.
[30:33]
Alessio
Important than algorithmic improvements because we can increase the supply of data out of the existing set of the worlds and humans.
[30:43]
Noam Brown
I guess so. A couple thoughts on that. One is that the answer might be an algorithmic improvement. Maybe algorithmic improvements do lead to greater data efficiency. And the second thing is that it's not like humans learn from just reading the Internet. So I think it's certainly easiest to learn from just like data that's on the Internet. But I don't think that's like the limit of what data you could collect.
[31:06]
Alessio
The last follow up before we change topics to coding. Any other just anecdotes or insights from Ilya? Just in general because you've worked with him, so there's not that many people that we can talk to that have worked with him.
[31:18]
Noam Brown
I think I've just been very, very impressed with his vision that I think like especially when I joined and I saw, you know, the internal documents at OpenAI of like what he had been thinking about back in like 2021, 2022, even earlier, I was very impressed that he had a clear vision of like where this was all going and what was needed.
[31:37]
Alessio
Some of his emails from 2016 17, when they were founding OpenAI was published and even then he was talking about how things like one big experiment is much more valuable than 100 small ones. That was like a core insight that differentiated them from brain, for example. It just seems very insightful that he just sees things much more clearly than others. And I just wonder what his production function is. How do you make a human like that and how do you improve your own thinking to better model it?
[32:05]
Noam Brown
I think it is true that one of OpenAI's big success was betting on the scaling paradigm. It is just kind of odd because they were not the biggest lab. It was difficult for them to scale back then. It was much more common to do a lot of small experiments, more academic style. People were trying to figure out these various algorithmic improvements. And OpenAI bet pretty early on large scale.
[32:28]
Alessio
We had David Luan on, who I think was VP Eng at the time of GPT 1 and 2, and he talked about how the differences between brain and OpenAI was basically the cause of Google's inability to come out with a scaled model. Just structurally everyone had allocated computer and you had to pool resources together to make bets and you just couldn't.
[32:48]
Noam Brown
I think that's true that OpenAI was structured differently and I think that really helped them. Like OpenAI functions a lot like a startup and other places tended to function more like universities or research labs as they traditionally existed. The way that OpenAI operates more like as a startup with this mission of building AGI and superintelligence that helped them organize, collaborate, pool resources together, make hard choices about like how to allocate resources. And I think a lot of the other labs, like have now been trying to adopt paradigms more like that, like setups more like that.
[33:23]
Books
Let's talk about maybe the killer use case, at least in my mind, of these models, which is coding. You released Codex recently, but I would love to talk through the Gnome Brown coding stack. What models you use, how you interact with them.
[33:34]
Alessio
Cursor with Surface.
[33:36]
Noam Brown
Lately I've been using Windsurf and codecs. Like actually a lot of codecs. I've been having a lot of fun. You just give it a task and it just goes off and does it and comes back five minutes later with like a, you know, pull request.
[33:46]
Alessio
And is it core research task or like side stuff that you don't super care about?
[33:50]
Noam Brown
I wouldn't say it's like side stuff. I would say basically anything that I would normally try to code up, I try to do it with codecs first.
[34:02]
Alessio
Well, for you it's free, but yeah, for everybody it's free right now.
[34:05]
Noam Brown
And I think that's partly because it's the most effective way for me to do it. And also it's good for me to get experience working with this technology and then also seeing the shortcomings of it. It just helps me better understand, okay, this is the limits of these models and what we need to push on next.
[34:21]
Alessio
Have you felt the AGI?
[34:22]
Noam Brown
I felt the AGI multiple times, yes.
[34:26]
Alessio
How should People push codecs in ways that you've done and I think you see it before others because obviously you were closer to it.
[34:34]
Noam Brown
I think anybody can use codecs and feel the AGI. It's kind of funny how you feel the AGI and then you get used to it very quickly.
[34:44]
Alessio
So it's really dissatisfied with where it's lacking.
[34:47]
Noam Brown
Yeah, I know, it's magical. One day I was actually looking back at the old Sora videos when they were announced, because remember when Sora came out, it was just the biggest news ever. It was just magical. You look at that and you're like, it's really here. This is AGI. But you look at it now and it's kind of like, oh, the people don't move very organically and there's a lack of consistency in some ways and you see all these flaws in it now that you just didn't really notice when it first came out. And yeah, you get used to this technology very quickly, but I think what's cool about it is that because it's developing so quickly, you get those feel the AGI moments every few months. So something else is going to come out and just like it's magical to you and then you get used to it very quickly. Yeah.
[35:29]
Alessio
What are your Windsurf Pro tips now that you've immersed in it?
[35:33]
Noam Brown
I think one thing I'm surprised by is how few people. I mean, maybe your audience is going to be more comfortable with reasoning models and use reasoning models more, but I'm surprised at how many people don't even know that O3 exists. I've been using it day to day. It's basically replaced Google search for me. Like, I just use it all the time. Like, and also for things like coding. Like, I. I tend to just use the reasoning models. My suggestion is like, if people are not. Have not tried the reasoning models yet because, like, honestly, like, we do. Like, people love them. People that use it, love them. Obviously a lot more people use gpt4o and just like the default on what on chatgpt and that kind of stuff. I think it's worth trying the reasoning models. Like, I think people would be surprised at what they can do.
[36:16]
Alessio
I use Windsurf daily and they still haven't actually enabled it as like a default in Windsurf. Like, I always have to dig up like type in O3 and then it's like, oh yeah, that exists. It's weird. I would say my struggle with it has been that it takes so long to reason that I actually break out of flow.
[36:34]
Noam Brown
I think that is true, yes. And I think this is one of the advantages of codecs that, okay, you can give it a task that's kind of self contained and it can go off and do its thing and come back 10 minutes later. And I can see that if you're doing, if you're using this thing as like more like a, like a pair programmer kind of thing, then yeah, you want to use GPT 4.1 or something like that.
[36:52]
Books
What do you think are the most broken part of the development cycle with AI? Like in my mind it's like pull request review. Like for me, like I use codecs all the time and then I got all these pull requests and it's kind of hard to like go through all of them. What other thing would you like people to build to make this even more scalable?
[37:09]
Noam Brown
I think it's really on us to build a lot more stuff. These models are very limited in, in, in some ways I think I find it frustrating that, you know, you ask them to do something and then they spend 10 minutes doing it and then you ask them to do something pretty similar and then they go spend 10 minutes doing it and like, you know, it's. I, I think I describe them as like they're geniuses, but it's their first day on the job, you know, and that's like kind of annoying. Like even the, the smartest person on earth, when they're, when it's their first day on the job, you know they're not going to be like as useful as you would like them to be. So I think being able to get more experience and like act like somebody that's actually been on the job for like six months instead of one day, I think would make them a lot more useful. But that's really on us to build, to build that capability.
[37:51]
Books
Do you think a lot of it is like GPU constrained for you? Like if I think about Codex, why is it asking me to set up the environment myself when like the model. If I ask Code three to like create an environment setup script for a repo, I'm sure it'll be able to do it. But today in the product I have to do it. So I'm wondering in your mind, could these be a lot more. If we just again put more test time compute on them or do you think there's like a fundamental model capability limitation today that we still need a lot of human harnesses around it?
[38:20]
Noam Brown
I think that we're in an awkward state right now where progress is very Fast. And there's things that are like, clearly we could do this and the models would be better. We're going to get to it. It's. You're just limited by how many hours there are in the day, you know, so progress can only proceed so quickly. We're trying to get to everything as fast as we can. And I think that O3 is not where the technology will be in six months.
[38:42]
Alessio
I like that question. Overall in like there's a software development life cycle, not just generation of the code, like from issue to PR basically is like the typical commentary of that. And then there's the windsurf side, which is insider id, like what else? Right. Pull request review is like something that people don't really. There are startups that are built around it. It's not something that Codex does. And it could. And so like, then there's like, what else is there? You know, that is sort of rate limiting the amount of software you could be iterating on. It's an open question. I don't, I don't know if there's an answer. Anything else on ASUI in general, like, where do you think this goes just in form factors or what will we be looking at this time next year in terms of how things are? How. What models are able to do that they're not able to today?
[39:30]
Noam Brown
I don't think it's gonna be limited to a sui, you know, I think. I don't think it's gonna be limited to software engineering. I think it's gonna be able to do a lot of remote work kind of tasks.
[39:38]
Alessio
Yeah. Like SRI Lancer type upwork.
[39:40]
Noam Brown
Yeah. Or just like even things that are not necessarily software engineering. Okay. So the way that I think about it is like anybody that's doing a remote work kind of job, I think it's valuable to become familiar with their technology and kind of get a sense of what it can do, what it can't do, what it's good at, what it's not good at. Because I think the breadth of things that it's going to be able to do is going to expand over time as well.
[40:00]
Alessio
I feel like virtual assistants might be the next thing after ASUI then because they're the most easily virtual assistant. Hire someone in the Philippines, someone who just looks through your email and all that, because that is entirely. You can intercept all the inputs and all the outputs and train on that. And maybe OpenAI just buys a virtual assistant company.
[40:20]
Noam Brown
Yeah. I think what I'm looking forward to is that for things like virtual Assistants, the models, if they're aligned well, they could end up being really preferable for that kind of work. There's always this principal agent problem where if you delegate a task to somebody, then are they really aligned with doing it as you would want it to be done and just as cheaply, as.
[40:43]
Alessio
Quickly as they can?
[40:44]
Noam Brown
And so if you have an AI model that's actually really aligned to you and your preferences, then that can end up doing a way better job than a human could. Well, not it's doing a better job than a human could, but it's doing a better job than a human would.
[40:57]
Alessio
That word alignment, by the way. I think there's an interesting overriding or homomorphism between safety alignment and instruction following alignment. And I wonder where they diverge.
[41:10]
Noam Brown
Okay, so I think where it diverges is like, what do you want to align the models to? Like that's I think a difficult question, you know, like you could say like you wanted to align it to the user. Okay, well what happens if the user wants to build a novel virus that's going to wipe out half of humanity?
[41:22]
Alessio
That safety alignment?
[41:23]
Noam Brown
Yeah, so there's a question of like, I think alignment, I think they're related, you know, and I think the big question is like, what are you aligning towards?
[41:31]
Alessio
Yeah, there's like humanity goals and then there's your personal goals and everything in between.
[41:36]
Books
So that's kind of, I guess the individual agent. And you announced the you're leading the multi agent team at OpenAI. I haven't really seen many announcements, maybe I missed them on what you've been working on, but what can you share about interesting research directions or anything from there?
[41:51]
Noam Brown
Yeah, there hasn't really been announcements on this. I think we're working on cool stuff and I think we'll get to announce some cool stuff at some point. I think the team in many ways is actually a misnomer because we're working on more than just multi agent. Multi agent is one of the things we're working on. Some other things we're working on is just like being able to scale up, test time, compute by a ton. So we get these models thinking for 15 minutes. Now how do we get them to think for hours? How do we get them to think for days, even longer and be able to solve incredibly difficult problems. So that's one direction that we're pursuing. Multi agent is another direction. And here I think there's a few different motivations. We're interested in both the collaborative and the competitive aspect of multi agent. I think the way that I describe it is people often say in AI circles that humans occupy this very narrow band of intelligence and AIs are just going to quickly catch up and then surpass this band of intelligence. And I actually don't think that the band of human intelligence is that narrow. I think it's actually quite broad because if you compare anatomically identical humans from caveman times, they didn't get that far in terms of what we would consider intelligence today. Right? They're not putting a man on the moon, they're not building semiconductors or nuclear reactors or anything like that. And we have those today even though we as humans are not anatomically different. And so what's the difference? Well, I think the difference is that you have thousands of years, a lot of humans, billions of humans cooperating and competing with each other, building up civilization over time. The technology that we're seeing is the product of this civilization. And I think similarly the AIs that we have today are kind of like the cavemen of AI. And I think that if you're able to have them cooperate and compete with billions of AIs over a long period of time and build up a civilization, essentially the things that they would be able to produce and answer would be far beyond what is possible today with the AIs that we have today.
[43:57]
Books
Do you see that being similar to maybe like Jim Fan's Voyager skill library idea re saving these things or is it just the models them being retrained on this new knowledge because the humans then have it, a lot of it, in the brain as they grow.
[44:11]
Noam Brown
I think I'm going to be evasive here and say that like we're not going to. Yeah, we're not going to until we have something to announce, which I think that I think that we will in the not too distant future. I think I'm going to be a bit vague about exactly what we're doing, but I will say that the way that we are approaching multi agent in the details and the way we're actually going about it is I think very different from how it's been done historically and how it's being done today by other places. I've been in the multi agent field for a long time. I've kind of felt like the multi agent field has been a bit misguided in some ways in the things that the approaches that the field has taken and like the way that's been approached. And so I think we're trying to take a very principled approach to multi agent.
[44:52]
Alessio
Sorry, I gotta Add like so you, you can't talk about what you're doing, but you can say what's misguided, what's misguided.
[44:58]
Noam Brown
I think that a lot of the approaches that have been taken have been very heuristic and haven't really been following like the bitter lesson approach to scaling and research.
[45:08]
Books
Okay, I think maybe this might be a good spot. So obviously you've done a lot of amazing work in poker and I think as the reasoning model got better, I was talking to one of my friends who used to be a hardcore poker grinder and I told them I was going to interview you and their question was, at the table. You can get a lot of information from a small sample size about how a person plays. But today GTO is so prevalent that sometimes people forget that you can play exploitatively. What do you think is the state, as you think about multi agent and kind of like competition, is it always going to be trying to find the optimal thing or is a lot of it trying to think more in the moment like how to exploit somebody?
[45:47]
Noam Brown
I'm guessing your audience is probably not super familiar with poker terminology, so I'll just like explain this a bit. A lot of people think that poker is just like a luck game and that's not true. It's actually, there's a lot of strategy in poker. So you can win consistently in poker if you're playing the right strategy. So there's different approaches to poker. One is game theory optimal. This is like you're playing an unbeatable strategy and expectation. Like you're just unexploitable. It's kind of like in rock paper scissors. You can be unbeatable in rock paper scissors if you just randomly choose between rock, paper and scissors with equal probability, because no matter what the other guy does, you know, they're not gonna be able to exploit you or you're gonna win, you're gonna like not lose an expectation. Now a lot of people hear that and they think like, well, that also means that you're not going to win an expectation because you're just playing totally randomly and. But in poker, if you play the equilibrium strategy, it's actually really difficult for the opponents to figure out how to tie you and they're going to end up making mistakes that will lead you to win over the long run. It might not be a massive win, but it is going to be a win. If you play enough hands for a long enough period of time, you're going to win in expectation. Now there's also exploitative poker. And the idea here is that you're trying to spot weaknesses in how the opponent plays. Maybe they're, maybe they're not bluffing enough or maybe they fold too easily to a bluff. And so you start adapting from the game theory optimal balance strategy of like, you bluff sometimes, you, you don't bluff sometimes to then playing a very unbalanced strategy that's like, oh, I'm just going to like bluff a ton against this person because they always fold whenever I bluff. Now the key is that there's a trade off here because if you're taking this exploitative approach, then you're opening yourself up to exploitation as well. And so you have to choose this balance between playing a defensive game theory optimal policy that guarantees you're not going to lose, but might not make you as much money as you potentially could, versus playing an exploitative strategy that can be much more profitable. But also it creates weaknesses that the opponents can take advantage of and trick you. And there's no way to perfectly balance the two. It's kind of like in Rock, paper, scissors, if you notice somebody is playing paper for five times in a row, you might think like, oh, they have a weakness in their strategy. I should just be throwing scissors and I'm going to take advantage of them. And so on the sixth time you throw scissors, but actually that's the time when they throw rock, you know, so, and you never really know, so you always have this trade off. The poker AIs that have been extremely successful. And like my background is like, I worked on AI for poker for several years during grad school and made the first superhuman no limit poker AIs. The approach that we took was this game theory optimal approach where the AIs would play this unbeatable strategy and they would play against the world's best and beat them. Now that also means they beat the world's worst. Like they would just beat anybody. But if they were up against a weak opponent, they might not beat them as severely as a human expert might, because the human expert would know how to adapt from the game theory optimal policy to be able to exploit these weak players. And so there's this kind of unanswered question of like, how do you make an exploitative poker AI? And a lot of people had pursued this research direction. I dabbled in it a little bit during grad school and I think fundamentally it just comes down to AIs not being as sample efficient as humans. We discussed earlier, if a human's playing poker, they're able to Get a really good sense of the strengths and weaknesses of a player within a dozen hands. It's honestly really impressive. And back when we were working on AI for poker in the mid 2010s, these AIs would have to play 10,000 hands of poker to get a good profile of who this player is, how they're playing, where their weaknesses are. No, I think with more recent technology that has come down, but still the sample efficiency has been a big challenge. Now what's interesting is that after working on poker, I worked on diplomacy. I think we talked about this earlier and diplomacy is this. It's a seven player negotiation game. And when we started working on it, I took a very game theory approach to the problem. I felt like, okay, it's kind of like poker. You have to compute this game theory optimal policy and you just play this, you're going to not lose an expectation, you're going to win in practice. But that actually doesn't work in diplomacy and it doesn't work. Again, for question of how much of a rabbit hole do we want to go down on this? But basically, when you're playing the zero sum games like poker, game theory optimal works really well. When you're playing a game like diplomacy where you need to collaborate and compete and there's room for collaboration, then game theory optimal actually doesn't work that well. And you have to understand the players and adapt to them much better. So this ends up being very similar to the problem in poker of like, how do you adapt to your opponents? In poker it's about adapting to their weaknesses and take advantage of that. In diplomacy, it's about adapting to their play styles. It's kind of like if you're at a table and everybody's speaking French, you don't want to just keep talking in English, you want to adapt to them and speak in French as well. That's the realization that I have with diplomacy, that we need to shift away from this game theory optimal paradigm towards modeling the other players, understanding who they are and then responding accordingly. And so in many ways, the techniques that we developed in diplomacy are exploitative. They're not exploitative, they're really just adapting to the opponents, to the other players at the table. But I think the same techniques could be used in AI for poker to make exploitative poker AIs. If I didn't get AGI pilled by the incredible progress that we were seeing with language models and shifting my whole research agenda to focusing on general reasoning, probably what I would have worked on next was making these like exploitative poker AIs. It would be a really fun research direction to go down. I think it's still there for anybody that wants to do it. And I think the key would be taking the techniques that we use in diplomacy and applying them to things like poker.
[51:17]
Books
I think to me the core piece is when you play online you have a HUD which tells you all these stats about the other player and how much they participate pre flop, blah blah blah. And to me it's like a lot of these models from my understanding are not really leveraging the behavior of the other players at the table. They're just kind of looking at the board state and kind of working from there.
[51:36]
Noam Brown
That's correct. The way the poker has work today, they're just kind of like sticking to their pre computed GTO strategy and they're not adapting to the other players at the table. And you can do various kind of hacky things to get them to adapt, but they're not very principled, they don't work super well.
[51:56]
Alessio
Okay, any grad students listening? If you want to work on that, I think that is a very, very reasonable research direction that will at least get in front of you and get some attention at least. The other thing that this conversation brings up for me is yeah, well one of the hypothesis for what is the next step after test time compute is world models. Is world modeling importance or worthwhile? Research direction like Yann Lecun has been talking about this nonstop but basically no. LLMs have they have internal world models but not explicitly a world model.
[52:30]
Noam Brown
I think it's pretty clear that as these models get bigger they have a world model and that world model becomes better with scale. So they are implicitly developing a world model and I don't think it's something that you need to explicitly model. I could be wrong about that.
[52:48]
Alessio
When dealing with people or multi agents it might be because you have entities that are not the world and you're resolving hypotheses of which of the many types of entities you could be dealing with.
[53:01]
Noam Brown
There was this long debate in the multi agent AI community for a long time about, and it's still going on about whether you need to explicitly model other agents like other people or if they can be implicitly modeled as part of the environment. For a long time I was took the perspective of of course you have to explicitly model these other agents because they're behaving differently from the environment, they take actions, they're unpredictable, they have agency. But I think I've actually shifted over time to Thinking that actually if these models become smart enough they develop things like theory of mind, they develop an understanding that there are other agents that can take actions and have motives and all this stuff. And these models just develop that implicitly with scale and more capable behavior broadly. So that's the perspective I take these days.
[53:48]
Alessio
So what I just said was an example of a heuristic that is not bitter lesson pilled and it just goes away.
[53:53]
Noam Brown
Yeah, it's really all come back to the bitter lesson.
[53:56]
Alessio
Got to cite them every AI podcast. So one of the interesting findings and most consistent findings I think you were at ICLR and one of the hit talks there was about open endedness and this guy Tim who gave that talk has been doing a bunch of research about multi agent systems too. One of the most consistent findings is always that it's better for AI to self play and improve competitively as opposed to sort of humans training and guiding them. And you find that with Alpha 0 and R10, whatever that was, do you think this will hold for multi agents like self play self to improve better than humans?
[54:33]
Noam Brown
Yeah, so okay, so this is a great question and I think this is worth expanding on. So I think a lot of people today see self play as like the next step and maybe the last step that we need for superintelligence. And I think if you're following, you know, you look at something like AlphaGo and AlphaZero, we seem to be following a very similar trend. Right. Like the first step in AlphaGo was you do large scale pre training. In that case it was on human go games. With LLMs it's pre training on tons of Internet data. And that gets you a strong model, but it doesn't get you an extremely strong model. It doesn't get you superhuman model. And then the next step in the AlphaGo paradigm is you do large scale test time compute or large scale inference compute. And in that case with mcts and now we have reasoning models that also do this large scale inference compute. Again that boosts the capabilities a ton of. Finally with AlphaGo and AlphaZero you have self play where the model plays against itself, learns from those games, gets better and better and better and just goes from something that's around human level performance to way beyond human capability. These go policies now are so strong that it's just incomprehensible. What they're doing is incomprehensible to humans. Same thing with chess and we don't have that right now with language models. And so it's really tempting to look at that and say, oh well, we just need these AI models to now interact with each other and learn from each other and then they're just going to get to superintelligence. The challenge and I kind of mentioned this a little bit when I was talking about diplomacy, the challenge is that Go is this two player zero sum game. And two player zero sum games have this very nice property where when you do self play, you are converging to a minimax equilibrium. I guess I should take a step back and say in two player zero sum games, two player zero sum games are chess, Go, even two player poker, all two player zero sum. What you typically want is what's called a minimax equilibrium. This is that GTO policy. This policy that you play where you're guaranteeing that you're not going to lose to any opponent in expectation. I think in chess and Go that's like pretty clearly what you want. Interestingly, when you look at poker, it's not as obvious. In a two player zero sum version of poker you could play the GTO minimax policy and that guarantees that you won't lose to any opponent on earth. But again I mentioned there's, you're not going to beat a weak player, you're not going to make as much money off of them as you could if you instead played an exploitative policy. So there's this question of like what do you want? Do you want to make as much money as possible or do you want to guarantee that you're not going to lose to any human alive? What all the bots have decided is like well what all the like AI developers in these games have decided is like, well we're going to choose the minimax policy. And conveniently that's exactly what self play converges do. If you have these AIs play against each other, learn from their mistakes, they converge over time to this minimax policy guaranteed. But once you go outside of two player zero sum games, like in the case of diplomacy, that's actually not a useful policy anymore. You don't want to just have this very defensive policy and you're going to end up with really weird behavior if you start doing the same kind of self play in things like math. So for example, what does it mean to do self play in math? You could fall into this trap of like, well I just want one model to pose really difficult questions and the other model to solve those questions. You know, that's like a two player zero sum game. The problem is that like, well you could just like pose really difficult questions that are not interesting. You know, you could just like get ask it to do like 30 digit multiplication. It's a very difficult question problem for the AI models. Is that really making progress in the dimension that we want? Like not really. So self play outside of these two player zero sum games becomes like a much more difficult nuanced question. So I think, and Tim kind of like basically said something similar in his talk, that there's a lot of challenges in really deciding what you're optimizing for when you start to talk about self play outside of two player zero sum games. My point is that like this is where the alphago analogy breaks down and not necessarily breaks down, but it's not going to be as easy as self play was in AlphaGo.
[58:39]
Alessio
What is the objective function then for that? What is the new objective function?
[58:44]
Noam Brown
Yeah, it's a good question. Yeah. And I think that that's something that a lot of people are thinking about.
[58:49]
Alessio
Yeah, I'm sure you are. One of the last podcasts that you did, you mentioned that you were very impressed by Sora. You don't work directly on Sora, but obviously it's part of OpenAI. I think the most recent new updates or in that sort of generative media space is autoregressive imagegen. Is that interesting or surprising in any way that you want to comment about?
[59:10]
Noam Brown
I don't work on ImageGen, so my ability to comment on this is kind of limited. But I will say I love it. I think it's super impressive. It's one of those things where you work on these reasoning models and you think like, wow, we're going to be able to do all sorts of crazy stuff like advanced science and you know, solve agent tasks and software engineering. And then there's like this whole other like dimension of progress where you're like, oh, you're able to like make images and videos now and it's like so much fun and that's getting a lot more of the attention, to be honest, especially in the general public. And it's probably driving a lot more of the like, you know, subscription plans for ChatGPT, which is great, but I think it's just kind of funny that like, yeah, we're also, I promise we're also working on super intelligence, but you.
[59:51]
Alessio
Can make everything Ghibli. I think the delta for me was I was actually harboring this thesis that diffusion was over because of autoregressive emission. There were rumors about this end of last year and obviously now it's come out then Gemini comes out with text diffusion, and diffusion is so back. And this is two directions and it's very relevant for inference of autoregressive versus diffusion. Do we have both? Does one win?
[60:19]
Noam Brown
The beauty of research is you got to pursue different directions and it's always going to be clear what is the promising path. And I think it's great that people are looking into different directions and trying different things. I think that there's a lot of value in that exploration and I think we all benefit from seeing what works.
[60:39]
Alessio
Any potential in diffusion reasoning? Let's say you're probably going to answer that.
[60:44]
Books
So you did a master's in robotics too? We'd love to get your thoughts on one. You know, OpenAI kind of started with the pen spinning trick and like the robotic arm they wanted to build. Is it right to work on this humanoid likes? Do you think that's kind of like the wrong embodiment of AI? Outside of the usual, you know, how long until we get robots, blah, blah, blah. Is there something that you think is like, fundamentally not being explored right now that people should really be doing in robotics?
[61:09]
Noam Brown
I did a master's in robotics years ago. And my takeaway from that experience, first of all, I didn't actually work with robots that much. I was like, technically in a robotics program. I played around with some LEGO robots my, my first week at the program. But then honestly, I just like pretty quickly shifted just working on AI for poker and was kind of nominally in the robotics masters. But my takeaway from like, interacting with all these roboticists and seeing their research was that I did not want to work on robots because the research cycle is so much slower and so much more painful when you're dealing with like, physical hardware. Like, software goes so much more quickly. And I think that's why we're seeing so much progress with language models and like all these, like, virtual coworker kind of tasks, but haven't seen as much progress in robotics that, like, physical hardware just is much more painful to iterate on. On the question of humanoids, I don't have very strong opinions here because this isn't what I'm working on. But I think there's a lot of value in non humanoid robotics as well. I think drones are a perfect example where there's clearly a lot of value in that. Is that a humanoid? No, but in many ways that's great. You don't want a humanoid for that kind of technology. I think weekly, I think that non humanoids provide a lot of value.
[62:24]
Books
I was reading Richard Hamming's the Art of Doing Science and Engineering, and he talks about how when you have a new technological shift, people try and take the old workloads and, like, replicate them just in the new technology versus you actually have to change the way you do it. And, you know, when I see this video of, like, you know, you're humanoid in the house, it's like, well, the human shape is kind of. Has a lot of limitations that could actually be improved. But I think people. What's familiar? You know, it's like, would you put a robot with, like, 10 arms and, like, you know, five legs in your house? Or would that be yuri and night when you get up and you see that thing walking around? And is that why we use humanoids? So I think, to me, there's almost like the. This local maximum of, like, you know, we gotta make it look like a human. But I think, like, what's like, the. The best shape in house?
[63:09]
Noam Brown
I'm terrible at product design, so I. I am not the person to ask on this. I think there is a question of, like, is it better to make humanoids because they're more familiar to us, or is it worse to make humanoids because they're more similar to us but not quite identical? Like, I. I don't know which one I would actually find creepier.
[63:26]
Alessio
Yeah. The thing that got me humanoid pilled a little bit was just the argument that most of the world is made for humans anyway, so if you want to replace human labor, you have to make a humanoid. I don't know if that's convincing.
[63:39]
Noam Brown
Again, I don't have very strong opinions in this field because I don't work in it. I was weakly in favor of humanoids, and I think what really persuaded me to be weakly in favor of non humanoids was listening to the Physical Intelligence CEO and, like, some of his pitches about, like, why they're not pursuing. Why they're pursuing, like, non humanoid robotics. Okay. And conveniently, their office is actually like, very close to here. So if you wanted to, they're speaking.
[64:01]
Alessio
At the conference and running.
[64:02]
Noam Brown
Okay, perfect. You know, I'd say, like, listen to his pitch and maybe he can convince you that non humanoid is the way to go.
[64:08]
Alessio
Awesome. The other one I would refer people to is Jim Fan, recently did a talk on the Physical Turing Test, which. Which he did at the Sequoia Conference, which was very, very good. He's such a great educator and explainer of things. It's very hard, especially in that field. Cool. We're done asking you about Things that you don't work on. So these are just more rapid fires to sort of explore some of your boundaries and get some quick hits. How do you or top industry labs keep on top of research? What are your tools and practices?
[64:40]
Noam Brown
It's really hard. I think that a lot of people have this perception that academic research is irrelevant, and that's actually not the case. I think that we do. We look at academic research. I think one of the challenges is like a lot of academic research shows promise in their papers, but then actually doesn't work at scale or even doesn't replicate. I think if we find interesting papers, we're going to try to reproduce that in house and see if it still holds up. And then also doesn't scale well. But that is like a big source of inspiration for us.
[65:10]
Alessio
Whatever hits archive, literally you do the same as the rest of us. Or do you have a special process?
[65:15]
Noam Brown
Especially if I get recommendations, we have an internal channel where people will post interesting papers. And I think that's a good source of, okay, well, this person that is more familiar with this area thinks that this paper is interesting, so therefore I should read it. And similarly, I'll keep track of things that are happening in my space that I think are interesting. And if I think it's really interesting, maybe I'll share it.
[65:34]
Alessio
For me, it's like WhatsApp and Signal, Google chats with researchers and that's it.
[65:38]
Noam Brown
Yeah, I think it is like, I mean, a lot of people look at things like Twitter and I think it's really unfortunate that we've reached this point where things need to get a lot of attention on social media for it to be paid attention to.
[65:50]
Alessio
That's what the grad students are trained, they're taking classes to do this.
[65:54]
Noam Brown
I do recommend to like, you know, I've worked with grad students. I work with fewer now because we don't publish as much. But when I was at Fair publishing papers, like, I would tell the grad students I was working with that like, you need to post it on Twitter and you need to. We go over the Twitter thread about how to present the work and everything, and there's a real art to it and it does matter. And it's kind of the sad truth.
[66:15]
Books
I know when you were doing the acpc, like the AI poker competition, you mentioned that people were not doing search because they were limited to like 2 CPUs at inference. Do you see similar things today that are like, keeping interesting research from being done that might be. It's not as popular it doesn't get you into the top conferences. Like are there some environmental limiters?
[66:37]
Noam Brown
Absolutely. And I think one example is for benchmarks that you look at things like humanity's last exam. Like you have these incredibly difficult problems but then are still very easily gradable. And I think that actually limits the scope of what you can evaluate these models on. If you stick to that paradigm, it's very convenient because it's very easy to then score the models. But actually a lot of the things that we want to evaluate these models on are kind of like more fuzzy tasks that are not multiple choice questions. And making benchmarks for those kinds of things is so much harder and probably also a lot more expensive to evaluate. But I think that those are really valuable things to work on and that.
[67:14]
Books
Would fit the SAM moment. GPT 4.5 is like a high taste model in a way. There's kind of like all these non measurable things about a model that are really good, that maybe people are not.
[67:26]
Noam Brown
Well, I think there are things that are measurable but they're just much more difficult to measure. And I think that a lot of benchmarks have kind of stuck to this paradigm of posing really difficult problems that are really easy to measure.
[67:38]
Alessio
So let's say the pre training scaling Paradigm took about 5 years from discovery of GPT to scaling it up to GPT4. And then we give test time compute 5 years as well. So if Test Time Compute hit a wall by 2030, what would be the probable cost?
[67:55]
Noam Brown
It's very similar to pre training where you can push pre training a lot further and it just becomes more expensive with each iteration. I think we're going to see something similar with Test Time Compute. We're like, okay, we're going to get them thinking, instead of three minutes, they're for three hours and then three days and then three weeks.
[68:10]
Alessio
Oh, you run out of human life.
[68:11]
Noam Brown
Well, there's two concerns. One is that it becomes much more expensive to get the models to think for that long or scale up Test Time Compute. As you scale up Test Time Compute, you're spending more on test Time compute, which means that there's a limit to how much you could spend. That's one potential ceiling. Now obviously not obviously, but I should say that we're also becoming more efficient. These models are becoming more efficient in the way they're thinking as they're able to do more with the same amount of Test Time compute. And I think that's a very underappreciated point, that it's not just that we're getting these models to think for longer. In fact, if you look at O3, it's thinking for longer than O1. Preview for some questions, but it's not like a radical difference, but it's way better. Why? Because it's just like becoming better at thinking anyway. Yeah, these models you're going to scale up, test on, compute, you can only scale it up so much that becomes a soft barrier. In the same way that pre training, it's becoming more and more expensive to train better and better pre trained models or bigger pre trained models. The second point is that as you have these models think for longer, you kind of get bottlenecked by walk clock time. Like if you want to iterate on experiments. It is really easy to iterate on experiments when these models would respond instantly. It's actually much harder when they take three hours to respond. And what happens when they have three weeks? It takes you at least three weeks to do those evaluations and to then iterate on that. And a lot of this you can parallelize experiments to some extent, but a lot of it, you have to run the experiment, complete it and then see the results in order to decide on the next set of experiments. I think this is actually the strongest case for long timelines that the models, because they just have to do so much in serial time, we can only iterate so quickly.
[69:42]
Alessio
How would you overcome that wall?
[69:44]
Noam Brown
It's a challenge and I think it depends on the domain. So drug discovery I think is one domain where this could be a real bottleneck. I mean, if you want to see if something extends human life, it's going to take you a long time to figure out if this new drug that you developed actually extends human life and doesn't have terrible side effects along the way.
[70:00]
Alessio
Side note, do we not have perfect models of human chemistry and biology by now?
[70:04]
Noam Brown
Well, so this is, I think the thing, and again, I want to be cautious here because I'm not actually a biologist or a chemist. I know very little about these fields. Last time I took a biology class was 10th grade in high school. I don't think that there's a perfect simulator of human biology right now. And I think that that's something that could potentially help address this problem.
[70:22]
Alessio
That's the number one thing that we should all work on.
[70:25]
Noam Brown
Well, that's one of the things that we're hoping that these reasoning models will help us with.
[70:28]
Alessio
Yeah, how would you classify mid training versus post training today?
[70:33]
Noam Brown
All these definitions are so fuzzy, so I don't have a great answer There.
[70:39]
Alessio
It'S a question people have and you're opening eyes, explicitly hiring for mid training. And everyone is like, what the hell is mid training?
[70:46]
Noam Brown
I think mid training is between pre training and post training. It's like, it's like it's not post training, it's not pre training. It's like adding more to the models but like after pre training, like, I.
[70:59]
Alessio
Don'T know, interesting ways.
[71:01]
Noam Brown
Yeah.
[71:01]
Books
Okay.
[71:01]
Alessio
All right. Well, you know, I was trying to get some clarity.
[71:07]
Books
Is the pre trained model now basically like a. Just the artifact that then spawns other models? And it's almost like the core pre training model is never really exposed anymore. And it's the mid training, the new pre training and then there's the post training. Once you have the models branched out.
[71:23]
Noam Brown
You never interact with an actual just like raw pre trained model. Like if you're going to interact with the model, it's going to go through mid training and post training. So you're seeing the final product.
[71:32]
Alessio
Well, you don't let us do it, but you know, we used to.
[71:34]
Noam Brown
Well, yeah, I mean, I guess there's open source models where you can just interact with the raw pre trained model. Um, but for, for OpenAI models, like they go through a mid training step and then they go through a post training step and then, and then they're released and they're a lot more useful. Like frankly, if you interacted with a only pre trained model, it would be super difficult to work with and it would. Yeah, it would seem kind of dumb.
[71:51]
Books
Yeah.
[71:52]
Alessio
But it'd be, it'd be useful in weird ways, you know, because there's a mode collapse when you, when you post train for it for like chat.
[71:59]
Noam Brown
Yeah. In some ways you want that mode collapse. Like you want that collapse of like.
[72:03]
Alessio
Yes. To be useful.
[72:04]
Noam Brown
Yeah, I get it.
[72:05]
Alessio
Yeah. We're interviewing Greg Brockman next. You've talked to him a lot. What would you ask him?
[72:11]
Noam Brown
What would I ask Greg? I mean, I mean I get to ask Greg all the time. What should you ask Greg?
[72:15]
Alessio
Like to evoke an interesting response that like he doesn't get asked enough about but you know, like this is something that he's passionate about or you just want his thoughts.
[72:25]
Noam Brown
I think in general it's worth asking where this goes, you know, like what does the world actually look like in five years? What does the world look like in 10 years? What does that distribution of outcomes look like? And what could the world or individuals do to help steer things towards the good outcomes instead of the negative outcomes?
[72:46]
Alessio
Okay. Like an alignment question.
[72:48]
Noam Brown
I think people get very focused on what's going to happen in like one or two years. And I think it's also worth spending some time thinking about like, well, what happens in five or 10 years and what does that world look like?
[73:00]
Alessio
I mean he doesn't have a crystal.
[73:01]
Noam Brown
Ball, but he certainly has thoughts. Yeah. So I think that's worth exploring.
[73:08]
Books
Okay, what are games that you recommend to people, especially socially?
[73:13]
Noam Brown
What are games that I recommend to people? I've been playing a lot of this game called Blood on the Clock Tower lately.
[73:18]
Alessio
What is it?
[73:19]
Noam Brown
It's kind of like mafia or werewolf. It's become very popular in San Francisco.
[73:25]
Alessio
Oh, that's the one we played in your house?
[73:26]
Noam Brown
Yeah.
[73:26]
Alessio
Okay, got it.
[73:27]
Noam Brown
It's kind of funny because like I was talking to a couple of people now that it told me that it used to be that poker was the like way that like the VCs and tech founders and stuff would socialize with each other. And actually now it's shifting more towards Blood on the Clock Tower. Like that's the, the thing that people use to like, you know, connect in the Bay Area. And I was actually told that a startup held a recruiting event that was a Blood on the Clock Tower game.
[73:55]
Books
Wow.
[73:55]
Noam Brown
Yeah, so I guess it's like, it's really catching on. But it's a fun game and I guess you lose less money playing it than you do playing poker. So it's better for people that are not very good at these things. I think it's kind of a weird recruiting event, but certainly a fun game.
[74:09]
Alessio
What qualities make a winner here? That is interesting to hire for.
[74:13]
Noam Brown
That's the thing is like okay, I guess you get ability to lie, deception and like picking up on deception. Like is that the best employee? I don't know.
[74:24]
Books
So my slight final pet topic is Magic the Gathering. So you have we talked about some of these games, Chesco, and they have perfect information. Then you have poker, which is imperfect information in a pretty limited universe. You only have a 52 card deck. And then you have these other games that have imperfect information, like a huge pool of possible options. Do you have any idea of like how much harder that is? Like how does the difficulty of this problem scale?
[74:49]
Noam Brown
I love that you asked that because I have this huge store of knowledge on AI for amper information games. This is my area of research for so long and I know all these things but I don't get to talk about it very often. We've made superhuman poker, AIs for no limit, Texas hold'.
[75:05]
Books
Em.
[75:05]
Noam Brown
1 of the interesting things about that is that the amount of hidden information is actually pretty limited because you have two hidden cards when you're playing Texas Hold'.
[75:13]
Books
Em.
[75:14]
Noam Brown
And so the number of possible states that you could be in is 1,326 when you're playing. Heads up at least. And you know that's multiplied by the number of other players that there are at the table, but it's still like not a massive number. And so the way these AI models work is they enumerate all the different states that you could be in. So if you're playing like six handed poker, there's five other players, five times 1,326. That's the number of states that you be in. And then you assign a probability to each one and then you feed those probabilities into your neural net and you get actions back for each of those states. The problem is that as you scale the number of hidden possibilities, like the number of possible states it could be in, that approach breaks down. And there's still this very interesting unanswered question of what do you do when the number of hidden states becomes extremely large. So if you go to Omaha Poker where you have four hidden cards, there are things you could do that are kind of heuristic that you could do to reduce the number of states. But actually it's still a very difficult question. And then if you go to a game like Stratego where you have 40 pieces, so there's like close to 40 factorial different states you could be in, then all these like existing approaches that we used for poker kind of break down and you need different approaches. And there's a lot of active research going on about like, how do you, how do you cope with that? So for something like Magic the Gathering, the techniques that we used in poker would not out of the box work. And it's still an interesting research question of like, what do you do? Now I should say this becomes a problem when you're doing the kinds of search techniques that we used in poker. If you're just doing model free rl, it's not a problem. And my guess is that if somebody put in the effort, they could probably make a superhuman bot for Magic the Gathering. Now, yeah, there's still some unanswered research questions in that space. Now are they the most important unanswered research questions? Like, I'm inclined to say no. I think there's like, the problem is that like the techniques that we used in poker to do this kind of search stuff were pretty limited. And if you expand those techniques, maybe you get them to work on things like Stratego and Magic the Gathering, but they're still going to be limited. They're not going to get you superhuman and code forces with language models. So I think it's more valuable to just focus on the very general reasoning techniques. And one day, as we improve those, I think we'll have a model that just out of the box, one day plays Magic the Gathering at a superhuman level. And I think that's the more important and more impressive research direction.
[77:31]
Books
Cool.
[77:31]
Alessio
Amazing.
[77:32]
Books
Yeah. Thanks very much for coming on, Noam.
[77:34]
Noam Brown
Yeah, thanks for your time. Yeah, thanks for having me.

Latent Space: The AI Engineer Podcast

Episode: Scaling Test Time Compute to Multi-Agent Civilizations — Noam Brown, OpenAI Date: June 19, 2025 Host(s): Alessio (Decibel), Books (Small AI) Guest: Noam Brown (OpenAI)

Overview

Main Discussion Points & Insights

1. From Games to Civilization: AI’s Evolving Competence

2. Safety, Steerability, and AI Agents

3. Reasoning Models and Scaling Test Time Compute

4. The "Thinking Fast and Slow" Analogy

5. Generalization, Harnesses, and Automation in Games

6. The End of Routers and Product Strategy in AI

7. Reinforcement Fine Tuning (RFT) and Data Collection

8. The Reasoning Paradigm’s Path: From Poker to Language Models

9. Coding with AI: AGI Moments, Shortcomings, and Practical Tips

10. Remote Work, Virtual Assistants, and Alignment

11. Multi-Agent Systems: Lessons from Human Civilization

12. Game Theory, Exploitation, and Sample Efficiency

13. Self-Play, Open-Endedness, and Benchmarking

14. Applied Scaling, Bottlenecks, and the Wall Ahead

15. Robotics, Embodiment, and the Form Factor Debate

16. Research Culture, Staying Updated, & the Danger of Hype

Notable Quotes & Memorable Moments (with Timestamps)

Timestamps for Key Segments

Further Rapid Fire Insights & Recommendations

Closing Thoughts

Latent Space: The AI Engineer Podcast

Episode: Scaling Test Time Compute to Multi-Agent Civilizations — Noam Brown, OpenAI Date: June 19, 2025 Host(s): Alessio (Decibel), Books (Small AI) Guest: Noam Brown (OpenAI)

Overview

Main Discussion Points & Insights

1. From Games to Civilization: AI’s Evolving Competence

2. Safety, Steerability, and AI Agents

3. Reasoning Models and Scaling Test Time Compute

4. The "Thinking Fast and Slow" Analogy

5. Generalization, Harnesses, and Automation in Games

6. The End of Routers and Product Strategy in AI

7. Reinforcement Fine Tuning (RFT) and Data Collection

8. The Reasoning Paradigm’s Path: From Poker to Language Models

9. Coding with AI: AGI Moments, Shortcomings, and Practical Tips

10. Remote Work, Virtual Assistants, and Alignment

11. Multi-Agent Systems: Lessons from Human Civilization

12. Game Theory, Exploitation, and Sample Efficiency

13. Self-Play, Open-Endedness, and Benchmarking

14. Applied Scaling, Bottlenecks, and the Wall Ahead

15. Robotics, Embodiment, and the Form Factor Debate

16. Research Culture, Staying Updated, & the Danger of Hype

Notable Quotes & Memorable Moments (with Timestamps)

Timestamps for Key Segments

Further Rapid Fire Insights & Recommendations

Closing Thoughts

Episode: Scaling Test Time Compute to Multi-Agent Civilizations — Noam Brown, OpenAI
Date: June 19, 2025
Host(s): Alessio (Decibel), Books (Small AI)
Guest: Noam Brown (OpenAI)

Episode: Scaling Test Time Compute to Multi-Agent Civilizations — Noam Brown, OpenAI
Date: June 19, 2025
Host(s): Alessio (Decibel), Books (Small AI)
Guest: Noam Brown (OpenAI)