Latent Space: The AI Engineer Podcast
Episode: Scaling Test Time Compute to Multi-Agent Civilizations — Noam Brown, OpenAI
Date: June 19, 2025
Host(s): Alessio (Decibel), Books (Small AI)
Guest: Noam Brown (OpenAI)
Overview
This episode features Noam Brown of OpenAI, renowned for AI breakthroughs in games like poker and Diplomacy, and currently leading OpenAI’s multi-agent team. The conversation traverses the evolution of AI reasoning paradigms, scaling test time compute, multi-agent systems, the "thinking fast and slow" analogy, coding assistants, and prospects for civilization-scale AI. The episode is a must-listen for anyone interested in the bleeding edge of AI engineering, practical insights from foundational research, and the realities of rapid change in the field.
Main Discussion Points & Insights
1. From Games to Civilization: AI’s Evolving Competence
- Noam’s Journey: Diplomacy & Cicero
- Built Cicero (top 10% Diplomat) and won the World Diplomacy Championship in 2025.
- Debugging AIs in games teaches as much about the game as about AI.
- "Sometimes it would do things that humans typically wouldn’t do, and that taught me about the game as well." — Noam [01:00]
- On Centaur Systems:
- Noam was inspired by Cicero but didn’t use it during the championship.
- Human players now routinely question if an opponent might be a bot—especially as models improve (Cicero, GPT-4o, etc.).
- Language models’ quality has rapidly improved since Cicero’s 2.7B-parameter days.
2. Safety, Steerability, and AI Agents
- AI Safety in Games:
- Cicero’s design (conditioning on concrete actions) aligns with safety goals:
"It was a very controllable system...not just a language model running loose, but a reasoning system that steers the way the language model interacts with the human." — Noam [04:10] - Researchers saw this as a pathway toward safe and steerable AI.
- Cicero’s design (conditioning on concrete actions) aligns with safety goals:
3. Reasoning Models and Scaling Test Time Compute
- The O Series and the Reasoning Paradigm:
- Progression from O1 to O3 marked by rapid scaling and new capabilities such as agentic behavior and web browsing (“mini deep research”).
- Resistance to the notion that AI is only competent in verifiable domains (math/coding) — deep research is cited as a success in fuzzy domains.
- "I think that's an existence proof that these models can succeed in tasks that don't have easily verifiable rewards." — Noam [07:18]
- Quality Control and User Feedback:
- Users can still distinguish between mediocre and good AI research, critical for product improvement.
4. The "Thinking Fast and Slow" Analogy
- Limitations and Applicability:
- The two-system analogy (System 1: fast, System 2: slow reasoning) partly fits, but isn’t perfect.
- Models need a threshold of System 1 competence to benefit from System 2 (deliberative) reasoning.
- "If you try to do the reasoning paradigm on top of GPT-2, I don’t think it would have gotten you almost anything." — Noam [09:22]
- Emergence Caution:
- Noam is careful about labeling behaviors as emergent but notes clear qualitative shifts occur with scale.
5. Generalization, Harnesses, and Automation in Games
- System 1 vs. System 2 in Games:
- Intuitive (fast) play must be strong for slow/conscious reasoning to help — both in humans and AI.
- "Harnesses" (additional scaffolding/tooling around the model) are a crutch and, with scale, will become unnecessary.
- "I think the ideal harness is no harness...I think harnesses are like a crutch that eventually we’re going to be able to move beyond." — Noam [14:05]
6. The End of Routers and Product Strategy in AI
- Model Routers:
- Today, routers help switch between fast and slow models (System 1 & 2); eventually, scaling will obviate the need.
- Caution for developers: don't overinvest in complex scaffolding/routers likely to be made obsolete by scale.
- "I think that routers are going to eventually go away...the field is evolving very rapidly. Things are going to change in three months, let alone six months." — Noam [19:23]
7. Reinforcement Fine Tuning (RFT) and Data Collection
- RFT Value:
- RFT remains relevant — fine-tuned data will retain value as models improve.
- "If we come out with future models that are even more capable, you could still fine tune them on your data." — Noam [21:26]
8. The Reasoning Paradigm’s Path: From Poker to Language Models
- OpenAI’s Strategic Bets:
- The move from scaling pre-training to exploring reasoning via RL was contentious but farsighted.
- Early bets on scale at OpenAI provided a critical competitive edge; willingness to try “big experiments.”
- "OpenAI functions a lot like a startup...with this mission of building AGI and superintelligence...that helped them organize, collaborate, pool resources together." — Noam [32:47]
9. Coding with AI: AGI Moments, Shortcomings, and Practical Tips
- AI Pair Programming:
- Noam uses coding assistants (Codex, Windsurf, Cursor), prioritizing them even for core tasks.
- On “feeling the AGI”:
"You feel the AGI moments every few months ... then you get used to it very quickly." — Noam [34:43] - Frustration: AI devs are like “geniuses on their first day”—very capable but lacking context/memory; product cycles lag behind model capabilities.
- Scalability Limitations:
- Current blockers: code review, context limitations, lack of memory/accumulated experience in agents.
10. Remote Work, Virtual Assistants, and Alignment
- Beyond Software:
- AI will impact any remote work—becoming essential in virtual assistant workflows.
- Principal-agent problem: With proper alignment, AI agents could surpass humans (in how jobs are done).
11. Multi-Agent Systems: Lessons from Human Civilization
- Research Directions:
- Multi-agent at OpenAI includes scaling test time compute, collaborative and competitive AI societies.
- Civilization-scale: analogy to how humanity's advancement is due to large-scale competition and cooperation, not mere individual intelligence.
- "The AIs that we have today are kind of like the cavemen of AI. If you have billions of AIs cooperating and competing...the things that they would produce would be far beyond today's AIs." — Noam [43:51]
- Field Critique:
- Historically, multi-agent AI research has been too reliant on heuristics, not scaling/bitter lesson–driven.
12. Game Theory, Exploitation, and Sample Efficiency
- Poker Paradigms:
- GTO (game-theoretic optimal) AI dominates, but lacks adaptivity/human-level exploitation due to sample inefficiency.
- Techniques from Diplomacy AI (modeling other agents) might eventually solve this.
- World Models:
- Brown leans toward world models and theory of mind emerging implicitly with scale, away from explicit modeling.
13. Self-Play, Open-Endedness, and Benchmarking
- Self-Play Limitations:
- Self-play scaling works for zero-sum games (Go, Chess), less clear for general intelligence, open-ended domains, or math.
- Objective functions are hard to define for multi-agent, cooperative environments.
- Benchmarking Constraints:
- Many benchmarks favor easily gradable hard problems; real value lies in messy, subjective, real-world tasks.
14. Applied Scaling, Bottlenecks, and the Wall Ahead
- Test Time Compute Barriers:
- Scaling “thinking time” will hit cost and wall-clock bottlenecks (e.g., multi-hour/days compute runs slow down research cycles and iteration).
- "This is actually the strongest case for long timelines...a lot of this you have to run the experiment, complete it, and then see the results in order to decide on the next set of experiments." — Noam [69:31]
15. Robotics, Embodiment, and the Form Factor Debate
- Noam’s Robotics Take:
- Progress is slower in robotics due to hardware iteration speed.
- Weakly prefers non-humanoid forms, inspired by drone utility and robotics startups.
16. Research Culture, Staying Updated, & the Danger of Hype
- Academic Papers:
- Contrary to perceptions, academic work matters—if it replicates/scales.
- Internal channels and expert recommendations drive reading decisions.
- "A lot of people look at things like Twitter...it’s really unfortunate that we’ve reached this point where things have to get a lot of attention on social media." — Noam [65:34]
Notable Quotes & Memorable Moments (with Timestamps)
-
On AI Steerability:
"We conditioned Cicero on certain concrete actions and that gave it a lot of steerability to say, okay, well, it's going to pursue a behavior that we can very clearly interpret and very clearly define." — Noam [04:10] -
On the “Reasoning Paradigm”:
"For me, O3, I've been using it a ton in my day to day life...it's kind of like a mini deep research that you can just get a response in three minutes." — Noam [05:48] -
On Harnesses Becoming Obsolete:
"The ideal harness is no harness. I think harnesses are like a crutch that eventually we’re going to be able to move beyond." — Noam [14:05] -
On the Limits of Self-Play:
"Self play outside of these two player zero sum games becomes a much more difficult, nuanced question...this is where the AlphaGo analogy breaks down." — Noam [54:33] -
On Research Culture & Hype:
"I would tell the grad students I was working with that like, you need to post it on Twitter ... there's a real art to it and it does matter. And it's kind of the sad truth." — Noam [65:54] -
On Civilization-Scale AI:
"The technology that we're seeing is the product of this civilization...I think the AIs that we have today are kind of like the cavemen of AI. And I think that if you're able to have them cooperate and compete with billions of AIs...the things that they would be able to produce...would be far beyond what is possible today." — Noam [43:51]
Timestamps for Key Segments
- [00:53] — Impact of working on Cicero/Diplomacy on Noam’s play
- [04:08] — Safety and steerability of Cicero
- [07:18] — Deep research as an existence proof for non-verifiable domains
- [09:22] — Why “thinking fast and slow” only partially applies
- [14:05] — On tool scaffolding (‘harnesses’) and the future of model generality
- [19:23] — Routers’ demise as models unify
- [21:26] — RFT’s enduring value as models scale
- [25:45] — OpenAI’s internal debate over reasoning paradigms
- [34:43] — “Feeling the AGI” moments with coding assistants
- [43:51] — Civilization as a cooperative/competitive multi-agent system
- [54:33] — Why self-play breakthroughs in games may not map to general intelligence
- [69:31] — Long iteration times as a soft cap on scaling
- [72:11] — What to ask Greg Brockman about AGI’s future
- [73:12] — “Blood on the Clock Tower” as the new social game in Silicon Valley
Further Rapid Fire Insights & Recommendations
- Favorite Social Games:
"Blood on the Clock Tower" is replacing poker as the VC/game-night favorite [73:19] - Research Practices: Internal paper curations, channels, and (sadly) Twitter threads are essential for keeping up.
- Robotics:
Non-humanoid forms and novel applications (e.g., drones) are seen as high value. - Benchmarks Limitation:
Too much focus on testable, easily gradable problems stifles evaluation growth.
Closing Thoughts
Noam Brown’s session delivers hard-won insights from the front lines of AI research and engineering: a candid roadmap of how AI is racing beyond its training regimes, how the shift toward agentic and multi-agent intelligence redefines capability, and how the “bitter lesson” of scale continues to sweep aside hand-engineered shortcuts. The lessons—product, research, and philosophical—are invaluable for researchers, engineers, founders, and anyone charting their course through the fast-evolving AI landscape.
