No Priors Podcast

Episode: O3 and the Next Leap in Reasoning with OpenAI’s Eric Mitchell and Brandon McKinzie
Date: May 1, 2025
Guests: Brandon McKinzie and Eric Mitchell (OpenAI)
Hosts: Elad Gil and Sarah Guo

Episode Overview

This episode of No Priors dives deep into OpenAI's recently released O3 model, exploring its breakthroughs in reasoning, multi-step tool use, and the implications for knowledge work, coding, and future AI research. Brandon McKinzie and Eric Mitchell, two core contributors, join to unpack what sets O3 apart, its training methodology, and how reasoning models evolve. The conversation also covers user experiences, tool-use in AI, evaluation challenges, and predictions for where reasoning AI is headed next.

Key Discussion Points & Insights

1. What is O3? The Next Leap in Reasoning

Smarter Through Deliberation: O3 belongs to OpenAI’s “O” series, designed for deliberate, multi-step reasoning rather than instant predictions.
- Eric Mitchell (00:41): “These models are … smarter than models that don’t think before they respond. Similarly to humans, it’s easier to be more accurate if you think before you respond.”
Enhanced Tool Use: O3 doesn't just solve problems; it figures out what tools it needs—web browsing, code execution, image manipulation—then uses them autonomously to tackle complex tasks.
- Eric Mitchell (00:41): “If it can’t browse the web and get up to date information, there’s just a limitation on how much useful stuff that model can do for you.”

2. Core Innovations & Training Differences

Reinforcement Learning for Long-Term Tasks: Main divergence is the heavy use of reinforcement learning, optimizing not for next-token prediction but for solving intricate, multi-step challenges.
- Brandon McKinzie (03:19): “Reinforcement learning is the biggest one. … Now we have a more focused goal of the model: solving very difficult tasks and taking as long as it needs to do to figure out the answers.”
Inference-Time Deliberation: O3 spends more time “thinking”—using extra compute during inference, which scales up answer accuracy but can also make responses slower.
- Brandon McKinzie (03:19): “The longer it thinks, like, I really get the impression that I’m going to get a better result.”
Paired with Tools for Real Productivity: The model’s ability to call tools (APIs, code execution, browsing) is pivotal in achieving coherent, multi-step workflows.

3. Tool Use, Deliberation, and Test-Time Scaling

Productivity Gain: Having access to tools allows O3 to reason through or offload tasks to specialized systems, drastically improving result quality and the reliability of its deliberation phase.
- Brandon McKinzie (09:10): “When you give it access to a tool, it’s like, okay, well I gotta figure something out. Let’s see if I can manipulate the image or crop around here … It’s a much more productive use of tokens.”
Efficient Compute Allocation: For tasks where code or quantitative analysis is required, O3 writes and executes programs directly instead of simulating logic in its own context.
- Eric Mitchell (10:04): “You could have the model … fit coefficients in its context, or you could literally just have it write the code … and just know what the actual answer is.”

4. Product Split: Fast vs. Smart Models?

Unification vs. Specialization: There’s active debate on whether future models should be general all-in-one AIs or users will pick from a suite of specialized AIs (fast/lightweight vs. deep/intensive).
- Eric Mitchell (05:24): “Are we going to have two models people pick between, or a zillion models … or do we put that decision inside the model?”
Brandon McKinzie (07:05): “The ideal situation is it’s intuitive: you should have to wait as long as it takes for the model to … give you a correct answer … [raising] the steerability question.”

5. Application Domains for Reasoning Models

Research & Deep Analysis: O3 is already showing value as a research analyst, conducting web research, synthesis, analysis, and reporting.
- Eric Mitchell (11:46): “Browsing is one of the most natural places … anything that requires up-to-date information.”
Coding: Huge productivity leverage for both personal and collaborative software engineering tasks.
- Brandon McKinzie (14:15): “I think our models are getting a lot better very quickly at being actually Useful … reaching some kind of inflection point where they are useful enough to want to reach out to and use multiple times a day.”
Next Frontiers: Expectations for future breakthroughs include general computer use (e.g., acting as a smart desktop assistant) and collaborative, multi-agent tasks.

6. Real-World Constraints and Multi-Agent Learning

Physical Real-Time Constraints: Some tasks (e.g., robotics, real-time interactions) challenge AI because “thinking” too long isn’t an option.
- Eric Mitchell (23:21): “In the real world you need to live on [its] frame rate … the ball is coming at you now and you have to catch it.”
Generalization over Standalone Models: Debate on whether specialized (e.g., robotic) foundation models will persist, or if broad general models will subsume those tasks as well.
- Brandon McKinzie (22:31): “I don’t see any reason why we couldn’t have these be the same model.”
Social & Multi-Agent Skills: Training models to interact with one another could be a gateway to more natural collaboration with humans.
- Brandon McKinzie (26:16): “Maybe a not too bad starting point is making it good with collaborating with other models.”

7. Evaluation, Data, and Model Robustness

Eval Scarcity: As models improve, existing evaluation benchmarks become obsolete—finding uncontaminated, high-quality test data becomes critical.
- Eric Mitchell (32:49): “Uncontaminated evals, always super valuable. … evaluating the capabilities of a general capable agent is really hard to do in a rigorous way.”
Data Quality Desire: The hypothetical “magic wand” for training would be large, richly annotated, multi-step data (e.g., coding tasks spanning weeks of collaboration).
- Brandon McKinzie (33:37): “A data set that’s a bunch of multi-turn user interactions in some code base … that would be awesome to have.”
Distributional Nature of Output: There isn’t always a single deterministically best answer; users should develop intuition for the “range” of model behaviors.
- Eric Mitchell (34:47): “Send the same prompt many, many, many times … and get an intuition for the distribution of responses you can get.”

8. User Experience and Future Features

Experimentation and Feature Desires: Power users want features like running the same prompt hundreds of times, then letting the model rank or synthesize outputs.
- Podcast Host (35:51): “I want to run the prompt automatically like 100 times … and then I want the model to rank them and give me the top one or two.”
- Brandon McKinzie (36:30): “Well, it’s expensive, but … it’s a great suggestion.”
Using AI as a Work Queue: Brandon habitually throws difficult tasks at O3, sometimes getting surprisingly good outcomes.
- Brandon McKinzie (36:51): “I use our model as almost like a background queue of work … sometimes those will stick and sometimes they won’t.”

9. Engineering & Research Challenges

Infrastructure Complexity: Async RL with tool use at scale brings major engineering challenges, especially in managing failures gracefully.
- Brandon McKinzie (37:57): “If your Python tool goes down in the middle … what do you do? … There’s been a lot of learnings … how you deal with like huge infrastructure.”

Notable Quotes & Moments

On O3’s Deliberation Power
- Eric Mitchell (00:41): “It gives you a higher level interface to doing some of these more complicated tasks.”
Productivity Impact
- Brandon McKinzie (14:15): “Our models are getting a lot better very quickly … useful enough to want to reach out to and use, like, multiple times a day.”
Evaluation Difficulty
- Eric Mitchell (32:49): “Evaluating the capabilities of a general capable agent is really hard to do in a rigorous way.”
Collaboration & Multi-Agent Learning
- Brandon McKinzie (26:16): “O3 in some sense is already kind of simulating what it’d be like for a single person to do something … no reason you can’t scale this up so models are trained to be really good at cooperating with each other.”
Real-World Constraints
- Eric Mitchell (23:21): “Gravity’s not going to wait for you.”
On Model Output Variability
- Eric Mitchell (34:47): “There is a distribution of behavior and I think people often don’t appreciate that.”

Timestamps for Important Segments

[00:41] - O3 Overview & Key Differences from Prior Models
[03:19] - Reinforcement Learning and Deliberative Reasoning
[09:10] - Why Tool Use Supercharges Test-Time Scaling
[14:15] - Biggest Application Areas: Coding and Research
[23:21] - Real-World Constraints and Robotics Discussion
[32:49] - Data, Evaluations and the Challenge of Measuring AI
[34:47] - Understanding the Distribution of Model Responses
[37:57] - Engineering Challenges with Async RL and Tools

Conclusion

This episode provides an authoritative inside look at the next generation of AI reasoning models—detailing how O3 integrates deep deliberation, sophisticated tool use, and reinforcement learning to power more intelligent, autonomous task completion. The discussion ranges from the model’s architecture and user experience to emerging engineering and research challenges, and tackles fundamental questions about the trajectory of general AI.

Listeners interested in AI research, AI productization, or the future of intelligent software will find this episode especially valuable for both its technical depth and candid, forward-looking insights.

wavePod

O3 and the Next Leap in Reasoning with OpenAI’s Eric Mitchell and Brandon McKinzie

Summary