Summary7 min read

The MAD Podcast with Matt Turck

Episode: OpenAI's Dan Roberts: Why AI Can Now Make Discoveries

Guest: Dan Roberts, Lead of Foundations of Reinforcement Learning Team, OpenAI
Date: June 4, 2026
Host: Matt Turck

Episode Overview

This episode explores AI’s leap from pattern-matching to genuine scientific discovery, sparked by recent breakthroughs where models like those from OpenAI, DeepMind, and Anthropic tackled famously unsolved mathematical problems (notably, Erdos problems). Host Matt Turck speaks with Dan Roberts, a leading AI researcher at OpenAI with a background in theoretical physics, about the evolving role of AI in science, the intricacies of reinforcement learning (RL), the boundaries between pre-training and RL, and the philosophical and practical implications of models that can reason, explore, and even originate scientific ideas.

Key Discussion Points & Insights

1. The Era of AI Scientific Discovery

Breakthroughs: AI models are not just assisting but sometimes autonomously solving open scientific problems, exemplified by recent successes in mathematics (Erdos problems) ([00:23]-[01:17]).
Gradual Progress: There won’t be a clear breakpoint from AI-as-assistant to AI-as-discoverer; rather, progress is gradual and cumulative ([07:26]).

Quote:
“It seems like it’s just this really nice gradual process.”
— Dan Roberts ([07:26])

2. Contrasting Approaches: OpenAI vs. DeepMind in Solving Math Problems

DeepMind: Formalizes problems in Lean (a strict formal language), enabling airtight, machine-checkable proofs ([10:34]).
OpenAI: Tackles problems in informal (natural language) settings, mirroring human mathematicians—solutions are checked post hoc ([11:56]-[12:13]).

Quote:
“We have language models... taught them to reason at test time. And one of the applications... is reasoning in mathematics.”
— Dan Roberts ([11:56])

3. Foundations and Importance of Reinforcement Learning (RL)

Definition and Analogy: RL is learning through interaction, feedback, and reward. Roberts gives a vivid analogy: learning to play Mario Bros. by playing vs. just watching others ([12:35]-[15:10]).
Why RL Works: Because learning is more effective when it is interactive and tailored—feedback helps models (and humans) improve ([15:15]).
The Catch: RL struggles in “sparse reward” settings—where feedback is rare or delayed (e.g., only after many sequential actions) ([16:05]).

Notable Analogy:
“Reinforcement learning would be: Your dad’s like, here, why don’t you play...You play. Maybe the first thing you do is you run, you hit the first bad guy...But then the second time, you press a button and you jump.”
— Dan Roberts ([12:35])

4. RLHF (Reinforcement Learning From Human Feedback) and Its Evolution

Early Steps: RLHF emerged as an effective way to make language models more aligned, turning unsupervised next-word predictors into instruction-following, user-friendly agents ([17:17]-[18:48]).
Efficiency Question: Despite claims (e.g., “less than one bit of useful information per 10,000 tokens”), RL remains essential for pushing models into new capabilities, especially reasoning ([27:52]).
Exploration vs. Exploitation: RL can drive true discovery when models aren’t just optimizing known techniques but exploring contrarian or unconventional paths—key for scientific breakthroughs ([22:35]).

5. RL’s New Centrality in Modern LLMs

Shift in Thinking: RL is no longer “the cherry on top of the cake”—it has become the “cake” itself, central to capabilities like reasoning and long-form problem solving ([25:15]).
Why Now?: Advanced pre-trained models provide a strong foundation; adding RL unlocks their ability to “think” at test time ([25:46]-[27:29]).

Quote:
“RL is really exciting. That’s what I’m here talking about. And I think that when you have a lot of compute, you want to turn that compute into intelligence in a way that’s useful. And RL is one way of doing it.”
— Dan Roberts ([25:15])

6. The Interplay Between Pre-Training, RL, and Language

Pre-training Limits: Pure scaling of pre-training isn’t enough; the real advances come from combining pre-training (on vast language data) with RL-driven reasoning ([31:51]).
Why Language Matters: Models grounded in language absorb the sum total of human knowledge, making language the ideal prior for intelligence ([29:32]).

Quote:
“This whole idea of reinforcement learning...the sort of grounding that was needed...is through language...All of our scientific knowledge, all of our mathematical knowledge, all of the humanity...is represented...in language...having the model have a prior of language...clearly the right thing to do.”
— Dan Roberts ([29:32])

7. Test-Time Reasoning & Chain-of-Thought

Test-Time Compute: During inference, models generate “chains of thought”—token-by-token reasoning. The promising insight is that “thinking in language” extends a model’s effective compute and problem-solving ([33:03]-[34:50]).
RL’s Role in Thinking: RL imparts the ability to reason during test time, leading to more deliberate and creative solutions ([35:10]).

8. Generalization, Verification, and Future Domains

Verifiable Rewards: RL excels in domains with clear metrics (e.g., math, code), but many real-world tasks (law, consulting) lack such objective criteria ([36:04]).
Generalization Challenge: The ambition is to create generally intelligent models—training across distributions and ensuring robustness to new domains ([37:23]).

9. Lessons from Physics for AI

Scaling and Emergence: Roberts argues we should study scale-up phenomena by making large systems comprehensible (“big to small”), much like physicists create simplified but sufficient models of nature ([38:32]-[41:57]).
Toward “AI Thermodynamics”: Early scaling law research (Kaplan, McCandlish) is likened to thermodynamics—a compact theory predicting behavior from key parameters, even amid immense internal complexity ([42:08]).

10. How Far to Einstein-Level AI?

Foresight and Limits: Predicting when AI achieves true originality is hard; progress likely to be smooth, not abrupt. The field moves quickly: what seems uniquely human one year can soon be in AI’s grasp ([43:08]-[46:02]).
Signs of Genuine Discovery: AI’s contrarianism and exhaustive exploration—e.g., finding proofs by connecting disparate fields—are already hallmarks of original science ([45:22]).
Automating AI Research: Increasingly, AI assists (and even automates) parts of its own advancement, but humans still provide (for now) crucial research taste and framing ([46:02]).

Notable Moments & Quotes (with Timestamps)

On How AI Refuted a Mathematical Conjecture
“One of the things that ChatGPT was able to do was assume it was false. And when you go against the grain and do something contrarian like that, you really have to have strong conviction in what you're doing in order to persevere down a really long calculation path.” — Dan Roberts ([09:05])
On RL's Place in Modern LLMs
“RL is really exciting. That’s what I’m here talking about. And I think that when you have a lot of compute, you want to turn that compute into intelligence in a way that’s useful. And RL is one way of doing it.” — Dan Roberts ([25:15])
On Why Language Is the Cornerstone of AI's Progress
“All of our scientific knowledge, all of our mathematical knowledge...is represented on the Internet in language. And so having the model have a prior of language and being able to think in language and then train on top of that—that seems like clearly the right thing to do.” — Dan Roberts ([29:32])
On the Physics Analogy for AI Scaling
“You don’t try to retreat to a setting that’s simple enough where you can calculate...You try to retreat to the setting that’s simple enough that contains the thing you care about...the same thing is true in AI.” — Dan Roberts ([39:27])

Timestamps for Key Segments

| Time | Segment | |-----------|-----------------------------------------------------------------------------| | 00:23 | Introduction: Math breakthroughs; AI as scientific discoverer | | 03:08 | Dan’s background: From physics to OpenAI | | 07:26 | How close is AI to scientific agency? Gradual progress discussion | | 10:34 | OpenAI vs. DeepMind: Formal vs. informal math proof approaches | | 12:35 | What is RL? Mario Bros. analogy | | 17:17 | What is RLHF? From human feedback to automated reward models | | 19:02 | RL, self-play, exploit vs. explore (Pokerbot story) | | 22:35 | Balancing exploration and exploitation in research/discovery | | 25:15 | RL’s centrality—“the cake” not the “cherry” | | 29:32 | Is language the foundation? The “grand debate” analogy | | 33:03 | Test-time compute; chain-of-thought reasoning | | 36:04 | Verifiable vs. non-verifiable rewards; RL in creative domains | | 38:32 | Lessons from physics for understanding AI scaling | | 42:08 | Toward compact “thermodynamic” theories of AI scaling | | 43:08 | How far to Einstein-level AI? The fallacy of timelines | | 45:22 | Has AI already done original science? Unit distance problem as evidence | | 46:02 | Will AI automate its own research? Smooth advances, not sharp transitions |

Conclusion & Takeaways

Dan Roberts frames today as a watershed: AI systems—rooted in language, trained by RL, and powered by immense compute—are crossing the threshold from analytic assistants to autonomous discoverers. With a physicist’s clarity, he argues that understanding, not just scaling, will be the key to the coming era. The episode punctures simple narratives around RL, pre-training, scaling, or even “emergence,” calling instead for a scientifically rigorous, theory-driven approach to both building and interpreting the new AI.

Final Reflection:
“We will get to really answer a lot of fundamental questions in the fields of science that we care about with the aid—or maybe the models being the driving force. And so, yeah, that’s just really thrilling.”
— Dan Roberts ([46:59])

Loading summary

Transcript69 lines

[00:00]
A
One of the things that ChatGPT was able to do was assume it was false. When you go against the grain and do something contrarian like that, you really have to have strong conviction in what you're doing in order to persevere down a really long calculation path. I feel really excited that we will get to really answer a lot of fundamental questions in the fields of science that we care about with the aid or the models being the driving force. And so that's just really thrilling.
[00:24]
B
Hi, I'm Matt Turk. Welcome to the MAD podcast. It's been yet another extraordinary last few days in AI with OpenAI, DeepMind and Anthropic CR tracking some of the most famous long unsolved questions in mathematics known
[00:36]
C
as the Erdos problems.
[00:38]
B
A moment many view as a stunning breakthrough and yet another signal that AI is moving from doing the work we ask of it to autonomously making deep science discoveries to unpack the moment and the fundamental advances in model reasoning that make it possible. I'm excited to welcome Dan Roberts, a top AI researcher at OpenAI who comes from a deep background in theoretical physics and has a particular interest in the intersection of science and AI. In this conversation, we go deep on what reinforcement learning actually is, why it's the most important paradigm in AI right now, and what's ahead for AI and science. Please enjoy my conversation with Dan Roberts.
[01:18]
C
Hey Dan, excited to do this. Thanks for taking the time.
[01:20]
A
Of course. Very happy to be here.
[01:22]
C
You are the lead of the Foundations of Reinforcement learning team at OpenAI. So what does that mean? What does the name mean?
[01:31]
A
The larger team that we're on is called Foundations and we think about reinforcement learning. So very boring. Foundations of Reinforcement Learning. But the team comes from a mandate of thinking about the science of reinforcement learning and a long time ago, which in AI speak is like six months ago, maybe a year, I guess now two years. So before we released 01 and thinking reasoning models, we were studying this internally. And one of the advantages to being first, or at least to being forced and spending a lot of resources on scaling things up, is that you can empower a group of people to not just work on making the thing work, but work on understanding how it works. And then beyond that, how do we scale? How should we think about scaling? Reinforcement learning versus Scaling Pre training. So what are scaling laws look like? But then going beyond that, what sort of things does this kind of training teach us? What doesn't it teach us? We're very interested in at the frontier for exploratory scenarios. How do we either improve or understand better what reinforcement learning is doing. We have all this compute that famously we are in the process of acquiring and we would like to turn that compute into intelligence. And to do that we need to make thinking models in some somewhere along the way we interact with that process, usually at the earlier stage for models, not the next model, but things that are like the next model or the next next model.
[03:08]
C
Great. And quickly, what was your path to OpenAI? So how did you go from studying physics to being where you are today?
[03:16]
A
I did a PhD in theoretical physics from MIT, thinking about the intersection of quantum gravity and quantum information. Thought a lot about black holes, quantum chaos, kind of thing of what if you throw something into a black hole, what happens to the information? Does it come out? How? If we think about black holes as computers, how fast are they? I was very interested in this fundamental question in theoretical physics, which is how do you find a quantum theory of gravity? I also got very interested in this interplay between computation and the laws of physics. You know, any computer exists in the universe behaves according to physical law. So the sort of computations you can do are bounded by the laws of physics. And there's some sort of interesting relationship there. Black holes are pretty interesting because they sort of saturate some conjectured bounds around processing of information. And from there I did a postdoc at the Institute for Advanced Study, and at around that time, I'm pretty old now, at least for this field. So that was about 2016 was when the DQN Atari paper from DeepMind happened in 2015, and then AlphaGo was in 2016. And I got very excited about the possibility of machine learning. And then deep learning was statistical science that lived in a similar framework to the sort of frameworks that we use to study the rest of the universe. And so there's always this question of how does everything work? This three year old question of I'm curious about everything and if you look outward and you care enough, you end up philosophy. Maybe if you're quantitative, you end up in physics. Very crude characterization. An AI that works is very fascinating. Or AI systems that work are fascinating because they are simple examples that do things that humans do. And then if it lives in the same framework that we use to understand everything else, then it's sort of like you can draw the parallels between how does the universe work and how do I work or how does intelligence work? So I got extremely interested in AI and deep learning. Then I went to Fair Facebook's AI research lab around 2017 to basically try and use the tools from theoretical physics to understand deep learning. Deep learning was supposed to be this really difficult thing that you couldn't understand. And I thought maybe the tools of physics could, could be helpful. This actually culminated in a book that I wrote with a collaborator who now is still a collaborator of mine now at OpenAI, working on the same thing as Shojeda. But we wrote this book, the Principles of Deep Learning Theory that was a culmination of these sets of ideas of can we sort of use the statistical ideas of understanding statistical systems like the gas in the room? We can characterize them with some simple laws of thermodynamics like the ideal gas law, and maybe we can make similar progress in understanding deep networks. So that was sort of my transition. I also had a startup along the way and spent some time at Sequoia Capital as an entrepreneur in residence. So there's some tension between am I a scientist and am I an entrepreneur? But about two years ago, after thinking about whether I want to start another AI company, I realized that the thing that was most exciting right now was what was happening at the frontier, that there's some amazing scientific progress happening in AI. And to really get at the questions and understand what's going on, you need to be there and you need to participate. And that meant joining lab. So I joined OpenAI two years ago.
[07:04]
C
Great, thank you for that. Where do you think we are in the evolution of AI being increasingly abled to solve difficult scientific problems? I mean, it's certainly something that we've been talking as an industry about for a while now, but it seems to be accelerating, perhaps just like everything else in AI. But where do you think we are?
[07:27]
A
I think one of the interesting things is that this process is smooth. There's no sharp point or I don't think there will be a sharp point where we'll say that systems weren't able to be useful for the scientific process to their fully fledged scientists. There'll be sort of a gradual shift. If you had to point to one moment, maybe it would be the release of OI and the sort of paradigm of test time, compute and reasoning. But I'm sure if I tried to make that claim, you could go and look at GPT4 and see that there's glimpses of that sort of useful behavior for the scientific process were already present. As a general point, the models are very good at certain types of things that clearly are amenable to making progress in math. They're not open loop fully fledged scientists in any domain, although neither am I. It seems like it's just this really nice gradual process.
[08:22]
C
So it feels like a particularly fun week to be having this conversation because over the last few days there were a number of different announcements in the general field of AI and mathematics around the Erdos problems. OpenAI came out first with this progress, but like almost within a few hours. Google DeepMind had a claim as well on different problems. Then Anthropic had some claims. However, from what I understand, the OpenAI approach and the DeepMind approach were very different. And that may be very interesting in terms of what that means for AI. As a research scientist.
[09:06]
A
This conjecture everyone assumed was true, but could not prove it. One of the things that ChatGPT was able to do was assume it was false. And when you go against the grain and do something contrarian like that, you really have to have strong conviction in what you're doing in order to persevere down a really long calculation path, because there's a lot of choices that you can make along the path. And if you get any of those choices wrong, if your ideas don't work, then you find out that, that you didn't make any progress. And so you need this really strong persistence. And then you need expertise in this other field, which is like algebraic number theory, some sort of generalization of number theory on things that sort of generalize the integers and the real numbers. You go down that path really far, you can refute this conjecture. So that was the big result. The big result was that this conjecture of this lower bound for the number of pairs that you can make is false. Not only is it false, it was false due to a really interesting connection to another field of mathematics. And so you would have to be somebody who is aware of this problem as interesting, which sounds like your expertise is one thing, and then be an expertise in something else, and then also be super contrarian and go down this really long path and then you would have identified the solution.
[10:25]
C
The OpenAI approach and the DeepMind approach were very different. Do you want to compare and contrast the two approaches?
[10:34]
A
One of the approaches that GDM takes is to take problems, present them in a formal language called Lean, and then used methods to search for proofs in that language and some problems for problems to be representable. There's this process called auto formalization where you take English version of the problem and you translate it into rigorous formal statements and then you, you conduct your proofs there and it's designed so that the proofs can be airtight. No one has to go and check for some Hidden assumption or some weird thing. I guess it's usually hidden assumptions or definitions that are not airtight. But in that setting, which is a setting that DeepMind has cared a lot about, they were able to formalize some problems and use their system to prove them. So that's one approach. Another approach is to just take the problem in English with mathematical expressions as well, but just the English statement of it, which is informal, and understand what is meant by that and solve that in informal language, presenting a proof much like the way a human mathematician would or human mathematician who's not using Lean. And then you have to check it. The verification problem is, is harder because it's not something that auto checks.
[11:55]
C
And that second approach was OpenAI.
[11:57]
A
Most of our results that we publicize as far as I can think, are all in the informal setting. We have language models that we've taught them to reason at test time. And one of the applications or benchmarks for that is reasoning in mathematics.
[12:13]
C
Okay, great. All right, so let's get into reinforcement learning to make this broadly accessible. Let's start from the top. What is the 1, 2, 3 sentences definition for enforcement learning? And perhaps give us a simple non technical analogy for people to understand.
[12:35]
A
Maybe a simple thing to do would be to give you two examples of how you could try to learn something you as an individual. And maybe we can take a game or even say a video game. I'm old enough that where I played the original eight bit Mario Brothers, the Super Mario Brothers. And so here are two ways you could learn how to play. One way you could learn how to play is your dad takes it out and plugs it in and he boots up the game and then he plays for a few hours and then you just watch him play. That's all you do. So he's demonstrating how to play and then at the end of that he's not very nice, so he doesn't let you play. But then he goes and runs outside and does something else. And you sneak into his room, you plug it in and you try to play. How good are you going to be? Well, all you've done is tried to memorize what he's done. You haven't gotten to push any of the buttons yourself. You haven't gotten to interact with the game yourself. This is sometimes called expert demonstrations. And you're just trying to memorize what someone else is doing. The version of supervised learning, the supervision being like you just watch what he does and accept that that's the true way of doing a thing. Reinforcement learning would be Your dad's like, here, why don't you play? Maybe he shows you once, or maybe he doesn't even need to show you, because the game is beautifully designed to sort of take you from not knowing anything to being able to play expertly, something called a curriculum. But you play. Maybe the first thing you do is you run. You hit the first bad guy, and you probably this example is dated, but you lose a life. But then the second time, you press a button and you jump. And so you're taking actions, there's an environment that's giving you feedback, and there's this close connection between the environment, between actions that you can take and then the responses that you're getting. And then the final part is there's a reward. And the reward can be something that you get pretty often. For instance, every time you do something, there's some score that goes up, or it could be just something that you get at the end. So you play a game of chess, and at the very end you get a reward, which is you won or you lost, but in the middle, you don't really know how you're doing until the very end. So this is called sparse resistance rewards. But I think this is the basic idea, and there's obviously lots of variants here and ways to quibble with this. But it's this notion that you interact with an environment, you get a reward, and often it's in a way where you get this sort of feedback as opposed to just trying to learn from data that you don't get to interact with.
[15:11]
C
And why does it work? And why is RL so powerful?
[15:15]
A
It works because of this ability to get feedback from the environment. You can go and learn. If you're doing it right, you can figure out how to learn the things that you don't know. And I also think it's powerful because of this fact that it's much easier to learn when you're learning at the right level for you. So if you want to learn addition, you shouldn't read a calculus textbook. You want to learn by being able to practice and learn at the right level, I'm actually making the choices and learning from my own choices, whether they work or not. Then I'm able to place it in a better context for the set of things that I understand.
[15:58]
C
Great. And then conversely, what's the catch and how does RIL break the setting where
[16:05]
A
very difficult is the setting that I alluded to before, where you don't get much feedback from the environment. You have to take many, many, many, many actions and Then you get maybe yes, that whole set of actions was good or no, it was bad. For instance, you're playing a game of chess and you don't know until you make all the moves that has an opponent. So it's maybe complicated. Maybe you are trying to do a homework problem and it's a research level. Or someone gives you a well defined problem like we give our language models. And it's a problem that requires days and days of thinking. There's so many choices that you can make along the way. And at the end, if you don't get any feedback at all, if you're just hidden in the woods by yourself, scribbling in notebooks, it's very hard to make progress that way because you don't have any sense. If you get a yes at the end or you get a no at the end, you have no sense for which of the the actions that you took, which of the things you did were good or bad.
[17:03]
C
Okay, great. Now let's talk about how RL has been applied in the context of large language models. So was the first step historically RLHF?
[17:17]
A
Yeah, I think that's probably fair, at least in a broad sense, that the first kind of RL that was done on language models was part of this post training process to turn a model that just tries to predict the next word on the Internet into either something that will follow your instructions, be nice to you, or fit the form of a chatbot.
[17:42]
C
So do you want to define for people what RLHF is and how it works? Quickly?
[17:48]
A
The basic idea is that you could collect data from humans. So the RLHF is reinforcement learning from human feedback. So you collect data from humans and you train a value function. So you would show in the language model setting, say two different completions from a language model, ask them to say which is better. This sort of comparisons could be used to train a value function and then you can use that as a reward for a reinforcement learning process.
[18:19]
C
Great. And you do that initially with humans, but then you build that into a reward model.
[18:25]
A
Yeah. So you would train a model for this. Because during the training process, you can't just pause your training run to ask some humans for input. That would have way too much latency. So instead you need a proxy for what a human would say. So you train this model based on the human preference data and then you can optimize against it, or at least a little bit.
[18:48]
C
One of the famous things in the history of RL is move 37. How do you train a Model to encourage the model to do that kind of things and come up with brand new ways while being efficient and exploit known path.
[19:02]
A
Yeah, so the great thing about Go is that you can just train it. It's a zero sum two player game. You can train it in what's called self play. It plays itself and it can go from playing randomly to expert play and it will find whatever the sort of best strategies are. So if that means exploring, great. If that means exploiting. Actually I have a funny story about this. So I met Noam Brown in grad school. He went to a different grad school than me, but he wanted to enter MIT's Pokerbot competition. And he had a poker bot that was the best in the world, but it wasn't something that would compete against humans. Yet he just won in this research competition. He collaborated with me and another friend to enter MIT's Pokerbot competition. This is great actually for me because I learned some really exciting work in AI and I got very excited about this while I was doing physics. We were playing essentially this kind of self play equilibrium strategy. There's some nuances, but essentially we could not lose. Assuming we did not have any bugs in our code. The way this thing worked was that it was a tournament where you would be paired with say another person and play them. And if you, you know, depending on the amount of points you got in like some sort of round robin setup, they would eliminate the bottom half and they would keep going until you got to the final table which would just be say you versus the other person. And so the scores, there was the award ceremony and we didn't know what happened, but there was someone else who was, what was everyone's scores over time look like? And there was say 64, I think there were 32 actually people playing. So it was like around 32 kind of tournament and 30 people over time, their scores were all very negative and going down. And then there was one person whose score was pretty much straight up. And then there was another that was like pretty good, but not with a crazy slope. Do you want to guess which one we were? So we were the lower slope. And then there was this other guy that had this crazy slope was just completely crushing all the other players. And then this happened for the round of 16, the round of eight, the round of four, and then in the round of two, it's heads up us versus this guy who's over the course of this tournament won way more than us overall, taken more money from everyone else, and then we crushed him because why because he was exploiting the weaknesses of everybody else. It had some theory of mind to try to figure out, oh, this guy does this when he bluffs. So I assume it was very good at exploiting everyone else, but we were just playing the best possible thing that you could do given. So the criteria was not maximize your, your amount that you get from anyone else, it was don't lose. So it's the best response to anyone's strategy. And so at the end we had to win, assuming we did it right and someone else playing the same strategy would tie.
[22:16]
C
Okay, fascinating. So just tying this back to the beginning of the conversation about Erdos problem and solving unsolved math problems. Presumably the instinct would be that you need a lot of exploration, not exploitation. So how does that work in the context of novel scientific discovery?
[22:35]
A
I think math research, or scientific research in general, has a lot of versions of both explore and exploit. To give the recent example, the OpenAI unit distance proof I think is very much in the explorer setting where the model was happy to be contrarian and try to disprove this thing that everyone believed and it was just looking for it has this huge repository of understanding all of human math. And so it was spending a very long amount of time, I forget how many hours, but I think we published a rewritten version of this chain of thought, but hours and hours trying different things. So it's clearly in the domain of exploration. A lot of times though, you can ask these models to compute something that they understand very well and then that that has a different structure and might look a lot like exploit. There's a paper that came out recently after the OpenAI result where the unrelated Erdos problem has something to do with if you have a set and you try to add the set to itself, or you try to multiply the set with itself, so take the elements and add them all together or take the element individualize or multiply them together into how many unique sums or products you get. There's some conjecture around that and this one was also disproved and that was done by humans. And the core idea was it's like a totally different problem, but there was inspiration from the unit distance 1. The idea that you can sort of generalize, Pick a certain type of numbers that had a certain Property that the OpenAI model figured out and that they realized that this applies in this setting. So that's very much an exploit thing, but. And so I think the process clearly is like the actual discovery process. I think normally when you talk about explore exploits, maybe we're talking about when training reinforcement learning models, how should we train them? But I think there's this interesting point that in the scientific discovery process there's really this interplay between exploring exploration and then exploitation in order to totally push the field forward.
[24:50]
C
Switching to RL in modern LLM systems. So there used to be a saying, which I think comes from Yann Lacan, that RL was the cherry on top of the cake. But I think you have argued that things have switched now and that RL is the main part, the cake. Do you want to just walk us through what you were thinking?
[25:16]
A
Yeah, I said that about a year and a half ago. Had to give a talk that was public and I couldn't say much. So I decided to invert this meme with this cake and the cherry. RL is really exciting. That's what I'm here talking about. And I think that when you have a lot of compute, you want to turn that compute into intelligence in a way that's useful. And RL is one way of doing it. And we just started doing it then and we're going to do a lot more of it now.
[25:47]
C
Why did RL start working? Well, it's not an entirely new concept. It's been tried for many years now. What is different now?
[25:58]
A
Yeah, I'm not sure to be honest. When people say it wasn't working, what that actually means. There was this 2016, 27, maybe even to 2018 before the transform period where DeepMind was all in on RL and OpenAI had Dota and the Rubik's Cube and some other exciting results as well. But a lot of people were all in on RL and then there were language models and the obvious thing to do was scale up the thing that worked, which was pre training. And I don't know whether or what people tried for rl. As you pointed out, RLHF was a central thing that came pretty quickly. Originally was developed for in the context of game environments, of trying to prevent reward hacking by using. I think the original paper was about using human feedback to control a character for or something like that. But there's an interesting thing to point out here though, which is that there's this question of how do you get models to think in test time and reason. And there was a reasoning effort at OpenAI that was quite early and spent some time and came up with some algorithms. I think maybe the simple thing to say is that if you have a powerful enough pre trained model, then it can start to do well at rl. It can start to think, use test time, compute to, for instance, solve math problems that it wouldn't otherwise be able to do.
[27:30]
C
This a viral analysis from earlier this year, February, I think, that claims that RL produces less than one bit of useful information per 10,000 tokens. And then Karpathy called it sucking supervision through a straw. What is your take on this and the overall efficiency of Oryle?
[27:53]
A
If you look at the Deep SEQ algorithm, which is a public thing that we can talk about, then you train on sequences that are correct. So whether it's correct or not is maybe one bit of information. So I think you can see where that logic comes from. I think the question is, is this doing a kind of thing that you can't otherwise do? Maybe you would want to give more supervision, but how are you going to do that? I think it's very clear that, that these methods have led to a bunch of breakthroughs in terms of the explosion of what the models can do, both in coding and in science. I think broadly it's about getting models to think in test time, to use test time, compute and do reasoning. And there's clearly a lot of the pieces of what the RL process is that's essential to make that work.
[28:48]
C
What's your overall feeling in terms of how far we can go with that current sort of systems model where we have pre training and then we have RL on top? Somewhat famously, last year there was a conversation with Rich Sutton on the Dwarkish podcast where his claim, in my best attempt to paraphrase it, was that LLMs were not really intelligent and therefore RL was the only way to do it. And pure RL, not LLM plus RL's. What is your take on this? I mean, obviously you're on an RL team at a company that does both pre training and RL combined. So what's your take?
[29:33]
A
Let me tell another story. So when I was. Before I did my PhD, I spent two years in the UK and I was at Oxford for one of those years and I was at a pub, as one does, and two of my close friends, one was a cognitive scientist and one was linguist. And so we had the sort of argument that you do in those situations when you're that age. And so something like physics is the most fundamental of all the sciences because it explains how the world works and everything is in the world. I said this earlier, my computer exists in the world, I exist in the world. We all follow the laws of physics. And then the cognitive scientists said something like, yes, but then you have to, you know, you have to process it. So there's all sorts of cognitive biases about that. And, you know, the way you collect data and learn something, something. But then the linguist was like, blah, blah, blah, Wittgenstein, you know, everything goes through language. That's the method of communication. That's the way words mean things are the central thing. And when we want to talk about the laws of physics, we have to use language. And I sort of feel like he and Wichtenscheid like that. That was correct, right? That's what's what. Or at least the path through AI suggests that that is a correct path. I'm conceding now to Kyle, and if he's listening, he's now a linguistics professor. This whole idea of reinforcement learning that kicked off the previous decades interested in AI, the sort of grounding I think that was needed was to make things really work is through language. Because everything goes through language. All the Internet, it incorporates the grounding of the real world, real world. All of our scientific knowledge, all of our mathematical knowledge, all of the humanity, the sum total basically of human work is represented on the Internet in language. And then so having the model have a prior of language and being able to think in language and then train on top of that, that seems like clearly the right thing to do and seems also well grounded in a way that even before all this, somebody might have argued would make sense. It's like an amazing prior to have to start with for an intelligence because it's very much based on us and our society. I have other disagreements with Rich Sutton, but if you want to poke it,
[31:48]
C
yes, just give us one or two quick ones.
[31:51]
A
I have a somewhat contrarian take that with the better lesson that it's not that scale is all you need. You need to also have good ideas to guide the scaling. So there's a deeper interplay than just scale things up. For instance, if you were just trying to scale pre training, you wouldn't get anywhere near as far as also trying to scale RL on top of pre training, which is what we do now. And our models are much more powerful for that very good idea and investing in that good idea.
[32:18]
C
And that the good ideas come from humans.
[32:20]
A
Well, maybe they'll come from AI in the future, but before we had AI, they came from humans. Scaling was also a good idea that came from humans. But there's this interplay of elicit new phenomena at scale. You try to understand them at that scale, and then that points you at new directions, and then you develop new ideas and then you try to apply scale on those ideas. So I think it's not just scale, scale, scale.
[32:41]
C
Since you mentioned test time compute, I think that's something that still puzzles people, which is the whole chain of thought thing, which is so magical from a user perspective. Whatever. You can see what actually happens during test time compute that creates those artifacts. What model actually do?
[33:03]
A
I think it does what you see it do. We lightly rewrite it or summarize it, but it just produces tokens. And those tokens are like a running thought process, just like you might have. Or maybe it's more akin to if you're solving a math problem, the scratch pad, the collection of notes that you have, but it just keeps generating. The cool thing about generating is that it's a forward pass, the model. So we're using a bunch of computation. So it's a way of leveraging a lot more computation on a problem than you would before. So my colleague Noam Brown likes to talk about the Riemann Hypothesis a lot. And wouldn't you want to have a model that runs for years that can resolve that? Prove that if you present it and you wanted to produce an answer, then it only has the number of flops in a single forward pass to produce one token if it's forced to answer right away. But if it gets to answer after a long time, it can reuse its weights, produce a final answer that is a function of a much larger amount of computation. And the natural way it thinks is in language. It's a language model. And so that's sort of this key insight that you can cause it to do better just by producing a thought process in token space in language, and this was known before rl, the idea that if you asked the model, if you gave a model examples of thinking things out, it would do this before it produced a final answer. Or if you just told it that, then it would do this sort of thing. Going back to this SFT versus supervised learning versus reinforcement learning analogy that I gave earlier, there's a lot of examples on the Internet of people thinking for a long time. And so it's not completely useless. It can channel that a bit, but RL really brings that out.
[34:51]
C
What happens during test time compute? Is RL related or created? Because that's effectively what you just described earlier when you were defining rl. The model goes in one direction, decides, maybe that's not a fruitful one, backtracks, try something else. Is that correct or not?
[35:11]
A
I think maybe the result of the RL process is that the model can then think at test time. And that's why we have these dials, or various companies have labs have reasoning effort dials. Right, so you've now created a model that will produce a bunch of tokens before it outputs a final answer. And causing that to be good is what RL is doing, or one of the things RL is doing. And so the output of doing RL training is the ability to have a model that thinks.
[35:40]
C
One of the key questions in the field is whether you can expand and generalize the success that LLM systems have had, particularly in coding and now math. But domains where you can sort of verify whether the model comes up with is correct or not. What is your view on that? And perhaps start by explaining what a verifiable reward is.
[36:04]
A
So a verifiable reward is in principle a reward that can't be hacked. So if it's a math problem and the answer is an integer, you just string match the integer and then you verify that it solved the problem correctly. That abstraction has all sorts of problems with it, but a problem that can't be verified. Is this a good piece of creative writing? There's not something you can sort of string match against that involves questions of taste. And maybe different people ask differently. So maybe it's a distributional kind of thing. And so there's clearly a big gap between those two things.
[36:48]
C
So do you think there is a path for RL to be truly effective at domains without referable rewards? So consulting, banking, legal, I mean, clearly there's tremendous progress in those domains, but what is happening?
[37:01]
A
I definitely think OpenAI will have amazing products that will be relevant in those domains and some amount of RL will play a role in there.
[37:09]
C
Does RL generalize meaning that as you train it against more and more domains, it becomes disproportionately good at learning the next domain.
[37:24]
A
I mean, we want to make a model that is generally intelligent and push that intelligence as far as possible. And to do that, we want to make everything part of the distribution. And then we also want to make it robust in cases where it encounters things that it was not in the distribution. And if RL is part of that process, then. But I think there's a vague sense, as I was trying to say earlier, is that there's a lot of things that are very fuzzy. But the clearly, the question of generalization in AI is an important central one. And there's a bunch of examples, I think, that support that the processes can do this.
[38:00]
C
So going back to your Physics roots, a lot of what we just described about this interplay between pre training and RIL and all the various bits that we described, those are clearly pretty complex systems. And you were trained in a discipline that is all about studying complex systems. What can physics teach us about how to understand those AI systems that we're currently building?
[38:33]
A
I think there's a lot of angles to answering that question. I think the maybe most interesting one or the most relevant one to how we work currently, and maybe this is a contrarian take, is that the way to think about scaling and say scaling laws is not small to big, but big to small. And I'll get to why physics really matters for this in a second. When you have the existence of some really big AI system and some weird things happen and they didn't happen at the small scale. And so we say like, oh, whatever emerged at scale. Sometimes people use the word grokking. There's something disconnect, discontinuous about the scaling sequence or the scaling law is broken. These are things that people might say, but I think I reject that entirely. I think it means that you didn't understand something about what you were scaling up. Maybe even going back to the reasoning thing. I don't know if this is true. This is a cartoon. I wasn't at OpenAI at the time. But if you imagine trying to get small models to reason GPT1, GPT2, GPT3 and then GPT4 to the again cartoon, you might say like, oh, this emerged at scale and it doesn't happen for the small models. I reject that. Instead there's some phenomena that's really exciting that we discovered, like reasoning or maybe something bad, like your model blew up and your earlier models didn't blow up. And your job is to then figure out how to restore smoothness to the scaling sequence, go back and make smaller and simpler models or. Or simpler toy examples such that the whole thing is smooth. And if you can do that, if you can figure out what to put into the small thing, then you understand the thing and then you can move forward. This is exactly what we do in theoretical physics. There's the Standard Model, which is. I have a textbook behind me. The description of all the forces except gravity would take even in compact notation, the entire page. It's completely gross. A lot of different particles. Why? Who knows? Some of them there's reasons for, but they're doing all sorts of different things. Different things cancel whatever. Or this just happens to be the universe that we live in. But you don't need all of that to study pieces of it, to study electromagnetism, you forget about everything else. Or if you want to study the Higgs phenomena, which gives mass to some particles, you can study a simplified version of that. And so what we do, and I think one of the key moves, at least in my training in physics, is to take really complicated systems. This often gets talked about as physicists just study spherical cows. And I think that kind of misses the point. If the spherical cow is sufficient to describe the thing that you care about, then you did a good job, and if not, you did a bad job. You don't try to retreat to a setting that's simple enough where you can calculate something. You try to retreat to the setting that's simple enough that contains the thing that you care about, and then you have no idea whether you can make progress there or not. But once you did, you sort of understand what the problem is. And that's a lot of the work in physics. And the same thing is true in AI. You have these crazy huge systems that have all sorts of interesting phenomena, and if you think about it the right way, they don't grok. There's just this nice continuity.
[41:58]
C
Do you think there could be an equivalent in AI to thermodynamics, meaning a compact theory that predicts behavior without tracking every individual bit?
[42:08]
A
Yeah. Kaplan, McCandlish, OpenAI scaling laws work originally is a version of this where you throw away all you know about the network is how many parameters it is and how much data you've trained on it, and you can predict the. The final loss. I think the missing piece is going from all the individual weights and biases. And how does that add up to the scaling law? I have some very initial work, and there's some other initial work about trying to bridge that connection. But I think that's the missing piece, the statistical mechanics to thermodynamics of how do these things emerge. But there's definitely a lot of useful, effective descriptions of how these systems behave. I think the other part of your question is enough to characterize everything that we care about. There's a lot that we care about other than just the final loss function. And so there's more thermodynamics to be worked out in addition to how does the thermodynamics arise from the microscopic description.
[43:09]
C
So at that conference a year ago, you jokingly predicted nine years to Einstein level AI. Where do you think, all jokes aside, we are on that spectrum of just AI creating scientific discovery. I mean, that's where we started the conversation and curious about where this is going.
[43:27]
A
The joke. Maybe it's helpful to deconstruct a joke, as it always is. But the joke was that taking the doubling time for the amount of work a system can do autonomously and figuring out how long it would take us to get to a system that can think eight years on its own. Because Einstein spent eight years discovering general relativity and I projected that out and was like. Like nine years from last year. Something. I hate making predictions, but I'm pretty sure something will break before that. In general, we're not just going to set up a system and let it think autonomously for eight years, if anything, because the system's eight years after will be so much more powerful that it probably doesn't make sense to let a system think for a certain amount. There's an amount of time it takes for the system to improve and then there's the amount of time it's thinking. And probably when those cross, all these scaling laws are going to break in certain ways. I do think that the kind of thing that I was trying to talk about, about how we as physicists approach problems, the structure and flavor of that is maybe different than here's a very well defined thing and go and do a calculation, which is what these erdoist problems are. I think probably we'll need to have some ideas to bridge from one to the other. I don't think. It's not obvious whether it has to be a discontinuous thing or smooth thing, but there's part of the scientific process I think, that the models haven't been imbued with yet. And I'm sure people are thinking about how to do that, trying to get to what is the right question, as opposed to here's a well defined thing and go calculate. And some of that involves research taste. That's not an easily verifiable thing.
[45:16]
C
Is that what would convince you that AI is doing genuine original science?
[45:22]
A
No, I'm convinced and I think we're going to. This is clearly. I think the unit distance problem is a great example. And also just being able to take a position that is contrary and think for an extremely long amount of time, explore lots of different options and bring to bear the full weight of disparate fields like where something very unlikely to find a human that has the exact set of skills to solve some of these problems. That's a huge thing.
[45:51]
C
How far do you think we are from AI research actually automating itself? Not just AI researchers using AI, but AI autonomously building AI?
[46:03]
A
Yeah, I think it's again, one of these smooth things where it's already doing pieces of it now. It'll do more in the future. And I know there's strong versions of this that people like to think about, but I'm not sure that we'll see a really sharp phase transition versus just more and more pieces of right now. A lot of coding that would take people weeks can be done very efficiently with models. So some of these math discovery problems, there's also versions of this where for engineering, the models are playing a more central role. And so I think there will just be more of that. I think that there's a kind of scientific thinking that humans still seem to be very useful for doing. And I don't want to make specific predictions about when or how. I can imagine you don't want to be caught on record saying the models won't be good at something because you'll definitely be wrong. Or maybe I should say that and then the models will be good at that immediately. And so I should pick the things that I want the models to do and say that they'll never do that. I think it's also just hard to make predictions because I think the way in which people made predictions before, the actual ways things shook out often are not in that direction. And so it's another sort of credit assignment thing. If you have this long chain of things that has to happen for whatever to happen, then anything that breaks that chain means your prediction is just way off. But I can make a very long distance prediction. For the next six months, I think we'll see more of these sorts of math and science breakthroughs and obviously we'll turn this sort of thing on AI itself and the models will get a lot more powerful. And that'll be fun. You could think about that. You could do science of AI and have it feel like doing physics. And that's true. Another really exciting thing is that I entered physics thinking that when you first start learning a field and maybe you want to commit to it, at least the perspective I had is that, oh, by the time I get to the end, I'll know all the answers, all the fundamental questions. Obviously this is a journey, and at the end of the journey it'll resolve. And then, I don't know, maybe it was in grad school or maybe when I switched to AI, I realized, oh, some of these questions will stay open maybe forever. Maybe I'll never get to learn the answers. Watching older colleagues as well start to retire and realize that they might not get to learn the answers, but I feel really excited that we, we will get to really answer a lot of fundamental questions in the fields of science that we care about with the aid or maybe the models being the driving force. And so, yeah, that's just really thrilling.
[48:36]
C
Well, that feels like a wonderful place to live it. Dan, you gave us plenty to ponder. Really appreciate you spending time with us today. Thank you.
[48:43]
A
Thanks for inviting me. It was a pleasure.
[48:46]
C
Hi, it's Matt Turk again. Thanks for listening to this episode of the MAD podcast. If you enjoyed it, we'd be very grateful if you would consider subscrib if you haven't already, or leaving a positive review or comment on whichever platform you're watching this or listening to this episode from. This really helps us build a podcast and get great guests. Thanks and see you at the next episode.