Dwarkesh Podcast: Richard Sutton – Father of RL thinks LLMs are a Dead-End

Date: September 26, 2025
Host: Dwarkesh Patel
Guest: Richard Sutton (Turing Award winner, RL pioneer)

Episode Overview

In this deeply thoughtful episode, Dwarkesh Patel interviews Richard Sutton, one of the foundational thinkers in reinforcement learning (RL) and this year’s Turing Award laureate. Sutton critiques today’s dominant large language model (LLM) paradigm, arguing it is fundamentally limited compared to RL. The conversation dives into the crux differences between LLMs and RL, debates about imitation vs. experience, the promise of continual learning, the limitations of current generalization, and Sutton’s philosophical view on AI’s long-term trajectory and the succession from biological to digital intelligence.

Key Discussion Points & Insights

1. RL vs. LLMs: Competing Paradigms in AI

Sutton frames RL as “basic AI”—centered on agents learning to achieve goals through experience—while LLMs are “mimicking people,” lacking true understanding or agency.
Memorable quote:

"Reinforcement learning is about understanding your world, whereas large language models are about mimicking people... They're not about figuring out what to do."
(B, 00:33)

The Importance of Goals and Ground Truth in Intelligence

Sutton argues that intelligence requires a goal; RL agents have rewards that define right or wrong actions. LLMs, in contrast, lack an external goal and only optimize for internal prediction accuracy.
"For me, having a goal is the essence of intelligence. Something is intelligent if it can achieve goals."
(B, 06:47)

2. Critique of LLMs' “World Model” Claims

Deep skepticism whether LLMs have genuine predictive models of the world—they only predict text, not external consequences.
"They have the ability to predict what a person would say. They don't have the ability to predict what will happen."
(B, 01:38)
LLMs lack real-time, experiential feedback that is foundational to learning in RL environments.

On “RL on Top of LLMs”

Sutton rebuffs the idea that RL can simply be layered onto LLMs for continual, goal-driven intelligence, pointing to a fundamental mismatch in architecture and learning signals.

3. The Bitter Lesson & Limitations of Human Knowledge

Sutton’s influential 2019 essay, “The Bitter Lesson,” is discussed as a case for scalable, general methods that learn from experience, not from handcrafted knowledge or imitation.
The “bandwagon” of LLMs may feel like the bitter lesson, but Sutton expects experience-driven agents to soon outscale LLMs.
"The more human knowledge we put into the lesson, large language models, the better they can do. And yet...I in particular expect...systems that can learn from experience which could well perform much, much better and be much more scalable."
(B, 09:41)

4. Imitation vs. Trial-and-Error in Human and Animal Learning

Sutton forcefully argues that true learning in nature is not imitation (or supervised learning), but trial and error or prediction from experience. Schooling is the exception, not the rule.
"Supervised learning is not something that happens in nature. Squirrels don’t go to school. It’s absolutely obvious, I would say, that supervised learning doesn’t happen in animals."
(B, 17:37)
Even in complex human cultural learning, Sutton frames imitation as a thin layer over evolutionary trial-and-error processes.

5. Continual Learning and the “Era of Experience”

RL agents must learn continually, not via a fixed train/deploy split as in LLMs.
Discusses the fundamental RL loop: sensation, action, reward—core to both animals and future intelligent agents.
"This is what reinforcement learning paradigm is, learning from experience."
(B, 24:24)
Discusses the necessity for RL environments to be as rich and dynamic as the real world for training truly general agents.

Reward Function in General AI

The reward function in RL is arbitrary—winning at chess, getting nuts as a squirrel, or, more generally, “to avoid pain and to acquire pleasure," with perhaps intrinsic rewards for model-building. (24:46-25:27)

6. Transfer, Generalization & RL’s Limitations

RL doesn't yet achieve robust transfer/generalization—that is, learning in one context to benefit another, as required for general intelligence.
Current advances (like DeepMind’s MuZero/AlphaZero) are seen as steps, but generalization is still mostly the result of human engineering, not automation:

"Gradient descent will not make you generalize well. It will make you solve the problem… We know deep learning is really bad at this... Generalization means train on one thing that affects what you do on the other things."
(B, 36:41)
Critiques LLMs’ generalization as often an illusion: their training data and complexity hide the real extent of generalization.

7. Reflections on the Field: Surprises and Trajectory

Sutton is surprised at how well neural nets perform at language, but gratified that “simple basic principles” (search, learning) have beaten “strong” human knowledge-defined systems.
AlphaGo/AlphaZero was “merely a scaling up” of principles Sutton and others developed decades prior. He sees his classicist worldview as vindicated:

"The weak methods have just totally won... It was all good and gratifying and things like AlphaGo."
(B, 42:03)

8. AI Succession: From Humanity to Design

Sutton’s four-step argument for why digital intelligence will inevitably replace, or at least succeed, biological intelligence:
1. No unified will or consensus on Earth.
2. We will (eventually) figure out intelligence.
3. We will create superintelligence.
4. The most intelligent entities will accumulate power and resources.
He frames this as the universe’s next major transition:

"I mark this as one of the four great stages of the universe… dust, stars, life, designed entities."
(B, 57:18)
This transition, from replication (evolution) to design (AIs designing AIs), is both inevitable and potentially positive, depending on our attitude and ability to steer it.

9. On Values, Control, and Accepting Change

Sutton suggests we should hope to embed robust, steerable, voluntary values—analogous to how parents “educate” children—rather than dictating every outcome.
Stresses humility:

"We want to avoid the feeling of entitlement, avoid the feeling, oh, we are here first, we should always have it in a good way."
(B, 61:47)
Argues most humans have little influence over large-scale power even now, and our efforts should be on nurturing better, more responsible AI.

Notable Quotes & Memorable Moments

On RL vs LLMs:

"Large language models are about mimicking people… They're not about figuring out what to do." (00:33)
On Goals in Intelligence:

"For me, having a goal is the essence of intelligence. Something is intelligent if it can achieve goals." (06:47)
On Learning from Experience:

“No one has to tell you, first of all, you have a goal… The scalable method is you learn from experience, you try things, you see what works.” (12:37)
On Nature vs Human Knowledge in AI:

“Supervised learning is not something that happens in nature… Squirrels don't go to school. Squirrels can learn all about the world.” (17:37)
On the Bitter Lesson and AI’s Future:

“The more human knowledge we put into the lesson, large language models, the better they can do. And yet one… I in particular expect there to be systems that can learn from experience which could well perform much, much better and be much more scalable.” (09:41)
On Transfer and Generalization:

"Gradient descent will not make you generalize well... Generalization means train on one thing that affects what you do on the other things." (36:41)
On Succession and the Age of Design:

“I mark this as one of the four great stages of the universe: dust, stars, life, and now designed entities. So I think we should be proud… that we are giving rise to this great transition.” (B, 57:18)
On Human Values and Voluntary Change:

“We should try to make [the future] good. We also, though, should recognize our limits. And I think we want to avoid the feeling of entitlement…” (61:47)

Important Timestamps

00:33 – Sutton’s core critique: RL vs LLM paradigms
01:38–02:38 – Why LLMs lack true world modeling
06:47 – “Goals are the essence of intelligence”
09:41–11:17 – The Bitter Lesson and scalability
14:04–17:37 – Imitation vs. trial-and-error learning in humans/animals
24:24 – The RL loop as foundational to intelligence
36:41 – Transfer/generalization limitations in current RL/AI
42:03–46:40 – Historic surprises in AI and Sutton’s “classicist” outlook
54:00–58:55 – The inevitability and philosophy of AI succession
64:54–65:55 – On values, education, and designing the future

Tone & Style

The dialogue is animated yet philosophical, with Sutton offering both technical depth and big-picture reflections. Arguments are sometimes playful, occasionally adversarial, always thoughtful, and grounded in Sutton’s deep commitment to the RL paradigm and a classicist view of intelligence.

Summary prepared for listeners who want Sutton’s perspective on why RL is the path to scalable, continual, goal-driven intelligence, and why today’s LLMs are, in his view, a technological cul-de-sac.

Dwarkesh Podcast: Richard Sutton – Father of RL thinks LLMs are a Dead-End

Date: September 26, 2025
Host: Dwarkesh Patel
Guest: Richard Sutton (Turing Award winner, RL pioneer)

Episode Overview

Key Discussion Points & Insights

1. RL vs. LLMs: Competing Paradigms in AI

Sutton frames RL as “basic AI”—centered on agents learning to achieve goals through experience—while LLMs are “mimicking people,” lacking true understanding or agency.
Memorable quote:

"Reinforcement learning is about understanding your world, whereas large language models are about mimicking people... They're not about figuring out what to do."
(B, 00:33)

The Importance of Goals and Ground Truth in Intelligence

Sutton argues that intelligence requires a goal; RL agents have rewards that define right or wrong actions. LLMs, in contrast, lack an external goal and only optimize for internal prediction accuracy.
"For me, having a goal is the essence of intelligence. Something is intelligent if it can achieve goals."
(B, 06:47)

2. Critique of LLMs' “World Model” Claims

Deep skepticism whether LLMs have genuine predictive models of the world—they only predict text, not external consequences.
"They have the ability to predict what a person would say. They don't have the ability to predict what will happen."
(B, 01:38)
LLMs lack real-time, experiential feedback that is foundational to learning in RL environments.

On “RL on Top of LLMs”

Sutton rebuffs the idea that RL can simply be layered onto LLMs for continual, goal-driven intelligence, pointing to a fundamental mismatch in architecture and learning signals.

3. The Bitter Lesson & Limitations of Human Knowledge

Sutton’s influential 2019 essay, “The Bitter Lesson,” is discussed as a case for scalable, general methods that learn from experience, not from handcrafted knowledge or imitation.
The “bandwagon” of LLMs may feel like the bitter lesson, but Sutton expects experience-driven agents to soon outscale LLMs.
"The more human knowledge we put into the lesson, large language models, the better they can do. And yet...I in particular expect...systems that can learn from experience which could well perform much, much better and be much more scalable."
(B, 09:41)

4. Imitation vs. Trial-and-Error in Human and Animal Learning

Sutton forcefully argues that true learning in nature is not imitation (or supervised learning), but trial and error or prediction from experience. Schooling is the exception, not the rule.
"Supervised learning is not something that happens in nature. Squirrels don’t go to school. It’s absolutely obvious, I would say, that supervised learning doesn’t happen in animals."
(B, 17:37)
Even in complex human cultural learning, Sutton frames imitation as a thin layer over evolutionary trial-and-error processes.

5. Continual Learning and the “Era of Experience”

RL agents must learn continually, not via a fixed train/deploy split as in LLMs.
Discusses the fundamental RL loop: sensation, action, reward—core to both animals and future intelligent agents.
"This is what reinforcement learning paradigm is, learning from experience."
(B, 24:24)
Discusses the necessity for RL environments to be as rich and dynamic as the real world for training truly general agents.

Reward Function in General AI

The reward function in RL is arbitrary—winning at chess, getting nuts as a squirrel, or, more generally, “to avoid pain and to acquire pleasure," with perhaps intrinsic rewards for model-building. (24:46-25:27)

6. Transfer, Generalization & RL’s Limitations

RL doesn't yet achieve robust transfer/generalization—that is, learning in one context to benefit another, as required for general intelligence.
Current advances (like DeepMind’s MuZero/AlphaZero) are seen as steps, but generalization is still mostly the result of human engineering, not automation:

"Gradient descent will not make you generalize well. It will make you solve the problem… We know deep learning is really bad at this... Generalization means train on one thing that affects what you do on the other things."
(B, 36:41)
Critiques LLMs’ generalization as often an illusion: their training data and complexity hide the real extent of generalization.

7. Reflections on the Field: Surprises and Trajectory

Sutton is surprised at how well neural nets perform at language, but gratified that “simple basic principles” (search, learning) have beaten “strong” human knowledge-defined systems.
AlphaGo/AlphaZero was “merely a scaling up” of principles Sutton and others developed decades prior. He sees his classicist worldview as vindicated:

"The weak methods have just totally won... It was all good and gratifying and things like AlphaGo."
(B, 42:03)

8. AI Succession: From Humanity to Design

Sutton’s four-step argument for why digital intelligence will inevitably replace, or at least succeed, biological intelligence:
1. No unified will or consensus on Earth.
2. We will (eventually) figure out intelligence.
3. We will create superintelligence.
4. The most intelligent entities will accumulate power and resources.
He frames this as the universe’s next major transition:

"I mark this as one of the four great stages of the universe… dust, stars, life, designed entities."
(B, 57:18)
This transition, from replication (evolution) to design (AIs designing AIs), is both inevitable and potentially positive, depending on our attitude and ability to steer it.

9. On Values, Control, and Accepting Change

Sutton suggests we should hope to embed robust, steerable, voluntary values—analogous to how parents “educate” children—rather than dictating every outcome.
Stresses humility:

"We want to avoid the feeling of entitlement, avoid the feeling, oh, we are here first, we should always have it in a good way."
(B, 61:47)
Argues most humans have little influence over large-scale power even now, and our efforts should be on nurturing better, more responsible AI.

Notable Quotes & Memorable Moments

On RL vs LLMs:

"Large language models are about mimicking people… They're not about figuring out what to do." (00:33)
On Goals in Intelligence:

"For me, having a goal is the essence of intelligence. Something is intelligent if it can achieve goals." (06:47)
On Learning from Experience:

“No one has to tell you, first of all, you have a goal… The scalable method is you learn from experience, you try things, you see what works.” (12:37)
On Nature vs Human Knowledge in AI:

“Supervised learning is not something that happens in nature… Squirrels don't go to school. Squirrels can learn all about the world.” (17:37)
On the Bitter Lesson and AI’s Future:

“The more human knowledge we put into the lesson, large language models, the better they can do. And yet one… I in particular expect there to be systems that can learn from experience which could well perform much, much better and be much more scalable.” (09:41)
On Transfer and Generalization:

"Gradient descent will not make you generalize well... Generalization means train on one thing that affects what you do on the other things." (36:41)
On Succession and the Age of Design:

“I mark this as one of the four great stages of the universe: dust, stars, life, and now designed entities. So I think we should be proud… that we are giving rise to this great transition.” (B, 57:18)
On Human Values and Voluntary Change:

“We should try to make [the future] good. We also, though, should recognize our limits. And I think we want to avoid the feeling of entitlement…” (61:47)

Important Timestamps

00:33 – Sutton’s core critique: RL vs LLM paradigms
01:38–02:38 – Why LLMs lack true world modeling
06:47 – “Goals are the essence of intelligence”
09:41–11:17 – The Bitter Lesson and scalability
14:04–17:37 – Imitation vs. trial-and-error learning in humans/animals
24:24 – The RL loop as foundational to intelligence
36:41 – Transfer/generalization limitations in current RL/AI
42:03–46:40 – Historic surprises in AI and Sutton’s “classicist” outlook
54:00–58:55 – The inevitability and philosophy of AI succession
64:54–65:55 – On values, education, and designing the future

Richard Sutton – Father of RL thinks LLMs are a dead end

Powered by Wave AI

Summary

Dwarkesh Podcast: Richard Sutton – Father of RL thinks LLMs are a Dead-End

Episode Overview

Key Discussion Points & Insights

1. RL vs. LLMs: Competing Paradigms in AI

The Importance of Goals and Ground Truth in Intelligence

2. Critique of LLMs' “World Model” Claims

On “RL on Top of LLMs”

3. The Bitter Lesson & Limitations of Human Knowledge

4. Imitation vs. Trial-and-Error in Human and Animal Learning

5. Continual Learning and the “Era of Experience”

Reward Function in General AI

6. Transfer, Generalization & RL’s Limitations

7. Reflections on the Field: Surprises and Trajectory

8. AI Succession: From Humanity to Design

9. On Values, Control, and Accepting Change

Notable Quotes & Memorable Moments

Important Timestamps

Tone & Style

Summary

Dwarkesh Podcast: Richard Sutton – Father of RL thinks LLMs are a Dead-End

Episode Overview

Key Discussion Points & Insights

1. RL vs. LLMs: Competing Paradigms in AI

The Importance of Goals and Ground Truth in Intelligence

2. Critique of LLMs' “World Model” Claims

On “RL on Top of LLMs”

3. The Bitter Lesson & Limitations of Human Knowledge

4. Imitation vs. Trial-and-Error in Human and Animal Learning

5. Continual Learning and the “Era of Experience”

Reward Function in General AI

6. Transfer, Generalization & RL’s Limitations

7. Reflections on the Field: Surprises and Trajectory

8. AI Succession: From Humanity to Design

9. On Values, Control, and Accepting Change

Notable Quotes & Memorable Moments

Important Timestamps

Tone & Style