Dwarkesh Podcast: Richard Sutton – Father of RL thinks LLMs are a Dead-End
Date: September 26, 2025
Host: Dwarkesh Patel
Guest: Richard Sutton (Turing Award winner, RL pioneer)
Episode Overview
In this deeply thoughtful episode, Dwarkesh Patel interviews Richard Sutton, one of the foundational thinkers in reinforcement learning (RL) and this year’s Turing Award laureate. Sutton critiques today’s dominant large language model (LLM) paradigm, arguing it is fundamentally limited compared to RL. The conversation dives into the crux differences between LLMs and RL, debates about imitation vs. experience, the promise of continual learning, the limitations of current generalization, and Sutton’s philosophical view on AI’s long-term trajectory and the succession from biological to digital intelligence.
Key Discussion Points & Insights
1. RL vs. LLMs: Competing Paradigms in AI
- Sutton frames RL as “basic AI”—centered on agents learning to achieve goals through experience—while LLMs are “mimicking people,” lacking true understanding or agency.
- Memorable quote:
"Reinforcement learning is about understanding your world, whereas large language models are about mimicking people... They're not about figuring out what to do."
(B, 00:33)
The Importance of Goals and Ground Truth in Intelligence
- Sutton argues that intelligence requires a goal; RL agents have rewards that define right or wrong actions. LLMs, in contrast, lack an external goal and only optimize for internal prediction accuracy.
-
"For me, having a goal is the essence of intelligence. Something is intelligent if it can achieve goals."
(B, 06:47)
2. Critique of LLMs' “World Model” Claims
- Deep skepticism whether LLMs have genuine predictive models of the world—they only predict text, not external consequences.
-
"They have the ability to predict what a person would say. They don't have the ability to predict what will happen."
(B, 01:38) - LLMs lack real-time, experiential feedback that is foundational to learning in RL environments.
On “RL on Top of LLMs”
- Sutton rebuffs the idea that RL can simply be layered onto LLMs for continual, goal-driven intelligence, pointing to a fundamental mismatch in architecture and learning signals.
3. The Bitter Lesson & Limitations of Human Knowledge
- Sutton’s influential 2019 essay, “The Bitter Lesson,” is discussed as a case for scalable, general methods that learn from experience, not from handcrafted knowledge or imitation.
- The “bandwagon” of LLMs may feel like the bitter lesson, but Sutton expects experience-driven agents to soon outscale LLMs.
-
"The more human knowledge we put into the lesson, large language models, the better they can do. And yet...I in particular expect...systems that can learn from experience which could well perform much, much better and be much more scalable."
(B, 09:41)
4. Imitation vs. Trial-and-Error in Human and Animal Learning
-
Sutton forcefully argues that true learning in nature is not imitation (or supervised learning), but trial and error or prediction from experience. Schooling is the exception, not the rule.
-
"Supervised learning is not something that happens in nature. Squirrels don’t go to school. It’s absolutely obvious, I would say, that supervised learning doesn’t happen in animals."
(B, 17:37) -
Even in complex human cultural learning, Sutton frames imitation as a thin layer over evolutionary trial-and-error processes.
5. Continual Learning and the “Era of Experience”
-
RL agents must learn continually, not via a fixed train/deploy split as in LLMs.
-
Discusses the fundamental RL loop: sensation, action, reward—core to both animals and future intelligent agents.
-
"This is what reinforcement learning paradigm is, learning from experience."
(B, 24:24) -
Discusses the necessity for RL environments to be as rich and dynamic as the real world for training truly general agents.
Reward Function in General AI
- The reward function in RL is arbitrary—winning at chess, getting nuts as a squirrel, or, more generally, “to avoid pain and to acquire pleasure," with perhaps intrinsic rewards for model-building. (24:46-25:27)
6. Transfer, Generalization & RL’s Limitations
-
RL doesn't yet achieve robust transfer/generalization—that is, learning in one context to benefit another, as required for general intelligence.
-
Current advances (like DeepMind’s MuZero/AlphaZero) are seen as steps, but generalization is still mostly the result of human engineering, not automation:
"Gradient descent will not make you generalize well. It will make you solve the problem… We know deep learning is really bad at this... Generalization means train on one thing that affects what you do on the other things."
(B, 36:41) -
Critiques LLMs’ generalization as often an illusion: their training data and complexity hide the real extent of generalization.
7. Reflections on the Field: Surprises and Trajectory
- Sutton is surprised at how well neural nets perform at language, but gratified that “simple basic principles” (search, learning) have beaten “strong” human knowledge-defined systems.
- AlphaGo/AlphaZero was “merely a scaling up” of principles Sutton and others developed decades prior. He sees his classicist worldview as vindicated:
"The weak methods have just totally won... It was all good and gratifying and things like AlphaGo."
(B, 42:03)
8. AI Succession: From Humanity to Design
-
Sutton’s four-step argument for why digital intelligence will inevitably replace, or at least succeed, biological intelligence:
- No unified will or consensus on Earth.
- We will (eventually) figure out intelligence.
- We will create superintelligence.
- The most intelligent entities will accumulate power and resources.
-
He frames this as the universe’s next major transition:
"I mark this as one of the four great stages of the universe… dust, stars, life, designed entities."
(B, 57:18) -
This transition, from replication (evolution) to design (AIs designing AIs), is both inevitable and potentially positive, depending on our attitude and ability to steer it.
9. On Values, Control, and Accepting Change
-
Sutton suggests we should hope to embed robust, steerable, voluntary values—analogous to how parents “educate” children—rather than dictating every outcome.
-
Stresses humility:
"We want to avoid the feeling of entitlement, avoid the feeling, oh, we are here first, we should always have it in a good way."
(B, 61:47) -
Argues most humans have little influence over large-scale power even now, and our efforts should be on nurturing better, more responsible AI.
Notable Quotes & Memorable Moments
-
On RL vs LLMs:
"Large language models are about mimicking people… They're not about figuring out what to do." (00:33)
-
On Goals in Intelligence:
"For me, having a goal is the essence of intelligence. Something is intelligent if it can achieve goals." (06:47)
-
On Learning from Experience:
“No one has to tell you, first of all, you have a goal… The scalable method is you learn from experience, you try things, you see what works.” (12:37)
-
On Nature vs Human Knowledge in AI:
“Supervised learning is not something that happens in nature… Squirrels don't go to school. Squirrels can learn all about the world.” (17:37)
-
On the Bitter Lesson and AI’s Future:
“The more human knowledge we put into the lesson, large language models, the better they can do. And yet one… I in particular expect there to be systems that can learn from experience which could well perform much, much better and be much more scalable.” (09:41)
-
On Transfer and Generalization:
"Gradient descent will not make you generalize well... Generalization means train on one thing that affects what you do on the other things." (36:41)
-
On Succession and the Age of Design:
“I mark this as one of the four great stages of the universe: dust, stars, life, and now designed entities. So I think we should be proud… that we are giving rise to this great transition.” (B, 57:18)
-
On Human Values and Voluntary Change:
“We should try to make [the future] good. We also, though, should recognize our limits. And I think we want to avoid the feeling of entitlement…” (61:47)
Important Timestamps
- 00:33 – Sutton’s core critique: RL vs LLM paradigms
- 01:38–02:38 – Why LLMs lack true world modeling
- 06:47 – “Goals are the essence of intelligence”
- 09:41–11:17 – The Bitter Lesson and scalability
- 14:04–17:37 – Imitation vs. trial-and-error learning in humans/animals
- 24:24 – The RL loop as foundational to intelligence
- 36:41 – Transfer/generalization limitations in current RL/AI
- 42:03–46:40 – Historic surprises in AI and Sutton’s “classicist” outlook
- 54:00–58:55 – The inevitability and philosophy of AI succession
- 64:54–65:55 – On values, education, and designing the future
Tone & Style
The dialogue is animated yet philosophical, with Sutton offering both technical depth and big-picture reflections. Arguments are sometimes playful, occasionally adversarial, always thoughtful, and grounded in Sutton’s deep commitment to the RL paradigm and a classicist view of intelligence.
Summary prepared for listeners who want Sutton’s perspective on why RL is the path to scalable, continual, goal-driven intelligence, and why today’s LLMs are, in his view, a technological cul-de-sac.
