Summary7 min read

CoRecursive: Coding Stories

Episode: The Bitter Lesson – The History of Reinforcement Learning

Host: Adam Gordon Bell
Guest: Don McKay
Date: June 13, 2026

Overview

In this episode, Adam Gordon Bell and recurring guest Don McKay dive deep into the history, science, and meaningful insights behind reinforcement learning (RL) in artificial intelligence—anchored around the “bitter lesson,” a famous reflection by RL pioneer Richard Sutton. The conversation weaves through the evolution of AI from expert systems to today's learning agents, exploring key breakthroughs, unexpected setbacks, and what Sutton's work tells us about intelligence (both natural and artificial), with a focus on the concept that “reward is enough” to drive learning.

Key Topics & Discussion Points

1. Setting the Stage: Understanding "Reward is Enough" [00:00–03:00]

Adam’s journey to understand if intelligence can emerge from simple feedback mechanisms.
Discussion around the influential paper: Reward is Enough (Silver, Singh, Precup, Sutton, 2021).
The radical claim: "Intelligence and its associated abilities could be understood as subserving the maximization of reward."
- Adam and Don challenge and reflect on the reductiveness of this view.

"As long as you have a reward structure, you can create intelligence, which I think is a little bit reductive." — Don [01:20]

2. Roots of Reinforcement Learning: Behaviorism and Sutton’s Early Work [03:01–07:04]

Introduction to Sutton’s academic roots with Andrew Barto, influenced by behaviorist B.F. Skinner.
Colorful side stories:
- Skinner training pigeons for wartime missile guidance systems and industrial QA.
The contrast between 1980s-90s "expert systems" (rule-based AI) and the emerging RL paradigm.

3. Sutton’s Temporal Difference Learning: Tic Tac Toe Explained [07:05–09:56]

Don and Adam break down Sutton's seminal tic tac toe RL example:
- Assigning values to every possible game state; updating values by “working backwards” from outcomes using temporal difference (TD) learning.
Discussion on how this mimics superstition, assigning value to “how we got here” after each win or loss.

"You just need to be able to play it many times so you can start building it, like a model of scores." — Don [09:10]

4. Scaling Up: From Tic Tac Toe to Backgammon – The Birth of Superhuman Play [09:57–15:56]

Gerald Tesauro at IBM Watson applies Sutton’s method to Backgammon (TD-Gammon, 1992):
- Too many possible states for a lookup table, so he uses a small neural network to generalize.
- The system outplays human experts and innovates strategies never seen before.
- Human champions like Bill Robertie learn from the AI, updating their own strategy books.

"It beats them because it’s unconventional. Rather than trying to imitate humans, it develops its own sense of positional judgment by learning from experience and playing against itself." — Don [15:04]

5. The Dormant Revolution: Ignored RL and the Era of Deep Blue [16:04–21:56]

Despite success in backgammon, RL is ignored for 20 years in favor of expert systems (e.g., Deep Blue in chess).
Deep Blue’s approach:
- Giant curated databases and 8,000+ handcrafted evaluation rules.
- Supercomputing brute-force, searching 200 million positions/sec.
Sutton’s frustration: his ideas left by the wayside.

"He’s probably pissed because he's like, I did this stuff like 25 years ago and nobody gave a crap." — Don [18:21]

6. Reinvention: RL, Deep Learning, and the Atari Breakthrough [21:57–28:43]

DeepMind’s revival of RL: Combining Sutton’s update rules, deep neural networks, and Monte Carlo tree search.
The famous Atari experiments:
- Input: Downsampled Atari screen pixels; output: joystick directions.
- The agent re-discovers strategies (e.g., “tunneling” in Breakout) entirely from reward feedback and pixel data.

"It can't see the game. It can just see, like, you know, 0, 0, 11, 1. ... It has to learn that because all it really knows is the score." — Adam [26:29; 27:29]

7. The Go Milestone: AlphaGo and Tabula Rasa Learning [28:44–44:31]

AlphaGo’s leap: RL plus deep networks plus tree search used to master Go—a game too vast for brute-force or hand-coded rules.
AlphaGo's move 37 (shoulder hit on the 5th line) shocks human grandmasters by innovating a non-human strategy, causing world champion Lee Sedol to leave the stage in shock.
- "The professional commentators almost unanimously said that not a single human player would have chosen move 37." — David Silver (AlphaGo architect) [41:33]
After the initial AlphaGo (which saw some human game data), DeepMind builds AlphaGo Zero:
- "Tabula rasa"—starting from nothing, learning purely by self-play.
- AlphaGo Zero outperforms its predecessor, beating it 100-0 after three days of training.
Extended to other games (chess, shogi), AlphaZero and MuZero crush traditional engines, proving the generality of Sutton’s ideas.

"When a simpler search-based approach...proved vastly more effective, these human knowledge-based chess researchers were not good losers." — Sutton, “The Bitter Lesson” [49:46]

8. The "Bitter Lesson" Manifesto: Beyond Human Cleverness [44:32–51:09]

Sutton’s 2019 essay "The Bitter Lesson" claims:
- All AI progress comes not from encoding human cleverness, but from scaling flexible methods + computation.
- Every time, systems that just search and learn from reward signals surpass those carefully designed by experts.

"Stop being clever. Just throw compute and reward at it and it will be guaranteed better than whatever you can come up with." — Adam [51:22]

Implications: Hand-crafted knowledge and expert systems get surpassed; the best AI comes from blank-slate self-play and learning.

9. The Present & Debate: Rewards, LLMs, and What’s Next [51:10–59:39]

Sutton and AI researchers debate the scope of “reward is enough”—can it drive social, perceptual, and even linguistic intelligence?
Don pushes back: aren't some human pursuits rewardless? Not all fits so neatly.
Adam draws a sharp distinction: current LLMs (Large Language Models) mimic humans, but true blank-slate RL models (like AlphaZero) go beyond imitation.
The “dark side”: If everything with a score or benchmark becomes a computer’s domain, humans may always be outperformed—unless we keep inventing new “boxes” and reward structures.

"The dark version of 'reward is enough' is that reward is enough for a computer to crush you at what you do. And that’s the bitter pill." — Adam [56:24]

Notable Quotes & Memorable Moments

On the core reduction of intelligence to reward:
- "It's saying that everything about our intelligence is just serving some kind of reward structure." — Don [01:30]
On unconventional AI play:
- "It hasn't read those books...And so it does things, playing them, that they're like...And then it beats them. And they're like, oh, wait." — Adam [14:59]
On Deep Blue's expert system:
- "Deep Blue...20 years later...They just dropped it for 20 years." — Adam [16:59]
- "It had 8,000 rules for how to evaluate a chess move." — Adam [21:54]
On AlphaGo’s historic Go match:
- "AlphaGo plays a shoulder hit on the fifth line...not a single human player would have chosen move 37." — David Silver [41:33]
On The Bitter Lesson’s key thesis:
- "The biggest lesson that we can read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective. And by a large margin." — Sutton [47:41]
On the dilemma for human expertise:
- "It’s a bitter pill to swallow to say, like, whatever cleverness you can come up with, a human will not be as good as us just figuring out where the reward signal is. And giving it unlimited compute." — Adam [51:22]
On the limits of current LLMs:
- "Large language models are about mimicking people, doing what people say you should do...They don't have the ability to predict what will happen." — Sutton [54:35]

Key Timestamps

00:51 – Introduction to “Reward is Enough” and Sutton’s claim
03:26 – Behaviorism and the pigeon-guided missile story
07:04 – Sutton’s temporal difference learning in tic tac toe
10:16 – Scaling RL up: TD-Gammon and neural nets
13:58 – Computer beats top backgammon players, changes human strategy
16:14 – RL ignored for 20 years, Deep Blue’s rule-based approach
18:15 – Sutton’s frustration: "25 years ago!"
24:24 – DeepMind bets on RL+deep learning, Atari experiments
28:59 – AlphaGo, the challenge of Go, innovation beyond human strategy
41:33 – AlphaGo’s move 37, move probability "1 in 10,000" for humans
44:31 – AlphaGo Zero and MuZero: Tabula rasa learning
47:41 – The Bitter Lesson: "General methods...are the most effective."
51:04 – "Don’t ascribe your human conceptual models onto the AI" – on why it’s bitter
54:35 – Sutton critiques LLMs for not being true blank-slate learners
56:24 – The dark side of rewards: computers dominating any "scoreboard" task
59:39 – Concluding reflections; humans redefine the game

Conclusion & Takeaways

The "bitter lesson": Decades of AI research reveal that generic, scalable methods leveraging computational brute force ultimately outperform systems built on hand-crafted human cleverness.
Reward is enough: With the right reward signal and sufficient computational resources, AI can exceed human ability in any bounded task, inventing new strategies beyond human imagination.
The shifting role of humans: As computers master well-defined domains, it’s up to humans to "make new boxes," push the boundaries, and invent new challenges.
Implications for AI’s future: The episode closes with an open question—what happens when AI can set its own goals, create new "boxes," and what that will mean for human ingenuity.

Final Thoughts:
This wide-ranging, story-rich episode is both a celebration of RL's real-world impact and a philosophical meditation on what intelligence, learning, and human creativity mean in an age of machines that learn faster and arguably “better” than we do—provided they have the right rewards.

Loading summary

Transcript263 lines

[00:00]
A
Hi, I'm Adam Gordon Bell and this is Code Recursive. And today I have here again Don.
[00:05]
B
Hi, I'm Don McKay and I'm here again.
[00:07]
A
So I've been texting you again.
[00:09]
B
Yes.
[00:09]
A
Yeah.
[00:10]
B
And it's always late, too. It's never like, you know, it's 10 o' clock at night or something.
[00:13]
A
I know, sorry.
[00:14]
B
I mean, when you're like an old man like me, like, that's late, it's too much.
[00:18]
A
Yeah. You're like, I eat at 4, I go to bed at 8. Yeah. Yeah. So I've been, I've been trying to understand, I guess, like, machine learning. And that's why I sent you some things. So. Yeah. What did I send you?
[00:29]
B
First you told me you didn't have whiskey anymore. And I was like, that's a crime. And then you said, okay, I've been continuing going deep on AI and ML stuff. I'm trying to understand it at a level of like, could I write a program that's just normal ifs and loops and whatever, but yet it can learn? And I think I found the key. It's this paper. Reward is Enough. Silver Singh Precup Sutton, 2021. I want to explain it to you. Sort of simple and simple.
[00:52]
A
My neat quotes and. Yeah, because like, okay, machine learning is fancy and cool and like, we're software developers, we should be able to understand it. But it feels like, magical.
[01:01]
B
Yeah, it's like existential. Right? You're like, oh, I don't know, I don't know how it works. So, yeah. Oh, okay, here's the important line. Intelligence and its associated abilities could be understood as subserving the maximization of reward.
[01:13]
A
What do you think? Subserving is such a weird word.
[01:15]
B
Yeah. You haven't heard of subservient individuals?
[01:18]
A
Well, I guess. Okay, maybe in that context it seems
[01:21]
B
like he's saying that as long as you have a reward structure, you can create intelligence, which I think is a little bit reductive, in my opinion.
[01:27]
A
I mean, it's incredibly reductive. Right? Yeah. He's saying it all reduces.
[01:31]
B
It all reduces down to the maximization of reward. Right. And furthermore, saying intelligence and its associated ability, Associated abilities can, can be understood by subservient reward, which means that everything about our intelligence is just serving some kind of reward structure. And I think that that's kind of an. I don't know. Yeah, it's, it's, it's very reductive. Once you, Once you reduce something to that, to that level, you, you, you lose the Fidelity of. Of the statement. Right.
[01:58]
A
So it reduces everything. But as a cool title, right. Because that sentence is very complex. But the. The title of the paper, right. Reward is Enough. You kind of get what he's saying, right?
[02:07]
B
If you wanted to create an artificial intelligence, reward is enough. That's all. That's the only structure that you require.
[02:13]
A
Everything. Yeah. That's all you need is just that one thing, Right. Like, I've reduced the entirety of intelligence, of intellect, of culture, of. I mean, I guess he didn't say culture, but everything to do with what makes something intelligent and smart and able to take action in the world is just reward. Right. Press the lever, get a piece of cheese, basically. Okay, so, yeah, Sutton, he wrote this book on reinforcement learning, and, like, his. Him and his advisor, like, kind of invented this field. Yeah. So I just want to pull it apart. Right. That's what I want to talk about today. I was like, this is a great thing for our kind of stack trace format, which. Which I asserted before and maybe still needs a better name that, like, we take something and kind of, like, back up the stack frames until, like, you get. Right, we have this really grand statement like, can we pull it apart? Can we get to somewhere?
[03:01]
B
Null pointer exception.
[03:03]
A
Yeah. Turns out it was all. Intelligence is a null pointer exception. Gone the bonkers. Okay, so it's 1988, and Richard Sutton is finishing his PhD at the University of Massachusetts, and his advisor is named Andrew Barto. And Barto, he actually comes from B.F. skinner was his background as a academic. Do you know B.F. skinner?
[03:27]
B
I do not.
[03:27]
A
So interesting enough, right? B.F. skinner, he taught pigeons things where they would, like, press a lever and they would get, like, a little piece of thing.
[03:34]
B
Okay.
[03:35]
A
He's the person who, like, operationalized how to, like, train animals to do things based on reward. And, like, his theory was also that all of human behavior is based on, you know, going after rewards, like, the same way your dog wants a treat. It's kind of a weird background. Like, Sutton is, like, a computer scientist, but he's working under this guy whose background is, like, this behaviorist and who, like, taught pigeons, basically. And side note, Skinner is super interesting. Probably not a topic for the podcast, but Skinner in the. In World War II, he built things for the US military. Any guesses what they would be?
[04:14]
B
Homing pigeons.
[04:15]
A
Yeah. So he built a missile guidance system where they just put the pigeons.
[04:20]
B
They put the pigeons in the missile. I did actually see. See, like, a video about that. Yeah.
[04:25]
A
It's insane, right? Like, he. He had very reductive View, as you would say, like, he reduced everything to like, well, yeah, let's just. We need to hit this target. Why don't we just train a pigeon? Yeah, they.
[04:37]
B
They were good at it.
[04:37]
A
And like, he also had or like one of his students built. I wrote this down. This guy who worked for him named Tom Verhave had a pigeon product or project at the company E. Is it Elly. Is that how you say it? The drug company E. Lily.
[04:54]
B
I think it's. I think it's Eli Lilly.
[04:56]
A
He had this project at this company where basically they just had pigeons do qa. So like all the. The parts are going down the aisle and they just have pigeons there who like, are trained, but look for the. The broken ones and pack them out of there. So they tried to. Basically they replaced human workers with pigeons and like they got the whole thing work. And then they canceled it because they thought it would be like a PR disaster. Although I did hear a story before that they canceled it because it was very demoralizing. Like, imagine you're at your job and they like, they fire the guy like further down the chain from you, like further down the.
[05:29]
B
And just replace him with.
[05:30]
A
Replace him with a pigeon. Like he. His door. So it didn't take off, but same idea, right? You just give a simple reward. The pigeon learns like, oh, this is a defective part or this is a defective pill and knocks it off. Anyway, this is behaviorism. Anyways, that's where this Bardo guy came from and Sutton's working under him. But the whole computer science thing at this time in 1998 was very different, right? It was AI, but a different approach to AI. Expert systems. So expert systems, they sometimes call it good old fashioned AI Expert systems. Like was this idea of AI was like different than today. It was if you could interview a doctor and figure out how he assess a patient and you can just like write down all those rules and then ask people about their health. And then it's like a doctor in a box because it's like a. Just like a flowchart that you can travel through. I mean, it seems kind of basic right now, but they had this idea, you know, if we can get all the rules for all of intelligence written down, right? Here's all the doctor rules, here's all whatever we can replace. We can have intelligence just in sort of if statement and whatever. So Sutton's not that right. He works for this behaviorist. Different approach, different. And he publishes a paper called Learning to Predict by the Methods of Temporal Differences. And it's a super. It's a super cool idea. I'm going to show you because this is how I started to understand machine learning. Right, so this is. Yeah, this is tic tac toe, right? And this is how he explains the thing, right? So if we. If we play tic tac toe, you go first. So you're. You're X. You can pick which position you want. But this is all the possible states of the entirety of the game in this tree.
[07:04]
B
It's a. It's. It's just a giant chart with a bunch of X's and O grids and what the current state of them would be very similar to, like chess moves. Kind of like mapping.
[07:15]
A
It's kind of like a tree, right? Or a flowchart. So it's like his idea is, okay, there's tic tac toe. He figured it out. There's like. I think it's like 725 possible states that you can be in, right? And then, like, at the very end, you have a result, right? So, like, if we played this game where you get three X's across and I do nothing to stop you because I don't understand the game, then he. He gives that a value.
[07:36]
B
So, yeah, he's assigning values to win conditions so that, you know that's what you want to get. That's the. That's the ideal state.
[07:43]
A
It's like, there's no magic. It's like an array of 725 values the game. And like, you put an X and I put an O. And so it's a tree going down, and at the end I win, right? So here's this trick. This is. This is like the whole of the thing. And then all he does is this, right? But this one must be negative one, because Adam went. And then. So this one, he goes backwards and he's like, okay, this one is like negative 0.1, right? And then this one is negative 0.01 and so on going backwards and saying like, well, those must have all been pretty good because you won. They're not as good as the part that you won, but they're. They're kind of good. So it's basically like walking backwards. It's like how people develop superstitions in a way, right?
[08:23]
B
So you trace back the path to how you got there, and you assign values according to how much you want to incentivize that path based on the end result or decentivize the. The path based on a losing condition. So at the end of the day, when Every possible move has a value. You'll know how to respond according to always going towards the one that will give you more of a score.
[08:42]
A
Exactly. This is this whole paper. I think you just nailed it, right?
[08:45]
B
Yeah. So whatever choice you make, it will then evaluate all of the other choices that it's learned and then it will have a score and it will go with the one with the highest score.
[08:53]
A
So if I go play. So that was play one game, right? If we go play 10 games, like every time it goes through the tree, it decides if it wins. So it's, it's like, hey, play a whole bunch of tic tac toe games and see where you won and where you lost. You don't even have to know how the game works. You don't even need to know what's good and what's bad.
[09:10]
B
You just need to be able to play it many times so you can start building it. Like a, like a model of scores.
[09:16]
A
Yeah. And if you do, like enough times, you start to know what the values are. Like what's a good move and what's a bad move. This is his whole thing, right. This is like his early paper. And I love it because it just explains this simple version of machine learning. Right. It's learning to play the game just by doing this working backwards step. And like there's no crazy stuff, right. It's just a bunch of array values and you're like updating. It's like if, if at the end of the day you get told you did good or bad, then you just like figure out during the day what were the things and work backwards.
[09:46]
B
Yeah, that's, that's amazing.
[09:48]
A
And like, that's his, that's, that's part of his, like, reward is enough, right? Like, all you need to do is
[09:53]
B
assign a score to desirable or undesirable outcomes.
[09:56]
A
So what happens next?
[09:57]
B
He moves up to chess?
[09:59]
A
Not yet, but if we're going to get there four years later, 1992, somebody picks up his paper and it's this guy at IBM Watson Lab, and he does backgammon, which was like kind of an interesting. Backgammon. People are super into backgammon. But like, I don't think it has the allure of like chess. Right.
[10:16]
B
It doesn't have the pedigree of chess.
[10:18]
A
You know, backgammon involves gambling often too. Like you're betting on it. Like, I don't know, like old British men play it, I guess. I don't know who.
[10:25]
B
I know my mom played backgammon all
[10:27]
A
the time, but it is like a complicated game, right. And it. And it has some interesting. So the guy who decides to work on it, his name is Gerald Hasaro. He's at IBM Watson. And he just takes that rule, right. Which Sutton calls the temporal difference update. But there's a problem with backgammon versus tic tac toe. Maybe a problem versus every single game in the world versus tic tac toe. Do you know what it might be? They're just bigger. Like, tic tac toe is so small.
[10:53]
B
Oh, you mean like the possible outcomes.
[10:55]
A
The possible outcomes is just like astronomically larger. Right. So like tic tac toe. It's great for his paper because there's like less than 800 states, but. Yeah. Do you want to guess on the backgammon number?
[11:06]
B
I have no idea.
[11:06]
A
So this says it's a hundred quintillion possible states, which are. So that would be possible game states. Right? It's. If you played all possible games, if you needed to make this array to hold all the values of every possible
[11:19]
B
position in every possible game.
[11:21]
A
Yeah. So it would be. No, that's exactly what it is. It would be as many grains of sand on the beach avert. So you can't even enumerate it. Right. You can't make an array or a hash table.
[11:31]
B
You'll need a lot of memory.
[11:32]
A
Yeah. It's just not possible. And then like, to play that game forward and backward and update those all. It would take too long. Right. So he does something different. He just. He uses a neural network. So you can imagine that we have like in our tic tac toe, we have like a function, say, that takes in the like nine values, like your state of like X's and O's, and then it returns that number. He does that, but he doesn't actually, you know, come up with that much memory. He just puts this neural network in the middle that somehow returns a value. Right. And that way he can shrink the problem. So the network that he makes is tiny. Three layers in a neural network. A neural network has like kind of floats in it and then some conditions that make it either return, you know, a high value or a low value. I don't super know how it works, but he's using way less values than there are possible states. Right. And the idea there is it's kind of like compression, Right. If you need to store all the values of the game state, but instead of having just a bazillion values, you say like, oh, we only have this many, then it has to figure out like what the duplicate parts are. It has to, like pattern match. It's kind of like a JPEG compression, right? It's. Oh, all these games are similar. We can update this little spot at once.
[12:46]
B
It starts to see winning patterns from losing ones.
[12:48]
A
It just doesn't have that much memory, much like we don't. So it has to find ways like, oh, this is similar to that other time. Maybe this is like that. But he does the same thing, right? Have this thing play backgammon against itself and then when it gets a good result or a bad result, you know, update the value in the neural network. So what happens? Like, how can something actually get better when it's just playing against itself? You know what I mean? Like, it doesn't know backgammon. Like, how is it going to get good at it if it's.
[13:13]
B
Well, you get to know all the moves, right? If you haven't played it before and you play against yourself, but you know the rules, like, you have to at least know what the rules are to the game. And when you play against yourself, you start to recognize what to do in response to certain moves.
[13:27]
A
So this is a side project bit. He plays and he has this thing like it's 1992. Computers are still pretty slow, but he's got off in the corner playing backgammon against itself at this time. Like IBM. Watson is this prestige lab of the AI era, like, for this expert systems of, like, writing down all the important rules. Like, this is the preeminent place in the world. But this guy's like doing his other thing, right? He's taking the Sutton guy's idea, having it run. Well, they're all busy off working on more important expert projects. I think it's. You can guess what happens, like after
[13:58]
B
they let it play with it, like play against itself for a number of days, months. Like, how long did they spend.
[14:04]
A
Hundreds of thousands of games against itself.
[14:06]
B
Hundreds of thousands of games. Imagine they would have moved on to something else.
[14:10]
A
Well, I mean, maybe they did, but I mean, he's a researcher, so he wanted to write a paper about it and whatever. But anyway, so he brings in. He brings in backgammon players, he brings in expert, and they play it and they lose. So this thing has gotten better at backgammon than the real world players. And backgammon does have chance in it. So it's not totally definitive, given enough time, because, I mean, this is that guy's point. Reward is enough. Like, you get to the end and, oh, I lost this round against myself. Well, what did The. My. You know, what was the difference? And how can I. Like, it's the same. It's just a bigger scale of that tic tac toe thing. Right. It also just did interesting things. Much like chess, there's this whole theory behind backgammon and, like, when you should do this and when you should do that and what you should do in these situations. And it does different things. It hasn't read those books. Like, it's just the computer. Computer programmer.
[14:57]
B
It only knows the data that it's acquired by playing itself.
[15:00]
A
Yeah. And so it does things, playing them that they're like. And then it beats them. And they're like, oh, wait.
[15:05]
B
So it beats them because it's unconventional. Rather than trying to imitate humans, it develops its own sense of positional judgment by learning from experience and playing against itself.
[15:13]
A
And that's why it was so surprising to these backgammon folk, right. They're like, wow, this thing plays unusually and it's beating us. Right. So they brought in this guy, Bill Robert D, two time world champion, top three player alive, who had written books on backgammon. And he's like. He says, like, yeah, this thing is better than we are. And we were wrong about things because it beat me in these ways. And so he's written all these books about how to play backgammon, and so he starts updating his book, Right? He's, like, playing against this thing, learning new moves. He's like, oh, yeah, we gotta change that. Cause it just beat me this way. So in its unique experience of, like, not being exposed, it's almost like, as you said, not being exposed to humans has actually benefited it. Right?
[15:55]
B
Yeah.
[15:56]
A
It's learned extra things.
[15:57]
B
It didn't. It didn't learn the game framed in any kind of way. It just kind of learned it through its own experience.
[16:04]
A
Yeah. So that was cool. They beat backgammon, and then, like, you would think the next thing that happens is this takes the world by storm. But didn't they just forgot about this idea?
[16:14]
B
They just forgot.
[16:15]
A
Yeah. Guess for how long?
[16:17]
B
I'm guessing they forgot because, like, nobody plays backgammon. So it didn't grab headlines to, like, your. Your layperson, but for how long? I. I mean, if it's. It's a successful project, I wouldn't imagine it would stay dormant for that long.
[16:30]
A
Yeah. So 20 years.
[16:32]
B
20. Okay. That was way out.
[16:33]
A
I mean, they had a much bigger project they were working on, which was Deep Blue was like beating the chess expert and using these expert systems and, like, the researchers who were building this amazing chess thing. They were like, I don't know, that's just backgammon. It's weird. Like, of course it works with that method. Like, because it's. I don't know, who cares? Like, yeah, it's. It's like saying some. It's like a result they didn't care about. Right. They're like, we're not interested in this way of solving things and we're not interested in this game.
[17:00]
B
So they, they just dropped it for 20 years. Yeah. Like those researchers went on to other projects.
[17:05]
A
Yeah. I mean, I'm sure some people kicked this idea around, but this wasn't like the hotness. Right. It's like there's probably somebody out there still really in to like macromedia cold fusion. But like, that's not, that's not the thing. Right. Like that maybe it had some cool ideas, but like, who knows? And so DeepMind, they, they published this paper in Nature. Yeah. This is from their paper.
[17:25]
B
Perhaps the best known success story of reinforcement learning is TD Gammon, a backgammon playing program which learned entirely by reinforcement learning and self play and achieved a superhuman level of play. However, early attempts to follow up on TD Gammon were less successful. So it's not that they were successful and then they got shut down, it's that they were less successful when they tried to expand, expand beyond backgammon. And they're like, this is a dead end.
[17:47]
A
I guess so. And everybody gave up on it. So Sutton himself in 2017 gave a talk where he was kind of saying like, wtf?
[17:54]
B
Yeah. Temporal difference learning. Temporal difference learning. Temporal difference learning. It's a method for learning to predict. It's basically the center, the core of many methods. You know about Q learning, Sarsa, TD Lamba, Deep Q networks. TD Gammon, the world champion backgammon player using deep root reinforcement learning from 25 years ago. Oh my God. 25 years ago. Deep reinforcement learning, 1992.
[18:15]
A
I don't think you gave it very good emphasis, but like, he's pissed off, right? It's like he's the guy. You get where he's coming from, right. He's like, I had this thing.
[18:22]
B
Yeah. I mean, if he's. If he, he's probably pissed because he's like, I did this stuff like 25 years ago and nobody gave a crap. I mean, how often does that happened? Like on a very much smaller scale to a lot of, a lot of programmers in the field, right?
[18:35]
A
Like, listen, our database, like if we follow this trend, it's going to get Too big. There'll be no instant size that can hold it. And, like, it'll destroy our business. And I've worked on a solution. It's just like, if you give me three days, like, it'll save us in 18 months.
[18:49]
B
And they're like, three days, though now. You could be working on this other thing for three days. Yeah, forget about that.
[18:54]
A
And then 18 months later, they're like, so we got this issue.
[18:58]
B
Can you, like, whip that up, like, right now?
[19:00]
A
And that's. That's when this guy's like, temporal difference learning. Like, I'm telling you.
[19:03]
B
I've been telling you this for 25.
[19:06]
A
Yeah. So then. Because what's happening at the same time, right, Is this Deep Blue thing. Deep Blue. Do you remember the deep Blue story?
[19:13]
B
I do, yeah.
[19:13]
A
I remember what happened with Deep Blue. Give me the. The summary.
[19:16]
B
That was the one that played chess, right? Yeah. And it started winning against grandmasters.
[19:21]
A
It was this big project to beat the world champion at chess as, like, publicity for. For IBM. And it was, like, part, like, oh, they had these very big supercomputers, but a big part of it was this expert systems, Right.
[19:35]
B
Do you play chess?
[19:35]
A
My dad always wanted me to play chess, and he, like, taught me, but I would just lose to him. And then I think once I left, like, after university, I learned. And then occasionally, like, when I'd go home and visit him, I would play chess against him. At one time, I beat him, but I think he made it, let me win. But that's like my ex. Like, he, like, from. From my perspective, like, chess has a lot of ranges. From my perspective, he was just way better than me. But chess has so many levels, right? That I'm sure, like, there's tons of people, like, a grandmaster, like, nobody can beat them. When I learned chess, there was this idea of you. Certain pieces of certain values, right? So it's like you're sitting there and you're playing chess, and it's like, my dad's gonna do this move where my pawn takes his pawn, and then I could take it back with my knight, but then his other pawn could take my knight, right? And, you know, like, oh, the knight's more valuable. It has a value of three. And, like, the pawn only has a value of one. So, like, I shouldn't do that, because then, like, you're basically, like, adding up the value of each side. And so, like, a big part of playing chess is, like, this idea. It's like, you have these pieces and you know what's valuable, and it's like, okay, if I make this move, then what'll he do? And what are the values? So the. The interesting thing is, like, the. Like, they've looked into this, right? Grandmasters actually don't play that many moves ahead, so they might play like three or four, Right. But the thing is that the. The moves that they consider are always just the best move. Just interesting.
[20:54]
B
Yeah. Like, they have names for all those, like, different opens.
[20:57]
A
Yeah. And they have the open books, which is like, yeah. All these people who've played forward, like, the first whatever, like four or five moves on a chessboard, like, they've all been exhaustively done, and they know, like, which one is better and which one isn't. And so Deep Blue used this strategy, right? So it had this giant open book of all, you know, the good ways to move forward. It had this giant book of end games. Like, if it's just down to these couple pieces, how exactly do you win it? Right? And then instead of what I was saying of, like, I just add up the pieces and determine the value. I mean, I guess not instead of like, it had something like that, but, like, IBM was throwing all this money at. They brought in all these grandmasters, and they came up with, like, the most accurate way that they could come up with to, like, assess at this point in this game, right? What. What value? Like, how good am I doing? Right? And so what I described as, like, whatever, three or four rules for adding a point. Like, guess what theirs was 20. It was 8,000.
[21:54]
B
Okay?
[21:55]
A
Like 8,000 rules.
[21:56]
B
I don't know why you ask these people these, like, arbitrary questions, like, different answers.
[22:00]
A
Well, no, because it's good, because you said 20. And I think the thing about 8,000 is it's absurd, right? But this was this expert system idea, right?
[22:10]
B
Which 8,000 moves in advance.
[22:12]
A
Well, so the 8,000 isn't how many moves in advance. That's just how they figure out for any given game position, how. How valuable it is. Right? It's like you get plus one if you have this piece there. And like. Like just 8, 000 different rules to say, like the equivalent of our number on tic tac toe. That's like, this is a 0.7 scoring system. It's the scoring system for evaluating any specific chess move, right? And then, like, as you said, right, like, you know, we play this and we think through, like, okay, if I move here, you move there, right? And then. Or that state will use all those rules to figure out if this is advantage. But then the other thing. Yes. Was Like, Deep Blue was just like, I can look through a bazillion of these, right? So I have my 8,000 rules, and then I'll just play it out really quick, right? This was in 1997. Computers weren't as fast, but it would play 200 million positions per second. And, like, it was this big boon for. For IBM, but also, like, for AI, they're like, ah, we hit the pinnacle, right? But in the meantime, like, Sutton, the. The tic tac toe guy, right? So he's working on a tech.
[23:12]
B
He's.
[23:12]
A
Yeah, he just keeps saying that whenever you talk to him, he's like, temporal. What is it? Temporal?
[23:16]
B
It was temporal difference learning. He's just screaming about that in the background.
[23:21]
A
So he writes a textbook, right? Him and Barto, who's like, the behaviorist with the chickens or the pigeons.
[23:26]
B
Pigeons.
[23:27]
A
Pigeons. So Reinforcement Learning, an introduction, comes out in 1998. And now we would say it's a very influential book. People use it and it's important and whatever. But then it was like, what? Whatever. Like, IBM is the thing and Deep Blue, and nobody cared, right? Everybody's working on something else, right? It's still like, he's like the cold fusion guy, like, doing his Macromedia thing. Everybody's like, dude, what? What are you doing? But there's. There's somebody paying attention, right? The thing that IBM really figured out was this searching forward, right? It's like we're moving really fast through this game, and we can figure out, you know, chase down the tree of possible moves and find the best one where the Sutton guy was like, you know, just play it to the end, and then we'll figure out who won and work backwards. But so somebody looks into this. In 2015, Dennis Habiz, and he. He founded the company DeepMind. So this is years later in 2024, he said this.
[24:25]
B
We bet on generality and learning. So those were always at the core of any techniques we would use. That's why we triangulated on reinforcement learning and search and deep learning as three types of algorithms that would scale be very general and not require a lot of handcrafted human priors. Oh, okay.
[24:42]
A
I mean, he's an expert in the field, and I'm some guy, but yeah. So DeepMind is this company, and they're gonna turn like, they're like, we think we can reinvigorate this, right? Like, we think this is important. Um, and so they come up with these Deep Q networks. Deep Q Network is just the same as the Sutton rule. It's this idea of, like, working backwards and applying these values. Right. And so it's the same ingredients. There is Dutton's update rule, and then there's a neural network. And then there's just a programming method, right. To like, play games and put together data so that it can learn network with like three layers, very small. But in the meantime, right. While all this stuff was happening, we talked about it in the LLM one actually, like, these dudes like Hinton and stuff at U of T, they figured out how to make really big neural networks and learn all this stuff. And so they decide to do Atari. Basically, we want to build an AI that can play Atari game. So they. They have 49 Atari games. And. But they have this problem, right, where you think of, like, if we do Tic tac toe, it's like our state, you know, into our neural network or array is easy, right? It's like we have nine fields, like, either filled out or not. Right. But how do you send in, like where we are in a video game, like, into something, Right. Like, it's hard to make a function call that's like the current state of my video game.
[26:00]
B
Yeah. Because there's too many factors.
[26:02]
A
Yeah. But just do the simplest thing.
[26:04]
B
Well, I mean, like, you could do something like X, Y, Z for like, coordinates or.
[26:08]
A
Yeah. So they did. They took whatever the resolution of an Atari screen is and they blocked all the pixels. So not they wouldn't send in every pixel value, but they would send in like, say like a five by five block, every five by five block. And they put it to grayscale. They took out the colors because, like, that doesn't actually matter, usually for.
[26:26]
B
Okay. So they like reduced the resolution and then sent the pixels in.
[26:30]
A
It's like asking it to learn. It would just like. But it can't see the game. It can just see, like, you know, 0, 0, 11 1. Like, so the software they built, they made it play like a whole bunch of Atari games. And so it gets its reward thing right, is just the score. And what its input is is like all of the pixels. And then Atari has a joystick. So then it's just like, it can decide. It gets in all the pixels, and then based on his joystick, it can just like, side. Right, I'm gonna push this way or push that way. So this is their. This is like. This was pretty famous, I think, when it came out. So this is one of their videos.
[27:04]
B
It's Block Breaker. Like a. Like a very early Atari version of Block Breaker. So you've Got your, you've got your paddle at the bottom. You've got. You've got your, your ball in the middle, and you've got rows of blocks at the top, each of them a different color.
[27:17]
A
And so this is after it's played a hundred times. Right. So it's played a hundred times and learned some stuff. And it's just trying to.
[27:23]
B
Yeah, so the, the goal of it is always just to prevent the. The ball from slipping past you. So you've just got to intercept it.
[27:30]
A
Yeah, but it has to learn that because all it really knows is the score. It got something. It's basically randomly moving the stick around. Okay, here's 200. It's more like I would play could occasionally it like randomly moves the stick. Okay, now we're at 400.
[27:43]
B
400. It's getting a little bit more accurate.
[27:46]
A
It's catching it. It's getting it every time.
[27:48]
B
It's moving it to intercept.
[27:49]
A
Yeah. Okay. So it feels like it knows what it's doing. Like it's learned and. Okay, then the. What is the. The overlay here says at this point, the agent finds and exploits the best strategy of tunneling and then hitting the ball behind the wall. It's Learned something after 600, it's.
[28:03]
B
It's learned a strategy of trying to get in behind all the blocks. Because then. So it must be so. Yeah, because it's counting the score. Right. So it's like, if I can do this, then the score will go up more without my intervention.
[28:15]
A
Yeah. So it's figured out this idea of like drilling a hole through all the blocks and then. Then it doesn't have to do anything. Right. The ball just bounces around back there and clears things. That was like a huge deal. Like there was a paper in nature and it wasn't just that game. Right after that, DeepMind gets acquired by Google, which is nice because Google, much like IBM, has just like a bazillion computers. The thing about this type of training is you need something that can, like, play the games over and over. Right.
[28:43]
B
You need the hardware.
[28:44]
A
You need the hardware. And so that is when they decide to tackle. Right. Because I feel like if they were like, cool, we have this method, everybody said it was no good, but we made it work. And like, we're going to tackle chess, nobody would care. Like, chess has already been vanquished with this other tech.
[29:00]
B
Nobody wants to hear about that anymore. That's old news.
[29:01]
A
Yeah, we've moved on from that. So chess is already won. So they decided to do go and the thing about Go is it's incredibly hard. So at that point, there was no good AIs at playing Go, or very few. The reason that Go's interesting is because of, like, this kind of exponential explosion when you play chess. Like, there's maximum, like, 35 moves you can make each game. And as we said with Go, there's much more, 250 possible moves at any point. Which means if you're trying to, like, play forward and build that tree, it's just astronomical. It's a very large board. You have a lot of pieces that you can place in lots of places. And then it takes a long time to play, and it's not till the end that it kind of all resolves and it's like, did you win or lose?
[29:42]
B
Human players probably haven't even played moved those moves, right? Because there's so many.
[29:46]
A
But humans are good at pattern recognition and somehow, you know, they learn to play this game and there are champions and, and they're good at it. But, like, it doesn't seem to fall to our normal techniques of like, oh, let's map this all out. These reinforcement learning guys came in who had been listening to Sutton with like, his temporal difference. Yeah, what is it again? Temporal difference learning. Temporal difference learning. They listen to him, right? And so they built their machine. But there was this guy, Remy Coulomb. He was an AI expert, and he had built the best Go program in the world. I think I have a quote from him.
[30:17]
B
So the quote is I think maybe 10 years, but I do not like to make predictions.
[30:21]
A
So he said it would be 10 years, but then 22 months later, they. They had beat this, right? AlphaGo beats the best Go players. That's awesome. But like, what is it? Right? Because I'm a software developer.
[30:33]
B
Yeah, you explain like, the problems with, like, we can't. We can't play the game to the end. Like, we don't know what a good state is. So, like, how does it make the decisions then? On what? How it quantify vibe?
[30:43]
A
Exactly. This is the question, right? And so I built. I built a version of wordle. You know wordle?
[30:50]
B
Yes, I think everybody knows wordle.
[30:52]
A
Yeah, I never really played it, but it's my example. I built like Alpha wordle, basically.
[30:57]
B
So if you haven't played the game, it's a. Is a five letter word that has been chosen that you don't know what it is and you're trying to guess it. And you can put in a word and it will evaluate your guess and tell you each Letter, whether that letter is in the word and in the right spot, it's in the word, but in the wrong spot, or it's not in the word at all. And given that feedback, you have to then pick a new word, and you only get a finite amount of guesses. I think in this one, there's like six. And then after six guesses, if you haven't figured out the word, then you lose the game.
[31:27]
A
Here's a. Here's a wordle game we're looking at. Let's guess. So I don't know. What should we get? So I'm going to pick the word slate. When you play wordle, you can do the same thing as in chess. You can say, like, okay, if I play this move, there's not an opponent, but I can get back a score. Right. And so, like, if I get back and none of them match, then what? I've eliminated not very many positions, but some.
[31:50]
B
Because, I mean, there's a finite amount of. It's that it's quantifiable now.
[31:55]
A
Yeah. So you can kind of play it forward in the same way you would with chess. Right. So we're looking at a wordle game. Right. And when I pick, I'm going to pick slate as my guess, and it was completely wrong, so it scored them.
[32:07]
B
And none of those letters are in the word.
[32:09]
A
Yeah. And so there's only a finite number of wordle words, like, in their set, I guess there's 2,000, 315. So, like, I. I played slate and nothing matched. But actually, that eliminates a lot. So this is saying in my little app I made here, before this, there was 2315 possible words.
[32:25]
B
Yeah. Because you eliminated A and E. Right. Which is contained in a lot of English words.
[32:30]
A
Yeah. And then I can play this forward. Right. So, so say to. To go back. To go. Right. At any given state in the game, I can tell how good I'm doing just based on how many possible words are left. And so that's like the one component that's hard and go. Right. The other component is the playing forward. So now that we have the slate position, I can pick a new word. Like, let's say if I pick this birch, but before I select it. Right. So for any given word, I can do the same as in chess. Right. I can search forward. So I know, like, if I played the word birch.
[33:01]
B
Oh, I see. Then you just be like, oh, if I get this result, then yeah, that's because the. The result here has a finite amount of possibilities.
[33:08]
A
Yeah, yeah. Okay, so this is kind of, this is my dumbed down version of how the AlphaGo works, right? So it has this neural network, right? And in the neural network, the, the weights in the neural network are basically from playing the game through all the way to the end, the. The same as tic tac toe. And it figured out what won and so it updated things, right? And so for the game, for. For its playing, it thinks this word arose is like its best guess. Probably it played that word at some point in one. And so it's like, oh, this is awesome, right? So this is its best plays. And then it has this value which it calls position strength. So for Wordle, it's saying my position strength is 3, 6, 4, 4, which is basically. It thinks that it will win in about.
[33:52]
B
Yeah, about three. Yeah, about three guesses. Or, or four. Three or four.
[33:56]
A
So this is, these are like the two important parts of the machine learning, right? It's like it needs to know how good it's doing. This is kind of like the chess points, right? And then this is kind of like what move to make. And so it thinks that we should do arose. How do we do arose? But then the other important thing is we can, we can play it forward in our head. So before we submit our move, we can kind of, okay, if I pick that, what are the possible options, right? So it thought this arose was best based on its original playing, but then when it plays forward from that position, it decides like, no, irate is better starting word. Like, I'm going to pick irate. So basically, AlphaGo, they've combined these two ideas, right? So Deep Blue was this one idea, sort of let's play forward into the future and figure out the values, right? Like anytime it's our position, we can kind of spread out and try to go quickly through all these rules and figure out what the best thing is with this, which is like, then once we get to the end of the game, you know, we'll learn from that and we'll update these, which is kind of like Sutton's like tic tac toe. Does that make sense? It's like they came up with this idea that we'll have a neural network that tells us how good each position is. But how will it know that? Because we said, oh, and go. It's super hard to know actually if this is a good position or not. And so they said like, well, we'll just, we'll do the Sutton trick. We'll play a game of go. And when we get to the end we'll say like, hey, if we won, all the places that that happened are good.
[35:23]
B
So is this what sets them off? And he's, he's like
[35:29]
A
learning, but like that would take forever, right? Like the to to do that because there's just so many go games.
[35:35]
B
So you have to constrain how many games you play maybe.
[35:38]
A
Well, so you, you combine it with this other idea which is the search, right. So when you're playing your game forward, right? So I played 10 games, right. And I have. So I think like these 10 random moves I did are good, right. And then I'm going to play my 11th game and before deciding what move to do, I'll play forward.
[35:57]
B
Yeah, you play it quote in your head.
[35:59]
A
But we're doing this two stage learning, right? It's like when we get all the way to the end, we say all these things must be good. And then when we're playing in the future, we don't wait until the end to decide if it's a good move. We play it forward. And if it looks like any of the other ones that we did before, then that also must be good.
[36:16]
B
Like if a, like if a piece of it matches one of the winning scenarios and you're like, oh, this is
[36:21]
A
a good one, this must be good, right? We're learning faster. It's like we don't have to play all the way to the end of the go. So it's this double layer thing. It took me a long time to understand this. Yeah, I mean obviously it hasn't learned everything, right? Obviously it's playing like things that are not great, but it can play this situation forward, right. And you can keep playing it. Right. It can keep learning and it will get better. And in fact, like you can do this whole thing.
[36:44]
B
Yeah, you're like increasing the resolution of the prediction.
[36:49]
A
Wordle's not that hard. You can actually just figure out the perfect wordle move by walking through every game. Like it's somewhat like tic tac toe and then it has a bottom. So now it's running a whole bunch of games and it's learning more. Key one.
[36:59]
B
Yeah, you got it in the fourth guess.
[37:01]
A
Not bad, not bad. Alpha wordle. This is their idea, right. And it took me a while to understand this. It's like these two things. Like one is playing against yourself or in the case of wordle, there's not another competitor. It's just like they're just playing. But it's still the same idea. Right. It's like thinking through the moves and Updating it and then this idea of searching forward. So they use, it's called Monte Carlo tree search. Like, but basically they're randomly playing forward in this tree of words. And which moves they decide to play forward in are basically the ones they think are like, oh, this seems like a good move. Let's play it forward, couple moves and see if it is.
[37:40]
B
It's like limited foresight with some randomness
[37:42]
A
because it's like you don't just want to play the moves you think are good forward and see if they are. Because, like, you start off knowing very little and you're probably wrong. So you have to kind of like,
[37:50]
B
you have to fail a lot.
[37:50]
A
Yeah. So they do this and they train and then, yeah, they're going to play this big game against this Lee Sedol, who's the champion. Yeah. So they win the first game against him. At this time, it's inconceivable that they win. Echo. Right. Like, obviously we know that they won, but I was part of this thing called the Human Judgment Project at the time. And it's like they tried to train experts, bet on outcomes.
[38:12]
B
Yeah, I remember you talking about that. It was because we, we both read that book about super predicting.
[38:18]
A
Yeah, super predicting. Right. So I tried to join that guy's project and like you could bet on these things. And this was one I bet like, I was like, there's no way, like they've determined it will be like a million years before they're good at this. So I bet like, no, Google will lose. And then they won the first game, like they crushed them. And then during the second game, I don't know what's happening. Like they're playing at the moves and I don't super understand things well, but AlphaGo is playing white and Lisa Doll is playing black. And it's pretty early in the game. The position's pretty open and the commentators, because it's like a live broadcast thing, they say that Lisa Dole is, is doing quite well, maybe even a bit ahead, but it's still anyone's game and it's only game two, so they don't really know. So then AlphaGo plays a shoulder hit on the fifth line, which I guess
[39:01]
B
is like, oh my God, a shoulder
[39:02]
A
hit on the fifth line.
[39:03]
B
No.
[39:04]
A
Yeah. And so people don't really know what to do. Yeah. So it doesn't mean anything to me. Like, I don't know what a shoulder hit on the first line, but go is like a 19 by 19 board and people put their stones down around yeah. And the. The basic rule you learn, right. Like, it's just like a basic rule is in the opening, in the middle, you play on the third and fourth lines. You don't play in the fifth line. And so AlphaGo played on the fifth line, which is away from the center of the action. And I guess that's for the end game.
[39:30]
B
Yeah. I mean, we'd have to look up go terminology, but, like, if. If. If it was doing that, there would have to be a. There would be a factor there that would throw off the human competitor because they've never played against somebody who did that.
[39:43]
A
The same as the backgammon. When the.
[39:44]
B
Yeah. It's like, what's going on?
[39:46]
A
What are you doing? So Lee Sedol, who's like, he's the best at this, but. And so he only plays other, like, amazing. The best people. Yeah. What do you think he does?
[39:55]
B
I mean, I think that he would probably, like, he doesn't have a strategy for that because he hasn't played against anybody who's made those moves. So it would. He would have to train. There's no counter for it.
[40:03]
A
It's funny because we know nothing about go. So it's like. But I don't understand the rules, but I assume they each have so much time to make a move. And what Lee Stoll does is he. He just gets up and leaves.
[40:13]
B
He just left.
[40:14]
A
Yeah.
[40:14]
B
He didn't finish the game.
[40:15]
A
He just walked. Like it's a televised thing.
[40:17]
B
And he's like, like, why did he get out?
[40:18]
A
I mean, I'm sure he was just shocked. He didn't know what, you know, wouldn't
[40:21]
B
you just keep playing? Like, if I was playing chess against somebody and, you know, they made a weird, unconventional move, I just like, well, I mean, I guess. And then I would just be playing with my own strategy.
[40:30]
A
Yeah. And it's a live broadcast thing. And he left for 15 minutes. He just walks off. And it's not like they have AlphaGo. Why don't you entertain us? Like, no, that's a computer. Right. So the commentators on the live broadcast, they don't really know what to say. Like, they're basically like, I think this is a mistake. I think AlphaGo is off its rails. What should we do? Um, and so an important thing, right, is AlphaGo has these two systems, right? It has the one that tells it the move it should make. Right. And then the other one that we saw that kind of, like, plays forward, what's going to happen?
[41:02]
B
Right.
[41:02]
A
And so this. This weird move that it Made the thing that said, like, what you should do next? Basically said, you should never do that. Right? But then they played forward, the rules, and all of a sudden it looked really good. And so it was a rule. It was like you, like, it knew the same as we did, or Lisa told knew. Like, this is not a conventional move. But then it played forward and it was like, interesting.
[41:24]
B
This is a good strategy.
[41:25]
A
Okay, and here is the architect of AlphaGo, who is actually David Silver. He was running the team at DeepMind. Here's what he said was happening at that moment.
[41:34]
B
The professional commentators almost unanimously said that not a single human player would have chosen move 37. And then we found out that AlphaGo said that there was a 1 in 10,000 probability that a human would have played that move. So it went beyond its human guide.
[41:48]
A
Then Lee Sedol comes back and he. He sits down, I guess he isn't gone. And they start playing and like, yeah, the Go games are long. Like, this was considered early games, but it's move 37. Right? But over the next 50 moves, everybody starts seeing what AlphaGo's doing, right? Like this move starts to make sense and it starts to be like a dominant factor and they're like, oh, I get it now. Right? It starts crystallizing everybody's mind who understands Go.
[42:15]
B
Like, so is that like a move now? Like, now people like, play to the outside or what?
[42:20]
A
So he loses the game. Right? And yes. I mean, I don't know for this specific game, but I know that much. Like Backgammon, the rules shifted. People start being able to play against these machines that have new techniques and then they learn from them. Right. It's like we were based on our own strategies that we learned over time when we had these biases. But the computer's out playing random games and discovers new things that we didn't know. Right. AlphaGo wins the match. 4:1. So at least it all won one of them. Right?
[42:48]
B
4:1's pretty decisive though.
[42:50]
A
Yeah. And that was March 2016. But people aren't willing to give it up. Right. And so this fact comes out that, okay, AlphaGo did actually have some human input into it. So there is this online thing called the Kaito Go server. It's been around since the 90s. And people play Go on it. Yeah. How many was it millions of games of humans playing that it had trained off. So people were like, well, you said the machine has beat us, but maybe it's just Remembering, you know, user 5639 played move 37 on a Tuesday in 1993. And it's just remembering it, right? You see the same things with LLMs, right? They came up with a strategy to overcome this, which was let's do this again, let's build this again. But we won't include the human, like throw it out, right? We'll start again. So it's like the same machine as wordle, right? It's like if in my wordle I originally played forward and I played a whole bunch of games and then I start training it and they're like, oh, maybe it learned from me. So then they do this again. They call it AlphaGo Zero. It's a good name, like Zero because it's like zero human knowledge. So now they play their Alpha zero against the one that beat him and it. And it crushes it. So they built an even better one. And so this, the whole thing playing all these games forward, it only took three days because they're Google and they can just run so much in parallel and play all these millions of games at once, which is, it's kind of crushing, right? Because you could dedicate your life to being good at Go. And we're like, well, we took this thing, didn't even know what GO was.
[44:17]
B
That pigeon.
[44:20]
A
Yeah. No, it's just like we're giving it and it's like dawn in a box. So it's like running at the super fast speed in parallel and it learns. Here is how DeepMind announced it in the Nature cover in October 2017.
[44:32]
B
A long standing goal of artificial intelligence is an algorithm that learns Tabula Rasa, a superhuman proficiency in challenging domains. Starting Tabula Rasa, the new program AlphaGo Zero achieved superhuman performance, winning 1000 against the previously published champion defeating AlphaGo. Tabula Rasa was also a video game that Richard Garriott tried to the guy who made Ultima online. And it didn't really get off the ground and then got scrapped.
[44:56]
A
Do you know what it means?
[44:57]
B
Plain slate? I think?
[44:58]
A
Yeah, yeah, yeah. Blank slate.
[45:00]
B
Blank slate.
[45:01]
A
But yeah, it's like because it knew nothing, right? It like it started from absolutely nothing. Okay, so they removed the human games and it worked. But then they're like, hey, we should remove more things, right? Like, what more can we remove? What do you think?
[45:16]
B
What more can they remove?
[45:17]
A
So they did the Atari move. They basically removed the rules of Go. It doesn't know how to play the game much like my wordle guy.
[45:25]
B
It doesn't know that you can't play the same letter that you've already discounted.
[45:28]
A
So they're like, what if it just all it can figure out, like, it just gets whether it won or lost a score.
[45:33]
B
Yeah.
[45:33]
A
And then it's like, figures it out the same way the Atari thing did.
[45:36]
B
Yeah. So it just needs a win condition.
[45:38]
A
Yeah. And then once they built that version, they had it play a whole bunch of games. So they train it on go, they train it on shogi, which is some sort of Chinese chess, like, game. They train it on chess. And so then they have it, like, once they built their chess one, they have a play against Stockfish. And Stockfish is like the most powerful chess thing at that point that, like grandmaster's learn against. And it's got all this custom stuff. And then like, yeah, it crushes Stockfish. And like, they're making a point. Right. Like, they're like doing the Sutton thing. I mean, I think they're also drumming up publicity, but they're like, this thing knows nothing. All it knows is whether it won or lost the game. And you can build the most complicated thing in the world. And you give me some time on Google's compute things in a couple days, like, we'll crush you. They're making a very clear point that they can just learn it. All right. Yeah. So that one was called mu mu zero, like mu because it doesn't know any rules. Like, they've removed even more. But. But meantime, since all this has happened, Sutton comes out and he writes kind of a. A manifesto. I feel like it's. It's always good when you can say, I'm writing a manifesto.
[46:43]
B
Sometimes they're good, sometimes they're bad.
[46:45]
A
Like, he could go Unabomber.
[46:46]
B
Yeah, it's. It's risky.
[46:49]
A
So he writes this short essay, March 2019, that he posts on his personal website, and it's three pages long. And he calls it the bitter lesson. And so for Deep Blue, right, The. The problem was that they had to have that evaluation. Evaluation function, Right. They needed to be able to say, like, how good is this chess position? You will hear people talk about the. The people who are big onto AI right now and who are building LLMs and all that. You'll hear them talk about the bitter lesson all the time, and they'll say things like, are we sufficiently bitter lesson pill? Basically, like, you know, in the Matrix, where it's like, take the two pills. It's like, have we sufficiently taken the bitter lesson pill? Right. Like, the bitter lesson that this button guy wrote has become like the important rule that all these people believe in. Right. And so here's how his three pages that becomes this big thing on his website opens.
[47:42]
B
The biggest lesson that we can read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective. And by a large margin.
[47:52]
A
General methods, Right, that leverage computation. That's his whole point. Uh, I mean, it's a, it's a larger essay, but he's saying none of your clever methods. Right? We don't need 8, 000 rules for how to evaluate a chess thing. We just do this tick tock toe thing. Just let it go like it has a reward. Right? Like let it go. Right. Um, but he means like large when he says like a large margin. He means like extremely large. Like whatever you got, however clever you can be, like building a complicated system, it will be beat if we can just figure out a way to get a reward signal and just turn the crank. It's basically saying, stop being clever.
[48:27]
B
Yeah, because like computation has reached such a level that if you just can provide it with a reward condition, then it can play out massive amounts of scenarios and create its own kind of guide, right? So you don't need to actually provide it with any rules. Your, your rules might actually be constrained,
[48:45]
A
move 37 or whatever it was, or the backgammon thing that people are like, that's not the way you do it. But they were right, because like Deep Blue had all these rules, but then, you know, Stockfish had less and was stronger. And then Alphazero had like no rules and like crushed it. Because it's just, it can learn all the rules itself, right? It's just simple algorithms, but they're these learning like meta algorithms. They win, right? And so he was just, he was salty about this. He, he calls them sore losers and he wins a Turing Award. And then he starts using this method everywhere, right? He uses it for, to get good at speech recognition. So there was like 30 years of building amazing speech recognition systems, you know, learning phonetic rules, and then people just come up with this deep learning method, just like, hey, we learned how speech works by just throwing lots of compute at it. And then right in the middle, in a paragraph about Go, he says this thing that kind of summarizes this whole episode, right? This thing that I'm trying to get at. So read, read the next quote.
[49:46]
B
When a simpler search based approach with special hardware and software proved vastly more effective, these human knowledge based chess researchers were not good losers. He said that brute force search may have won this time, but it was not a general strategy. And anyway, it was not how people played Chess. So learning by self play and learning in general is like search in that it enables massive computation to be brought to bear. Search and lear learning are two of the most important classes of techniques for utilizing massive amounts of computation in AI research.
[50:14]
A
Yeah, because he's saying like a program can make its own data. Like we talked about this when we did the LLM episode. Like, oh, they're like, we're running out of data. But like here it makes its own data. You just like let it run as long as it can figure out whether it's one or lot.
[50:28]
B
But you still have to. Yeah, you still have to provide it with good and what's bad, right? Like what's a positive and what's a negative condition so that it knows how to evaluate.
[50:36]
A
And that idea, right, that came from Sutton. Like they were taking his tic tac toe idea. I mean, you have to squint and say like, well, maybe math. Like they're like, well, math could be like a tic tac toe thing, right? We'll just randomly guess and we'll learn backwards. But this is his idea, right? He's finally, he's in. His brilliance has been recognized. There's like 20 years of objections that this of his idea of self play, right? But the bitter lesson, the reason he called it the bitter Lesson, actually you have the quote. Tell me why it's called the bitter Lesson.
[51:05]
B
We have to learn the bitter lesson. That building and how we think we think does not work in the long
[51:09]
A
run, which is super confusing building and how we think we think. But what he's saying is like, don't try to like reflect.
[51:15]
B
Don't ascribe your human conceptual models onto the AI because it doesn't think the way that a human brain does or
[51:22]
A
it doesn't need to. Right? You don't need to record the way the best chess player plays chess. You can just let it figure it out on its own, right? Like, don't give it how we think we think. Just let it run. And it's bitter because. Because, well, you want to. You know, you're the greatest AI person or software developer and you build this complicated system and all these rules and it's getting better and better. And he's saying like, no, actually don't be clever. Just throw compute and reward at it and it will be guaranteed better than whatever you can come up with. Like, it's a bitter pill to swallow to say like, whatever cleverness you can come up with, a human will not be as good as us. Just figuring out where the reward signal is. And giving it unlimited compute. Right. He's like. It's kind of saying, like, let the machine win. Like, the machine is better at this than you will ever be. Stop trying to teach it things. Just let it learn. So then, October 2021, in the artificial Intelligence Journal, he publishes this thing that I texted you, right? Reward is enough. And so the people on this one is Sutton. Right. David Silver is the other person on it. He was the guy, the lead architect of AlphaGo and AlphaZero and Alpha Mu Zero, I think. But, yeah, the last guy is Sutton, and it's the same Sutton that's all the way through. And in Reward is Enough, they. They make this claim, like, even larger. Do you want to read that?
[52:39]
B
Yeah. Reward is enough to drive behavior that exhibits abilities studied in natural and artificial intelligence, including knowledge, learning, perception, social intelligence, language generalization, and imitation.
[52:49]
A
Now he's. Now he's definitively saying, like, reward is not just enough for beating somebody at chess. Reward is enough for knowledge, for learning, for perception, for social intelligence, for language, for imitation, for. He's basically saying that's it. Like, all of. All of intelligence is contained in beating something. A reward and seeing what it learns.
[53:09]
B
Yeah. I mean, again, it's. It's still reductive because we do undertake some endeavors not for rewards. Like, I don't think there's a tangible reward to figuring out what dark energy is, but it's something we haven't discovered. How would you ascribe a win condition to that?
[53:21]
A
And Skinner. So Skinner, you know, was the. Skinner was a psychologist, and his methods got left behind a lot, although they were powerful because, you know, you think of Freud and, like, wondering why people do things like, Skinner never wondered about that. He was like, no, you press the button, you get a piece of grain. Right.
[53:37]
B
It's enough for the. A pigeon. Yeah.
[53:38]
A
But he had nothing to say about individuals and how they worked, because people to him were like a black box. Right. You give a reward and then the pigeon does the thing.
[53:46]
B
And people aren't pigeons. Right. They're a little bit more complex.
[53:48]
A
But the interesting thing is these AIs are black boxes as well. Right. Like, we don't know what's going inside the neural net. Nobody care. So it's very interesting how it actually follows from this guy. He's like, I don't. I don't care what the pigeon thinks about it. It's like I trained it to do the bomb guiding. It's like, I don't care what the neural network does now. It's just better at chess than you will. And. I don't know, it's many years later, and he won the Turing Award, and he. He wrote this, and it's affected the whole world. But he went on this podcast about AI recently, 2025, he went on the Dwarkesh Patel podcast.
[54:22]
B
Yeah, I know.
[54:23]
A
Yeah. And so he interviewed him because he's like, oh, my God, look at AI so big. And you finally got your comeuppance. You know, what do you think? Like, we've all finally learned your bitter lesson, and we're building these things. Will you read it?
[54:35]
B
Large language models are about mimicking people, doing what people say you should do. They're not about figuring out what to do. They have the ability to predict what a person would say. They don't have the ability to predict what will happen.
[54:47]
A
So they thought they learned all his bitter lesson, but he's saying, like, no, man, you built an LLM by consuming all of human knowledge. That's not AlphaGo Zero. Right. You actually just built something that's pretending to be human. Like, his way would be, like, it starts with nothing. By his thinking, then we would be at the big blue era of AI. It's really good at doing all these things, but it's encoded all this human stuff, and the next step should be like, can we get rid of all the human stuff? Because right now, the LLM isn't going to do the move. 37. Right. It's trained on. On human knowledge.
[55:21]
B
Yeah. It's also trained on, like, what it thinks you want. Like, I think that he's on the right for, like, a. Like, a general artificial intelligence. Because he's talking more about, like, don't teach it the content. Teach it how to discover the content to learn.
[55:36]
A
Teach it how to learn. This guy who worked with him and who worked on the AlphaGo team, Julian Schnitzer, he's at anthropic working on LLMs, building large language models. And he said, after that statement, he said, like, oh, no, you're actually wrong, Sutton, because as we discussed the other time, these LLMs are. They have these RL loops in them now, where they're, like, teaching them reward based on math and based on whatever. So he said, no, actually, Sutton, you're wrong. These LLMs are doing exactly what you're saying. I guess they're built on human knowledge, but we're still. We're giving them this reward loop and they're learning. Okay, but now I'm gonna go dark on this whole thing, right? Because, like, I feel like this goes further than all of that. Right. Because he's saying, you know, reward is enough. Well, enough for what? Saying, well, it's enough for everything.
[56:25]
B
Like, I think that there's room for nuance there. I mean, most of the things we do for reward, some things we don't have a discernible reward, right? Like, we eat to stay alive, we work to make money so we can be alive. Right. But there's like, pursuits that don't have any kind of discernible reward or reason why someone would do them.
[56:40]
A
True. But I feel like what he's saying is, like, if you can come up with the scoreboard, which is basically what the reward is, right? It's like you need a way to score something, then the computers will. Will beat you. Right. And so it was easy to come up with a score for chess, and now the computers B u h, it used to be this imagenet. It was in the last episode. It was like identifying animals in images, and computers couldn't do it. But then they, they built this big data set and they were like, it has to guess and say, like, is this a cat? Is this a dog? Once they had that scoreboard, computers just learned and got better at it. And so they're way better at identifying things in images than humans are. But now there's like SWE bench, right? It's like a benchmark for programming tasks. It's like you pull down this task and it's like an open source bug and you have to solve it, and then it scores you on whether you're good or bad. Problem is, that's a benchmark that now a computer can do its reward game on. And so, like, the larger bitter lesson is, like, we're in trouble if there's a reward signal. Like the SWE bench, Like, a computer will be able to dominate us at it. The dark version of reward is enough. Is that reward is enough for a computer to crush you at what you do. And that's the bitter pill. So for 70 years, the job was to be clever, right? Like, I'm the best plumber at solving a problem, or I'm the best software developer. But I feel like when I try to extend his analogy, it's like Sutton's whole life says that these things are actually clever without us now that we've built them. Like, the advantage of being clever is kind of escaping us, that what. What's valuable isn't what you know anymore. Like, it feels like the world's changing. I don't know. It feels Bleak, right?
[58:21]
B
I mean, it could. It could be, yeah. I mean, rewards are also dynamic. They change a lot sometimes, you know, like, what is today's reward might be different tomorrow by being able to redefine it. I think the thing is, like, when we, we work on something, we define the rewards. It comes from within our. Our own kind of minds. It has yet to determine its own rewards.
[58:44]
A
So that was the show. I know. I exhausted, Don. So you walked in not knowing, actually. So I walked in not knowing what subserving meant. Like, I had just never heard that term. Oh, yeah, Subservient I had heard, but I think we figured it all out. Whether we agree with or not, I think is a little bit undecided. But yeah. Thank you, Don, for letting me go through this all with you. Here's what I would say about Sutton. I think that he's right. If you can come up with these rewards in this box system, the computer will win. But as humans, we just come up with new boxes. And so this guy who did the backgammon right after the computer beat him at backgammon, like he put out a new book. Right. Or you invent a new game. And the humans, you know, the way it is now, like, we're setting the limits of what these boxes are. Right. And so I think, like, you just have to keep learning and growing. If you try to compete head on head at a limited game with the, with the AIs that we have right now, you will be beat.
[59:39]
B
Yeah. Don't worry about it. When it can create its own box.
[59:42]
A
Yeah, yeah. You just make a new box. Right. I'm not going to try to write SQL, the most performant SQL anymore. Like, I'll just ask cloud code. It'll be better than me, but I'll, I'll. I'm building the things.
[59:52]
B
Yeah.
[59:53]
A
Yeah. So until next time. Yeah. Stay curious. Learn new things. Thank you so much for listening.