Summary11 min read

Dwarkesh Podcast: Eric Jang – Building AlphaGo from Scratch

Date: May 15, 2026
Host: Dwarkesh Patel
Guest: Eric Jang (ex-VP of AI, 1x Technologies; ex-DeepMind Robotics)
Theme: Reverse-engineering and rebuilding AlphaGo, what it reveals about AI research, and broader lessons for ML, RL, and future automated scientific discovery.

Episode Overview

This episode is a deep-dive "blackboard lecture" with Eric Jang, exploring the core ideas behind AlphaGo—the landmark AI system that mastered the game of Go—and their relevance to both current AI advancements and the future of research automation. Eric describes his hands-on project of building a new AlphaGo from scratch during his sabbatical, sharing not only technical insights and reflections but also drawing parallels between Go, LLMs, and automated research agents. The discussion unpacks how AlphaGo works from first principles—covering game theory, Monte Carlo Tree Search (MCTS), deep learning architectures, reinforcement learning (RL) strategies, and scientific lessons for both human and automated researchers.

Key Discussion Points & Insights

1. Why Build AlphaGo from Scratch?

[00:00–02:14]

Eric's motivation: AlphaGo's success in Go was "profound" for its demonstration that deep learning could "amortize" extremely deep search processes in a way previously thought computationally intractable.
Training AlphaGo was once the domain of huge teams and compute clusters; now, open-source implementations (e.g., KataGo) can be trained "for a few thousand dollars" thanks to improved hardware and tooling.
Go remains a uniquely rich problem for exploring fundamental AI questions—search, learning, decision-making under complexity, and scaling laws.

"When I saw the kind of early breakthroughs on AlphaGo in 2014, 2015, 2016 and so forth, it was just profound to see how smart AI systems could become and the kind of computational complexity class that they could tackle with deep learning." — Eric Jang [00:33]

2. The Rules & Beauty of Go

[02:28–08:14]

Go basics: Two players (black and white), aiming to surround and control territory; capturing stones by surrounding them; different rule sets (Chinese, Japanese, Tromp-Taylor).
Subtleties for AI: Human scoring can be ambiguous; Tromp-Taylor rules are unambiguous and thus standard for computer Go.
Endgame scoring, passing, and the subtleties in how territory is agreed upon — echoes the idea of value functions later in neural networks.

"This is what makes Go a very beautiful game, is that you can kind of lose the battle, but win the war. And as the board size increases, the complexity of these kind of like micro versus macro dynamics gets more interesting." — Eric Jang [04:52]

3. The Core Challenge: Combinatorial Explosion in Go

[08:17–14:14]

The game tree: A 19x19 Go board yields an astronomically vast tree (effectively, 361^~300 possibilities).
Search problem: Brute-force search infeasible; even pruning by symmetries or move order, the space is intimidating.
Reducing search: Merges, symmetries, and child node equivalence reduce the tree, but not enough for tractable exhaustive search.

"This is a very, very, very large tree. And this is also why computer scientists for many years thought that Go was not a tractable problem." — Eric [11:50]

4. Monte Carlo Tree Search (MCTS) and Exploration–Exploitation

[11:21–21:21]

Classic tree search: UCB1 (Upper Confidence Bound) bandits inspired action selection; selects actions as argmax(Q + exploration_bonus).
PUCT variant: Used in AlphaGo (“predicted upper confidence with trees”) adds policy prior to guide search, balancing visited/novel nodes.
Probabilities in a deterministic game?: Probability arises in MC search rollouts, not in the game's inherent rules. Explains how MCTS can assign expected values by sampling over possible futures.
Backing up values: Terminal leaf nodes are given win/loss values, recursively backing up means to parent nodes.

"The motivation for UCB was to... bound your regret in terms of how wrong you can possibly be... The exploration term rewards taking actions that you haven't taken before." — Eric [19:45]

5. The Value Function and Human Intuition

[24:34–27:12]

Value function: Predicts win probability from a given board state; acts as a shortcut for deep simulation. Humans "amortize a huge number of possible game playouts and take that average” immediately—a useful analogy for NNs.
Why is AlphaGo possible: Because neural networks can be trained to approximate complex value judgments that once required searching massive trees.

"A neural network in a human can somehow do all of this simulation at a glance... this gives us a hint that in games like Go, there are ways to radically speed up the search process. And this is one of the fundamental intuitions behind why AlphaGo works." — Eric [25:00; 26:41]

6. Neural Network Architecture: Policy and Value Heads

[32:03–37:56]

Two heads:
- Policy network: Predicts distribution over possible moves.
- Value network: Predicts win probability for the board state.
Architectures:
- ResNets outperform transformers in low-data regimes for Go; transformers may help in global feature aggregation, but not yet surpass ResNets.
- Encoding is similar to multi-channel images: one channel per player and possibly an "empty" or mask channel.
Training regime:
- Pre-training on expert human games is quicker, more stable; tabula rasa (AlphaZero) self-play is harder but possible.
- Even a simple (10-layer) neural network can play well with the right initialization.

"Transformers start to outperform residual convolutional networks when you want more global context. ...I've tried very hard to make Transformers work for this problem... but I actually haven't figured out a way to make Transformers better than ResNets for now." — Eric [33:23/35:01]

7. Putting It Together: AlphaGo’s MCTS+NN Loop

[41:42–57:38]

Per move, MCTS builds a partial tree from the current state, guided by the policy prior and value estimates, to select the best action.
Four-step MCTS loop:
- Selection: Pick nodes using PUCT.
- Expansion: Expand new nodes given by policy.
- Evaluation: Score using the value network (early AlphaGo also averaged with rollout playout, since dropped).
- Backup: Recursively propagate value estimates up the tree.
Training: After each game, use MCTS-improved actions as new targets for the policy network; the policy learns to imitate the improved move distribution, so it starts each new game closer to optimal play—requiring less search per move for the same strength.

"The beauty of how AlphaGo trains itself is that it actually can take this final search process... and tell the policy network: hey, instead of having MCTS do all this legwork... why don't you just predict that from the get go?" — Eric [62:28]

8. Distillation, Test-Time Scaling, and the Dagger Analogy

[64:46–73:18]

Distillation: After improving a move with search, distill MCTS’ outputs into the policy head—"amortizing" search into the network.
Dagger analogy: MCTS acts like an imitation teacher, always proposing strictly better actions at every state (given a good value function).
Data bootstrapping: Value function is easy to train at endgames; harder in the midgame. Bootstrapping with smaller boards or expert data is key for robust initial training.
Test-time scaling law: There’s a tradeoff between test-time compute (number of MCTS simulations) and training (stronger policy nets) to achieve equivalent performance.

"MCTS should get more confident about specific actions than others... After applying MCTS process, your policy recommended distribution looks like this. It’s a bit more peaky than the previous one..." — Eric [62:28]
"You want to first make sure that [your value function] is good before you invest a lot of cycles doing MCTS." — Eric [71:32]

9. Comparing RL Paradigms: AlphaGo’s Supervised Signal vs. Policy Gradients

[74:59–97:08]

AlphaGo’s learning: No TD error, Q-learning, or explicit dynamic programming; just supervised learning on improved policy/value labels. This gives much lower variance and better sample efficiency, sidestepping the exploration problem of classic RL.
Naive RL in LLMs: "Sucking supervision through a straw"—policy gradient RL has high variance, low bits-per-sample, particularly early in training. Learning per trajectory (as in LLM code RL) is vastly slower than per-action (as in AlphaGo).
Soft labels: Distilling the MCTS distribution (not just the chosen action) gives richer, higher-entropy supervision per sample.
AlphaGo’s elegance: Never needs to bootstrap from near-zero performance—every step yields a stable, incremental improvement.

"Karpathy, when he was on the podcast, called it like sucking supervision through a straw... AlphaGo is quite beautiful. In AlphaGo you don’t train the policy network to imitate the MCTS action, you train it to imitate the MCTS distribution." — Eric [88:31/139:16]
"Why is AlphaGo an elegant RL algorithm? Because you never have to initialize at a 0% success rate and solve the exploration problem of how to get a non zero success rate. …every step of the way, this supervision signal is very clean." — Eric [140:26]

10. Self-Play, Q-learning, Neural Fictitious Self-Play, and Robustness

[97:19–138:17]

Self-Play: AlphaGo: use search as a teacher; StarCraft/Dota: use best-response RL (train one agent against frozen others, relabeling actions post hoc).
Replay buffers: OCCASIONALLY useful in AlphaGo/Go, but need care—can drift off-support and waste compute.
On-policy vs. Off-policy RL: On-policy is more stable for deep exploration, though off-policy data is still fruitful for robustness (e.g., recovering from rare states).
Contrast with LLM RL: In language, huge action space, sparse reward, and ambiguous “good action” assignment make these neat tricks less effective than in Go.

"Monte Carlo tree search is doing something very fundamentally different, which is it’s not trying to do credit assignment on wins, it’s trying to improve the label for any given action you took." — Eric [97:19]

11. Scaling Laws, Compute Efficiency, and Modern Go Bots

[111:15–121:11]

Scaling law lessons: More search at test-time can substitute for more policy net training; the compute frontier is always moving, and tricks may become obsolete as hardware improves.
Compute story: AlphaGo Zero’s compute (2017) dwarfed all then-existing AI models; today, Eric trained a strong bot for ~$10k using off-the-shelf hardware and clever tricks (e.g., pre-training, smaller boards, best-response to KataGo).
Modern perspective: Once the “frontier” experience exists, climbing it gets ever cheaper and simpler via distillation, architectural simplification, better hardware.

"The compute required to be the first to do something is always much larger than the compute it takes to catch up... And it's the same story playing out in LLMs." — Eric [116:22]

12. Automated Research and LLM Assistants in Practice

[142:17–147:09]

AI-for-AI research: Eric used Claude Opus as a code and experiment assistant, helping automate code changes, run experiments, and even plot/interpret results. Hyperparameter search, code refactoring, and automated experiment execution are already highly effective.
Weaknesses: LLMs currently struggle with lateral thinking, identifying bugs vs. architectural flaws, or switching research tracks tactically—still require human research taste and intervention.
GO as a proving ground: Because Go is quick to simulate and robust as a “non-reward-hackable” benchmark, it’s a fertile testbed for developing and verifying automated scientist agents.

"Automated scientific research is one of the most exciting skills that Frontier Labs are developing right now ... the [LLM] models can do a very good job of doing hyperparameter optimization... What is it not doing so well today? ... They don't seem to be able to step back and do lateral thinking." — Eric [142:54]

13. Bottlenecks & Lessons for Future Research Automation

[147:54–156:54]

Research taste still essential: Stackable, local improvements can sometimes interfere—research vision and “correctness intuitions” matter for progress.
Games as a stepping stone: Many labs (e.g., DeepMind) used games as their outer loop before moving to LLMs and real-world tasks. Transfer is possible, but not guaranteed or straightforward.
Outer-loop verification challenge: Even for Go, quantifying “real” improvement is not always clear beyond win rate. For general AI/Economic utility, much harder.
Reducing cost and sharing results: Eric’s blog post and open-source code let others fork and extend his results; he encourages further exploration of Go as a substrate for reasoning and research automation.

"I encourage the audience to think about the relationship between thinking and Go via MCTS and search and how it relates to LLMs. I think there's something quite profound there and probably under explored." — Eric [156:54]

Notable Quotes & Memorable Moments

On AlphaGo's deep insight:

"Ten steps of a somewhat small neural network can somehow capture what feels like an NP class problem into a single problem." — Eric [76:51]
On supervised RL with soft labels:

"If you have access to the logits, not just the one hot… there's way more information and bits per sample in a soft label. So that's why distillation is so effective." — Eric [139:16]
On LLM RL being slow:

"Karpathy ... called it like sucking supervision through a straw. ... This is what happens most of the time, so you just learn very little." — Dwarkesh & Eric [88:31/135:49]
On the progress of automated coding/research:

"What works is that the models can do a very good job of doing hyperparameter optimization... But... they don't seem to be able to step back and do the lateral thinking..." — Eric [142:54]
On future prospects:

"Many of these algorithmic tricks to speed up convergence just don't matter so much compared to something else. And I think that's a matter of taste in the present time." — Eric [151:07]

Suggested Viewing/Reading/Resources

Eric’s repo & blog: evjang.com; autogo GitHub
KataGo: KataGo project
Scaling Laws for Board Games: Andy Jones, 2021
Further reading: Eric’s post "As Rocks May Think"
AlphaGo papers: AlphaGo Lee, AlphaGo Zero
On LLM RL: Karpathy’s podcast, John Schulman’s GAE paper, and the Dagger algorithm

Timestamps for Reference

Introduction & Motivation: [00:00–02:14]
Go rules & scoring: [02:28–08:14]
Combinatorial game tree: [08:17–14:14]
Bandit/MCTS algorithms: [11:21–18:45]
Human value intuition: [24:34–26:41]
NN architectures & training: [32:03–37:56]
AlphaGo’s MCTS loop: [41:42–57:38]
Distillation & scaling: [62:28–75:57]
RL paradigm contrast: [74:59–97:08]
Self-play & Q-learning connection: [97:19–135:49]
Automated research experiences: [142:17–156:54]
Outro & resources: [156:20–157:25]

Summary Takeaway

Eric Jang’s reconstruction of AlphaGo serves as a microcosm of contemporary AI: a highly engineered but conceptually elegant interplay between search, learning, and structure, with broad implications for LLMs, RL, and future automated science. The leap from brute force to neural amortization, and from human-coded heuristics to learned strategies, isn’t just a story about mastering Go—it’s about the future of how we build, train, and perhaps soon automate agents that perform complex reasoning and discovery. For aspiring AI researchers and practitioners keen to understand not just how AlphaGo works, but what it tells us about building advanced AI (or even automating research itself), this episode is essential listening.

Loading summary

Transcript242 lines

[00:00]
A
Today I'm here with Eric Zhang, who was most recently vice president of AI at 1x Technologies. Before that, senior research scientist at what is now Google DeepMind Robotics. And you've been on sabbatical for the last few months. One of the things you've been doing is rebuilding and improving and hacking on AlphaGo. And so today what we're going to do is you're going to explain building AlphaGo from scratch and what it tells us about the future of AI research and development. But before we get to that, why is AlphaGo interesting? Why is this the project you decided to do on sabbatical rather than just hanging out at the beach?
[00:33]
B
Sure, yeah, I like making things and AlphaGo and Go AI is one of those things that really got me into the field. When I saw the kind of early breakthroughs on AlphaGo in 2014, 2015, 2016 and so forth, it was just profound to see how smart AI systems could become and the kind of computational complexity class that they could tackle with deep learning. This is a problem that has long been understood to be kind of intractable for a search, and yet it was solved through deep learning. And so that was quite mysterious to me and I've always wanted to understand that phenomenon a little bit better. My training is often in deep neural nets for robotics where the decisions made by the neural networks are a bit more intuitive. But AlphaGo is a sort of problem where the, the decisions are actually the result of a very, very deep search. And it's always been very mysterious to me how like a 10 layer network can sort of amortize the simulation of something so deep in the game tree.
[01:35]
A
Yeah, interesting.
[01:36]
B
So if you plot out how much compute it took to build various iterations of strong GoBots over the years, you can see that in 2020 there was a open source project called Katago by David Wu from Jane street, who, who basically achieved a 40x reduction in compute needed to train a really strong GoBot tablet. Rasa. I'm not certain if it's stronger than AlphaGo Zero or AlphaZero or MU Zero, but it's very, very strong. And this is what most GO practitioners today train against when they're playing an AI. And thanks to LLM coding, what took a whole team of research scientists at DeepMind and millions of dollars of research and compute can now be done for a few thousand dollars of rented computer.
[02:14]
A
By the way, if you're listening to this on an audio platform, this is a blackboard lecture. So I highly recommend switching over to a video platform like YouTube if you can, to look at the math and the graphs and the Go board. Okay. I guess we should first discuss how Go works.
[02:29]
B
Great.
[02:29]
A
So, yeah, how does the game work?
[02:31]
B
So the game of Go is a very simple one that can be implemented quickly and easily in a computer. The objective of the game is basically to put down black and white stones and try to occupy as much territory in the game as possible. So I might start by putting down a black. Black stone. Black always goes first. Go ahead. And so the way you capture an opponent's stones is that for every intersection, if you can surround all four of its neighbors with your stones, then this one is sort of cut off from oxygen, if you will, and then it is a dead stone. So then now I control these four stones as well as this empty intersection here. So there's, like, slight variations between Chinese, Japanese, and what is called Tromp Taylor rules. Tromp Taylor rules are designed to be completely unambiguous. Is for Go. So this is what all go AIs train against and resolve against. So in typical go, like humans play, you're actually not allowed to put this white stone down here. It would be instant suicide. In Trump Taylor, it's actually fine. You put it down, and then it immediately resolves to death. So the outcome is sort of the same. Let's go ahead and start over and play a few stones, and then I'll explain some more. So I'll just start there.
[03:45]
A
All right. I'm, like, basically playing randomly here, but I'm trying to get around your stones and see if I can get one of them.
[03:50]
B
Yeah. So this move basically exposes one empty neighbor for your whitestone. And it's very akin to a check in chess, where if you don't respond immediately by putting one here, then I can immediately capture this.
[04:09]
A
I see.
[04:09]
B
Okay.
[04:10]
A
Because it is sort of the diagonals that determine whether you're grounded in the
[04:13]
B
cross section, not the diagonals. So this one is surrounded on three sides.
[04:18]
A
Yep.
[04:18]
B
And so you're at threat of losing that stone if you don't play one immediately there. Now you can see that I'm starting to pressure you, because by putting a stone here now, you are forced to put one here.
[04:32]
A
Otherwise, you would have this to block to yourself.
[04:35]
B
And then if you think through, like, what happens if you were to respond here, you can probably search into the future and deduce what I'll do in response once you.
[04:44]
A
You have a lot of confidence in my abilities. But I'm guessing you'd put the black here.
[04:47]
B
That's right. And then I would capture all three of these stones.
[04:50]
A
So I should just assume that this is gone, this little block is gone.
[04:53]
B
Yes. So in Go, it's actually okay to let an opponent capture some stones. If, for example, it allows you to position to capture more stones in somewhere else on the board. And this is what makes Go a very beautiful game, is that you can kind of lose the battle, but win the war. And as the board size increases, the complexity of these kind of like micro versus macro dynamics gets more interesting.
[05:15]
A
Presumably you'd put one here.
[05:18]
B
And so now I would capture this entire group and this would be mine. Okay, there's one more case that I want to demonstrate, which actually I had a bug in my code recently, which is the following situation. So let's consider a formation like this, right? And then, you know, we have other pieces on the board in play or whatever. And so let's talk a little bit about how the game ends right in this territory. Who controls these areas? Is it white or is it black?
[05:49]
A
White.
[05:50]
B
It's actually black because I have actually surrounded this whole area and it's very. Assuming I have other black stones here, it's actually very hard for you to break this out of the control of these stones.
[06:02]
A
So when the final score is tallied, would these ones also count as being in?
[06:07]
B
Yeah, great question. So this is where different rule sets have different ways of scoring. And so we should talk a little bit about how you resolve scores between humans and how you resolve scores between computer code, because there's actually some ambiguity in how humans evaluate this. So most humans would look at this board configuration and conclude that black has kind of totally surrounded white and so white has no chance of life. We could play out more here, but then at the end I would capture everything. However, if you have a way of breaking this formation and connecting Y to something outside of it, then it can flip. Right. And so this is where it's a little bit hard for a computer to decide these kind of things. Right. So how do humans do it? It's worth thinking a little bit about how humans resolve this because this will actually map later to how we think about the deep neural network. Humans basically say, I think the game is done. And then you have to also say, I think the game is done. And then we'll say, I think these are milestones. And then you have to agree. If you don't agree, then we keep playing. So essentially once two humans, their so called value function, agree on a consensus, then the Chinese rules resolve that.
[07:13]
A
Yeah, interesting.
[07:14]
B
So in Trump Taylor Scoring, it's perfectly unambiguous. So it can be decided algorithmically by a computer. So let's say you have this at the end game. The way you score this is that you first count how many stones you control, and that's unambiguous. Then you count how many empty intersections that are not touched by your opponent's stones. So these intersections would not count for either player because all of these intersections are connected to both white stones and black stones. If this were like this, then white would get three points. Now, this is a little odd because a human would know that white is actually losing these points, but Trump Taylor's scoring would consider white to have all of these points as well as these points.
[07:59]
A
Got it. Okay.
[08:00]
B
So that is a very big difference in how computer Go scores things and how humans score things.
[08:06]
A
How does the game end?
[08:07]
B
The game ends when either a player chooses to resign or both players pass consecutively. Cool. Yep. So that's the rules.
[08:15]
A
Nice. All right, now help me crack this with AI.
[08:18]
B
Great.
[08:19]
A
Let's understand how AlphaGo actually works and how somebody in the audience might be able to implement it.
[08:25]
B
Great. Yeah. Let's start with kind of an intuition about the underlying search process used to make moves, and we'll layer on ideas from deep learning to make it much more efficient and tractable. So Go is a game where there's just two players. We're going to draw a person here and we're going to draw an AI here. And let's say this person is playing black, so they go first. So we're going to draw here. And then now the AI is going to make a move based on what it sees here. So there's a question of, like, how you encode these inputs into the AI. Maybe you could use ones and zeros, but you want to represent black, white and empty. So you would need at least three different values here. Right. So maybe you can use 0, ones and twos or something. So the AI might see something like 0, 0, 001. Great. So this is the input to the AI on its turn. So the AI can choose. Let's just pick three possible random moves that I can go. And I just drew these at random. And so which move is best here? Right. Well, we don't know until the game ends. Go does not have any kind of local reward of which move here is good. And this is what makes Go a very difficult game, is that you don't actually know who won until you really get to the end of the game. So how deep is this tree? Right. Well, in a 19x19 go board, there are roughly to the order of 361 moves on any given move. And of course as it fills up you have less moves and, and the number of steps in the game can be somewhere from 250 to 300 moves. And maybe experts might decide to end the game well before that. But under Trump Taylor scoring, you actually have to play things all the way to the end. So this could be like 300 moves or something, right? So like 300 depth of the tree. So if you keep on expanding possible moves here. So in this move the AI is going, and then here the human would go, and then, you know, there's some. And so forth. You can find that like essentially what you end up with is an enormous explosion in the possible game outcomes originating from just this one state. So this is something to the order of like, you know, 361 to 300 power of 300, which is far more than the number of atoms in the universe. And of course actually there are redundancies and symmetries, so it's not actually 300, but that's sort of the. If you were to do a naive tree where there were no merging of children, then actually you end up with a tree about this big.
[11:21]
A
What do you mean by merging of children?
[11:23]
B
Right, let me use this board here. So if we start here and then you play here, and then I play here, and then you play here. That is equivalent to I start here, you play here, I play here, and then you play here, right? Yeah. So both of them arrived at the same spot but through different paths. So this child node can be thought about as a shared access drive.
[11:46]
A
And I guess it's not 360. It starts at 361, but it decreases by one each time.
[11:50]
B
And the branching factor decreases by one each time. Yeah, but in any case, this is a very, very, very large tree. And this is also why computer scientists for many years thought that Go was not a tractable problem. This cent the amount of compute you would need to exhaustively search every possible possibility is just too large. If you could go is actually a deterministic game. So on any given state you can actually compute what the best possible strategy you can make is in order to win the game. You can search all the possible futures where you win and then just make sure you always stay in that set of futures. So AlphaGo's kind of core conceptual breakthrough was using neural nets to make this search problem tractable. So before we get into how neural networks are involved, let's talk a little bit about how we can, assuming we have a very powerful enough computer, search this tree to find the best move. So in the beginning, you're not going to build out the whole tree because storing that tree would be very expensive. Instead, you might do something like interactively figure out which leaves of this tree are worthy of exploring and expanding into the future to see, you know, what else is there. So there are some early algorithms in bandit literature, like UCB1, which is not exactly appropriate for a sequential game like Go, but very much inspired the action selection algorithm used in alphago. So UCB1 looks like on every move, we're going to take the best action, or the ARGMAX over A that maximizes, you know, the Q of A. And I'll explain what Q of A is in a moment, plus some sort of exploration bonus. So on every node, we're going to track a few quantities. So, so let's, you know, consider each of these a node. This is, this is the, the root node where you're making decisions from. And these are the children of the root node. And we're going to say each node is basically a data structure that stores a visit count of this node, this
[14:14]
A
child node, how often the parent visited this node.
[14:18]
B
Yes. And we'll call this an action. So one thing that is easy to trip on is like, if you come from robotics or other kinds of reinforcement learning, is like, where are the actions? Right. I'm only talking about nodes. Nodes here represent states, and because this is a perfectly deterministic game with no randomness, you can actually just infer the action based on the child. So if I go here, that implies an action, and this is the state that we resolve. So the LLMs, if you ask to vibe code a MCTS implementation, it'll most likely design the right data structure here, but it's sort of a chef's choice. You can actually rewrite the tree structure however you like. This was what Claude 4.6 wrote for me when I asked it, and it was a very reasonable choice. So then Q represents the mean action value of this action. And I'll use a subscript A to denote that this kind of corresponds to taking a specific action to get here from the root node. So if we have root, basically taking A gets us to this node here, and then we're going to also store the probability of taking this action
[15:35]
A
again from the parent.
[15:36]
B
From the parent, yes. What are the odds that we sample this one and this will become relevant later? We've talked about a deterministic tree for now. So, well, I'll bring probabilities into this later. And then finally we have a sort of dictionary of children which is just like more of these nodes in a sort of classic linked list style reference tree. So this is the basic data structure to implement a tree. And in AlphaGo they use a slightly different action selection criteria called pucked and it's short for predicted upper confidence with trees. And this is basically when you select which child to take, you do argmax A of Q of SA constant. So the equation forms are actually pretty similar. These are both scoring criteria, right? Like you want to argmax this quantity and you want to argmax this quantity to determine which action to take. So let's break down the intuition of like how you select actions here. This is the mean action value. So how good is a given child on average? And if you actually knew the whole tree, then this is all you need to select the best action. You don't really need to do more than that. But if you're interactively building this tree, as you're figuring out what the Q values should be, then what you have to do is occasionally try some other actions as a sort of explore versus exploit trade off. So in both UCB and Pucked, there is this term here that basically rewards taking actions that you haven't taken before. So as we mentioned before, each node stores the vis account of taking that specific action, right? So everything is initialized to zero. And so for a given action, let's just say call it action A. Initially it's zero. And so as N is increasing, if, let's say We've already made 10 action selections from that root node, but we haven't picked A yet, then this term actually starts to become quite large for A. And conversely, if we have chosen a 10 times out of 10, then now this term is quite small, it diminishes very quickly. And the same thing is actually true here.
[18:05]
A
Just to make sure I'm understanding it, maybe I can put it in my own words. Let's just focus on ucb. What we're saying here. You can think of it conceptually as two different things. The Q and then this exploration term. Let's just be clear about what Q is. Q is basically saying, hey, once we do these rollouts, so you're actually running all these simulations, you go down the tree and then you figure out, okay, if I end up at the terminal value of this tree, do I win this game or not? And then you average whether I win this game or not across all the leafs of this tree, starting from this node. That average you put in Q.
[18:45]
B
Correct.
[18:46]
A
And so you're saying the Q is basically representing will I win this game or not? What is probability that I'll win this game starting at this node. That is your sort of exploit. That is like saying, I've run these simulations, I think this is a good move. Or not. And then this other term is saying, have I explored this branch enough yet relative to the other actions I could be exploring? Or I have already explored, if I haven't explored this branch yet, maybe I think it has a low score, but I just haven't explored that many leaves down this leaves down this node in this tree. So I should maybe try this even though the queue. This sort of exploit is telling me that this is not that valuable because ln of N grows slower than N. Basically over time you will move from the Argmax being dominated by this exploration term, which is the second term here, to the Argmax being dominated by the Q term, which is like, okay, I've done enough simulations, I'm quite confident that this is the branch to go down.
[19:45]
B
Yes, that's right. So the motivation for UCB was to come up with an algorithm where if you don't know the payoff of the arms, the different actions, you can select to begin with this strategy, basically, given some exploration term here, bounds your regret in terms of how wrong you can possibly be. I don't know the proof. I don't also know if this one is proved to have a logarithmically or square root bounded regret or anything, but I think the algorithm was just derived to look something like this. And you can tell that these terms grow a little bit differently. And this is actually just to account for the fact that GO has many more actions in every given move compared to your standard bandit problem. So one small clarification to make is that you talked a little about simulations on probabilities and so forth. We should remember that Go fundamentally is a deterministic game. So the notion of, like, where does the notion of probability come from here, right? If you had a very powerful computer, there is no probabilities. You just, you can just compute the true average of what the mean action value is. So where does the probability come in? Well, it turns out that as in computer Go before AlphaGo, we've always done some sort of Monte Carlo method where we take the expected Q value averaged over a randomly selected tree, and that randomly selected tree is where probabilities come in. So the interpretation of Q is what is the expected action value under the random distribution induced by some random search process.
[21:22]
A
Makes sense.
[21:22]
B
And so where does the random search process come in? That's where p of action comes in. So if we assume a very naive algorithm where you have a uniform probability of taking any valid action, then this would just be one over the number of valid moves in this setup and you would be kind of taking this average over this very diffuse tree. Right. And this is a valid integral you can take, but it's very slow because you're going to consider a lot of trees that have very low value. And it's essentially almost like a importance sampling problem where you want to. There's only a few actions and sort of paths that can contribute high value, and almost everything else is low value. So that's sort of a tricky problem here. Okay, so this is the action selection criteria for how you decide which moves to move down. Now, as you move down in tree search, you will eventually run into a node where it's quite clear you've won or lost. Right at the very, very end of the game, when there are no valid moves to play. Left under trump Taylor scoring, you can decide whether you won or lost. So you either win or you lost. And so this is basically the final return of the whole game. And so the question here is we can assign a value U to a terminal leaf node of the tree, but how do we assign the values for nodes prior to that? The parents? And it turns out what you simply do is you just take the. Your mean action value is essentially your average. So let's suppose these were leaf nodes. Sorry, these were all leaf nodes. The mean action value of this node, you know, this action here is just the average of whether you won or lost at the leaf nodes. And correspondingly, you can kind of walk up the chain and say like, well, the mean action value of this node, let's call this like qb, and this is action B is just the average of a weighted average of these ones here. And the weighted average could be dependent on if you have a different sampling distribution or not. But the basic intuition is that you want to resolve the game where you have a deterministic win or lose. And then you can kind of go backwards. This is called the backup step. And assign values to these nodes or actions corresponding to the averaged over the final terminal leaf. Okay, so if you were to do this without neural networks, it would still be intractable. You would have trouble finding which actions to sample. A lot of the actions would contribute very low value, especially if you're like trying to fight your way out of a losing position and only a few actions give you high value. So the search in practice is still very, very expensive. But the idea is that if you can, because GO follows a tree structure, you can actually inform a very good estimate of the value of this node based on the values of downstream, assuming they're all correct and assuming you've searched deep enough.
[24:34]
A
Your explanation earlier about the sorts of states where it's obvious to a human who's going to win, but it's not obvious to or deterministically you still had to play it out actually drove from the intuition of why the value function both is trainable and two, why it's necessary in order to actually be able to learn this game effectively. And maybe it's worth defining value in the first place, but.
[24:58]
B
Sounds good. Yeah, yeah. So we talked about this U value being your final resolution of whether you want or loss. And this is a terminal leaf node condition. Now, humans don't play all the way to the sort of edges of the tree, the leaves of the tree, right? They kind of stop some dozens of moves before maybe even 100 moves before in sort of high level play. So how do they know? Right. Like you can think about humans as implicitly having a neural network called a value function that basically takes in a board state and then it kind of evaluates key win. And so the human glances at the board and they know like I'm probably going to lose, right. And they're essentially running a neural network that looks at a board and implicitly they are amortizing a huge number of possible game playouts and taking that average and then deciding whether the board is winnable or not and then whether they should concede or keep playing or not. And this is remarkable if you think about the beauty of something like this. It's like a neural network in a human can somehow do all of this simulation at a glance and then just know within a few seconds without actually playing every single game logically based on just kind of like crystallized knowledge and experience, that they can do this. And so this gives us a hint that in games like Go, there are ways to basically radically speed up the search process. And this is one of the fundamental intuitions behind why AlphaGo works, is that you can train a value function to look at a board and quickly resolve the game without playing out all of these trees into a very deep search depth.
[26:41]
A
Yep, makes sense. I will say for the audience, I sort of found for previous episodes when I was Prepping, and it seems somewhat relevant to understand how AlphaGo works. I would find it very, very confusing, but it's the kind of thing where once you understand the problem in this way, and then you'll build the next few pieces, it is actually much more understandable and it will make a lot of sense. And it's okay to be confused right now, but it's. It's probably simpler to understand by the end of this lecture than you anticipate. So I'll just make that note for the audience.
[27:13]
B
Yeah, the important intuition at a high level, just to step back about where we're going with all this, is that classically for games like Go, you could build a tree, but we don't have computers powerful enough for that. And estimating the value of every action that you could possibly take is also hard because you don't know until the end of the game. You could take averages by playing them to the end, but that's also hard because you don't know which actions to take to sample these averages. So conceptually, there's kind of two problems. There's the breadth of the tree and then there's the depth of the tree. And AlphaGo gives us a way to basically shrink both of those to be very tractable. That's essentially the kind of core idea behind it. Okay, so we take this idea that humans can glance at a board and instantly predict what, whether we win. And maybe that gives us the opportunity to really truncate how deep we search. And then we also know that humans can look at a board and decide what boards intuitively, at a glance, what moves might be good on a good board. So these are kind of two things that we can use deep neural networks for to accelerate this search process. Let's go back before we talked about neural nets. Let's just go back to how this play out works. So we've only talked about making one move, right? So the AI looks at this encoded Go board, it has a tree. It searches deeply into the tree to find out which of its actions might be the best. And then it takes that action. And then now it goes back to the human. So maybe now the human sees a Go board that looks like this, and then they, they make their move. So maybe they put, they put their stone here. And then now we, we go back to the AI, which now looks at a new encoded board. So I've used two to denote the AIs playing as white, and one to denote the human playing as black and zero as empty. And then now on the AIs turn, it does the MCTS tree search all over again from scratch, right? So it throws away this old tree that it searched last round and now there's a new root node and it begins to search anew. And then so and so forth. So MCTS is basically, you can think about it like a search algorithm that is deciding what moves to play best, aided by neural networks. And it's done on every move. Okay, great. So let's talk about the neural network
[29:57]
A
part of this while you're erasing. Another sort of thing that was important for me to understand was this MCTS data structure with nodes and children of nodes and whatever this is done per move and reinstantiated once a move is made. So a human makes a move, then the AI looks at this and is trying to basically run a bunch of simulations to figure out, okay, what move should I make next? And those simulations, just a simulation, is basically like exploring one more node in this MCTS tree. And at the end once all this, you run a thousand simulations that informs then this, I guess as you'll explain this probability of what move to make next, that's what you store. You choose the best move, given those probabilities, you discard all of that, then the next player makes a move, and then you restart this process at the beginning of every move.
[30:53]
B
Correct. One small addendum, you don't discard all of that. You keep one thing behind that we'll use later.
[31:00]
A
Just like I did for Reiner, I wanted to make flashcards for this episode so that people could retain these concepts. And ideally an LLM could generate some candidates for me to then refine. But to actually get high quality suggestions, I needed to design a whole pipeline where the AI could take and ingest screenshots of the blackboard at the right timestamps and then make SVG diagrams in case visuals were helpful, and then run their writing and drawing through a critic and then revise the card in response to this feedback. It's very hard to accomplish this just by stacking. LLM calls this sort of step by step recipe works much better if you have a durable agent that's been engaging with the task across all the previous stages. So I used the cursor SDK to spin up an agent for each card. The cursor hardness saved me a bunch of work in designing some custom context scaffold or figuring out how to design tool calls for taking screenshots or making animations. These agents all run in the cloud, so I don't have to worry about leaving my laptop open. I just get an email when I have candidates to review. You can check out my cards@flashcards.dwarKash.com you can start building with the agent's SDK@cursor.com
[32:04]
B
Zorkash okay, so now we have a basic intuition of how moves are made. With search, we're going to talk about how neural networks can speed this up by providing an analog to the human intuition. So there's two networks. There is the value network, which takes in a state and it predicts am I going to win or lose? It's a binary classification problem. Then we're going to have a policy network which induces a distribution over good actions to take. So I'm going to draw a one dimensional flattened move distribution. But this is really like, you know, a square kind of grid. Right. So maybe like it thinks actions are like these are the kind of probability distribution over good actions. And both of these are categorical classification problems. Right. So you can train this like any classifier with deep learning, cross entropy loss, that kind of stuff. So the specific architecture does not actually matter too much. I tried a few different architectures. Transformers work. Resnets work for small data regimes. My experience is that resnets still kind of outperform transformers and kind of give you more bang for the buck at lower budgets. But this may not be true.
[33:22]
A
Why is that?
[33:24]
B
They provide the inductive bias of local convolutions. And generally transformers start to outperform residual convolutional networks when you want more global context.
[33:33]
A
I see.
[33:34]
B
So one interesting finding from the Katago paper was that they found it actually quite useful to pool together global features together and aggregate global features throughout the network, to kind of give the network a global sense of how to connect value from one side of the board to another side of the board.
[33:53]
A
What does it mean to aggregate global features?
[33:54]
B
Yeah. So if you have a very large 19 by 19 go board and you've got some sort of battles going on here, and you've got some battles going on here. When you pass this through a convolutional neural network, the receptive fields of the convolutional network are going to be good at computing local things and making that invariant. But they won't be able to kind of connect these two features easily. They need to sort of be pooled together and attend to each other somehow. So the argument about why transformers are good for computer vision tasks, like with vision transformers and so forth, is that because they have sort of global attention across the Whole thing, they can more easily draw these kind of predictions. But you do need more data there so that you can kind of learn through data the sort of invariant, local, local features. I've tried very hard to make Transformers work for this problem because I was kind of curious if Transformers would present some sort of breakthrough in GO and just remove a lot of those tricks. But try as I might, I actually haven't figured out a way to make Transformers better than Resonance for now.
[35:01]
A
So. Sorry, one more tangential question. It makes sense why transportmers, with their global pooling of information would be better if you need to consider information that is not just spatially or. Yeah, CNNs give you a sort of bias that the things which are next to you are especially irrelevant and then
[35:21]
B
they're sort of aggregated up.
[35:23]
A
Yeah, exactly.
[35:23]
B
Yes.
[35:24]
A
Okay, so for games where it isn't that relevant what is happening locally, you just kind of have to consider the whole thing. You're saying Transformers would work better. How about games where. So I'm talking about the spatial dimension. How about the temporal dimension, where right now we're only considering the previous move because it is a deterministic full information game. But what if it was something like poker or diplomacy, where really a bluff they made a while back is sort of relevant to understanding now and isolating to decide to make your next move? And so you need to consider all those previous states. Would that then change the consideration of what inductive bias is most relevant and what architecture is most relevant?
[36:02]
B
Right, great question. So GO is a perfect information game. And in perfect information games there does exist a Nash equilibrium strategy for which you can do no worse than any other strategy. So if you know that your opponent has a particular bias, like they love to play aggressively, you can actually in principle counter that specific strategy better than a Nash equilibrium policy. But to counter any given strategy, there does exist a single Nash equilibrium that can be decided solely using the current state. So that is a design choice that most go agents AlphaGo chose to do, which in hindsight turned out to work very well. Because the Nash equilibrium seems to be superhuman. No human strategy seems to be able to beat it. Now, there are variations of this where you would actually need to consider temporal history. And this is a very exciting research area that I would encourage people to kind of fork my repo and try these things out, which is if you were to play, let's say 2v2 go, then you actually need to model your partner's behavior. And you may not have information on how they play. So you need to aggregate some information on how they play so that you can respond accordingly. These are situations where it's no longer a perfect information game. And then in those cases, in games of imperfect information or partial observability, then you do need some context to build a model. And I think that's a place where things can get very, very exciting in terms of self play or diplomacy style.
[37:31]
A
Yeah, interesting.
[37:33]
B
Okay, so returning back to the neural network, the architecture again is not super important. You can get it to work with transformers, you can get it to work with resnets. I found that for low budget experiments, resnets work a little better. You can also use kind of a karpathy style, auto research, hyperparameter tuning to make your architecture pretty good. And so you don't have to worry too much about that. You just need to sort of set up the problem so that you have a sort of target optimization.
[37:56]
A
Yeah.
[37:57]
B
Okay, so we're going to pick just a somewhat arbitrary architecture that worked for what I did. But again, this part is not super important. You have your encoded board state and we're going to just choose to, let's say, do similar to an rgb. We're going to have three kind of channels. One channel to encode black, one channel to encode white, and then one channel maybe to encode empties or maybe like a masked region. If you want to train on multiple board sizes. I'm actually not going to talk about multiple board sizes for now. That's a little bit too complicated. So we'll just say like, you know, we've got this two or three channel RGB like image and then we go into a, you know, a resnet and then we have two branching heads. One head predicts the, the value function and this is like a single logit. So this is like R1 and then we have the policy which is R361. So this is the architecture and we're going to basically train this to predict the outcomes of games given the board state. And we're also going to train this to predict what are good moves. So the OG AlphaGo paper or called AlphaGo Lee, initialized this network with a supervised learning data set of expert human play. Later they removed this restriction by having the model teach itself how to play well. But I find it actually from a matter of implementation for your audience super, super nice to always kind of initialize your experiments to something that's easy and then get the problem working before trying to bite off the whole thing and learn a tabular resonant. You generally want to kind of initialize. Just as in deep learning, initialization is everything, right? You always want to initialize your research project to something as close to success as possible, especially if you're doing something new that you haven't done before. Like, always pick something that works and then get it to do something better, rather than start from something that doesn't work at all and then try to make it work. So under that philosophy, it's a great idea to start from something that has a good initialization. So we're going to take human expert plays and train this model to predict good actions. So we're going to take all of the winning games, all the moves in which a human won, sorry, an expert won, and then predict those actions. And then regardless of board state, whether you won or lost, you're going to predict the outcome. So you might be wondering, okay, well, some of the early boards where basically only one stone has been put down, how could you possibly know who the winner of this game is? Well, if you have hundreds of thousands of games, then on average you'll probably see that boards that start like this have sort of half of the games that branch off from this will win, and half of the games that branch off of this will lose. So that'll actually be fine. When you train this model to predict those, the logit will sort of converge to 0.5 for these things. It's sort of expected that once you train the model, a starting board state will look like 0.5, and then as you progress towards the end of the game, it'll actually look something like if this is 0.5, the win probability will sort of either go like this or it'll go like this. And this is sort of your move number. And so as you get hundreds of steps into the game, it becomes much more clear like who's more likely to win or who's more likely to lose under your expert data distribution.
[41:37]
A
I didn't understand the significance of why this way of thinking about value is especially relevant to the expert data.
[41:43]
B
It is not relevant to the expert data. It's true for any data that you train it on. Yeah. So if you were to learn tabular rasa, you would also expect this to fall out. So if you just do this. So imagine you're vibe coding alphago and you gather some expert data sets from how to go online, or you have a data set of human players and you train this model. Actually, it turns out this model is already a pretty good go player. It'll most likely beat most human players right so if you just take this policy recommendation and take the Argmax over, if this is the probabilities, if you take the Argmax and you just take this action as your go play, it'll be a very, very fast go player that doesn't think in terms of reasoning steps. It just kind of shoots from the hip and it'll be a very strong go player, which is already quite miraculous if you think about like 10 neural network layers, maybe under like 3 million parameters can already do something that impressive. And so you can start this way. And it's important when implementing this to kind of just verify that this is probably true. It's good to verify that your go rules are implemented correctly, that you can run these simulations relatively quickly and just as almost like sort of a checkpoint that you want to make sure that you can actually do this basic step before you try to layer on more complex things like search. But yeah, we can do a lot better than taking the raw neural network and playing the moves. And this is how we can apply it to Monte Carlo tree search. So let's apply the neural network to, to improve Monte Carlo tree search. So we start with our root node. And we now have a four step iterative process to do mcts. So this tripped me up when I was first reading the paper and trying to understand it. But essentially what we're going to do is we're going to choose a number of simulations. So like, you know, NUM simulations. And this number varies. This can be, you know, somewhere between 200 to 2048. I believe in the AlphaGo Lee match, they use tens of thousands of simulations per move because they really wanted to boost the strength of the model as much as possible. Yeah, but in training you don't actually need too many. And Katago, I think uses something on this order as well.
[44:11]
A
Do you know if they used, if you watch a documentary, they had a laptop out during the game. They didn't use the laptop itself. It was like on some, it was
[44:17]
B
on some TPU pod, I think. Yeah. But now, honestly, kind of unfair.
[44:22]
A
Well, Lee is not using like 1e22 flops to do a move.
[44:27]
B
You know, fair enough. Interestingly enough, modern go bots don't need that much compute at test time. And what we'll actually find out as we talk about how the MCTS policy improvement works is that over time, the RAW network actually takes all of the burden of that big TPU pod and just pushes it into the network. And you can do all of that work with one neural network. Pass but the TPU pod will always add the extra oomph on top. And so that's what they wanted for the match. So we're going to pick this kind of like num simulations thing. And for every simulation we're going to basically do several things simultaneously. We're going to see which moves are the best in the current tree. We're going to add extra leaves to the tree if we get to a point where we need to add a leaf and we're going to update the action values for the tree. So that's what every simulation involves, these kind of like four step process. So the four step process is basically selection, expansion, evaluation and backup. So at the beginning of our Monte Carlo tree search, our tree is very basic. It only has the root node or our current board that our AI wants to play at. And so we're going to basically select the best action for this. So when this root node is created, we also know that we can evaluate this under our neural network and get the quantities v theta as well as our probability over actions. And I'm going to say root. So for all of the actions here, we can create a bunch of children, right? So this one has, well, in this case I'm drawing a three by three board with one board missing. So basically there are eight possible children associated with this root node. So like, And each of these has an associated probability of taking that action, right? So, so there's P8, P1, P2, etc. Okay, so at the beginning of our Monte Carlo tree search, we have our root node and we can initialize it with some children because we know the policy network evaluated on the root node gives us on a three by three board with one existing stone placed eight possible children that this AI could take. So with each of the children, their policy network also gives us the probability of selecting that child. So the first step is to do this selection of the tree. And again, this is a very shallow tree. All we have so far is a tree of depth one, essentially. Right, so our first move is to select by maximizing or argmaxing the pucked criteria, which is Basically, you know, QSA +C pucked times P of A divided by N over 1 NA. So for each of these we're going to NA is zero for all the actions initially N is zero. And so we're going to basically just pick according to this initially what is going to be the chosen action here is most likely going to be biased towards, you know, the highest likelihood action here. Right? Because these are sort of uniform for everybody. So let's suppose P1 was the highest probability node. So you selected this one here. Now you got to this node and you realize that it's not a leaf node, right? There are more, it's not a terminal game, so you cannot resolve the final resolution. So the next step that you do is expansion. So you, you will then run this node, this board state through the policy network. Note that this is the AI's move, right? Like AI is making this move. And so when we expand this tree, we're now thinking about what the human might do or any opponent might do, right? So this is like your opponent. The tree expansion process actually is completely. So when we evaluate the, the node here, we're going to now evaluate the node from the perspective of this player. So then this one has possible actions that we could take and we expand basically the leaf nodes here. So for each of these nodes that we could arrive at, we're going to now check how good those nodes are, right? So maybe from here like the human could play here, the human could play here, or human could play here. And we're going to store essentially the v theta for each of these things. So v theta of node1 or like node1 prime, v theta, node1 prime. And so we're basically using our neural network to make an intuitive guess of how good is this board from the perspective of this player. And fortunately, because it's a zero sum game, it's easy to deduce that the value for this player at this step is just one minus the value from this perspective. So it's easy to flip the search process depending on which player you're at. And so this is the expansion step. You've taken a non leaf node and expanded it and evaluated the value. And this is essentially a quick guess as to like if I were to play to the end, am I going to win or not, right? So you can almost think about the v theta as a shortcut for searching to the end of the tree for any given simulation. And this is essentially the evaluation step. We're evaluating the quality of each of these boards. In original Alphago League they actually did something kind of interesting which is that they took this value and they averaged it with the value of a real go playout. So they actually played a real game from here all the way to the end. So I'm just going to draw this squiggly line to indicate some path and they kind of like play this all the way to trump Taylor resolution of a full board. And so this is like a 0 or 1, right? And so they took this value and they just averaged it with this one here. So the formula they did was like, you know, alpha times v theta of like, you know, some node plus sort of like one minus alpha of a true randomly sampled playout. And you might be wondering, okay, well, how do they play this out? Right? It would be very, very costly to do another search on this playout, like, almost like a tree within a tree. So they don't do this. Instead they just take the policy network and play it against itself. So they just take this as both players and they just play it all the way to the end. And this is something that helps ground the, the estimates here in reality, because you can get a single sample estimate of whether you win or not, you can think about in the end game where the board is almost resolved that this one actually becomes quite useful because the play according to the policy will most likely decide a pretty reasonable guess of the game. And so you're not facing a problem where this one kind of becomes untethered from reality. It turns out this is totally unnecessary. So in all subsequent papers after AlphaGo Li, they just got rid of this. In my implementation, I also did the same. And it speeds things up a lot because you don't have to roll these games out on every single simulation.
[52:43]
A
Okay, so again, just to reinforce my own understanding and just to re explain it for the items, by the way, in case it's not obvious, the P there in the select that is the probability coming from the network in this case, correct.
[52:57]
B
The policy network here.
[52:58]
A
Yeah. Okay. So fundamentally a simulation, just think of it as like rolling out one more node in the search process almost.
[53:11]
B
So a simulation is easy to think about when the whole tree already exists. Right? You just walk down the tree using the pucked selection criteria and then you keep going. Now, in AlphaGo, the data structure is such that we begin with a tree that has basically only depth one, which is its only children. And you want to iteratively build out the tree as you're also selecting actions down the tree. So that's the kind of core thing here, is that because Go is such a combinatorily complex game, you cannot afford to build the tree in advance and then search it. You must search while building the tree.
[53:49]
A
Right?
[53:49]
B
Okay, so let me just finish up with actually the last step, which is the backup. So once you've scored these things, you basically take the mean. The Q value assigned to the node here for taking this action is now just the average across your evaluated Values. Here you take a running mean over all of the simulations that you've taken and they average the values of the children nodes. So that's what is known as the backup step. And once you evaluate this, you can actually kind of recursively go back. So if you know the action value of this node, you can then take the average on its parent and so on and so forth. So you have this kind of four step process where you are choosing the best action that you know of so far. Then you may run into a node where you haven't been to before. So you need to grow the tree a bit and then you run it through the network to guess whether you're going to win or not. And then you walk all the way back up to the root node to update your values on what the best moves are. So as you do this iteratively, this selection criteria will cause you to visit the. Because you're always selecting according to this criteria, you're always going to be selecting the best action you think at any given branch. So the final visit counts of how often you chose these things will reflect your correct policy distribution as induced through this search process. And so the visit count that we store in the node earlier actually becomes the sort of vote for which way we should finally select an action here.
[55:23]
A
Yeah.
[55:24]
B
So as a sort of test of understanding, it's worth thinking a little bit about whether we could make this even simpler, right? Like, could we actually maybe even get rid of this one and still make the thing work. Um, so recall that, you know, when you do an expansion and then an evaluation at, let's say this node, you, you are checking the sort of wind probability of each of the child nodes, right? And so if this one is, you know, like one, and these are zero, you do kind of know something about which action might be better to take. And so why would you need still need this, right? Like why not just normalize this one into some distribution and call that your, your policy distribution? This is fine, you can do this, and this probably does work. But in practice, having a single forward pass that gives you a pretty good guess is how the breadth is pruned out. There is a sort of duality here. It would be weird if, let's say the policy recommended an action that disagreed with the value. If, let's say the policy said this was very high probability, but this one said it was low value, then there's actually something kind of fundamentally wrong between your policy head and your value head. So they are linked and you probably could get rid of this if you came up with a different way to recover this from just the value evaluations.
[56:44]
A
Right. But just to make sure I understand, the reason you don't do that is so that you don't have to do 360 independent forward passes to like, hey, here's the value of everything. Let's argue max over it. Right. Instead, you can just do one forward pass and get the probabilities of all them.
[57:00]
B
You can usually batch these somewhat efficiently, so it probably is not a huge computational burden in practice. But yes, you would have to pass up to 361 boards into a single mini batch update to evaluate all the values here, then normalize them. Now, there's actually a more important reason why we still do this, which is how Monte Carlo tree search is used to feedback on itself and sort of recursively improve its own predictions and search capabilities. And that's where this one, having this as an explicit entity you're modeling rather than an implicit normalization over your value, is a good idea.
[57:39]
A
Makes sense. Okay.
[57:40]
B
Okay. So we talked about the simulations and basically what you end up with as you roll out the number of simulations is a tree that kind of looks like I'm drawing a very low dimensional version of this. Of course, in the real game it's much more high dimensional, but you'll end up with basically a tree structure that has a lot of leaves that terminate and are not visited again because their value is deemed to be too low. But then along one path, there will be a set of actions with very, very high visit counts that gravitate towards that one set of decisions as you increase n. So this is kind of like the mental picture of what the tree in Monte Carlo tree search looks like. And you should contrast this with an exhaustive tree like in Tic Tac Toe, where you could say there's nine actions and then eight and then seven and six. And so it's a sort of like nine factorial sized tree. The Monte Carlo tree search in GO is very, very sparse. Right. It only considers the paths that you've expanded children nodes on. Okay, so now that we have the search algorithm that applies the value function as well as the policy function, we can now talk about how the Monte Carlo tree search algorithm can actually act as a improvement operator on top of these guys here.
[59:06]
A
20 years ago, Jane Street's data center fit in the corner of an office. Ron Minsky, who co leads the tech group there, told me about how it all got started. One of our compute clusters we called the Hive, and I remember the first version of the Hive was literally like six Dell boxes stacked on top of each other at the end of the row. And the trading systems themselves we also had there because we actually wanted the ability to make sure we could turn the damn thing off. I mean, there were ups and downs. Like, literally at some point, you know, one of the people who was cleaning the office unplugged one of the trading systems in the middle of the day as they were vacuuming. So, you know, in the end, it is, in fact, better to have it all in a data center. Jane Street's data centers have come a long way since those six Dells. And I got to tour one of them in Texas with Ron and Dan Pontecorcorvo, who leads Jane Street's physical engineering team. You know, These cabinets, these GB300 cabinets,
[59:53]
B
consume at peak about 140 kilowatts.
[59:55]
A
Compare that to traditional air cooled, you're
[59:57]
B
talking about 10 to 40 kilowatts. So a lot more.
[60:00]
A
We got deep into the details of running one of these data centers, things that I had never considered before. It's filled with a liquid, a mix of distilled or deionized water and propylene glycol. 25% of propylene glycol.
[60:12]
B
That's to inhibit any bacteria or algae growth.
[60:15]
A
I don't love the world where we have to worry about bacteria growing in our servers. I got to see way more of what actually happens in a data center than I've ever seen before. Jane Suite was willing to literally pull up the floorboards and take out the racks and take me to the back where all the chillers are. You can check all of this out@janestreet.com ThorCache where we posted the full tour.
[60:35]
B
Okay, so we now talk about the RL part of how this thing gets stronger by playing itself. Right? Let's say we play a game where at the AI, so you make a move, AI will kind of compute the search. And then this is this sort of visit count distribution. Let's say this is your policy, your policy, initial policy recommendation at this node. And then after mcts, it gets more confident about one of these actions. Right? And so maybe the distribution looks a bit more peaky like this based on the search. Now, of course, you can tune the search process so that it ends up more diffuse, but that's probably not a good idea. Mcts should get more confident about specific actions than others. But it, of course, might place a lot of weight on other actions initially. And then as you increase the Number of sims, it should converge to a very peaky distribution. So this is your new, let's call this like PI. Let's wrap this in like a MCTS operator of a given S. Right? So after applying MCTS process, your policy recommended distribution looks like this. It's a bit more peaky than the previous one. And so then you take the Argmax or maybe you just sample from this, it doesn't have to be the Argmax. And then you make your move and then you throw away the tree and then you begin anew on the next move, right? So again like you compute a new distribution. So initially maybe your guess looks like this, and then you refine it through mcts,
[62:28]
A
there should be one more X on the board, right?
[62:30]
B
I'm sorry, that's correct. Yes, to something that looks like. So on every move you have your initial guess from your policy network. And then the search process that combines your policy network and your value network arrives at a more confident action that you take. And then so on and so forth, and then the game ends and one person wins and one person loses. So the beauty of how AlphaGo trains itself is that it actually can take this final search process, the outcome of the search process, and tell the policy network, hey, instead of having mcts do all this legwork to arrive here, why don't you just predict that from the get go? Why don't you not use this guess and just predict this to begin with? And if you have this guess to begin with in your policy network, then MCTS has to do a lot less work to get things to work. And so if we draw a sort of test time scaling plot, so let's say this is like number of simulations, let's say at zero simulations, your sort of implicit win rate is like, I don't know here. And then without any simulator, if you just take this raw action, this is what your win rate is. And let's say as we increase the number of sims, maybe you kind of have a win rate that looks like this, right? So when you search for, let's say 1,000 simulation steps, that gets you to a policy here, that gets you to here, which is great. But if you were to distill this MCTS policy network back into your sort of shoot from the HIP policy network, then you could actually start here. If let's say this was zero by distillation, then if you spend another 1000 SIM steps, then you actually kind of get to here. It's almost like if you could just amortize the first 1000 steps actually into the policy network instead of the search process, then you could begin at a much better starting point and then get a much better result for the number of sims that you put the sigmoid
[64:47]
A
type nature of test time scaling, as the number of simulations increases, the increase in win rate is smaller. Is that true even for the distilled network? That is to say, is there some gain of like, okay, we start from the distilled, we get these early gains again, or is that just inherent to the nature of mcts?
[65:04]
B
To be honest, I actually don't know the test time scaling behavior of MCTS simulations, and I believe it might actually be quite sensitive to how strong this one is. In practice, I'm just drawing a monotonically increasing function that gets to one. So don't pay too much attention to the shape of the curve, just know that it's monotonic with respect to sim. Okay, so the idea of MCTS is very brilliant, which is like we got something better by applying search and we're going to now, on our next iteration of updating this network, just train this to approximate the outcome of 1000 steps of search. And so instead of starting here, we get to now have a neural network start here and then, and then the play gets stronger once we then apply another thousand steps on top of it and you can keep going. Right, so the training algorithm for AlphaGo is to basically take the games where you've applied the search on every move that the policy encountered, whether you won or lost. And that's quite important. And you're just going to train the model to imitate the search process. So there's an analogy to robotics actually, which is the dagger algorithm. First I'm going to draw like a schematic of like, let's say, you know, the states, right? So s0, s1, s2, s3. So let's say, you know, we took a series of actions in an MDP to get a trajectory, and these actions may be suboptimal, right? Maybe we lost at the end of this game. So there is a family of algorithms that basically take trajectories and relabel the actions to better trajectories. So maybe a better action here would have been to take, you know, a zero prime. A better action here would have been to take a one prime and yet another one, like a two prime, a three prime. So what MCTS is doing is basically saying like, you play this game where you eventually lost, but on every single action I'm going to give you a strictly better Action that you should take instead. It does not guarantee that you are going to win, but it does guarantee that if you take these tuples as training data so that you retrain your policy network to predict these ones instead of these ones, you're going to do better. And this is very related to dagger in robotics and imitation learning where you want to collect a intervention here. And even if you're in a not great state, for example, like a self driving car that veers off the side of the road, there is still a valid action that kind of corrects you and brings you back.
[67:43]
A
Yeah, okay, so pedantic question, but is there a guarantee that MCTs must be better than the policy? For example, you could imagine early on in training, because MCTS is informed by the value network early on in training, when the value network hasn't been well trained on finished games, that MCTS is worse than sort of randomly exercised policy. So is it just a heuristic that MCTS is better than policy or is there some guarantee?
[68:11]
B
Right? In practice it is a heuristic and it does work also in practice. But let me illustrate an example where mcts can give you a worse distribution than your policy network. And this can often happen if your self play algorithm has trained to a good point, but then somehow it collapses because it's not trained on diverse data or something. So let's say we have a board state where the policy recommendations here are very good. So like PI of as is like great. But somehow, because maybe we're playing on a lot of games where the bots just resign instead of playing all the way to the Trump Taylor resolution, they kind of forget how to evaluate those kind of late stage plays. In the case that we showed with the Corner play, maybe 100% of our training data in our replay buffer has lost examples of how to evaluate the value function at those states. So you might end up in a scenario where your terminal value is very bad. And if the terminal values of the leaves are not good, then this will actually propagate all the way up and cause your pucked selection criteria and your backups to be off. And then you end up visiting a very, very different distribution than what your policy initially recommended. Also if your number of sims is low, then you might also have a variance issue where you just don't explore enough. It's only guaranteed to converge when you kind of take N to infinity. So variance in your search process as well as inaccuracies in your evaluation can definitely screw with the quality of your policy network recommendation. And so that's why it's not a guarantee to improve. And that is why I think, I suspect why AlphaGo Lee had the playouts to the end in their training algorithm so they could ground this thing in real playouts in practice. What you could also do is just like for 10% of the games you prevent the bots from resigning and you just say resolve it to the end so you get some training data in your replay buffer to really resolve those kind of late stage playouts that normal human players would kind of not play to. Yeah, so this is why mcts kind of, if you assume that the value functions are correct, why it gives you a better policy is because, and it's a very critical chain of assumptions, assuming that this is accurate, then your search process should give you a better recommendation than your initial guess.
[70:36]
A
Right. Okay, so if you have a cold started policy, if you have an alphazero type thing, really what's happening for the first few epochs is the policy is kind of useless and what you're really just doing is hey, let's play full games. And once we have played full games for the preceding moves, we'll have labeled who won, who didn't win and the loss for AlphaZero has two components which is like how good is the policy relative to mcts and how good is the value prediction relative to who actually won the game from this move? And this sort of like you can think of this being applied to every single action or every single move.
[71:14]
B
Correct.
[71:14]
A
And really what's happening in the beginning of AlphaZero training is just like we're trying to get the value function to actually predict who will win the game. If you find yourself in this state and you're this player and functionally that's all that's happening. And later on, once that's well trained, now the policy is also improving.
[71:31]
B
Correct.
[71:32]
A
Okay.
[71:32]
B
One trick I did find to be pretty useful and this is not a peer reviewed claim, so just take this with a grain of salt. I found it useful in my own implementation to do the following. You want to first make sure that this is good before you invest a lot of cycles doing mcts. Right. It doesn't really make a lot of sense to do search on garbage value predictions. So you want to kind of start at a good place where this works. AlphaGo Lead does a very good thing where it just takes human games and then you train on it and it just works, right? Totally works. You can also take an open source GoBot, play it against itself, generate data, also works. So if you have some offline data set that has realistic good play, you can easily learn the late stage value functions pretty well. And that's what you kind of need to start the search process.
[72:21]
A
Sorry, can you just read this?
[72:23]
B
Sure. So it's quite easy to evaluate a late stage go game, like when almost all the pieces are on the board. It's almost like a decidable problem. Right. Because there's lower and lower uncertainty as to the depth of the tree. So most games played to the end by reasonable people will be good training data to train a good value function at terminal parts of the tree.
[72:44]
A
Got it. Okay.
[72:44]
B
Then as you play more games, the search will back up good values into the sort of intermediate nodes of the tree. And then as you increase the amount of data, your value head gets a good intuition of what is a healthy board state versus a not healthy board state. Those are much more subtle to judge in the mid game than the beginning or the end. So the most difficult part to score is not the beginning or the end because the beginning is just obviously 0.5. And then at the end it's like pretty obvious who's winning. So the hard part that you want to learn in the value function is like who is winning in the middle.
[73:16]
A
And so this is actually very analogous to TD learning.
[73:19]
B
Yes. And there's a beautiful connection to TD learning that we can talk about in a bit as opposed to contrasting with Monte Carlo tree research. So you first want to get good value functions and expert data can kind of give you a quick shortcut. I recommend for practitioners, just do that first, just to initialize to a good starting point. And then if you want to do the alpha zero thing or, or Katago kind of tabula rasa learning, then what you can try to do is on a small board, play random games. Just take a random agent. And if you play like 50,000 games, you'll actually learn a pretty good value function as well. Because on a 9x9 board you can see enough of the common patterns with random play. And then if you train a model that kind of can train on both 9x9 and 19x9 data and Katago proposed one of these architectures, then there's some pretty good transfer learning from the value head evaluated at 9x9 to the 19x9.
[74:12]
A
Right. Because this, unlike other games, has very much a sense of there's not a new kind of piece that is introduced when you increase the size or something.
[74:20]
B
If we take it to its limit and consider a very tiny 4x4 go board, if you play 50,000 games, you're going to have a lot of end states that look like human play. It's just like tic tac toe at that point. So if you broaden this a little bit to five by five or nine by nine, it's not unrealistic to imagine that purely random play will actually generate pretty reasonable looking boards and then so you can score those pretty easily. And so that is what gives you the bootstrapping to be able to then improve your policy with search. But it's very, very critical that MCTS has accurate value estimates and you need to ground the value. Ultimately MCTS will fall apart if you don't have a grounding function for, for the value.
[74:59]
A
I'd be curious how much compute you save by training the value and policy on the same network because they share the same representations. How much more efficient learning is because that would be interesting if they're basically kind of, we've just talked about how they're kind of making similar predictions or they should be in line with each other. And so I'd be curious if actually you're halving the amount of compute you had to do by keeping the Same network right.
[75:22]
B
AlphaGo Lee the original AlphaGo paper had two separate networks and then in all subsequent papers they merge them into two heads and presumably this saves compute. But answering that question in a very rigorous scientific way is actually, it's a simple question but in practice actually takes like, if you really want to chase that question down to its limit, it takes quite a bit of work to really resolve that. But intuitively, yes, they share a lot of representations and as we mentioned there is a sort of like your policy network and your value network when doing evaluation should kind of agree. Right. So there really should be this sort of consistency between them.
[75:58]
A
Yeah. Tell me if this is the wrong way to think about it. I feel like when I learn how an LLM works and how simple RLVR is, at least as an algorithm, how simple it is, I'm sort of stunned by the kinds of things it can do that it can learn how to build very complicated code repositories and whatever simply from getting a yes, no. And here I feel like if you understand it more deeply of just predicting mcts and it actually seems awful, go seems less impressive in retrospect the more you understand it because you're like, oh, you're putting in a lot of bias by just saying how much you'd, you're like telling it how we should titrate exploration as things go on. You're building this very explicit tree search for it. And so I don't know if you share that intuition where actually the more you understand, the less impressive the accomplishment in 2017 seems.
[76:51]
B
I personally disagree. I think they're profound for different reasons. And I don't understand the LMRL enough to kind of comment on your podcast about it. But I think AlphaGo, why is it a profound accomplishment? I think maybe it's worth stepping back a little bit and just like it is different than modern RL and we can talk a little bit about some of the algorithmic choices there. But I think the most profound thing here is that a 10 layer neural network pass. So basically 10 steps of reasoning. And of course the reasoning is not just one trail of thought. It could be like the distributed representations and a lot of thoughts going on at the same time. But by construction, let's say a 10 layer neural network can only do 10 sequential steps of thinking, right? 10 steps of neural network paralyzed distributed representation thinking is able to amortize and approximate to a very, very high fidelity, a nearly intractable search problem. So this was a breakthrough that I think most people don't even understand today fully comprehend how profound that accomplishment is. And this is what also girds alphafold, for example, where you have a very, very difficult physical simulation process that you would need to roll out so many microscale simulations. And yet 10 steps of a somewhat small neural network can somehow capture what feels like an NP class problem into a single problem. And so it actually makes me wonder if our understanding of problems like P equals NP or these very fundamental computational hardness problems are incomplete. Right? It's not like, obviously this is not proof of PMP or anything, but there's something to it that kind of is very disturbing, where what felt like a very hard problem can fall to a very, very simple macroscopic solution.
[78:53]
A
That is a very interesting insight that a lot of problems which are proven to be NP hard. I don't know if GO is proven to be NP hard, but protein folding, et cetera, have been like neural networks can solve them because they're NP hard in the worst case. But we're not dealing with the worst. We're usually not concerned about the worst case. These problems have a lot of structure to them usually.
[79:13]
B
Yeah, I think that the kind of question we should be asking ourselves is like, we've been formulating solutions to NP hard problems, as in kind of worst case complexity. And I wouldn't say this solves go, right? It doesn't give us an exact solution of the optimum. But in practice it is extremely useful. And the same thing has been shown in alpha tensor alphafold, where, yes, there is a very hard problem that in the worst case seems intractable and yet we're able to make almost arbitrary amounts of progress. So here's a sort of like in the limit. What might this look like? Right. Well, if you want to simulate something very complex like weather or predict the future, like do we live in a simulation or not, the computing resources you need to build a very complex simulation might be much smaller than you think, based on our ability to amortize a lot of that computation into the forward pass of a single network.
[80:09]
A
Interesting.
[80:10]
B
So to me, yeah, AlphaGo was the first paper that kind of really showed this profound level of simulation being compressed into a small amount of.
[80:20]
A
I feel totally not at all qualified on the computational complexity of the math to comment on this, but I wonder if there's an important role of chaos here where. What is the problem with weather? And why does it take 10x the amount of resources to predict whether a day out and continually so for every more day out is because it's a chaotic system and so small perturbations can totally change the final estimate as time goes on. And I guess it's interesting. Well, I guess you would expect that for GO and protein folding as well.
[80:55]
B
So here's an analogy to weather that might be relevant in go. So the problem of like, here's our current board state, Given what we know about both players, what is the board state in the future? What is the exact board state in the future? Right. This is, this is extremely sensitive to initial conditions. Like a single stone place here can kind of disrupt the entire prediction. So this is hard. This is kind of intuitively the chaotic problem. And yet somehow. So this is hard. Somehow we can predict who's going to win. And this captures a lot of possibilities here. And so there's this more macroscopic quantity that we really care about, which is the average or expectation or some sort of global macrostructure over a lot of possible futures.
[81:48]
A
That's an interesting way to think about it.
[81:50]
B
In weather, it could be the same thing. Right. We don't exactly care what the velocity of wind 6,000ft above a specific latitude longitude is. We kind of care like where's the hurricane? Or things like that. And I would say in chaos, there's a classic Lorenz attractor which kind of looks like this, right? Yes. If you start anywhere on the Lorenz attractor, you don't know where you're going to end up. But you do know that the thing looks like this, right? And so there's this kind of beauty of, like, sometimes we don't necessarily care about the microscale things. We actually care about the macroscopic structure. And these things can be predictable.
[82:27]
A
And contrast that, say, to something like a hash function, which is also incredibly dependent on initial conditions, but doesn't have a macrostructure, at least, hopefully, if the work.
[82:37]
B
Yes, one would hope.
[82:38]
A
And so there's no equivalent of a value function or broadly, how's the weather going to be? That is interesting there. It's really just about what is the move. What is a board going to look like 100 moves from now? Exactly.
[82:51]
B
Yes. Intuitively that seems correct. And then again, this is also out of my area of expertise, but I find it interesting that cryptography has not been able to. The tools of cryptography and hashing have also not been able to prove that you cannot come up with fast approximations. You cannot come up with fast approximations. If they were able to do that, then you could prove P is not equal to np.
[83:18]
A
In fact, we know that there's structure in many cryptographic protocols. Obviously, like RSA cryptography, there is structure, and that structure is what quantum computers exploit to break them. Right, I see. Reiner has a very interesting blog post which we talked about in the episode, where he talks about how if you look at. At a high level, what cryptographic protocols look like and what neural networks look like, it's extremely similar where you have sequential layers of jumbling information together. And it's because there's this convergent evolution in the algorithms where in cryptography you want the final state to be incredibly sensitive to initial conditions, so that it can come out sort of looking jumbled based on if you change anything. And then neural networks, you similarly want everything to be dependent on all the information, because you want to process all the information and consider how it relates to itself.
[84:09]
B
Yeah, you have the maximum power of a neural network at the edge of chaos. I think there's some research papers from Joshua Struldicht on this. There's something kind of quite fundamental about chaos that is it's not just like hopeless noise. It's like there's something kind of useful in chaotic systems, at least at that boundary. But yeah, this is just my think about this as a philosophy. I don't actually know the math well enough to comment on it. Anyway. If we go back to. We'll talk about LLMRL a little bit because there's some connections There. But let's just go back to the mcts. What is it doing? Crucially, it is not saying we're going to increase the probability of winning directly. It's not going to say we're going to upweight all actions that won and downweight all actions that didn't win. Importantly, what it is doing is saying for every action we took, we did a pretty exhaustive search on MCTS to see if we could do better. And we're just going to make every action that we took better by having the policy network predict that outcome instead. And so this is a very, very nice idea because you have one supervision target for every single action. So the variance of your learning signal is very low compared to the alternative naive RL thing. So let's actually consider what, let's consider a very naive algorithm that looks a lot more like modern LLM RL today where we do something like let's take the winner of a self play game and encourage it to do more of that. Okay, so it's worth kind of thinking a little bit about like, okay, what are some alternatives that we could do to train self play agents instead of MCTs? Right? Like, you know, we use a lot of LLM style RL these days. Like is that relevant? Could we do that instead? So let's think through this a little bit. Let's suppose we have a very naive algorithm where we take a league of agents of different checkpoints and we play them against each other. And for the games where a single player wins, we're going to reinforce those actions up and retrain the policy network to imitate those guys instead of the MCTS objective. So what ends up happening is let's say you have a chain of actions that led to a win and you have a matchup between two agents that are basically the same. So in fact, let's just assume that like, you know, policy A and policy B are like evenly matched, right? So their true win rate is like 50%. So let's say you play 100 games and then each game, let's say, lasts, you know, 300 moves and you're doing some sort of evolution strategy or some way to perturb these things to get them to do different things. Or maybe you don't and you just play them against each other. And you see, occasionally this one might actually have a better strategy than this one, right? So let's say 51 games, policy A wins, and then 49 games, policy B wins. And this is just due to random luck or maybe you perturbed Policy A in some way that let it do this. And just to have a very, very simple model, let's pretend that for like, 49 of the games they played exactly equally. I'm sorry, for 50 of the games they played exactly equally. Right. And on that one game where this one won, it played it slightly differently. It made one critical move that normally it would have done differently, but due to some exploration or some random noise, it just happened to make a smarter move than it did previously. So you have one supervision signal, like one true supervision signal for your policy network, and then you have 99 games times 300 moves for which imitating those actions gives you exactly the same policy you had before. And so the scale of your variance is actually very bad because it's like you only have one label out of this enormous data set of actions of supervision actions where you want. Actually, sorry, let me clarify a little bit.
[88:32]
A
Okay, so we were just talking about how the good move data distribution move is a small fraction of all the moves that are played across all the games on which you'd want to train. And this, of course, reminds me of how LLMs are trained with policy gradient methods. Karpathy, when he was on the podcast, called it like sucking supervision through a straw. And so, yeah, it's interesting that this thing you're saying, which would be intractable and prevents you from actually getting beyond a certain level in Go, is just by default how LLMs are trained.
[89:03]
B
Right. So in this case, this is not to say it doesn't work. If you imagine increasing the number of games to millions of samples, you actually can get some meaningful supervision samples, so long as you find a way to sort of mask out the supervision from these guys. And then this is where things start to get pretty related to RL in terms of advantage and baselines and so forth. So let's look at the gradient variance of a very naive approach like this, where I'm just going to call it, like, gradient rl and it's basically the sum of rewards.
[89:50]
A
All right, I see what you're saying.
[90:04]
B
So the sum of rewards is the return, Right? So in our naive setup here, we only have an indicator variable for the return where either you won or lost. So in the case where you lost, well, you just don't train on. Your gradient is zero. You don't train on those examples. And when you won, you try to predict those things. Right? So you can think about this setup as a special case of this general formula here. The trouble here is that this is very high variance, because when you Multiply these terms out when you try to compute the variance of this. And so variance of the gradient is equal to expectation of squared minus. And just for simplicity, we can pretend this is like on average zero or something if you're centering it at no signal. And the variance here basically means that you're taking the square of this product term and so you end up with a term that kind of grows quadratically with T. So variance, when you have a setup like this, this thing acts as a coupling effect on top of these terms here. So let's actually map this to an LLM case and we can answer like why do LLMs only do one step RL instead of a multi step RL scenario? In LLMs you have a decoder that might predict some words like hello world. And so in current lmrl they treat this entire sequence as a single action just at and big T is just one. Right? And so yes, it is true that because of how transformers are formulated through the product of conditional probabilities, we do have, probability of this sequence is equal to the sort of sum of log probability of the whole sequence is equal to the sum of the probabilities of individual tokens. Right? So in this case I would say something like, you know, log hell plus log low plus log world. So this is true. And if this term were one, then they would be the same thing. However, in sampling things, if you have a reward term assigned to every specific token, now you have these interaction effects between the cross multiplication of these terms and these terms. And so the problem becomes how do you ascribe the credit associated with every episode to all these different terms here?
[93:03]
A
I guess the thing I'm confused on is what would that even look like to do it that way in lms, in LLMs, because you only do get a reward at the end of the episode.
[93:12]
B
So you could imagine a reward that says like I'm going to give you some process supervision where you get a reward for each of these actions on every step.
[93:23]
A
Okay, so you're saying instead of doing it that way, where you. Well, I guess the way you've written it, it would be a sum at the end anyways, so they wouldn't have to be multiplied. But you're saying instead of doing it that way, you would just add up this process rewards at the end and then treat that as one single reward
[93:40]
B
signal correct for one single log action.
[93:42]
A
But isn't that how it's written to begin with anyways, like the sum of the rewards?
[93:48]
B
So the thing that's A little bit hidden here in the math is that we're assuming that when you decompose the problem to a multi step problem, that you're now introducing kind of correlations between your actions through the computation of the sky. And so if you separate these things out, then there will be. This will magnify the variance of this one. So in the case where you don't separate it out, if you just have t equals 1, you just have a single estimate of logprop and a single estimate of reward. Now this term still shows up in so in LLMs, it looks a little bit more like the naive reinforced estimator. Looks a bit like return of the single action plus times. You know, It looks kind of like this. This is sort of the very basic form here, but this is still a contributor to variance. So you want to make sure that like you don't similar to how in this case we were training on a lot of neutral labels. You want to make sure that you're sort of penalizing the labels that don't help and only rewarding the ones that actually make you better. So intuitively the analogy here is like, can we find a term in our training objective such that it's actually kind of discouraged from doing this? Or these don't have any effect on the gradient, and this has an effect on the gradient.
[95:13]
A
Right. I guess if you apply that there, the only thing you could do is eliminate 49 of the games. So at least the way you have it written there, it would be 51 times.
[95:26]
B
Actually, the optimal case is to pull out, discard all of these moves and only get a gradient on that single move that you got better.
[95:35]
A
Yeah, but how would you do that?
[95:37]
B
Right, so this is a pretty tricky problem in practice. And so this is where advantage estimation happens in reinforcement learning. So you want to subtract a term from. Your multiplier. Instead of an indicator function of like one and zero, you want something that kind of behaves like a zero for all of these guys and then a one for all these ones.
[96:04]
A
Yeah, but you could do that if you can say, hey, I won in this game. So this is slightly above baseline performance.
[96:13]
B
Well, you won on a lot of games.
[96:14]
A
Exactly.
[96:15]
B
But you don't know which ones let you win because they were truly better versus winning on accident.
[96:20]
A
How would you design a baseline where it's truly better?
[96:22]
B
Yeah, so this is where in RL people use things like TD learning to better approximate the quality function, the queue that we mentioned earlier. So you can try to subtract that from your return. I see. So ideally what you really want to do is in rl, you want to push up the actions that make you better than the average and push down the actions that make you worse than the average. And they call this advantage. There are multiple ways to compute it. I highly recommend John Shulman's General Advantage estimation paper as a good treatment on how to think about various ways to compute it. But at the end of the day, you want to reduce variance by trying to make this smaller and so that it doesn't magnify the variance.
[97:09]
A
So, but this requires you to have a very good estimate of what average performance from a statement would look like. And this gets us back to the value function thing we were talking about earlier.
[97:19]
B
Right. And so keep in mind that in this case, this model free RL setting is trying to solve a credit assignment problem where you don't know which actions were actually good and which ones were bad. Monte Carlo tree search is doing something very fundamentally different, which is it's not trying to do credit assignment on wins, it's trying to improve the label for any given action you took. And so we can actually think about a completely different algorithm called neural fictitious self play, which was used to great effect in systems like AlphaStar and OpenAI's Dota. So let me talk a little bit about how you can kind of unify some of these RL ideas in the model free setting as well as the self play setting. Okay, so what happens if you don't have the ability to easily search a tree? Right. Like in Go, it's a perfectly observable game. You can easily construct a pretty deep tree that completely captures the game state. In a game like starcraft, where you don't have really complete control over the binary, it's a little bit hard to do this. And I'm not even sure if it's a deterministic game. So that makes this kind of difficult from a data structures perspective. So what is done instead is that the basic idea of supervising your actions with a better teacher is still there. Right. So if in a given neural fictitious. So we're going to talk a little bit about how neural fictitious self play works. Same idea. We're going to come up with better labels for each of the actions we took, just like in mcts. But how do we derive the better labels? In mcts, we perform search, and assuming we have a good value function, the search will kind of give us a better result than our initial guess. In a game where you can't easily simulate a search process, what they do instead is train what is known as a best response policy. So you fix your opponent. So let's say you're currently training PI A against a strong opponent, PI B in starcraft maybe like, you know, these are the zergs and you're playing Protoss or something. So you fix your opponent and you treat this as a classic model free RL algorithm where your goal is just to beat this guy. And so here you use your standard TD learning style tricks or use PPO or any actually like, you know, model free RL algorithm to try to hill climb against winning this player. And so you train basically you have a reward function that's like, you know, return is like, you know, one if wins against PI B. So this is no longer a self play kind of problem, right? This is just like a fixed opponent and you're just solving, trying to maximize a score against that and then you know, zero otherwise. And so you have a sort of fixed environment where all you care about is just beating this guy. And once you have a good policy that you train with, you know, pick your favorite model free RL algorithm, PPO or SAC or any kind of mixture or VMPO or whatever, you now have a good policy that gives you a good label for what this one should do when playing against that player. And when you train multiple best response policies, you can basically then distill the RL algorithms into the labels for a given opponent. So you might have, let's say a best response policy against PI B. And then maybe you have a league of, you know, of opponents like pyb, pyc pyd and you're going to take the best response policy that you train against each of these fixed opponents and for this one you're going to supervise them with the label that this one would provide. So it's kind of like this is almost like a proxy for your MCTS teacher, right? Instead of McSteacher, you use a model free RL algorithm to find the best search action that you could do to to kind of beat your opponent. And then finally you're distilling the policy here into what is known as a mixed strategy where it's trying to basically average across all possible opponents it could play against. And this is what gives you something that can do no worse than an average selected opponent from the league. And so this gets around the problem of having to derive a teaching signal from mcts, but it still fundamentally is about relabeling your states with better actions so that they improve your policy and
[102:07]
A
just make sure you understand this is like if you win again against this other policy, you sort of reinforce all the actions on that trajectory.
[102:16]
B
Yes. So here you can use a number of algorithms like ppo, vmpo, Q learning, even if you want, the specific algorithm here can be. It's usually a model free thing because you don't have search. But there's an interesting connection from MCTS&Q learning that I want to bring up. So in mcts you do something where you have a tree, and through the resolution of your value function at the leaves of the tree or your approximate leaves of the tree, you can kind of back up through the sequence of many sequences and then obtain some sort of mean value estimate. Right. Your Q is kind of derived from the average of a bunch of simulations. In model free algorithms, there is often a component of estimating a Q value. And so. And Q values are often learned through TD learning, although in ppo, the way that they do advantage estimation is not necessarily through a Bellman backup. But in Q learning there's this kind of very cool trick where you do QSA is backed up as R plus some discount factor times the max AQ of your next step. So intuitively how this works is like if you have a MDP and then this is like terminal. What this is sort of saying is that the best action you can take at this state is equal to the reward you take for taking this action plus the best that you can do at the next state. So there's a sort of recursive and dynamic programming property of mdps. And you can train neural networks to basically try to enforce this consistency. So you can say like, well, once I know the Q value of this action, so I can then use that to kind of compute something about the Q value.
[104:21]
A
So when earlier I was like, hey, why are we training policy? Why don't we just train the value alone? That is what this is.
[104:28]
B
This is a algorithm for recovering value estimates of intermediate steps when you don't have the ability to do forward search. So you must collect a trajectory first of n steps before you're able to do this trick. But the intuition is kind of the same, which is that knowing something about the Q value here can tell you something about the Q value here. And indeed you can recover a policy from a Q value. So you don't need to explicitly model the policy distribution. You can actually recover the policy distribution by doing argmax over your Q values. So Q learning or this kind of like approximate dynamic programming kind of propagates what you know about the Future queues backward like this. And you can see that there's a sort of similar structure that goes on here where in this case you're planning over trajectories your agent hasn't actually been to yet, whereas in this case you're planning over trajectories your agent has visited. Importantly, why was Q learning a big deal? Right? It's because historically we just haven't had the ability to do search on fairly high dimensional problems like robotics or whatever. So for a long time we kind of make the assumption that, okay, well, if we can't model the dynamics with a world model or something, we're going to instead just collect trajectories and then plan with respect to the only number that really matters, which is reward.
[105:48]
A
Okay, so this is very interesting. And then to unify this with our discussion of LLMs. So with LLMs, you're doing something you don't have Q values, but you're doing this sort of backwards learning where hey, let's find the trajectories which pass some unit test in some coding environment and then let's reinforce those trajectories. And then there's a huge difference between that and this forward approach with mcts. And the reason you can do mcts and it's much more preferable to do mcts because you can do it per move and make each move better, rather than having to learn per trajectory and hope, as Karpathi said, hope to learn
[106:23]
B
this through a straw.
[106:24]
A
Yeah, so you get the supervision through a straw, basically just upgrade all the tokens in a trajectory that might or might not have been relevant to getting the answer right. The reason you can do this much more sort of sample efficient, much more favorable thing with GO is that because MCTS works in go, you basically know that, hey, if I just do search locally here, and this search is sort of truncated at the end by this value function, that works, even if I haven't unfolded my whole trajectory, I can just say this is my new policy and I can improve in a more iterative local way rather than having to unfold all these trajectories.
[107:07]
B
So there was some research I think from Google in 2023, 2024, where they did try to apply tree structures to reasoning. And I think it's the jury's still out as to whether this can ever work. So I would say we probably will see revisiting of this idea of forward search in the Future. But there's two things that make MCTs very simple for Go, which is that value estimation is kind of concrete and you can determine it for real and then you can kind of sort of use it to truncate depth as you said, and then the breadth is also determined. And what's kind of critical is that the action selection algorithm where you iteratively visit and grow the tree is well suited for the size of problem that GO is and the depth of the problem. But for something like LLM reasoning, Pucked might actually not be a good enough heuristic. It might be too greedy with local tokens and it might do something like oh, only give you sort of obvious thoughts that are correct but not really solve your final problem. So I would say the jury is probably still out on how what the final instantiation of reasoning for LLMs would look like. And I wouldn't rule out that this stuff could come back. It's been hard.
[108:22]
A
Don't LLMs sort of natively learn to do MCTs where they'll try an approach and be like oh, that doesn't work, let's back up, let's try this other thing and then go in the direction that proves to be more fruitful?
[108:34]
B
Yeah, certainly. I think the LLMs manage to do something that looks like real human reasoning without having to do an explicit tree structure. That being said, I think the idea of doing forward search and simulation to get a better sense of what is valuable might make a comeback, even though not exactly in the same instantiation as alpha.
[108:55]
A
But just to make sure I understand the crux of it, the breadth from the number of legal actions being wider and the depth from being able to not being able to train a value function as easily because so here's an
[109:10]
B
example where LLMs break down the CPUCH rule involves square root of n over 1 plus NA. In an LLM you're most likely never going to sample the same child more than once. So if you have, let's say multi steps of thinking, because language is so broad and open ended, it's a sort of discrete set of actions is not really an appropriate choice for an LLM, even though they're discrete tokens. It's just such a large number that this type of exploration heuristic is probably not the right thing to do to guide how to search down a tree, right?
[109:43]
A
But I guess the crux comes down to the fact that in GO you know that the mcts is almost certainly better than your current policy, even though you haven't gotten even though you haven't explored the end of any trajectory.
[109:55]
B
Correct.
[109:56]
A
And then in normal reasoning for LLMs robotics there's no way to just locally evaluate and improve your next move in a way that doesn't result, in a way that's independent of actually solving the problem.
[110:11]
B
No way is a strong word. I think lots of people have thought about how to try to apply MCTS or its kind of successors like MU0 to continuous control spaces. And I'm sure very cool research work is still ongoing to try to crack that problem. But yes, the seeming challenge right now is that most problems in much higher dimensional action spaces, or something that's combinatorially much bigger, like language, they don't seem as amenable to the kind of discrete action selection heuristics as well as kind of game evaluation type stuff that GO does. But that's not to say the idea of thinking into the future along multiple parallel tracks might not give you some information about which way to search. If you think about mathematics, I think mathematics often occupies a little bit more of like a logical search kind of procedure where you kind of can back up, you can kind of see which paths seem good or not. There's more of a rigid structure there, whereas maybe in a business negotiation or something, it's less of a tree and maybe something a bit different.
[111:15]
A
Okay, so we're now seated so I can ask you some more questions about AlphaGo and about AI research more generally. In 2021, Andy Jones had a paper called Scaling Scaling Loss for Board Games and he basically anticipated inference compute or inference scaling by showing that you can trade off test time compute and training compute. That is to say that you can spend more compute on the searching through the mcts. And if you do that, you can get the equivalent performance as having spent more time training the model. And so if you see this pattern, you might think, okay, well with LLMs you might do something like that in the future. And in fact that's what ended up happening. Okay, so what is a kind of fun exploration one could do now to explore other axes of scaling in toy settings, which will be important to understanding what AI development might be like in a few years?
[112:07]
B
Sure, yeah. I think that indeed test time scaling and reasoning and how it interacts with model size are quite profound when it comes to how much needs to be actually done as explicit search versus how much can be packed into the forward pass of a neural network? And how does a forward pass of a neural network learn how to do something that should be a sort of sequential and recursive step. That's quite interesting. So the Andy Jones Scaling Laws for Board Games paper is quite cool. There's another really nice result from that paper where he showed that not only can you predict scaling loss of the sort of LLM variety where as you increase parameters, you can decrease the amount of compute for search or vice versa, he also showed that you can actually predict how much compute is needed to solve a larger version of the board game, for example. And so with Go, which can scale from 3 by 3 to infinitely sized Go board, you might actually be able to sort of revisit this question and try to reproduce whether this shows up. I actually started this project with this sort of a motivation that does the bitter lesson or does our knowledge of scaling laws allow us to kind of execute a lot better on a sort of compute optimal GoBot? And can we kind of build a strong GoBot without all of the katago tricks? Right, just by really focusing on the bitter less than the scaling laws? I have not been successful so far, but I think it's sort of a fact that like, usually when you want scaling loss to work, you want to be in the regime where the recipe already works and the data sets are good, rather than trying to kind of figure out how to do scaling while also trying to figure out what the right data set are. Okay, so this is like the scientific understanding component in research often follows a step where you get something to work first and then you use that system to collect data that then helps you build a mental model of how things work, such as scaling. And so usually actually, if you want to build a strong GoBot using scaling laws, you actually have to make a strong GoBot first and then use the scaling laws to kind of extrapolate a bit farther into the future.
[114:08]
A
Say more just so I understand. First of all, you're saying scaling laws did not work or you could not. There was no scaling loss pattern that you could see in your GoBot.
[114:17]
B
Yeah, so a mistake I made initially when I had some bugs around how MCTS labeling was working was I would collect a bunch of data with an expert policy and then treat it as a supervised learning problem and try to identify scaling laws with expert data sets, you can indeed plot things that look kind of like this. But if you're in a regime where your policy is not working well, you might be just studying scaling laws on bad data. Right. So just like one important implementation detail is that if you want to study a scaling laws problem, you kind of have to have a problem for which the data is good, the architecture is good, and there's no bugs, and then you solve it. There's X and T. I Wasn't able to apply scaling laws to direct what to look at until I had the rest of the system working. And this sounds obvious to researchers, of course you want to have a working bug free system before you study scaling. But just as a sort of advice for practitioners on where I actually tripped up when I started this project was you don't necessarily want to kind of jump into the science of studying your man made artifact before your man made artifact is interesting enough to be studied.
[115:21]
A
Speaking of compute, so you can look at these charts of compute used to train the best AI model in the world over time going back 10 years. And it's a very smooth line in log space that is exponentially growing year over year. Except there's this huge aberration and that aberration is AlphaGo Zero, which is trained on way more compute than any other AI model. At the time it was like three E23 flops. That's sort of comparable to a Frontier LLM. I mean orders of magnitude off, but still. And so yeah, the question is, especially with you being able to get something off and did you train it on your own?
[116:02]
B
I got a donation from prime intellect for about 10k and then I spent maybe the first 4k doing kind of exploratory research. And then about 3k on the kind of final run. Yeah. And then some of it remaining for serving the model. Cool.
[116:19]
A
Yeah. Is your sense that they just did a bad job training it? If you can do it in 10k
[116:23]
B
now, the compute required to be the first to do something is always much larger than the compute it takes to catch up. And it's the same story playing out in LLMs once someone else has done it. You could use tricks like distillation, you could use all sorts of crutches to kind of bootstrap your way to success. So with my own bot that I've hosted online, I actually used sort of best response training against the Katago models to kind of get a strong level of performance. And as a time of recording, I'm validating whether this can be. I can kind of do that first step, which is to do the tabula rasa play. But importantly for research, you often want to start from a good init. So the kind of simple thing I did first was train best response agents against Katago alphazero team. They did not have any policy that they could train against. Right. Because they were trying to do everything tabular rasa. And being the first to do it means that you're prioritizing, getting the thing working rather Than let's say the most compute efficient possible implementation. So this actually plays out in robotics as well. If you look at the kind of frontier of large models trained for robotics, the scatter plot is all over the place and there isn't a very clean line the way that there is for Frontier LLMs. And that is because the folks training these models often are not at the scale where every flop counts and they need to squeeze out the performance of every single flop as the dominating deciding factor in pre training. Instead their focus is more like we want a certain capability to show up, so we optimize the training setup to kind of make it easy to derive that capability. And once you have that capability, well invariably if you scale up the compute, you are forced to kind of make it compute efficient because this is like hundreds of millions of dollars we're talking about. But in the past when when compute for experiments was kind of more plentiful or not accounted in a way that the researcher was really responsible for, then you kind of end up with people optimizing for things besides kind of being on the compute optimal prino frontier.
[118:15]
A
I see like speed or something.
[118:17]
B
Yeah, like time to result or just getting to work. I think the first Alphago probably they had lots of compute and they didn't need to worry too much about making it the most compute optimal thing.
[118:27]
A
And how much of the improvements to compute efficiency are methods that did not exist as of 2017 versus things which they could have done in 2017. But yeah, great question.
[118:37]
B
So going into this project I kind of knew in the back of my mind that things always get easier to do over time and I wanted to see where is go at. Given that it didn't seem like there has been any major open source strong bot after Katago in 2020. And then reading the Katago paper, there's a lot of clever ideas. I was kind of wondering, okay, let's see if the bitter lesson has happened where a lot of these kind of tricks just sort of go away because Nvidia made faster GPUs. Right. And so roughly where are we on that? So again, this is not a peer reviewed claim. So this is just my preliminary vibe guess on what I've seen based on my own experiments. But it seems like architecture choices don't matter that much. Transformer versus resnet. We're at the sort of speed of GPU where the size of the model is not so big that this really matters. You can actually simplify this setup quite a lot. So instead of doing a distributed asynchronous RL setup with replay buffers and pushers and collectors, you can kind of do a dumb synchronous thing where you collect, you just train a supervised learning model and then you collect again. And so there's like opportunities to simplify infrastructure. Nvidia GPUs have indeed got faster. So whereas Katago was trained on V1 hundreds, you can train on half the number of desktop Blackwell GPUs and it still works. And some of the kind of auxiliary supervision objectives that Katago developed aren't really necessary if you have a strong initialization. So if you're initializing against best response training against Katago itself, then your own model actually needs none of the tricks that Katago needs. So then the core thing is how can you get as quickly as possible to some strong opponents? And that matters a lot more than the specific architectural innovations. But there are still some nice compute multipliers. So I found that training on 9x9 boards was very nice for resolving endgame value functions. And then if you can co train that on an architecture that can transfer between 9x9 and 19x19, then you can really cut down the warm start time to learn that from scratch. I think AlphaGo Zero, their plot was first 30 hours or so are spent basically catching up to the supervised learning baseline. And you can cut down that time a lot by kind of pre training on a small board and then warm starting that into your 19x19 board play. There was some other stuff like varying the number of sims between episodes. This turns out to be not that sensitive actually. You can kind of fix it or increase. It doesn't matter too much. But so anyway, it's kind of just nice from a scientific perspective, just revisiting an old paper and seeing what really matters.
[121:11]
A
This is a tangential question. Why is it okay to have a buffer in AlphaGo? Because every time I talk to an AI researcher, they're telling me about how bad it is to be off policy. But then the way a naive implementation of APOGO zero would work is that most of the moves in a given backward step or in a batch of backward steps would be not among the ones that were made by the most recently trained model. So why is that?
[121:38]
B
Okay, great question. Yeah, and this gets into the sort of fundamental off policy versus on policy reinforcement learning kind of questions. So as you recall, in mcts, you take actions that you took and you relabel them to take different actions on the same states. So the off policy part here comes where, what if you're relabeling states that your new policy would never visit. What's the point? You're kind of wasting capacity. And in the extreme limit, imagine your distribution of states in your training buffer are all states that you would never visit. Then you're basically supervising them to take good actions on states you would never achieve. And therefore your policy can get really bad. So this is where off policy can really hurt. AlphaGo. However, if you interpret this sort of from the dagger perspective, which is basically saying a way to kind of correct yourself back to the optimal trajectory given some data. What you kind of want in an algorithm like this is to have mostly states that you would visit, but then you have a small percentage or maybe a reasonable percentage of states in this kind of high dimensional tube around your optimal trajectories. And any of those states are given a supervision target to kind of sort of funnel you back into your optimal trajectory. So maybe I can just draw quickly here. So in sort of a dagger style setup, what your kind of optimal training data distribution is is that here is your optimal states and actions. So this is like you want to be in this state, you want to be in this state, you want to be in this state, and then you win here, and then these are your optimal policy actions. So these are the things that you definitely want to train on. But to make it robust to disturbances, you want to make sure that if you happen to drift off into some other states, you can kind of funnel yourself back into.
[123:30]
A
But why isn't this a fully general argument for OP policy training?
[123:33]
B
This is actually why you want to do off policy training sometimes is that you don't want to have a compounding error where if you make a mistake, you don't have the data of how to return back to your optimal distribution. And so optimal control does not really say too much about how to not accidentally get here because it's sort of making the assumption that once you learn the policy, you're going to get here. But in applications like robotics, like, I don't know, a gust of wind blows you slightly off and then now you need to correct. Right? Or the friction on one of your tires is kind of a little bit lower than the other wheel and then now your car's drifting and you gotta correct it. So these kind of things in more real environments often happen where like actually there's a funny quote about chess and also go is like the problem with Go and chess is that the other player is always trying to do some shit. So things can kind of drift off and you always want to be able to correct back to your waiting condition. So your replay buffer really should have the states that your policy would visit, plus some distribution of states that you might drift to and then how to return back to your optimal states. Now if you take this to the extreme and you say like, well, we don't have any of this data. And we're going to just be labeling with mcts states that are so far away from our optimal behavior. Like this bag of states over here. Well, now, yeah, I mean, each of them gets mcts label and your policy learns how to take sort of the best possible action here, but you never get here. So you're training your model on states you would never reach. Like this is not there. So then this is a problem and this is where off policy can really hurt. So actually, as part of this project, I did try an experiment where I took a bunch of trajectories and to try to saturate the GPU as much as possible, what I did was I took random states from the data set and reran mcts on just those states. So instead of playing a whole game where I'm doing mcts on every move, I just ignore the sort of causality of moves and just pick random board states and I just label those with my current network and I might revisit old states that I've labeled before and relabel them again with my current network. And so in practice, this actually does work. You can actually say like, let's take some states that are reasonable and constantly be relabeling them while we're training. And so this actually starts to converge on a very robotics like setup, which is very common, which is you have your, your data set of trajectories and then you have something like a replay buffer pusher. And these are off policy offline trajectories, right? So your replay buffer pusher pushes transition tuples to the replay buffer and then you have some job that's kind of continuously replanning what the best action you should have done instead of taking this action is, right? And so in robotics it's actually very common to use the sort of minimized TD error. So like your bellman updater constantly is pulling things from here and trying to satisfy, you know, the qsa. So, so, and then, and then from here you have your trainer which is trying to fit the S to A or, or, or fit the, you know, Q to the Q target. So, so here you can think about this as a sort of planner, right? You visit old states that you've been to and you take your current model and you rethink like, what could I have done better if I visited this? And so this is actually how like kind of off policy robotic learning systems are usually trained these days. There's a sort of simpler recipe, but like, you know, in the Google qtop days, we kind of did things like this.
[127:48]
A
So what is the trainer?
[127:49]
B
Oh yeah, the trainer is you try to Minimize QSA and QTarget.
[127:58]
A
Can you explain the whole setup again, like at the high level?
[128:01]
B
Yeah. So you have your off policy data that came from various policies. You're constantly pushing transitions that you saw before to a replay buffer. And then you've got this thing called a Bellman updater, which basically replans instead of this action, what action should I have taken at S to have a better value? And the way you enforce that is you try to minimize the TD error. So actually, given this, you have S prime, right? You compute Q of S prime and you find the action that should go with S prime that makes this Q value as high as possible and then you add that to the reward here and that gives you your actual target. Right. So for this current S and A, your Q target is this. So now you have a. Now you send back the Q target to this transition. So with this tuple, you pair with that a Q target. And then here on the trainer, you simply just use supervised learning and you minimize your current network's QSA with its target. Got it.
[129:01]
A
Okay. So in the background you're just like, hey, let me basically think through, how valuable were all these actions?
[129:07]
B
Actually, yeah, in a more optimal policy where you're trying to maximize this, what is the CU target of this transition?
[129:13]
A
It's sort of like basically daydreaming.
[129:15]
B
Exactly. Yeah. You can think about it. You're kind of going back in hindsight and being like, given what I've seen in historical buffer, was there a better action I could have taken? Now the connection to go here that I tried and it was moderately successful, but too complex to open source, was you replace this with a MCTS relabeler where instead of doing this kind of target network computation, you run MCTS on your transition. So in this case, you have your state, your action, and then whether you want or not at the game, and actually you can just toss these two. You don't care about these ones, you just take your state and you just plan mcts to get your best policy PI on your current network, not the network that took this action, but your current best policy network. You just rerun your search offline on these transitions, and if these are transitions that your policy can get to, then this actually acts as a very nice stabilizing effect. And also one other benefit is that you can kind of fully saturate your GPU better because you're not blocking on the go game to kind of give you board states. You just simply search across all board states at any depth in peril.
[130:35]
A
Yeah.
[130:36]
B
And then here the trainer would be just predict the MCTS label as possible. So again, this kind of works, and this is quite relevant in robotics where you just have a lot of offline data and you can't simulate things like mcts. But in practice it does run into the problem where if the current model is looking at states that it would never reach, then it's kind of wasting capacity. And so you have to be a little bit careful here. So the on policy thing, and also much of RL has kind of converged to a much more on policy setup where they don't really try to directly train on OFF policy data. At best, they use off policy data as a way to reduce variance, but not directly influence the objective.
[131:14]
A
I'm sorry, why have they converged to that?
[131:16]
B
It's just more stable. You might use the off policy Q as a way to do advantage computation like Q minus sum of Q, that's your. Or, sorry, sum of there's N actions. And then, yeah, so this is your value and then this is your kind of current Q value. So your advantage for that action is the average value minus your current one. So people can try to estimate Q in an off policy way and then just use advantage here. And then if there's a problem in these dynamics, it doesn't blow up your loss as much. And so in robotics there's a kind of convergence towards more like a using off policy data to just shape your rewards, but not actually be directly your.
[132:04]
A
I'm reminded now of our earlier conversation of why MCTS is so favorable as compared to the kind of reinforce or policy gradient kind of thing LLMs do. And this might be totally wrong, but I wrote a blog post a few months ago about how rl, at least policy gradient RL is even more inefficient than you might think. And so the inefficiency one thinks about naively is the fact that you have to roll out a whole trajectory in order to get any learning signal at all. And so as these trajectories become longer and longer, as an agent has to, instead of just previously complete the next word in the sentence, it has to go instead to, hey, do two days worth of work to figure out if you even did this project correctly. The amount of information per flop has been decreasing as you had to unroll two days worth of thinking in order to see if you even did something correctly to implement this feature. The amount of samples per flop has been decreasing. But so you can think of, you're trying to maximize as you're learning bits per flop, right? And you can think of bits per flop as samples per flop times bits per sample. And what I just mentioned a second ago is that the samples per flop go down as RL becomes more and more long horizon. But at least this kind of naive RL is also terrible from a bits per sample perspective. Here's what I mean, at least compared to supervised learning. So, early on in training, let's say you have a vocabulary size for an LLM that is 100k long, so there's 100k possible tokens that one could answer. And you have a totally untrained model and you have a prompt like the sky is with supervised learning. What would happen is that the model would have some probability distribution over all the things it could say. There's a label that says, actually the term here is blue. And it would learn basically through cross entropy loss exactly how far its distribution is from correctly saying blue. Now, if you were doing this through rl, you would say the model would try. The sky is, nope, that's wrong. The sky is told, nope, that's wrong. This is a totally untrained model, right? And so you would have to do this on the order of 100,000 times in order to just stumble on blue, then get some learning signal off of that. So if you're in the supervised learning regime and you have your distribution of probabilities, you get told that it's blue and you figure out how far off you are. The amount you learn is a function of your pass rate. So the further away you are from blue, the more you've learned to go towards blue using cross entropy loss. And so you can think of it as your pass rate, your prior probability of having said blue, and as a function of that, in supervised learning through cross entropy loss, you would learn negative log p p being pass rate bits once you get this label. Whereas in rl, if you're just randomly guessing shit and seeing if it works or not, that's just basically going to be the entropy of a binary random variable, which is,
[135:37]
B
and what's also tough here is that actually the distribution that you're sampling under is your policy's distribution. So it's like if your policy has no chance of sampling blue, then you will never get a signal.
[135:49]
A
Exactly right. So that's being modeled by the fact that your probability of sampling blue is extremely low. If you do sample it, you do learn as much as you would have learned in a supervised learning. In all other cases, like 99.999% in an untrained model, you're just learning incredibly little from seeing halicon is not the correct word or told is not the correct word. And that's what happens most of the time. So you just learn very little. So if you try to graph, if you put on the x axis your pass rate, and here you put the bits you're learning from a sample, if you have like 0% here, 50% here and 100% here, so the end of trading, you're here. If you have supervised learning, negative log pass rate would look something like this, and then the binary random variable would look like this. And this is depending on whether you're doing nats or bits. Yeah, if you do bits, it's like one right here at the peak. This is like a coin flip. You learn the most from a coin flip. This is supervised learning. This is rl. However, the problem is you spend most of training in this regime, right? Like in the low pass rate regime. And in fact, if how fast you're learning is a function of how many bits per sample you're getting, and you're getting very little signal here. If you chart the pass rate on a log scale, so you put the x axis on a log scale where at the beginning of training, with a vocab size of 100k, the pass rate is 1 over 100,000, then 1 over 10,000, 1 1000, 1100 and then 100. Okay, what this graph looks like here, where supervised learning would look like this, And then RL, if you just basically crunch what I just showed there, it would look like that.
[138:11]
B
Yeah. And arguably you spend all your time here, potentially never even getting a single success, Right?
[138:17]
A
Exactly.
[138:18]
B
So it's a sort of depressing plot in the sense that once you're here, it's not at all obvious how you get to here. Once you're here, you have something, but you actually in many RL problems spend all the time here. So there's a sort of question of how do you initialize so you're at least not at zero, but at a non zero pass rate. One more thing I'd like to add about bits per sample. That's very relevant to any kind of machine learning problem is that and there's a connection to soft targets and distillation where if you have access to the logits, not just the one hot, like this is a sort of one hot token answer. If you have access to the soft targets, the entropy of this distribution is far, far higher than the one hot. So there's actually way more, there's way more information and bits per sample in a soft label. So that's why distillation is so effective per sample is that it's actually giving you way more information per sample.
[139:17]
A
Yeah, well, I wonder what the equation would be. But obviously it would just be the
[139:20]
B
entropy of this distribution. So the entropy of this is zero. The entropy of this is like the entropy equation. And this is also why AlphaGo is quite beautiful. In AlphaGo you don't train the policy network to imitate the MCTS action, you train it to imitate the MCTS distribution.
[139:37]
A
Interesting.
[139:38]
B
But both of these are actually valid. And if you wanted to do a scientific experiment of how important are this soft label dark knowledge distillation, you can run an experiment where you retrain the policy network on the action mcts selected rather than the software.
[139:52]
A
Interesting. Earlier I was sort of stumbling around intuitively. Why is this ability to do iterative search where you don't necessarily need to be able to win the game in the beginning, you just need to be able to improve your current policy. Why is that so powerful a capability in learning as compared to how LLMs currently learn RL? And yeah, it's exactly this thing of this is considering your pass rate of the entire trajectory. I actually don't know a formal way to think about this. Maybe you should help me out here.
[140:26]
B
Why is AlphaGo an elegant RL algorithm? So the major reason is that you never have to initialize at a 0% success rate and solve the exploration problem of how to get a non zero success rate. And this is what allows you to hill climb this beautiful supervised learning signal. And if you look at the actual implementation of AlphaGo every step of the way, there's actually no TD error learning or dynamic programming at least explicitly. It's just supervised learning on a value classification as well as a policy KL minimization. So it's just a supervised learning problem on improved labels. And so the training is very stable. You can train as big of a network as you want. You can kind of retrain this on the data set. Everything will just go stably. The infrastructure is very simple to implement as well. You don't need a complex distributed system to kind of keep everything on policy. At the end of the day, you're just saying, like, I have some improved labels. Let's retrain my supervised model on these targets. And so you're always in this beautiful regime where you're just trying to improve the policy rather than escape this kind of local minima where every signal is flat all around you. So one way to draw the curve is like if you draw the sort of win rate of an MCTS policy versus the RAW network. Let's say this dotted line is the RAW network. The MCTS policy kind of looks like this. And so every step of the way, this supervision signal is very clean.
[141:55]
A
Right.
[141:56]
B
You're never in a situation where the MCTS is kind of like giving you no signal unless your MCTS distribution converges to exactly what your policy network predicts.
[142:06]
A
Yeah, yeah, yeah. Okay. That's a great way to explain it. Cool. Okay, maybe we sit down and I ask some questions about automated research.
[142:17]
B
Sounds good.
[142:18]
A
One thing I really wanted to talk to you about is that you did a bunch of the research for this project through this kind of automated LLM coding assistant loop. And there's an idea that if you fully automated AI research, you could have some sort of singularity. Obviously, we're not there yet, but to the extent that we have early indications of what this process might look like, I am curious what your observations about what the AI is good at, what it's not good at, what you think about this scenario's likelihood eventually. What thoughts do you have about this in general?
[142:55]
B
For sure, yeah. I think automated scientific research is one of the most exciting skills that Frontier Labs are developing right now. And I think it's important for everyone who's doing any kind of research to get a good intuition of what it can do now and what it can't. And how might the sort of science process work in the future Once we're having AIs automating a lot of this investigation? So, in brief, I mostly use Opus 4.6 and 4.7 throughout working on this. And what works is that the models can do a very good job of doing hyperparameter optimization. So in the past, people would kind of come up with a search base of hyperparameters, like learning rate and weight decay and maybe how many layers are in your network. And they would just kind of do a grid search or a sort of Bayesian hyperparameter optimization approach, and then it would find some tuned parameters. The kind of really cool thing that automated coding can do now is that it can search a much more open ended set of problems, right? It can say like, well I've identified that the gradients are kind of small in this layer. So let me change it up here, let me rewrite the code. So the data loader has a new augmentation I came up with. Let's sort of try to find the best way to kind of fit the constraints of the optimization problem and you end up with this much more flexible and kind of high level, almost like grad student like ability to just grind a performance metric. And so this can squeeze out quite a lot of performance on a fixed data set with a fixed time budget, improve perplexity by quite a lot on a sort of classification problem like LLMs or go. And it is also fantastic now at basically executing any experiment, right? So I have a Claude skill that I wrote called Experiment where I give it a description of what I wanted to plot. And like I just described, here's the x axis I want, here's the y axis. Answer this question for me and it'll go run off and do all the experiments, compile the plot, make a report and suggest what might have caused it or so forth. So that's what works quite well today. And I think we can expect that these abilities get better in the future. But it's also kind of useful to know what is it not doing so well today. So on my blog version of this tutorial I have a plot of basically all the kind of experiments I did grouped in a sort of tree where every node kind of represents a failed, successful or sort of mixed experimental result. And then from there it branches off into a child where it's like the follow on experiment. Occasionally I'll kind of rabbit hole down a track like this off policy mcts, relabeling, do a few experiments and then realize it's probably not worth it. So then I'll kind of jump to a completely different track and I call these kind of things like rows. What I find is that current closed models that we can access, the public can access today, they don't seem to be that great at selecting what the next experiment should be in a given track. And they don't seem to be able to kind of step back and do the lateral thinking of like wait a minute, this track doesn't really make sense. Let's go back to sort of first principles and think about what the bottleneck might be or what are we trying to achieve. And so often I had to catch infra bugs myself. By prompting the right question to Claude to investigate what is causing this discrepancy. And then it'll answer the question. I think with Mythos class models or Mythos models coming online, maybe this just completely changes and these problems just fall to just improve scaling. But at the same time, I think there's a lot of rich opportunity to develop RL environments that might incentivize this kind of lateral thinking. And so one of the motivations for setting up this Go environment was that I think that Go captures a lot of very interesting research problems, often overlapping with LLMs or robotics. And yet it's very quick to verify the outer loop is ultimately like, does the agent do what I think it does? And you can kind of check the outcome of a Go game quite easily. And then the inner loop involves all this kind of like, you know, research engineering around distributed systems, predicting whether an idea is going to work or not, predicting the, you know, the difference a particular modification to your training algorithm might make. And I think there's a rich library of subtasks and sub environments that you can kind of train an automated scientist to work on with Go as a sort of outer verification loop that then once you acquire these skills, maybe you can apply them to other domains like biosciences or robotics, or automating AI research.
[147:09]
A
Or automating AI research, which is the real crux or the scary incredible thing for just making AIs, making future versions of AIs. And you're suggesting the outer loop here could just be your win rate against Katago?
[147:22]
B
Basically, that's one of them. I think there's a lot of deeper questions that one could tackle. Right. So for example, let's say you have an idea on how to improve a scaling law compute multiplier. The outcome isn't necessarily like, I achieved the best GoBot ever. The outcome might just be like, can I predict what the win rate of my GoBot will be? Or can I predict the scaling law plots that emerge from my idea? But then you can verify that you haven't kind of reward hacked anything by using a very verifiable game like Go on the outer Loop.
[147:54]
A
I think there's a couple of interesting follow on questions. There's questions on the inner loop and the outer loop. On the inner loop, there's a question of how locally verifiable any modification you might make is. That is to say, would you know whether something is actually improvement or a degradation, some idea you try out? Would you know that if something isn't working as a result of a bug or is it the result of the idea itself being wrong? Ilya was talking about why having one of the reasons he thinks he's a good researcher is he is a good researcher. One of the things he thinks makes him a good researcher is that he has intuition about, he has strong belief in what the correct idea is. And he is able to persevere through bugs and know which things are bugs versus mistakes in the fundamental idea based on his high level belief about this idea should work. So therefore it's there has to be bug versus the other way around. Why don't we start with that question actually? Yeah. How locally verifiable are things which are good ideas?
[148:56]
B
Yeah, I think as in the case of the success story for deep learning, you can think about this as like a decades long idea that took a lot of faith to get it to work. And so this presents a very challenging long horizon RL problem where every step of the way you have a committee telling you that this is a bad idea and then ultimately you break it through.
[149:17]
A
Right.
[149:18]
B
And so how do you design RL environments that maybe give you some feedback earlier? And I think this is a very tough, open question that I don't have an answer to. But ultimately, to play a very strong Gobot, you probably did need to discover deep learning. And so I think that having a challenging game that cannot be cheated easily on the outer loop could be used as a sort of outer loop signal for something like discovering the principles of deep learning. Now, of course, to make it tractable, and this is where research taste really matters, you have to come up with ways to initialize your problem so that you don't solve a sort of very intractable problem. Maybe you can leverage LLMs as a sort of universal grammar in the middle to kind of give you some sort of local feedback. The fact that LLMs are universal grammar means that they can kind of move at almost any level of the stack, right. They can think very locally as well as step back and think in very broad steps. And I think that's where a lot of the lateral thinking ability of humans kind of come from. How to know if the track that you're pursuing or the objective that you're pursuing is not right. And you should be asking a different question.
[150:31]
A
The other question is how stackable local improvements are in the attempt to get to a better result on the outer loop. I've heard rumors that at some AI labs the thing that has gone wrong is that people will individually pursue good ideas, but those don't end up Stacking well. And so the training run falls because of some weird interaction between two seemingly good ideas. And having a single top down vision of how things should work is very important. Having worked at different AI labs and also playing around with, I guess, parallel agents trying different ideas, what is your sense of how parallelizable AI innovation is?
[151:08]
B
Yeah, great question. I think the research taste for executing well on the bitter lesson is that you need to know how much the bitter lesson can buy you and how much is too much to ask for at any given moment. Of course, in the fullness of time, compute kind of is the single most important determinant on how things work. And it's almost inevitable that as you scale up energy and compute and parameters, intelligence will just fall out of that. And that's super beautiful, super profound. No algorithmic detail really matters beyond that. But in present day we don't have infinite compute and parameters and arbitrarily good initialization. So we have to come up with heuristics that kind of give us that. But these heuristics are probably somewhat redundant. So that's probably why you see this effect where a lot of these compute multipliers don't necessarily stack, is that they might have some correlated benefit. And then three years down the line when the Nvidia GPUs have gotten even stronger, maybe they stack even less. Well, maybe at any given point in time, the sort of benefit of any given compute multiplier is transitory, which is what I sort of suspected with the Katago paper. There was many algorithmic ideas kind of applied. And then you can see that with modern Blackwell GPUs and ADA class GPUs that are much better than the V100 grade GPUs that paper used. You can see that some of these algorithmic tricks to speed up convergence just don't matter so much compared to something else. And I think that's a matter of taste in the present time.
[152:42]
A
Yeah, interesting. How about the outer loop? How verifiable for making AI smarter? With Go, you do have this outer loop of win rate against the best open source model out there. And even there, as you were saying, there are other outer loops of. Did you discover a new phenomenon which is actually very hard to. If you didn't know scaling laws were important? If you're back in. When was chinchilla or Kaplan scaling laws released? Like 2019, 2018? Yeah. So if you're back in 2015, would you. There's not an automated procedure one can easily imagine of knowing which paper is the scaling loss paper versus which is just like another random plot. And so that even in the go case is a hard to verify outer loop. And the whole idea of an outer loop is to have some backstop on improvement, but let alone for general AGI, where of course we have a bunch of these benchmarks. But there's a problem that we know the things we can measure and we improve on the things we can measure. But we care about this broader ability to do economically useful work, which is at least until you automate everything, not super easy to measure. So yeah, there's a question of okay, how good is the outer verification loop for AI self improvement and does that matter?
[154:02]
B
Yeah, I'm going to give a non rigorous argument, but one that I kind of intuitively believe which is that DeepMind, the AI research lab, they started as a sort of focus on games, right? They kind of used games as their outer loop and then their researchers learned from experience of solving games and then now they're working on LLMs and presumably there was some positive transfer from their time working on games and like Atari and Go and Starcraft that now helps them make good lms. I assume that there's positive transfer in some regard, whether it's coding or general research ability or project management, all these things kind of probably help them do well. And so if that's the case, why wouldn't it also be true for automated AI researchers? They should be able to positively transfer experience, tackling quick to verify, quick to iterate on environments to something more ambitious and economically useful like automating drug discovery or so forth.
[155:00]
A
I mean, I don't know, isn't the issue with historically until Gemini through or whatever a couple of years ago, people were saying look, Google isn't catching up in LLMs because they're too tied to the old approach. And yeah, there's gains, but there's also ways in it which actively hinders you. So it's actually not obvious to me
[155:26]
B
that there's like the jury's still out, right? I think who knows if the, let's say currently Google's doing quite well. Who knows if the initialization on training on games is ultimately going to hobble their ability to be the winner in the long term. It's hard to say for sure. And likewise, who knows if the seeming late start was really just them kind of pre training for longer on how to scale up TPUs. They invested all their tech tree in getting TPUs to be good, which seemed not that useful in the short term, but then long term it becomes maybe like a. So it's even hard for humans to reason about what the optimal research strategy should be, even with the data we have today.
[156:07]
A
Yeah. Cool. Okay. We should let people know how they can find out more about this project, Whether to fork it themselves, whether to check out your blog post where you do an excellent job explaining many of these ideas. Where do people go next?
[156:21]
B
Great. Yeah. So my website is evgenian.com there's a blog post that kind of links to an interactive version of this tutorial. And on my GitHub, which is the username is just Eric Jang, there's an autogo repo that people can fork and reproduce the training results.
[156:36]
A
And I also highly recommend people check out this blog post as Rocks May Think, which we touched on some of the ideas in this conversation, but it's this grander thesis of what happens when you have thinking as a primitive in computer science. Exactly. So I highly recommend people check out that block list as well.
[156:55]
B
Yeah. And I encourage the audience to think about the relationship between thinking and GO via mcts and search and how it relates to LLMs. I think there's something quite profound there and probably under explored just because GO has been relatively under explored compared to the boom in LLMs. It's not to say that I think we should have trees in our LLMs, but there is some very interesting duality between them. And you can actually do a lot of research on GO mcts and reasoning with very small budgets. That's very exciting.
[157:26]
A
Cool. Awesome, Eric. Thanks for doing this.
[157:28]
B
It's an honor to be on the podcast.