Summary6 min read

Y Combinator Startup Podcast — "Beyond Bigger Models: Recursion As The Next Scaling Law In AI"

Date: May 1, 2026
Host: YC (A)
Guest: Francois Chaubard (B), YC Visiting Partner

Overview: The Evolution of Recursion in AI Reasoning

This episode dives deep into new frontiers in AI scaling—moving beyond simply enlarging model sizes toward leveraging recursion at inference time. Host and YC visiting partner Francois Chaubard explore two pivotal research papers from 2025: the Hierarchical Reasoning Model (HRM) and Tiny Recursive Model (TRM). The discussion elucidates how recursion, inspired by both neural and algorithmic principles, is reshaping the landscape of model architecture, efficiency, and reasoning capabilities.

Key Discussion Points and Insights

1. Limitations of Scaling Large Language Models (LLMs)

Historical context: The conversation revisits the evolution from RNNs and LSTMs (pre-2016) to attention-based transformers. RNNs had inbuilt recursion but suffered from vanishing/exploding gradients and prohibitive memory costs when scaling up (01:29 – 02:29).
LLM trade-offs: Transformers circumvent backprop-through-time issues by executing all timesteps in one forward pass, but lack powerful latent (hidden state) reasoning and compressive capabilities inherent to RNNs (02:29 – 03:46).
Fundamental constraint: LLMs are limited in their ability to perform certain types of reasoning or computation, notably tasks requiring iterative, stepwise logic like sort algorithms or Sudoku puzzles (04:08 – 05:57).

"If I have a list that's 31 characters... my transformer is 30, I run out of steps to do comparisons... In HRM and TRM they use Sudoku as an incompressible problem." — Francois Chaubard (04:54)

2. The Power of Recursion and Memory

Recursion as a "memory tape": Effective recursion in models is likened to Turing machines with access to memory, enabling handling of problems which cannot be solved by mere token-by-token prediction or "chain of thought" LLM outputs (05:29 – 06:30).
The challenge with traditional approaches: Attempts to enable advanced reasoning with LLMs using chain-of-thought (COT) or external tool use can only reach as far as human-provided dataset examples or functions, unable to discover new reasoning strategies independently (17:27 – 19:46).

"Both hacks to solve this in COT and tool use, you're bounded by the bounds of human knowledge. In the event it's outside the set of human knowledge, then you're kind of sol."
— Francois Chaubard (19:46)
Discrete vs. latent space recursion: LLMs are limited to operating in discrete token space, while recursive RNNs can perform reasoning in much more expressive, continuous latent spaces—if only they could be effectively trained (19:46 – 20:41).

3. Paper Deep Dives: HRM and TRM

A. Hierarchical Reasoning Model (HRM)

Biological inspiration: HRM draws from neuroscience—different brain regions operate at various frequencies/hierarchies, mimicked by nested loops in neural architecture (07:43 – 08:23).
Three levels of recursion: HRM features a low-level iterative module, a higher-level module, and outer refinement steps—all weight-shared to enable recursion (09:03 – 09:22).
Groundbreaking results: Achieved state-of-the-art performance on ARC Prize challenges with only 27M parameters and no pretraining (09:50 – 10:05).

"This was only a 27 million parameter model ... no pre training at all. This starts from literally tabula rasa... It got something like 70% on ARC Prize 1." — Francois Chaubard (09:50)
Novel training trick: Instead of back-propagating through all recursion steps (as in classical RNNs), HRM utilizes a "deep equilibrium learning" (DEQ)-inspired fixed-point iteration, only back-propagating through the main modules and treating successive hidden states as if they were distinct mini-batches (11:22 – 13:19).
Key insight: The "outer refinement loop" is the core driver of performance gains—more so than details of the inner hierarchies or even deep recursion (20:52 – 21:46).

B. Tiny Recursive Model (TRM)

Ablations and simplification: TRM further streamlines HRM by collapsing the double "low/high" reasoning modules into a single, weight-shared module, showing that networks can learn to separate features internally (23:25 – 24:09).
Even smaller, even stronger: With only 7M parameters, TRM reached 87% on ArcPrize 1 and outperformed vastly larger LLMs on specific reasoning benchmarks (33:25).
Optimization details: Instead of full unrolling, TRM relies on truncated back-prop-through-time (T=1), showing that minimal backward steps plus sufficient test-time iteration was enough for strong performance (22:05 – 23:07).
Expectation-Maximization flavor: Both HRM and TRM resemble EM algorithms—alternating updates to local and global hidden states to iteratively refine candidate solutions/memory (24:23 – 27:17).

"The most important part... is it actually is able to discover things without being teacher forced via a chain of thought." — Francois Chaubard (27:19)
Code breakdown: The hosts describe the models’ logic in code form, emphasizing the simplicity and elegance of alternating "local" and "global" state updates, outer and inner loop structure, and key decisions regarding gradient flow and recursion (27:38 – 32:45).

4. Scaling Laws: Beyond Model Size

Challenging “bigger is always better”: The HRM/TRM results show that recursion (temporal depth) rivals or surpasses sheer parameter count in solving certain types of algorithmic or compositional tasks (33:05 – 34:39).
Possibility for hybrid architectures: The most exciting future direction is combining the generalization and representation power of huge LLMs with task-specific, recursive reasoning modules for efficiency and deeper algorithmic understanding (34:39 – 37:10).

"When you take the benefit of both these TRMs and these giant models and you actually slam them together ... I think that it's just going to take off and it's going to be really huge." — Francois Chaubard (34:39)

Notable Quotes & Memorable Moments

On the "memory bottleneck" in transformers:
"Every single decode that I do, I still have to retain the entire Shakespeare novel just to decode a little bit." (B, 02:55)
On bio-plausibility vs. practical performance:
"We use bio plausibility to inspire us... but we end up veering away from the bio plausible to something... that seems to work better. I think it runs better on a GPU." (A/B, 16:18 – 16:20)
On the future of efficient, general-purpose AI:
"A 7 million parameter [TRM] wins. The right answer is to take the amazingness here and ... slam them together [with large LLMs]... I think it's going to be really huge." (B, 34:39)

Timestamps for Key Segments

[00:00 – 02:29] — Evolution from RNNs to LLMs; problems with scaling through backpropagation
[04:48 – 06:30] — Limits of LLM's reasoning vs. recursive architectures; Turing analogy
[07:43 – 10:05] — HRM architecture, hierarchical loops, and major results
[11:22 – 13:19] — Backpropagation innovations: DEQ and fixed-point learning
[17:27 – 20:41] — COT/tool use vs. true recursive learning; continuous vs. discrete reasoning
[20:52 – 21:46] — HRM’s key finding: importance of outer refinement loop
[23:25 – 25:54] — Architectural simplification and performance boost in TRM
[27:38 – 33:05] — HRM/TRM code walk-through & iterative logic
[33:05 – 34:39] — Implications for scaling laws and future hybrid models
[34:39 – 37:39] — Future of recursion in AI, integration with large models, concluding thoughts

Conclusion: Recursion as the Next Scaling Law

The episode’s closing reflections position recursion not only as a conceptual link back to the earliest neural networks, but now, as the frontier for practical progress. HRM and TRM prove that carefully designed recursion can produce models that are both more efficient and more capable at logical, stepwise reasoning—often outperforming vastly larger, purely feedforward LLMs in these domains. The likely paradigm of the future: hybrid systems combining the vast semantic knowledge of LLMs with the powerful, iterative reasoning of recursive architectures—pointing AI scaling toward smarter, not just bigger, models.

Final thoughts from Francois Chaubard:
"Recursion is important and it's not going away. When you take the benefit of both these TRMs and these giant models... it's going to be really huge." (34:39 – 34:57)

For more:
Check out Constantin’s YouTube scaling ablations and read Melanie Mitchell’s work on recursive reasoning limits. Stay tuned for future deep dives into hybrid and multi-agent architectures as recursion becomes central to AI’s next scaling law.

Loading summary

Transcript121 lines

[00:00]
A
Welcome back to another episode of Decoded Today. I'm back with YC visiting partner Francois Chaubard to talk about one of the most interesting recent trends in AI research, recursion. Specifically, we're going to talk about how we can improve a model's reasoning performance by using recursion at inference time rather than by just making the model bigger and bigger. There were two papers that made the power of this approach really clear in 2025. One on hierarchical reasoning models or HRM and and another on tiny recursive models. Trm. Francois, thanks for joining us. Can you tell us a little bit about these two models and what was so interesting about them?
[00:43]
B
Sure. I guess to set up a little bit of a foundation. You already did an amazing lecture on RNNs and LLMs in one of the previous videos, so I won't overdo it. But just to give the cliff notes, an RNN is just a model that you recursively call again and again and again on itself. And we were very much in the belief that this was required to get to AGI Peak. RNN use was probably until 2016 with Alice Graves Neurips keynote, which is just fantastic and all his adaptive compute time work.
[01:16]
A
So this is about 10 years ago people were working on these models. This was in the era of LSTMs and LSTMs with attention. Yeah.
[01:22]
B
And depending which professors you talked to before attention was invented.
[01:26]
A
Yes, yes, totally.
[01:29]
B
And I think what really was the limiting step on RNNs in general was this thing called backprop through time where you roll out the model and then to update the weights you need to approximate the gradient and you step back, back, back and you keep rolling out. And as the model gets bigger and bigger and as you roll out for more and more steps, then you have all these accumulation of errors and the gradient gets noisier and noisier and then it just kind of stops to work.
[01:54]
A
So you have these like vanishing or exploding gradient problems. And it's because if you have an input with 20 steps you're like multiplying these matrices 20 times and that causes,
[02:01]
B
and we're talking about doing context length of like a million or like a billion. And so it's not even just 20, it's like a billion. And even worse, you have to retain the activations at every single step. And so if this were happening in your brain, you would need a million copies of your brain at every single activation so that I can back prop through it. There's tricks around this that you can do and you can do gradient checkpointing and things like that to reduce that issue. But then you just like trading off memory for wall clock time and compute.
[02:30]
A
Right. So now if you contrast that with LLMs, the ones that people are widely using, these while at face value they appear to be similar at training time, they're doing basically this once one shot feed forward process for every input. The LLM, the transformer block can take all the inputs in parallel. It's not actually iteratively going over them one at a time at train time. So you don't have this needing to store tons of activations problem or this giant vanishing gradients problem with them.
[02:55]
B
Yeah, exactly. It's actually all happening in time in one shot magically. And that was the trill or lower triangle trick that kind of happens, this causal mask that occurs. And so you actually do all time steps in one shot and you forward pass a feed forward model on all time steps in one shot and you backwards in one shot. And it's amazing for train time in terms of wall clock, it requires a lot of flops and it still requires a lot of the memory, you still need it there, but you don't have the vanishing gradient issue. And what you actually paid for that you have to give up is this latent reasoning thing and this compression in the time direction. There is no compression in lms. Every single decode that I do, I still have to retain the entire Shakespeare novel just to decode a little bit. And in RNNs you don't have to do that. It's all compressed in this hidden state that you kind of roll out.
[03:47]
A
Okay, so let's talk about that in a little bit more detail. You refer to this inherent reasoning ability. Many people think about LLMs as doing reasoning, and we're going to talk about that a little bit later. But help me understand where you see the biggest limitations in LLM's reasonability is in terms of what the model does in an actual forward pass.
[04:08]
B
Yeah. And so I guess we go back to ChatGPT2. GPT2 was this landmark architecture and paper that basically was just next token, next token, next token. And it kind of worked. And we just watched VAL loss go down, perplexity goes down. Like the model just is, is more performant, looks better, starts to make some Shakespeare that actually sounds somewhat plausible. And then we have to get these things to reason and to actually solve some really hard problems. And I've done extensive experiments on this. But if you take for example, sort, you have infinite amounts of unsorted Lists and you give it sorted lists, you keep feeding it to the model, it should work.
[04:48]
A
Right.
[04:50]
B
It's actually impossible for the model to map from unsorted list to sorted list.
[04:55]
A
If I have in a one shot,
[04:56]
B
basically in a one shot basis, it's like literally we know a theoretical lower bound that for comparison sort, you can't do better than n log n steps. And if I have a list that's 31 characters or elements long and my transformer is 30, I run out of steps to do comparisons. It's not possible for me to do all the steps that is needed to be done. In HRM and TRM they use Sudoku as an incompressible problem. Similarly, and so are mazes, those are incompressible problems. Rolling sum incompressible problem.
[05:29]
A
So when you mention the sorting algorithm, when I think back to my algorithms class from college, the one way you could get faster than N log n in a sorting algorithm is if you had some access to an external memory cache. If you had some tape you could write to, then you can actually do faster than N log n by basically selectively putting things onto this memory. And I suspect that's a key limitation of these LLMs in that because there's no external memory tape inbuilt into the model, you lose certain performance possibilities in terms of how fast you can go.
[05:57]
B
That's right. And so I guess Radix sort would be the most common one. Depending on the number of buckets that you have, you can kind of get from n log n to order N. You can't get less than N. You have to touch all the elements. Sorry, you have to do that. And if you run out of layers, transformer layers in your neural network, then you ran out of chances to do that.
[06:20]
A
So this is just like a Turing. This is like going back to Alan Turing now. And like a Turing machine. Right. Like what's the analogy there exactly. That we should think about in terms of LLMs. I guess not quite satisfying how you think about a Turing machine.
[06:30]
B
Yeah. So if we. Let's just Talk about like ChatGPT2 GPT2, the original. Like no bells and whistles, it's just a feed forward model. And so just forward passing one step,
[06:42]
A
taking an input, creating a bunch of outputs.
[06:44]
B
In the Sudoku case, If I have 50 different provable that I can only do one, given this information and I have this many layers, then that's all I can do. And the cheat is the chain of thought. And so it's completely true that at Test time, they are turn complete and you can simulate all Turing computable functions at test time. But how do you get it to learn it? You need to train it. And that's where. Unless you're training it on human labeled traces, for which there's a lot of problems, like the millennial prize problem, we don't have the trace for it, so we'd love to have a trace for it. Just doesn't exist.
[07:24]
A
Totally makes sense. Okay, so with that context in mind, now let's talk about these two papers, because I think that sets up a lot of the contrast we're going to draw between these papers and the models that people are maybe more used to. So let's talk about HRMs first. Walk me through a little bit about how this model works and some of the intuition behind it.
[07:44]
B
Sure. So this is directly in the lineage of RNNs. There's not that much novel from the RNN standpoint, at least in my opinion. They do have this idea of, you know, from inspired by the brain, where I have like, there's different parts of the brain that operate on different frequencies. There's some that operate at a really high frequency, which is on the low level of the hierarchy, some that operate in a really low frequency, which is the higher level of the hierarchy. And the interplay between those things is really interesting.
[08:14]
A
So this is like literally in the human brain. There's some like, bio inspiration here, which is that, like, you have like different waves running at different frequencies at different parts of the brain or something like that.
[08:23]
B
And I guess that's one interpretation of it, of the way that they're talking about classifying these hierarchies of frequencies. And the most interesting part, at least for me, is the way that they train the neural network. You take in some X, some input, whether it's a incomplete Sudoku puzzle, a maze, or an art prize challenge. You do TL steps with the lower level module, then you go to H, you do that th times, and then you have N sub outer refinement steps.
[09:03]
A
Yeah. So you basically are running through the input with a given matrix with a given transformation repeatedly on it. And you're doing that through two levels of refinement and then basically running that process several times.
[09:16]
B
Yes. So there's exactly three levels of recursion occurring here. There's the low level, there's the high level, and then there's the outer refinement steps.
[09:23]
A
And we're calling it recursion because it's the same weights that are being applied repeatedly we're not changing the weights in between these steps.
[09:28]
B
Exactly right. You get to recurse on the lnet ltl times, you recurse on the th and the tl, this looped recursion th times. And then you do nsup, you do this whole outer refinement step n sub times.
[09:42]
A
Cool. And so what's the basic intuition for why that works? Why does that produce an effective paper result? And what even were the results that this paper showed?
[09:51]
B
Yeah, and so I mean this got state of the art on ArcPrize 1 and 2. This was only a 27 million parameter model that was only trained on ArcPrize.
[10:04]
A
So it's like 1000 inputs or something like that.
[10:06]
B
Like puzzles, basically, literally a thousand tasks, which is extremely small. There is no pre training at all. This starts from literally tabula rasa wades. And it can outperform at that time. If we go back, we had 03, if you remember back way back when O3 gets zero, literally zero. And this got something like 70% on ARC Prize 1 at least at the time, which was just a huge breakthrough. And so the way you can think this is variable scoping. And so if I have three nested functions, I guess the lowest level function has scoped variables which they'll call zl, which is the carry that inits the
[10:48]
A
zero, a latent variable, Latent variable.
[10:51]
B
In traditional RNN literature they would call this the hidden state, the low level hidden state. And I get to recurse, recurse, recurse, and then I pass back that zl back to the outer scoped function, the higher level one. I let that one do one iter, it goes back and calls the lower level again. It does this whole thing in a third outer loop which is called the adderrefinement step.
[11:12]
A
But when you describe it like that, it seems like it would have the same backprop through time problem that you would have at RNNs. And I think they came up with a clever trick to basically get around that. So what was that trick that they
[11:22]
B
figured out, and this is really the crux of the paper that differentiates it in my opinion in the literature, is instead of doing what Alex Graves did in all of his papers, from neural Turing machines to adaptive compute time to differential neural computers, is he always back propped through all of the recursion steps? And he was limited by back prop through time. So you can only make the model so big you have all these issues, vanishing gradients, et cetera, et cetera. And what they do is they kind of have this deq method of doing fixed point iteration.
[11:57]
A
Sounds like deep equilibrium learning.
[11:59]
B
Yeah, deep equilibrium learning where if I take a batch, and this is completely counterintuitive as a computer vision person, because you'd never do this, but it actually does make sense and I'll explain why. If I take a batch of Imagenet or Cifr 10 and I forward pass through the model and I get some loss and I back prop and I update the weights, I would go get a different batch for the next one. But what they do instead is they actually do that 16 times. And as you do that, you actually can see the change in your residuals get less and less and less. And why it actually makes sense is because when in the RNN case, the ZL and the zh, which are the carry, the task carry start out, hidden states start out at zeros, those are zeros. Then we go through this whole loopy recursion, at least the two loops, the two lower loops, the TL and TH steps, and then I back prop just through the two modules just once and I don't recurse all the way back. I do a stop grad, I stop right there. Then there's a huge residual and then I don't reset ZL and zh, I do it again at a different point in the carry or hidden variable space. So one can actually look at it as a different batch every time, even though it's the same exact X's.
[13:19]
A
Yeah, the way I think about it is the 16 or whatever that you're recursing over, it's like constructing a mini batch not from different inputs but from different memory states. Basically it's like across this hidden or carry memory access.
[13:38]
B
Basically that math holds and it works. It follows DEQ directly in the event that the zl and the delta and zl and the delta and zhao go to zero, which it actually doesn't do. So we'll get to trm, but Alexia basically shows that it's just not the case and you can't actually apply this math why it's working. That's not sufficient support for why it's working. We actually don't know why it's really working. She figures out that you actually can back prop through all the way to the deep recursion, which we're going to get into TRM in a second. And that actually improves performance much, much more.
[14:18]
A
Interesting. Okay, so before we get into trm. Yeah. On this paper, I think there's a bunch of different ways people have looked at this in terms of how they came up with it. And then why this may or may not be working. One it's a sort of bio plausibility argument. Now, as you know, I'm usually not super keen on these. I think machine learning tends to have a long history of people starting with bio plausible arguments and then realizing that there's some variant of them that seems highly bio implausible that actually works better. I think you have example along these lines right here.
[14:46]
B
Yeah, a classic. The first deep learning paper that started this whole craziness is Alexnet. In Alexnet, there's actually this funny little thing called local receptive activation or depression or something like that, where once this activation fires, then I have this refractory region or something like that. Actually, that didn't work at all. It didn't work and you didn't need that. And then vgg came out and said, get rid of all that. Just go deep. And three by three conv. And it actually just outperforms dramatically. And so this is. Maybe you need to do it to get accepted into neuropsychology.
[15:19]
A
Yeah, sure, totally.
[15:20]
B
You're definitely the expert here. But what do you consider to be bio plausible and what's not?
[15:24]
A
Well, I think that a lot of machine learning literature has overlapped a lot with people working in neuroscience. I think it is very natural for us to ask questions about how does our brain work? Because our brain is like an incredible instrument that does a ton of computing, obviously, and does it in a very shockingly efficient manner, it seems like. And so a lot of machine learning research has for a long time sought analog from how we think to understand our brain to work and try to encode that in various machine learning systems. So from the very basic concept of what a neural network is, it's called a neural network because we think it's some basic model for what a neuron is. How certain activation functions work are meant to be inspired by certain biological premises. The thing about them is that often we use bio plausibility to inspire us to come up with ideas, but we end up veering away from the bio plausible to something adjacent to them that is likely bio implausible. But that seems to work better.
[16:19]
B
I think it runs better on a gpu.
[16:20]
A
Exactly. It runs better on a gpu. It's more efficient in some capacity that is relevant to how we actually encode it in a computational system. So I find thinking about bio plausibility fun and interesting and it's definitely a great way to inspire us to think about new things. But I tend to not be bounded by bioplausibility. When I think about what machine learning systems we should prioritize working on or think are as particularly exciting, other than as an interesting scientific launching point for a deeper exploration, I think the version of this that I find more compelling is actually that original discussion we were having around automata theory basically and honestly just actually fundamental data structures and algorithms theory, which is that if you're running a complex algorithm, having access to a memory cache is actually very useful for being able to run that algorithm efficiently. And I kind of think of this set of hidden states or carrying as akin to a Turing machine tape, or akin to the Radix sort memory bank, where you can basically train a model to use this memory cache in an intelligent way in a single forward pass so that you can get a more efficient time operation that would otherwise require some sort of more complicated reasoning.
[17:28]
B
Yeah, I think a point I wanted to make earlier is that we did this COT stuff and this tool use thing as ways to get beyond the limitations of GPT2. And so the way that we get. You can actually I've done this experiment. If you give me infinite amounts of unsorted list and sorted list, if I can do chain of thought and I can do every single step and teach it to do every single step, then I can actually get it to do sort and become a Turing machine at test time. Similarly, an even cheaper one that is much easier to do is you teach it and you say, hey, there's this Python function called sort.
[18:10]
A
Just call the function.
[18:11]
B
Just call the function. That's the easiest thing to do and you don't need back prop at all. So those are the two hacks now. Well, Francois, this is solved, we're done. No, because I needed to know what sort was. What happens if we didn't know what?
[18:26]
A
The chain of thought is not going to inherently discover sorting from first principles, it's finding it from our historical knowledge of everything. That's true. Yeah.
[18:33]
B
I mean this is like the demos had this whole thing about like the ultimate test is the Einstein test. Like go back to 1911 and then like have it rebuild all the physics up until now. Similarly, let's just pretend that we only had bubble sort. We knew other no other sort system. If you chain of thought it on all the bubble sort input and output, it will only do bubble sort. In fact, it won't even do bubble sort that well. So this is the best situation. And then the tool use, of course, it can only know bubble sort. I want to get to merge sort of. How do I discover merge sort.
[19:01]
A
I think the interesting thing just to emphasize here, because it may not have been extremely clear is there already exists some type of recursion that people are used to in LLMs, which is chain of thought we mentioned earlier. But that is a recursion that's happening in the token space of the model's outputs, not inherent to the model itself. That's the fundamental limitation is that the model can only do a feed forward one shot output. And then we basically just have this hack that if you keep letting it output things then it can read its outputs and do somewhat intelligent seeming things with it. But it seems to sort of be upper bounded by the data that we feed it that the labs are very hungrily buying right now and not this sort of inherent underlying recursive reasoning.
[19:47]
B
Yeah. So in both cases, both hacks to solve this in COT and tool use, you're bounded by the bounds of human knowledge. In the event it's outside the set of human knowledge, then you're kind of sol. And so that's one. You make a great point about discrete versus latent space reasoning in a discrete can only output the carry in the case of LLMs has to be snapped back to some discrete token space. And in the case of RNNs in general they remain in this continuous latent space which is much higher dimensional. If you give me a tape that's this long and you cut it up into 10 buckets versus all the possible values.
[20:30]
A
Right, exactly.
[20:31]
B
It's much more expressive to being continuous space. But we can't train it that way because we actually. Because you're inhibited by backprop through time largely. And this is why this paper is so exciting.
[20:41]
A
Ok, so before we then go over to the TRM paper, let's just summarize here what matters most from the HRM paper that we should take away before we transition and contrast it with the TRM paper.
[20:52]
B
Yeah, I think that the number one piece to take away is this outer refinement loop. The outer refinement loop scales and there's a great breakdown. Basically the sapient authors which huge kudos for this paper because there's so many innovations in this paper, didn't really do scaling ablations on every single one of the inputs. But this guy Constantin at Francois Chollet's company NDEA actually did and it's this amazing breakdown that he posted on YouTube that you can go check out. But basically the main takeaway is that the outer refinement loops is the main beneficiary is the main reason why these things work so well, which Alexia basically she found I think in parallel and scales up and shows that you can get rid of a lot of all this other stuff.
[21:47]
A
So like a lot of machine learning, the follow on paper is basically delete 75% of the first paper, as we've often done in videos here, and keep the magic, basically. Okay, so what's the magic then? What's the part that actually matters in terms of what stays in the TRM paper? And let's now contrast the core architectural differences between these two papers.
[22:05]
B
Yeah, so I think that I guess if I break it down into two major things, this outer refinement loop thing is really great and works really well. And that this truncated back prop through time, which is backprop through time except I truncate at some time some earlier point, earlier point called T, T back T equals 1 is actually completely sufficient. And so truncated back prop through time T equals one completely sufficient. And that's very counterintuitive, which is what HRM found, which HRM found and TRM does a little bit further. Rather than going through just one call to the HNET and the LNET, it actually goes through one full recursion loop. So if I do it 16 times, I just go back through one time and that is kind of sufficient. If you do it with this fixed point iteration thing, pseudo fixed point iteration thing, where you keep hitting it with gradient at every single step, it weirdly works. And this batch size across the carry space actually works.
[23:08]
A
So that part is also kept between these two models. It seemed like another thing that changed was having this sort of double layer of higher order thinking and lower order thinking. It seems like it collapsed that down into just a single one. What's the intuition there? And how does that actually work in the TRM paper?
[23:25]
B
Yeah, so it's interesting, she actually ablates having two separate networks versus just having one. I guess the more important space is the variable scope is that you should have low level features and high level features but the same network. And so the best performing the same
[23:38]
A
network can extract both. Basically.
[23:40]
B
Yeah, you weight share between the LNET and the HNET and it's just called net and you do just one transformer layer versus the four like they do in Sapient and just whittle it down to one and do more of recursion and that. But you keep ZL and ZH to be distinct and separate. And she calls it X and Y which I find very confusing. Z xyz, which is very Confusing. And it's just like zh and ZL is just cleaner.
[24:04]
A
So if you read the paper, Y is actually latent space. It's like Z basically.
[24:09]
B
And it is not a label, which
[24:11]
A
really threw me through memory.
[24:13]
B
But anyway, so we'll go through some code here and I'll walk you through it. So I've replaced all of her nomenclature and use the sapient notation, which is much cleaner and more straightforward to me at least.
[24:24]
A
Okay, cool. Now, before we dive into the code for a sec, in terms of how these TRMs actually work, it's pretty interesting because this recursion advantage now gives you a bunch of advantages over transformers, where rather than having 500 or 1000 or a million or whatever transformer layers and having tons and tons of parameters, you get compute depth basically without this parameter depth. And the optimization process looks more of like an iterative kind of like, expectation maximization algorithm. You want to talk about how that worked in the TRM paper, because I thought that was also pretty interesting.
[24:58]
B
So both of them kind of had the same kind of EME feeling thing where we update ZL condition upon the input x and zh, the last zh, zh t minus 1, let's say. And then we keep updating zl, zl, zl, zL, zl, and we keep updating it, and then we update zh condition upon zl, and actually it's just zl, it's not even x. And then we just update zh. And the way to think about zl and zh is zl is like your local scoped variables that are just being overwritten and updating, updating, updating. And then zh. And Azalea makes this point. Sorry, Azalea. Alexia makes this point. That is a candidate answer, a proposed latent answer that is just an embedding space away, one MLP lookup away from the true answer.
[25:54]
A
So you're kind of like em ing just to zoom out a little bit. You're kind of maximizing the probability of the correct information stored in your memory conditioned on a given output, and maximizing the right output conditioned on the information stored in your memory, quote unquote, in parallel. And that optimization algorithm leads to you ultimately learning a recursive method that stores the right information to this local memory, basically, and then outputs the right thing.
[26:26]
B
Really, if we actually think of Sudoku, it's actually a really natural way to think about what's actually happening under the hood. Where Sudoku is in a complete puzzle. You can't guess every cell at Any one time. Actually. It's designed where you can only guess one or two cells based on the available information. So it's not, it's an incompressible problem. You actually can't do it unless you're just randomly guessing and guessing and guessing, which is very high combinatorial space. And so what the ZL is doing is some type of, Let me try this, try that, do some computation, think about local things. And then it proposes and then we go to condition upon like something that it may have found. It sends it to zh, Zh fills it in. And now we have a little bit more of a filled in Sudoku puzzle.
[27:07]
A
And the training process is training the algorithm to know to do that. It's maximizing that. It's like, oh, this strategy for what you save tends to lead to correct outputs.
[27:18]
B
Without chain of thought.
[27:19]
A
Without chain of thought.
[27:20]
B
That's the most important part. If we had Sudoku and we know how to solve Sudoku because we were just dumb homo sapiens that didn't know how to solve Sudoku, it would just have solved it then. That's why it's cool, because it actually is able to discover things without being teacher forced via a chain of thought.
[27:37]
A
Right. Interesting. Yeah. Should we look at some code?
[27:39]
B
Let's do it.
[27:40]
A
Okay, let's dive in. And I would love to see what these papers or bottles look like, just distilled down to their core essence. I know there's lots of details on how you train them, but kind of the core training algorithm. And it'd be great to contrast the two methods.
[27:52]
B
Yeah. So I mean they're remarkably similar. And so largely one and learning one is learning the other. But basically you start out with some ZH and ZL that are just zeros. You have some input embedding space. We go from X raw to X, which is the maze state or whatever it is. Initial maze state. And then with no grad, you don't pass any gradients back through this.
[28:17]
A
This is the trick basically to not backprop through time.
[28:20]
B
Here are two of the three recursion levels. So you have. This is like the. They do this just for simplicity. But I hit zl tlow times and then once for modulo t low. Then I hit the zh and I do it again, again. And like you said, I'm updating zl condition upon zh and X.
[28:43]
A
Right.
[28:43]
B
And then I update ZH condition upon zl.
[28:46]
A
So this is the expectation maximization style.
[28:48]
B
Exactly. Then you don't really need this. This is like Just for cleanliness to show clearly that there's no gradients occurring above this line.
[28:57]
A
Just freezing the weights past that.
[28:58]
B
Exactly. And then I hit LNET and HNET one more time, which is the same
[29:02]
A
thing as up above. So this is just. Literally just the no grad thing running one more time.
[29:06]
B
Exactly. Cool. Yeah. And just make it really clear. And then there you go. And that's your HRM model.
[29:11]
A
Cool. Quite simple.
[29:14]
B
2 and 2 is completely sufficient if you actually go much higher. Konstantin showed very clearly that it doesn't actually help.
[29:23]
A
So that's two of the three recursions you said. The third happens in the actual training.
[29:26]
B
Third is in the train loop. And at the test loop they both have this mtest or N supervision, which Alexia calls deep supervision. They call it adder refinement steps. It's just whatever you want to call it. Call it nsup.
[29:40]
A
So you do this NSUP times during training and then during test time there's a different hyperparameter for. For how many times it recurses over each model, which is M10.
[29:50]
B
Basically they're actually the same. And so this and this. We can probably just call this the same, but it's the same. And if you actually. Konstantin does a good job of this. If you actually train on 16 and you test on only 1, you get 7, 8 of the performance or almost all the performance. So it's actually quite interesting that this is just too much compute and it doesn't actually help you all that much. So setting this to one is actually.
[30:22]
A
But presumably for more complicated problems, having more test time compute is still useful is the reason you would set it
[30:28]
B
up this way for sure. So we call our hrm. We get some loss. We back prop through just those two little parts here and then we step. We zero out the gradient. But we do not update ZH and zl. These are still the same init. So that's the really important detail there. As we go back, we pass in the ZH and the ZL from the previous one. So now this is actually not the same batch because we have updated ZH and zl, so it's in a different part of the latent space.
[31:00]
A
Cool. That's the key. Mini batch construction through memory space concept. Yeah. Cool. Exactly.
[31:06]
B
Then at test time it's simply the three loops. So there's your outer refinement loop, which turns out just at train time mostly doesn't matter. Train time recursion was important, but test time recursion was actually not that important, which is kind of counterintuitive. And then the HRM inside that has your two other loops.
[31:23]
A
Makes sense.
[31:24]
B
And that's it. So pretty simple. Now, the trm, the only two changes, the main two changes here is that they collapse LNET and HNET into just net.
[31:34]
A
Great.
[31:35]
B
And it's important detail. These are four transformer layers. This is four transformer layers. And this is just one transformer layer. And Alexei actually shows that going deeper actually didn't help.
[31:44]
A
Yeah. And actually on some tasks it was just the feed forward net actually worked just as well as the transformer there. Right. It was like on Sudoku.
[31:50]
B
I think on Sudoku, MLP actually outperformed the tension. It scored 0 on the maze, the MLP scored 0 on the maze. And so it's not clear, it's not obvious that the transformer is always better. So there's the weight sharing. And then instead of going back just the one, two back propping through just these two, you actually back prop through one latent recursion step all the way through one latent recursion step. So let me just walk through this a little bit. So we have the same thing here, same starting point. Yeah, it's mainly the same thing here. We're doing this six times and then we go one more time here and then we do our deep recursion. This is the outer loop ends up times. And so again we have the Nograd, we have the detach, and then this is where it's different. So I am calling this latent recursion after the detach.
[32:45]
A
Yeah. So it's one full recursive loop is happening here.
[32:50]
B
So that's the main difference in the optimization. Otherwise it's effectively the same. Then it outputs and then you're good to go and you train it exactly as same way before. And then at test time, it's the same thing again. So largely the same.
[33:06]
A
Cool. So in many ways it's sort of a simplification. You're collapsing certain parts of it, you're simplifying this NET architecture. It's slightly more complicated along this backpack through time, this back, back, prop through time part, because you're actually back propping through more than you did before. But it's like taking a bunch of lessons from the first one and basically simplifying most of it, which is actually
[33:26]
B
why she, I think, is why she needs to make the model smaller. And so it's a 28 million parameter model for HRM. Now she brings it down to a 7 million parameter model. It actually gets from 70% to 87% on ArcPrize 1 and does actually quite well on ArcPrize 2 as well. And so, yeah, so she makes the model three, four times smaller. But because it has that recursion, it actually outperforms. And there is this researcher named Melanie Mitchell that writes this book talking about this very phenomenon, which is like, it is sufficient, not necessary, to go bigger and get better performance, and it is sufficient and not necessarily to add more recursion. So where I'm really excited is what happens if you do both and you're still limited by backprop through time. Even Alexia is limited by that last step from a memory perspective, for sure. So if you can make the model really big and you have lots of recursion and we do something else other than backprop through time, then we can get all the benefits of this and all the benefits of the giant LLMs and then you can get some crazy stuff.
[34:40]
A
So now to wrap up, why don't we talk a little bit about the bigger picture? What does this mean for the field of AI research? How should people think about where these models fit into the current span of research happening? Especially given that it seems like a bit of a departure from a lot of the methods that people are used to hearing about and increasingly seeing products that people use?
[34:58]
B
Well, I think for one, from the arguments that Schmidt Huber makes and we've talked about today, recursion is important and it's not going away. And clearly the benefit is here of adding recursion into models. And you've seen things like the recursion language models out of Google that are pretty powerful and cool. And so that's definitely one piece that's, I don't think, going away anytime soon. The next one is this. Add a refinement loop back tbt, t equals 1. Truncated back wrap through time, t equals 1. I think that that is a really powerful idea. And the fact that that works so well, we have yet to really explore that extremely understand what's happening there. And then the third is that idea of like, okay, we know that recursion works. We have these tiny recursive models that are 7 million parameters that can solve really small 100 million, 100 billion, probably trillion parameter model can't solve trained on the entire Internet. And a 7 million parameter wins. The right answer is to take the amazingness here and take the amazingness here, which probably is already in Gemini already or some of these, it might be at least in some part. But when you take the benefit of both these TRMs and these giant models and you actually slam them together. I think that it's just going to take off and it's going to be really huge.
[36:19]
A
Yeah. One of the things that's really interesting about these TRMs and HRMs is they're not general purpose models. These were task specific models. The model trained to do sudoku cannot do arcprize inherently. It has to be trained on the arcprize set to do so. Versus the LLMs that are used on these tasks are general purpose models that maybe get some additional fine tuning data or in context learning data on those tasks. So I think that's where the interesting overlap might come is if you can make these more general purpose agents that can somehow be general purpose in the way that the next token prediction algorithm has given us and do more complex reasoning to achieve that. Seems like you can have really efficient architectures to do scale up reasoning.
[36:58]
B
Right. A lot of the view of what these LLMs are doing is finding really amazing embedding representation spaces.
[37:05]
A
Yes.
[37:05]
B
But reasoning inside that space is actually not done all that much.
[37:09]
A
Yeah, it's always through the token space.
[37:11]
B
Always through the token space. And so what you can imagine is we found mapping from token space or from vision from pixels, some really cool latent space where things are just nicely semantically separated and it makes it really easy for downstream tasks to do. But now in that space, use this tiny reasoning models, use some type of recursion inside that and train that model on that, a little small model on that reasoning space. I think that's really going to work.
[37:40]
A
Francois, thanks so much for breaking it all down for us. See you all in the next episode of Decoded. Thank you.