Summary9 min read

Podcast Summary: The MAD Podcast with Matt Turck

Episode: State of LLMs 2026: RLVR, GRPO, Inference Scaling — Sebastian Raschka
Date: January 29, 2026
Host: Matt Turck
Guest: Sebastian Raschka, AI Researcher, Educator, Author

Overview

This episode is a deep dive into the current state and cutting-edge developments in large language models (LLMs) as of 2026. Host Matt Turck speaks with Sebastian Raschka—a noted researcher, educator, and author—about breakthroughs in model architectures, alternative approaches, advances in post-training, inference scaling, benchmarking, and emerging trends. The conversation maintains a balanced tone between technical depth and accessible explanation, with Raschka offering both academic insight and “insider” views from active work in the field.

Key Discussion Points & Insights

1. Transformer Architecture and Alternatives

[01:05–04:04]

Transformers Still Dominate: Despite being nearly nine years old, transformer-based LLMs remain state of the art.
- "There's nothing really better in terms of state of the art performance... My short answer is I would say right now if I were to build a state of the art model that would be still a transformer based model." — Sebastian [03:49]
Alternatives Emerging: Newer architectures tackling issues of cost and scale, such as:
- Mixture-of-Experts (MoE): Increases usable parameters without linear cost scaling.
- Linear attention variants: Reduce cost, especially for long-sequence processing.
- Diffusion models and state space models: Offer efficiency but trade off quality (see next section).
No Free Lunch: Efforts to make things cheaper often force a trade-off with performance or flexibility.

Notable Quote:
"You can actually...take a GPT1 or 2 model and with a few lines of code, almost, you can transform it into the latest...deep seq version 3.2 architecture. It’s not like a big leap, it’s still the same scaffold." — Sebastian [03:25]

2. World Models, Recursive Models, and Benchmarks

[04:04–09:45]

World Models: LLMs that learn an internal representation of the environment, improving on tasks that require context tracking or variable state prediction (especially promising for code and potentially for robotics).
Tiny Recursive Models & Hierarchical Reasoning: Small, specialized models are showing strong results on benchmarks like ARC, designed for logic and reasoning. They use recursion to refine answers, performing surprisingly well compared to much larger, generalist LLMs.
Use Cases: While generalist models like GPT-4 or Gemini are versatile, specialized models can dramatically cut costs for focused tasks.

Notable Quote:
"It made a lot of waves also because like, oh, we don't need these big ChatGPT Gemini type of models to solve complex problems.... But each one was a different model. It was not like one model that could do all three things." — Sebastian [08:10]

3. Diffusion Models for Text

[09:45–13:22]

Image to Text: Inspired by the success of diffusion models in image generation, researchers are exploring their use for text, generating all tokens in parallel and refining results in denoising steps.
Trade-Offs: They can be faster for certain tasks but often lack the quality or flexibility of autoregressive transformers. Companies like DeepMind are experimenting but diffusion models remain a second-tier choice.

Notable Quote:
"They are not putting it out there as their state of the art model. It's more like a cheaper model...But [it] is not, I would say, the replacement at the state of the art." — Sebastian [12:14]

4. Architectural Tweaks, MOE, and Progress

[13:22–17:26]

Smaller Gains from Architecture: Most current progress comes from incremental tweaks—normalization placement, sparse attention, etc.—rather than major breakthroughs.
MOE Becomes Mainstream: Mixture-of-Experts transitioned from niche to nearly standard for large models in 2025–2026, aiding scalability and efficiency.
Shift in Focus: The low-hanging fruit has moved away from pre-training toward post-training and inference improvements.

Notable Quote:
"Improvement is not so much coming from the architecture anymore. It is basically the post training." — Sebastian [13:49]
"Pre training is not dead, but pre training is boring. It's not where the low hanging fruit is anymore." — Sebastian [17:31]

5. Post-Training: RLHF, RLVR, GRPO

[18:03–28:19]

New Techniques Timeline:
- 2022: RLHF (Reinforcement Learning from Human Feedback) enables sharp jump in conversational ability.
- 2025: RLVR (Reinforcement Learning with Verifiable Rewards) and GRPO (Group Relative Policy Optimization) bring dramatic efficiency and accuracy leaps, e.g., DeepSeek R1 model.
How RLVR & GRPO Work:
- RLHF relies on humans (or a reward model) to rank outputs; RLVR uses objectively verifiable rewards (e.g., math or code correctness).
- GRPO further simplifies the process, improving efficiency and scaling by comparing results in groups, not just pairs.
Unlocking Reasoning:
- "In my experience...for example, I took the Qin 3 model...trained it just for 50 steps with RLVR and it goes from 1.5% accuracy on Mars 500 to 50%...by only doing 50 reinforcement learning steps." — Sebastian [24:17]
Process Reward Models:
- Training LLMs to improve not just answers, but quality of internal explanations/“chain of thought”—still a tricky, nonstandard area; reward hacking is an issue.

6. Challenges and Scaling RL

[28:19–35:17]

RL’s Scaling Pain Points:
- Expensive and tricky to implement due to instability, frequent need to “babysit” training, hyperparameter tuning.
- Numerous “tips and tricks” have accumulated to stabilize RL post-training, moving toward maturity.
Meta-Lesson on AI Progress:
- "There's no magic lever, no magic, I guess, bullet that gives you everything. It is kind of tweaking things here and there and, and making things more robust." — Sebastian [35:39]
- Key: Progress is built from many small innovations, not one big breakthrough.

7. Benchmarks and Real-World Evaluation

[37:08–43:11]

Benchmaxing:
- Over-optimizing for benchmarks leads to models that may not reflect real-world ability.
- “Leaderboards” can reward style over substance.
- While models might “overfit” to benchmark datasets, rankings across different systems tend to remain meaningful.
Need for New Evaluation Metrics:
- As benchmarks saturate, there’s a need to transition to more agentic, task-based evaluation, measuring ability to complete real-world tasks over multiple steps.

8. Inference Scaling and Tool Use

[43:11–50:10]

Inference Scaling as a Major Driver:
- Instead of revising model weights, you increase compute at inference time—e.g., generate more tokens, majority-vote, or run multiple refinement cycles.
- Parallel sampling, response ranking, and iterative “self-refinement” all deliver perceived quality gains but at higher runtime cost.
Tool Use:
- LLMs calling APIs (e.g., web search, code execution) reduce hallucinations and boost performance.
- Tool integration emerged as a key feature on ChatGPT, GPT-OSS, and is anticipated to spread as privacy, sandboxing, and local execution develop.

Notable Quotes:
"I think one of the biggest drivers this year has been also the inference scaling." — Sebastian [43:37]
"Tool calling means that the LLM can call a web search or...code interpreters...that is very, very powerful because...you can outsource a lot of things that are hard to tools. So like we humans do." — Sebastian [46:11]

9. The Next Frontier: Edge, Private Data, Industrial Models

[50:10–55:04]

Horizontal Parity, Vertical Niche:
- Major LLMs from OpenAI, Google, Anthropic, etc., all feel similar at the “top line”.
- Real competitive advantage for businesses will come from combining LLMs (open-source or proprietary) with private, domain-specific data (finance, medical records, etc.).
Return to Private Training:
- Big companies now train and fine-tune their own large LLMs in-house for proprietary use—potentially marking a return to the “edge”.
Open Source’s Evolving Role:
- Open-source weights and research facilitate broader experimentation and education but often trail the scale and performance of proprietary models outside of “big” organizations.

10. Continual Learning: Hype and Reality

[55:04–59:16]

Highly discussed at conferences, but genuine robust continual learning (self-improving LLMs) is still several years away, likely 2027+.
Bottlenecks: catastrophic forgetting, high resource cost, risk in continuous updating, and lack of user-specific fine-tuning.

Notable Quote:
"Continual learning is an interesting topic because...it sounds attractive if you have an LLM that self improves or like an agent that does something, fails and learns. I don't think anything like that is feasible this year." — Sebastian [55:44]

11. Raschka’s Workflow, Book Writing, LLM Use, and Reflections

[59:16–67:17]

Staying Current: Driven by excitement and personal interest; writes only about topics he genuinely finds engaging.
Blog vs. Books:
- Blog for covering new research rapidly, especially architecture comparisons.
- Book for pedagogical clarity—focus on code, fundamentals, and reproducibility.
- "Code basically doesn't lie—if it either works or it doesn't work, you know. And I think that's a very useful way to learn." — Sebastian [61:10]
Using LLMs:
- Helpful for proofreading, editing, and small clarifications—not for automating research or writing fully.
- "I would say first. Also, the thing is, it's not super satisfying if you just ask the LLM to do it. It's, you know, like cheating at homework." — Sebastian [65:02]
LLM Burnout & Creative Process:
- Delegating too much to LLMs can feel “empty”; values hands-on work for learning and pride in accomplishment.

Notable Quotes & Moments

"There's no magic lever, no magic...bullet that gives you everything. It is kind of tweaking things here and there and, and making things more robust." — Sebastian [35:39]
"Inference scaling...is one of the biggest drivers this year." — Sebastian [43:37]
"I think that's the challenge we have right now. How do you put that into words to communicate the progress and I think that will be in the upcoming years. The more difficult problem to solve how to actually evaluate what you're using." — Sebastian [42:24]
"I get very excited about things and then when I'm excited about something, it goes very easy and very fast...there's like a lucky coincidence that...I honestly write only about things I find interesting." — Sebastian [60:21]
"If you use only LLMs to just generate everything, and I wouldn't say useless, but I would feel maybe empty...Pride. Oh, I did something that worked and it's cool and you're proud of this." — Sebastian [65:20]

Timeline of Key Topics

| Timestamp | Topic | |-----------|-------| | 01:05 | Transformer architecture — status, alternatives | | 04:04 | World models, tiny recursive models, benchmarks | | 09:45 | Diffusion models for text | | 13:22 | Architecture tweaks, rise of MOE, small gains | | 18:03 | RLVR, GRPO, RLHF post-training explained | | 24:42 | Process reward models, explanation quality | | 28:19 | RLVR’s application to domains beyond math/code | | 30:46 | Pain points in scaling RL research | | 35:39 | Meta-lesson: Progress as incremental, multi-factor | | 38:40 | Benchmarks & "benchmaxing" | | 43:35 | Inference scaling, tool use, interface tricks | | 50:10 | Open source, private training, edge models | | 55:33 | Continual learning: challenges and outlook | | 59:16 | Raschka’s work habits, book writing, using (and not overusing) LLMs | | 65:02 | Reflections on LLM-driven workflows and burnout |

Closing Thoughts

Sebastian Raschka offers a grounded and nuanced view of LLM progress: the biggest leaps in 2026 are coming from a patchwork of small but effective improvements in inference, post-training, and practical engineering, rather than from entirely new architectures or wave-making breakthroughs. Progress depends on both collective incremental innovation and interdisciplinary effort, with private, domain-focused LLMs set to become increasingly important. Throughout, Raschka emphasizes hands-on learning and curiosity—both key for anyone seeking to stay on the leading edge of AI.

For more of Raschka’s work and practical guides, see his technical blog, books, and substack—all recommended for anyone interested in the details behind today’s LLM revolution.

Loading summary

Transcript46 lines

[00:00]
A
Pre training is not dead, but pre training is boring. It's not where the low hanging fruit is anymore. Improvement is not so much coming from the architecture anymore. It is basically the post training. I think one of the biggest drivers this year has been the inference scaling. It goes from 1,5% accuracy to 50% by only doing 50 reinforcement learning steps. There's no one thing that fixes it all. It's a lot of little tips and tricks all over the place. If you add them up, that will give you the progress, but there's no magic bullet that gives you everything.
[00:29]
B
Hi, I'm Matt Turk from firstmark. Welcome to the Matt Pod. Today my guest is Sebastian Raschke, an AI researcher and one of the best educators in the field. Well known for his in depth technical blog posts and his book entitled Build a Large Language Model from Scratch. In this episode we go deep on the state of LLMs in 2026 architectures. Post training, scaling, benchmarks, tool use and what it all means for the next wave of AI. Please enjoy this in depth but very approachable conversation with Sebastian. Hey Sebastian, welcome.
[01:00]
A
Thanks for inviting me on your podcast today. I'm excited to talk about anything AI, I guess.
[01:05]
B
Wonderful. So we are going to go in the state of LLMs in 2026 in depth, including very much post training and reinforcement learning. But I wanted to start the conversation with the transformer architecture itself. Obviously the backbone of the entire generative AI revolution, but also over eight years old at this point. And for all the tremendous progress in LLM based systems over the last year, it also seems that there have been some interesting developments in terms of alternative architectures. So has anything caught your eye? And do you think that's a world where the days of the transformer architecture could finally be numbered?
[01:46]
A
Yeah, that is actually a very interesting question to start with. I mean it's starting at the very beginning with the transformer architecture. You said eight years. I think it's almost nine years because was 2026, it came out in 2017, quite a long time. And I think the question you raised is if it's like, you know, the final architecture. I think people probably ask that every year. Is this the thing we should be betting on going forward, let's say in 2026? I would say right now, yes, because it's still the state of the art. So there is nothing really better in terms of state of the art performance, getting better quality results. What we have seen so far though is alternatives that make it cheaper. So they have tricks to make the architecture cheaper. Itself like linear attention variants that are like a building block. In the transformer architecture, a big one was a mixture of experts which is essentially making the model bigger without necessarily making it more expensive to use and inference, like keeping like that reasonable while expanding the size. You see all kinds of, I would say like levers, tips and tricks hacks around that architecture. But it's still kind of like the same architecture in the core. And you can actually in Fact take a GPT1 or 2 model and with a few, I mean few lines of code almost, you can transform it into the latest, let's say deep seq version 3.2 architecture. It's not like a big leap, it's still the same scaffold. At the same time you have other alternatives popping up, like you know, diffusion models, text diffusion in particular, or Mamba models, state space models and so forth. They all try to address a problem that the transformer has, namely that it is expensive and big and yeah, expensive to run and train. But then of course there's no free lunch. These have other trade offs. They are like cheaper to run in certain instances if you take a look at diffusion models or text diffusion models. But then you don't get the same, let's say, quality out of it. And if you want to get the same quality out of it, in this particular case, you have to crank up the denoising steps and then you end up with something very expensive. So right now I think we are at that point where there is no free lunch. We are still, you know, like we are trying to figure out what is the best next architecture Right now. There's nothing on the horizon that would replace that. So my short answer is I would say right now if I were to build a state of the art model that would be still a transformer based model.
[04:05]
B
Great. What do you make of world models?
[04:08]
A
Yeah, world models are also an interesting hot topic. So there is the whole world model aspect for more like images and physics and that stuff. So world models are basically models that have like an internal model of the world. So they kind of simulate something internally what you have externally. Like for example, example, if you have like a chess playing model, it has like an internal chess simulator built inside so it can kind of make better predictions or predict the next states. I think that's particularly interesting for robotics. But coming also back to LLMs, there was also a paper by Meta, it looked very promising to me as refinement or a next step for code based LLMs. LLMs for coding are still next token predictors. But in addition to that, what they did is they also tried to predict the internal states of the variable. It's like that was like during training an objective to, if you have a Python code to say, okay, this, at this iteration, when I, if someone would step through the code, this variable would have that and that value. And so this is in a sense giving the model more context, more information about the training data. And it forces the model also to kind of like in quotation marks, understand training data better. So it's like instead of just, you know, brute force, just what is the most, like Linux token, it has kind of like an understanding of what it is right now. I think that's also how humans work. When I, for example, as a human read through code, I'm also trying to visualize or verbalize or write down what are the states of the variables in my for loop. For example, at this first iteration, second iteration, back then when I learned coding, I actually had like a paper notebook and was like writing down things with a pencil, basically, like these types of iterations. It's kind of like also this approach, but for LLMs essentially. And I do think that is something that is maybe more expensive to do, but it is also something that might push the state of the art a little bit.
[06:01]
B
And what about small recursive models? What does recursive mean in this context?
[06:08]
A
Yeah, so there was also a big topic in 2025, there was the hierarchical reasoning model. And from that we had also another paper, tiny reasoning models. And so they are interesting because they were getting very good performance for their very small size on the ARC benchmark. So ARC ARC is like a benchmark, almost like an I test, like a logic puzzle where there are like different symbols and you have to, you see, let's say an array of different symbols and you have to predict like let's say, what's the missing thing here in the bottom corner. And it's kind of like going a bit beyond text and beyond things usually on the Internet. So in that sense, I think the motivation between behind this ARC benchmark was to have something that really tests the capabilities of that model on something new that hasn't been shown during training, like a new task and how well the model can take some examples from that benchmark and generalize to new tricky problems. There are also different iterations of this ARC benchmark to make it harder and harder and harder. Hierarchical reasoning model. It became like popular because it performed relatively well on that benchmark compared to very expensive models like Gemini chatgpt. And so Forth and it is a transformer architecture. And then there's the tiny reasoning model that is I think even simpler than the hierarchical reasoning model. And so the idea is that you recurse so you have like latent, let's say storage vector or something like that where you refine the answer over multiple iterations. Instead of just doing a one shot, you let's say write an intermediate answer and the model looks at that, looks, is this correct or not? And takes it another round and another round and refines that answer. It is not cheap either, but the model itself is much cheaper. It made a lot of waves also because like, oh, we don't need these big ChatGPT Gemini type of models to solve complex problems. I think this is to some extent true, but I think also that kind of underestimates the appeal of ChatGPT, Gemini, Claude and so forth. And the appeal there is it's like one model that can do it all. It's very general purpose. You don't have to even really teach people that much how to use that model. I can ask anyone like, hey, you know, here's ChatGPT, the interface and people who have never used it before, they will be able to figure it out just typing some prompt and I can drag and drop an image there, I can and ask it a code problem. You know, it's doing a lot of things very well at the same time. It's also a downside because it's this gigantic model which is very expensive. So if you have a simple task that is very expensive to run such a big model and at such a big scale. And so there is then this appeal to develop these special purpose models. So tiny reasoning models, hierarchical reasoning models, they are very specific to a particular task. For example, in the paper they had pathfinding, like finding them path through a maze or something like that, like a toy problem and then the ARC benchmark. But each one was a different model. It was not like one model that could do all three things. In that sense it is, I think, really hard to compare to something like Gemini or ChatGPT. At the same time, I do think this is a very interesting and promising direction because even though, let's say ChatGPT can do everything, it's not always the cheapest thing. If you have a business problem, you are maybe manufacturing something, maybe you can start with a generalist model, but then once you know exactly what the task is and you want to hone in on it, maybe it makes sense to replace that expensive thing by something like that. That is cheaper, you know, like a module that you can plug in and you can even have an LLM like ChatGPT or Gemini, use those as tools. And so I think that it's a great development, but I don't see it quite fair comparison to, you know, state of the art LLMs. Basically.
[09:46]
B
I want to come back to something you mentioned a few minutes ago. Diffusion models, especially for text. Last year, I believe Google DeepMind announced one called Gemini Diffusion. So what are those and how different are they from transformers?
[10:01]
A
There's of course the big field of diffusion models coming from image models. Like not too long ago, maybe two, three years ago, there was like the big hype around stable diffusion, which was based on a research paper where they had like a model that replaced going back generative adversarial networks, which were an idea for generating images. And so the diffusion models were essentially, instead of having generator and discriminator set up like two networks competing against each other, it was like a pipeline that was denoising, starting with random noise, denoising an image and coming up basically with so with realistic looking images. And you could also have a text prompt and basic guide in terms of it's not like a random image. You can basically guide what you want to generate. It's basically the modern generative AI image AI that we see out there. People were wondering, okay, can we do the same thing for text? So can we use this, you know, pipeline, like this denoising pipeline to generate text instead of using transformers or I mean I'm saying instead of transformers. Diffusion models can be. Transformers are often also transformers because transformers is the architecture. So LLMs nowadays, it's specifically autoregressive transformers, which means these are LLMs that are generating one token at a time. So like where the next token always depends on the previous tokens. And so with diffusion models you don't have that. You have, you generate everything at once in parallel. But it might be very messy. And then you have multiple iterations. You take that whole thing and denoise it, basically refine it. What's nice about it is, well, it is fast because it's like a one iteration generates something and then you have a few steps that refine that, which might be cheaper than using an LLM to generate a long response because then you have a lot of sequential steps. So let's say 16 denoising steps is fewer steps than having 2,000 tokens, 2,000 steps that you generate something with. The downside is while you have everything in parallel and there are nowadays a lot of tasks that require sequential processing. For example, if you think about reasoning models or for example tool use, when you have a reasoning model and you ask or in general you ask a model to answer a question and the model maybe does a web search as part of its answer. And so you have to kind of interrupt the generation. I think the diffusion models, they have these downsides. But like what you mentioned is Gemini. I remember seeing the Gemini diffusion website where they are saying something like coming soon. And they compared their diffusion model to their latest, I think they call it Flash, the cheapest model. And so as an alternative to the Flash model being even like I would say faster at the same performance level. But they are not putting it out there as their state of the art model. It's more like a cheaper model, maybe for everyday use, maybe for like the free tier here or something like that. So it is an interesting direction to go into these diffus diffusion models as alternative to the autoregressive transformers. But it is not, I would say the replacement at the state of the art. I think one company will launch a big diffusion model this year. So there are diffusion models out there that you can use already. But I haven't seen anything at, you know, like Gemini ChatGPT anthropic cloud scale I think. But this year maybe we will see something like that.
[13:23]
B
Yeah, great. Super interesting. In the world of LLMs you mentioned MOE and that triggers a question which is what I think a lot of people are wondering which is like are we seeing real architecture breakthroughs within the LLM world or are we effectively at this point polishing what we already have within the LLM world? What are you seeing that's moving the needle in terms of architecture improvement or optimization?
[13:50]
A
Improvement is not so much coming from the architecture anymore. It is basically the post training. But like coming back to the architecture, I think it's still an interesting question because there are so many different architectures and almost no one uses the same one. Like they're all very similar, but they are not identical. I think a lot of it is coincidental where there are some tweaks and if you look at the loss in some cases on some training data and some training pipelines, maybe like moving the normalization, the RMS norm before, after makes a small difference. I mean they are theoretical, just justifications for. But also for example almost three, which is very transparent, they moved the RMS norm placement. So then Gemini had a post and pre norm. They had both on both ends. And so there is some justification where okay, ablation studies show this stabilizes the training, but while assuming a stable training, it's not gonna, I think, make your model magically perform better. I mean this is just like people tune their cars a little bit by, you know, putting in different air filters and something like that. So I think it's on that level where you make small tweaks, but it's not really changing the engine itself. The one thing though, what we've seen is a lot of large architectures now using MOE. I think that's a new 2025 thing. Of course, MOE is not invented in 2025. That was, I think going even back to the Google Pathways paper in, I don't know, 2022, three, something like that. And then Mixtraw had a big Moe. I think it was 2024 then. I think it was pretty quiet. I mean around MOES there was only, I think the ChatGPT model, which was rumored to be an MOE. But now this year really almost everyone has an OE out there, like every open weight developer. I would say DeepSeek kind of like restarted that trend in 2024, in December with Deep seq version 3. They had an MOE model before, but I think this is like the one that everyone looked at because that made such a big splash that people like, oh, what they are doing is maybe sufficient. It's the right thing. Let's not, you know, try something crazy. Let's like iterate on that. So there were a lot of companies adopting straight up the Deep SEQ architecture. So there was like, I think Kimi had Deep SEQ architecture scaled it up I think to from 670 billion to 1 trillion parameters. And then even the European Mistral AI company used the deep seq version 3 architecture for their new Mistral 3 model. It is something that is working well. But then Deep SEQ itself, they iterated on that too. So they have deep seq version 3.2 where they changed attention mechanism. They had a multi head latent attention which is already a nice tweak to. They added a sparse attention where. Sparse attention again it's not new, but they had their own flavor of it to make it cheaper. The idea I think is to get better modeling performance through the training pipeline while tweaking the architecture. So that benefits can be of course absorbed by the architecture but then also at the same time to bring down the cost of running the architecture. Because we've seen with GPT 4.5, which was also just 2025, GPT 4.5 was rumored to be a bigger model, a bigger version of GPT4, but it was not very popular because it was too big, too expensive. And so they kind of abandoned it and went a different direction with GPT5. And so I do think, well, I wouldn't expect bigger architectures, I would expect more efficient architectures, tweaks, getting the same modeling performance for less compute because then you can have more tokens for the same cost. And the tokens they give you better performance like inference, scaling and so forth.
[17:26]
B
But you see room for progress there. Like you're not in the pre training is dead camp.
[17:32]
A
I would say pre training is not dead, but pre training is boring. So it's not where the low hanging fruit is anymore. I think the low hanging fruit used to be in pre training, but now you need really good pre training still. But it is, I think harder. I mean I wouldn't say harder, but you can get better bang for the buck elsewhere almost. I would say pre training. I don't think it's dead, it's just not, let's say the most popular thing to spend money on right now. I think it would make more sense to put a lot of that budget into post training right now.
[18:03]
B
Okay, let's go into post training. So in your blog post you mentioned that 2025 was the year of RL VR and GRPO. So you had like a nice timeline where you said 2022 was RLHF, which gave us ChatGPT plus BPO 23 was Lora SFT. 2024 is a year of mid training and 2025 the year of RLVR and GRPO. So we'd love it if you could walk us through those techniques. So fair to say so both of those belong to the world of post training. Let's pick RLVR and let's start with the definition. What does RLVR mean versus regular rl?
[18:46]
A
I would say RLHF is the biggest leap in LLMs we have seen in a long time because that was taking GPT from GPT to ChatGPT, you know, like the RLHF, the reinforcement learning with human feedback. And in that sense it's almost like lrvr, which is reinforcement learning with verifiable rewards. Took that other leap basically from just simple chat model to a reasoning model. Both RLHF and RLVR have the RL in it. So both are based on reinforcement learning. But I mean this reinforcement learning is a bit different from the reinforcement learning that plays go. It's almost like a special thing and simpler thing in the context of LLMs. But the idea is that instead of doing next token prediction, just predicting what's the next token, it's more like looking at the full answer and then based on that answer, you give a reward. Like in RLHF, you have multiple answers and then you say, which do you prefer? Or in the case of lrvr, you look at the full answer and then let's say it's a math problem. You say the math problem is correct, the final answer of the math problem is correct or incorrect. That's like the main difference between next token prediction and pre training and then the RL here. So RLVR was kind of like popularized by deep seq R1, which was based on deep seq version 3. And that came out. R1 came out, yeah, January 2025. And with that they also introduced the GRPO algorithm you mentioned. But they go well together because they make it more efficient, the whole thing. But it doesn't have to be. So you could technically do LRVR with a PPO algorithm that was used back in RLHF. Now why I think it's such a powerful combination is, well, it just makes things more efficient. With RLHF, you had to have people ranking answers. So because the. The goal was essentially to train a model that prefers one style over the other. So for example, for safety, like reducing swear words, if there are two answers, answers use the one with fewer swear words. Or if you have an explanation, maybe use the explanation that is simpler to read and these types of things. But you always have to have someone who compares these answers and says, okay, this answer is better than the other answer. What you do then, though, is during the RLHF, you train a reward model, another LLM that provides this information for you. So at that point, so you can replace humans looking at these answers. So you have this other model that does it automatically as part of your loop. It's more expensive. Now you have two models essentially. And then there's also a value model. So the value model is internally kind of like a reward model, but it gets also updated to make some predictions as part of the reinforcement learning signal. And so you have basically three models in memory. And if you have ChatGPT style training, like large models, like, or Even like deep seq version 3, 600 billion parameters, you have 3 times 600 billion parameters, and you have to keep them all in memory. It's very expensive. And so in lrvr with grpo, you replace two of these models. So you have three models for RLHF with ppo you replace that reward model by verifiable rewards. So instead of having someone say oh I prefer this answer over the other or using an LLM for that, you have now tasks that can be automatically verified. So for example math, you can have a math parser. It could be like something like Wolfram Alpha alpha. You have the correct solution and you have the LLM solution and you just parse out that part that you can compare algorithmically and then based on the correctness you can give a reward for the reinforcement learning. So you eliminate already one big LLM that you have to have to train and have to have in the loop and the other one you also eliminate. So there's the value model that assigns a value to each of the responses during the training. And in grpo, so that's the GRPO part, you just compare them relative to each other. That's where the R in GRPO comes from, like the group relative policy optimization. And so yeah, and this makes it much more feasible to train it, it's just cheaper. And they show that it is actually really powerful. So you can take a base model, even skipping supervised fine tuning and RLHF and just do this LRVR and you get a really good reasoning model out of it. The deep seq R1 model. You can still do supervised fine tuning in RLHF and it's recommended to do it. Reasoning behavior comes from that rlvr. There are of course papers showing that the base model already has reasoning capabilities and I think this is actually partly true, but it is hard to say for sure because yeah, you don't really know what's in the pre training data anymore. So there's also a lot of reasoning data. So reasoning data is essentially just data which has this chain of thought format which means that the model writes intermediate steps like it explains its own answer. A lot of the pre training data has already the style of data in it and then it's hard to say does the reasoning behavior come from the pre training corpus or is it from the rlvr? And in my experience I think a little bit of both. So for example, I took the Qin 3 model as part of my book, the Reasoning from Scratch book and I trained it just for 50 steps with rlvr and it goes from 15% so 1,5% accuracy on Mars 500 to 50% on Mars 500. So it takes us three times fold leap in terms of accuracy by only doing 50 reinforcement learning steps. And I think it's not really Learning that much in these 50 steps in terms of how to do math better, I mean, yeah, it does, but it's not learning new knowledge about math. The knowledge is already there in the pre training and this just unlocks it. It's just like a step that, that maybe shows the model how to use its own knowledge basically. So that's how I think about it.
[24:42]
B
Yeah, fascinating. Just to unpack some of this, you can understand the reasoning steps that led to the explanation you mentioned in some of your writing, the label process reward models, PRMs and the fact that this is not successful yet. Can you unpack that part?
[25:00]
A
So there's an outcome reward and a process reward and the outcome reward is mainly like is the final answer question correct or not? But then there's the whole explanation of the reasoning model, whether it leads to the correct answer. And so there's also research like hey, why should we throw out everything the model generates and only look at the final answer? Can we get something useful out of this intermediate explanation? And the intermediate explanation is useful for several reasons. I mean one is it has been shown like that this helps the model to generate the correct answer whether the explanation is correct or not. But it's a different aspect. But, but just the fact that it generates these intermediate steps is correlated with a more accurate answer. Then the hypothesis is if we can improve that explanation, maybe it gives even a better answer, maybe it even drives the accuracy higher. If you want to learn something, it's not enough to just see the final answer. You want to see the steps that lead to the final answer. Process reward models. They are also focused on training the model to reward the models based on that explanation. And so my statement that it is not so, let's say promising or useful was mainly based on the R1 paper where they had a final paragraph at the bottom. I mean this is already a year old but they had a paragraph at the bottom that I think the headline was something Unsuccessful attempts. They tried it and they found it wasn't worthwhile because of reward hacking. The model was because usually you need another model to grade the responses and they can be susceptible to reward hacking. And it's hard to train that model and it's not reliable. And then that whole thing, it's not really worth it. According to their experiments. There are a lot of people who try to make it work and I think it is promising and we will see it working at some point I think. So it's just like right now it's still tricky to make it work. But I'm quite sure we'll see it as part of the standard repertoire at some point. And there was, for example end of the year last year was the deep seq math version 2 paper where they had actually a nice study. They had something like that where they had like a second model that was checking the answers and explanations of that first model. And it's almost like turtles all the way down. They had yet another model. So they had three models. They had one model generating the answer, one model to grade the answer and the intermediate steps. And they had one model to grade the grader basically to say, oh, is the grader actually doing a good job? So there were like three models in a row. It sounds a bit excessive, but based on the performance of that whole setup, it performed really well. So they were cranking up also the self refinement steps and iterations and they got gold level performance on some of the math benchmarks. One could say, okay, maybe, well, cheating the data was public or whatever. I don't know, that's a different question. But the fact that this performed better than the model without it tells me whether, let's say it's really gold level performance is a different question. But it is doing better than just the plain model. So it is actually adding usefulness to the whole process. And I think we will see more of it. It's just expensive because now you have to have more models, more training, more stuff. But that's what I meant also earlier with that's where you make the bigger gains rather than scaling the model size. I think that's one of those things where you will see more progress coming from.
[28:20]
B
Yeah, and speaking of math, I think that's one of the key questions going forward for rl, whether you can can expand this beyond math encoding to other domains. What's your take on that?
[28:32]
A
Yeah, I think what's so attractive about LRVR is that you don't have to have let's say humans checking the solutions. You have a verifier that deterministically checks for math. Is the answer correct? Like giving two fractions, Are the fractions the same? Two numbers with decimal points, Are they the same if I round the them up? And so it's like very easy to check programmatically, algorithmically and the same for code. So you have code and you like code problems and you can. So in that case you can compile the code. If it compiles or you have unit tests, it checks, it works. It's very nice to check. There's no subjective aspect. It's very objective. You can say, okay, it compiles, it doesn't compile, it's very clear cut. The question you had is, does that what happens now in general to other fields? Is it like specific to math or code? And I think we will see also expansions of that to other fields. I'm personally not an expert in other fields, so I don't know what that would look like for medicine. I mean I have like a computational biology background. I know a little bit about the drug development pipeline and so forth. But it is, I think it's not quite as clear of what the reward looks like. But you can also be more creative. So it doesn't have to be strictly verifiable through an algorithm. It can be verifiable maybe through an LLM. You know, it could be something like that. So for example, I can see, I don't know for research maybe training and model to give correct citations and you can maybe check the citations. You could have another model that goes through the URL and say oh, this is indeed the correct paper, giving the correct title of the paper or something like that. It's a correct URL and stuff like that. So I think think there are lots of these things where we can expand RL VR to and train on those things. So I think we will see a lot of that.
[30:22]
B
The thought that crosses my mind as you describe this is that I've heard people say that RL is very difficult to scale. It's very finicky. And hearing what you describe about different techniques put together, I'm starting to get a sense for why is that, why it's complicated. Is that because it's basically a bunch of different things and models talking to one another that it is hard to scale or is there another reason?
[30:46]
A
Well, I just implemented before we are recording this GRPO LVR from scratch in a Jupyter notebook. Wrote up a chapter 39 pages. So it is not super complicated. I would say it's like you can fit it into a Jupyter notebook, it works and it trains fine. What I'm trying to say is if you can figure out pre training, the scale at pre training you can figure out this because if you also look at the numbers of how much it costs to just GPU hours deep seq version 3 they had like a $5 million price tag on that. Given the I think $2 per GPU they assumed whether that's a correct assumption or not. It's a different question. But if you compare it relative to the cost of R1, I think R1 was about $300,000 when they trained it. They had a number in the nature version of the paper. So it's basically more than 10 times cheaper than pre training. So. And it's the same infrastructure where you have to make sure. I think the complexity comes from making sure that GPUs don't crash if they crash, that you can resume and so forth. And also during pre training you might have bad losses where you want to reroll the old checkpoint and the same things apply. You have multiple models. But yeah, you are right, there is a bit more, I would say, well, trial and error in rovr where just due to the nature of the updates and so forth, you can, I mean that's what I observed when I was training my models. You often get, I mean not that often, but every so and so many hundreds or thousands of steps the model gets bad. You know, like so the model works totally fine and you train long and long and suddenly the model is really bad. And so you just go back to the previous checkpoint. But it's not new in terms of a new thing that's happening all the time in pre training as well. But I think Vanilla grpo, the original algorithm, it is pretty flaky. Like where it is, you have to babysit it. Over the course of the year many people had these tips and tricks where some people were saying remove the KL divergence term. Like if you just drop it for math it performs better. Remove the standard deviation, the normalization term. Or if all the rewards look the same you can skip them to make it faster. You know, like there's like there's a lot of tips and tricks like these tricks of the trade that make it more stable. And I think like if you apply all of them together, it is actually a pretty stable okayish algorithm. Just like last week Nvidia also had a paper on gdpo I think, yeah, gdpo. So they were focused on also algorithmic improvements with respect to multiple rewards. So if you have multiple more than one reward, it could be something as simple as you have the accuracy reward, but you usually have also a format reward because you want the model to put in the final answer in. You don't have to, but you can put that into these think tags. So there's like a, a token, a think token. It's more honestly for stylistic purposes. I think the advantage is some people develop models that are hybrids which are capable of a normal mode and a thinking reasoning mode. So thinking stands for reasoning. The appeal here is you don't always want to use reasoning modes because it's expensive, it uses a lot of tokens and sometimes you have a simple answer and you don't want to spend 2,000 tokens on the simple answer. And so you can, with these think tokens, steer it a bit. For example, in Gwen 3, you can add empty think tokens. So you have opening token and a closing token. If you add that, think whatever in between is empty. And then the model will not generate any reasoning chain of thought. And long story short, so during training, you can teach the model to adhere to these different formats. So then you suddenly have a second reward. So one reward is correct. Correctness is the answer correct. The second reward is does the model output something that fits my formatting here? And then you have two rewards and how you combine them. Usually originally you just add them up together. But then there were some, yeah, downsides in the GRPO instability. And so GDPO had some algorithmic improvements to improve the stability. And so there are lots of these little tricks over the year to make RBR more stable. But it is a new, a newer paradigm. So it just takes a few iterations to find the canonical one. It's similar to, you know, optimizers with Adam. So Adam is, I mean, right now there's Adam W, there was SGD and all the other RMS Prop and how they were called and they kind of all converged to Adam W by adding more and more tricks. And I think that's the same right now with LVR with grpo, where adding more tricks to get to something that kind of like is pretty stable across a lot of different scenarios.
[35:17]
B
It's fascinating to hear you talk about tips and tricks and different techniques that triggers the thought that you had a nice way of putting it in your blog post and taking a step back for a second from the weeds. But you talked about a meta lesson for all the things in 2025 and where progress actually comes from your want to get into that? I think that'd be interesting.
[35:40]
A
Yeah. And so meta lesson would be essentially here that. Well, the whole. I think we are talking right now for half an hour about different things. So I think the theme would be, well, there's no one thing that fixes it all. It's a lot of little tricks and tips and tricks all over the place. And if you add them up, that will give you the progress. But yeah, I think there's no magic lever, no magic, I guess, bullet that gives you everything. It is kind of tweaking things here and there and, and making things more robust. I think the tweak was the transformer architecture back then and now it's essentially let's, you know, make it even better, I guess, refining it. And a little bit of post training here, a little bit maybe improving the quality and pre training, maybe some architecture tweaks, algorithmic tweaks. It's all a little bit of everything basically. It's also that I would say you don't have to know all these things because like in practice it's like a big team at a big company and everyone has a specialty. Like everyone is either like on the post training team, on the pre training team. It's not that one person has to know everything and all the tricks because that would be really impossible. And so I think it's also just due to the nature of work because it's so much work. It's a lot of work to train these big models that you kind of like separate these roles and then everyone is working on, can work on everything at the same time, which is also nice. And then you bring back together all these improvements into the model. Yeah.
[37:02]
B
And you're confident in the industry's ability to keep coming up with tricks and tips going forward.
[37:09]
A
Yeah, that is a good question. I mean if I look at Deep Seq for example, because I mean I'm always picking here Deep SEQ in this podcast because I think they have a really nice trajectory of models. I wish I could also talk more about Gemini and ChatGPT but they don't really release the techniques. So, so hard to talk about it. So picking on Deep Seq here, I mean if you look at version 3 and then R1 and then they had version 3.2 model with the sparse attention mechanism and then also this math version 2 with a self refinement and everything. So they do have right now still a track record of improving things and they are rumored to release a new model in February, the deep seq version 4. But I think, well so far I think we are still on that trajectory where we haven't run out of ideas. So I think the only thing is we are running out of is really benchmarks. So the improvement on benchmarks it's kind of like harder to measure. And I think maybe, well maybe it's not the one shot problem anymore where it's not really answering knowledge question. It's not really solving math problems in one iteration of the benchmark. It is maybe more like the a agentic cycle like where you have like a more like a objective that is not let's say answer the question, but more like design something, blah blah, blah. And then it goes off and how long it can or how long it needs or how long it can run until the problem is solved. And I think it's maybe more towards that how we measure progress rather than whether we get 90 or 95% or 97% on a benchmark.
[38:40]
B
Yeah, yeah, you had this nice expression benchmaxing in some of your posts. What do you mean by that?
[38:48]
A
Yeah, so benchmarks maxing is so I'm often, often reading things on X because that's like where a lot of the AI community is. I think benchmarking is one of the ones that came up in 2025, like newer generation term. And so loosely how that, what that means is essentially that, well, it's almost like exploiting the benchmarks. Like people train models do well on the benchmarks, but it doesn't really translate to real world performance. Like popular example was the llama4 model. I mean based on rumors I heard they had a separate model just for the benchmarks, the leaderboards. But let's say even that aside, if someone, let's say trains a model on leaderboard performance, it doesn't mean the model necessarily performs better in real life because leaderboards are susceptible also to style. And so with leaderboards, the tricky part is because humans compare care which model they prefer. And if I have, let's say a very complicated math problem, I ask an LLM and let's say I don't even know the answer and then it should help me with my tax report or something like that. And there's one LLM that gives me a really nice explanation. Maybe the result is wrong, but the explanation is really nice, easy to follow. I probably like that one. And so I would probably give it a thumbs up because, oh, it's understandable, it's reasonable because I don't know what's correct because I'm going to take expert. And so and I think that's one problem with leaderboards. It rewards the style more, more than the correctness because there is no correctness check. It's like, yeah, you as the expert, you have to know whether it's correct or not. Yeah. And so also LLM developers, when you're training the LLM, it kind of gets biased to follow a certain style and the style of people who use those leaderboards. And in that sense you end up with models that have, let's say, have been benchmarks. They have been getting really good benchmark scores, but they might not do better than previous models. And then it's kind of like a tricky thing. It's hard to measure progress this way.
[40:40]
B
Do you think people do that just out of largely economic incentives? The companies need to raise more money and people need to have successful careers and therefore they want to look good. Is that the driver?
[40:53]
A
I mean, I don't want to accuse anyone. I don't know for sure. I only know what is known on the Internet. I read on, let's say Reddit a few times that Llama 4 was a separate model. So there might have been, I know some company leader decisions that have led to that. I honestly don't know. And maybe incentives getting good headlines and that stuff, but. Well, I think the open weight community is a pretty smart community. So it's like, I think it's not worth, worth risking something like that. And I think most people don't risk it. It's just implicit, it just happens. It's like if you iterate too many times, it's a, it's a classic deep learning problem or machine learning problem. But the nice, the beautiful thing here was actually it's not a big concern because it happened to all the models. So all of the models performed like 5, 10% worse in this new data and it was pretty consistent. So if you were to rank those models, the ranking would still be the same. So in LLM terms, let's say ChatGPT and Gemini, let's say they cheat on the benchmarks and the models are 10% worse. But if the ranking is still the same, let's say Gemini is still better than GPT. Then it's not a problem if both of them do that. So I think we have right now that in LLMs where I wouldn't say they are cheating, they are just using the data a lot. And from using the data a lot, well, the data leaks in a sense. So you kind of like you're biased in a sense. But then if they're all biased, then it's again fine because the ranking is still the same. But I think, yeah, the problem still remains. The benchmarks are saturated and it's hard to demonstrate or detect or have any type of notion of progress. It's really right now, honestly, personally, I stopped looking at the benchmark numbers. I just use the model and see for a few days and I see if it's better or not. Like I can't say, okay, this is better by so and so many percent. It's more like, oh, I use it and I feel like, like it's doing a better job. Right. I can't even put it into words and I think that's the challenge we have right now. How do you put that into words to communicate the progress and I think that will be in the upcoming years. The more difficult problem to solve how to actually evaluate what you're using. I mean the power of LLMs is that they are so free form. But that's also the downside for evaluations because evaluations, if you want to be numeric and precise, well free form is not so easy to deal with with basically super interesting.
[43:12]
B
So going back to tips and tricks to make sure that we cover the state of LLMs in I guess early 2026. So we talked about post training. How much in the recent progress in the last year or so do you think comes from non architectural and non post training stuff? And I'm thinking in particular inference scaling and then tool use.
[43:36]
A
I do think a lot. I I mentioned previously post training but honestly I think one of the biggest drivers this year has been also the inference scaling. So inference scaling essentially means you don't change the weights of the model, you just expand more compute during using the model during when the consumer or the person, the user uses the model. And a beautiful example or chart was back in October 2024 when OpenAI 01 came out. They had like this chart where they had to sub graphs. One was for scaling the training and one was for scaling the inference. And you could see for both, both were going up with a similar like increase. And so you can basically invest either more money during training which is a one time cost and then you have the fixed size, you never have to pay money again later on. But then that breaks a bit in a sense because reasoning models generate more tokens. So they are also expensive to use during inference. So inference scaling includes that, it includes like models that generate more tokens because if you generate twice as many tokens it's twice as expensive because now you have twice as many steps. That is one form of inference scaling. But it can lead to more accurate answers. The other one is parallel sampling. You just have the model, you ask a model multiple times, more like a majority vote. And most people, if you see the benchmarks, they do that, it's called benchmark best at something like that like best of five or something are best they say I think an at sign five or something running it five times and then selecting the answer. So you can do majority vote. But it's five times more expensive now because you have to have to run the model five times. There are also methods where people have a judge model that judges these results. Like if you can't do majority voting, you can have a score and then score the highest answer. It's a bit brittle because, well, that model can also make mistakes. But there are all these types of tricks or self refinement where you have multiple iterations. Self refinement is also like, basically like this iterations where you have one LLM write the answer and then you say, okay, take a look at this answer. And then oh, I made a mistake here. And it self refines. I mean reasoning models do that internally as a chain of thought also sometimes. But you can also have an explicit version of that or another. A really cool paper that came out in January was rlms. And so what they do is they take that prompt. So instead of processing it all at once, they chunk it up into several smaller prompts or the LLM decides it learns how to or sees how it should chunk it up in code and then runs a prompt on each of those again. So basically making one prompt into smaller prompts and then having multiple requests. And this I would say is also a form of inference scaling because I. Well, it depends, but I do think it's. It can be more expensive because now you have more, more LLM calls. And each one, if you want to go deep, you can end up spending more tokens, not more tokens in one request, but in the sum of all the requests. But there are all these things I think that underappreciated in a sense because they are not. I mean in this case it was a popular paper, but often inference scaling is not talked about that much. But I think it is a big driver of, of making LLMs perform well. And I think why I think that is if you use DeepSeek locally or use the platform, let's say you use a local LLM and use ChatGPT. I think ChatGPT of course is a really good model, but I do think the leading open weight models are not that far behind. But if you use them locally, they don't feel as good. And I think that's because ChatGPT has a really good interface. Like the platform they have, it's not just running the LLM, it's maybe cleaning up your prompt. Again, this is like, I would say a hypothesis or like guessing here, but like instead of having the LLM learn, of course it can and it does learn how to deal with misspelled words, but you can also just clean that up the input in certain Cases where that might improve the accuracy. And I think all these little engineering tricks, not just inference scaling, but just cleaning up the prompt, how to manage the context, the history and everything, I think that all contributes to a lot of progress that is felt by the user. Yeah, and then another example would be tool calling too. I think that was also a big one in 2024. I don't remember when ChatGPT introduced tool calling, but might have been early 2025 or 2024, but GPT OSs, so they had that open source model and let's say summer 2025 and GPT OSS has tool calling support. So tool calling means that the LLM can call a web search or it can call code interpreters and so forth. And that is very, very powerful because I think this is one of the ways you can mitigate, not totally mitigate, but let's say reduce hallucinations because then the LLM suddenly doesn't have to remember everything anymore. You can outsource a lot of things that are hard to tools. So like we humans do, right? We humans, we use calculators, we use the web search, we don't try to memorize everything. And so I think in that sense that is really a big unlock. The only problem is, well, you have to trust the LLM to run on your computer. So which is why right now I think it's more like confined to these proprietary LLMs like Gemini and ChatGPT because, well, it's not your computer it runs on. If it goes somewhere, execute some code and messes it up, well, not your problem. But I think we will see more of that in the upcoming years when the open source tooling kind of like I would say gets more robust and people have more trust in running that over on, on their own computer. Maybe in a Docker container still. But yeah, something like that. I mean right now a lot of people already run code agents on their computer. They're usually kind of more restricted. But I think people are more and more trusting these to do things they wouldn't have trusted them to do like a year ago and give them permissions. But as LLMs get better, people develop more trust or I mean run it in a secure virtual environment. And I think that will be a lot, a lot of other SO leaps we will see even though the LLM doesn't get bigger or anything like that. And also you can actually go to the GPT OSS release block and they did have benchmarks to show how the performance on the benchmarks is with the same Model with tool called enabled and disabled. And you can actually see there is, I mean it's not like two times the performance, it's maybe 1.2 times the performance. But you can see it is definitely a jump in capabilities just by allowing the model to use tools basically.
[49:59]
B
And that's part of where you see the world go, right? This combination of like open source model and private data, I think you call that the edge in your blog post.
[50:11]
A
What distinguishes different LLMs right now they're all kind of similarly good, I would say like open weight LLMs. There are a lot of LLMs that are similarly good. I personally, I don't use all of them all the time, but so I usually use one LLM at a time. But like if you use or compare ChatGPT, Gemini, Claude, Grok, I think they're all pretty much on the same level. Like, and I think that's because they're trying to do everything like they're generalist models for a general person to do a lot of things. I mean, Claude is a bit more specialized to code now, but the other ones, they are more like general models and I wouldn't say one is significantly better than the other. They have like small, I mean, differences and so forth. But so if you want to really distinguish them and make them better in certain industries, I do think yeah, the private data is what helps. Like all the treasure troves of data that a finance company has over the years, over 100 years or 50 years collected or medical data, like medical records from patients. I think JGP had like a contract now to process them to make it secure and private. But I honestly think these companies, they don't want to just give away that data first. They can't. I mean it makes also sense. You just really as a patient or customer, I would feel really bad if someone gives my health data to some other company that I didn't agree with or agree with with this sharing. But then also the companies, they don't want to just give away all that data because once they do well then maybe they become then really kind of obsolete in that sense where it's all, all the treasure is basically all they're this what makes them different from other companies. Companies is now taking basically. And so over the last month, I mean people reached out to me also. I know for a fact that big companies are training now LLMs in house. Really like big companies who have the financial means to train chatgpt like model are hiring people who train LLMs. And I think that is also what we will see that instead of going to this big LLM provider, then giving them the data, people will try to make their own LLMs for their own company and private data.
[52:20]
B
It's fascinating, right? It's back to the future because initially people thought that they were going to train the models and then they kind of like gave up. But what you're saying is that you're seeing people going back maybe with a better state of open source LLMs that they can build on as a building block. That's what you're saying, right?
[52:37]
A
Yes and no. I think you're right. I mean, no, you bring up a good point. Open weight and open source models are very, very popular. Like a few years ago and are still very popular and I love working with them. But I think, well, there's still a gap between a ChatGPT model and an open source open weight model. Maybe now with deep seq version 3, not so much, but that's almost a different community like the Tinker community, like me, like small system. Well, deep seq version 3 would be way too expensive to run for me every day. I would have to spend thousands of dollars on just hosting costs every, every day, every week. And so I toy around with smaller special purpose models. But what I meant is so first, yeah, the open source community in that sense will have maybe a comeback at these companies, but I even mean a step further, that they actually develop models from scratch, like really big models. And what's different from, I would say the regular open source here is that it is really large scale. It's like, it's like a chatgpt scale data center large LLM. It's not something you run on your computer. Basically it's really like a big data center style LLM. So I know that there are, I mean, I can't say any names but I know people are interested in that. Like they are exploring that. Whether it will work out or not, I don't know. But I think, I mean right now if you are in college, you are learning about LLMs. That's the thing that's about big thing. So you start with open source, you start training small LLMs and you, you work your way up or, and then at some point, well, you probably want to get hired either by Gemini, ChatGPT or so forth and do the big model development there. But not everyone can't have 100,000 people doing that. So there will be people distributed across different companies who will do something similar. And also on the other other hand, people who are, I think at OpenAI and Gemini at some point. I mean big finance companies have a lot of deep pockets. They will make it attractive to do something similar at their company too. So I think we will see. Right now it's very concentrated at these companies but we will see I think the knowledge being a bit more spread out. Where other companies will develop models too, we will probably not hear about it. You know, it's like on in the news or anything. I mean the news of course, but there won't be papers, there won't be big announcement because it won't be this general audience model. So it's more something they will do internally. Right.
[55:04]
B
And especially if you're a big hedge fund or a defense company like I assume that's the kind of companies we're talking about. Yeah, those are have a very secretive DNA. Okay, fascinating, fascinating. What else do you see happening this coming year? I was interested to read your thoughts on continual learning which was sort of the talk of the town to neurips but you, you viewed it or you view it as a 2027 thing, not necessarily a 2026 thing.
[55:33]
A
Yeah, the continual learning is an interesting one. I think it was discussed there very heavily. But also in general if you went to social media, AI related topics, continual learning was there all the time, everywhere. And to be honest with you, I don't know why exactly it was such a hot topic in 25 because I don't think there was a big breakthrough. I mean maybe it's the hope for breakthrough or the like, hey, nothing has changed so maybe we have to focus more on that to force some change. But yeah, so continued learning I think is an interesting topic because it is, it sounds attractive if you have an LLM that self improves or like an agent that does something, fails and learns. I don't think anything like that is feasible this year. I mean so, so right now, well you could technically do continual learning if you wanted to with the data you have. Like when you think even of rlvr. So you could technically it's just updating the model but the problem is still the catastrophic forgetting. Also you don't want to train the model on garbage data so people kind of do, I think. I mean I would call it continual learning but in a more controlled setting where well, instead of just updating the model, letting the model update itself, they collect, collect failure cases and data and then construct the data set and then do it in a more controlled manner. But you can see based on the model releases, it happens more frequently than it used to because I mean Back then it was GPT1, GPT2, GPT3 and now it's GPT4. 4.1, 4 point, I think 2 or 5, 55 1, 5.2. And all the models, they are iterating now or even the same with Deep seq. So it's like this same architecture, you iterate multiple times time but still in a more controlled way. And I think it makes sense because it's such a expensive thing to do. You can't, I don't even know how you would do continual learning if you have this model, it is hosted in a data center, you can't just update it. It's a very expensive model. You can just update it on. Good luck, you know. So it's like the risky thing to do and you can't do it even as a single person. You have to be really careful monitoring a lot of things. I don't know how that would work because right now we are still in this era where everyone uses the same model. People don't have custom models. When I go to ChatGPT, I have the same model as you do. And yeah, the prompt is a bit different, like the memory and everything, but it's all in the prompt. It's the same model weights and as long as we have that, I don't think yeah, we will see anything like continual learning. I mean there are companies I guess like, I mean Tinker API, it is something where it is democratizing a bit like the training where through an API you can now train your model more cheaply or instead of having the hardware, it's on their data centers, people have their own copy. But you know, it's I think very, very far from continual learning. It's just making available what other companies have in terms of training on a large cloud instance without you having set it up. But I don't see any, anything 2026 that really makes continual learning like the big breakthrough in efficiency or I don't know. And so 27 is even ambitious. So I don't know. So maybe we'll see something there. 2027, given that it is such a big topic, an important topic. There's a lot of smart people thinking about this and working on it. So we'll probably see something, or at least I don't know, some ideas that are fresh or things that prototypes that work or interesting but really hard to say anything concrete without having, you know, seen anything. So it's a prediction that I just put out there. Maybe we'll see more continual learning stuff in 2027. But with a grain of salt.
[59:17]
B
Yeah, Maybe it becomes a self fulfilling prophecy, right? If you have enough smart people that decide it's the thing, then maybe it does does happen. All right, so maybe to close the conversation, let's talk about and how you do your work. So you published a book this past year in 2025 on how to build LLM from scratch. I believe that you are writing the sequel currently. How to Build Reasoning Models from Scratch. Is that the title? And you produce an incredible amount of work. So people should find you on your website, on your substack letter. How do you absorb all of this knowledge? To which extent are LLMs part of your workflow? Just curious how you work these days.
[60:12]
A
Good question. So I must say, like, well, I don't have like a magic approach or anything. I think the thing I have maybe is I get very excited about things and then when I'm excited about something, it goes very easy and very fast. I don't know, it's like, well, if you notice, maybe I write only about certain topics. I don't cover image models at the moment, for example, because I am just very excited about LLMs. And then, I don't know, it's just, I can't help it. I get very excited, read all the things about it, write about it and you know, that's mainly I almost go by intuition basically. And I'm kind of lucky in that sense that like with my blog, what I find interesting, other people also right now find interesting. So I think there's like a lucky coincidence that I honestly write only about things I find interesting. So I'm not kind of trying to force myself, oh, I have to cover XYZ because, well, it's something that should be covered. It's more like, oh, how does this, let's say recursive language model work? Let's just read the paper and then I write about it. You know, like more like yeah, getting excited about things. And yeah, the book writing is. Well, it's also a bit different because my blog is more research paper focused where I put all the. Like when I get excited about something, I read about it and put it in there for the book. I'm similarly excited. But that's more like a coding book where it's like the fundamentals. It's like, because I think that's like. To be honest, the best way to understand something is to see, to see it actually working. It's not any like hand wavy figures. I mean there are a lot of figures. Like right now for chapter six just I finished the other other day 21 figures they take the most work. Maybe one day LLMs can help me with that. But figures help to explain the code and everything. But code basically doesn't lie if it either works or it doesn't work, you know. And I think that's a very useful way to learn also. And it's just also a lot of fun for me. It's like when I write code and I see it working, it's very satisfying and you have something that actually works. So I should say I'm not building in that book any production level systems. It's called called build an LLM from scratch or build I mean large language model from scratch. But it's not like an LLM that you would use in production, it is it? Well, if you make a few tweaks you could use it in production. But the goal is code readability and teaching LLMs basically to. Because I think to see actually what how do I format my training data, how does it get processed, what is the loss function, what gets updated. I think this explains so much more than if I say oh, so it does next token prediction and then hand wavy, hand wavy here and there. You can actually literally see how it does that and what feeds in. And the same with lvr we had like a hopefully not too bad explanation in this podcast at the beginning of a grpo. But if you actually see the diagram, I have the numbers in there. But the numbers could be wrong though if you have a figure and you draw arrows and everything. But then if you implement that in code and you get exactly the same result results and the model trains and you, you get 50 accuracy. It's like a nice thing where you can oh, it's actually working. It's not just made up numbers, it's actually, it actually works, you know. And also that's how I learn about LLM architecture. So I have this blog the big LLM architecture comparison with Now I think 13,000 words because I keep extending it. I read the paper, I look at the architecture of an LLM. Do I really understand it? So I draw the architecture but. But then do I really understand it? And often unless it's like a 1 trillion parameter model, I often code the model and so I have still my GPT2 architecture and they are relatively similar. And so if there's a new architecture I take the most similar one and make a few changes to it. But then the beautiful thing here is so someone already implemented that in hugging face the transformers architecture. So I have a Reference model I can run, I run my model and I can see do I get the exact same logits, the same numbers, if I have the same prompt. And so with that, you can actually self check yourself. Did I implement everything correctly? Are the results correct? And I think that that's just a lot of fun. It's a lot of work, but it's a lot of fun. And it doesn't lie. It gives you the, the correct answer. Perfect on some prompts. Someone else extended that and I found a bug. And now I have a better understanding how they implemented the yarn scaling. That is something you would never understand by just reading the paper. You have to really. I don't look at the code and toy around with that. And so yeah, so that's, that's basically how I try to work. I try to combine, you know, like reading and coding and LLMs I also use, but I try to use it. So I, I would say for block writing on, on book writing, not so much because honestly I, I, for fun, I tried it out. It's just, it generates okay text, but it's, I don't know, it does not. I can ask it to generate text like me, but it's almost like, then I don't like it and end up editing it and then it's almost faster for me to just write it out the way I want it.
[64:56]
B
And you had interesting thoughts on LLM burnout, how using LLMs tends to deplete energy.
[65:02]
A
I would say first. Also, the thing is, it's not super satisfying if you just ask the LLM to do it. It's, you know, like cheating at homework. Honestly, I understand there are jobs and people where it just matters how much you get done and how fast you get something done. And then it, it makes sense to use an LLM to do the job for you. But I think that different types of people who enjoy different things. For example, I enjoy doing the work more than managing. When I was a professor, I, well, I did research, but I had to also, you know, supervise other students. And I liked working with other students, but I noticed I actually like doing the research myself more than, than telling other people how to do research or like managing. And so I think if you use only LLMs to just generate everything, and I wouldn't say useless, but I would feel maybe empty, like, okay, you use that pride, I think. Pride. Oh, I did something that worked and it's cool and you're proud of this. And so what I try to do is generally when I use LLMs, I try to make my work better. Not to like, not necessarily to make more or make it faster. I mean, to some extent I do, but then I try to like, how can I make what I do kind of better? So What I use LLMs for is more like, hey, I use actually the GPT5 Pro when I've written an article and put it in there. Hey, can you find any mistakes or typos? Often I have mislabeled figures. I go like figure 11, 12, 15, 16 and things like that. I can check myself, but it's just faster for an LLM to find all these things like how to make things better or are there any things that are unclear? I mean, I'm not a native speaker, so sometimes I have problems with a sentence. I'm tired, I just can't get it right and then it suggests me, oh yeah, that is maybe not a bad way to say it. And I would then take that sentence, for example, like things like that. Where, yeah, I'm trying to make, let's say, the work better without fully replacing myself. I mean, maybe that's like short sighted because L alarms will eventually be able to do everything. But I kind of like enjoy the work too much to just give everything away, if that makes sense. And I have the luxury that this still works for me. So I know there are some businesses where, yeah, it is really important to execute faster and, you know, so yeah.
[67:18]
B
Thank you very much for your work. Very popular for the quality of your writing, all your tweets, you have like a big X following and you name keeps coming back in conversations about where people go to to learn. Thank you for doing this part. Really appreciate your time. That was super fascinating and really educational, really insightful. So really appreciate it. Thank you so much.
[67:41]
A
Sebastian, thank you for inviting me. I had a lot of fun. I think it was maybe one and a half hours just talking about LLMs and AI. I mean, this is. Well, that's the dream, right? Thanks for having me.
[67:52]
B
Thank you. Hi, it's Matt Turk again. Thanks for listening to this episode of the MAD podcast. If you enjoyed it, we'd be very grateful if you would consider subscribing if you haven't already, or leaving a positive review or comment on whichever platform you're watching this or listening to this episode from. This really helps us build a podcast and get great guests. Thanks and see you at the next episode.