
Loading summary
A
There is this thing that's happening in AI and in AI every week now, a lot is happening. Fundamentally, if you look at AI progress, it's been a very smooth exponential increase in capabilities. This is the overarching trend. It's not like pre training fizzled out. It's just we found out a new paradigm that at the same price gives us much more amazing development. And this paradigm is still very new. I think one of the biggest things that I would say people kind of know on the insight and others don't is that already right now it's not about the progress. There are so many things Chat or Gemini, any LLM can do for you that people just don't realize. You can take a photo of something broken, ask how to repair it. It may tell you you can give it a college level homework and it will do it for you.
B
Hi, I'm Matt Turk, welcome to the MAD podcast. My guest today is Lukasz Kaiser, one of the key architects of modern AI who has quite literally shaped the history of the field. Lukasz was one of the co authors of the attention is all you need paper, meaning he's one of the inventors of the transformer architecture that powers almost all the AI that we use today. He's now a leading research scientist at OpenAI helping Dr. Second major paradigm shift towards reasoning models like the ones behind GPT 5.1. This episode is a deep exploration of the AI frontier. Why the AI slowdown narrative is wrong, the logic puzzles that still stomp the world's smartest models, how scaling is being redefined and what all of that tells us about where AI is heading next. Please enjoy this fantastic conversation with Lukas. Lukasz, welcome.
A
Thank you very much.
B
There was a narrative, at least in some circles, maybe outside of San Francisco throughout the that AI progress was slowing down, that we had maxed out pre training, that scaling laws were hitting a wall. Yet we are recording this at the end of a huge week or a couple of weeks with release of GPT 5.1 GPT 5.1 Codex Max GPT 5.1 Pro as well as Gemini Banana Pro for Grok 4.1 almost 3. So this feels like a major validation of that narrative. What is it that people in year AI labs know about AI progress that at least parts of the rest of the world seem to not understand?
A
I think there is a lot to unpack there, so I want to go a little slower. There is this thing that's happening in AI and in AI every week now a lot is happening. New model coding, doing Slides, self driving cars, images, videos, you know, it's a nice field that doesn't make you be bored for a long time. But through all of this it's sometimes hard to see the fundamental things that are happening. And fundamentally, if you look at AI progress, it's been a very smooth exponential increase in capabilities. This is the overarching trend and there has never been much to make me at least, and I think my colleagues in the labs believe that this trend is not happening. It's a little bit like Moore's Law, right? Moore's Law happened through decades and decades and arguably you would say it's still very much going on, if not speeding up with the GPUs. But of course it did not happen as like one technology was bringing you there for 40 years. There was one technology and then another and another and another and another, and this went on for decades, right? So from the outside you see a smooth trend. But from the inside, of course, progress is made through new developments in addition to the increase of computer power and better engineering. So all of these things come together. And in terms of language models, I think there was a big pivotal point. I mean one point was of course the transformers when it started. But the other point was reasoning models and that happened, I think O preview was a bit a year and a month ago or something like that. So we started working on it maybe three years ago, but it's very recent. If you think of it as a paradigm, that's a very recent thing. So it's always like these S curves, right? It starts, then it gives you amazing growth and then it flatlines a little bit though we'll get to the pre training. I feel pre training in some sense it's on the upper part of the S. But it's not like scaling loss for pre training don't work. They totally work. What scaling clause says is that your loss will log linearly decrease with your compute. We totally see that and clearly Google sees that and all other labs. The problem is how much money do you need to put into that versus the gains you get. And it's just a lot of money and people are putting it. But with the new paradigm of reasoning, you can get much more gains for the same amount of money because it's on this lower and there are just discoveries to be made and these discoveries unlock insane capabilities. So it's not like pre training fizzled out, it's just we found out a new paradigm that at the same price gives us much more amazing development. And this paradigm is still very New it happens so fast. I think if you blink, you may miss it because it was basically, you had chat 3.5, right? GPT 3.5 in chat. And it would give you answers and it used no tools, no reasoning. It would answer you something and now you have chat and you know, if you were not into it, you may have blinked. And it also gives you answers and you may say, okay, it's more or less the same, except the chat now will, you know, go look on some websites, reason about it and give you the right answer instead of something it memorized in its weights. I very much used to like this example of what time does the SFZoo open tomorrow? The org chat would tell you totally hallucinate from its memory an hour that it read Zoo opens on probably the Zoo's website from five years ago. And it didn't know what's today or tomorrow. So it would just assume it's a weekday. Chat now knows what's today because it's in the system prompt. It goes to the Zoo website, reads it, extracts the information if it's ambiguous, probably checks three other websites just to confirm, and then gives you the answer. But if you blink, you may think it's the same, but no, it's dramatically better. And you know, as a consequence, since it can read all the websites in the world, it can give you answers and stuff that it wouldn't be able to even touch before. So there is tremendous progress, right? And it happened so fast, may even be missed. I think one of the biggest things that I would say people kind of know on the insight and others don't is that already right now it's not about the progress. Like there are so many things Chat or Gemini, any LLM can do for you that people just don't realize. You can take a photo of something broken, ask how to repair it. It may tell you you can give it a college level homework and it will do it for you probably. So that's absolutely amazing.
B
So there is an education gap to some extent.
A
Well, it just happened like, I mean, if you think you said codecs, right? You know, programmers are conservative a little bit. I still use emacs from time to time, all the coding tools, like, okay, it will complete one line for me. But people are very like, this is my editor, I write code here now. People are like, no, this is codecs. I ask it to do stuff, I will fix it later, right? But I think it's the recent few months where the transition happened from people using it sometimes, but rarely to now basically this being how a lot of people work in coding, that's quite big. I'm not sure everyone's aware of it. But it's also like, if you don't do programming, why would you be aware of it? I do believe though that this will come to more and more domains to
B
the point of all of this being very new and somewhat sudden. Something that you or I hear from time to time when talking to people. Is that part of the reason why people are so optimistic is that there is a lot of low hanging fruit, very obvious things to improve for those models in the next few months? First of all, do you agree? And second, can you give us some examples obvious things that you need to fix next and that the industry will fix?
A
Yes, there is a ton of extremely obvious things to fix. Larger part of this ton is it's just hard to talk about on a podcast because it is in the engineering part. You know, every lab has their own infra and their own bugs in the code. Machine learning is beautifully forgiving in some sense, in contrast to old software engineering which would just yell at you when you made a mistake. You know, our Python coded will generally probably run, except much slower and give you worse results if you run it wrong. So you realize, oh no, it was wrong and you improve it and the results get better. These are huge distributed computing systems, they're very complex to run. So there is a huge amount to improve and fix and understand in the process about just how to train your model and how to do rl. Because RL is more finicky than pre training, it's harder to do really. Right. So every day, this is our day to day work. On top of that there is data, you know, we used to train on just like Common Crawl, basically it's a big repository of the Internet that people just scrape, scraped without regard of what. And some things came in, some didn't, it was a mess. So now of course every larger company has a team that tries to filter this and improve the quality. But it's a lot of work to really extract better data. Now synthetic data is becoming a thing, but when you generate synthetic data, it really matters how you do it with what model. In the engineering aspects of everything, it's such a new domain that it was done, somehow it works, it's beautiful. But there is just so much to do better that people, I don't think, have any doubts that there is a lot there. And on top of that there are the big things like multimodal, I mean language models. They are now as I'm sure you know, and most people realize actually vision language models because and they can also do audio. So they're multimodal models to some extent. But the multimodal part still lags behind the text parts to a large extent. So that's one big area where obviously you need to do better. And it's not a huge secret how you can do better. You know, there are some methods that maybe will make it even amazingly better, but there are some very simple methods how you can do just better. But you know, this maybe requires retraining your whole base model from scratch and that takes few months and it's a huge investment. So we need to organize it. So there is a lot of just work that will undoubtedly make things better. I think the big question that people have in their mind is how much better will it make them?
B
So I'd love to do a little bit of a deep dive educational part on the whole reasoning model aspect because as you just mentioned mentioned, since it's so new, some people truly understand how those work. Many people don't. At a very simplistic level, what is a reasoning model and how is that different from your sort of base LLM?
A
So a reasoning model is like your base LLM, but before giving you the answer, it thinks what people call in the chain of thought, meaning it generates some tokens, some text that's meant not for, for you to read, but for the model to give you the better answer. And while it does this these days, it is also allowed to use tools so it can, for example, in its thinking so called thinking process, go and browse the web and to give you a better answer. So that's the superficial part of the thinking models. Now the deep part is that you start treating this thinking process as part of the model, basically. So it's not something the model generates and it's an output for you, it's something you want to train, right? You want to tell the model you should think well, you should think so that the answer after this is good in whatever way. And this leads you to a very different way of training the model. Because models were usually trained with just gradient descent, the way deep neural networks are trained, meaning you say predict the next word and you do a gradient, you differentiate your function from the model, they're not fully differentiable, but you approximate it. And you trained your weights to do that. And it was quite amazing that doing just that, you could make a chat. But with the reasoning model you can do that because there is this reasoning part you can't differentiate through that. So we train this with reinforcement learning and reinforcement learning basically okay, there is just this reward and you need to do a bit of tries and reinforce meaning push the model towards doing more of the things that lead to better answers. And this kind of training is a bit more like it has more restrictions than the training we used before. The training we used before you took all of the Internet put it in, even if you didn't filter it very well, it would mostly work. Reinforcement learning, you need to be careful, you need to tune a lot of things, but you also need to prepare your data very carefully. So currently and current for at least the most basic ways we use it currently it needs to be fairly verifiable. So there isn't is your answer correct or not? You prepare data for that. You can do that in mathematics, coding very well. You can do this in science to some extent. Right? You can have test questions, they're correct or not. But you know, if it comes to like writing poems, is this poem good or not? For now the reasoning models are really shine and domains like science and they've brought some improvements to non science domains but it's not quite as huge maybe yet as it could be. I mean at least compared to mathematics and coding. Then there is the multimodal question. How do you do reasoning in multimodal? I think this is starting like I saw some Gemini creating images in the reasoning part. That's quite exciting. But it's very very young.
B
Retraining and reinforcement learning parties is particularly interesting from an educational point of view. Again because it seems that people have come to the conclusion that there is the pre training world and then the post training world. And the post training world is mostly reinforcement learning. But this idea that there is reinforcement learning in the pre training I don't think is as understood by everyone.
A
At the beginning of chat, let's say there was pre training, people did not do RL right then, but then you couldn't really chat with it. So chat was RLHF applied to a pre trained model. But the RLHF was different kind of RL of sorts. It was very small and it was human preference that was telling you what is better. That's what the HF is human feedback. You showed people pairs of stuff. You learned a model that says well people seem to prefer this as an answer. You trained with that it would very quickly what we today say hack this model. Like if you trained the RLHF too long, it would start giving things that satisfy this model. That seems Supposed to model human preferences. So it was a bit of a brittle technique, but it was a bit of RL that was extremely crucial to making the models chat. These days I think most people move towards this big rl. It's still not as big as pre training in the scope, but. But it says you have a model that says whether this is correct or not or if it's a preference. It's a very strong model that analyzes things and says you should prefer that. And you have data that's restricted to a domain where you can do this well and then you can also put some human preferences on top, but you make sure that you can run a little bit longer without making this whole grading fall apart. But again, this is our today. I do believe the role of tomorrow will be broader. It will work on general data and maybe then it will expand to domains that go beyond where it shines today. Now will it shine? There is a different question. Do you really need to think very much doing some of the things? Maybe not, but maybe yes, maybe we do more thinking and reasoning than we kind of consciously call thinking.
B
What would it take for RL to generalize? Is that better? Better evaluations? Like you guys released GDB VAL a few weeks or months ago to sort of measure performance against sort of broad economic sectors. Is that part of what the system needs?
A
I think this is a small part of it. I think that's one part. But if you think of economic tasks, making slides is important there. Following instructions, doing calculations. It's not math, but it's still very verifiable. Right. What I'm thinking about is when you do pre training, you take the Internet and you just say ask, what's the next word? You could think before you ask, what's the next word? Obviously now you don't want to think before every word, probably. But I don't know if you ever looked at the training data for a real pre training run because I think people mostly don't realize how bad this is. Hotels.com is a great website. Compared with the average chunk of 2,000 words from the Internet, it's. It's a mess, right? And also a miracle that from this the pre training process gets you something reasonable. So you probably don't want, you know, imagine you have a hotel website telling, you know, it's a beautiful vacation. You don't necessarily want to have a very long chain of thought before that, right? If it was written by a person, there was probably some kind of thinking that went into it. Maybe not as elaborate as the math and coding thinking. But maybe there was something going on. So maybe you want a little bit of thought before at least some of the text and that our models can't do very well yet. I think they're starting. There is a lot of generalization in this reasoning. If you learn to think for math, you will sometimes do some, you know, some strategies are they transfer very much like look up on the web and see what they say and use that information. So some of these things are very generic and they start to transfer. I feel like some are maybe not yet. Especially thinking in the visual domains is very under trained, I believe. But you know, we work. So we will try to push for more of that.
B
Going back to chain of thought, how does that actually work? How does the model decide to create that chain of thought? And is what we see. So the little intermediary steps that we see on the screen as users, the chain of thought that's exposed to us, is that what's actually being processed by the model? Or is there a deeper, longer, broader chain of thought that happens behind the scenes?
A
So in the current ChatGPT, you will see a summary of the chain of thought on the site. So there is another model that takes the full chain of thought and shows you a summary because the full ones are usually not very nice to read more or less to say the same thing, just in more messy words. So it's better to have a more readable summary. When you start with a chain of thought, the first paper about chains of thought, you basically just ask the model, please think step by step. And it would think. So if you just pre train a model on the Internet and ask it to think step by step, it will give you some chain of thought. Interesting. And most important point is you don't stop there. You say, okay, so you start with some way of thinking and then you say, sometimes this leads to a correct answer and sometimes it leads to a wrong answer. So now I'm telling you I have some training examples. You will think 100 times and say 30 lead you to the correct answer. Then I'll train you on these 30 examples. Say this is the way you should be thinking. That's the reinforcement learning part of training. Changes dramatically how the models think. We see this for math and coding, but the big hope is it it could also change how the models think for many other domains. Even for math encoding, you start seeing that the models start correcting their own mistakes earlier. If the model made a mistake, he'd generally just tell you what it did and insist that the mistake was right or something like that. With the thinking, it's like, oh, I often make mistakes, but I need to verify and correct myself to give the correct answer. So this just emerges from this reinforcement learning, which is beautiful. Right. It's clearly a good thinking strategy to verify what you want to say. And if you think it may be an error, then think again. That's what the model learns on the most abstract level.
B
Great. Thank you for this. All right, as a quick detour and we'll go back to more frontier AI topics, I'd love to talk a little bit about your story. I mean, you have the incredible distinction of having been at the forefront of this industry. Both the Transformers paper, which was the birth of one paradigm, and now you're very much leading the charge on the reasoning model part, which is another paradigm. So this just incredible story, how did you become an AI researcher?
A
I was a mathematician and a computer scientist, but in theoretical computer science.
B
And that started in high school as a kid.
A
Yes. I was definitely very into math in high school and into computers also later in high school. Yes. I did my studies in Poland. I went for a PhD in Germany. It was a theoretical computer science and mathematics PhD. So I very much am a mathematician. Yeah, I was always fascinated by, you know, how is this thinking going? What is intelligence? As a child, I always wanted to like, emulate the brain. They thought, well, okay, maybe higher level explanations are more interesting. I did research in logic, but did a little programming. But then there was this opportunity to join Google just as the deep learning was starting off. I already had my tenured position in France and the French system has this beautiful thing that you can take a leave of 10 years and you can still return anytime you want from the leave. So it's no risk.
B
So at some point when you've solved AGI, you may return back to France and be a professor.
A
Well, if you solve agia, they may take you anyway. The nice part about the leave is that they will take you back even if you don't. But it's actually very important. I think there's a number of Nobel Prize winners who took this leave to just try something more risky. And sometimes it works, sometimes it doesn't. There's a lot of lack in science and research, but it's very good to have this opportunity to take it. So I came to Google and that
B
was Google Brain at the time you said. Right.
A
I came to Ray Kurzweil Group. He was my first manager. He interviewed me and was very inspiring. My first interview was to join the YouTube UI team. And I was like, okay, I'm not going. And then I had an interview with Ray and I knew him of course, from his books and he's a very inspiring person. So it was like, okay, let's go. The team was separate from Google Brain at that time. Then I moved to Google Brain, worked with Ilya, Discover another very inspiring person. There's an amazing number of great people in AI in the bay in general.
B
I have to ask you at this point about the Transformer paper story, how it all came about the eight of you, right? Seven or eight of you, how did you all get together?
A
Well, we never got together.
B
You never got together? Okay.
A
I recently got a photo on Twitter of a photo session of all eight of us and it was saying it was fake, but I knew it was fake because I don't think elite of us were ever in the same physical room. These ideas developed from many sites before and after. Like Jakob and Delia Purushukin worked on attention, like self attention. Of course, attention was there from the encoder decoder side and maybe one minute
B
for the broad public on what attention actually means, since it's such a fundamental concept.
A
So attention is the mechanism that tells the model as you're doing the next thing, look into your past and find the most similar things that you see in the past to what you are seeing right now. It came from the machine translation times where people wanted to align words in one language with words in another. They were like, okay, so this word, where in this previous sentence would it be? It's an analog of alignment for deep learning. It's now called attention In AI just says think of what comes to your mind as you are here now in this environment, what things from the past are similar to. And this mechanism was already used in deep learning translation before, but it was used. There was like one encoder model and the decoder would be looking at the encoder but never at its own states. Main novelty of Transformer was self attention. But Transformer is more than just this idea. I think that's important. I think that's the beauty of these eight weird people somehow came together, even though not physically to do it, is that we all approached it from different sides. So there was people working on the attention idea. You need to put this in the network that needs to have a lot of knowledge. So there is the feed forward layer that expands and then contracts. So now I'm working on this. And nowadays used mixtures of experts which actually came before the transformer so how do you store knowledge in neural networks is another important question. And it's part of this model too. And then in deep learning, people laugh that ideas are cheap. Making them work is the hard part. So how do you write the systems and the code and the baselines to actually make this train? And this is funny to say now because nowadays you can take any deep learning framework and say, x equals transformer, X train, and it will basically work, but back then it totally did not. So you need things like learning rate warmup or tweaks to the optimizer that just work. And I did a lot of coding at that time, was working on TensorFlow and parts of the framework. And I remember distinctly that people were like, so you want to use the same model for a few different tasks. Why do you even do that if you have a different task? Like, if you do translation, you train one model. If you do parsync, you train another. If you do image recognition, you train a third. Like, you never train the same model for three different tasks. Why do you even write APIs to do multiple tasks on one model? And I was like, no, no, we're going to do all tasks in one model. And people were like, no.
B
So there was a lot of pushback against the idea, not against the idea.
A
Google was also an amazing place at that time that they would very happily let you work on whatever you wanted. But I don't think there was this widespread belief in doing multiple tasks with the same model. Not to mention this idea that you. I still find this idea that you take basically the same model as Transformer. Like now there is a bunch of changes to it, but you could in principle take the same architecture as the decoder from the paper, train it on all of the Internet, and it will basically start chatting with you. It would have back then definitely sounded as a worthy dream we maybe had as a dream, but not reality that you. You expect five years later. It's very lucky that it actually works so well, right?
B
Talk about the transition from Google to OpenAI and perhaps how those two cultures are different.
A
So Ilya Sutskever was my manager at Brain. Then he went on to found OpenAI. He asked me a number of times whether I would like to join in the years. I found it a little bit too edgy at the time. Then Transformers came, so we had a lot of work with that. And then Covid came. And Covid was a tough time for the whole world, right? But Google totally closed. Google was reopening extremely slowly. So one part of me was I find it very hard to do remote work. I much prefer to work with people directly. That was one reason. But the other was also Google Brain. When I joined, it was a few dozen people, maybe 40, something like that. When I left, it was 4,000, 3,000 people spread across multiple offices. It's very different to work in a small group and to work in a huge company. So with all this, Ilya was like, you know, OpenAI, though, is in a much stabler state. We're doing language models. You know something about this that may look like a good match. And I was like, okay, let me try. I've never worked in any company other than Google before, other than the university. So it was quite a change to the small startup group. But I like working in smaller groups. It has its pleasure. It has a little bit of a different intensity sometimes. In general, I found it very nice. On the other hand, Google has merged, made the Gemini, and I hear it's also a very nice place. I think in general, the tech labs are more similar to each other than people think. There are some differences, but. But I think if I look at it from the world, from the university in France, the difference between this university and any of the tech labs is much larger than between one lab or the other.
B
How are the research teams organized within OpenAI?
A
They're organized. They're not very organized. We do organize them. Some have managers and sometimes talk to them. No, but mostly people find, like projects. There are things to do, right? Like improve your multimodal models, improve your reasoning, improve your pre training, improve whatever this part of the infrastructure, people work on it. As we go through these parts, right? There is infrastructure, pre training, reasoning. I think the parts are the same for most of the labs. So there will be teams doing these things. And then sometimes people change teams, sometimes new things emerge. There is always some smaller teams doing more adventurous stuff, like diffusion models at times. Then some of the more adventurous stuff, like video models gets big and then maybe they need to grow.
B
Do people compete for GPU access?
A
I don't think it's so much people that compete. I think it's more projects that compete for GPU access. There's definitely some of that. On the other hand, like on the big picture of GPU access, a lot of this is just determined by how the technology works. Right? Currently, pre training just uses the most GPUs of all the parts, so it needs the most GPUs. Right. RL is growing in the use now. Video models, of course, use a lot of GPUs. Too. So you need to split them like this. Then of course people will be, oh, but my think would be so much better if I had more GPUs and I've certainly said that a number of times too. So then you kind of push it, you know, I really need more. And then some people may say, well, but you know, there's only so much. There is never enough GPUs for everyone. So there is some part of the competition, but the big part is just decided by the fact how the technology works currently.
B
Great. What is next for pre training? We talked about data, we talked about engineering your big GPU compute aspect to this. What happens to pre training in the next year or 2?
A
Pre training, as I said, I think it has reached this upper level of the S curve in terms of science, but it can scale smoothly, meaning if you put more compute, you will get better losses. If you do things right, which is extremely hard and that's valuable, you don't get the same payoff as pushing Karel, but it generally just makes the model more capable and that's certainly something you want to do. I think what people underestimate a little bit in the big narrative is openai3. 4 years ago I joined, even before that was a small research lab with a product called API. But you know, it was not such a big. There was no GPU constraint on the product side. For example, all GPUs were just used for training. So it was very easy as a decision for the people to say, you know, we're going to train GPT4. This will be the smartest and largest model ever. And what do we care about small models? I mean, we care of them as to make like to debug the training of the big model, but that's it. So GPT4 was the the smartest model and it was great, right? But then it turned out, oh, there is this chat and now we have a billion users and you know, people want to chat with it a lot every day, and you need GPUs, so you train the next huge model and it turns out you cannot satisfy this. Like people will not want to pay you enough to chat with the bigger model, so you just economically need the smaller model. So, and this happened of course to all the labs because the moment the economy arrived and it became a product, you had to start thinking about price much more carefully than before. So I think this caused the fact that instead of just training the largest and largest thing you can for the money you have, we said, well, no, we're going to train the same thing but same quality but smaller, cheaper. The pressure to give the same quality for less money is very large in some sense researchers almost makes me a little sad. I have a big love for these huge models that people say human brain has 100 trillion synapses or orders of magnitude of course are not that exactly calculated but our models don't have 100 trillion parameters yet. So maybe we should reach it. I would certainly love it. But then and you need to pay for it. So I think this may be why people kind of think that pre training has paused because a lot of effort went into training smaller models. Now on the side people kind of rediscovered how amazing distillation is. Distillation means you can train a big model and then put the same knowledge from the big. The big model is a teacher to the little model. People knew about distillation. It's a paper a long time ago, but somehow at least for OpenAI I think maybe it was more in Google's DNA when orioles there but people kind of rediscovered how important that is for the economics. But now it also means that oh training this huge model is actually good because you distill all the little ones from it. So now maybe there is a bit more of a return to lake. It's also matter once you realize you have the billion users and you need the GPUs you you need to invest into them. And of course everyone sees this, there's a huge investment but the GPUs are not online yet. So when they come back online and I think this may play into this what people call resurgence of pre training. We both understand that you can distill this amazing big model and there is now enough GPUs to actually train it. So it's resurging. But all of this fundamentally happens on the same scaling curve, right? It's not like we didn't know that you could do this. It's more like the different requirements of different months have sometimes changed the priorities. But I think it's good to step back from it and think of the big picture, which is that pre training has always worked. And the beautiful thing is it even stacks with rl. So if you run this thinking RL process on top of a better model, it works even better than if you run it on top of a smaller model.
B
One question that I find fascinating as I hear you speak and the evolution of the modern AI systems has been this combination of LLMs plus RL plus a lot of things going on. It used to be at some point, and maybe that was back in the deep learning days, that people would routinely say that they understood how AI worked at a micro level, like the matrix multiplication aspect, but, but didn't fully understand, once you had everything together, what really, really happened at the end of the day in the model. And I know there's been tons of work done on interpretability over the last couple of years in particular, but particularly for those very complex systems, is it increasingly clear what the models do or is there some element of black box that persists?
A
I would say both. There is a huge progress in understanding models fundamentally. I mean, think of the model that is chat. It talks to a billion people about all kinds of topics. It gets this knowledge from reading all of the Internet. Obviously you cannot identify, like, I cannot understand what's going on in there. I don't know the whole Internet. What we can identify is there was a beautiful paper just, I think, last week from OpenAI about if you tell the model that lots of its weight should be zeros, it should be very sparse, then you can really trace when it's thinking about one particular thing, then you can trace what it's actually doing. So if you say limit ourselves to this and to really study this inside the model, then you can get a lot of understanding. And there's circuits in the model. Anthropic had great papers on that. So the understanding of what the models are doing on a higher level has progressed a lot. But then it's still an understanding of what smaller models do, not the biggest ones. But it's not so much that these patterns don't apply to bigger models, they do. It's just the bigger models just do so many things at the same time that there is some limit to what you can understand. But I think this limit is a bit more fundamental than people think. It's like every very complex system, you can only understand so many things and then you don't. Right.
B
Thank you for all of this. I'd love now to talk about 5.1 and do a little bit of a deep dive on all the latest stuff that you guys have written released in the last couple of weeks, which has been very impressive. In particular, as a user, I think that the 5.1 moniker doesn't do justice to the evolution between 5.1 and 5. It feels like a much larger improvement than the number would indicate from, again, my perspective as a user. Walk us maybe through the evolution of from GPT 4 to 5 to 5.1, what has actually changed.
A
That's a very tough Question I think less than you think it's no, I mean from GPT 4 to 5, I think the biggest thing that changed is reasoning, meaning RL and synthetic data. As I told you, the pre training part in that timeframe was mostly about making things cheaper, not making things better. So of course the price has changed dramatically too. Right. Thousand times, I think, or some of these orders of magnitude. The main improvements from 4 to 5 is adding reasoning with reinforcement learning and this allowed to generate synthetic data which also improves the model. So that's the big picture. In addition to that, ChatGPT is now a product used by a lot of people. So the post training team has learned a tremendous number of lessons and it's added, you know, things clearly experimented. Wanted the model to be very nice to you, then turned out to be too nice then. Now when a lot of people use it, you need to be really careful about safety. Right. There may be people that are in distress using the model. The model needs to do something reasonable in these cases. It was not trained for it before, now it is and it makes the model much better. But you know, in the same time you don't want to refuse to answer any question that has any sign of anything. So as you work on these things, you make the model much better in use. Not just for the people in distress, but for everyone who wants questions answered but the answers to be reasonable and what these things called hallucinations. It's still with us to some extent, but dramatically less than two years ago. Some part of that is because reinforcement learning can now use tools and gather data and it also encourages the model to verify what it's doing. So that's an emergent thing from this reinforcement learning of reasoning. But also you just add data because you realize sometimes the model should say I don't know. So you add this to the post training data. You say like we really need to give it a thought. How the model should answer people in various situations. 5 to 5.1 is mostly this kind. Like it's mostly a post training improvement.
B
Yeah. So to double click on this because it is super interesting. So indeed as part of 5.1 there's the ability to choose different kind of styles from nerdy to professional. And that's I guess in reaction to the fact that some people were missing the sycophantic aspects of earlier models when GPT5 came out. So adding more tones, that's all post training stuff. So you tell the model those are examples of how you should be responding, which is more like a sort of superfast tuning kind of paradigm. Or is that RL right or wrong with rewards? How does that work?
A
I don't work on post training and it certainly has a lot of quirks. But I think the main part is indeed RL where you say, okay, is this response cynical? Is this response like that? And you say, okay, if you were told to be cynical, this is how you should respond if you were told to be funny. Try this one. So I do think the IRL is
B
a big part of in between models or different versions of the models. Are the releases aligned with pre training efforts or sometimes you have one big pre training effort and several models that come out based on there used to
A
be a time not that long ago be half a year distant past where the models were did have an alignment with technical stuff. Right. So they would align whether with either with RL runs or pre training runs. That's why you had a beautiful model called 4.0 which was aligned with a pre training run which was obviously worse than the O3 aligned with an RL run. That was the Follow up to 01 Naturally, because you couldn't use the name O2, but it was slightly better than the 04 Mini because that one was mini. And you know, we had this beautiful model picker and people kind of thought this was not the best naming for some whatever reason. So no, I mean it was fairly obvious that this was very confusing. Right. So now the naming is by capability, right? Right. GPT5 is a capable model, 5.1 is a more capable model. Mini is the smaller model that's slightly less capable but faster and cheaper. And the thinking models are the one that do more research in that sense the naming is detached from any technical in particular. 5.1 may be just a pre training, sorry, post training thing, but maybe 5.2 is the newly pre trained model or maybe not. But the naming has detached from the technology which also gives some, you know, as OpenAI has grown, there is a number of projects, right? There is RL and pre training and there may be, you know, something just to make slides better or whatnot. And with distillation you have the ability to put a number of projects into one model. It's kind of nice that you don't need to wait on all of them to complete at the same time and to contract to periodically put this together, actually make sure that there's a product, it's nice to the users and good and do this separately from waiting on the new full pre training run that takes months and so on. So I feel like even though a little tear in my eye goes for the times where it was that pre trained model number that was the number. As it's a product serving a billion users, it's maybe inevitable that you should name it by what the user should expect from it rather than in 5.1.
B
You have additional granularity in terms of telling the model how long it should think. By default, how does the model decide how long it should think?
A
So the model sees the task, it will decide on its own a little bit how long it should think. But you can give it an additional. It's trained with an additional information that can tell it think even harder and then it will think longer. So you have now the ability to steer that I still think it is important to realize. So this is the fundamental change that came with reasoning models, that using more tokens to think increases your capability and it increases it given the computation way faster than pre training. Right. So if you give GPT5 the ability to think for long, it can solve tasks that are. We had this gold medal at Mathematical Olympiad and Computer Science Olympiad. So they're amazing abilities. At the same time, the fundamental training method of reasoning is very limited to science data. So it's not as broad as the pre training, which I think like pre training models felt kind of almost uniformly good or bad at things. I mean this was still not uniform because it's not like teaching humans. Right. But the reasoning models are even more. People call it jagged. Right. They have amazing abilities somewhere and then close by, not so much. And that can be very confusing. It's something I always love that it's weird because you can say the model is amazing at Mathematical Olympiad at the same time I have a math book for. I have a first grader daughter in the first grade, she's five years old. I took one exercise from this math book and none of the frontier models is able to solve it and you would be able to solve it in 10 seconds. So that's something to keep in mind. Models are both amazing and they're tasks that they cannot do very well. I can show you this as an example. I think it's quite interesting to keep in mind. Let me start with Gemini 3. Just to blame the competitors.
B
Yes, please.
A
So it has, you see two groups of dots on both sides and the question is, is the number of dots even or odd? And if you look at it, you see, oh, they're like two identical things. So that would be even. That's what the five year old is supposed to learn. But There is one dot that's shared. So now that must be odd for this simple one which has like, you know, I don't know, 20 dots or so. Gemini 3 actually does it, right? It finds out that it's an even number of dots and it says that. And that's great. And then you have another puzzle which is very similar, except now there are two mountains of dots and there's also one dot shared at the bottom now and right in context, right after that you ask, okay, how about this one? And then it does some thinking and it just totally misses that there is a shared dot and it says the number is even. And it's like in context where you've seen this first example, how would you ever miss that? You know, and here is the same, the exact same prompt for GPT 5.1 thinking. And it also solves the first it sees the dot, it says it's odd and then it sees the mountains and somehow it doesn't see the dot and it says it's even. The nice thing is if you, if you let it think longer or if you just let it think again, it will see it. So if you use GPT5Pro, it takes 15 minutes. So you know, this is the human 5 year old takes 15 seconds. The GPT5.1Pro will run Python code to extract these dots from an image and then it will count them in a loop. So that's not quite.
B
And why is that? What trips up the model?
A
I think this is mostly multimodal part. The models are just, they're starting like you see the first example, they manage, so they've clearly made some progress, but they have not yet learned to do good reasoning in multimodal domains and they have not yet learned to use one reasoning in context to do the next reasoning. What is written in context is, you know, learning in context happens, but learning from reasoning in context is still not very strong. All of these though are things that are, are very well known and like the models are just not trained enough to do this. It's just something we know we need to add into training. So I think these are things that will generally improve. I do think there is a deeper question whether. So you know, like multimodal will improve, this will improve. Like we keep finding these examples. So as the frontier will move, it will certainly move forward. Some things will smoothen, but the question is, will it still be just other things that you don't need to teach the human every. Okay, now you know how to use a spoon and a fork, but now if the fork has four instead of three ends. Then you need to learn anew. That would be a failure of machine learning. I am fascinated by generalization. I think that's the most important topic. I always thought this was the key topic in machine learning in general and in understanding intelligence. Pre training is a little different. Different, right, because it increases the data together with your increase in model size. So it doesn't necessarily increase generalization. It just uses more knowledge. I do believe that reasoning actually increases generalization, but now we train it on such narrow domains that it may still be to see. But I think the big question in all of AI is, is reasoning enough to increase generalization or do you need like more general methods? I think the first step is to make reasoning more general. As we talked before, that's my passion. That's also what I work on. There is still something there. We push the models. They learn things that are around what we teach them. They still have limitations because they don't live in the physical world, because they're not very good at multimodal, because reasoning is very young and there's a lot of bugs in how we do it yet. But once we fix that, there will be this big question, is that enough? Or is there something other big to make models generalize better so we don't need to teach it every particular thing in the training data that it just learns and generalizes? I think that's the most fascinating question. But I also think a good way to approach a question like that is to first solve everything that leads up to it. You cannot know whether there is a wall or not until you come close to it, because otherwise AI is moving very fast. Someone said it's like driving fast in a fog. You never know how far or close you are. So we're moving. We are learning a lot.
B
And does that mean. So that central question of basically learning with very little data the way a child would, and the fact that a child is able to do things that even the most powerful model cannot do. So this, as you said, to unpack this, making progress on reasoning and showing how far we can get into generalization with reasoning. And then the separate question is, as you said, whether we need an entire different architecture. And that's where we get into, for example, yarn account work. Do you see promising fundamental architectural changes outside of transformers that have caught your attention and feel like they could be like a serious path to explore in the future?
A
I think there is a lot of beautiful work that people are trying out. You know, the ARC challenges inspired one set of people. There are models now that are very small and solve them very well, but with methods that I'm not sure are actually general. You need to see. Jan Lecun has been pushing for for other methods. So I feel like his approach is more towards the multimodal part. But maybe the maybe if you solve multimodal, right? Maybe if you do jeta, it also helps your other understanding. There is a lot of people pushing fundamental science still. It's maybe not so much in the news as, as, as the, you know, the things that, that push. But whatever you do, you know, it will probably run on some GPU. If you get a trillion dollar of new GPUs, the old GPUs will be much easier to get also. So I think this growth in LLM AI on the more traditional side is also helping people to have an easier time to run more experimental research projects on various things. So I think there is a lot of exploration, a lot of ideas. It's still a little hard to implement them at a higher scale. The engineering part is the biggest bottleneck. I mean GPUs are a bottleneck too when you scale really up. But implementing something that's larger than one machine, it's an experimental research project so you don't have a team to do that. I think that's still harder than it should be. But you know, codecs may get there or code coder. This is the thing where AI researchers have great hope to help themselves and also other researchers is that if you could just say hey Codex, this is the idea and it's fairly clear what I'm saying. Please just implement it so it runs fast on this eight machine setup or 100 machine setup. That would be amazing. It's not quite capable of doing that yet, but you know, it's capable of doing this more and more. I think that's what OpenAI says is they say, you know, we say we'd like an AI intern by the end of next year. That's how I understand this. You know, can someone help us?
B
And is part of a the path for Codex to be able to do some of this. Does that revolve around how long it can run? Context behind the question being that again like two days ago as we recorded this, you guys released GPT 5.1 Codex Max described as a frontier agent decoding model trained on real world software engineering tasks designed for long running workflows and using compaction to to operate across multiple context windows in millions of tokens. So I'd be interested in unpacking some of this. What does that mean to run for a very long time? Is that an engineering problem or a model problem? And then maybe a word on compaction.
A
So it is both an engineering and model problem. You want to do some engineering task, like you have some machine learning idea, you want Codex to implement it for you, test it on some simple thing, find the bugs. So it needs to run this thing. This is not something you would do in an hour, right? It's something you'd spend a week on. So the model needs to spend a considerable amount of time because it needs to run things, wait for the results, then fix them. The model is not like it's going to come up with the correct code out from thin air, right? It's just like us. It needs to go through the process and oftentimes in the process, since it was not trained on anything very long and it's in its training or maybe very few, but certainly nothing that went on for a week, it can get lost, it can start doing loops or doing something weird. That's of course not something you want. So we try to train in a way that makes it not happen, but it does. So how can you make the model actually run a process that requires this larger feedback loop loop without tripping up? And the other thing is, transformers have this thing called context. So they remember all the things that they have seen in the current run and that can just exceed the memory available for your run. And the attention matrices are N by N, where N is this length. So they can get huge. So instead of keeping everything you say, well, I'm going to just ask the model on the side to summarize the most important thing from the past, put it in context anew and forget some part of it. Right? So it's a very basic form of forgetting the compaction, right. And that allows you to run for much longer if you do this repeatedly. But again, you need to train the model to do that. And when you do it, it works to some extent. I don't think it works well enough to replace an AI researcher yet. It made a fair bit of progress. I think another part of progress that's a little understated on the research side, but is very important is allowing the model to connect to all of these things. So models now use tools like web search and Python routinely. But to run on a GPU to have access to a cluster, it's hard to train models with that because then you need to dedicate for the model to use and that has security problems. And this Thing like how do models connect with the external world? It's a fundamentally very hard problem because when you connect in an unlimited way, you can break things in the real world. We don't like models to break things for us. So that's a part where people work a lot. It overlaps with security, right? You need to have very good security to allow models to go on and train on the things they need to train.
B
One theme that people like me, VCs and founders and startup think about a lot as we see all the progress at OpenAI is as the models keep getting more generous, general with more gentic capabilities, the ability to run for a very long time, going to areas like science and math. And recently it was reported that there were some investment bankers hired to help improve the model's capability to do grant investment banking work. All of that taken into account, is there a world where basically models, or maybe just one model does everything? And I don't know if that's AGI, let's not necessarily go into that debate, but what's left for people that build products that sit on top of models?
A
I just showed you a five year old exercise that the model doesn't do. I think we need to keep that in mind.
B
So you're saying there's hope.
A
There is hope. The next model will do it, right?
B
That hope. Okay.
A
Well, for me, yes, I still, still think we have some way on the models. Progress has been rapid, so there is good hope there will be less and less of this. But on the other hand, for now you don't need to do a deep search to find things where you'd really want a human to do that task because the model's not super good. On the other hand, transformer paper started with translation. I recently went to a translation industry conference. The translation industry has grown considerably since then. It has not shrunk. There's more translations to be done, translators are paid more. The question is why would you even want a translator if the model's so good? In most of the cases, the answer is sometimes. Imagine you do a listing for a newspaper, but in a language you don't know. And GPT5 will almost certainly translate it correctly for you. If it's into Spanish or French or any high resource language, would you still publish it without having a human who speaks that language look at it? Would you publish it if it's a UI of ChatGPT that a billion people are going to see? It's a question of trust, probably, right? But if you have a million users, a billion Users, maybe you will pay the $50 for someone to just take a look over it before you, you translate it. So this is in an industry that fundamentally is totally automated, right. There is still the question of trust and I think that's a question we will grapple with for a long time. There are also just things you want, want a person to do. Like I, I don't think we will have no things to do. But, but that doesn't mean that some things we do may not dramatically change and that, you know, that can be very painful for people who do these things. And so this is a serious topic that happy people are engaging with. But I don't think like there will be this global lack of anything for people to do.
B
And maybe as a last question to help us get a sense for, for what people at the frontier of AI are currently thinking about or working on, some of the topics that one may see are things like continual learning, world models, robotics, embedded intelligence. What do you personally find very interesting in addition to what you mentioned upfront multimodal. But what do you personally find really interesting as a researcher?
A
Well, you know, I find this general data reinforcement learning is my pet peeve and what I work on, luckily that for example, robotics is probably just an illustration that we are not doing that well in multimodal and that we're not doing that well in general. RL in general reasoning. Yet the moment we do really well in multimodal and we manage to generalize reasoning to the physical domains that the robot needs, I think it will see amazing progress when it does. I have a feeling given that, you know, a lot of companies are launching hardware that's kind of tele operated or glove operated or something. So my suspicion is by the moment we make this progress, which you know, maybe, maybe it will be next year, maybe it will be in a few more years, but the hardware will may be there by then. And having a robot in the home may be like a big visible change, maybe more visible than chat. Given how quickly we got used to the self driving cars in San Francisco, maybe it will be only visible for the first two days and then be like, yeah sure, the robot's there. It's always been cleaning. Since I can remember the last three months. It's stunning to me how quickly we get used to these things. The self driving cars in San Francisco are something people got used to so quickly. So maybe this will happen for robots too. Nevertheless, I do think it will be quite dramatic in our perception of the world when it happens. Hardware is hard though. Right. Robots may have accidents in the house. You need to be very careful. So maybe it will take longer to deploy them and actually make it a scalable business. We'll see. It is amazing that we are are at this point where we can start thinking like, yes, maybe that will come soon.
B
Lukas, it's been absolutely wonderful. Thank you so much for spending time with us today.
A
Thank you very much, Matt. Thank you for the invitation. Great to talk to you.
B
Hi, it's Matt Turk again. Thanks for listening to this episode of the MAD podcast. If you enjoyed it, we'd be very grateful if you would consider subscribing if you haven't already, or leaving a positive review or comment. Comment on whichever platform you're watching this or listening to this episode from. This really helps us build a podcast and get great guests. Thanks and see you at the next episode.
Date: November 26, 2025
Host: Matt Turck
Guest: Łukasz Kaiser (Co-author of Transformers, Research Scientist at OpenAI)
This episode features a deep, insightful conversation between AI investor Matt Turck and Łukasz Kaiser, renowned for co-authoring the seminal "Attention is All You Need" paper that introduced the transformer architecture. Now at OpenAI, Łukasz is at the forefront of the next paradigm shift in AI: reasoning models. The discussion tackles the “AI slowdown” narrative, the ongoing exponential growth of AI capabilities, why reasoning and reinforcement learning are game changers, and where the next frontiers lie—including challenges in generalization, the limitations of current models, and the future of robotics and multimodal systems.
⏰ [01:29 – 08:04]
“If you look at AI progress, it's been a very smooth exponential increase in capabilities... It's not like pre training fizzled out. It's just we've found a new paradigm.”
—Łukasz Kaiser [02:16]
⏰ [11:22 – 21:37]
“With the reasoning model you... want to tell the model, you should think well, you should think so that the answer after this is good.”
—Łukasz Kaiser [11:47]
⏰ [08:32 – 11:22], [32:49 – 37:29]
“There is a ton of extremely obvious things to fix...and on top of that, there are the big things like multimodal...”
—Łukasz Kaiser [08:32]
⏰ [19:22 – 21:37], [37:29 – 39:41]
“With thinking, it's like, oh, I often make mistakes, but I need to verify and correct myself to give the correct answer...”
—Łukasz Kaiser [19:49]
⏰ [21:37 – 28:42]
“How do you store knowledge in neural networks is another important question. And it's part of this model too...”
—Łukasz Kaiser [25:07]
⏰ [31:48 – 37:29]
⏰ [39:41 – 46:17]
“Main improvements from 4 to 5 is adding reasoning with reinforcement learning and this allowed to generate synthetic data which also improves the model. So that's the big picture.”
—Łukasz Kaiser [40:21]
⏰ [46:30 – 53:17]
“Models are both amazing and there are tasks that they cannot do very well... I always love that it's weird.”
—Łukasz Kaiser [46:30]
⏰ [53:17 – 56:07]
⏰ [56:07 – 59:35]
⏰ [59:35 – 62:29]
“There are also just things you want a person to do. Like, I don't think we will have no things to do.”
—Łukasz Kaiser [61:00]
⏰ [62:29 – 64:54]
The conversation underscores just how fast, jagged, and surprising modern AI’s progress has been, with paradigm shifts sometimes occurring “between blinks.” Łukasz Kaiser’s dual perspective as both a creator of transformative architectures and a leading researcher in reasoning-based models gives listeners a unique window into technical, organizational, and philosophical issues at AI’s bleeding edge—while practical challenges from system engineering, multimodal learning, and trust remain front and center for the next wave.