
Loading summary
A
Most of the people don't realize that this is like already happening, especially over the past few months. In almost every lab, the new generation of the models are built heavily using the previous generation of the models. What is missing right now is long horizon and full automation and we are moving to that direction super, super fast. The moment that we have this full automation, we can close the loop of self improvement. We just got rid of the human bottleneck for improving these models, which I expect to see a huge jump again from such development.
B
Hi, I'm Matt Turk. Welcome to the Mad podcast. Today my guest is Mustafa Digani, a top AI researcher at Google DeepMind and a core contributor to some of the most influential architectural breakthroughs of the last decade, including Universal Transformers, the Vision Transformer, and the natively multimodal Gemini family. In this episode we unpack what's hot in frontier AI right now, including what it actually means for AI to think in loops and the immediate timeline for recursive self improvement where AI autonomously builds the next gener generation of AI. We also dive into the technical evolution of image generation with nanobanano2 and why continual learning could completely disrupt how enterprise data pipelines and rag systems are built today. Please enjoy this fantastic deep dive with Mustafa Dihani. One of the hardest concepts in AI research right now seems to be the concept of loops. So I thought it'd be a fun place to start this idea that models are going to improve not by being bigger, but but by thinking recursively. What does that mean exactly?
A
Definitely one of the toppest active areas for almost every lab to invest in looping. And it has operation at different levels. The one that is on the macho level is basically the looping that we use architecture or at inference time for test and compute and stuff like that. And then at a higher level is basically the loop, the look that we have over the development of these models, which is basically we refer to it as self improvement. If I want to put it. Let's talk about self improvement as this general concept. If I want to put it very simply, it is really just the continuation of the trend that we've been riding for decades. And think about it, in classical machine learning, humans had to sit down and manually engineer the features and you had to decide what the model actually pays attention to. And deep learning and neural network came along and they said, okay, let's just remove that, let the model figure out the representation itself. And that was actually a huge deal. And we somehow removed a massive human bottleneck and human bias. And then Further and instead of just designing architecture, we started learning them too. Instead of curating every piece of training signal, we scale to basically data driven approaches and let the data speak. And the self improvement and this loop into development is just the next step in the same direction. And the whole idea and the whole point of it is you're removing the human bottleneck and bias from improving these models. Right. And now you say that okay, not just human doesn't have to handcraft features anymore, but also we don't want the human to sit in the loop every time that the model has to get better. And I think that's basically on the development side. So it's not radically new, it's the same story, just a new chapter of the same story. I think every time that we removed human from human judgment from this process, we kind of got over a bottleneck. I would say the self improvement and looping over the development is kind of like doing that at the highest level, which is basically improving these models. If you want to go to more detailed level of looping, we can talk about ways of increasing test and compute for these models and how we let these models to loop over their process within a specific problem, to refine it, to think about it. And I think the most familiar form is just like chain of thought and letting the model to think with extra tokens that's beyond that. And you can think about different ideas that you let the model to increase the compute for any specific problems. Like what if I have dummy tokens that they can use as read and write tape to kind of verify what I've done and go through the solutions or the process that I'm doing over different steps and understand what has been done wrong, what has to be done next, or even negative sparsity, which is basically reusing part of the model multiple times. And this sort of looping is also been shown to be super helpful, mostly because you just let the model through more compute on a difficult problem.
B
So that's self improvement at inference time. I think you alluded to earlier. There is also a bigger concept that's maybe, I guess more science fiction, except it seems to be becoming a reality very quickly, which is this concept of recursive self improvement or rsi. That seems to be what a lot of people are talking about about. I think ICLR is coming up in a few weeks and there's a bunch of papers focused on that. So what is that? What is recursive self improvement as a concept?
A
It's actually interesting because you referred to that as Something that looked like a bit of a sci fi situation where these models are actually improving themselves. And that's true because a few years ago when you wanted to talk about this, you could just write a perspective paper at a conference and talk about it at super high level. But, but if we go and check out what is happening right now, like to a really good extent happening like most of the, and it's somehow like most of the people don't realize that this is like already happening. Especially over the past few months in almost every lab the new generation of the models are built heavily using the previous generation of the models. I think that's basically the case again everywhere. And it's not fully automatic yet, but the direction is like super clear and it's like easy to imagine that we're going to get to a situation with full automation. These models are going to improve themselves and keep learning from the world. And again, it has relation with other concepts like continual learning and other concepts that we are still not yet to the most advanced point of it. But if someone comes and say that, oh, you know, I have an idea to get a model to calculate the gradient and updates its weights like on the fly. It just feels like very normal. You know, it's not something that wow, this is like such an amazing idea. I think what is missing right now is like long horizon and full automation and we are like moving to that direction like super, super fast. The moment that we had this full automation I would say we can close the loop of self improvement and then it becomes like, you know, the problems become like, you know, mostly providing compute for these models to actually do what they want to do. And as I said, like the comment, we just got rid of the human bottleneck for improving these models, which I expect to see a huge jump again from such development.
B
So people may have seen or heard about Karpathy's auto research project a few weeks ago. Is that an example, presum reasonably narrow to make it work. Is that an example of a self recursive loop?
A
That is definitely. And I think that was one of the early examples of seeing these models actually doing something super sensible on the research side. So we've been seeing them doing a lot of good work on improving the engineering part of the development loop. But on the research side, which you think about, okay, maybe some sort of gut feeling or intuition is needed and a researcher with a long time of like, you know, playing with these models and experience can do this, but not necessarily, you know, like a model I think we're seeing the sign that, okay, you know, maybe basically that kind of like golden part of the recipe, a successful recipe that mostly coming from like intuition of a good researcher is coming to kind of these development loops by these models. And it's a bit hard to think about, okay, you know, does it mean that we can replace every genius researcher with these models very soon? Maybe, and I don't know how soon, but this is definitely a sign of something that we kind of doubted a few years ago. We couldn't believe that this is going to happen that early, which is very exciting.
B
I want to play it back just to make sure that people listening to this understand we're talking about AI building AI. And I think a few months ago, if you talk to researchers, people would say, oh yeah, we already use AI to build AI, but that really meant that we use AI tools and reasoning models to come up with ideas and thoughts about building models. But here what we're talking about is AI automatically updating itself, updating its weights in a recursive manner, leading to potentially a dramatic acceleration in progress. And what you're saying is that this is largely upon us and a question of longer horizon and basically more compute. Is that fair?
A
I think so. This is one. And the other one is also, I'm not going to say that, that oh, you know, soon we're going to have these models like fully automated and there are actually many problems that we have to solve. But directionally I can see how this can happen. You know, like, it's not something that I would look at it as like super hard. It's like hard, but very possible.
B
Okay, so what are the roadblocks? So you talked about compute. Is evaluation one of them? Because presumably the model needs to understand what is right and what is wrong in terms of the quality of the answer is that one of the issues,
A
100%, at the end of the day, you can only improve what you can measure, right? And then getting evaluation is like just hard. And, and at the end of the day it becomes almost a philosophical problem, not just a technical one. Like, this is actually a very interesting observation. So if you have like a team of super competent people, most of the time they can do like massive progress on a problem if there is some concrete eval to heal climate. But if there's no eval, it's just like really hard to, to make progress. And, and the fact that we don't have evals that like, or, or like even defining evals that, that can maybe measure. Oh no, how close we are to the point that we can actually get, get a self improvement loop. It's just like we don't, we don't have that and, and it's just making it like much harder to, to measure the progress in that direction. But there are proxies and there are definitely some evals that you know, like we're going from, oh, maybe we can evaluate like every step of the model toward this direction and maybe, maybe we can evaluate up to this many terms of the model or maybe we can evaluate the model helping itself to improve in a specific framework and in the specific setup. And this part of the, like the machine learning that needs like iteration. It's also quite interesting because the difficulty of building eval is like the infrastructure that you need to reliably run evals that are super complicated is also like super hard. It's quite funny, but sometimes figuring out that, okay, how can I create an environment for a model that operates safely within Google, right? And does all the jobs that an RE&RE like research engineer or research scientist can do in a safe setup where they can put. Because right now we definitely, we are not confident about, you know, them doing the right things all the time and measuring like how much they can push and how long they can push a task is very difficult. And like connecting all these points into an environment that these models are operating and then get them run efficiently and like bringing diversity to eval is definitely one of the bottlenecks of, of like making progress in this direction.
B
A couple of weeks ago we had a fun conversation with Karina Hong of Axiom Math and we talked verification. Is that a promising area from your perspective is something like formal verification? What would enable you to make sure that the improvement loop keeps continuing?
A
In my opinion, formal verification is one of the most powerful keys to enable self improvement. But it's not B key. And if you think about it, for mask code logics, it's great. You can run a proof. It either checks out or not. If you go to other domains that are a little bit messier, like for example, you cannot write a formal proof that if a doctor's recommendation is good, right? So it's not hard. It's not easy to have to extend this formal verification to all the domains in real world. But one question that is actually an interesting question which is very relevant to formal verification is how can we look at these methods and formal verification and build that kind of tight and honest feedback loop for the messy part of the work? I think that that's very inspiring to build on top of these formal Verification methods to extend to domains that not easy to verify easily, but you need some sort of clean and tight feedback loop to be able to make progress.
B
So the same problem as reinforcement learning, right? The second you start veering away from math and code, you start getting into a messy territory. Is model collapse one of the issues to think about or is that orthogonal
A
model collapse is definitely a risk, right. And I would say model collapse mainly happens when you have a loop that is completely closed, right? And if you don't have any outside signal and just the model, for example, talking to itself or operating in a very like a, like a restricted environment, there's a good chance that your mobile call access. But if you have a strong verifier or some sort of a real reward signal that anchor this kind of signals that is coming from AI generated data, for example, it can be quite powerful. I think the key here is to stay grounded to something real and then you can most likely avoid things like model collapse. But yeah, again it's a risk, but it's not definitely a major wrong and
B
perhaps to make this accessible to everyone. Can you define what model collapse is in the first place?
A
So basically when you have some sort of data and environment that these models are interacting with, but those environments and data are designed for example by another model, this is just an example of that. And then you become really, really good at this specific part and then suddenly you lose generalization to anything beyond that. And this is one of the kind of definition or one of the cases that a model collapsing would result to.
B
So you mentioned losing generalization. Is that particularly in the concept of rsi, a worry that either you have those self reinforcing loops, but they need to be fairly narrow, or you have more general models, but then you kind of have the loops.
A
This is an interesting question again like generalization versus specialization. Let me go a few steps back. We had this discussion like, like many, many times. How should we do a trade off between generalization and specialization when we are developing these models? I think long term you want a model that knows everything and knows when to go deep versus wide, right? Imagine like you have an agentic actor, right? Like if, if you. An agent decoder, if your agent is like super strong at every step of operation, like a really, really good programmer. It's amazing, you know, like it's like super specialized. But for many of the problems, like coding problems, you need some sor planning and understanding what's going on and collecting information and based on the context deciding what to do. And then after you Define the steps, then your super joint specialization just kicks in. And before that, being a generalist is super useful. Definitely. Generalization is one of the things that you need to get to the ultimate side of AGI. But short term, I would say building a specialist model is probably the fastest way to learn what is actually possible. And in many cases these specialized models are becoming stepping stone toward a generalist model, which is super valuable, right? So you can imagine that, oh, if I'm actually thinking about self improvement, maybe I need to make sure that in a very specific area I can build that, maybe I focus on coding and then if it works out, then I go through how to widen that and how to bring more into this specialized setup. One thing that I always say is that people don't care what category their problem falls into, right? And if a human calls something a problem, then AI should be able to solve it. And I think that's fundamentally a generalist need. Right? So at the end of the day you need generalization. And playing this, going through this spectrum of super generalized model and super specialized model is more about long term, short term, and how to take advantage of each side during this process.
B
What's a specialized model today? Is that a separate model or is that a broad general model that's trained in a specific way, including in particular through rl?
A
Okay, so here's the point. We used to have constraint like compute, and then if we wanted to push a model to be like sota, we would choose specific dimensions. And then we say that, okay, we want to kind of allocate the compute that we have to that and then make this model look really good at this, like something that is extremely expert at this. So that was basically the trade off that we were trying to make, given the complete budget that we had. As we go through this, the phase of compute becoming more available, cheaper, and then maybe we're constrained with other stuff like data and stuff. One of the other trade offs that pops up is, especially in post training, this game of evacuable, that sometimes it's really hard to get your model to be good across the board. So you try to make it good at something like multimodality. Somehow you see some regression on the coding and you make it good at coding and multimodality. It becomes slightly worse than a model that you had at math and reason. So it's hard to find a balance. And part of it is because post training does a little bit of an overfitting. At the end of the day, when you post train a model, you are trying to overfitted to the best local Optima you have. When recipe becomes like, how can I find the best local optima? It becomes the problem of, okay, there's no local optima that is good for everything. So I need to kind of choose right. And then like seeing this, you end up with like making some decisions along the way and saying that, okay, you know, maybe for me at this stage, because of the meat that I have in my organization, like with respect to the competition that is going on, I need to choose this specific axis. For example, some companies have a very strong focus on coding, which is, okay, I make my job super easy or not super easy, but much easier than the competitors that they want to basically shoot a model that is good across the board. I think short term it's very effective because first of all, during development you care less about all the dimensions. So maybe it's just faster to iterate. You free up some space from the mind of your researchers and engineers that, okay, about this, just let's push this to the max. And then the other one is also like, you don't hit the trade off immediately. And especially this model is that like, okay, I'm going to pick this specific axis and then make the model look really, really good at this. Sometimes again, this is a decision based on the place that you are at again, organizationally, competitors and stuff like that.
B
Great. You said something a few minutes ago that I thought was so intriguing, which is this idea that the carpathies of the world and you of the world could be be automated. What happens if like the brightest minds in the world get automated and the AI creates itself? Like at some point is there just no one knows how the AI works? Is that an actual possible future?
A
This patch is very philosophical. I don't know. Well, let me give you one quick things that I thought about it a few days ago. I have a daughter. She is like one and a half years old. I've been impressed over the past few years. Very interestingly, I've been proven wrong multiple times about like the timeline that I had in mind. For example, sometimes I say like, oh, this is going to happen in six months. Never happened. Sometimes like, oh, this is just like so hard. Like within the next 10 years, there's absolutely no chance to solve it. And then boom, like in two months, three months, someone had a brilliant idea and they solve it. So it's like really hard to predict the future. And it was thinking like, okay, you know, like, so you're talking about like catapathy and like you know, again like other researchers. But I'm thinking about okay, like what about the next generation? You know, if my daughter at some point comes to me and asks like okay like what should I do? You know, like what do you recommend to study? Like what major and like you know, what branch of the science or research should I kind of like, you know, dig in and like be the expert on? I really don't have a good answer. You know, like almost it doesn't exist and it's just like really hard to predict the future. What I know is there are a few skills that are probably key to be able to make impact in this world and also be relevant. Staying relevant like one of them is like a strategic. And having all the parameters on your table when you're making a decision and becoming absolute expert about a very specific subject most likely is not going to be useful in the near future. I think the brilliance of Kpathy is not like he's a good programmer or he's a good, definitely he's a good teacher. But I'm saying these are not the most impressive part of it. The most important impressive part for me is that he has a really good overall view of like what is happening. Like by putting himself in, in the, in the, in the like the stream of information he can make a decision about okay, what is the next most impactful thing to do. And now like you know, the things that he does to make impact is very different from like you know the things that he used to do like five years ago. And I think he can be able to, to do that like continue doing that, you know, like what is that? The things that he's going to do he's going to be doing like in like in five years. I don't know but I know it's like he's smart enough to figure it out and still keep making impact on the board.
B
So AI researchers are not researching their way out of a job just yet.
A
Hopefully we are smart enough to.
B
All right, maybe that's more of a macro question as I think about, you know, where the value lands in this ecosystem but API just keeps creating itself then is data still needed in that equation or is that all compute concept
A
of data is a little bit broader than just tokens, right? And if you think about data as whatever that the model can get signal from either it is predicting the next token in ROC hex which we kind of use in pre training or super complex environment that the model interacts with and then gets signal. This is something that basically we can Refer to it as data. Right? And it's not like data or the value of having good data or working on data is going to disappear and compute is going to become the only things. At the end of the day, I think the work that we're doing on the data side most likely is going to shift toward building environments or making sure that these models can interact with it, with physical worlds. And then it becomes more of a problem of okay, how can I provide more grounding for these models? They are good at improving themselves, but as long as I expose them to real world data. Right. And real world environment. So providing data becomes more about, okay, how can I give access to this specific model to something that we never had? For example, like again like something came to my mind which is like, again like a little bit sci fi. But how can I make like smell accessible to these models? You know, like, like right now it doesn't like there's no good way. But then data becomes like, okay, you know, like information or anything that is for us because of all the sensory that we have is like really easy. You know, like right now I'm sitting here, I know how hard is my chair, what is the temperature of this room. All this sensory information is something that is coming to me and then I'm like the mix. This board that I'm saying is based on all this input, right. And then providing this for a model that does self improvement is already a really hard problem. So I would say that the work on the data would shift toward making these sensory information more available to these models in a way that it enables them to really improve themselves given all this information in a more effective way.
B
Yeah, interesting. Yeah. There seems to be a big trend towards sensors as a service. We're seeing the startups emerge in that field. Okay, super, super interesting. Zooming out from self improvement for seconds. The big theme of the last year has been the acceleration of post training in addition to pre training. So the whole reinforcement learning aspect of things, where do you expect gains to come from in the next few months or year? Is that more post training? Is that more pre training? Is that both? Is that something else?
A
The answer to this question really depends on when you actually ask this question. And like it's obvious that like, you know, we're going to be having a bit of a swing back and forth between pre training and post training. At the end of the day I want to say that, you know, pre training is still the foundation and like you can never post train your way out of a weak based model. But right now the current like the return on post training is really strong. And I started working on post training myself like like a few months ago. Like Gemini post training, like mostly coding and agent take. I can see how a brilliant small idea can make a model like 10x better, for example, in terms of behavior at a fraction of the cost of the pre training. Right. This is again like you know, we can see how post training is like, like the place to make a lot of impact and improve these models. But on the other hand, like I know at different companies it's also the case. But at Atlas, at gdm, a lot of exciting reset work is going to into the pre training side and like new recipe, new ideas. And I would say like, you know, the work that we're doing on the pre training is going to unlock a lot of downstream possibilities. Post training is just like a different mode of operation. It's like also super interesting for me because I'm again like a little bit like new to this side of the operation. But at the end of the day I always expect to circle like, like going on like you know, between post training and pre training.
B
Your comments on pre training are sort of against like that narrative that, that appeared a few months ago that pre training was dead. That's not your take at all.
A
Right? But I think everyone has ideas on pre training side. At the end of the day like going for that idea is a function of complexity and the expected gain. Right? And sometimes you feel that okay, you know, there are low hanging fruits and, and it like, you know, instead of bringing this complex like you know, recipe to the pre training, the one that I have like which is simple, elegant, super scalable, I'm going to push this and then move the effort to the post training and then at some point like the base model becomes the bottleneck and then you're happy to take the complex recipe and bring it to the pre training and then like keep pushing it. I think pre training is dead. I would say like maybe like you know, the old. It's also like a little bit like difficult to talk about old and new because like the time frame is like so when I say maybe I'm referring to like you know, two weeks ago or something. But, but, but the way that we used to do pre training maybe like you know like two, a year ago or two years ago maybe like you know, diminishing return is like obvious but I can see how new ideas are, are bringing like you know, fresh, fresh energy into the pre training and suddenly just open a door Toward like, like something exotic that might actually drastically change the base model capability over time.
B
So exciting stuff for Gemini 4 whenever it comes out. You mentioned continual learning earlier and that's another one of those hot topics that people have been talking about. Can you define continual learning for us so that this conversation is educational for broad group of people? Maybe compare and contrast that with the self improvement loop. Those are two different things, but help us understand the difference.
A
Definitely. They're related, but they're distinct. Right. So self improvement is about a model getting smarter over time and improving its capability like the model itself doing it. Continual learning is mostly about a model staying current. Right. And think about a doctor that keeps reading new research and they refresh their knowledge about stuff and they're trying to make sure that the knowledge doesn't go stale. The shared enemy between self improvement and continual learning is a model with frozen weights over time while the board is just going. Right. If you have a model that is just frozen and the board is moving, then you neither get self improvement nor continual learning. But continual learning is mostly focused on making sure that if there's fresh knowledge in the board, like the model knowledge cutoff is not like in the past. So it's constantly, for example, overnight all the news, everything that is happening in the world, everything is just updated. So if today you ask the problem, if you ask a question from the model, those knowledge which is super fresh is already in the weight of the model. So it doesn't have to kind of depend on external source to bring it in. And it's hard. It's like really, really hard. And the biggest problem, not the biggest, but one of the big problem is catastrophic. Forgetting where you get your model to learn about new information after you're done training that model and suddenly you see regression in the knowledge that you learn already in the main training phase. And it's a very active area of research right now.
B
And what's the reality of continual learning as of now is that built into existing systems not at all about to.
A
There are two sides of it. Like one side is, I think the research is not yet to a very to a point that you think that, oh, this is the recipe, I just need to kind of exploit it and push productionization. Right. But basically every time that you have have a new problem that is like key, you have this phase of exploration where people try to kind of try different ideas and go jump over this idea to another idea which could be so different. And then when you're confident about this kind of working to some extent you go to the exploitation mode and say that, oh, let me just make it as good as it can be. And this is the way to kind of push it and let's scale it, let's just develop infra for it, make it like super fast, productionize it and see what happens. I think that is not yet there. The other one is also again as I said, because we've never had super confident recipe for continual learning. Building infra for not investing in something that is fast is hard. Given that I've seen very impressive progress on this of the Sweden gdl. It's kind of interesting because it is one of the things that, you know, it can be heavily theoretical. I've seen people who are like, you know, like doing a lot of theory work and they got into this like problem and they're having a lot of fun and they're also like making a lot of impact and it's impressive how much progress we made on this. But I don't think that, you know, we have yet, like, like any, any idea that like, like everyone says that, oh, you know, this is it, you know, like, let's just do it, you know, like push this.
B
Great. I'd love to talk about you and your background. Tell us your story in a few minutes. How did you come to do this work and what was your journey to AI and then your journey to Google DeepMind?
A
So I did my PhD at University of Amsterdam on machine learning and mostly on the language model side and text and search and retrieval. And then I think what kind of like, like pushed me toward trying really hard to be on the, like on the, on the mainstream and be part of this group that are like, you know, hustling to, to make like really good progress. I did a few internship like back in 2016 and 2017. And the funny story is I did an internship in, at Google Brain in 20, like early 2017 and then it was amazing. It was just like, you know, I went to this team, they were working on like LSTMs for you know, like summarization. Summarization was actually one of the most, most like interesting problems at that time. I was like amazed. I was like, so this is so good. I really, I just want to keep doing this for the rest of my life. You know, this is it. And then I got, I got a return offer to go back and then do another internship at the end of the same year. The recruiter told me that, oh, you know, there's this team that they just published a paper maybe you've heard about it like transformer and then they're looking for an intern and I haven't had a chat with. I remember I had a chat with. I had a chat with Lukash Kaiset and then Lukasz was talking to me and saying like, yeah, like we have this idea of building like a Kolmogorov machine based on transformer. And he was so excited about this. And then like, you know, we like, we finished up the conversation and I started sort of sending a message to a recruiter and I was like, I don't know if I want to go with this team. It's just like they're doing something random. Like who? Like everybody's doing lst. I'm like, why should I go and work with like a group of people who are working on this like random architecture, like transformer? It's just like it's gonna die. And then he tried and he couldn't find any other team for me to join. So I joined this team as an intern and that changed my life. Being among these super brilliant, super smart people that they believed in some vision and direction where almost everyone was excited about something else was very inspiring. And then we work on again this Kolmogorov machine idea of which turned into universal transformer paper which recur in depth and reusing parameters was coming out of it. And still this is making a lot of impacts after almost like 10 years.
B
Tell us about that quickly. So that was in 2019 I believe, and you were a co author of that paper and that was very much that idea that we started with at the beginning of this conversation of loops and recursive stuff.
A
So Universal Transformer, we wrote that paper in 2018 and I think it was also rejected one time from one conference. And it was accepted in 2019. I don't remember exactly, but yeah, I think it was accepted iclear, but it was rejected from Neurips or something. The whole intuition was there is something about reusing parameters and a model going through its output another time. And so basically you generate something and then you kind of pass it into the model again and then the model has the chance of doing this. So we started with. I remember Lukasz had this algorithmic data set which I remember he used to call it Algorithmic Tasks. And it was part of this code base based on TensorFlow, like Tensor to Tensor was the name of the code that's still there. And I remember I can even find my put request into that for pushing the universal transformer code. And we saw that basically there are some problems like copying an input to the output or doing something algorithmic with super long input on the output side, which is super easy. But the normal models, like a normal transformer was like failing awfully at this. And we saw that, you know, like looping is like do it perfectly. And then at that point, I remember we had this babby like data set from Meta and it was like doing great on that. And then the idea of test time compute, which basically you train with fixed amount of compute, but at test time you unleash your model to do more computation, throwing more flops on the input put was coming to our mind, like super excited about this. And then we ended up with actually kind of like introducing this adaptive computation mechanism into this, which was again some sort of inspiration from Alex's paper from lstm. And then a very interesting ride because we were pushing for something that at that time it sounded exciting. And I have a guess maybe at that time perhaps like the whole fail was a bit too focused on using adaptive computation for decreasing the cost on simple problem. But now we know that maybe we can actually use adaptive computation to increase the cost for hard problem. It's actually like the other side of the same coin, right? So because at that time we were like, you know, like maybe resource constraint and everything. So we were really thinking about why we were spending so much fops, like going through all the layers and everything's for dot at the end of the sentence if that token is doing really like 24 layers. So how we can decrease that. But now we have a different perspective to that, which is like, how can we increase this for a physics problem that we want to run the imprints for maybe for two weeks. So that was really fun to work on that with these brilliant people. And I think just recursion in depth and reusing the parameters or I've seen later like some, some people actually framing it as negative sparsity, which is a great way of, you know, like connecting it to mixture of experts that, you know, in mixture of experts you have flops, free parameters. So parameters that they're not actually bringing any flops. And in like looping you have parameter free flops where you don't have extra parameters for the extra flops that you are throwing on this. So it goes the other direction of the sparsity and it's quite effective and I think people are picking it up. So we're seeing a lot of excitement in this direction.
B
Fascinating. Another fundamentally important contribution to the field that you did was the visual transformer paper in 2022. So the paper was called An Image is Worth 16 by 16 Words. Transformers for image recognition at scale. Do you want to walk us through what that was?
A
That's also a funny story. For that I got into vision and multimodality with that paper. So I've never worked on any vision problem. It was mostly because I was sitting next to people who were working on vision. So like, my desk was like next to people who were working on vision. And that was the reason that I got interested, because I was just talking to, I was like, oh, this is actually interesting. And, and then, and then I remember that at that time I was like working on, on like externally, we call it palm, palm paper with like, you know, Akansha and other folks. And I was like, why we have 400 billion parameter language models, but the biggest model that we have on the vision side is just like maybe 100 million, like a restat. Like, why, like why there's no benefit of a scaling. Started looking into this with folks on like, okay, maybe there's something in transformer that actually kind of make it scalable. And then maybe we can move away from convolution to try this. And at the end of the day, I don't want to say that like that's the only way of scaling. Maybe, you know, if a group actually spent like enough time on convolution, they can also make it scalable and like, you know, like as good. But there was also benefit of doing that simply because the rest of the machine learning field, which was working on language, they were using this like a, like architecture. So they were building infra for it, making it faster. And you know, like, sometimes the hardware is kind of like designed based on this architecture, at least for short term. So we started pushing and then I remember that, you know, we had, we had a bunch of ideas that, okay, what if each pixel is a token? And then the cost was going high, the context was just getting super long. And then we had a lot of back and forth. And it's also quite funny because we started thinking about this problem from very complicated point of view. So we were trying to mimic convolutions to be able to get this working. And it ended up. I had a bunch of colleagues also in Zurich, and they started trying the simple idea of, of what if we just like divide the image into patches of Pixel, you know, 16 by 16, and then get each patch as a pixel and forget about, you know, like overlapping patches or you know, like Windows and stuff like that. That's it. You know, like, you know, like chop the image and then fit it to a transformer and then scale, you know, like, like go with a lot of data and then like, let's start with like, you know, something discriminating to train this model. And it worked. And it, it was also a little bit like, of a surprise for us that. Oh, you know, like they were all thinking about something like, you know, fancy, very complicated, maybe in the integration of having convolutions and stuff. But something that worked was basically the simple idea of patchify, fit it to transformer, scale it up, and then boom, you had a really, really good model for representation learning.
B
Yeah. And to play it back at the highest level, that basically meant that you could apply a transformer architecture to image, wherein in the past you had two different families, you had the CNN world and the transformer world for text. And. And your breakthrough was to prove that transformer could scale equally well to images, which basically paved the way to a Gemini 3 today, which is like a natively multimodal model. Is that fair?
A
Okay, yeah, that is true. Yeah. So basically with that, we kind of took a step toward having also videos like adapting Transformers and audio Adapting Transformers. So basically, again, even if this is not like the only architecture that would be like in a multimodal, but it made it really simple to train these models like natively, because you have like a single architecture and can have all the modalities during training.
B
Great. So that's a perfect transition into your work into Nano Banana and the future of image AI. So you are part of the nanobanana team, which must have been so much fun when this came out and went just completely viral and what an incredible product. So since then there's been a couple releases. So there's been Nanobanana Pro in November of 2025 and then just a few weeks ago, Nanobanana 2, aka Gemini 3.1 flash image. Yeah, at the end of February. So a lot of people assume that image generation works as a translator, meaning that the AI reads the text of the prompt and then translates it into picture instructions and then draws it. But as we were saying, Gemini is natively multimodal. So how does that work? How does a model actually process the text and the pixels at the same time to build the image?
A
I think the reason that maybe I got to the generation okay, by the way, there's also one thing that I'm not an expert in, image generation. When I started working on this, I remember I had meetings with people and then they were talking about computer graphic and all the old ideas about or intuitions and I had zero idea what's going on. I was like, I know how to train a transformer and scale it and if it helps I can basically contribute to this. But again, it was fun because I worked with a group of super smart, brilliant people with really good intuition. And I think the reason that I was excited about this was this is maybe not super relevant to Nano Banana itself, but to just mention this is I was excited about the idea of positive transfer across modalities. So when you think about multimodal like natively, one part of it is that, oh, you know, I'm adding capability to my model, you know, so my model can understand images and understand videos and understand audio, but also like generate like and text, but also can generate all these modalities, you know, like so, so I have a model that actually does all these together, right? This is for sure exciting from the product point of view. You have a model that is a great model for generating all these different outputs and users are finding it very useful and interesting. But the most exciting part for me was can I see a glimpse of transfer from these modalities? For example, if I train a model to become good at images, with generating images, that does it become also good at like better at like generating text? There are different, like different intuition of that, you know, like why this should happen. I think there's like something like again like very old in the literature on the linguistic side that they call it reporting biases, right? So like you for example, you know, like visit your friend's place, right? And then you go to their place and then you see that they have a banana shaped like sofa. When you go home, the chance of talking about that sofa compared to a normal sofa is like much higher. So you can actually talk to your friends or partner later. Oh, you know, I went there and then their sofa was like in the shape of a banana, which was really fun. But it was like normal. Like you almost like it's weird if you go somewhere it's like, oh, by the way, I went to my friend's place and they had a sofa which was like super normal. So this is the language reporting bias. So language doesn't talk about things that are like at the middle of the distribution, right? But if you have an image or if you have like vision input from anything in the world, you have that information. Like there's no need for reporting, it's just like there, right? So because of that, like picking up a lot of knowledge about the board through language is just not really efficient. I don't want to say that it's impossible, but it's not efficient. You know, like to learn about gravity. If you kind of like, you know, have your model train on videos, it's much easier to get the model to learn about gravity because it just happens in a video than training your model on. On all the textbook to kind of learn about the concept of gravity or what is actually gravity.
B
Is that a concept of world model that's built into the image representation?
A
Exactly, exactly. So basically, you want a VORT model, basically like these models to be also like a VORT model. So you want these models to know about the Ward is a good chance that you can actually teach your model about the VORT just by presenting text to it. But it's just not efficient. And a good shortcut would be to bring multimodality into this. And the best way of learning about modality is learning how to generate that. Right? So we got to this point that, okay, we've been having Gemini generating images from Gemini 1. So basically, Gemini was multimodal from day one. And the reason that we kind of first released the Image Generation At 2.5 instead of Gemini 1, Gemini 1.5, Gemini 2 was that it was not great. And then it really needed a push. And then we figured out that, okay, you know, how to push this without, like, you know, introducing any regression to other capabilities that the model has. And, you know, like, bring all of these natively into the. Into, like this, this model. And that was, like, one side that was like, super interesting for me. Like, not sad news, but. But. But it's really hard to see positive transfer. So it turned out to be a really, really good model. But it was really hard to see that, wow, I train on images and then text perplexity goes down. That was hard to see. The fact that you train a native model and it's good across all the capabilities is already impressive. But my hope is that multimodality and work model is the way to really push multimodal training to enable positive transfer across modalities. I've worked with people that they were expert on this. For example, one of the things that. I remember that at the beginning, they were talking about this visual quality, and then I remember that it's like, oh, this model is a great model. I send them them, and there's like, no, this is not a good model. I was like, what do you mean? And they started showing me two images that to my eyes, they were looking the same, but they were Saying like, no, this is way better. I was like, no, they're the same. So they had a good taste on grasping their visual quality of images. So working with them was really interesting to kind of understand that, okay, there are dimensions. And by the way, like, their intuition was the things that actually made, like, not a banana, like, a success in terms of, you know, being a good product. But it was like, okay, what if we push this towards something beyond, like, traditional image generation? So instead of like a translator that, as you said, like a text to image, it becomes a thinking machine about images. You know, for example, you know, you enable interleaved text image generation where the model can think in not only text token, but also in pixel space, right? So it generates text and then generates an image and generates another text, another image. And you can leverage that for different problems. Like, one of them is that, oh, if you have some sort of a story, right? Like text of the story, image related to that, text of the story, like children's storybook, right? Another one, which, which I was actually really excited about was like this incremental generation. Let me just give you an example. So if you take Dall E or Imagine or standalone image model, right? If you ask these models to generate an image of a scene with 50 details, they might fit, right? And then someone can say that, oh, you know, okay, I can generate a better model that does up to 55 details. And then you say, okay, what about 60? And then I say, okay, let me just go back and trade it and then come back to you to cover your case. But at the end of the year, there's a threshold that these models can kind of like follow instruction to some extent about how many details that they capture from the text. But if you have incremental generation, so if you have text and then an image and text an image, you can get your model to generate these details one by one. So you never expect your model to generate an image, a perfect image in the first shot, right? So you expect your model to plan about this generation. So it says that, oh, you know, let me start with big objects because, you know, later I'm going to have a hard time if I put like small objects and the big objects don't fit, right? So let me just do that. And then, like, in the next turn I go with, like, medium objects and smaller and this like, super smart, you know. And you're never bottlenecked by the capability of a singular shot image generation because you did planning and then you tune every step difficulty to match the capability of Your model to generate one shot. So that was also one of the things that nano banana and native generation, interleaved generation kind of brought a completely new perspective to image generation work which is a little bit far from just translating text into any of the fascinating.
B
Does part of this contribute to efficiency? So especially nanobana two, you have the flash aspect of this. So you're able to create amazing images very fast and apparently I mean seemingly very efficiently. So what's behind the scenes is that what you described is that moe. How are you able to do that?
A
First of all, I was involved in the original nanobanano nanobana Pro and then the last version I gained like you know, because I jumped on the post training and coding an agent and I find it exciting this one like the team actually shipped it. But if I want to say super high level that what is exactly the things that makes the model faster and more efficient. Part of it is just the size of the model. So Nona Banana was pro size and this one is just flash. So definitely the parameter size configuration of the MOE and stuff. The other one was people actually spent quite a lot of time on figuring out nailing down distillation recipe both on the side of knowledge and other things that basically you kind of need to distill to something like a process that is lighter than the full process. Surprisingly a lot of infra work for serving. So we have really, really, really brilliant people that they're like serving engineers. And it's kind of impressive that you sit on your desk and then they come and they say oh by the way like casually I made the model 10x faster. I was like, you know, like it's just like. And they just kind of saying it like, you know, in a very like in a casual way. It's like, wow, this is like impressive. We had also a lot of work on optimizing the serving how to serve these models. And you know, like because these models are operating differently from like just like normal language model. Like they're not necessarily, you know, like the same as you know, next order prediction. This is definitely something that you know, a good serving engineer can figure out that okay, you know, I can think about a deal different way of doing that. And we had also a lot of improvement on the efficiency side by their work.
B
All right, so as we get towards the end of this conversation, I thought it'd be fun to end with a few hot takes if you're ready for them.
A
Yeah, simply.
B
All right. What is one thing the AI field is getting wrong right now?
A
Not easy to pinpoint like specific things. But again like you know, this is just like my personal opinion and maybe I have colleagues and, or the other people sharing this with me. But I think we're underestimating how hard jagged intelligence is to fix. We are missing how we're underestimating how much it matters. And we talk about almost people laugh and go, if you have a model that does a very difficult math proof but has difficult time counting letters, in a word, as I said, just people just laugh and move on. But I think it actually pointing at something deep and unresolved about these system. The way that these systems kind of represent unprocessed knowledge. And it's not a bug that you can patch. So definitely we see that this is happening people sometimes or we have these problems that something is awfully, really sad. And then you can, oh, you know, let me just like you know, patch by adding something for the system instruction or the opener instruction. A bit of a structural property of how these models actually learn. So I would say this is probably one of the things that we're not getting it like super right at this point.
B
Great. What is one idea in AI research right now that is underrated?
A
Something that is underrated. Like you mentioned continual learning. I think this is, is. This is definitely underrated. As I said, you know, like sometimes the problem stays in the exploration mode until we are confident about something and then it goes to the exploitation mode. I think we are past the time that we really had to push this to the exploitation. So maybe like foundation models are essentially right now like frozen in time and like when the training ends, right. And then everything is like built on top of this frozen model like in a RAG pipeline and fine tuning workflows and retrieval systems. Them and all these elaborate infrastructure is all based on this assumption that these models are frozen and it's a bit of too much of a strong problem. There's an assumption to make and I think we are going to get to the point that we need to change these assumptions and maybe we need to think about it a little bit more actively and pushing this it toward like you know, something that we actually push it to productionization and maybe it's a little bit like underrated right now like the contiguous learning.
B
So you think RAG goes away over time.
A
It's not going to look like as is today and it's going to be different but like saying that it's going to go away completely. I'm not sure about that. And one of the reasons that I say that is RAG is Not just about bringing like fresh information to the model when it wants to kind of solve a problem about the current state of things, but it also has this kind of in context learning. And there is a difference between in context learning, like the information that you have in the context of the model compared to the information that you have in the weight of the model, like continual learning and RAG are doing different things for bringing this fresh information. Maybe it changes in a way that it doesn't, doesn't need to trigger RAG for everything. But I'm pretty sure that there's going to be some tail of the distribution that we're going to do. RAG still for it. What's the time?
B
All right, last couple of hot takes. What do you think people are too confident about?
A
So people think that pushing the technical side is sufficient, that if we just get a model that is smarter, everything is going to follow. And in my opinion, a version of AI that is really brilliant had like technical problems, but it has a blind spot about everything else. And that version is not going to be able to actually create meaningful progress in the world. And the fact that people kind of assume and confident about that, they're confident about this, that kind of like everything else is going to follow or just everything else is just like a small list. I think it's right. We have, have governance, we have like, you know, regulation, we have social trust, we have like, for example, distribution of access. And the benefit like in the world for this technology and even the institutional capacity to kind of like absorb and adapt this technology is just like, this is something that maybe we don't have enough attention to. And these are not really solved problems, if not harder than the technical part. They're really hard. And the pace of technical progress is, is definitely currently running ahead of the board's capacity to develop this kind of mechanism. And this gap is getting bigger and bigger. But what I'm saying is basically the field needs to hold both things at once. So maybe that's one of the things.
B
All right, and last one. And I don't know if that's hot take or maybe just advice for anybody entering the field today. If you were going to start from scratch today, what would you work on?
A
I don't want to start from scum. It's hard to start, I can tell you. Like, you know, there are two things that I think like, you know, would be nice to spend more time on it. And there's one thing that I'm like very excited about it. I start from the things that I'm like really excited about it. And I would say like in a short term it's like really exciting to push it. I am actually trying to even like you know, be able to contribute to this direction. And that's like full automation of like super long horizon time task things that you have a machine working for maybe two weeks, one month. The agents today are very impressive and the demos are very remarkable. But there is this compounding reliability problem that doesn't get talked about enough. And for example, imagine if an agent has to take 100 sequential steps to complete the task. And imagine if each step has 95% of success rate, right? Which is great. You know, like given the models that we have today, 95% is like really good. The probability of completing the whole task without a single failure is like 0.95 to the power of 100, which is like less than 1%. And this math is like brutal, right? Like you know, and this like 95 per step. 95% per step, as I said is like very, very optimistic. Long horizon automation is like definitely isn't impossible, but it requires a level of first step reliability and error recovery. And the current system maybe don't have it. And if we want social trust and basically having people really using it, at the end of the day people don't experience average performance of these models. They experience the failures. If you have have your model doing a dumb mistake, the damage in the trust that it makes is bigger than the benefit of getting 100 things right. Like 100 imperfect things right. So this reliability in this long horizon task is something that we definitely need the kind of like side of. As I said, two kind of more philosophical, high level things I would definitely work on grounding problem and how we can build AI system that are robust and connected to physical world. As I said like soon the concept of data, how to kind of enable these models to kind of be very good at self improvement becomes how can I ground these models in real world. So this is definitely something that would be the bottleneck of self improvement. If we don't actively think about it. We should definitely move away from this statistical pattern in text and pixels and the other things that is maybe kind of like related is even thinking about a better definition of intelligence itself. Right. And a little bit like philosophical, but it's definitely a practical question. And the whole field and us, we are building more and more something that we haven't really defined. We're trying to make these models smarter and more intelligent. But the definition of intelligence is just so hand wavy and fuzzy. That it's hard to actually know. Measure the meaningful progress. Like, which is related to your question that about, you know, like, how about, like, evaluations? It's good. You know, we have prices, benchmarks, scores, capabilities, and even vids, you know, which is I find, like, super useful. But at the end of the day, we really need a systematic way of maybe defining intelligence. That is hard. And again, like, making progress based on what we have right now is good, but at some point that becomes a little bit more important to really pinpoint that, you know, what is the target and what is the goal and then push toward that with maximum speed.
B
All right, Mustafa, it's been a absolutely fantastic conversation. Thank you so much for spending time with us. Really enjoyed it. Really appreciate it.
A
Thank you. Yeah, thank you so much for having me. It was, like, fun to chat. And thanks for the invite.
B
Hi, it's Matt Turk again. Thanks for listening to this episode of the MAD podcast. If you enjoyed it, we'd be very grateful if you would consider subscribing, if you haven't already, or leaving a positive review or comment on whichever platform you're watching this or listening to this episode from. This really helps us build a podcast and get great guests. Thanks and see you at the next episode.
Date: April 2, 2026
Guest: Mostafa Dehghani, Google DeepMind
Host: Matt Turck
This episode dives into the rapidly evolving landscape of artificial intelligence building artificial intelligence, focusing on the concepts of recursive self-improvement, looping architectures, continual learning, and the transition from human-driven to autonomous model development. Mostafa Dehghani, a leading researcher at Google DeepMind and a core contributor to Universal Transformers, Vision Transformer, and Gemini, provides expert insights into the current frontier of AI research and development. The conversation covers key technical innovations, philosophical implications, model evaluation, data, and the future of enterprise AI.
Current State:
What’s Missing:
Looping Defined:
Technical Mechanisms:
From Vision to Practice:
Roadblocks:
Definition:
Generalization vs. Specialization:
Distinction:
Research Status:
Trends:
Insight:
NanoBanana & Visual Transformers:
Technical Approach:
Efficiency gains:
"Every time that we removed human judgment from this process, we kind of got over a bottleneck… I would say the self improvement and looping over the development is kind of like doing that at the highest level."
— Mostafa (03:23)
"If I want to put it very simply, it is really just the continuation of the trend that we've been riding for decades… it's the same story, just a new chapter of the same story."
— Mostafa (02:19)
"100%, at the end of the day, you can only improve what you can measure. And then getting evaluation is just hard."
— Mostafa on evaluation as a bottleneck (10:14)
"Model collapse mainly happens when you have a loop that is completely closed, right? … There's a good chance that your model collapses. But if you have a strong verifier or some sort of a real reward signal that anchor this kind of signals… it can be quite powerful."
— Mostafa (14:11)
"Short term, I would say building a specialist model is probably the fastest way to learn what is actually possible. And in many cases these specialized models are becoming stepping stone toward a generalist model, which is super valuable, right?"
— Mostafa (16:40)
"Sometimes I say like, 'Oh, this is going to happen in six months.' Never happened. Sometimes like, 'Oh, this is just so hard… absolutely no chance to solve it.' And then boom, in two months, three months, someone had a brilliant idea and they solve it. So it's like really hard to predict the future."
— Mostafa (21:20)
"Data becomes more about, okay, how can I give access to this specific model to something that we never had? For example… How can I make like smell accessible to these models?"
— Mostafa (24:50)
"Pre training isn’t dead... the way that we used to do pre training maybe like a year ago or two years ago, maybe diminishing return is obvious. But I can see how new ideas are bringing fresh energy into the pre training and suddenly just open a door toward… something exotic."
— Mostafa (28:14)
Recursive Self-Improvement Explained
[05:29–09:38]
Model Collapse and Generalization
[15:01–17:45]
Continual Learning and Its Industry Impact
[30:06–33:42]
Technical Origins Stories: Universal Transformer & Vision Transformer
[33:55–43:46]
NanoBanana/Gemini and Image Generation Breakthroughs
[43:46–53:04]
Hot Takes and the Future
[54:51–63:57]
| Time | Segment/Theme | |----------|---------------------------------------------------------------| | 00:00 | AI models are already building AI; closing the self-improvement loop | | 01:33 | What "thinking in loops" means in AI | | 05:29 | Recursive self-improvement as an emerging reality | | 07:32 | Karpathy's Auto-Research and AI in research | | 10:01 | Roadblocks: compute, evaluation, philosophical challenges | | 14:11 | Model collapse: risks and mitigations | | 16:40 | Spectrum of generalization vs. specialization | | 24:09 | Data as signals and environment; future of interaction | | 26:50 | Pre-training vs. post-training; short vs. long-term gains | | 30:06 | Continual learning explained and contrasted with self-improvement | | 33:55 | Mostafa's personal journey: internships, Universal Transformer | | 40:13 | Vision Transformer (ViT): scaling images with transformers | | 43:46 | NanoBanana/Gemini: how natively multimodal models work | | 53:04 | Speed and efficiency innovations in modern image models | | 54:51 | Hot takes: what the field misses, underrated and overrated trends | | 60:09 | The hardest and most exciting problems ahead |
"At the end of the day, we really need a systematic way of maybe defining intelligence. That is hard. And again, making progress based on what we have right now is good, but at some point that becomes a little bit more important to really pinpoint what is the target and what is the goal and then push toward that with maximum speed."
— Mostafa (63:26)