
Loading summary
A
If I'm being honest with myself, I think we're ahead of where I thought we could go. We're not really building a model anymore. I think we're really building a system at this point. What might be happening instead is kind of a shift in paradigm where before we were kind of scaling in the Data Unlimited regime and we're kind of shifting more to a data limited regime, which actually changes a lot of the research and how we think about problems. I don't really see an end in sight for that kind of line of work to continue giving us progress.
B
Hi, I'm Matt Turk. Welcome to the Matt podcast. My guest today is Sebastien Bourgeau, pre training lead on Gemin@Google DeepMind. Sebastian is one of the top AI researchers in the world and a member of the Metis list and this is a particularly special episode because it's his first podcast ever. We Talked about how Gemini 3 is built under the hood, the shift from an infinite data world to a data limited regime, how research teams at DeepMind are organized, and what's next for AI. Please enjoy this great conversation with Sebastian. Sebastian, welcome.
A
Thank you. Hi Matt.
B
So I was hoping to start this conversation with this tweet from or vinials, who's the VP of research and deep learning at Google DeepMind, the Gemini Co lead who said when Gemini 3 came out that the secret behind the model was remarkably simple. Better pre training and better post training, which when you think about the leap that Gemini 3 represented over the prior state of the art, sounds remarkably modest. So I was curious about your perspective. Is it as simple in some ways as that?
A
Yeah, I'm not sure it's a big secret. At least from my perspective, this seems quite normal. I think people sometimes have the expectation that from one Gemini version to another, there's a big thing that changes and that really makes a big difference. In my experience, there's maybe one or two of those things that make a larger difference than other things, but it's really a culmination of many, many changes and many, many things from a very large team that actually makes Gemini 3 so much better than the previous generations of Gemini. And I think this is probably a theme that will recur later, but it's really a large team effort that comes together in a release like Gemini 3.
B
What does that tell us in terms of where we are in AI progress? What sounds from afar, as in sort of turning some knobs gives us such a leap? What does that mean in terms of what we can expect going forward?
A
There's two things the first one is it's still remarkable how much progress we're able to achieve in this way. And it's not really smooth slowing down. There's so many of these knobs and so many improvements that we find on a day to day basis, almost on a day to day basis that make the model better. So that's the first point. The second point is we're not really building a model anymore. I think we're really building a system at this point. People have sometimes disputed that we're just training a neural network architecture and that's it. But it's really the entire system around the network as well that we're building collectively. And so that's the second part.
B
The big question on everybody's mind is what does that mean in terms of actual progress towards intelligence? And we don't need necessarily to go into the whole AGI thing, because who knows what that means? But is the right way to think about this kind of model progress as an actual path towards intelligence versus trying to succeed on this benchmark or that other benchmark? What gives you confidence that the core model is getting smarter?
A
The benchmarks definitely keep improving. And if you look at the prompts and how the benchmarks are set up, they are becoming increasingly difficult. And even for me, who has a background in computer science, some of the questions the model answers, it would take me a significant amount of time to answer. This is just one view, it's the benchmark view. And there's some amount of. We evaluate those frequently, et cetera. We're being very careful about holding out the test set. But still there's some fears often of overfitting to those and just benchmaxing is what people call this. But that's one aspect. I don't think those fears are very founded. But the second aspect, and that's the one that really fills me with confidence, is the amount of time people spend using the model to make themselves more productive internally is increasing over time. Every new generation of models is pretty clear the model can do new things and help us in our research and our day to day engineering work, much more so than the previous generation of models. So that aspect should give us confidence as well that the models are becoming more capable and actually are doing very useful things as well.
B
I'm always curious, as an AI researcher who's so deep into the very heart of all of this, if you zoom out, are you still surprised by where we are from your perspective? Are we well ahead of where you thought we would be a few years ago? Are we on track? Are we behind? Possibly.
A
I think it's easy to say we're on track in hindsight. I think if I'm being honest with myself, I think we're ahead of where I thought we could go. Starting work on LLMs in 20, 2019 or 2020. It's kind of hard to believe the scale of everything we're doing, but also just what the models are capable of doing today. If you kind of looked at scaling laws back then, they were definitely pointing towards that direction. And some people really believe those deeply. I'm not sure if I would have bet a lot on that actually materializing and being where we are today. So one interesting question that follows from this is where does that take us? If we assume the same or if we assume the same kind of progress we've seen in the last five years, I think, yeah, this is going to be very, very cool. What's going to happen in the next few years as well?
B
What do you think on that front? Does that mean AI comes up with novel scientific discovery, wins the Nobel Prize? Like where do you think we are going in the short term? Like two to three years?
A
I think yeah, that's part of it. On the science side, I think DeepMind historically has done a lot of work and for sure there's a lot of work in that direction as well. I think we will be able to make some large scientific discoveries in the next few years. That's one side, I think on the other side, in my day to day work as well, both research and engineering, I'm very excited about how we can use those models to make more progress, but also to better understand the systems we're building and develop our own understanding and research further.
B
Yeah, there's this big theme in the industry about automation of AI research and engineering, which if you extrapolate it leads into AI 2027 kind of scenarios where there's a discontinuity moment. Just at a very pragmatic level. What does that mean using AI for your own work today and what do you think that's going to mean in a couple of years?
A
I think it's not so much about automation, but more about making us go faster and spending more of our time in the research part at slightly, maybe higher level. A lot of the day to day work in research on language models, we're dealing with quite complex and large systems on the infrastructure level. So actually quite a bit of time is dedicated to running experiments, babysitting experiments, analyzing a lot of data, collecting results. And then the interesting part is forming hypotheses and designing new experimen. And so the last two parts I think is something where we'll be very much involved in the first part, I think especially in the next year, with more magentic workflows being enabled more and more that should be able to really accelerate our work there.
B
Is your sentiment that the various frontier AI labs are effectively all working in the same direction, sort of doing the same thing. One fantastic, but in some way perplexing thing that we all experience experience as industry participant observers is this obvious phenomenon of like every week or other weeks or every month seems to be another fantastic model and we're completely spoiled. So Gemini 3 just came out at the same time like two hours ago, literally before we were recording this GPT 5.2 came out. What do you make of that from your perspective and how do you think that plays out? Is anybody going to break out or effectively the industry is going to continue with the handful of top labs plus some neolabs that are appearing.
A
For the first question, there's definitely similarities between what the different labs work on. I think the base technologies are kind of similar. I'd be surprised if we weren't all training transformer like models, for example, in terms of the architecture side. But then there's definitely specialization I think happening on top of that and different tree or branches in the tree of research that are being explored and exploited by the different companies. I think historically, for example example, DeepMind and still, I think on the vision and multimodal side, we've been actually really strong and that continues to be the case today and shows in both how people use the model, but also in the benchmarks, of course, and then the things like reasoning, et cetera. OpenAI came up with the first model, but we also had a strand of research on that. So there's similarities, but it's not exactly the same. I would say for the second question, I don't know if I have a good answer. One thing that's clear is to make progress on a model like Gemini today, you do need a very large team and a lot of resources. Now, that doesn't necessarily mean that what we're doing today is optimal in any form. And some disruptive research could definitely come along and allow a smaller team to actually take over in some form. This is one of the reasons why I actually enjoy being at Google so much as well. Google has this history of doing more explorative research, has a really high breadth of that research, and that continues to be the case Mostly in parallel to Gemini, but we're definitely able to also utilize that and bring some of those advances into Gemini.
B
Are there other groups, whether at DeepMind or elsewhere in the industry that are working in semi secret or complete secret in post Transformers architecture? And one day something will come out and we'll all be surprised. Is that. Are there groups like that in the industry?
A
I believe so. There's groups doing research on the model architecture side for sure, within Google and within DeepMind. Whether that research will pan out, it's hard to say, right? It is research, so very few research ideas would help.
B
And so in the meantime, the core advantage that one company may have over the other is just the quality of people. In the case of Google, I guess the vertical integration that tweet from Oriole that I was mentioning retweeted a quote tweeted by Demis Hassabis and he was saying that the real, real secret was a combination of research and engineering and infra. So is that the secret sauce at Google, the fact that you guys do the whole stack?
A
It definitely helps. I think it's an important part. Research versus engineering is also interesting. I think over time that boundary has blurred quite a lot because we're working on these very large systems now. Research really looks like engineering and vice versa. And I think that's a mindset that has really evolved over the last few years at DeepMind, especially where maybe there was a bit more of the traditional research mindset before. And now with Gemini, it's really more about research engineering. The infrastructure part is also very important. We are building these super complex systems, so having infrastructure that's reliable, that works, that's scalable, is key in terms of not slowing the research engineering down.
B
And Gemini 3 was trained on TPUs, right? Not on Nvidia chips. So it's truly, truly integrated. Okay, so I'd love to do a deep dive on Gemini 3, but before we do that, let's talk about you a little bit. So you are the pre training lead on Gemini 3. What does that mean? And then let's go into your background and your story.
A
I am one of the Gemini pre training leads. So what this entails, it's a mix of different things. So part of my job is actual research, so trying to make them models better. But these days it's less running experiments myself, but help design experiments and then the real results with people on the team. So that's the first part. The second part, which is quite fun, is more of the coordination and integration. So it's a Fairly large team at this point. It's a bit hard to quantify exactly, but maybe 150, 200 people work on a day to day on the pre training side between data model infrastructure evals and so coordinating the work of all of these people people into something that we can build together is actually quite complicated and takes quite a bit of time, especially time to do well. To me this is super important because actually being able to get progress out of everyone is really what makes us make the most progress, rather than enabling maybe one or two or a small group of 10 people to run ahead of everyone else. That might work for a short period of time, but over longer periods of time, what's really been successful for us is being able to integrate the work from many, many people.
B
So in terms of your personal background, I'm always curious, where did you grow up? What kind of kid and teenager was trying to reverse engineer those top AI researchers, Where do they come from and how did they become or how did you become to be who you are?
A
I grew up a bit all over the place in Europe I moved around quite a bit. So I was actually born in the Netherlands and I moved when I was seven to Switzerland. So my dad is from Switzerland and my mom is from Germany. So I did most of my school and the beginning of my high school in Switzerland, mostly in French and also in German in parts. And then at age 15, I think I moved to Italy where I finished my high school till around when I was 19. And at that point I was going to go to the ETH in Zurich to do my studies. But I think just by random events, one morning I just looked up the top universities in some kind of ranking and I saw Cambridge was at the top. So I thought I'll just apply, why not? And yeah, a few months later I got the acceptance letter so I decided to move to Cambridge where I did my undergrad and master's in the computer lab.
B
And growing up you were just a super kind of math, strong kind of kid, computer science kind of kid.
A
My dad has a technical background, so I remember some when I was 10 or 11, starting to program a bit with him and learning and I kind of always liked that. And then I always had easiness in math and science at school. I remember never having to really study for math exams, but always doing quite well. That definitely changed at university, but that was, that was my high school experience.
B
Great. And what was your path from school into where you are today?
A
Yeah, so again there's a bit of a lucky moment. I would say one of the lecturers we had in my masters was someone who was also a researcher at DeepMind. And I just remember at the end of the last lecture I was packing my stuff and I was, oh, you know what, I'll just ask him for referral, what's the risk, right? You might just say no, but whatever. And so I actually took the courage and then went up to him and asked if he would give me a referral. And sure enough, he was like, sure, send me your CV and I'll see what I can do. And that's kind of how I got my interview at DeepMind. This was in 2018. And so I joined DeepMind at the time, just DeepMind, not Google, DeepMind as a research engineer after university.
B
And what did you do at first and how did that evolve to being one of the pre training leads on Gemini 3?
A
Yeah, so at the beginning, having joined DeepMind and DeepMind, being known for RL, the first product project I managed to work on or decided to work on was something on the RL side. So specifically we're training some unsupervised network to learn key points on Atari environments and try to get the agent to play Atari. Right. So I did this for about six months. Maybe it wasn't enough or in the sense I didn't like the synthetic aspect of this. I always wanted to work more on real world data and have more of a real world effect, I think. I think in general I like to build things and build things that work. I don't really like the academic, pure research part. And so that kind of drove me to start working on representation. So creating these or training these neural networks that have good representations to do different tasks. And one funny anecdote here is something I tell a lot of the people on my team. But the first effort I joined on this was called Representation Learning from Real World Data. And at the time we had to add this from real world data to the name of the project because people would assume otherwise it would be synthetic environments or synthetic data. And that definitely has shifted completely since then. So yeah, that was kind of my first project on that side. And specifically LLMs and Transformers, we are looking at architectures like Transformer and models like BERT and xlnet that were learning these representations and trying to improve those representations and do research on that side.
B
Great. And then you worked on Retro, right? Do you want to talk about that?
A
Yeah. So after that we started working on scaling up LLMs and LLMs in general. So we started this work first on Gopher which is I think the first DeepMind LLM paper that was published. So already at that point there was a team maybe of 1012 people. So already at that point it was pretty clear you couldn't just do that research on your own. And this is really where I started doing pre training and pre training at scale and develop my research taste but also what I enjoy about this. So we trained the first dense transformer model, I think it was 280 billion parameters, I think 300 billion tokens at that time and trained that. And we were, we would definitely not do things like we were doing them back in the day. But it was great and a very fun learning experience after that. There were kind of two projects that emerged. The first one was Chinchilla and the second one Retro. So in Chinchilla we were re examining how you should scale the model size and how you should scale the data, especially from a training compute optimal perspective. So the question is, you have a fixed amount of training compute, how do you TR best possible model? Should you increase your model size or should you increase your data size? And there was some previous work in this domain from OpenAI specifically that we re examined and we actually found that you want to scale the data side much more quickly than what was thought before rather than scaling the model side. Funnily enough, this is still really relevant in our day to day work today, especially because it has a lot of implications on the serving cost and how expensive it is to use the models once they're trained. So that was one side. The other of work was more on retro and this is more on the architectural innovation side of things. So here we were looking at how you can improve models by giving them the ability to retrieve from a large corpus of text. So rather than having the model learn and store all the knowledge in its parameters, you give the ability to the model to look up specific things during training, but also during inference.
B
You used the word research taste, which I think is super interesting. What does that mean? How would you define that and how important is that for a researcher?
A
Yeah, it's very important these days and it's quite hard to quantify. But the few things that matter. The first one maybe is your research is not standalone, this is what I was mentioning before. But your research has to play well with everyone else's research and has to integrate. Let's say I have some improvement on the model, but it makes the model 5% harder to use for everyone else. This is probably not a good trade off, right? Because you're going to Slow down everyone else and their research, which would then cumulatively slow down the overall research progress. That's the first thing. The second thing is being allergic to complexity. But complexity is quite subjective in terms of what people are familiar. But still we have a certain, I think, budget of complexity we can use in a certain amount of almost research risk we can accumulate before things go bad. And so being aware of that and managing that is very important. So oftentimes we don't necessarily want to use the best performance version of a research idea, but we'd rather trade off some of the performance for a slightly lower complexity version because we think that will allow us to do more and more progress in the future. So these are kind of the main two things I think around research taste.
B
That's fascinating. And then presumably a part of it has to do with having an intuitive sense for what may work and not work. Given there's only so much compute you can use, is that fair?
A
Yeah, definitely. That's also an important part. I think that some people have that much more than others and a lot of experience really helps. But for sure we are bottlenecked on the research side by compute. If we had a lot more compute, I think we'd make a lot more progress a lot quicker. And so you have to guess to some extent what the right first, which part of the tree of research tree you want to explore, and then within that, what are the right experiments. But then also knowing research, always most research ideas fail. And so you need to figure out at what point have I done enough in this direction to know to move on to something else, or should I keep pushing? And then the other interesting thing is, especially in deep learning, a negative result doesn't mean something doesn't work. It means you haven't made it work yet often. And so being aware of that as well is quite tricky.
B
Since we're on this topic of research and how to organize research team to be successful, let's double click on some of this. So you mentioned trade offs. Presumably one kind of trade off is short term versus long term. How does that work? How do you all think about this?
A
That this is part of what I spend a lot of time thinking about as well. There's always critical path things to be done, or like this part of the model needs improving, or we know this part of the model is suboptimal, so we invest quite a lot in just fixing those immediate things. There's a few reasons for that. The first one is we know this will make the model better. So it's a fairly safe bet. But also we know that things that don't look quite good or quite perfect often tend to have issues later either when you scale up or when the model just becomes more powerful. And so actually really being very diligent about tackling those and fixing those is really important. So that's kind of the first part. The second part is slightly more exploratory research. So ideas that could land in the next version of Gemini or the version after that that have maybe a bit a bigger effect on the model performance but aren't quite validated. How we balance these is I don't think I have a very clear answer. It's also a bit periodical. So when we're doing a scale up for example, there's often more, slightly more exploratory research because there's nothing right now that needs to be fixed in parallel. But just before we are ready to scale up a new architecture or a new model, it's very much like let's de risk the last pieces. It's very execution focused.
B
How does that work? A little bit in the same vein the tension between research and product. So as we're discussing earlier, you all are in this constant race with other labs. So is there maybe some pressure in like oh no, no, we need to have a better score or win IMO or whatever it is. So a very pragmatic immediate product goal versus stuff that we know is going to improve the model over time. How does that work? I guess it's just a variation of the same theme.
A
This is why I like Google as well. There's actually very little of that I think because all of the leadership has a research background. They're very much aware that yes to some extent you can force and accelerate specific benchmarks or certain goals but in the end the progress and making the research work is really what matters. So I personally, at least on a day to day I never really feel that pressure.
B
How is a team at DeepMind organized? So you mentioned pre trained several hundred people if I heard correctly. Is there like then a post training team? Is there like an alignment team? How does everyone work together at a super high level?
A
So we have a pre training team, post training team on the pre training side we have people working on the model, on the data, the infrastructure, evals as well. Very important. I think people often underestimate the importance on evals research and it's actually quite hard to do this well. And then yes there's a post training team and of course there's a large team working on Infrastructure and services.
B
All right, thank you for that. Let's switch tax a little bit and as promised, let's go fairly deep into Gemini 3, if you will. So Gemini 3 under the hood, the architecture, deepthink, pre training, data scaling, all those good things. So starting at a high level on the architecture. So Gemini 3 as a devoted user feels very different from. From 2.5. Was there a big architectural decision that explains the difference? And then how would you describe that
A
architecture at the high level? I don't think the architecture has changed that much compared to the previous one. It's more of what I was saying before, where a few different things come together, give a large improvement. At the high level though, it's a mixture of expert architecture, transformer based. So from that perspective, if you squint enough, you will recognize a lot of the original transformer paper pieces in that.
B
Yep. Can you describe for people to make this educational what an MOE architecture is
A
at a high level? The transformer kind of has two blocks. So there's an attention block which is responsible for mixing the information across time, so across different tokens. And then there's the feedforward block, block which is more about giving the memory but also the compute power for the model to make these inferences. And those operate on a single token at the time, so they operate in parallel. So in the original transformer architecture, this is just a single hidden layer in a neural network. So it's a dense computation where the input gets linearly transformed into a hidden dimension. You apply some activation function and that one gets linearly transformed again, again into the output of the dense block. So that's the original paper. And then there's a lot of work before transformers as well and mixture on experts. And here the idea is you kind of decouple the amount of compute you use with how large the parameter is to use that. And so you dynamically route effectively to which expert you want the computational power to be used on, rather than having that couple.
B
Gemini is natively multimodal in practical terms. What does that actually mean for the model to think about text, images or videos?
A
Yeah, what this means is that there's no specific model trained to handle images and a different model trained to handle audio, a different model trained to handle text. It's the same model, the same neural network that processes all these different modalities together.
B
Presumably there is a cost aspect to this. Does being natively multimodal mean you're more expensive from a token perspective?
A
Yeah, this is a really good question. There's kind of two costs to this, I would say that the benefits largely outweigh the cost here, and this is why we train these models. But the first cost is maybe less obvious to people, but it's this complexity cost and this research bit I was talking about, because you're doing a lot of more things and especially different modalities interact in some ways. This can interact with different parts of the research and has a complexity cost. So we have to spend time thinking about these things. The second cost is, yes, images are often larger in terms of input size than pure text. And so the actual computational cost, if you do it naively, is higher. But of course then there's interesting research to be done on how you make these things efficient.
B
All right, let's talk about pre training, since it's the area that you cover in particular. So starting with the high level question, we mentioned, of course, the term scaling laws towards the beginning of this conversation and we talked about chinchilla a few minutes ago as well. In 2025 there was this much discussed theme of the death of scaling laws, particularly for pre training. Is Gemini 3 the answer that shows that all of this is not true and that indeed the scaling laws are continuing?
A
Yeah, the discussions they have to me were always slightly strange because my experience didn't match those. I think what we've seen is scale is a very important, important aspect in pre training specifically, and how we make models better. I think what's been the case though, is that people overvalued that aspect. So it is a very important aspect, but it's not the only aspect. So scale will help to make your model better. And what's nice about scale is it does so fairly predictably. And that's kind of what the scaling laws tell us is as you scale the model, how much better will the model actually be? But this is only one part, the other part, architecture and data innovation. These also play a really, really important part in the performance of pre training and probably even more so than pure scale these days. But scaling is still an important factor as well.
B
Right. And we're talking about pre training specifically. Right. Because this year we seem to have scaled RL in post training and scaled test time, compute all the things. But for pre training, using not only scaling loss, not slowing down, but you see some acceleration. Do I understand this correctly? Due to data and different architectures.
A
I think the way to put this is these all compound. So scale is one axis, but this model and data also will make the actual performance better. And yes, sometimes the innovation part outweighs the benefits of scaling more and sometimes just raw scaling is the right answer to make the model better. So that's on the pre training side and yes, on the RL and RL scaling side. I think we're seeing a lot of the same things we're seeing in pre training or we saw in pre training. What's interesting here is because we have the experience of pre training, a lot of the lessons apply and we can reapply some of that knowledge to RL scaling as well.
B
Speaking of data, so what is the pre training data mix on Gemini 3? I think you guys had a model card out for bit that was talked about some of this. So what went into it?
A
Yeah, it's a mix of different things. So the data is multimodal from the ground up. And yeah, there's many different sources that go into this.
B
Another classic question in this whole discussion is are we about to run out of data? So there's always do we have not enough compute? And the other question is, do we not have enough data? Clearly there's been a rise in the usage of synthetic data this year. In your day to day work or perhaps in general, where do you think synthetic data helps and where does it not help?
A
Yeah, so synthetic data is interesting. You have to be very careful in how you use it because it's quite easy to use it in the wrong way. And what's often the case as well with synthetic data is you use a strong model to generate the synthetic data data and then you run smaller scale ablations to validate the effect of the synthetic data. But one of the really interesting question is, can you actually generate synthetic data to make a model that you want to train in the future which will actually be better than the model that generated the synthetic data in the first place? Can you actually make that one better as well? We spend a lot of time thinking about this and doing research in this direction. The other part of your question, are we running out of data? I don't think so. So there's more. We are definitely working on that as well. But more than that, I think what might be happening instead is kind of a shift in paradigm. Where before we were kind of scaling in the data unlimited regime where data would scale as much as you would like and we're kind of shifting more to a data limited regime which actually changes a lot of the research and how we think about problems. But one good analogy of this is before LLMs a lot of people were working on imagenet and other benchmarks and there was very In a very, very data limited regime as well. So a lot of techniques from that time start to become interesting as well.
B
And perhaps that's one of those. And I don't know to which extent you can talk about it, if not talk about it in general, but there is this concept throughout the industry of training models based on reasoning traces. So basically forcing the model to show its work, how it got to a certain outcome and then taking that to train the next model. Is that something that you do or that you think is interesting or a future direction? What is your perspective?
A
Yeah, unfortunately I can't comment on the specifics.
B
This is how I know I'm asking the right questions. But maybe in general, is that something that's in there?
A
I believe so. And this also falls into the previous question around synthetic data you were asking and kind of our approach to that
B
is similar and perhaps with that taking this into a futuristic conversation. But another big question and theme seems to be indeed, how can models learn from less data? Which I think is what you were alluding to talking about a data limited regime. Again, whether at DeepMind or in general, are you seeing interesting approaches? To use the famous analogy, a model can learn like a child. Those.
A
Just to maybe clarify what I said earlier, in a data limited regime, I didn't necessarily mean with less data, but rather with a finite amount of data. So the paradigm shift is more from like we have infinite data to we have a finite amount of data. The second point is in some sense model architecture research is exactly what you mentioned. So when you make an improvement on the model architecture side, what it typically means is you get a better result. A better result if you use the same amount of data to train the model, but equivalently you could get the same result as the previous model by training on less data. So that's kind of the first aspect of that, but it is true in terms of the volume of data needed today with still orders of magnitude higher than what the human has available to. Of course there's the whole evolution process as well, which I find these high level discussions quite hard to understand or follow because you have to make so many assumptions to convert that amount of data into what is today's pre training data. But, but at least at first order it does seem like we're using a lot more data than humans do.
B
What other directions in overall pre training progress are you excited about throughout the industry?
A
Yeah, I think the one thing is in Gemini 1.5 I think we had really good leap in the long context capabilities of the model and I think that's really enabling the ability of models and agents today to do this work where you have maybe a code base and you do a lot of work on it. So your context length really grows. I think there's going to be a lot more innovation on that side in the next year or so to make long context more efficient, but also just to extend the context length of models themselves. So that on the capabilities front I think is something where pre training specifically has a lot to offer and it is very interesting. Relatedly, I think for us at least on the attention side, we've made some really interesting discoveries recently that I think will will shape a lot of the research we do in the next few months and I'm personally very excited about that. Yeah, again I think I want to emphasize the point that I made towards the beginning, but the way things work is it's really a culmination of many different things. So there's a lot of small, medium sized things that we can already see coming up where I think we fixed this issue, we fixed this bug. This is an interesting research that shows promising things and all of these things coupled I think will drive a lot of the progress.
B
Again, it's interesting thinking about Retro that we talked about a bit earlier. You're the co author of Retro, which was about efficiency and smaller models doing more and now you are in the world of Gemini 3, which is massive amounts of data and training in very long context windows. Do you think that this paradigm of having again larger models, large context windows effectively obviates the need for kind of rag and search and that everything gets folded into the model? I mean obviously there's a corporate data part, but in general there's some interesting questions here.
A
So first of all, I think Retro was really about retrieving information rather than storing it, not necessarily about making models smaller. So it's about how we can use the model to do more reasoning already in a pre training sense of reasoning rather than just store the knowledge. So this is still very much the aspect today. The interesting part is the iteration cycle maybe of pre training used to be a lot slower than that of post training until fairly recently. And so making these large changes on the pre training side is quite costly in terms of risk and how long it takes. And then you have approaches like rag or search which you can do during post training and iterate much more quickly on which give very strong performance as well. I think deep down I do believe that the long term answer is to learn this differentiable end to end way, which means probably during pre training or whatever that looks like in the future, learn to retrieve as part of the training and learn how to do search as part of the large part of training. And I think that's kind of RL scaling maybe starts that process, but I think there's a lot more to do. And also on the architecture side, but this is something that we'll see in the next few years and not immediately, I would say. The one thing I want to highlight is people often talk about model architecture and that's definitely one part of what makes pre training better. But there's other parts as well, like infrared data and evals specifically that don't always get the same mention. Evals specifically is extremely hard and it's even harder in pre training, I would say, because it kind of has these two gaps you need to close. So on the one side, the evals we use or the model models we train regularly are much smaller and less powerful than when we scale up. So that means the eval has to be predictive of what the performance have to still work for the large model and point in the right direction. So it has to be a good proxy on that side. And then there's a second gap as well, which is when we evaluate pre training models, there's a post training gap as well. So the way the models get used is they don't just get used after pre training, there's more training happening after after. And so the evals we use in pre training or pre trained models have to be good proxies of what happens after as well. And so making progress on evals is really important and quite hard and has also driven a lot of the progress we have in terms of being able to measure what an actual improvement is on the model or on the data
B
side and evals, a DeepMind that's all internally built. Like you have your own set of evals.
A
Yes, to a large extent and more and more so because what we found is that external benchmarks, then you can use them for a little while, but very quickly they become contaminated. So they start to be replicated on different forums or different parts of the web. And then if we end up training on those, it's really hard basically to detect leaked evals. So the only way you really have to protect against cheating yourself and thinking you're doing better than you are is by actually creating held out evals and not really keeping them held up in the same vein.
B
Is alignment a part of what you all think a lot about at the pre training level or is that more of a post training kind of conversation or both.
A
It's a majority of post training, I would say, but there's definitely some parts of it which are relevant of pre training. I can't go into too many details here, but some parts are relevant to pre training and we do think about that as well.
B
And at a very simplistic level. I always wonder again, in the context of Gemini or otherwise, if the core data set is the Internet. There's a lot of terrible things on the Internet. Is alignment 101 that there's stuff that you just do not include in the model.
A
This is an interesting question and I don't think I have a definitive answer. But you don't want the model to do these terrible things. So at a fundamental level you do need the model to know about those things. So you have to train a bit, at least on those, so that it knows what those things are and knows to stay away from those. Right. Otherwise when a user would mention something terrible, the model wouldn't even know what it's talking about and might not be able to say this is something terrible.
B
Right, let's talk about deepthink, like the thinking model that was released a few days after Gemini 3. So first of all, is that different model or is that part of the same model? How should one think about it?
A
I'm not allowed to. I can't comment too much on specific.
B
What happens when the model thinks and you wait for 10 seconds or 20 seconds or whatever time. What happens behind the scenes?
A
Yes, I think this has been covered quite a bit in some of your previous podcasts as well. It's about generating thoughts rather than just doing compute in the depth or in the model side. You also do compute and allow the model to think more on the sequence tank side of things. So the model actually starts to form hypotheses, test hypotheses, invoke some tools to validate the hypothesis, do search calls, et cetera, and then at the end be able to view the thought process to provide a definite answer to the user.
B
The industry has normalized around that paradigm of chain of thought.
A
That's for. Yeah, can you talk a little bit
B
about the agentic part of this and Google anti gravity. What do you find interesting about it? What should people people know about it?
A
Yeah, this is I guess what I was mentioning before around my own work especially, I think that's interesting. A lot of the work we do on a day to day basis is more execution based babysitting experiments, et cetera. And I think this is where I at least see the most impact from those bringing it back to the topics of pre training. I think that the perception and vision side is very important for this because now you're asking models to interact with computer screens. So being able to do screen understanding really, really well, well is critical. And so that's an important part on the pre training side at least.
B
And in anti gravity there's a whole Vibe coding aspect, truly Vibes, in that you don't even really see what happens when you ask is Vibes same question. Is that a pre training thing? Is that just a post training thing? How do you build Vibes into a model?
A
Yeah, this is interesting. I think you can probably ask five different researchers and you'll get five different answers. There's also this notion of a large model feel people call this especially I think GPT 4.5 historically had some of this, presumably where larger models maybe feel differently. I wouldn't put it in these terms specifically, but I think Vibes comes down to this. And actually pre training probably plays a larger role today in some of that and how the model feels and in general than post training. I think this is in general for Vibe coding specifically. I think that's maybe more of an RL scaling and post training thing where you can actually get quite a lot of data and train them all to do that really well.
B
So zooming out a little bit maybe for the last part of this conversation, I'm curious about where things are going in general. There was a key theme discussed at NeurIPS this year around continual learning. And I'm curious about your perspective, especially from a pre training perspective, because we are in this paradigm where every few months or years we. And by we I mean you train a very large new Bayes model. First of all, what is continual learning? And two, how does that impact retraining if continual learning becomes a thing?
A
Yeah, I guess continual learning is about updating the model with new knowledge as new knowledge is discovered. Right. Let's say a new scientific breakthrough is made tomorrow. The base model we trained yesterday wouldn't actually know about it in its pre training First. I think a lot of progress has been made on this front since in the last few years. I think this is mostly around price training around search, use search tools and make search calls. Then they would have access to that new information in some sense. This is also what retro that we talked about was doing by retrieving data and trying to externalize the knowledge corpus with the reasoning part. So that's the first part I Think the second part is on the pre training side specifically is what I was mentioning along context as well. One way of doing this is if you can keep expanding the context of the user, the model keeps getting more and more information in that context. And so you kind of have this continual learning aspect part of that. But then of course there's more of a paradigm shift. Maybe this is what people discuss is can you change the training algorithm such that you can continuously train them on a stream of data coming from the world?
B
Basically beyond the continual learning, what do you think is hot, interesting or intriguing in current research today?
A
Yeah, again there's a lot of small things right now that accumulates. So that's kind of the first thought that comes to my mind. And that historically has really driven progress. So I wouldn't just bet against that continuing to drive progress. The things I mentioned before around the long context architecture and long context research is one aspect I think on the attention mechanism as well on the pre training side. And then this paradigm shift from infinite data to the limited data or finite data regime is something as well, I think where a lot of things will change and there's a lot of interesting research that's kind of on the pre training alone side. The other side, which is quite interesting today, is these models, but the amount of people using these models is growing quite rapidly. And so more and more what we have to think about on the pre training side as well is how expensive is the model to use to serve and have really deployed at a large scale. And what things on the pre training site specifically can we do to make this model have better quality and maybe be cheaper to serve and consume fewer
B
resources during inference for any student or like, yeah, PhD student listening to this, if they want to become you in a few years, what problems do you think they should think about or focus on? That's not like a year or two out, but like more interesting sort of a few years out.
A
One thing that's becoming increasingly important is being able to do research, but being aware of the system side of things. So we are building these fairly complicated systems now. So being able to understand how the stack works all the way down from TPUs to research is kind of a superpower because then you're able to kind of find these gaps in between different layers that other people weren't necessarily able to see, but also to reason through the implication of your research idea all the way down to the TPU stack. And people that can do that well, I think have a lot of impact in general. So in Terms of specialization. It's really, really thinking about this research engineering and systems aspects of the model research and not just the pure model architecture research. That's one I think personally I still have a lot of interest in kind of this retrieval research as well that we started with retro and I think it wasn't quite ripe until now. But things are changing and I just think it's not unreasonable to think in the next few years something like that might actually become viable for a leading model that like Gemini.
B
And why was it not ripe and why may that change?
A
I think that's around the complexity side of things I was mentioning and also the fact that all the capabilities it brings you can iterate much more quickly in post training. So what I was saying with search and post training data you can give very similar capabilities to the model in a much simpler way. And as post training grows and RL scaling grows as well, maybe that shift shift again towards more on the pre training side.
B
Do you think there are areas of AI right now that are over invested in where there's a disconnect between what makes sense and where the industry is actually going and investing dollars in?
A
I think it's got a lot better. I think maybe two years ago what I was seeing is people were still trying to very much create specialized models to solve tasks that were maybe within half a year or a year of reach of generalist models. And I think people have caught up to that much more and now kind of believe that for generalist task or tasks which are not don't require extreme specialized models trying to use a generalist model and maybe not the current version but the next version might be able to do that. So what that means is research in terms of how you use models and, and the harness, et cetera is becoming increasingly important and also how you make models and these harnesses more robust to making errors and recover from such errors.
B
Yeah. In that vein, do you have any advice or recommendation for startups? So seen from the perspective of a founder or the VCs who love them, there is this feeling that the base models are becoming ever so powerful and then trained on more multiple data sets. So it used to be the model is able to converse, but now it's able to do financial work and cap tables and that kind of thing which seems to shrink the area of possibility for startups. Do you have thoughts on that?
A
Yeah, I think so. Maybe look at what models were able to do a year or a year and a half ago and then look at what more models are able to do today and try to extrapolate that. I think the models, the areas where the models are improving, I think will continue to improve. And then there's maybe some areas where there's not been that much progress and that might be more interesting areas to do research. I don't really have a specific example in mind right now, but that would be the general advice.
B
What are you excited about for the next year or two in terms of your personal journey?
A
What I like very much about my day to day is working with many people and being able to learn from a lot of researchers. And that's what drives me to a large extent. Every day I come to work and I talk to really, really brilliant people and they teach me things that I didn't know before. And so I really like that part of my job. What I was saying multiple times at this point. But there are just so many different things that will compound and different things where there's headroom to improve. I'm really curious because right now I don't really see an end in sight for that kind of line of work to continue giving us progress. So actually being able to see this through and see how far this can take us is really interesting, at least for the next year or so. I don't see this slowing down in any way.
B
Great. Well, that feels like a wonderful place to live it. Sebastian, thank you so much for being on the pod. Really appreciate it. That was fantastic. Thank you.
A
Thank you, Matt.
B
Hi, it's Matt Turk again. Thanks for listening to this episode of the MAD podcast. If you enjoyed it, we'd be very grateful if you would consider subscribing if you haven't already already, or leaving a positive review or comment on whichever platform you're watching this or listening to this episode from. This really helps us build the podcast and get great guests. Thanks and see you at the next episode.
Episode: DeepMind Gemini 3 Lead: What Comes After "Infinite Data"
Date: December 18, 2025
Guest: Sebastien Bourgeau, Pretraining Lead for Gemini @ Google DeepMind
Host: Matt Turck
This episode dives deeply into the making and implications of Google DeepMind’s Gemini 3, one of the most advanced AI models to date. Sebastien Bourgeau, pretraining lead for Gemini 3, provides inside perspectives on team structure, research trends, the shift from “infinite data” to “data limited” regimes, architectural decisions, synthetic data, future research directions, and his own storied journey into the heart of AI innovation. The conversation is both technical and accessible, reflecting on what makes leading-edge AI work—and where it’s headed next.
Timestamp: 00:58 – 03:04
Not One Trick, But Many "Knobs":
Bourgeau emphasizes that Gemini 3’s improvement over previous models isn’t the result of a single radical change, but instead comes from the aggregation of numerous small and medium advancements, all delivered by a large, coordinated team.
“It’s really a culmination of many, many changes and many, many things from a very large team that actually makes Gemini 3 so much better than the previous generations…” (01:31)
From ‘Model’ to ‘System’:
The distinction is made between building AI models and now constructing AI systems, reflecting the complexity of modern state-of-the-art work.
Timestamp: 03:04 – 04:36
“...the amount of time people spend using the model to make themselves more productive internally is increasing over time. Every new generation...can do new things and help us in our research...much more so than the previous generation.” (03:35)
Timestamp: 04:36 – 06:35
Faster Than Expected:
Bourgeau openly admits that the current progress has exceeded his own and peers’ expectations from a few years ago.
“If I’m being honest with myself, I think we’re ahead of where I thought we could go.” (04:56)
Looking Forward:
He anticipates “large scientific discoveries” by AI models within the next few years and is excited about models aiding and accelerating both research and engineering.
Timestamp: 06:35 – 07:45
Timestamp: 07:45 – 10:19
Common and Divergent Paths:
Labs train on similar architectures (e.g., Transformers), but specialize in different research branches. DeepMind, for instance, excels in multimodal and vision capabilities.
Team Scale:
Building a leading model now requires hundreds of researchers, not just small teams; nevertheless, there's potential for surprising disruption from smaller teams if innovation reduces the resource demand.
Timestamp: 10:51 – 12:01
Research ↔ Engineering Blur:
At Google DeepMind, the traditional boundaries between research and engineering have dissolved, with large-scale, reliable infrastructure now inseparable from cutting-edge research.
Own Chips, Own Infra:
Gemini 3 was trained on TPUs, not Nvidia chips, reflecting true vertical integration.
Timestamp: 12:01 – 13:39 & 25:51 – 26:13
Large, Highly Coordinated Teams:
150–200 people work on Gemini pretraining alone, across data, models, infrastructure, and evaluation. Bourgeau’s job is as much about integration and team enablement as it is about research design.
Team Structure:
Organization into pretraining, post-training, and alignment teams; internal evaluations are crucial and growing in sophistication.
Timestamp: 13:39 – 18:06
Cross-European Upbringing, Serendipitous DeepMind Entry:
Moved from the Netherlands to Switzerland and Italy; chose Cambridge on a whim; joined DeepMind via a fortuitous referral after university.
Early Projects:
Started in reinforcement learning but quickly pivoted to work with real-world data and large language models (LLMs)—motivated by practicality and impact.
Representation Learning & Big Model Scaling:
Participation in Gopher, Chinchilla (the critical finding that data scale is more important than previously thought), and Retro (architectural innovation enabling retrieval-augmented models).
Timestamp: 20:29 – 22:05
Integration Over Standalone Success:
Research ideas must play well with others; robust progress is driven not by maximizing individual benchmarks but by system cohesion and simplicity.
“Your research has to play well with everyone else’s research and has to integrate.” (20:40)
Allergic to Complexity:
Preference for lower-complexity, maintainable improvements even at a small cost to immediate performance, anticipating long-term progress.
Timestamp: 33:07 – 36:16
Synthetic Data’s Limits & Shifting Paradigm:
While synthetic data is researched heavily, its effectiveness is nuanced. The big change is cultural: models are now leaving the “unlimited data” world for a “finite data” regime, affecting research strategies.
“...kind of a shift in paradigm, where before we were kind of scaling in the data unlimited regime and we’re kind of shifting more to a data limited regime, which actually changes a lot of the research...” (33:35)
Model Advances Now Drive Efficient Learning:
Model architecture now aims to either “do more with less” or get the same quality with less data; nonetheless, there’s still a gap between the data hunger of LLMs and how children learn.
Timestamp: 26:50 – 30:03
MOE at the Core:
Gemini 3 has a transformer-based mixture-of-experts architecture, dynamically routing computation across experts for efficiency and power.
Native Multimodality:
All modalities (text, images, audio, video) are processed by the same model. This increases complexity and compute cost, but the unified approach is deemed critical for advanced capabilities.
Timestamp: 30:03 – 32:42
“Scale is a very important aspect ... but...architecture and data innovation...probably even more so than pure scale these days.” (30:42)
Timestamp: 33:07 – 35:39 & 39:21 – 41:48
Synthetic Data Usefulness and Traps:
Using strong models to generate synthetic data for smaller models is common, but generating synthetic data that improves a future stronger model is an open challenge.
Internal Evals are Crucial:
External benchmarks quickly become “contaminated” and must be replaced with carefully held-out, internally created evals to maintain reliable measurement.
Timestamp: 37:18 – 38:40 & 48:17 – 49:32
Long Context Windows:
Increasing model context length allows for handling more complex tasks (e.g., multi-file codebases).
Attention Mechanism Advances:
Recent breakthroughs on the attention side are anticipated to shape coming research.
End-to-End Retrieval and Search:
The future likely lies in making retrieval part of the differentiable pretraining process, not just a post-training add-on—“learning search” within the model, albeit this is still an emerging area.
Timestamp: 42:28 – 43:32
“At a fundamental level you do need the model to know about those things. So you have to train a bit, at least on those, so that it knows ... to stay away from those.” (43:07)
Timestamp: 43:45 – 46:26
Model “Thinking” Now Explicit:
New research (e.g., deepthink, agentic systems) allows models to explicitly generate "thoughts," hypotheses, and tool calls before answering, moving beyond simple token streaming to a more deliberative, agentic mode.
Vibes & Subjective Feel:
The elusive concept of “Vibes” (model “feel” or personality) is influenced more by pretraining than post-training, though opinions differ widely.
Timestamp: 46:26 – 49:32
Continual Learning:
Desire to update models incrementally as world knowledge shifts—currently best approached via retrieval and long-context, with true continual/streaming pretraining as a research frontier.
Cost & Efficiency at Deployment:
With usage exploding, serving costs and inference efficiency are chief concerns for pretraining teams as well.
Timestamp: 49:32 – 53:35
System-Minded Researchers Have a 'Superpower':
The biggest skill gap is blending research acumen with systems (hardware, infra) knowledge.
Beware Niche Models... Generalist Models Are Catching Up:
Many tasks that required specialized models are rapidly being absorbed by ever more capable generalist models.
Timestamp: 53:40 – 54:28
On Team Collaboration:
“Being able to get progress out of everyone is really what makes us make the most progress, rather than enabling maybe one or two or a small group of 10 people to run ahead of everyone else...” (12:26)
On Research Taste:
“Being allergic to complexity...we have a certain budget of complexity we can use...so oftentimes, we don’t necessarily want to use the best performance version...but rather trade off some of the performance for a slightly lower complexity version because we think that will allow us to do more and more progress in the future.” (20:40)
On AI Acceleration:
“If we had a lot more compute, I think we’d make a lot more progress a lot quicker.” (22:05)
On Fast-Changing Industry:
“Now, kind of believe that for generalist task or tasks which...don’t require extremely specialized models, trying to use a generalist model...the next version might be able to do that.” (51:48)
On the Continuing Journey:
“There are just so many different things that will compound and different things where there’s headroom to improve. I’m really curious because right now I don’t really see an end in sight...” (53:40)
This is a must-listen episode for anyone interested in how cutting-edge AI models are truly built. Bourgeau provides rare, grounded insights into the realities of large-team, system-level research, the direction of the broader AI field, and the enduring importance of research taste, simplicity, and systems engineering. Gemini 3 is presented not as a stroke of genius, but as a testament to relentless, cumulative progress, a reminder that in AI, the future will likely be built one carefully considered tradeoff at a time.