
Loading summary
A
You need to reach this level of reliability to really make any of these AI tools very useful. And I think we just crossed that probably December last year at least at OpenAI. Now we can trust these models to do a lot of the work that we are doing. The last few months have been pretty wild. We moved from competitions to usefulness, to users, and that's what we are feeling right now. I think most of the time the bionic is the last mile. There will always be a lot of space left for this last mile in different verticals and I would highly encourage people to continue working on that.
B
Hi, I'm Matt Turk, welcome to the MAD podcast. My guest today is Jan Dubois who co leads the post training frontiers team at OpenAI. The recent release of GPT 5.5 was yet another major milestone in AI and Jan's team helped build it alongside OpenAI's prior top reasoning models including O3 and GPT 5 thinking. Before OpenAI, Jan was at Stanford where he co authored Stanford Alpaca, the landmark project that kicked off much of the modern post training research community. In this we go deep on what's actually new in GPT 5.5, why reinforcement learning is moving from math encoding competitions into messy real world work, why AI progress can feel like a sudden step function, and why continual learning remains one of the big unsolved problems in AI. 3 years after ChatGPT. Please enjoy this fantastic conversation with Jan Dubois. Hey Jan, welcome.
A
Hi Matt, thanks for having me.
B
It's been another wild last few weeks in the world of AI with the release of GPT 5.5 of a Claude Mythos preview. So it feels like we have unlocked yet another step function in progress, particularly in cybersecurity agent decoding. What's the best way to think about this from your perspective? Are things accelerating what is happening?
A
Yeah, the last few months have been pretty wild internally. We also really feel it and I think anyone who's working with anyone who's coding basically is really feeling it right now. I think that's really because of three reasons. The first one is, even though in my mind the progress is actually pretty continuous, you need to reach this level of reliability to really make any of these AI tools very useful. And I think we just crossed that probably December last year at least at OpenAI. That's where I thought we really crossed that threshold where now we can trust these models to do a lot of the work that we are doing. So it feels like a STEM function, even though I think actually in terms of capability it's pretty Continuous. So that's the first thing. The second reason is once you start having models that are really good, you accelerate yourself, especially in terms of coding, given that we all code internally. You excel at yourself both for having these models train the other models, but also build the tooling that we need as researchers to do our job. And all this acceleration, I think means that we saw these last few months going faster and faster. The third thing that I think we are feeling is all of last year, we really built these reasoning models and we really sawing pushing a lot on reinforcement learning. And initially when we had like 01, 01 preview, even 03, these models were still optimized for what we call verifiable rewards. Things where we actually have access to ground truth and it's easy to test whether you're correct or not. That is, for example, the case in math questions or coding competitions. And what I think we are realizing now is that we were able to take many of the tools that we built for these verifiable reward cases and we were able to use them more generally for reinforcement on real use cases. And I think that's really why we're feeling that right now in just real world coding rather than competition. So we moved from competitions to usefulness, to users. And that's what we are feeling right now.
B
Okay, fascinating. So we're going to unpack a lot of this, particularly on the RL side. For the first thing that you mentioned, reliability is that in engineering is that models, what makes a model reliable in the way you meant it?
A
It's a little bit of everything. But in general, given that these are agentic models, if you just think about it as every two minutes there's a certain probability that they are wrong. The longer that they run, the higher the probability that the final answer is going to be wrong. So it's just something inherent in agentic models. And what we've been pushing a lot on is like making sure that the model, like we decrease this probability of being wrong every like two minutes. So purely from a model point of view, of course, there's a lot of reliability that is also happening on the applied side. And the team at OpenAI has been doing an amazing job on that. But I'm even talking only about reliability of our models and like making sure that like basically we decrease the probability of being wrong.
B
Great. So 5.5, formerly known as Spud, was, as mentioned, a big deal, is a big deal. And I'm just curious from the inside, what are you guys the most proud of? What did you find the most challenging? Give us some color on how you all felt releasing this.
A
We're all really excited by 5.5, to be honest. It is one of these models where everyone in the company was extremely involved in building and I think that we really feel it now. That's like, we got a lot of attention because of 5.5, and it seems like all the stars were aligned. That doesn't always happen. And I was just like a great model for this. So we did feel it. It's kind of funny because in general, with every model that is looking really good early on, we have a model, we all get really excited about it. And then there's tons of dads that start coming up because it's like, oh, everyone is hyping this thing internally, but actually it's bad at all these other things. And then there's another wave where people start under hyping it and it kind of goes through waves. And it depends when we actually ship it, how people feel about it internally. But that's true of most models that we have. So 5.5 was not that different in this case, but it definitely maybe had a higher amplitude of the wave. So people were very excited then, not as excited, and we shipped it and people were happy externally.
B
How long does that process take, including the waves of going up and down and of excitement? I guess it depends on the release and the importance of each release. But is that a few weeks? Is it a few months?
A
It really depends. I can't talk exactly about what went into 5.5, but it can. It kind of depends which part of the pipeline is training parts of the model. So we really have different sub teams, including pre training and you have the mid training stage and you have some post training. And usually the closer you get to products like pushing being the last one, the faster the iteration cycle is and if you're more upstream, the slower the iteration cycle is. So it could go from, let's say from months to days, basically.
B
5.5 was particularly good on gen decoding, computer use, knowledge work, and early scientific research. How does that work internally? Do different people focus on those different parts? How do you get to that result?
A
Yeah, we definitely have different teams that are working on specific use cases and are pushing on these use cases. My team specifically is actually the one that is kind of taking all these vertical improvements and try to put them together in the final model. You could see it as a team that is doing both kind of the smoothing function. So you have all these improvements, but you need to make sure that the model doesn't feel too spiky, doesn't feel differently on different verticals. And also you need to have some teams that are working. And that's basically what my team is doing on all the horizontal improvements. So there are many things like instruction following, function calling, or thinking about how much should a model think for on different problems. Those are very horizontal and that kind of impacts all these use cases. So we have both these more vertical teams and these more horizontal ones. And both are very important to improve on the model. And the good thing is that these things can kind of be improved orthogonally. So you might have multiple different teams that are working on certain verticals. And maybe for one model there's only half of these teams that made integrations basically in the last run and improved the model on these capabilities. And maybe for the next model it'll be the other half. So that's kind of at a high level how it works. One thing which I will say because you asked also about one of the things that we are really proud about for this model, I would say two things. Number one is the efficiency of the model. We really, really improved the efficiency of the model. And most of the tasks can be basically performed, I would say 2x faster now with this model. So that's great. And the other one that I already mentioned before, but it's kind of this alignment of the company and making sure that everyone is working towards the same goal and that really takes the entire company working towards this north star of building one good model in specific timelines. So very proud of how that happened.
B
Great. And then speaking of efficiency, how do you optimize for that? We're talking about efficiency per token. Are we also talking about latency in serving the model? What part is AI research versus engineering?
A
So that's what I mean when I say it's the entire company is that it really comes from everywhere. It has to come from inference optimizations. It has to come from the model being more efficient in its thinking time. So you have basically every token that you think for. Basically the usual plot that you should be looking at is X axis, the number of tokens that you think for and Y axis the performance. So this is these test time scaling curves that we look at. And research basically tries to move this curve to the left. So think less to be the same level or more correct. And then inference also deals with this X axis, but switches it from number of tokens to actual latency. And the final thing that people care about is latency on X Axis performance on Y axis. And this is where everything comes together. And this is really what happened with 5.5. So yeah, that's why I always say I'm really proud of the company for this one.
B
Okay, great. Let's talk about you for a minute. So you are in the post Trading Frontiers team. So that team you described as horizontal. So what does the team do in general?
A
Yeah, I would say there's three things that we do. So in a broad sense we are on the posturing org and my team is the Posturing Frontiers one. So there are three things that my team does. Number one is we kind of decide what goes into the final run. So as we talked before, there's like many verticals and someone needs to decide what can go in, what cannot and also provide the science experiments for people to iterate on something that's going to be representative of the final run. So this is the first thing that my team does. The second thing that my team does is bringing everything together and actually doing the big run. So this has, as you might imagine, we train on a good amount of GPUs. So there's a lot of infra work that is needed, but also there's a lot of ML work that is needed by putting everything together and making sure things work well together. And then the third thing that my team does is horizontal improvements to the models. Basically there are some things that these vertical scenes will not usually look too much at. For example, the thinking time, as I said before. So how much should the model think for on certain answers or like instruction following, function calling, things like memory and like general improvements to the model that are really across the stack. So that's what the Pushing Frontiers team does. And I'm leading that team.
B
Okay, great. And what was your journey into OpenAI?
A
It's a long story, but I'll try to keep it really short. Basically I did my undergrad in biomedical engineering in Switzerland. I'm from Switzerland. And then I won an exchange in Canada and I learned about Word2Vec. So I don't know if you heard about this algorithm, but it basically takes words, which is like something discrete and puts it in a vector space. So puts it basically in a way to think about. It is a plane where if words that are more similar to one another will be closer to one another. So it brings these discrete words into some continuous space that is semantically meaningful. And I was absolutely blown away by that algorithm. And that's when I decided that I wanted to work on natural language processing and just Understanding language. At that time, I was very wrong, but I thought that English NLP was basically solved or close to being solved. That was in 2017. So that was right when Transformers started and was actually right before Transformers. So I was very wrong, but I decided that I wanted to work on under research languages and basically I wanted to improve NLP on languages where we don't have that much data. So I went to work for GRAB in Singapore and I was basically building the natural language processing pipeline for them, working with Khmer, with Bahasa, with Tyler, Vietnamese and all these different languages. And then I'm skipping a little bit. I did more academic type of work in different countries and I ended up at stanford, did my PhD there and after this had a small stint into startups and then went to open it.
B
Yes, and I remember seeing on your blog or your page a note for quant firms to not reach out to you because you are not interested in. In the hedge fund work.
A
Yeah, I always think it's very important for me to think about the positive impact that I'm having in the world, or at least that I'm trying to have. So that's why this thought is there.
B
Yes. And as we were saying just before we started recording, people may have seen you in the GPT5 video announcement and you did this very funny demonstration of an app that was built on the fly to teach your partner how to speak French. So like, people should go check that out.
A
Exactly. That was a fun one. That was a fun one. GPT5 was not that reliable, so I was a little bit stressed that it wouldn't work, but it ended up working.
B
So this was truly live. And oh yeah, it was presumably very rehearsed, but truly live.
A
Actually, the right before we did that, like the last rehearsal, it did not work. So I got slightly stressed about that. But yeah, seems like live ended up working well.
B
Yeah, no pressure, but yeah, that landed perfectly. Okay. Very cool. All right, so let's unpack some of the things we alluded to in the intro. So we started effectively talking about reasoning. And I'm curious what reasoning means in 2020 that's any different from a conversation we could have had about 0, 1 or 3 in particular, one of the claims of 5.5 and also my experience as a user is that it's particularly good with messy data, which seems to imply that it needs to reason through ambiguity more. What has changed?
A
What I would say is that 01 and 01 Preview were really, really breakthroughs in the research community about having model that can think and the longer they thoughtful, the higher likelihood they would be of being correct. So that was really a breakthrough. But initially, and if you look at old blog posts, you would mostly see math evals and also maybe coding competitions, but things that are really easy to test whether you're correct or whether you're not. And that also gives you some suggestion about how we were training some of these models and how I see maybe all of last year and especially the end of last year and the beginning of this year, is that we were able to take these algorithms that work with verified rewards, like things where we can say you're correct or you're not to the messy real world and really optimize for the utility that we provide to users and making them more productive. So I think that's what really changed.
B
Okay, so it's the post training, reinforcement learning part largely.
A
Yeah, I would say that's. I mean, there's also another big part of it, number one. Basically the first thing is that of course when you develop a new method, the method is kind of fragile and it's not that reliable and it's hard to basically productionize. So this part also improved a lot. But then it's also really, basically we had a tool that we could start optimizing for different things. And initially when we were developing this tool, we were making a lot of simplifying assumptions in the real world basically. And now we are removing these simplifying assumptions and, and at least in posturing, we are able to optimize really user utility and make sure that these models are useful and the tasks that we are looking at are useful. And that's why also now current evals look much more realistic. I mean, if you think about GDPVAL or even if you look at like 3 bench pro or 3 bench, these look way more realistic than let's say some codeforce or like coding competitions that we were looking at with R1.
B
And still on the topic of reasoning, what's ultimately the difference between 5.5 thinking versus 5.5 pro? Is that just more test time compute, more tokens and more time invested in solving a problem?
A
Yes, basically it's just a question of how much test time compute we pour into the model or we pour into this entire system that we're shipping. So we've seen again and again, the longer the model think for the better answers we will get. The problem is that these curves that we're talking about are definitely not linear and there's some plateauing effect and they kind of look logarithmic. In some sense, or depending on which evals, so you can pour two times more computer and actually only get small performance gains. I personally don't use PRO that much because I really don't like wait. I'm pretty impatient, so I don't like waiting for that long. And I know that the probability of being correct definitely improves, but it doesn't improve enough for me to use it. But there are some people who use PRO and who really love it, especially actually for academic research. And I know especially a lot of mathematicians who are using it, and that's because they kind of just have this in the background that is running for maybe one hour, two hours, and they don't really need to iterate really quickly with the model. And PRO is really good for that.
B
I'd love to reconcile this with what you were mentioning about efficiency earlier, per token. So is the idea that you would be able to think longer but also be more efficient, therefore solve the task better? Like how do the time aspect and the efficiency, latency aspect sort of interact?
A
Yes. So if you go back to the plot that I was talking about, what I was thinking about. Well, on the X axis we have latency, and Y axis we have performance. We're basically moving this curve when we say that we improve efficiency more and more to the left, so we're becoming more efficient or we spend less time to achieve the same performance. But what PRO does is that it extends this curve. So it says, I'm going to think for much longer, but I will have a higher likelihood of being correct. But every iteration of the PRO model also moves to the left, so it also becomes more and more efficient. The important part is there will always be tasks where you just want to maximize the probability of correctness and you don't really care about latency. For example, if I start a job before going to sleep, I mean, the model has eight hours. It should just think for as long as it can. And this is what kind of promo gives you.
B
And in layman's terms, what does that mean practically or how does that work? Practically, if the model goes in the wrong direction, then it would interrupt itself earlier. Is that one of the axes?
A
Okay, so there's two things. Are you asking for the efficiency? What does it mean?
B
Yeah, for the efficiency. Yes, largely for the efficiency. I'm just curious how reasoning gets more powerful.
A
Yes, that's a good question. Let me give you maybe a metaphor from humans. If you have someone who's an expert in certain domain and you compare them to some undergrad, that is Starting in that domain, the undergrad doing that task will probably take one day, two days, and we'll have to think through a lot of the, the possibilities and investigate because it never did a certain problem, while someone who's an expert in that field will usually just know what direction to take. And it will not spend the time on investigating 10 different directions because it knows that there's one that is more likely to be correct. So this is a type of efficiency that we're talking about. It's basically models where we optimize more on real world problems. And as a result, it was kind of trained to figure out with a higher likelihood, which paths of reasoning are more likely to be correct. So this is a part on efficiency. There's also what you suggested is that part of it is the model knowing when it's going down the wrong path. But this is also something that the model can be trained for with reinforcement learning is like knowing, okay, that seems like not a great path. Let me backtrack and let me go on and test something else. And if you train the model less, it might realize it's in the wrong path much later.
B
Okay. All right. So it seems like a lot of this goes back to reinforcement learning and post training. So let's talk about how the different components of modern AI systems work. So let's talk about pre training, mid training, and post training. And spend more time on post training since it's so important, starting with pre training first at a high level and realizing that you may or may not be able to talk about how the things are done or what happened in the context of 5.5. Specifically, big narrative of last year was that pre training was hitting a wall and was not going to yield much progress. That seems to not be the case at all in 2026. Can you walk us through some, some ideas for what is happening in pre training and why it's progressing now in a way that people hadn't predicted last
A
year for pre training. I can talk in a lot of details about what is happening internally. Besides that, the team has been really doing a lot of good work and our models are really getting better and better. One thing that I do want to highlight when we're talking, for example, with efficiency, if you have larger models, the amount of thinking time, so the amount of tokens that it will think for will usually decrease. And the way that you can think about it is that metaphorically, the model already thinks through its weights when it generates a certain token. So you can decrease the number of tokens that it needs to generate for thinking by kind of increasing the size of the model that you are training. So oftentimes if you just increase the model size, if you basically pre train larger models, you will get better efficiency. And the good thing with larger models is that they can be paralyzed better at inference time. So even though you might think, okay, you actually generated fewer tokens, but by a larger model, so you actually might decrease the overall efficiency of the system, this is not true because the larger the model is, the more chances you have to actually optimize basically for inference on GPUs. So you will be able to make the overall system more efficient. So that's one thing I wanted to say with larger models that are actually giving you a lot of efficiency. Otherwise, in terms of pre training, I think it's very interesting. I actually also thought maybe two years ago that pre training was kind of hitting a wall. And when we see, for example, if we talk just about entropic, I mean Mythos seems like clearly just a much bigger model when you look at the cost, the cost of the model, usually that's how you know, by the way, if it's a bigger model, you just look at the cost per token and clearly they are getting very good performance just by increasing the size of the model. So I think the field was very, at least part of the field was surprised about that. There were a lot of conversations about hitting data walls and it seems like we did not quite hit it. So the larger the model is, the more data it needs to ingest to be trained. And it seems like different companies kind of found different ways to overcome the fact that we don't have that much data on the Internet.
B
Is the next frontier or the current frontier for data, multimodal data? Is it synthetic data?
A
I think synthetic data can probably work well in a data limited regime. I think multimodal is an interesting one. I definitely cannot talk about what we do internally, but I used to work on multimodal representation learning back in the days and I always thought that it would really help kind of your reasoning abilities if you have a lot of multimodal data. And I still think this, but for example, if you look at entropic models, they tend to not be that good on multimodal and they are still really smart. So it seems that it's not as necessary as at least I would have thought in the past. I still believe that once we go to embodied agents, embodied AI, you will learn a lot about the world and you will kind of improve general intelligence and usefulness to users by learning how the world interacts with itself. But at least looking, for example, at entropic models, it seems that they don't need that much multimodal data to have strong models.
B
And by embodied intelligence, you mean potentially robotics. And so if you use a video that shows how gravity works and how a robot evolves in space, then presumably that would be more useful. Is that the thought?
A
Yes. The intuition that I think many people had and I definitely felt for a long time is that it's hard to understand the world only through text. And it's hard to understand what physics is without really seeing what. For example, you can't understand gravity without really seeing things falling. And when you look at our models, they kind of understand gravity without having seen that, but it still seems not obvious. It still seems like they would get it more. And they are still kind of missing some common sense aspects. So I do feel like we will improve the common sense of our model by having them interact in the real world. But we're still pretty far from that, I think. And by we, I mean just generally the academic community and the AI community seems pretty far from that.
B
Yeah. And while we're on the topic, as a quick detour that leads us to the concept of world models. So leaving your taking your OpenAI hat off, are you bullish on world models?
A
World models in the sense that, yes, you can try to replicate or simulate things, basically work in an environment that is simulated, yes. The problem is simulations are always going to be really hard and not going to be truthful. So I think they will always need to be a little bit of training that will need to happen in the real world to make sure that the model realizes these mismatches between the simulated world and the real world. And I think we as a field have a tendency of optimizing something that is simulated or not quite realistic past the point where this is useful. So that's something that I think we should always be careful with, is we spend a lot of time and effort on optimizing something simulated and not quite realistic. And it's great at the beginning, but at some point, once you start optimizing too much for something, it's not representative of the real world. And people continue doing that just because that's what they've been doing for a long time. So I just think people need to realize when to stop that. I don't work with, with these type of synthetic environments as much or just because I don't work on embodied AI, so I don't know if we're there yet.
B
Okay, great. All right, so going back to pre training, mid training, post training, let's talk about mid training. It's maybe something that people have heard about a bit less, the term comes up a bit less. What is it and why is it important?
A
Mid training, it's just this idea of something that's between pre training and as you might realize from the name and kind of the post training part of the pipeline. And really the idea is if you have high quality data that is more representative of what you really want in your final model, you should over train on that data. So taking a step back here, pre training, what is it? Pre training, it's basically trying to learn everything from the world by learning everything from Internet at a high level. The problem is that most things on Internet are not really useful. If you think for example about Wikipedia or GitHub which is coding data, it just seems like there's way more information in there than some random forums. Yeah, some, some random firms that maybe not like have that much information. Like for example, ads. There's also lots of ads on Internet. Like you probably don't want to train too much on that. But in pre training we train on everything and in mid training we basically overweight this type of high quality data that we think is more useful for, for training the final model. And this is something, I can't talk about what's happening in opening, but it's like something that, that is happening definitely in all the academic community right now and in all the open source models have this stage of mid training.
B
Great. Post training, let's start at a high level by defining what that is. So there's reinforcement learning, but that's not the only part of post training. What else is there?
A
It kind of depends how you define the term and where you put the boundaries. In my mind, post training, including, I'll take it from a very broad sense, which includes all the reinforcement learning and the training for our reasoning models. It's just the idea of having something that knows everything about the world to making something that is useful to people. So pre training, I think about it, or the metaphor that I like giving is you go in the library and you have a lot of books about everything and in theory you can find all the information that you want in the library, but it's much more useful to talk to an expert who has learned these books and that you can ask questions to and they can answer and they can understand what you're actually looking for. So this is kind of the goal of pushing at a very high level is making something that is useful to users and is easier to interact with. So there are multiple stages I'll talk about mostly. Well, I'll talk only about things that are happening outside of OpenAI and kind of the usual stages. There's usually some SFT that is happening
B
which is supervised fine tuning.
A
Supervised fine tuning? Yes, supervised fine tuning. And that's actually what early on, most of the models that were portraying were only doing supervised fine tuning. This is the idea is that if you have humans that can give you the desired final answer. So if you have humans that can give you the gold answer, you can basically clone the behavior of the human. So this is what we call behavior cloning. The problem with this is that you will never get better than what your ground truth gives you. And humans are actually pretty limited in many sense. So you will never overcome the human labelers that you're working with. The reinforcement learning or reinforcement learning stage goes from behavior cloning to really optimizing rewards. So the idea is I don't know what the ground truth is. I don't know what the perfect answer is, but here's how I would say whether the answer is correct or not. And here are the things that I want in the answer. And what you do is you start optimizing, you start having a model that tries to get more reward, basically optimize more this reward function. That's how we call it. And it goes beyond what you currently have, what humans can do, or at least the humans that you're working with can do. So this I would say is the two big stages then in reinforcement learning. That depends in which models are being trained. At least in the open source community. It seems that there are different ways of doing that reinforcement when you have very fireball rewards. So reinforcement learning, where it's really easy to say whether something is correct or not, and you can really kind of have binary reward for this. And that goes back to how we talked about O1 and O1 preview in the past. And then you have reinforcement learning without verifiable rewards, where maybe I could do pairwise comparisons. I can say this, this answer is better than this other one, but I don't really know. I cannot quite say this is the perfect answer. So of course it's a continuum and there's everything in between. But I would say these are the three high level things to think about when you think about post training in general and how people are usually doing it in the open source world is that they take SFT they clone the behavior that you can collect online or from humans. And then once it's already at a pretty good level, they just do this reinforcement to go beyond what we currently have. Because if you just started from reinforcement learning, it would be very inefficient. Because the problem with reinforcement learning is that you have to stumble across the right answer. Basically, because how reinforcement learning works is you sample many times essentially from the model that you're training and you say, this one is correct, this one is not, and you say do more of the one that is correct. So you have to stumble across the right solution. So you're much better off first getting as close as possible to the best you can do. And this is this behavior cloning and then doing reinforcement learning.
B
Does reinforcement learning create new capabilities or does it make the model better at existing capabilities?
A
It's really hard to say because pre training, when it's trained on all of the Internet, arguably already has all capabilities in it. So it would be even hard to answer this question scientifically because arguably everything is already there. What I would say is that if you look at models that we were training or that we were posturing, two years ago in the open source world, for example, I worked on one of them, Alpaca, where we used 50,000 examples for SFT. And now when you look at reinforcement learning from models like Kimi or from Deep SEQ models, it seems that they are closer to 1 million data points. So definitely people scaled up a lot the reinforcement learning stage. And from this it seems that they've learned new capability like this reasoning aspect, this fact that you can check your answer and try to improve it so you can really think for longer to get a more correct answer. So all this to say that arguably everything is already in pre training, but we were definitely able in the last one year and a half, even in the open source world, to have more capabilities after reinforcement that we used to before.
B
I heard several times that reinforcement learning is pretty finicky and hard to scale. And part of the reason why we as an industry didn't do reinforcement learning as part of the initial kind of LLM sort of progress curve was precisely that, that it was hard to make work. What is hard about scaling RL is that a question of data sets, knowing where the rewards are, or something else?
A
I would say most people who did not work in reinforcement learning in the academic and research community up to two years ago probably thought reinforcement learning just doesn't work and is too finicky to work with. I used to be that type of person. And actually when I saw ChatGPT come out, they had this blog. I was not at OpenAI at the time. I saw this blog that says that they use reinforcement learning. And my first thought was, I can do the same without reinforcement learning, because this is just an overcomplicated method. And this is actually the project that we started working on with alpaca was exactly, let's try to reproduce that only using SFT just by doing this behavior cloning. Yeah. And for example, Yann Lecun famously gives this metaphor of like, oh, the reinforcement is like the cherry on the top. So I think that was really the intuition that most people had. It seems that after crossing a certain scale of models that know basically everything about the world and what we call, like good priors about the world, it seems that reinforcement learning just started to work. And this is not only with LLMs. Robotics seems to have seems to be entering the same stage where they're realizing that actually it used to be very finicky. But now that we use models that like, know already everything about the world, it actually learns pretty well. Now to answer your question about what is still complicated with reinforcement learning, one is an infra aspect. So just systems in general reinforcement learning, you have at a very high level basically to sample, as I said before, many answers and say what is correct and what is not. And this sampling is just very expensive and you have to do it at scale. The other issue that also in the open source world people are seeing right now is that when we are training more agentic systems, you only know whether you're correct at the end of your very long rollout. So you get very little information per token of whether you were correct or not. And it's hard to say. It's hard to basically do attribution. It's hard to say what part of your entire answer was the one that led you to being correct. So that's more of an issue on the machine learning side. The ideal world in machine learning is when I can say exactly like, this thing was good. Do more of that. And the problem again with these agentic systems and reinforcement with agentic systems is that you don't really know which part was good or not until you arrive at the end. That's another big issue for reinforcement learning.
B
What's the current frontier of reinforcement learning? It seems like there's a jungle of acronyms like GRPO and other techniques. What are you using? What are you excited about? What do you think is promising?
A
So I can't talk about what we're using. But for example, in the open source world, GRPO seems to be working very well. And people used to have different methods like PPO and dpo and people seem to have really converged to this one. The big difference with other methods is that again, you do this simple method that I told you about, sampling as many answers as possible and you say which one is correct. So in some way, GRPO is a very simplistic method. And in general, we saw over and over again in machine learning that the simplest method where you can scale up in terms of compute usually is the one that ends up working the best. And that is kind of what is happening here, at least in the open source world.
B
As you describe, some of the challenges question crossed my mind. You often hear that AI systems are not built, they're grown. How you'd characterize it as well. What part is science versus a craft or trying multiple things and then just keeping what works best in your day to day life?
A
Yeah, that's a great question. I think how it usually works is that it starts being craft. People just try out many things and they start building a mental model of what works and what doesn't. And over time we move to from this craft land to more science. Science is or more scientific approach are really the ones that first end up working. It's very rare that you take a really scientific approach and you say this is the optimal thing to do and you do it and it just works. People just, there's some sense of alchemy. People just have like a good flair for something and they make it work. And then other people or that person starts trying to improve what we are doing by being very scientific. And I would say this happens over and over in machine learning. So first craft, then science. And both are really important, but it's different stages of the pipeline in terms of engineering. This is definitely something that is always necessary. So I would say most researchers have moved to being relatively good at figuring at least I wouldn't say good engineers, but good at working in complex systems and figuring out what they need to try out. And the systems and the infra that we have has become more and more complicated. So definitely the work required changed over time.
B
Fascinating. All right, so still in reinforcement learning and circling back to some of the things you said at the beginning. So if I want to make my model better at computer use or genetic coding or whatever domain, then I would spend a particular amount of time doing specifically reinforcement learning for computer use and putting together a data set and then coming up with rewards. Is that how it works? Like you just pick one problem and you just do reinforcement learning specifically for it?
A
To be clear, I talk more about reinforcement learning because also this is the part I know the best and this is what I've worked pushing, I've worked on for a long time. We talked about mid training before. All these things are also extremely important and you can improve it in different parts of the pipeline. As I said before, the closer you are from the final stage of the model, usually the smaller the scale of the training becomes. So you can iterate fast on that because now you can iterate in terms of days rather than in terms of months. So usually people start from this fast iteration loop and then they go deeper and they make bigger changes across the entire stack. So this is not to say that only reinforcement learning matters, I'm really not saying that, but it's just that that's why people will start doing changes and then that will permeate and we will go deeper into the sec. So this is how it works and in the open source world it's very much like that too. I think you see way more post trained models than you see new pre trained bases and you see way more improvements in the algorithm. And that's why we talked about GRPO, DPO, PPO. There are so many XPOs and that's because people can integrate really quickly on this final stage of the pipeline.
B
And the jagged nature of those models, does that come from this approach of picking this problem and that problem? And therefore it's going to be excellent at those problems but not as good as other problems or is that a more fundamental characteristic of AI models?
A
There's definitely some of that for sure. If you optimize more on specific types of problems, you'll be better in that setting, I would say at least My intuition is that it's less about the exact problems that you're optimizing on and it's more about the class of problems that you're optimizing on. So for example, if you are really good at math competitions, your model will probably be pretty good at coding competitions. So it's not about the domain, it's more about the skills that are necessary and the way to think and this horizontal capabilities that you need for performing these tasks. And that's what I think you're usually seeing when some model is really bad at something, it's actually bad at that in any domain, in any language. So, so you have to think, yeah, about this domain and then this generalization of this Domain, not necessarily per domain capability.
B
So speaking of generalization, so there's been that clear evolution from math and coding success to now starting to cover different areas. So that's the whole GDP valve thing, where across the economy different areas are being evaluated in terms of model performance. Sort of same question, is that the result of overall model progress or is that a deliberate, okay, now we're going to take this part of the economy and build a data set for it and do mid training and do post training. How does that progress work from those very specific domains to generalizing to the rest of the world?
A
It's definitely something that we actively push on. I think people are realizing, I mean us and also other companies that we are moving towards this world where we want to really make products that are useful and improve productivity of people and help people in the day to day life. So I think there's a very active move to deciding what are the domains that we should be prioritizing now that we know we have an algorithm that we can apply in different places. What we are constrained by is more collecting the right data, having people who really care about a certain problem work on that problem. But there are not that many people who can do these things. So you really need to prioritize. So this is, yeah, it's a very active, it's a very active, proactive kind of approach here. And in general, I would say the performance of the model really depends on the number of people who care about the final output of the model who are looking at that model. So if they start looking more on specific verticals, these verticals will improve really quickly. But again, we don't have that many of these people that can do these things.
B
But to unpack to something that you alluded to I think a minute ago, due to models actually generalize now, more, especially from a reinforcement learning perspective. So making a model very good at domain A or B then is likely to make the model better at C, regardless of the amount of effort you put into developing rewards for domain C.
A
So I think there are different axes of generalization. One, there's an algorithmic generalization and that's like really, can I use the algorithm that I developed or this black box that I developed for domain A and can I use it for domain B? And again, even talking about the open source world, it really seems that people are able to do that. They take grpo, they apply it in many different places, and it just works so that generalization seems to be relatively good, which is why we're seeing a lot of progress, otherwise it would be hard to make progress. Then there's the generalization of the model that is trained on one particular data set. And this is what I was alluding to before is at least my mental model is the generalization happens in terms of capability. Like if the capability is the same, you will see generalization across domains. Again, different languages like coding. You can optimized for C coding for having a good C model with very little training on C, partly because this pre trained model, very little IL in C, partly because this pre trained model has seen all of C and so it already kind of understands the basics of that language. So that type of generalization definitely happens. The generalization that I think is harder are these. When we don't have these horizontal capabilities. I'll give you one concrete example. If my model is very intelligent in terms of being correct on competitions, I usually take that example because it's somewhat contrived. At math competitions like coding competitions, from a human perspective, people that are good at these things are usually just smart. And if they are smart, someone might think that at least that are just smart. And if they are smart, they can actually do other things too. But that is really not true. And that type of generalization is really not true because many things where we need to have humans working on expert domains, the world is very messy. And these coding competitions and math competitions are extremely well specified. And you need to have the capability of understanding under specified tasks, understanding how to deal with the messy world and understanding what are even the resources that you need to answer the question. If you look at the math competition, you usually have everything in the prompt. It's like you have five lines or maybe 15 lines and it's all the information that you need to answer this question. In the real world, if I'm a consultant, if I work in finance, I need to go on the Internet, I need to find and extract different information just to understand before doing any of the reasoning, just to be able to do that reasoning. And this type of horizontal capability is the thing that doesn't usually you generalize if you have that horizontal capability, but in many cases we don't have that horizontal capability. So yeah, that's why we hallucinate. Actually in every domain, when you have hallucination of LLMs, if a model is really bad at saying that it doesn't know, that usually happens in every single domain. You won't have one domain where the model is extremely calibrated about its knowledge and another domain where it's not.
B
And as a quick Detour. Is hallucination also a reinforcement learning problem where you reward the behavior to say, I don't know when it occurs?
A
John Shulman has a great presentation about that, I think from one or two years ago, where he was saying that if you do behavior cloning, so this SFT that we talked about before, you will basically reward and optimize for hallucination, because what will happen? Or you could optimize for hallucination, because what will happen is if your model doesn't know about something. But now you say that the right answer is to say that something. So I'll be very concrete if the model doesn't know about a paper. And now in an answer that you give that is given by a ground truth answer given by a human, you say, here's where I got the information. And then you cite that paper. What you're actually optimizing the model to do is citing something that doesn't exist because it doesn't know that that paper exists. And so John Schulman had this great presentation saying, like SFT is going to force hallucination while in reinforcement learning. Given that, as I said, you kind of sample from the model in the first place, extremely unlikely that it samples something that it doesn't know and it's correct. That's extremely unlikely. So you will never reward that behavior. You will only sample things that it doesn't know and being incorrect, and then you will kill that behavior. So hallucination, at least the intuition that people have is that it can come, for example, from sft and it can come from this portioning pipeline. But if you have good reinforcement pipeline, that shouldn't happen too often.
B
And going back to generalization as well, are there examples where actually getting better at one domain makes the model worse at the rest? A little bit. To what you were saying about some people are very good at math, some people are very good at, very good at English. Pretty often they're not the same people in domains.
A
Usually not. What will happen, though, is you will make decisions based on which domain we optimize for. And if you optimize for one domain, you will be able to optimize less for another one. So it's not necessarily that optimizing for one thing will make the other one worse. It's just that as a result, you can optimize less for the other one because you're compute constrained, your data constrained, you have like your human bottlenecked. Also in terms of that work, what does happen is you can have negative kind of generalization like Bad generalization or negative transfer more for these horizontal aspects of the model. So I'll give you a very concrete example, explicit instruction following versus implicit instruction following. If I have a model, and this is, we often hear, for example, from OpenAI models that they tend to be really good if you tell them exactly what you want. But as a result, sometimes we hear also that they're less good if you are not as specific about what you wanted. For example, if I make a typo and I say change this file and I make a typo in this file, an extremely good model at explicit instruction following will change the wrong file, the one that has a typo. But humans would probably realize that you made a typo. And as a result there are cases where this explicit instruction following goes against this implicit instruction following. So you will have cases where basically these horizontal capabilities go against each other
B
and maybe to close on this whole reinforcement learning conversation. So is your sense that as we progress from being excellent at coding and excellent at math and move to the rest of the economy, do you think that the rest of the economy is a tractable problem? Do you think we can get to the same level of performance? Ultimately, yes.
A
But I was like, yes, we can. I don't think there's anything really deeply special about these domains where we cannot optimize and where we can get the same with other domains. The bot is for at least two reasons. The first one is most of the people working on these models are pretty good at coding and they really care about coding because that's what they use as day to day care drivers. And there's nothing better than the user being also the one who trains the model because then they understand the issue. It's very hard to really like for me, for example, it's very hard to really understand what should we change on legal aspects of the model if I don't understand anything about the legal domain. So that's one thing. The other thing that you will often hear about, and I mentioned also briefly about before, is this kind of verifiable rewards. There are domains where it's easier to say where something is correct or not. For example, in the case of cyber, like you mentioned that before, that cyber has been improving a lot cyber capabilities of models. And this is because in cyber it's extremely easy to say, are you correct? Did the cyber issue that you find is a real issue or not? It's very easy to test it. There are domains where reinforcement learning is just easier to apply. But there's nothing I would say in the Capacity of the model that is constraining the model to be as good at legal and medical and other domains. So the short answer is we know less about these domains and definitely there are some domains that are easier to optimize for in reinforcement. Great.
B
Let's talk about evals for a minute. That's a hugely important topic, maybe to start. Why is it so hard to evaluate a model in the first place?
A
Evaluation has been harder and harder as models become better. And that's because the tasks that we ask to the model become more and more general and more and more open ended. So now I maybe just say, build me a website that does X. Well before in the past I would just be like, hey, is there a specific bug in this implementation that you have? And it's much easier to say whether there's a bug because I can extract, I can have a human that says here are all the bugs that you have and then you can apply that automatically. While the website one is very hard to know what is like the optimal answer because there are many good answers, there are many good ways of building a certain website. This open ended nature of models really makes evals harder. There's also another issue is that models in specific axes are becoming better than the majority of humans. And so we have fewer and fewer humans that can actually evaluate these models and particular axes. So that's definitely a constraint. Another one, to be honest, is kind of cultural. Most people want to improve the model and they think that the best way to do that is kind of training the model. When in reality, finding issues and making sure that we can quantify improvements is just as important, if not more important. But there's always this cultural gap. That was especially true, I would say, in the academic world up to two years ago, when evals were always fixed, benchmarks were always fixed, and even data sets were kind of always fixed. Maybe let's say four years ago and there was like a mentality shift of like, okay, data is actually critical. And now there's a lot of people working on data and I think evals were still not quite there. People don't really fully. Everyone knows that it's important, but like people don't really understand like how impactful it could be to work on evals. So actually my first, first project at OpenAI just came in and I was like, I want to work on data and evals because I know that this is the thing that no one is working on. And as a result I know that's super impactful to work on that. And yeah, the tide is shifting, but not fast enough.
B
And is the pace of progress in model as a judge and AI evaluating AI, is that moving as fast? Is that a distinct part of research or is that fundamentally the same? Same idea or the same techniques?
A
It's really fundamentally the same method. It's like nothing. Also, most of the things that we do in evals, especially now that we have reinforcement learning, can just be applied nearly exactly as is during training. So that's another reason actually why evals are so complicated, is that every time you build an eval, you actually build a way to build training data sets. So now you're going to optimize that training data set. Well, not even if it's not that eval, it's going to be the same type of, of data. And now you're going to do super well because we have this generalization of capabilities that I was telling you about. You will learn that on that other data set and now you'll become really good at that eval and that eval will become obsolete really quickly. So that's also an issue with Ulce. But yeah, to come back to your question, the model as a judge, it's really important and I think it's one probably of the most important things because as we get better models, we have this self reinforcing loop and we have this capability flywheel, where better models become better teachers for other models. And this is really important for training, but then you can also do the same thing for evaluation. So a lot of my team works on that and I think that's really critical is to work on this model as a judge kind of framework.
B
Okay, fantastic. All right, so as we get towards the end of this conversation, I'd love to zoom out a bit and get your sense for where things might be heading. Obviously it's incredibly hard to make predictions on AI years out, but let's call it the next 12, 18, maybe 24 months. Is your sense that things are going to continue progressing or are we heading towards something that could feel more like a discontinuity in terms of progress?
A
As I was saying before, I think it's always continuous now. The feeling of discontinuity will happen. It did happen three months ago with coding or four months ago with coding. And I think that will happen now in every other domains. Like most people are not feeling the same way. The capability of our model and the usefulness of our models the same way as coding and software engineering is feeling right now. So this will definitely permeate, I think, through Many other verticals. Now in terms of capability bump, in terms of, let's say the verticals that we're already looking at, I think it will be more continuous and there will never be big discontinuities. Most of them are always local discontinuities, but you zoom out and it always just feels pretty smooth. It's not always like this, but that has been the case most of the time and I can definitely not predict when is the next big discontinuity.
B
What is your sentiment on this general concept of accelerating loops in AI? So whether that's continual learning to make models more current and able to learn faster to this broader concept of AI building AI like in an increasingly automated way, fact versus fiction. And what are you excited about?
A
I'm extremely excited about continual learning. I think we haven't quite cracked it. I mean we have codex memories and that is helpful, but it's definitely not the end state. I have a friend who always tells me about again, another type of plot that we should be looking at, which is X axis time, Y axis utility that you provide to users. And right now, or like usefulness basically of the models and right now actually most models at day zero, if you just drop them in a company, arguably they are more useful than most new employees. So they start higher at T0, but then across time they are mostly constant because they don't really learn kind of company knowledge, they don't really learn to be more efficient or over time on doing the things that they are doing. While humans learn really quickly and what is important is kind of this integral or kind of the area under the curve of these curves. And as a result I think humans are still more useful in many cases. And that's why what we will need is to make continual learning is to make this curve now monotonically increasing over time and basically make models more and more useful the longer they work in a certain environment. So I'm extremely excited about it. I'm actually surprised that we're not quite there yet. Three years ago when ChatGPT came out, I remember I was doing a startup with friends and we were thinking about working on continual learning and personalization and memories in general. And we're like, Ah, OpenAI is going to do that in the next six months. They have all the data, they're going to figure it out and they have all the users and the models are going to learn super quickly from users. And three years later, I don't think we're there yet.
B
And quickly, in layman's terms, what is the fundamental difficulty.
A
It's a good question. I actually don't quite know, to be completely honest with you. I don't quite know why it's taking us that long to figure it out. It's this type of domain that I think if we really put enough resources behind it, we would figure it out. Of course, especially when we talk about this memory inside of a company, there's big questions about permissions and there's a lot of questions about privacy and what you can share and what you cannot across models, across users. But for a single user, even for a single user, we're not quite there. And I don't quite know why, at least at the high level that I can talk about. I don't know why.
B
Yeah, what you bring up is, I think, really interesting for AI builders and investors and startups, which is this question of the models getting increasingly smarter. Within an enterprise in particular, there's this whole tension between what the models are able to do and then what. A lot of people have been built around the model. So, you know, a year or two ago it was, it was rag. These days it's all about harnesses for agents and a lot of people are wondering whether the models are going to end up eating the harness, whether the harness is just a temporary thing. From your perspective, like what do you think happens?
A
Yeah, I think harnesses can really improve the capability of a model right now. I think given that we're seeing this really fast progress in terms of capability. I personally wouldn't push that much on the harness unless it's like the harness is something for very concrete goal that you're trying to achieve right now. So certain companies, if they are focused on a specific vertical, they want to go from this 80% maybe reliability to maybe they're like 85% and iconic will give them that. And I think that's like very important. But like they will, they will, they need to do it while knowing that they will have to retune that harness in the future. And I think that's, that's totally fine. If you try to have like a general harness that will like sustain over time. I don't think that will work. Harnesses for specific domains as a short term thing that you need to do. I think there will always be so much you can do in harnesses and if anything, I think everyone should do more of that if they have a specific problem in mind because we're leaving so much on the table without a good harness. Arguably, I think if we froze the models that we have Right now and you really worked on the harness and maybe we also spend more time training with a great harness. I think people would really feel the AGI in every single domain or could already feel that in every single domain. But given that we're not freezing it and we're going to continue training better and better models, I think the harness, we don't really understand what the final harness will be and it's not, and it will always change.
B
Same question about applications. So we alluded to your progress in different verticals and there was, you know, GDB eval in general, but also Tao2 bench telecom, which does complex customer service workflows and then progress against finance agents automating 88.5% of internal investment banking modeling tasks and then 51.1% on office QA Pro. So bit by bit you're doing more and more of this. So do you think people should be building applications anymore or is ultimately, as we get closer to AGI, all of this going to be part of the model capabilities?
A
There's so much space on pushing for like external companies or like startups pushing on specific verticals. I think there's so much space for that. The reason why is because a lot of people kind of think about intelligence in quotations and kind of like raw capability as being the real Barneck, but I don't think that's true. I think most of the time the BARNAC is the last mile. It's like making sure that the model has access to, has the right permissions or has also access to the right connectors and things like this. And we are going to be very focused on this, on this general aspect. And I think there are other companies that should be focused on more the verticals and providing maximum value of what we currently have. So I think there will always be a lot of space left for this last mile in different verticals and I would highly encourage people to continue working on that. And maybe one day when we stopped making horizontal progress, which I don't think is anytime soon, maybe we will start focusing on that. But yeah, that's not what we're doing now.
B
Okay, well, that feels like a very optimistic note, at least for the startup ecosystem to end up on. Thank you so much, Jan. This was terrific. Really enjoyed it. Thank you so much for spending time with us.
A
Great, thanks, Matt.
B
Hi, it's Matt Turk again. Thanks for listening to this episode of the MAD podcast. If you enjoyed it, we'd be very grateful if you would consider subscribing, if you haven't already. Or le positive review or comment on whichever platform you're watching this or listening to this episode from. This really helps us build a podcast and get great guests. Thanks and see you at the next episode.
Podcast Summary: The MAD Podcast with Matt Turck
Episode: OpenAI’s Yann Dubois: Why AI Progress Suddenly Feels Real
Date: May 21, 2026
Host: Matt Turck
Guest: Yann Dubois, Co-lead of the Post Training Frontiers Team, OpenAI
This episode dives deep into the state of AI and large language models with Yann Dubois from OpenAI, focusing on the recent release of GPT 5.5. The discussion covers why recent AI advancements "suddenly feel real," shifts from competition benchmarks to practical deployments, the centrality of reinforcement learning (RL) and post-training methods, generalization, evaluation challenges, and the pressing unsolved issue of continual learning. The conversation is rich with insider context, candid assessments, and practical advice for builders and researchers.
“You need to reach this level of reliability to really make any of these AI tools very useful. And I think we just crossed that probably December last year at least at OpenAI.” – Yann [00:00]
“We moved from competitions to usefulness, to users, and that's what we are feeling right now.” – Yann [01:00]
“With every model that is looking really good early on, we have a model, we all get really excited about it. And then there's tons of dads that start coming up because it's like, oh, everyone is hyping this thing internally, but actually it's bad at all these other things...” – Yann [05:34]
“You could see it as a team that is doing both kind of the smoothing function... And also you need to have some teams that are working...on all the horizontal improvements.” – Yann [07:48]
“We really, really improved the efficiency of the model. And most of the tasks can be basically performed, I would say 2x faster now with this model.” – Yann [07:48]
“It seems that after crossing a certain scale of models...reinforcement learning just started to work... even with LLMs.” – Yann [39:20]
“The idea of having something that knows everything about the world to making something that is useful to people…” – Yann [32:52]
“We were able to take...tools that we built for these verifiable reward cases and we were able to use them more generally for reinforcement on real use cases.” – Yann [01:57]
"If you have someone who's an expert in certain domain...will not spend the time on investigating ten different directions because it knows that there’s one that is more likely to be correct." – Yann [21:54]
“The performance of the model really depends on the number of people who care about the final output of the model who are looking at that model.” – Yann [49:04]
“SFT is going to force hallucination, while in reinforcement learning...you will only sample things that it doesn't know and being incorrect, and then you will kill that behavior.” – Yann [54:30]
“Evaluation has been harder and harder as models become better. And that's because the tasks that we ask to the model become more and more general and more and more open ended.” – Yann [60:30]
“I think we haven't quite cracked it...three years later, I don't think we're there yet.” – Yann [66:09, 67:00]
“I think there will always be a lot of space left for this last mile in different verticals and I would highly encourage people to continue working on that.” – Yann [72:09]
“Even though in my mind the progress is actually pretty continuous, you need to reach this level of reliability to really make any of these AI tools very useful... It feels like a step function, even though I think actually in terms of capability it's pretty continuous.” – Yann [01:57]
“With every model that is looking really good early on... then there's another wave where people start under hyping it and it kind of goes through waves.” – Yann [05:34]
“Pre training, it's basically trying to learn everything from the world by learning everything from Internet at a high level... post training is making something that is useful to users and is easier to interact with.” – Yann [32:52]
“I would say most people who did not work in reinforcement learning... probably thought reinforcement learning just doesn't work and is too finicky to work with... it seems that after crossing a certain scale of models... reinforcement learning just started to work.” – Yann [39:20]
“I think most of the time the bottleneck is the last mile... and I would highly encourage people to continue working on that.” – Yann [72:09]
Recommended for:
AI practitioners, enterprise AI strategists, founders in AI tooling or application spaces, ML researchers, and anyone wanting a tangible sense of where state-of-the-art AI research and deployment is heading next.