Loading summary
A
Thank you so much.
B
Hey everyone. Welcome to the lit and Space podcast. This is Alessio, partner and CTO at Decibel and I'm joined by my co host, Swix, founder of Small AI.
C
Hey. And today we have a returning guest as well as a new friends. Welcome, Michelle and Josh.
A
Hey there.
C
Both of you work on the. I guess Michelle, I think used to introduce you as manager on the API team. It seems like you've changed your role since we last talked on the podcast.
A
Yeah, now I lead a team on the research side, specifically in post training.
C
Yeah. And Josh, you are also on post training.
D
Yep. I'm a researcher on Michelle's team.
C
Yeah. And I just found an interesting commonality you guys have. You're also both from Waterloo, continuing the tradition of extremely cracked engineers.
A
Oh yeah, we talked about that last time. That's right.
C
Okay, so we're gathering to talk about GPT 4.1. You launched it. I mean we got a little preview and it was a little bit roomy rumored. Right. It was pre released, I guess with open router as Quasar Alpha and then there was also an Optimus version. And I think people are trying to figure out why are we going back from 4.5 to 4.1. There's a whole bunch of other things, but what are the headline facts? I guess you guys want to emphasize about 4.1?
A
Yeah, I'll just say we released three new models today. GPT 4.1, GPT 4.1 Mini and GPT 4.1 Data. And the real focus on these were just making models that were great for developers. So we improved instruction following coding and shipped our first 1 million context models.
C
Josh, anything to add? I don't know if there's anything else that people should really that are like sort of in the fine print.
D
No, I think the only thing that I would touch on maybe twice is that there's actually a new model in the lineup, Nano, which is even faster for developers that are making, you know, low latency applications.
B
And cheaper. What's the any fun story behind the codenames or. You know, I got the Strawberry hat as another fun time in the lore of OpenAI.
A
Yeah, yeah. We really wanted to get as much developer feedback as possible on this model to make sure it worked well in the real world. And so we tested it kind of through OpenRouter and it was super cool to see people latch onto the names and, and get the theories going. But the feedback we got from there was super helpful.
C
Yeah, it's not even like the name it's more about just like the API shape. Once we saw, like, chat compo, it was like, very obviously OpenAI.
A
Yeah, it's a good note.
C
Yeah. But like, I mean, okay, is there, like, an emphasis on stars? What inference were we supposed to draw from? Quote unquote supermassive black holes?
D
I don't think there's anything really to draw from there. I think they're just cool.
A
Just cool names, you know, they make you think of cool concepts.
B
The vibes are good.
A
The vibes are good things, you know? Yeah, yeah.
C
The other thing about the examples, we're just mining for lore here, Right. The interesting animal comes up a few times on the live stream and on the blog posts. What's up to tapirs? Who likes tapirs here?
A
Yeah, our team is just a super big fan of tapirs, so. So they just happen to work their way into a lot of our content.
C
Okay, cool.
B
Awesome.
C
Go ahead.
B
Yeah, go ahead. I think, like, the first thing that, yeah, we just want to run through is obviously the 4.1 to 4.5. I think that's the first thing that everybody was maybe confused about, so. And I know you're deprecating 4.5 sounds like 4.1 is just like a kickass model. And the 4.5 size, maybe it's not as good of a fit. That was just a research preview, so maybe. Yeah, I don't know. Whatever you want to say to address that. I think it's something we've seen come up also in the discord.
A
Yeah, totally. Okay. Naming is really hard, and we've tried to make this as less confusing as we can, but, you know, nothing's perfect. Basically, the way we got here is that GPT 4.1 is like a pretty big improvement over the 4.0 line, and we really wanted to signify that. However, it's a model that's, like, much smaller and cheaper than GPT 4.5, and as a result, you know, doesn't achieve the same, like, Amy or other intelligence evals. So it doesn't beat 4.5 on all of the evals. And so we didn't think it made sense to increment beyond 4.5, but we do think for most developers, they can kind of replace a lot of their 4.5 usage with 4.1, and then the.
B
Mini is strictly better than 4.0. Mini.
A
Yeah, yeah.
C
With the Nano. But we don't know if 4.1 is a distillation of 4.5 or there's no relationship there. What can we say about the shared lineage.
A
Yeah. What I'll say there is, we're always using various research techniques to improve our models and distillation is something we talked about before. It's really meaningful, especially for the small models. And we've kind of pulled out some of the things that made 4.5 really good. Like it has a lot of the instruction following greatness and also rolled that into 4.1.
C
Awesome. I think one of the. Because I strongly Remember on the 4.0 launch that their communication was that we were kind of moving to a new model architecture that is OmniModel.
B
Right.
C
That's what. That's the O in 4.0. And then 4.1 is part of this subsequent trend of trying to merge everything like the reasoning model, the omni model, everything. And I think that there's just doubt about whether 4.1, I think it's basically trying to be sold as a strict replacement for 4.0, but I don't know, is it going to be fully Omni model? Is it like roughly the same architecture that we think 4.0 has?
A
So we already have different slugs on the Real Time API and like Responses API. So they're already, you know, somewhat different checkpoints. I don't think we don't have any current plans to release 4.1 in the real time API. But you know, things, things may change.
C
Yeah. And then there's ImageGen and all that. Right. Like so. And as far as we know, no plans maybe, but nothing announced.
A
Not. Not right now. The focus for 4.1 was kind of these three core capabilities for developers.
D
Yeah, we.
C
Our Discord actually also did a launch watch party for the recent 4.5 podcast that Sam Altman did where I think for the first time it was basically kind of that something that people already knew like Andrej Karpathy was already talking about this, that 4.5 was like 10x the size of 4. And I think there's a question about do we do the linear interpolation of 4.1 is like zero point, I don't know, 2x the size or something.
D
That's not really how we think about naming the models. There's a whole bunch of different parts that go into the recipe and so it doesn't really reflect on just the pre training recipe or version numbering. But I think the 4.1 is just because of the large jump that we have like coding capabilities, long contexts and so on. It's more so what it's like for the end user more so than anything about the training Recipe.
A
We, we can go a little under the hood on training though. And we'll say that, you know, Nano is obviously a new pre train. We also have a new pre train for Mini and then the larger version is a new mid train. But we find that actually a significant amount of the gains come from new post training techniques. And so I think in the past the narrative is that you need to pre train these larger and larger models to get better performance. And we're finding that we're able to squeeze a lot more out of post training. Now.
B
Talking about how big a model is, the other side of it is the context window. You have a 1 million context. I know that Sam at Dev day last year he said that. Yeah, 1 million was like months away. So right on time. Can you talk about. Yeah. How hard that was to get to 1 million and then maybe where the end game is in your mind? Is it ten million? A hundred million? Infinite? What, what, what really matters as you start to scale this?
A
Yeah, Josh worked a lot on long context, so he's the right person to ask.
D
Definitely. So I think the first thing that we, that I thought was really interesting when we were going to long context is actually some of the evals that you see is like headlines on maybe other blogs where it's needle in a haystack. Actually most of the models do really well right out of the box. But then we had to actually first get a lot of measurement on the longer context for long context reasonings. You know, we actually just open sourced two new evaluations that are about using the context in a more complex way. So you know, one of them you have to reason a lot about ordering and the other is actually walking through graphs. So there's a lot of reasoning that you have to do in those data sets and that's where doing long context is actually much harder. But single needle and a haystack we were able to saturate pretty easily and then the most of the work came at these harder tasks.
B
Yeah, I was going to say how. No, just how you think about length of context in terms of consuming documents versus like active kind of like thinking and planning. I think there's obviously a whole part on the prompting side around building agentic workflows. Do you think that people maybe like still think too much of it about. Yeah, needle and a haystack kind of document retrieval versus like traversing very long plans and kind of like iterations in context?
D
Yeah, I think the mental model that I have is maybe actually has some more variables in it. So there's single. There's the needle in a haystack where you have like some amount of distractors and some, you know, needles that you're trying to find. And I think that it's more so about how dense of the context you need to use. So like summarization, you're. You're actually just using the entirety of the context, whereas, you know, needle in a haystack, it's very sparse. And then I also generally think about orderedness. If you're going to make some sort of inference on this, do you, are you just look, looking, you know, sort of front to back, or do you need to move around in the context in order to generate a good answer while the model's sampling?
C
Yeah. Is that something that you worked on with graph walks? Is that the thing?
D
Yeah, that was sort of the, the most synthetic and clean way to measure the model. And then, you know, we worked on a lot of other training techniques, data to sort of test the model's ability and train in the model's ability to reason throughout the context in a sort of shuffled way.
C
Yeah, you know, actually I have the ability, I like to give people a little bit of visual aid with these things. So I actually went into your hugging face release and got an example of the graph task. And so there's a few versions of this, right? There's like the BFS and DFS version. And also I guess it's very character specific. So I don't know, maybe could you tell us, you know, design choices around this? Like, what was surprisingly hard, you know, anything like that?
D
Yeah, so the idea here is you, you take a graph and you encode it into the context by looking at the, the edge list and just putting that into the context and then asking the model to do an operation. And then, you know, under the hood, we're actually just executing the real operation and using that to then evaluate the model's ability to work. One of the things that I found surprising at first was the what the model would do when it wasn't sure how to use its context. You know, early versions of the model just sort of looping, saying like, oh, no, I can't find this edge that I think should be there. And yeah, I think I was actually very surprised how all models seem to have more difficulty than I would have expected on a task that, you know, we would find very simple. Or like, you know, maybe an undergrad could write a Python script to run in a couple of minutes.
C
Yeah, right. Okay, so what is the real life task that this is Meant to model, I guess I feel like the other one, mrcr, seems a little bit more intuitive where you have four different stories and you pick out the second one. And that's a real task that people have. But people don't really traverse graphs. This is a bit more theoretical, but was there any sort of correlation study done?
D
Yeah, this is actually meant to be sort of the idealized version of like a multi hop reasoning benchmark. So we have a lot of things where, you know, you're putting hundreds of documents into the context and then you might ask a question that you actually have to traverse 10 documents for. But there the edges are, they're implicit. Right. Like there is some underlying graph that's connecting all of these documents that you need to traverse in order to answer the question. But they're actually much harder to traverse because the edge isn't actually given to you. And so the question there was like, okay, if I actually just give you all of the IDs of these things that you need to traverse, can the model even do that? Where it's like a. It's actually just a lower bound on how well the model can do? And I think it's. That's actually somewhat well reflected in some of the internal benchmarks we have that are using more natural data.
A
You can imagine something like a tax return, right. Where you like upload the entire tax code. Like to figure out what to put into this box, you'll need to reference all of these boxes. And so this is a similar level of multi hop reasoning, but again, like Josh said, all of the references are implicit.
C
Yeah, I think that some kind of backtracking, if it's needed, is also super interesting, especially for agent work for listeners who've been listening to us for a while. We actually covered this paper at Neurips last year, two years ago, cog eval where they actually modeled graphs for graph traversals for agent planning. And it reminds me closely of that. It's just that they never came up with this exact format that you have here, which basically is the same thing. I also like that you included blank answers because sometimes people do hallucinate or models do hallucinate answers and you have a fair amount of blank ones.
D
Thank the random sampling over graphs I did, I guess. Yeah.
B
Is this tied also to the file search API that you released recently? Like how should people think about how everything kind of comes together in the API?
A
Yeah, I think oftentimes with retrieval you might be using RAG to fill the context. And a lot of this is to get around the limitation of a short context window. So we do expect a lot of developers to start uploading their full context more directly to the model. So for smaller tasks, you maybe don't need the whole vector store, but we do anticipate this to play well with that paradigm as well. Like maybe you can just insert way more chunks into the context. So we think it'll play nice.
C
Yeah. Any relationship to the memory upgrades in ChatGPT that we recently got? Is long context just directly usable for memory, or should we just always have a separate memory system?
A
Yeah, it's a good question. So right now, the dreaming feature, we kind of have some of these memories embedded in the context, but they are separate features. So 4.1 is powering the API, whereas the enhanced memory is ChatGPT only.
B
Yeah.
C
Awesome. Yeah, I think that's interesting. I guess one last thing, I'll call out on long context, which is kind of unintuitive, or maybe there's an explanation, but which was you had two needle for mrcr and then we had four and eight and everything kind of just regresses to some kind of baseline of like, let's say 30% or 20% as that. But it's interesting to see where the smaller models sometimes match or outperform the larger models. I was wondering if there's anything unusual there, or do you think it was like a bad roll of the dice?
D
I think it's probably just a bad roll of the dice. I think I would probably look more so at the larger ones. These things regress as you increase the number of. Because there's sort of more complex reasoning that has to do about the order of different things in its context.
C
Awesome. Yeah, Cool. Happy to move on from there. Yeah. We have a whole bunch of other evals that we can go over, so I had in my notes that we could talk over anything that you want. There was also like Kali from Shen Yu who have on a podcast for instruction following. And I realized that, you know, he joined OpenAI and I wonder if he had a role to play in that one.
A
No, we did not collab a ton on it. Honestly, I think it's best when eval authors and model developers don't collab too much because you keep things as objective as possible. Not trying to game any evals.
C
Yeah. And then I think there was also for the first time, the announcement or shout out of the internal instruction following benchmark from API data. People have had the ability to opt in to share data for a while. Actually, I posted a tweet because I Found it in the dashboard that you can just opt in and there's basically 16 days left for this program where you can just get free inference. So I'm just kind of curious what you found from that kind of if eval that might be different from the normal if eval that people have.
A
Yeah, totally. A lot of the instruction following evals that are open sourced are open sourced or crafted in a way that are easy to craft. So for example, Graph Walks is somewhat easy to craft. Like you can create this graph and verify it easily, but it is not exactly aligned with what the users are doing. And this is true for some of the instruction following evals where you ask the model to output exactly four words or three paragraphs or stuff like that, things that you can verify easily in code. And these are useful instructions, but we find that many of the really interesting instructions are actually challenging to grade and so the open source evals often don't have them. And so getting this real world diverse set of data actually helps us find what are the commonalities in what developers are doing, what is a really good example of a negative instruction, and then we can go from there and figure out how to evaluate it.
C
Yeah, I think that's also an interesting question of what domains do people use you on? And I wonder if there's a way to tell you because sometimes it can be very confusing. Especially because maybe I'm building an app and letting people use my key, but other people are building apps on top of me. So you have just a lot of chaos of multiple degrees of abstraction where you just have to parse through the prompts.
A
Yeah, it's true. Well, I will say we do use our own products internally where we can, and so we're not manually by hand reading every prompt after they're like anonymized, we scrub them of any identifying data. Then we use our models to take passes to categorize them. And so if we get feedback that we're not doing well on ordered instructions, then we can kind of do a pass over all of our data and find some good examples of those.
B
So there's the instruction following section and this create prompting GPT4.1 models. I think maybe we can go through some of these examples. The first one that caught my mind that it's not necessary to use all caps and other incentives like bribes or tips, but developers can experiment with this for extra emphasis. So I think that second part leaves me confused. Are you saying that people should still try and do this and sometimes the model responds positively to it? Do you feel like it's still just part of the lore? I'm curious why. I would have loved for you to say either yes, it works, or like, no, you should stop. It looks silly. I guess the truth is somewhere in the middle.
A
The truth is always messy. Reality is that our models have gotten a lot better at following instructions. Just stated once and clearly. But we find honestly, developers often become the best experts at prompting our models because, you know, you're building your livelihood on this thing and get to know the details of it really intimately. So I will say stuff like that won't hurt the performance of the model, but we kind of always want to leave it open to people to figure out what works best.
D
Yep.
B
Yeah. And then you had to always start with a response, rules or instructions section. Are those keywords meant to be taken kind of like verbatim? Like those are kind of like the tokens that work the best. Or is it just like a. An example?
A
More of an example, yeah.
B
Okay, cool. Yeah, this is great. I feel like until today we did an episode with like the prompt report on like all these prompting techniques, but then it's also unclear for which model which ones work best. So it's super useful. And then you had a. In the agentic workflows, one you have a persistence thing. It's like, please keep going. How much? And I think I read that improves like the suite bench, like 20% just by having like the persistence keep going.
A
I wouldn't. It's not that this one prompt improves Sweetbench 20%, it's that we found this is the most effective harness for our model. And combined with all of the post training improvements, it results in the big improvement. But yeah, like, the model is trying a lot to be helpful and often it wants to check back in with the user and be like, you know, should I keep doing this? Like, am I on the right track? And so a prompt like this makes sure it keeps going, doesn't bother you again, and just gets the task done.
C
Yep. Yeah. I think like, there's this interesting trade off between persistence and. And yielding back to the user. The more agentic a model wants to be, the more persistent it should be. But then sometimes it just goes off the rails and I wonder how you solve this trade off because sometimes it just goes too far. There's been criticisms of Claude Sonnet trying to rewrite too many files at once when I just wanted to make one thing, for example, and that's a form of bad persistence. What are the axes here in which you think about it.
A
Yeah. I think one interesting thing that comes to mind here is that we had an extraneous edits eval where you asked the model to make an edit and classify. Were all of its changes related to what it was asked to do or did it go off and do a little too much? And we found that from 4.0, which got 9%, which is pretty crazy, 9% of the time, making an extraneous edit is a lot. 4.1 is at 2%, so it's a pretty big improvement. So, yeah, I will just say, like, focusing on this. We've heard feedback about this. We made an eval and we made sure to track it and improve it during training too.
C
Yeah, yeah. I mean, everything comes out the evals as it's no surprise to anybody.
A
That's true.
C
There's another interesting eval that I think is causing some noise for the first time. I think also that and you being the master of structured outputs, should know that JSON is bad now and we should all use xml.
A
I wouldn't say that. I don't know which eval you're talking.
C
About, but it's in the prompt guide, which maybe you guys didn't write, so we're kind of springing this on you.
A
Yeah, Noah and Julian on our team wrote the prompt guide and did a great job. I do think XML is very helpful for structuring prompts, whereas for parsing outputs. Maybe the story is a bit different. Like sometimes it's really useful to get outputs in JSON so you can plug them directly into your application. But I do think the models work particularly well with XML as inputs, but. Curious. Anything to add? No, no.
C
Cool. I mean, I think people always care a lot about code tool calls and structured outputs, as you well know. And so any updates to instructions over there is good. People also are interested in this concept of that, apparently putting the instructions and user query at the top and the bottom. So duplicating it at the top and the bottom in the context is much better. Is better than putting it top only and much better than putting it bottom only. Again, this is from the prompt guide, so I don't know how aware you guys are on this.
D
Yeah, I think part of that was just like, you know, empirical. We, we tried all three for when we were evaluating the model and having that redundancy is definitely the best. But then using those, the instructions at the beginning, the model's going to be able to then take that into account as it does processing.
C
Yeah, I think a lot of people would See this as running counter to prompt caching because obviously you want to put the things that change a lot at the bottom basically. Is this fixable in post training? Can we just tell models to take instructions or user queries only at the bottom because we want to optimize for prompt caching?
A
When we figure it out, we will do that.
C
I mean it seems doable. It seems like a post training thing. I don't know, maybe my mental model post training is wrong.
D
So I think actually having things at the beginning of the prompt you would still get prompt caching there. If you're putting in for example like a big needle in haystack and you have the data changing each time like per user, there's still different ways that you can be putting the prompt at the beginning and getting a lot of the cache hits. It sort of just depends on your use case.
C
Yeah, awesome.
B
The the other thing I noticed. I know you made a note of this Sean too, is that on chain of thought and reasoning and how people should think about this model versus their reasoning model? Yeah, what's your. Yeah, should I just use 4.1 and prompt it to do a channel of thought? Should I use 01 and make a plan and then use 4.1 to implement the plan? How should people think about composability?
A
Yeah, it's a great question. We have found that 4.1 is a lot better at doing planning and thinking through its steps in cot when prompted than our previous non reasoning models. But our reasoning models are designed to have kind of more coherent plans and be able to reason over longer horizons than these non reasoning models. And you can see that reflected in things like intelligence benchmarks. So Amy, gpqa, stuff like that. You'll see the reasoning models do much better. So in general I would say the question you're really getting at is I'm a developer, which model should I be using? And I think the answer is always going to be the fastest model that accomplishes your task. So maybe you start prompting 4.1 as a starting point. If it does your task super well, then maybe you could drop down to 4.1 mini and save latency or even nano. Whereas if 4.1 is struggling a bit little, maybe needs more coherent reasoning over longer time horizons, then maybe you upgrade to a reasoning model.
B
Is there a quick way to get through this heuristics? I know one thing that a lot of people do is they use 01 for a plan and then they put that plan in cursor and then have the plan applied to their Code base. It sounds like there's maybe not a rule to when to do, which it's just like task dependent.
A
Yeah. I would say we're all kind of figuring out the best way to use these models together. And so I do think reasoning models for planning and using kind of more targeted models to execute is definitely a good architecture.
C
Cool. If there's nothing else on that side, I'd love to go into the coding, which is something that we're emphasizing a lot. It's doing super well. It's better than 01 and sweep benchmark. Was that expected?
A
Not really.
C
Yeah. What's the story there? There's also suitelancer, which is a newer one which attaches a money value to things. And basically what should people understand is going on here? Is it a better coding base model or just a coding agent model? And I think there's also a question about how important to coding is it if I'm not using the coding use case?
A
Yeah. So I'll start by saying we just set out to make a model that was great at coding, both in your terminal or in your editor or wherever you want to use it. And so we kind of broke that down into the problems that encompasses. So like developers want the model to produce better diffs, for example, or they want the model to explore the code base correctly or they want to produce code that compiles or produce code that writes tests. And so our approach was kind of teaching the model all of these various facets. There's kind of just a bunch of work streams that all coalesced around GPT 4.1.
D
Yeah. Think much improved post training all over to make for a better coding model.
C
Yeah, I think there's like different kinds of coding. Right. Like it's interesting for me to observe that there, for example. So I'm just going to pull it up on the chart here because I always like to show people visuals. You're 55 on Suitebench and 01 gets like a 41. But then on, oh, I don't think I have the others, but eightir it is not at 01 level. And so I think I struggle to get some kind of intuition of what are the different elements of coding. I guess there is single file edits, whether it's like a diff or a whole file. And then there is entire project edits. Is that a reasonable split? Are there more to this?
A
Yeah, that's one way to think about it. Basically, where GPT 4.1 can kind of explore, go through a repo, it's been trained to do that particularly well. Whereas to just get some code and produce a change, a reasoning model might do better because it can kind of reason over the entire file. And so that's one good way to think about it.
C
Yeah, yeah, that's fair. Any understanding of like the smaller ones, the smaller models, like basically for coding I should only use 4.1 and forget the rest.
A
You might want to use the smaller models maybe if you have like, if you have an IDE where you need an autocomplete feature, for example, or if you want something super fast, if you're building like, I don't know, a text to SQL thing, you might want the first version to populate instantly. So you can see like 4.1 mini is actually quite significantly better than 400 mini and not that far away from the old 4.0. So I do think that model will find use case in a bunch of these coding niches.
B
And I know you might not be able to talk about this, but the clip of the of an AI CFO talking about the Agentix suite has been going viral. I think today it seems like every lab is putting a lot of emphasis into coding. So yeah, I'm just curious if there's anything you can share about how people should think about OpenAI encoding. You know, obviously today you don't have, you know, claudex clock code, you don't have anything related to coding. And I think the Windsarf partnership Today, they're giving 4.1 for free, 4.1 for free for a couple of weeks. It's maybe like one of the first OpenAI endorsement, I guess on the live stream. But yeah, just. I know there might not be an answer that the PR team might approve, but I'm curious if you have any takes and thoughts.
D
I think just stay tuned.
A
Yeah, I think coding is an important use case for our users and so that's why we focused it on it a lot for 4.1. We also love to use our own products internally and so making 4.1 selfishly helps us move faster as a company. And so that's where the real focus has been for this model.
B
Do you track what percentage of code is written by 4.1 internally?
A
Now we do have some metrics like that. I don't have it off the top, but I was actually just talking to one of the researchers on the team who worked on something over the weekend and he said that this model GPT 4.1 was able to get 49 out of 50 of its commits on this massive priority. Done so we were pretty happy to hear that.
B
I'm excited to use it. Awesome. Yeah, I think on the. Yeah, I think coding is a super exciting use case and I think like OpenAI has always been very developer first as you've been too, Michelle, so it's great to see the convergence.
D
Yeah.
C
The other, I think the last capability that I kind of vectored in on was vision or just multimodality in general. It is a lot better, basically. I think like, I really like these niche benchmarks like Math Vista and chartsyve. Yeah, just any, any, any extra color on. On like the vision side that you wanted to talk about, but maybe you couldn't fit into the blog post.
A
Yeah, yeah, go ahead.
D
Oh, I was gonna say, I think one maybe small nugget there is actually I think the 4.1 mini is really exciting on that front. As we were talking about, it's a different pre training base and I think that really shows up in some of the vision evals.
A
And yeah, we talked about like coding instruction, following long context. A lot of gains coming from post training, but in particular multimodal. Like basically everything you're seeing, the gains are there from pre training. So kudos to the pre training teams there. They've done incredible work on perception and multimodal.
C
Yeah, totally. So something that we've been exploring on the podcast for a while and I'm curious if there's any takes on your side. Is there a strong split between what I call screen vision versus embodied vision? Right. Like are you taking pictures of. Are you training on snapshots of a computer for computer use or, you know, and anything with charts, anything on a PDF is very similar to that. Or pictures from the real world, which is more embodied.
B
Right.
C
Like where a robot might be able to use that people have argued back and forth. I'm curious where the movement is or the emphasis is.
D
I think one of the. First off, I think that 4.1 is better at both of those things, regardless of how it was actually trained. I think I would probably somewhat defer to the pre training team when it comes to which one you should be using or using. You know, a mixture of both. But we've improved our results across evals on both.
C
Awesome. Yeah, yeah, that's something that I think people should definitely do want to explore the more embodied stuff as well because the benchmarks tend to focus on the screen vision stuff. You know, more, more chat.
A
It's always an eval that is easy to grade.
C
Yeah, exactly.
A
Those are the things that get looked at the most for sure, yeah.
D
I think one of the things that was really funny with both the 4.1 Mini and Nano is we had some strange internal eval results. And it turns out that actually the these new vision capabilities, they were able to read like, you know, signs in the background and stuff, which was actually changing like some of the validity of our results. And so we were, you know, just running into different eval problems as you actually improve the models.
C
Is there a feature of a 4.1 imagegen or is that like a completely different part of this vision? Like, you know, in some sense vision is image to text and the other way around is ImageGen? Is it that simple or is it something else?
A
It is not. No plans right now to get 4.1 ImageGen?
C
Well, you know, it's very, very popular.
A
We like images too.
C
It's like melting your GPUs. I mean, talking about GPUs. Right. Part of this whole deprecation of 4.5 and moving people to 4.1 is to get back your GPUs. That's a message that both Shaky and Kevin Weil have mentioned. But you are running all these models concurrently for the next three months. I don't know if you get back your GPUs. I think you just grow the usage even more.
A
Yeah, I do think people get the message on deprecation and start moving over. So as developers use this model a little less, we can kind of reclaim that compute. But you're right, it takes a while. And the trade off there is really our commitment to developers. Like if we have something in the API, we won't take it away without sufficient notice. So that's the trade off that is right for us.
C
Okay, awesome. Then a couple other smaller announcements. Fine tuning available day one, which is I think new for OpenAI. Usually you have to wait like a month or two for the fine tuning capability. 4.1 and mini only and Nano in future. Any specific callouts for fine tuning? I guess fine tuning is general discipline. That always applies. But any wins that you guys can talk about.
A
So first off, yeah, shout out to the fine tuning team. They've worked really hard to get this ready on day one. One thing I will say is that I think people have slept on the preference fine tuning offering or the. I think that's what we call the product. Yeah, so SFT is. People know it pretty well. It's the original fine tuning we had. Whereas this preference fine tuning is super helpful for steering in a particular style. And so I think not enough people are using that.
C
Isn't that only for reasoning models or is that for everything?
A
No, that's reinforcement. Fine tuning is only for reasoning models.
C
Right?
A
Preference fine tuning. The DPO stuff offer the pears. Yeah, exactly.
C
Yeah. And I thought it was in alpha. This is why I haven't looked into it.
A
I thought.
D
I think it's RFT that's still in alpha.
C
Okay, well that's a lot of confusion that we just cleared up. Yeah, I think we're going to. I'm doing my conference again in June and I think we're going to do a workshop on just general all the fine tuning options and I think that will clear up a lot of things, which is good. Okay, new models. I know that we can't talk a lot about a lot of them. Gnome Brown from your reasoning team just said that there should be a follow up on reasoning models soon. What can we say about that? Sounds like he's.
A
We're not the right people to ask, but stay tuned for.
C
Yeah, but like 4.1 is a good basis for like whatever comes next, right?
A
Yeah. Not all of our models kind of build on each other necessarily, but we think 4.1 is a great standalone offering for developers and we also think, you know, reasoning models are a good tool in the toolbox.
C
Yeah, more just generally. I always want to explore the relationship between non reasoners and reasoners and then also how we merge them. Are we doing routing, anything of that sort? Obviously you have a lot of secret sauce.
B
Cool.
C
And then I think the other thing that a lot of people are demanding or asking about is the creative writing model. Will that ever see the light of day?
A
We're working on incorporating kind of those improvements into the models more generally.
C
Not a separate release.
A
People loved about 4.5 is like the humor, the green text, the nuance. So we've heard that feedback and I know, yeah, there's lots of folks working on that and trying to bring it into our next models.
C
Awesome. Alessio. Anything else?
B
No, this was great. Any requests for the developer community, things that you want them to try out that maybe people are not doing things you want them to build for you using the new, the new APIs.
D
I feel like first off, send us feedback. It was really useful to look at different partners and customers who are using our models and to get this like nice wrapped feedback from them. It allows us to iterate a lot.
A
Faster and on that vein, you know, opt in to data sharing. This just helps us make the model better for you. And one kind of slept on way to do this is the Evals product. So you can upload an evaluation and opt in such that we'll pay for the inference costs if we can also use the eval. And this is just another great way, like we'll use those Evals to make sure our models are getting better for people over time.
C
Yeah, I think the Evals is permanent. There's no end date announced, but the opt in in the API is at least until April 30th. I think a lot of people still don't know about it. We might want to extend that so that people can do more.
A
Good. Flag all raised with the team.
C
Yeah. Awesome. And I think the last question I had was on, just on pricing. I think pricing, it's basically just generally cheaper than 4.0, but not a ton, but cheaper. And then you're also introducing this concept of blended pricing for the first time that I've seen it. But maybe it's just been out there for a while because you have caching and all that. Just generally, what is the cash to non cash ratio that we should be thinking about when thinking about workloads? Like, is there, is there a general rule of thumb?
A
So one clarification, which is that GPT 4.1 mini is not cheaper than GPT 4.0, so it's not just like a blanket decrease in all the models, but however, 4.1 mini is cheaper than 4.1. Also not sure if this is widely reported, but we've increased our prompt caching discount from 50% to 75% on these models.
C
Yeah, I saw that.
A
So that's a big input, you know, into figuring out what kind of application you build. And then your question was on, like.
C
What kind of thing about, yeah, blended pricing. Right. Like, I think there's this question of comparability of prices across models and across providers. Because like I, you know, like some people are three to one in terms of context to output, and then some part of that is cached. Selfishly, I make a chart that just plots all the model labs versus all the prices. And I'm sure you guys seen it and I don't know what numbers to plug in there. So what are people seeing in real life? What's the median, you know, caching rate?
A
I don't think we have that off the top. The blended pricing is more to just make it easier to compare. Like, so you can say something like GPT 4.1 is 25% cheaper than GPT 4.
C
Yeah. You want one number?
A
Yeah, yeah, yeah, no.
C
All right, we'll all have to figure it out, but thank you so much. That was. That was fantastic. Thanks for all the work. I think people are very excited to get to work, testing this out, giving you feedback, and I'm sure we'll be back again for the next one. Probably the reasoner.
A
Nice. Thank you, guys.
B
Thank you.
Episode: ⚡️GPT 4.1: The New OpenAI Workhorse
Date: April 15, 2025
This episode dives deep into the launch and technical details of OpenAI's new GPT 4.1 line of models, with expert guests Michelle and Josh (OpenAI research team, post-training) joining hosts Alessio (CTO at Decibel) and Swix (Founder of Small AI). The panel covers not only headline features—such as the million-token context window and developer focus—but also model lineage, evaluation strategies, instruction-following insights, coding capabilities, multimodality advances, and practical deployment advice for AI engineers. The conversation blends engineering rigor with community-driven feedback, making it essential listening for anyone building with or evaluating cutting-edge foundation models.
[01:20]
“Nano, which is even faster for developers that are making, you know, low latency applications.” — Josh [01:45]
[03:22]
“…for most developers, they can kind of replace a lot of their 4.5 usage with 4.1” — Michelle [04:13]
[05:12]
“…we're able to squeeze a lot more out of post training now.” — Michelle [07:09]
[07:40]
Notable Quote:
"We actually just open sourced two new evaluations... one of them you have to reason a lot about ordering and the other is actually walking through graphs." — Josh [08:12]
[09:26]
[14:38]
[16:23]
"The truth is always messy. Reality is that our models have gotten a lot better at following instructions just stated once and clearly." — Michelle [19:40]
[23:01]
[25:32]
“…the answer is always going to be the fastest model that accomplishes your task.” — Michelle [26:22]
[27:06]
“...this model GPT 4.1 was able to get 49 out of 50 of its commits on this massive priority. Done so we were pretty happy to hear that.” — Michelle [31:26]
[32:01]
[35:48]
[37:25]
"Send us feedback. It was really useful to look at different partners and customers... It allows us to iterate a lot faster." — Josh [38:40]
[39:35]
On Model Naming:
“Naming is really hard, and we've tried to make this as less confusing as we can, but, you know, nothing's perfect.” — Michelle [03:50]
On Instruction Following Lore:
“The truth is always messy. Our models have gotten a lot better at following instructions just stated once and clearly.” — Michelle [19:40]
On Coding Impact:
“This model GPT 4.1 was able to get 49 out of 50 of its commits on this massive priority. Done so we were pretty happy to hear that.” — Michelle [31:26]
On Vision Model Surprises:
“...these new vision capabilities, they were able to read like, you know, signs in the background and stuff, which was actually changing some of the validity of our [internal] results.” — Josh [34:13]
On Post-Training Innovations:
“We're finding that we're able to squeeze a lot more out of post training now.” — Michelle [07:09]
*For full show notes, references, and more, visit latent.space.