Transcript
A (0:00)
OpenAI has just rolled out ChatGPT 5.4. There's actually a couple cool features in here that I'm really excited about that I've been wishing ChatGPT has been able to do in the past and they finally launched it. And of course if you look at all of their marketing, it's going to just basically be them saying this is our most capable model yet. And of course it's the most capable model. If it wasn't, I mean, what would they even be making an update for? So I'm just going to get past all of the hype and all of the buzz from what they said in their launch and I'm going to tell you some really interesting use cases and some ways that I actually think this is use GPT 5.4. Before we get into all of that, if you want to try all of the latest models, go check out my startup AI Box AI. We have the latest models from the top 15 different AI companies. Everything from Grok to Gemini to Anthropic to OpenAI 11 labs for audio. Tons of cool image generation models. I think there's over 50 models on the platform total. You could try all of them side by side and it's only 8.99amonth. So much cheaper than ChatGPT, but you get way more models and, and of course you can also use it to automatically create AI workflows that can complete tasks for you that are automated. So there's a ton of cool stuff going on. But go check out AI Box AI if you want to get access to all of the top models for only 8.99amonth and it's 20% off if you get an annual plan as well. So there's a lot of cool stuff there. All right, let's get into what's going on. The first thing I want to mention here is that this is called GPT 5. 4. Thinking they have a higher performance variant that is known as GPT5.4 for pro. But both of these together are designed to kind of handle everything from some complex analysis. They do a lot of coding, a lot of long running workflows across a lot of different professional software tools. And they're kind of dubbing this as like their, their professional work tool. They're trying to get into, you know, into the hands of more working professionals. And this is coming right on the backs of them signing a whole bunch of deals with a bunch of different consulting firms that are going to allegedly get Chat GPT into more businesses and kind of the professional environment. And at the same time they're having kind of this, you know, they're locked in a battle with, even Google's in this right now. But really with Anthropic. For Anthropic's Claude code, their Codex tool, they're really trying to push forward in kind of how software is using AI models and how computer use is, is going on. So this is where they're really focusing One of the most, one of the biggest changes basically about this is the scale. So in the API, GPT5.4 has a context window of up to a million tokens, which basically lets them work with huge documents, really big conversations, big data sets and really, I mean if you think about this, a huge benefit is going to be coding where you can look at bigger code, you know, code bases to actually work with. So something, this was something that Anthropic was really crushing at. And then OpenAI is trying to get into this. OpenAI also says that their model is specifically more, what they're saying is token efficient, which I, this is actually one thing that I'm excited about. Basically can solve the same problems using a lot less tokens and GPT 5.2. So your costs are going to come down. It's actually kind of cool if you already had 5.2 running in a software which even if you don't, a lot of the software you use will. The costs come down a lot for that and it also gets a lot faster so the costs come down and the speed goes up. And so yeah, for me this is something I'm actually excited about. So as far as how the benchmarks look, I know, you know, I'm not trying to like sit here and nitpick the benchmark percentages, but I did want to talk about some interesting use cases and reasons why these are why they're good. Specifically it's, it's kind of leading on a bunch of the better known benchmarks. One of those is for coding, of course we know why that's important right now. But also computer use and this is something I'm excited about right now. I feel like Anthropic is really crushing it with computer use. Basically, you know, it can look at everything on your computer and go click on stuff and get stuff done for you. This is a use case that I've been using a lot with Claude's Anthropic browser, the Claude Chrome browser extension. Basically it's a button you click, it opens a side chat bar. I go to really complex UI or complex websites. I'm not a Developer. But if I'm going into like, for example, recently I had to do some stuff on Google Cloud to set up a tool that I was vibe building on Lovable and I needed to beef up my back end so it could, you know, do some extra fancy stuff. I didn't really understand anything that Lovable was telling me I needed to be able to do. So I opened up the Claude sidebar, told it, look, I'm on my, you know, my Google Cloud account go, and here's the instructions from Lovable. And it clicked around and set up some stuff for me. Now, should I have a real developer look over this? I mean, we're going to throw caution to the wind for the time being and I hear all the developers screaming into their headphones right now. But at the end of the day, it got it done and my software is now functioning and I have, I did not have to watch a whole bunch of long YouTube tutorials on how to set up some complex. I mean, for me, complex, because I have no idea how to code Google Cloud stuff. So this is a really incredible use case for a lot of reasons. And I think OpenAI beefing up their capabilities in computer use is really exciting because they're going to start competing more directly with OpenAI. I mean, it's not like error with Anthropic. It's not like Anthropic is kind of like the only one working on this opening. I has been doing this for a long time with agents, but it feels like it's getting a lot better. Okay. The other one that I'm excited for is they're getting a lot better at knowledge work. And so, I mean, these are kind of things that I think everybody uses it for. So this is something we're just going to see some incremental improvements on. On OpenAI's GDP VAL benchmark, which basically checks tasks. It has like 44 different occupations. So it's kind of like showing you how you can use this for different professionals. It is exceeding industry professionals in 83% of comparison. So they're like, look, these are all the tasks that people in all of these different professional industries are doing. It is better than 83% or it's, you know, it's beating what an industry professional might give you in 83% of these cases, specifically, I think for knowledge work. And it has a really big jump from achieving about 71%. The GPT 5.2 is getting. So upgrading this to now GPT5.4, we're getting from 71 to 83 it just basically is going to be a lot better for knowledge work. I mean and by a lot better, I mean we're seeing you know, a 10% jump here or you know, 12% jump here, which is, is pretty significant on some of the coding benchmarks. So swe SWE Bench Pro. This is a, you know, software engineering bench pro. The model is getting slightly better than the last version. So I mean this is good but. But you know, beyond just getting slightly better, it is actually a bit, quite a bit faster. So if anybody has used a lot of these software tools is specifically we use cloud code AI box and my developer sends me screenshots of like because of these really long elaborate tasks that it's doing on our back end, our code base. And he, I swear it's like a goal for him to see how long he can get cloud code to run continuously without stopping on project. He gives it. It's funny because I'm, you know, vibe coding stuff on lovable and I usually get a lovable response back to me and like, you know, a minute or two he has it go for like three and a half hours doing a task. So when this model gets faster I'm excited because hopefully that three and a half hours gets cut down on, you know, some of the stuff that we're working on. I think one of the things that it's also very good at is for real computer interaction. There's an OS world verified. It basically evaluates how well an AI can operate a desktop environment. It's, it's, you know, pretty much just like takes a screenshot and then it uses the keyboard and mouse commands to go and click stuff. Right now it has about a 75% success rate. I've use ChatGPT agents. It's not perfect. It's actually not my go to. I don't use it that much. I wish I could use it more. I think Anthropic is doing better in this but 75% success rate like they are improving. Their success rate is up a bit. It's better than GPT 5.2. I still don't think it's the best. There's a major focus on kind of how it is being used professionally. OpenAI says their model right now is significantly better at basically giving the kind of deliverables that people use in real work. So things like spreadsheets, presentations, financial models, legal analysis, all of those. They've done a bunch of different tasks and they had one performed by a junior investment banker analyst. It got 87% compared to 68% that GPT 5.2 got some human evaluators also preferred it about 68% of the time. They said it had better visuals and better structure. So there's some cool stuff. Okay, cool features that you might actually use today. This is the one I'm very excited about. It has what they're calling steerability, but basically when you're, when you're talking to ChatGPT, it's available in the API too, which is, I think, crazy, but it's on ChatGPT GPT. If you're talking to Chat GPT and you can kind of see its reasoning, right? Like it's thinking through some stuff and it puts a couple steps down, you realize it's going in the wrong direction. You know, maybe you're like, hey, I'm trying to visit like the best beach for surfing. And it's like, okay, looking at beaches in Kauai. And you're like, oh, crap, like I'm in California, I don't want to see Kauai. And you're like, then you can type a message, like, specifically in California. And mid, like prompt mid response. It actually takes into account what you just said and is, you know, steerability, it's going to go and incorporate that into its, into what it's looking at and into its reasoning and give you an updated response. So basically you can do mid response prompts and it's going to take that into account and change its prompt and give you better prompt mid response. So it's kind of interesting because I think they did a couple clever things here, but one of them is like, when you ask a question, you have to wait for it to think, you have to wait for it to reply. You sit there and you wait. We all hate waiting. And so if in the middle of waiting, we're reading its line of reasoning and we're giving it more input and more feedback, it feels like we did a lot less waiting. We're really just kind of reading and trying to throw in something in and it can get it done faster and better rather than having to wait for it to spit out the whole thing. And you'd be like, okay, this is wrong, and here's why it's wrong, and here's what you should do instead. You could do that in the middle of the chat conversation response, which is really cool, in my opinion. Something else they've kind of focused on right now is online research. Apparently it can search across like a greater number of sources on the web. So it can kind of, instead of just like, okay, we're looking at this website in some data now we're going to look at this website. It's going to go and search just like a ton at the same time across the web and then it's going to follow leads across different pages. So it might get an idea from something that's reading on one article. It's going to go follow that to another article, bounce around a lot more. So it's kind of doing like, I know we have had deep research for a while, but it's doing deeper research if that's the thing. And it's going to combine all the information that it gets into one coherent answer. So basically this is going to be more useful for some of the more complex questions where the information is kind of scattered across a lot of different sites instead of sitting in one place. Now, not every, every question you ask, this is going to be relevant. But sometimes when you have a complex answer question, it's going to be able to go get you a more coherent answer quicker. So this is great. They have all this like kind of, I don't know, fluff in their launch about how it hallucinates less and it has less, you know, it has more, it has less factual errors and all this kind of stuff. I don't think that's super important. One thing that we also heard about it is that it is going to, it's going to turn you down less. So like, if you ask a question and they're like, hey, you know, I don't know, you, you ask a question, it's going to be. It's less likely, allegedly, according to them Allman, to like, not answer. However, our good friend Connor Grennan, who hosts the AI Applied podcast with myself, he was testing. I saw a post he made on LinkedIn where he asked it, is it true that air bubble inside of an IV can cause me or can, you know, could kill me? And it said, you know, apparently it typed out the whole response to him kind of. And just like we saw with like deep seq in the Chinese censored model, if you ask anything about Tiananmen Square to deep seek it like types it out and then it disappears and it's like, sorry, canon. It's just apparently CH said the exact same thing. And also this is kind of a tricky moment because we're seeing New York right now is trying to pass some legislation where they, they're saying, hey, we don't like, they're basically trying to pass legislation saying AI models can't ask answer any questions. About medical help, legal. Like they have all of these different areas. I think even hair stylists they're trying to put in there. It's basically all the all of the different industries with regulatory capture, they just don't want people to be able to get the answers for free. So pretty I don't know, kind of bummed about that legislation and people like seriously considering that, however, so it doesn't seem like it's that much better, but maybe it's moving in a good direction. I'm not 100% sure. It still feels like there's other models that are more of the adult in the room, but you also get pros and cons with those models. Grok famously is going to answer any question you have about basically any of those topics, but you know, there's a lot of different There might be some other cons with Groq, so pros and cons to all of the models. Thank you so much for tuning into the podcast today guys. If you enjoyed the episode, it would really help the show a ton if you left it a rating or review. Wherever you listen to your podcast, just drop me a note. Say if you enjoy it. You know, say where you're from, say, say what topics are interesting to you. I read all the reviews and all the comments. It helps a ton. Also make sure you go check out AI box AI if you want to get access to all of these latest models in one place so you don't have to pay a 20 subscription to 10 different platforms. It's 8.99amonth and you get access to over 40 different AI models. So go check it out. Link in the description AI Box AI I'll catch you guys all in the next episode.
