Summary7 min read

Everyday AI Podcast – Ep 728: GPT-5.4 Released: 7 Takeaways You Need to Know About OpenAI’s New Model

Host: Jordan Wilson
Date: March 6, 2026

Episode Overview

This episode dives into OpenAI’s latest large language model release: GPT-5.4, available as GPT-5.4 "Thinking" and GPT-5.4 Pro. Jordan Wilson breaks down why this release matters, how it shifts the direction of AI from being “just a chatbot” to a true work system, and highlights seven key, sometimes overlooked, takeaways for professionals, developers, and everyday users. He brings clarity to new features, tackles confusing model naming, and explores what makes this update a potential game-changer in the workplace.

Key Discussion Points & Insights

1. The Basics: What’s New with GPT-5.4?

(02:00–05:40)

GPT-5.4 is available now for ChatGPT paid subscribers (not yet for free users), via ChatGPT, API, and Codex platforms.
Major improvements in:
- Factual Accuracy: Reduces hallucinations by 33%.
- Real-World Use: Excels at spreadsheets, docs, and presentations, now with higher reliability.
- Developer Focus: Native computer use APIs for desktop/browser tasks and improved integration with coding/data tools.
Million Token Context Window: Available via API and Codex (not in base ChatGPT).
New safety and monitoring controls for professional deployment.
GPT-5.2 (prior model) sunsets in June.

Notable Quote:

“This wasn't just a model update from OpenAI. It was a Flex. GPT-5.4 feels like OpenAI is making a direct play at developers, researchers, and anyone building serious AI workflows...the gap between chatbot and work system is officially dead with this release.”
— Jordan Wilson (01:30)

2. Benchmarks and Real-World Impact

(05:40–08:30)

OpenAI touts GPT-5.4 at the top of most key benchmarks versus Anthropic’s Opus 4.6 and Google’s Gemini 3.1 Pro.
Notable advancements in:
- OS World Verified (computer use tasks)
- Tool use efficiency and faster, more reliable tool searching
- Browser-based automation

User Reality:

“People are going to be a little confused because they were maybe using GPT-5.2 yesterday and then it's GPT-5.4 today.”
— Jordan Wilson (08:10)

Seven Key Takeaways

1. Model Naming Is a Mess

(08:40–11:55)

OpenAI’s complex naming system is confusing users and businesses.
Example: Multiple versions (5, 5.1, 5.2, 5.3, 5.4, etc.) often only available in specific platforms or for short periods.
The vast majority of users don’t know which model they’re actually using, causing inconsistent experiences.

Notable Quote:

“It's bad. It's actually confusing consumers, probably hundreds of millions of them.”
— Jordan Wilson (09:35)

2. Direct Competitive Play Against Anthropic

(11:55–16:20)

The release appears timed as a response to perceived competitive moves from Anthropic (incl. marketing jabs and product releases).
OpenAI is directly targeting Anthropic’s strengths:
- Tool usage efficiency
- Long-context processing without breaking
Jordan sees OpenAI as “coming for Anthropic’s throat,” aggressively competing on core technical and performance features.

Notable Quote:

“OpenAI with 5.4 improved token consumption, improving their ability to call those tools as well...really making a harder play in long horizon tasks.”
— Jordan Wilson (15:05)

3. Codex Becoming a Requirement

(17:30–21:55)

Codex (OpenAI’s multi-purpose desktop and coding tool) now central for both technical and non-technical users:
- 1M Token Context: Essential for heavy workflows.
- Windows & Mac: Now available cross-platform.
- New “Playwright Interactive”: Empowers browser automation and complex cross-app workflows.
- Beyond coding: Useful for all sorts of work, not just programming.

Notable Quote:

"I'm doing so many non-technical, non-coding tasks in Codex...At that point, it's much more than a coding tool. It is a co-worker."
— Jordan Wilson (20:27)

4. Coming for Data & Analyst Roles

(21:55–25:20)

The model is now outpacing not just junior but also some senior data/analyst/researcher roles:
- Native spreadsheet and document generation
- Dedicated Excel integration (with Google Sheets integration coming soon)
The line between AI as a helper and AI as a replacement for white-collar tasks is rapidly blurring.

Memorable Moment:

“The original ChatGPT couldn’t add, right? And even models a year ago couldn’t edit a spreadsheet. Now...they can do all those things...by default, they’re agentic.”
— Jordan Wilson (22:55)

5. Thinking Models Are Now More Than Just ‘Chat’

(25:20–29:10)

“Thinking” modes now feel like full systems (not just smart chatbots).
Features:
- More “agentic” behavior, meaning multi-step research, tool calling, and context expansion.
- “Deep web research” built into the “thinking” tier.
- Ability to “steer” the model mid-task (interrupt and change direction), a unique addition.
- Multiple levels of “thinking” depending on subscription tier.

Notable Quote:

“Now when you’re using Thinking, it feels like a system, not a smarter chat.”
— Jordan Wilson (26:45)

6. State of the Art (SOTA) Computer Use Agent

(29:15–33:20)

GPT-5.4 introduces native, default computer-use agents for complex workflows across multiple apps—a leap ahead of both Google and Anthropic in technical benchmarks.
In Codex/API, you can orchestrate browser interactions, automate applications, and direct agentic behaviors at scale.
This is a builder’s dream but will soon shape mainstream work tools as well.

Memorable Moment:

“This just made agents much smarter. But it also gave us, non-technical humans, the potential capabilities to direct agents...faster, better, and more token efficient.”
— Jordan Wilson (31:30)

7. Benchmarks Don’t Matter—Except GDP VAL Does

(33:20–39:15)

Most AI benchmarks are more about marketing than practical use.
GDP VAL benchmark is the exception: Tests deliverable creation (spreadsheets, docs, slides) judged by experts for 44 real-world jobs.
Massive leap: GPT-5.4 Pro now matches or beats human experts 82% of the time on GDP VAL tasks—a dramatic rise from GPT-4o’s 12% less than a year ago.

Notable Quote:

“Benchmarks don’t pay bills. Benchmarks don’t help us do work better...GDP VAL measures how good a model is at creating economically valuable, viable work on its own, which is what matters.”
— Jordan Wilson (35:00)

The Big, Overlooked Takeaway:

“If you’ve been in an industry 10 or 15 years...You only have an 18% chance to beat GPT-5.4 Pro. So not wild, right? ...Where we’ve come in the past year is not normal...now, they’re better than almost all experts.”
— Jordan Wilson (38:25)

Notable Quotes & Timestamps

“The gap between chatbot and work system is officially dead with this release.” – Jordan Wilson (02:00)
“OpenAI is going for Anthropic’s throat with this one.” – Jordan Wilson (11:57)
“I think Codex does both [coding/non-coding]...it is a co-worker.” – Jordan Wilson (20:30)
“The world runs on Excel...and now ChatGPT just did that.” – Jordan Wilson (23:48)
“Thinking, again, I think the older versions were just smart chats. Now it feels like you are working with an agentic system.” – Jordan Wilson (27:10)
“This just made agents much smarter...gave us non-technical humans the capabilities to direct smarter agentic browsers at scale.” – Jordan Wilson (31:30)
“Benchmarks don’t pay bills...GDP VAL measures how good a model is at creating economically valuable, viable work.” – Jordan Wilson (35:00)
“Now, they’re better than almost all experts...maybe GPT-5.4 starts that conversation away from ChatGPT as a chatbot to ‘where work gets done.’” – Jordan Wilson (38:25)

Important Timestamps

00:16 – Introduction to GPT-5.4, purpose of the episode
02:00 – Model availability and new features
05:40 – Benchmarks, context, and performance
08:40 – Takeaway 1: Model naming confusion
11:55 – Takeaway 2: Competitive landscape (“Anthropic’s throat”)
17:30 – Takeaway 3: Importance of Codex for all users
21:55 – Takeaway 4: AI’s role in analytical/data tasks
25:20 – Takeaway 5: Evolution of “thinking” models
29:15 – Takeaway 6: SOTA computer-use agents
33:20 – Takeaway 7: Benchmarks that matter—GDP VAL
38:25 – The overlooked leap: AI as a work system outperforming human experts

Overall Tone & Language

Jordan Wilson maintains a conversational, unscripted, practical tone—balancing technical depth with big-picture, business-focused clarity. He draws on both personal experience (corporate trainings, everyday user perspectives) and competitive market analysis. The language is direct, occasionally cheeky, and focused on actionable insights.

Final Thought

Jordan closes with a call to listeners:

“Maybe, just maybe, GPT-5.4 might be the model...that moves away from ChatGPT as a chatbot to ChatGPT as the place where work gets done.” (38:59)

For more:

Visit youreverydayai.com for the daily newsletter and episode links.
For foundational AI knowledge, check the “Start Here” series (episodes 691+).

Loading summary

Transcript5 lines

[00:01]
A
This is the Everyday AI show, the everyday podcast where we simplify AI and bring its power to your fingertips. Listen daily for practical advice to boost your career, business and everyday life.
[00:16]
B
OpenAI just dropped their latest model in GPT5.4 thinking and pro, and it's obviously state of the art in performance and benchmarks. Yeah, we knew that. And technical talk that means it's just a really, really good model. But this wasn't just a model update from OpenAI. It was a Flex GPT54. Feels like OpenAI is making a direct play at developers, researchers and anyone building serious AI workflows. And if that kind of sounds like a Flex against Anthropic, you're right. But let's leave behind the competitive side of this GPT5.4 release, which comes on the heels of impressive updates from both Anthropic and Google. But between the benchmarks and features and tokens from the announcement, I think there's something much bigger in terms of takeaways. There's two stories here. The public story is, okay, impressive new model, but the real story I think is, okay, bold new direction. GPT5.4 shows OpenAI is leaning harder into long running work, better tools, deeper research and computer use. That means the gap between chatbot and work system is officially dead with this release. So today I want to unpack why this launch feels like a step towards systems that work for you, not just smart models that you talk to. So on today's show, here's what you're going to learn and what we're going to go over quickly. Separate what's important and what's just marketing when it comes to these new models. The GPT54 thinking and GPT54 Pro. We're going to break down the meaningful benchmarks in simple language and how they'll impact your use. Then we're going to dish out what I think are seven of the more important or not talked about takeaways that you need to know about this latest update. And then I'm going to reveal at the end the one most important takeaway that no one is really talking about. All right, let's do this thing. Welcome to Everyday AI. My name is Jordan Wilson. If you're new here, well, you can guess by the name. We do this every day. It's an unedited, unscripted, live streaming podcast helping everyday business leaders make sense of the non stop updates. Because yeah, they're non stop. I tell you what matters, what doesn't, and you take that information to be the smartest person in AI at your company. So, so it starts here. But if you really want the goodies that's on our website, your everyday AI.com, make sure you go sign up for the free daily newsletter. And hey, while you're there, also just go subscribe to our podcast. Right? You can always go find the links on there. So let's talk about the basics here. What's new in GPT54? If you're a little confused by the names and everything? Well, let's start there with. Well actually let's not start there. Let's give you the basics first and then I'll give you some of my takeaways. So on paper, I mean this is OpenAI's most capable and efficient model and right now it is available in chat, gbt, the API and Codex platforms. If you are a paid subscriber. All right, if you're a free subscriber, you're not going to see GBT 54, at least not for now and maybe not for a while we'll see. GPT54 Pro offers maximum performance for complex tasks and professional use cases and it integrates top coding abilities, tool use and document handling improvements. Let's talk a little bit about real world accuracy and just work well. One of the things that is great at now is excelling at spreadsheets, presentations and creating documents with higher factual accuracy. So cutting hallucinations down according to OpenAI by 33% and just improving greatly both visually and in different benchmarks on just doing real work. Right. Literally creating presentations, documents and spreadsheets all with a model. Right. Not even having to having to go into a different mode. A lot of people still don't know that about Chad GBT and its base models. But yeah, it does all that out of the box and right now it does outperform previous versions in knowledge work across 44 occupations and industries and is significantly reduces errors and hallucinations for more reliable outputs compared to its previous models. But I mean some of the biggest advancements are on the computer and tool use side and we will go over these benchmarks here in a second. So this is OpenAI's first model with native computer use capabilities for desktop and browser tasks. So yeah, if you are needing some of that, right. Whether you're a developer or wanting to use it on the front end. Right. A lot of recently it was anthropic. Very, very recently I think Google Gemini and their 3:1 Pro came along and now, you know, OpenAI is kind of leading with this. Right. I was actually surprised at least from their marketing and messaging that they led so heavily into tool search. Right. They talked multiple times about improved tool search, you know, cutting down on the amount of tokens that it takes to even grab tools. So a much more technical angle in this release, but really pumping up browser use, computer use and just being able to handle a larger tool ecosystem with lower token usage and faster responses. And the big number a million, that is a million token context window, but not inside ChatGPT. That would be great. I don't think we're going to get that really from any provider anytime soon, but it is available in the API and in some use cases in Codex as well. That's one of the reasons I'm using Codex all the time. But OpenAI does say that there's enhanced safety, cybersecurity and reasoning monitoring for professional deployment. So like I said, you do have to be on a paid plan to use this right now. And if you are using the older model, which is actually GPT by 2. Yeah, confusing. I know that is going to sunset in June. Let's take a quick look at the benchmarks and I will get over into these a little bit here in a minute. A little more. But I mean really good, right? You don't have to be on the, the live stream and see the, the benchmark screenshot from OpenAI. So obviously as with any new release, you have to keep in mind they're always going to cherry pick what they're showing, what they're not showing. You know, sometimes there's different versions of benchmarks. So I mean usually when you see this from a company, you know, they're not going to put every single benchmark, even the ones that they, you know, maybe aren't the leaders in. So in this one, for the most part, you know, across OS, world verified, Weberita, verified GDP, VAL, browser comp, Swedbench Pro, GPT QA, diamond, frontier math, right, all of those, you know, OpenAI, I think all of them except one. OpenAI is winning against their competitors in this instance they're showing it against Claude Opus 4:6 and Gemini 31 Pro, which are the most up to date and latest frontier offerings from their main competitors in Anthropic and Google. So some pretty big jumps across the board. And the thing that we have to realize and that's know good to talk about now because at least on the chart, if you're reading this left to right, right, we have GPT54 thinking pro and then we have GPT53 Codex. Right. But here's the thing Almost every single chat GPT user, right. All 900 million of us. No one was using that because it wasn't available inside of chat GPT. Right. So for the most part, people are going to be a little confused because they were maybe using GPT5.2 yesterday and then it's GPT5.2 today. All right, but let's get into the seven takeaways, because we're going to start there where I just ended. Because takeaway number one, OpenAI has still not solved the model naming problem. Right. Back in last summer, OpenAI CEO Sam Altman said, yeah, we're going to solve the naming problem. And at first it seemed like maybe they were right because at the time you had models like GPT4, GPT4.1, GPT4.5, you had 0303 high 01. Right. It was super confusing because you had two different classes of models. So same. Altman said, all right, well, we're going to come out with GPT5. It's going to have this smart, you know, model router and, you know, that's it. And you're just going to go in there and it's going to route you to the right model you need. Obviously that didn't work. GPT5 auto or GPT5.2 auto is still an option, so we'll see what happens with that between, you know, free users, paid users, we'll get to that. But takeaway number one, it's more than about OpenAI being confused with, with their model naming. It's bad. It's actually confusing consumers, probably hundreds of millions of them. Right. I do a lot of trainings, right. Corporate trainings, you know, virtual, whatever. I don't think I've ever met anyone that really knows what model is what. And that's probably a bad thing, right? And I think this is something where Google has done probably a much better job in, you know, they have three one. You know, go use three. One. Right. I think, unfortunately, you know, OpenAI is probably the furthest behind in this anthropic is not, you know, not that much better. Although they did get a little better by bringing all their latest models to four six two weeks ago. But before that you had a couple different tiers as well. But so not only is it confusing having multiple, but even the last like week or so in the last two months has also been confusing because like I said, we just jumped from 52 to 5. 4. So most people for the last couple of months they've been using GPT5.2, but there's all this talk of GPT5.3, but everyone's like, where is it? Right? Everyone's like, I don't see this GPT5.3 Codex model. Well, that's because it was only in Codex. And then to make that even more confusing, earlier this week OpenAI teased the GPT54 release. So we obviously knew this was coming. And then they released GPT5.3 Instant the same day, which is really only for free users. Right? And that's all who should be using GPT5.3 instant. So there was technically a GPT 5.3 in chat GPT that was the quote unquote latest model, but like for less than 24 hours. So takeaway number one, this is confusing for consumers, for biz, small businesses, enterprise, across the whole AI landscape because like I said, I talk to so many people and number one, most people don't even know. Right? The overwhelming majority of people use the default model. And that's different depending on what plan you're on. And that's usually a very bad thing because you should always be using a thinking variety. Right? I tell people humans are impatient, right? Because sometimes people, oh, I don't want to use thinking or a high version of thinking, you know, I just want the answer. Okay, well you're going to get an answer and then you're going to spend five times that trying to make it better. Right? Because a default, especially the GPT default models, not good. They're not right. Yes, five, three instance a little better, but compared to what you have, they're very bad. Right. And that makes the whole lineup seem even more messy and confusing. All right, takeaway number two. OpenAI is going for anthropic throat with this one. All right, that's the first thing as I'm looking both at the benchmarks and in the marketing and how they're angling this, you know, like showing some of their use cases. Yeah, you know what? I don't, I don't think the, the OpenAI team has, has taken the, the recent kind of pseudo rivalry in some of the anthropic too lightly. Is it, you know, coincidental timing that we got this release like 24 hours after the report of anthropic CEO, you know, in a leaked memo, kind of taking some shots at OpenAI, obviously, the, the super bowl commercial, taking a shot at OpenAI, which is something anthropic hasn't really been known for. Right? Anthropic, up until the last like five weeks, they've kind of been the, the good guy, right? The good guy. That's concerned about safety. And now it doesn't seem like that now it seems like they're trying to bully their way into relevance. Yes, I do love the anthropic models, especially over the last five or six months, I think they've gotten much better. You unfortunately have to pay 200amonth to get any utility out of them. That's beside the point. But they've really, I think, changed in their Persona, right? Maybe it's intentional. I don't know, maybe they're, they need to be the, the loudest AI lab in the room because outside of, you know, us, quote, unquote us, right? If you're listening to this podcast, you obviously know Claude, right? Everyone knows Anthropic, but no one else does, right? They had these studies, like speaking of the super bowl commercial, I think it was like they had 7% recognition. No one knows. Outside of the, the AI bubble that many of us choose to live in, no one knows Anthropic, right? Most people know Google Gemini, most people know co pilot. Just about everyone knows, you know, OpenAI, chatgpt. It's become synonymous with AI, right? No one knows Anthropic. So anthropic trying to, you know, bully their way maybe into a little more relevance probably wasn't the right move. And I do think that this model GPT54 is, is going straight for their throat because everything that anthropic has hung their hat on over the past 18 months is exactly what OpenAI updated in GPT54, right? Just around tool usage, efficiency in those tool usage. My gosh. Like everyone always goes crazy. Which I get it, the models are great. But I paid $200 a month for Claude, right? I also paid $200 a month for ChatGPT. I was doing some side by side testing. There's certain prompts, right? Everyone's like, oh, you know, Claude agentic, blah, blah, blah, right? A single prompt, I kid you not. And it crashes every single time because it runs over the context context window. The compaction inside Claude breaks it, right? So for all this stuff about, oh, you know, Claude's, you know, the, the context window and tokens and the tool usage. Okay, good. But long context, OpenAI is in a league of their own, right? Especially when it comes to long context with transparency in tool use and not breaking. So not just that, but OpenAI with 5, 4 improved token consumption in improving their ability to call those tools as well. So just really making a harder play in long horizon tasks.
[16:20]
A
Foreign.
[16:25]
B
Moves too fast to follow. But you're expected to keep up. Otherwise your career or company might lag behind while AI native competitors leap ahead. But you don't have 10 hours a day to understand it all. That's what I do for you. But after 700 plus episodes of everyday AI, the most common questions I get is where do I start? That's why we created the Start Here series, an ongoing podcast series of more than a dozen episodes you can listen to in order. It covers the AI basics for beginners and sharpens the skills of AI champions pushing their companies forward. In the ongoing series, we explain complex trends in simple language that you can turn into action. There's three ways to jump in. Number one, go scroll back to the first one in episode 691. Number two, tap the link in your show notes at any time for the Start Here series. Or you can just go to start here series.com which also gives you free access to our inner circle community where you can connect with other business leaders doing the same. The Start Here series will slow down the pace of AI so you can get ahead. So let's go to takeaway number three. Codex is becoming a requirement all right, so not only is that 1 million token context window something that you can take advantage of in Codex, which is great, I think that takes the tool from, you know, a nice, you know, desktop app, right? They just released it for Windows this week after last month releasing it for dedicated Mac app for Codex, right? So I think it's gone from okay, this is a software development tool, right? An IDE to, you know, great for vibe coding. So no, now it's like, you know, able to refactor entire code bases. But I think Codex is becoming a requirement for non technical people, even just for the right. Because at least right now if you have a chat GBT paid account, you have Codex, right? And I think there's still a couple weeks where the limits are double, right? I've yet to hit limits. And I know you're tired of me saying this. Yes, I'm always running Codex. Like I'm running Codex right now. Every single time I'm recording, codex is running 24 7. I've never hit limits. It's absolutely wild. Yes. I'm on the $200 a month plan. There's double limits. But regardless, I'm doing so many non technical, non coding tasks in Codex, right? I was actually chatting with someone at OpenAI and I'm like, you guys need to push this, right? Like sometimes I just give advice, sometimes companies ask me things and I'm like, you guys need to push this because Claude is really pushing Cowork. Right. Anthropic has their Claude code and Claude code work, and I think Codex does both. Right. But a lot of people are looking at Codex like it is just the, you know, their, their version of Claude code, and it's so much more than that. And I think the new updates in 54 really emphasize that just with the computer use Agent Capab. Right. So right now, if you want to take advantage of that, you're not seeing that inside of Chat GPT. You know, I don't know if Agent mode is going to. Hopefully, eventually it'll be updated with some of these, you know, the 5.4model and some of that computer use capabilities. But right now, if you want to use that, it's either on the API or inside Codex. Right. Playwright Interactive, I'll probably do an entire show on it. You know, it's. It's a new. So they just came out with Playwright Interactive, but they've had the Playwright CLI command line tool. Right. But that's essentially a browser. People don't understand that. Right. Like, yes, Codex has a browser that it can control. It can access your machine. So, I mean, at that point, it's much more than a coding tool. It is a co worker. Right? Too bad Claude got to that name first. Great name, by the way. But the new Playwright Interactive kind of plugin for Codex is great. So it pushes Codex into a more, I think, serious testing and debugging and execution workflow. All right, takeaway number four, Chad GBT is coming for data and analyst roles. All right, so analyst roles, data roles, not data analysts, but that's one of them, too. Let's. Let's be honest. The original ChatGPT couldn't add, right? And even models a year ago couldn't edit a spreadsheet. All right? And you really had to know your way around prompt engineering to get it to save a spreadsheet. So not only do the models now, by default, they're agentic, and they can do all those things people don't know. You don't got to click a button. You can just be like, Yo, GPT54, go do all this research for me. Put it through my own personal, you know, lens via my memory and what you know about me, and go create spreadsheets and documents and PowerPoints. And it will literally do that. And they work and they're there. Right. But they now also just released a dedicated Excel integration. Right? So another thing that Claude. Right. When Claude announced this or Anthropic announced this and you know, they've been coming out with these plugins and they in these skills. Yeah, Anthropic's been shipping great stuff, but it's been moving the markets. Right. You've had some legacy, you know, per seat software providers, you know, some legacy financial institutions that have seen their, their stocks crash. Right when Anthropic is releasing some of these plugins and you know, some of these like their Excel integration. Okay, well, Chat GPT just did that. I don't, I'm not quite, I'm not so quite sure about the, the timing of this. I would have, I don't know if it was me, I would have saved that for maybe not the same day because that's actually a big freaking deal that no one, again, no one's really talking about that one. So there is a dedicated app for the Excel integration with Chat gbt. That's huge, right, because the world runs on Excel. But they are also coming out with one soon for Google Sheets as well, which I'm looking forward to that one. Also, Gemini has gotten so much better, right. Late 2025 and early 2026, like Gemini and Sheets is actually great, but I'm still going to use Chat GBT and Sheets as well. And I think what we're seeing here is it's shifting the GPT models, especially with the GPT54, it's shifting from like, oh, is this an AI tool or a junior researcher? Right? Like that was kind of the, the maybe with 5:1 and 5:2. Maybe that's the conversation people are having, right? Like, oh, at what point is this AI tool a junior researcher? Now I think we're way past that. I think it's like, okay, is this a junior researcher or a senior researcher or a junior analyst? Or a senior analyst. Right. Another reason why I've been saying the consulting industry is going to get absolutely smashed because of this. Right? All right, takeaway. Five thinking models are much more than chat, right? And I think with 5 4, I've only had a couple of hours, you know, to play with the model. I unfortunately didn't get early access like some people. So this is my takeaway from just a couple of hours. And it depends on what plan you're on. And let me explain that if you're on the pro plan, uh, there's four tiers of thinking. If you're on the regular 20amonth plan, there's two tiers. If you're on the free plan, you can click the little light bulb icon. And I think you get like one of those a week or something like that. Right. But for the most part, if you're on the paid plan, the base $20 month plan, which I think is what most people are on, you know, there's two thinking levels and I think that using those in the older models, right, 5 1, 52 it just felt like a smarter chat, right? 5, 4, it feels much more than that right now when you're using Thinking, it feels like a system, not a smarter chat. And there's a couple reasons for that. But it seems like OpenAI put a lot of emphasis on the Thinking versions, right? So using 5.1 and 5.2 thinking versus 5.1 and 5.2 pro, the gap was enormous, right? I don't feel that gap is as big. I mean, we'll see as more and more benchmarks come out. But using, especially when you're on the PRO version and you have four different tiers of thinking, you know, using the heaviest thinking tier on the Pro. So not. Okay, this is confusing, right? Not using GPT5.4 Pro, but using the highest version of Thinking on the Pro plan because it's an extra higher version. Right? But it felt like the premier model, but it was a thinking model. So I think that gap between the Thinking tier and the Pro tier actually closed. And I'm not saying that means that GPT54Pro didn't get much better. It did, but the thinking, again, I think the older versions were just smart chats. Now it feels like you are working with an agentic system. Couple updates on that. Number one, you know, OpenAI did specifically say that they made some improvements. So GPT5.4 thinking also now includes Deep web research. So a version seems like a specialized or a mini version of Deep Research, that kind of technology, which I think uses like dual models that's available in the Thinking Thinking mode. So, you know, there's. They've definitely put a little bit more technology just into the Thinking side, which is huge. Another update to Thinking is you can steer it, which is really nice. Before that was something you could only do in the Pro version. So what that means, especially if you're, you know, giving it a lot of data and it's going to do a lot of research and a lot of tool calling, which again, we need to get more comfortable with this non technical people. All tool calling means is, oh, it's going to use Python, right? If you throw a lot of numbers at it, it's going to use Python and write some code or it's going to, you know, use, you know, web search, right? That's all that means, you know, the harness and you know, all that, you know, people use all the fancy words. I just try to simplify it for you, right? But it can take longer. Now the GPT54 thinking modes because they have this new deep search built in. So it's actually nice that you can steer it. So you don't, you know, if normally it might take 10, 15 minutes and you're like, ah, frick. You know, I'm reading the chain of thought and I'm reading how it's thinking and I see it's going in the wrong direction. Normally with the thinking models you'd have to just wait or just say, all right, well I'm just going to click cancel or you know, there's an answer now button. Now you can steer it, which is really nice. All right, takeaway number six. All right, I was kind of having fun with this one, being a little cheeky, but I said 54 is soda kua. All right, State of the art computer using agent. Right. 54 pushes computer use agents into more complex cross application workflows. All right, so like I was saying earlier, OpenAI said this is their first model with state of the art computer using agent abilities built in, right. By default. So I kind of see this and view this if you remember back to GPT4 and then you kind of went to GPT4O. Right. So technically GPT4 used three different models to give you responses in GPT4, which stood for omni meant it was all one model. So that's kind of what I'm seeing and feeling now with how OpenAI is starting to integrate computer use into their model. So yeah, unfortunately we may not actually get to realize that in the old chat, GPT.com maybe until they update agent mode, please. OpenAI update agent mode so much, right? Like just, just so much potential there. It just needs some love, right? But by default, if we're talking about in Codex, if we're talking about on the API side, that's important. So you may not touch this directly, right? You may not touch their, you know, new built in computer use agent. Like I said, they, they have a demo of it, you can go play with it on GitHub, you know, download the repo, you can do it that way you can use it in Codex. But ultimately this is for builders. But this is going to be so much of the technology that we all use, right. So that's what I'm excited for is all Of a sudden this just made agents number one agents much smarter. But it also gave us non technical humans, I think the potential capabilities to direct agents or direct smarter agentic browsers at scale because it is going to now be faster, better and more token efficient. So huge win that. Now OpenAI again going after Anthropic's lunch money, right? OpenAI said oh okay. Anthropic, you want to come come after our, you know, our decision to run ads with a, you know, not super truthful super bowl ad. Ironic, right? Okay, we're going to come for your lunch money. So yeah, pretty, pretty impressive. All right. And speaking of state of the use computer. Sorry, state of the art computer use some of those benchmarks. Yeah, just absolutely crushing it. And the noteworthy thing here I think is some of the jumps from 5:2 to 5:4 because like I said, hardly no one used 5:3 because it was in codex. Right? So now these are state of the art. These are topping the charts. So in GDP bow, which I'll talk about here in a second, Swebench Pro, right? That's how good the model is at fixing real software bugs. State of the art OS world verified. That's how well the model can use a computer like a person would. State of the art tool a thon. That's how well the model uses, you know, tools correctly. Right? State of the art browser comp. I think that might be one. They're like 0.1 points behind someone else but essentially state of the art. That's how well the model finds and uses information on the web. Right. So essentially anything with related to computer use tool calling. I mean OpenAI just crazy, right? And this is not at least in my opinion, when you're looking at kind of the, the three way race over the last, you know, year and a half, this is not something that OpenAI has been known for. So again, I think this is the bold new direction. And then last but not least, and this is both takeaway seven and, and the one thing I teased for the end, the thing that I think most people are overlooking. I know you're going to get tired if you're an avid listener. I'm sorry, I'm going to talk about GDP Val one more time. And here's the reason why I hate benchmarks. Everyone cares about them, everyone talks about them, everyone in the lab. Guess what? In the real world, here in Chicago, Illinois, right, where I'm from, right. I think this is just like a Silicon Valley thing. Maybe, I don't know. But that drives the narrative everyone's so concerned about these 50 benchmarks. So I got to talk about them because if I don't, you know, I'm going to get 50 emails about them. But I think my seventh takeaway, my last one here is most benchmarks don't matter anymore, but I think one matters even more. And there's a couple reasons for that, right? And I think GDP VAL is the one that matters more. It's number one. It's harder to gain game. But I think that most frontier labs are just, you know, benchmaxing right there. All they're doing is they're playing the game, you know, tweaking things just to get really good, you know, scores on certain benchmarks, right? And it guess what? Benchmarks don't pay bills. Benchmarks don't help us do work better, right? In theory, you know, A leads to B, B leads to C. But if the end goal is C, that's GDP val. That's the work getting done. Not just, oh, here's this random test, you know, ARC AGI, right? All these, you know, humanities last exam, right? All these things that are, you know, these tests that are set up that are more just like, oh, here's a bunch of random things that are very hard, that just aren't in training data, right? That's not what we should be looking at models for and how good they are. We need to be looking at how good they are at creating economically valuable, viable work on their own, which is what GDP VAL is, right? It measures how good a model is at creating deliverables in the same way a human would. I've talked about this a little bit, but let me just tell you how absolutely bonkers this is, right? And I'm actually going to go to my, my next slide first. All right, GPT4O. All right, so remember, GPT4O. I'm trying to do the math here. Yeah. Ten months ago, nine or ten months ago, this was the best model, right? The best general purpose model. No one used the O models. I, I, I did. I love the o models, right? 0103 no one used them. Everyone used GPT4O. GPT4O. If you're looking at this GDP VAL, stick with me here, right? And this is just a model would get, there's a benchmark and it's, you know, you have to go in and do real work like a human would. The whole thing front to back. All right? One, you know, one task. But it's creating spreadsheets, it's doing, you know, multi step research and creating something on the back end. And then it's judged by experts in the field, right? So this is across 44 different, you know, real, real world jobs. And then it's judged by a panel of experts. So essentially the 50 mark, that is the Paris, the parody, with an industry expert, right? And then the expert also, you know, there's groups of experts that judge both the expert human that submits the deliverable and the AI model, right? So GPT4O, which was the best model 10 months ago, got a 12% on that, right? Not very good. All right, and even GPT5 high, still not very good. 38%. Okay, so why is this the benchmark that matters? And why is GBT4.5 4 such a huge step? Number one, I was not expecting this, right, because GPT5.2 had a 70%. All right, so when tie rate. So 70% of the time it either won or it tied. The expert human blindly judged by expert judges. In my 2026 AI prediction and roadmap series, I thought I was being kind of bold by saying, yeah, I think we'll get to 80%. Guess what? We already got there because GPT54 Pro got 82%. The more I like, I've, I've, I've read this, this benchmark, you know, the study that came along with it multiple times. And the more and more I read about it and think about it and revisit it when new models come out, it just, to me just talks about the sheer gap. And it's a knowledge gap, it's an educational gap. And I'm going to end with this, right, Which I know is a weird way to end a recap show about GPT5.4. But I do think this ties into what is OpenAI's bold new direction. It is getting work done, right? And that's what these models and that's what GPT4. Sorry, G, G, D, P, Val. The benchmark shows 82% of the time this new model wins or ties against a human. That's wild. And if that doesn't change, right? So think, think. If you've been in an industry 10 or 15 years, right, you're an expert and you sit down and you have a project, right? You have to do some research, you have to use your smart brain, and then you have to create something of value, right? A document, a spreadsheet, a PowerPoint presentation, and then a group of people are going to judge it. You only have an 18% chance to beat GPT54 Pro. So not wild, right? Everyone's like, oh, these AI models are so dumb. That's what I'm saying. Where we've come in the past year is not normal, right? Going from a year ago models couldn't edit spreadsheets to now, they're better than almost all experts. And I think the GPT54 model maybe, just maybe, might be the model at least from OpenAI that starts that conversation and it moves away from ChatGPT is a chatbot to oh, chat. GPT is the place where work gets done. All right, I hope this show was helpful. If it is, let me know. Should we do a more hands on version of this on Wednesdays? Right? On Wednesdays we do our AI at work on Wednesdays. Wednesday. So let me know if you actually want to see some, you know, go under the hood with GPT54, test some of these things out. I've been having fun in the little time I've had testing so far, so I hope this was helpful. Make sure if you haven't already, go listen to our 2026 AI prediction and roadmap series. That's episode 712 and 713. I get a lot of people always right asking questions, emails a day, you know, asking me very questions that would take me a long time to answer. And I feel like a jerk sometimes, but I'm like, go listen to this episode, right? I cover so much in there. If you listen to that and then go read the newsletters that come along with it, I guarantee you you're going to be the smartest person in AI in your company, right? For the most part. Unless you're working at Google or OpenAI. Right? So, and then when you're done doing that, make sure you go to your everydayai.com. sign up for the free daily newsletter. Thanks watching for tuning in. Hope to see you back later for more Everyday AI. Thanks y'. All.
[39:31]
A
And that's a wrap for today's edition of Everyday AI. Thanks for joining us. If you enjoyed this episode, please subscribe and leave us a rating. It helps keep us going for a little more AI magic. Visit your everyday AI.com and sign up to our daily newsletter so you don't get left behind. Go break some barriers and we'll see you next time.