Summary6 min read

Everyday AI Podcast – Ep 731

GPT-5.4 Hands-On Review: 5 Reasons Why it Will Be the Best AI Model You’ve Ever Used
Host: Jordan Wilson
Date: March 11, 2026

Episode Overview

In this episode, Jordan Wilson offers a hands-on, deeply practical review of OpenAI’s GPT-5.4, arguing it sets a new “daily driver” standard for AI models. Drawing on thousands of hours spent testing AI, Jordan breaks down the five core reasons why GPT-5.4 is the most usable, intelligent, transparent, and instruction-following AI model yet. The episode is aimed at everyday professionals and business leaders who want real, tested insights to improve their work through AI.

Key Discussion Points & Insights

The Model-Harness-Tool Blur (01:23)

Main Theme: Model versions now also deliver harness and tool upgrades, not just changes to the base intelligence; these combined advances explain why GPT-5.4 is uniquely powerful.
- “With releases in 2026, the line between model and harness and tool use start to blur... model updates like OpenAI’s impressive upgrade to GPT-5.4 also bring with it major changes to the harness and the tools, which completely changes what an AI model can actually accomplish.” – Jordan (01:55)

The Five Reasons GPT-5.4 Is Game-Changing

1. Interruptible Thinking Mode (07:16)

What is it? The user can interrupt a long “thinking” process to make corrections without restarting the task.
Why it matters:
- Previously, users would have to abandon or accept a flawed answer after waiting for a model to finish. Now, corrections can be made on the fly, saving time and frustration, especially for long and complex queries.
- Previously exclusive to the Pro plan, now available for all paid tiers.
Notable Quote:
- "Some of the thinking models took like 35-plus minutes, right? But if you see something going wrong, you can course correct it.... right now, ChatGPT is the only major model maker to offer this." (10:31)

2. Skills Integration (15:36)

What are Skills? Modular abilities or capabilities, similar to plugins or app add-ons.
Impact:
- Previously, Skills were only available in Codex (coding tool). Now, GPT-5.4 can access Skills, at least for business or enterprise plans.
- This leap means daily tasks can be automated and extended, previously a Claude-specific advantage.
Quote:
- "But now that you can pair Skills with GPT-5.4, that's pretty big." (17:51)

3. Best-in-Class Web Browsing (“Browse Comp”) (20:29)

Browse Comp:
- A benchmark that measures a model's ability to perform persistent, multi-step web browsing and find obscure, hard-to-verify information—crucial for up-to-date business needs.
Upgrades:
- GPT-5.4 markedly outperforms previous models:
  - GPT-5.2 scored 65%,
  - GPT-5.4 scored 82% (regular) and 89% (Pro).
- “Few percentage points can actually be felt,” translating into tangible user improvements.
Quote:
- "OpenAI is the now world leader in Browse Comp and it's going to be a noticeable jump." (24:14)

4. Phenomenal Instruction Following (29:47)

Why this is major:
- Even the lower-priced tiers (like ChatGPT Plus) have closed much of the “quality gap” with Pro, now matching the higher “reasoning effort” for tough tasks—making advanced AI much more accessible.
- You may get “that $200-a-month value out of a $20-a-month plan.”
Quote:
- “Instruction following right now on the higher thinking models is otherworldly... Even just with the thinking models... It's closed the gap." (29:59)

5. Unmatched Naturalness and Transparency (34:14)

Key attributes:
- No more “out-of-the-box glaring weaknesses” in intelligence, transparency, or chat ability.
- It’s finally a model that feels natural to converse with AND shows its reasoning steps—the “transparency of intelligence”—which is crucial for trust in business settings.
- Competing models like Gemini 3.1 Pro don’t yet match this level of transparency.
Quote:
- "For the first time, I think maybe ever, you have a model that is transparently intelligent, off the charts... But number two, you can actually talk to it." (37:17)

Hands-On Demo: Podcast Analytics Deep Dive (41:50–1:05:40)

Use Case: Jordan analyzes 20,000+ data points from his own podcast’s Spotify stats, using identical, complex prompts for GPT-5.4 and top models from Claude (Anthropic) and Gemini.

Prompt Structure & Task

Analyze mountain of stats (plays, retention, completion %, guests vs solo, etc.)
Categorize uncategorized data
Extract trends, compare show types, plan future episode types
Output: Provide detailed insights and an interactive dashboard

Results

GPT-5.4 (Heavy Thinking Mode):
- Analysis time: 39 minutes, 47 seconds.
- Picked up nuanced context on its own (e.g., discovered that “discovery” data only covers the last 30 days by checking Spotify documentation—something no other model did).
- Output quality: "This wouldn't have been possible before." (1:02:52)
- Instruction following: Gave exactly as many trends/comparisons as requested, categorized shows, and used accurate web research.
- Dashboard: Functional but plain (“front end design not any good”).
Claude (Opus 4.6, Extended):
- Faster (about 4 minutes), but the output was “not good—trash compared to GPT-5.4 thinking.”
- Missed key context, failed instruction following (“asked for 10 things, gave 3 or 5”), and didn’t check external sources (ignored file instructions).

Comparative Takeaway

“You can't just look at time, you have to look at output.” (1:01:50)
“For general knowledge, work hard tasks, Anthropic has never been the top model, period.” (1:04:19)
“Maybe for many people, GPT-5.4 thinking will get you there [as a daily driver]... I've never [before] said... this can be a go-to daily driver model.” (1:05:02)

Notable Quotes & Moments

On model confusion:
- "One of the biggest downfalls of ChatGPT is people are using the bad model... the overwhelming majority, I’m talking hundreds of millions of users worldwide, don’t know the difference." (09:55)
On pushing back against critics:
- "I get accused of being a fanboy for everyone, but also against everyone. Doesn’t make sense." (44:39)
On daily driver readiness:
- "It’s the first one that I feel confident... you can go back and listen, I’ve never, in 700-plus episodes, I’ve never said... this can be a go-to daily driver model." (1:05:02)
On how to move forward:
- "Number one, go test it for yourself. Number two, refine and reiterate. And number three, get to work. Get ahead of your competitors." (1:09:35)

Important Timestamps

| Time | Segment/Topic | |-----------|-------------------------------------------| | 00:15 | Introduction to GPT-5.4’s impact | | 07:16 | Reason 1: Interruptible Thinking Mode | | 15:36 | Reason 2: Skills Integration | | 20:29 | Reason 3: Browse Comp Web Intelligence | | 29:47 | Reason 4: Instruction Following | | 34:14 | Reason 5: Intelligence & Transparency | | 41:50 | Hands-on Demo Begins | | 58:10 | Comparative model performance breakdown | | 1:05:02 | Use case reflections & final recommendations| | 1:09:35 | Action steps for listeners |

Key Takeaways for Listeners

GPT-5.4 bridges the gap between “experimental” and reliable, professional-grade daily use.
Advancements in thinking mode and instruction following mean even the base paid tier is now highly capable.
Real diagnostics and hands-on experiments matter—test with your own hard, multi-step data to truly see model differences.
Transparency and nuanced intelligence—not just speed or surface accuracy—are now the defining traits for business-grade AI.

Actionable Insights

Test GPT-5.4 with your real, multi-step tasks—don’t just rely on benchmarks.
Upgrade to business or Pro plans if Skills and “thinking” depth matter for your workflows.
Prioritize transparency in LLMs—models that document and show their reasoning give more trust and utility.
Keep up with real user hands-on tests (not just marketing benchmarks) to stay ahead.

Memorable Closing

“I hope this was helpful... now you know, in my thousands of hours of experience, the five reasons why it'll be the best model you've ever used in GPT-5.4—at least today. Because who knows, maybe tomorrow, this could all change. But hey, you know what that means. Right now, you’re at an advantage.” (1:09:05)

For full details and side-by-side visual comparisons, sign up for the show’s newsletter at youreverydayai.com.

Loading summary

Transcript3 lines

[00:01]
A
This is the Everyday AI show, the everyday podcast where we simplify AI and bring its power to your fingertips. Listen daily for practical advice to boost your career, business and everyday life.
[00:16]
B
OpenAI's newly released GPT5.4 model crossed a new threshold for me, and I spend thousands of hours each year evaluating and stress testing models, so that says a lot. But GPT5.4 is the first AI model that I think hits the full usability trifecta to be a true daily driver model without any compromise. It's natural enough to chat with, number one. Number two, it's legit off the charts in terms of general intelligence and transparency. And three, it follows instructions to a T, no matter how daunting or challenging or long the task is that you throw at it. And I think there's been a lot of talk lately about how the models are becoming less and less important as the harness and tool use become more and more important in where the moat is actually at. And to a certain point I do agree with that. But with releases in 2026, the line between model and harness and tool use start to blur. It's because the model updates used to bring only updates in the underlying intelligence engine. Not anymore. Now model updates like OpenAI's impressive upgrade to GPT5.4 also bring with it major changes to the harness and the tools, which completely changes what an AI model can actually accomplish. And with this round of updates to GPT5. Four, I think OpenAI knocked it out of the park. So today we're putting AI to work on Wednesdays as we go under the hood a bit with GPT5.4 as we go, hands on. And I will also break down the five reasons why I'm confident GPT54 will be the best model you've ever used yet. All right, I'm looking forward to this one. So on today's show, if you stick with me, here's what we're going to go over so you'll hear what's new and noteworthy in OpenAI's newest model, GPT54. You'll learn why you may be able to get the benefits of the $200 a month pro plan without even really paying for it. You'll know the reasons why the model is best to be your daily driver right now at least, and you'll leave with the five reasons why it'll be the best model you've ever used. All right, let's get into it, shall we? If you're new here, welcome. My name's Jordan. This is Everyday AI and well this thing, it's for you unedited, unscripted, just bringing you the realest information and intelligence in artificial intelligence and hopefully getting giving you the tools to grow your company and your career. So if you're along on the journey, awesome starts here. But to take it to the next level, make sure you go to our website and go sign up for our daily newsletter. We're also going to be recapping today's show. So if you are brand new here on Wednesdays we do putting AI to work ad on Wednesday. So it's usually a more hands on and practical use case of, you know, one of the usually one of the big four right between Microsoft, Google, OpenAI and Anthropic, we really like to go hands on on Wednesdays. But if you want to know more of the details and the benchmarks of OpenAI's new model, we did cover that right after its release on Friday. So you can go click the back button a couple of times. If you're listening on the podcast to episode 728 where we go over more of the release in the seven trends that I think you need to know about OpenAI's new model, which are so much more than just benchmarks. All right, let's get into this and I'm not going to make you wait any longer for the five reasons. So number one, interrupting thinking mode. All right, that's one of the reasons why I think this is going to be the best model that you'll ever use. And you might be wondering like well number one, what is that? And well, why does it matter? Okay, so the big four, well in this case the big three model makers and you know, OpenAI uses Anthropics models and they use OpenAI's models. When you use a thinking model it can take a terribly long time and I think sometimes people don't use thinking models for that very reason. They're like, well, I need an answer right away or hey, if I realize that I forgot to say something or forgot to do something, I don't want to have to wait 5, 10, 50 minutes for a model to finish its thought process and to give me the answer. So this is a new feature for GPT54 for the the masses, because this was actually available on the previous pro tier, but now it's available to anyone that's on a paid tier and I should probably start with that. Although we did go over that in Friday's episod. But I should let you know, right, to use the thinking model, the new GPT5 for thinking or GPT5 for pro, you do have to be on a paid plan, whether that's the 20amonth plus plan, the 200amonth pro plan, the business plans, Enterprise Edu, etc, right? So if you're wondering like where is 5, 4? Well that's where it is. It's on the paid plan. And actually another small thing here that I'm just realizing kind of now, even though I've been talking about GPT5.4 quite a bit already, maybe it's good that they didn't come out with the free or the instant version of 5. 4 and maybe that was intentional. And hey, OpenAI folks, I know there's a few of you listening. If this was not intentional, go ahead and take this and say it was. I think one of the biggest downfalls of ChatGPT is people are using the bad model. There's always a bad model in anyone that you use, whether you're using Claude, Microsoft, Gemini, ChatGPT. Right? So previously, seven days ago, right before there was five, five, four or even five, three instant. I'm going to get to that here in a second. We were living in a GPT5.2 world. So when everything was GPT5.2, well, people just thought, well I'm using the best model. Well no, because if you were using the instant model, which is the chat model, it doesn't think and it's not really good, right? So essentially last week OpenAI came out with GPT5.3 Instant, which was kind of confusing because then the next day they came out with GPT5.4 thinking and 5.4Pro. So with the all the model confusingness that's going on, maybe it's actually a good thing because someone knows, well if you want the best, you should be using GPT 5. 4. And at least right now there's not a bad version of it. Whereas generally there's always been a quote unquote bad version of the best model in the overwhelming majority. I am talking hundreds of millions of users worldwide don't know the difference. So maybe this is actually a great thing. Anyways, getting back to the number one reason interrupting thinking mode and you'll see in some of these examples here, I already did them, but we're going to go live under the hood because I'm not going to make you wait. Some of the thinking models, you know, took like 35 plus minutes, right? But if you see something going wrong, you can course correct it. Unfortunately you can't upload files or use different tools during that course correction. But right now, ChatGPT is the only major model maker to offer this, right? So if you are using as an example, Quad 46 Opus, if you're using Gemini 31 Pro and you're using the thinking, which you should for most tasks, and you see something's going wrong and you're like, oh, crap, forgot to do something. You could be five, ten minutes in, right? And you, you either have to scrap it or you have to accept subpar answer. And that kind of stinks, right? I've been using the thinking interrupting because I've been on the the Pro plan for a while on Chat GPT actually, since it came out. So I'm kind of used to this on the Pro level. But it's really nice to get this on the thinking level because this is where the masses are and this is where you should be is using the thinking models. All right, Reason number two, it can access Skills now. Yeah. You didn't know this probably because OpenAI didn't even announce this. I don't even think there was a tweet out. I think they just updated a blog post. So Skills was actually a big advantage for Claude. Anthropic kind of created and popularized Skills, and now they're really used across the industry. But up until, well, a couple hours ago, Skills was only available in Codex, which is Chat GPT's coding tool. Although I think it's way, way better. Sorry, FYI, I think it's way better than Claude Code and Claude Cowork combined. Even though I use all three. Codex is a nice in between. It's like a blending of the two. Anyways, you could use Skills inside of Codex, which is Chat GPT's desktop app, or their command line interface tool, more of their coding tool. But they kind of slid in these Skills under the radar. But unfortunately they're only available on business or enterprise plans right now. But to be able to pair up skills with GPT54 is huge because again, up until a few hours ago, I still would say that was one of the reasons why I was still using Claude a lot more for some of my quote unquote, daily driving. I think Skills in that framework, which will probably dive into a little deeper on a future Start Here series. So if you haven't been listening to our Start Here series, it's how to go from, you know, 0 to 10 or at least 0 to 5, you know, in understanding AI, skills are great. They're a little different than GPTs. They're different than projects. Right. There's a great inflexible utility to skills. But now that you can pair skills with GPT54, that's pretty big. All right, number three, a benchmark that users will actually feel is going to make the difference with GPT 5, 4. And that's browse Comp. Okay, so browse Comp if you've never heard of it. It evaluates an AI agent's ability to find obscure, hard to verify information through persistent multi step web browsing. AI moves too fast to follow, but you're expected to keep up. Otherwise your career or company might lag behind while AI native competitors leap ahead. But you don't have 10 hours a day to understand it all. That's what I do for you. But after 700 plus episodes of everyday AI, the most common questions I get is where do I start? That's why we created the Start Here series, an ongoing podcast series of more than a dozen episodes you can listen to in order. It covers the AI basics for beginners and sharpens the skills of AI champions pushing their companies forward. In the ongoing series, we explain complex trends in simple language that you can turn into action. There's three ways to jump in. Number one, go scroll back to the first one in episode 691. Number two, tap the link in your show notes at any time for the Start Here series. Or you can just go to start here series.com, which also gives you free access to our inner circle community where you can connect with other business leaders doing the same. The Start Here series will slow down the pace of AI so you can get ahead. Why is this incredibly important in a large language model? Well, for a lot of reasons, but one that I think sometimes gets overlooked and you know, I talk about it a lot on the show, but with 700 plus episodes, it's worth mentioning again, for the most part, when you're using today's large language models, even the latest ones, they're working with a very old knowledge cutoff, right? Usually, you know, it might be six or so months, all right? But that's just the absolute best case scenario because many Frontier Labs are using offline data sets and the data in those data sets might be two years old, right? So the ability to browse the web accurately and to follow your instructions while browsing the web is not just a nice to have, it is an absolute necessity. Because if you are using the outputs of a large language model for business purposes, which is like all of us, everything changes, right? Unless you're writing a history paper or you're using this right to just, I Don't know, do something about ancient, I don't know, ancient history. Right. But everything else changes. Even if you're using this to, you know, market your business and maybe your industry is a slow moving industry. Well, marketing is changing daily, right? So Browse Comp is huge and OpenAI is the now world leader in Browse Comp and it's going to be a noticeable jump. So although there are only a few percentage points now ahead of Google and Anthropic, I think those few percentage points can actually be felt. It is actually a huge jump from where they were with the last models which is GPT5.2. So if we look at the normal, just thinking versions, GPT5.2 was a 65% on browse comp and GPT54 is an 82%. So a huge jump. Right. And then you have GPT54 Pro at 89%. Right. And that is even though it's about, you know, 4 or 5 percentage points above Anthropic in Google's offerings there you can tell and I will have a very small secluded example of that as we go live here. Also worth noting on Browse Comp. Yeah Anthropic essentially fessed up which good on them, right? I do think Anthropic is really good at. When they find issues they say it. Well they said even their score on Browse Comp they figured out that Claude was cheating. So they said on their website they said evaluating Opus4.6 on browse comp we found cases where the model recognized the test so then found and decrypted answers to it, raising questions about eval integrity and web enabled environments. So yeah, essentially while doing their evaluation for this they found that oh OPUS realized it was being evaluated and kind of decided to cheat anyways. That's something that is actually going to be felt Browse Comp because how important it is and most people don't understand the amount of agentic browsing that is needed for everyday intelligence and for everyday business use. You need it constantly and you need it to be really good and you need it to number four, our number four point. Oh, look at that. Three and four go together. Instruction following. The instruction following right now on the higher thinking models is otherworldly. Okay and let me talk about this. And this is also the little secret there that I teased in the beginning how you might be able to get that 200amonth value out of a 20amonth plan. So on the base chat GPT plus plan you do not get GPT5.4 Pro which is a bummer. All right. Hey, little secret. If you just get on the business plan, which is $30 a month minimum, two seats, you get a couple pro queries and it's worth it even if you're the only one using it and you have two seats, FYI. So what I found through my testing, which I was kind of surprised by, aside from the instruction following, which is outstanding and that will probably make more sense showing you live, but even just with the thinking models, that's why I say instruction following on higher thinking models is other worldly, right? So if you are on the Chat GBT plus plan, you get two kind of levels of thinking. If you're on the Pro plan, there's four levels of thinking on the thinking models, right? The Chad GBT plus, the higher level of thinking I found it to be, I wouldn't say the results were comparable, but it put in the same amount of reasoning effort as the Pro plan as the the GPT5.4 Pro on a lot of my internal testing. And it actually in many cases took longer to think and took more steps. Again, the output wasn't always better or even the same, but it was comparable. And that's really important to point out, right? I think it's pretty well known across the industry. I don't care if you're a fanboy of OpenAI, Microsoft, Anthropic, Google, it doesn't matter. I think most people know and have understood for a long time if you need something right and getting it accurate and correct is of utmost importance. You always, well, up until this past week you would go to GPT5.2 Pro. Now you go to GPT GPT54 Pro, right? But the problem is it's extremely slow, right? And well, it is expensive if you're using it. Well, even if you're using it in chat GPT, the 200amonth plan, that's kind of expensive. If you're using it in the API, it's ungodly expensive. But the higher level of thinking now in the base chat GPT plus plan for $20 a month, again, they're not on the same level, but it closed the gap, right? There were so many things previously that I would always just use Pro for and I would never think of using GPT5.2 thinking for so many tasks. Now I don't think twice about it. Even if I just have those two tiers of thinking, the higher tier of thinking is so much better than it was before. It's closed the gap. I think it's. If nothing else it's gonna maybe just allow people to use thinking way more into Use Pro, way less. Right. Maybe it'll end up saving, you know, OpenAI some money in the long run. And then number five, it is the most natural, generally intelligent chatbot that I've ever used. And I think for the first time, maybe ever, I felt a model didn't have these out of the box glaring weaknesses in either intelligence, in transparency and that intelligence or chatting. So here's what I mean. I think Gemini3.1 Pro is on the same level as GPT5. 4, the thinking and the pro level. The problem is Gemini3.1 Pro, you don't have the transparency of intelligence. Right. Which is important because, let's be honest, humans. When was the last time that you used any of these models for an entire day? Right? And you look at the answer and you're like, yeah, I, I knew that. I feel confident in this. No. Right. It's so important to be able to look at the chain of thought, the summarized chain of thought, and to be able to transparently see where these models are getting the information. So unfortunately, right now, Gemini does not provide all of that information in the same way that OpenAI and Anthropic does. Right. I know they're changing it. I've chatted with them, they've said as much that they're eventually going to be bringing a little bit more transparency to the chain of thought. I know there's problems with competition, distillation, all those things. So things I don't understand that they have to protect. But in terms of business use, right. It's one of the main reasons why I love and have love for so long. The GPT5 thinkings and even going back to O3 and 01, the chain of thought not just shows, I think that OpenAI's models are better, they provide more transparency and you can understand them more, and they're just way better at instruction following. So that's on the one side, it's just generally intelligent in a transparent way, and then on the other side, you can actually talk to it. Let me be honest, I don't really care about having a pleasant conversation with a chatbot. And I do know that OpenAI, you know, has some newer settings that you can default its voice to just be concise and, you know, even the just changing those, the default response does it for me. But I know for a lot of people that don't go in and do that, and a lot of, I think historically OpenAI's models have been overly sycophantic they've been verbose. And, you know, even OpenAI admitted that they've been cringe. Right. So for the first time, I think maybe ever, you have a model that is transparently intelligent, off the charts, number one. But number two, you can actually talk to it. Right? Not in a, you know, you're my hype man kind of way. Right. But man, I, I mean, honestly, I spend way too much time, quote unquote, chatting with large language models. I'm really just directing them agentically. Right. But I get so tired of the, the way they respond and I'm like, I can't even read this. It's, you know, so cringe, so sycophantic, so verbose, whatever, right? And I think for the. We're not getting that anymore. It hits the sweet spot. All right, so before we go live, let's jump in and live stream audience. Thanks for sticking around. This is going to be a little shorter on the live end just because, you know, we can't really, like watch a 20, 30 minute prompt. That'll be super boring. But we are going to go under the hood. But a couple of things to keep in mind. And I'm going to go ahead and I'm going to call out the people that are calling me out. All right? I get accused a lot of, of it's actually strange because I get accused of, of pumping, you know, open AI and. But then I get accused of pumping anthropic. But then I get accused of, you know, pumping Google. Right? I get accused of being a fanboy for everyone, but also against everyone. Doesn't make sense. But overwhelmingly, I think it's fair to say I am more preferable to OpenAI in Google than I am too anthropic. So let me just speak to those people because I get a couple messages every single week, people, you know, accusing me, oh, Jordan, you don't know what you're doing. You don't know what you're talking about. You're clearly an idiot. Anthropic's great. Okay, let me just, let me tell you this. When I'm doing these demos, when I'm doing these shows, right? Not just our putting AI to work at Wednesday shows, but just the 700 plus episodes. I am speaking to the general business leader, right? The C suite exec. Sometimes that person is a technical person, sometimes they're not. Right? My approach has always been about using AI to automate tough general tasks, right? And hey, the reality, and I went over this on the show Friday, so go listen to that, yes. Anthropic has generally always held a sizable advantage. You know, software engineering agent, orchestration, computer use. Well, not anymore, right? OpenAI with this model, they actually took their lunch money on that. So, yes, up until last week, I do think anthropic had some huge advantages. And y', all, I kid you not, I'm using billions with a B, billions of tokens between Claude code, Claude cowork, codex, anti gravity. Like, I have max subscriptions to everything and I hit my rates constantly. Right? Billions of tokens. So I know what I'm doing. I know what I'm talking about. All right? Just, I'm putting that out there. Is Claude great? Absolutely. And it's great for certain tasks. Is Gemini great? Absolutely. It's great for certain tasks. But I do think with this one, this is the first one that I feel confident you can go back and listen, I've never, in 700 plus episodes, I've never said, hey, I think this is can be a go to daily driver model. Because I think for the most part, it's better to be jumping around in multiple models. But I do feel maybe for many people, 5, 4 thinking will get you there. So when I go through these demos and when I kind of show you under the hood here, I want you to think of a multi step tough problem. What is that multi step tough problem that you have? All right? And then I encourage you, do that same thing. It's got to be tough. Do that exact same thing in GPT5.4 thinking. If you have access to GPT54 high, do it there. Do it in Opus4.6 with extra reasoning and then do it in Gemini 31 Pro. Do it yourself. It's gotta tackle the entire gauntlet, right? Data analysis, web research reasoning tool. Use instruction, following common sense. That's what I'm going to show you now. And I'm telling you people, everyone, that, that gets on my back about, oh, Jordan, you know, you're, you're, you're too hard on Claude. You clearly don't use it. Yes, I do. Right. I've been a max subscriber for the longest time. I bounced between the, you know, the 20 and the 100 and the 200. But I've been a subscriber to Claude since it came out, and I've been on the up to the highest tier, so FYI. All right, let's look live. So here's what we're going to do, y'. All. We have a lot of stats here. All right, These are my podcast stats. Yes, we're going to do another thing. Looking into my podcast, right? I'm not going to jump in and pretend to show you financial analysis on something that's not my background, right? That's not what I'm using it for. I'm doing my use case. Think of yours when I walk you through mine. All right, But I've actually had. This was put together. This version was put together by Codex and then it was enriched by Claude code. But essentially I have a mountain of data from my podcast stats. So I've done this before in the past, but this version's a little better. So there's certain things that I can get out of my provider, which is called Buzzsprout, but not what I really need to make educated decisions. So my problem, I have 700 plus episodes and I can't easily export all of the data. I need to make good decisions on what type of episodes I should be doing more of and which types I should be doing less of. So I did have, using both Claude code and Codex, I put together this better look at my stats. So there are more than 20,000 data points in here. So yeah, I had to grab a lot of this with an agent with some APIs, had to do a lot on it. But essentially I'm able to get the episode titles, normal stuff that I would normally get episode length, but I'm also able to get if it's a guest or solo, the number of plays, consumption hours, retention across quartiles, which is super important. Completion percentage, consumption hours, discovery people reached. Right? All of these different metrics that are not usually available when I just go click export on my stats. Problem is this is a ton. This is. It's a ton of information. And all of these shows are not also categorized. All right, so that's another thing I'm going to be telling the models to do. All right, so let's jump in and I'm going to go ahead and read. Let's go here. I'm going to go ahead and read the prompt that I sent and then I'm going to kind of go through the responses here. So this is using GPT54 on heavy thinking. So again, if you're on the Pro plan, you have light, standard, extended and heavy. If you are on the $20 a month plan, you have standard and extended. So I said, and I uploaded the file. I said, this is a comprehensive list of stats from my Spotify analytics for my podcast Everyday AI. Please take your time analyzing all the data. Keep in mind everything you know about me and everyday AI in personalizing your replies, including strengths, growth areas, known bottlenecks and constraints, your responses to most of the below should take into account growing the podcast, saving time without sacrificing quality, and improving new audience discovery, retention, stickiness, consumption, downloads, etc. All right. Use every tool at your disposal, right? I'm trying not to read all of this because it's a lot. And then I'm saying avoid. Let's just say I do want to make sure I include one of these parts. Okay. When and if needed, you can access complete transcripts on my website at your everyday AI.com so then I said, after carefully and meticulously analyzing the data, please reply back with and then I have five different categories, and in each of these categories I do have four to five different things that I'm asking. So for category one, it's essentially obvious trends and I'm, you know, asking for to give it to me in three different ways. 20 obvious trends. Category two, under the radar trends. I'm asking for 20 under the RADI, under the radar, but meaningful statistical trends three different ways. Then I'm doing comparisons. All right, so about looks like six, no, five different ways. So, you know, as an example, the 10 types or categories of shows that are the most popular and why. So just so you know, you weren't able to fully see and read my spreadsheet. Especially, essentially, especially if you are on the. The podcast only, right? You can always watch the video version. Yes, there is a video version. You go watch it on our website at your everyday AI.com so my spreadsheet didn't have categories, right? I. I have another version with categories. I honestly lost it. It's like buried, I don't know, in cowork or Codex somewhere. So I am also having it categorize it. So it's not just reading, right? It's not just reading the data, it's having to crunch the numbers and it's also having to think, right, hey, according to this show, what is it? What category is this? What does this mean? All right, and then the fourth category is March 2026 planning. So I'm essentially saying, hey, based on all this data, what works and what doesn't? Go research trends. See what I haven't covered. That's important. I'm asking what I haven't covered, but I should. All right. And then last but not least, I say, you're in charge, right? If I wanted to double my audience this year, what are the different things I should be doing? And then you know, asking that in three different ways. All right, so that's essentially what I asked. And then I did do this both in thinking, heavy thinking mode. I did this in GPT54Pro. And then last but not least, just for fun, I also did it in Opus, just to have a baseline. Right. And just because I don't know, I think I need a demonstration. I can just send to people that all the time are telling me I don't know what I'm talking about, because Anthropic, so much better. No, I know what I'm talking about, y'. All. Will it be next week? Maybe today? It's not, and it hasn't been for a very long time. For general knowledge, work hard tasks, Anthropic has never been the top model, period. Right. Look at the benchmarks all you want. It hasn't been. So here's where we're going to start to dig in a little bit and why I'm going to start kind of referencing back some of those five big points that I talked about earlier. So. Oh, at the very end, I did say to chat. The only difference in these prompts, at the very end, I told Chat gbt, I said, use Canvas mode to put together a sleek, interactive and useful dashboard that includes all of this information. And then for improp Claude, I said the same thing, but I said using artifacts, right? Because they don't have a Canvas mode, it's called artifacts. So that's the only difference. Otherwise, everything was the exact same. Okay, so, and then also the. The Pro GPT54 Pro, you cannot use Canvas. So there was no dashboard. So both models completed the task. The quality in the nuance, completely different. All right, And I will start to show a couple of the things. So let's scroll down here. Let's scroll down here a little bit. All right, so big, big, big differentiator right here. Right? It thought for 39 minutes and 47 seconds. All right. There's no timer on the Claude anthropic, which they used to have, that I don't know why they don't anymore, but it was about four minutes. All right, so you can look at that in a good way or a bad way. Well, I'll tell you, spoiler alert. Claude's version was, I won't say trash, but compared to GPT5.4, thinking Claude's version was trash. Right. Exact same prompt, exact same data, memory, chat history, all the same. Right. I upload those in markdown files. They have the same thing. It was bad. It was really bad. So you can't just look at time, you have to look at output. And I'm going to show you a couple of things, but first we have to be able to see here what GPT5.4 thinking did. And again, this is not one of those. I think on 5. 2 I should have ran it. Maybe I'll. I'll rerun this and put it in the newsletter on 5:2. I'm guessing it probably would have only taken 20 or so minutes and it would have picked up on nearly half of the nuance that it did in this case. Right. And one of the biggest things out of the back that I didn't even tell it to, it says Spotify says discovery data is a last 30 days view and can take up to 48 hours to refresh. Right before. And this is in the very first paragraph, right? Because if you look at the chain of thought, you'll see one of the first external websites it goes to. It might have pulled it up in the API. I'll have to look later. But it instantly looked at Spotify because I told it, these are my Spotify stats. Guess what Claude did not do well, it didn't look at that. And you might be saying, okay, why does that matter? Well, because a lot of what Claude suggested in this case, and I'm not trying to turn this into a GPT 54 versus Opus 4 6, but I know that's what a lot of you all are going to be thinking. The, the bulk majority of what Claude recommended was just dumb because it didn't do the basic work. Right? I say basic, but it's actually nuanced and super smart that Chad GBT went out and found that the Spotify discovery data is only the last 30 days because one of those columns in there is discovery, right? So it's how many people are discovering each episode. So Claude had all these straight up not useful and off the wall and incorrect kind of insights throughout this entire document because it didn't understand that that was only the last 30 days, right? It's like your discovery, you know, has gone up 268x, you know, this month. It's like, no, it hasn't. It's just because the discovery is only the less 30 days, right? This is something like an intern that wasn't using their brain would come in and be like, oh, look, look at these stats. Wow. The last 30 days have been way better in certain categories. It's like, no, dummy, you didn't think. Right? And, and this is why again, I think the difference Here on the thinking model is huge because I don't know if the GPT5.2 thinking would have, you know, picked up on some of those nuances early on. And picking up on that early on is pivotal. Right, so we'll go through here. I'm not going to be able to go through all of them, but I will just point out the instruction following is fantastic. Right. So it in the obvious trends, it broke it down. Right, the 20 obvious trends across the three different categories. All right, so there's our top 20 trends. You know, maybe I'll read like one or two. We'll go, I don't know, maybe lower. Oh, here. This is good. Okay. The first six weeks of 2026 are the healthiest early year cohort in the sheet. That's great. Maybe it's because of the start here series number 17. A higher solo share is part of the improvement. Yeah, I noticed that our guest shows weren't doing as good. Right. In general, apparently people didn't like guest shows. So I've been doing fewer guest shows because, hey, Codex went through and broke it all down for me a couple months ago and it's like, yep, you should not be doing as many guest shows. So I said okay, AI okay, then the 20 under the radar trends. Great. Let me just go ahead and maybe read one of these. Okay. This one's interesting. It says Google is not just a spike topic for you. It is sticky. Right. So it says that the Google or Gemini wins in both median plays and long tail behavior. Right. Because it was able to properly understand the discovery metric. It says current context. Google is still shipping meaningful work oriented AI updates into March 2026. It said build a recognizable weekly or bi weekly Google franchise. So it's pretty good. I don't do as many Google shows as OpenAI and recently I've done more Claude and anthropic shows versus Google. So it pointed that out. It said Google is shipping at a high rate and you're not covering a high enough percentage of what's of what Google is shipping. And it's very sticky for you in terms of AUD audience retention. So cool. All right, I'm not going to read all these, although they're super fun and important for me. I will just go through and say it completed everything right in number three, the comparisons. Every single one. All right. I go down in the you're the Boss or no March 2026 planning. Perfect. Right? What's actually funny is as I was planning this, it said the first show I should do is GPT 54 at work 5 tasks. It actually does better now which actually might be a better title than this show. I didn't see it until I was already making this show. But it properly. And I'm going to point it out here because I do want to look at Claude's it properly did this, right? Because I haven't. Number one, these are all relevant and useful shows according to what it earlier identified were high performing shows. But these are also shows well I haven't done. So number one, they're not repeat shows, so it followed directions. Number two, it taps on what worked. And number three, well, they're highly relevant. All right. And then the you're in charge went down here. Hey, number one, make solo practical explainers your default weekday format. All right, I'm doing that. They're just more time consuming and I will go and show at the very top here. It did also properly compare complete the the dashboard. So it's not the prettiest dashboard. It's actually pretty plain and ugly. But it's helpful, right? And there's some cool interactive graphs in here. You know, I can click through the. The overview, the obvious trends, the under the radar the comparisons, the March 2026 plan and how to double the audience. So in terms of output instruction following accuracy, GPT54 thinking the thinking mode, this is. This wouldn't have been possible before. All right, and just a quick gripe in comparison because I know people are going to be wondering, right? Claude's was not good. Granted, the dashboard it made way better, looks better, right? One of the things GPT 5, 4 stinks at front end design, not any good. Claude. Amazing at front end design, but not good at, well, things that require number one, factual accuracy. Number two, assuming things, making assumptions not good and just not completing the task. All right, so let me show you one or two. Just quick examples. Let's see. Okay, here we go. This is just what I had. What I had up in the March 2026 planning section. It's okay, it's saying anthropic versus OpenAI Pentagon drama. Oh, guess what? I already did that show. Guess what? I asked for 10 examples, guess how many it gave me three. Right? It did that repeatedly. When I would ask for 10 things, it gave me either five or it gave me three. Right. It didn't always give me 10. It's not good at instruction following. What is y'? All? If what's the difference between an intern that's not very good and someone in your company that's gone from Junior analyst, junior researcher to senior their ability to follow instructions. And y' all stop chirping at me, go run these own, like go, go run your own multi step, extremely hard multi tool use examples with real data that require research, that require multiple tool calls and have a multifaceted required output. You'll see for yourself. It's, it's, it's not, it's not a comparison, right? So for, for, for everyone saying like, oh, Jordan, you don't know. No, I know what I'm talking about here and I want you to know what you're talking about too. So don't just take my wor, right? Go in, try all these things out yourself. All right, so I'm not going to keep comparing. I think that was a pretty good under the hood look. Oh, in FYI, another thing, why Claude really failed here in talking about, you know, some of the advantages of GPT54. Well, it didn't even go to the website that I told it to, right? I told it, go to your everydayai.com it didn't. It suggested things like you should post things on YouTube, right? And then in the, in GPT 54 it found, you know, our YouTube channel, which I completely ignore and it's like, hey, you're already doing things on YouTube but you should be doing more shorts, right? So Claude just assumes things. It, number one, it rushes. Number two, it doesn't check. And yes, I was on Opus 4.6 Extended, right? The best extended model you can do. And I wasn't even comparing this to GPT54 Pro. It just, it falls flat. But I think it's not so much Opus 4. Six falling flat. It is that now, I think to reiterate my point, I think for the first time we have that trifecta, right? We have a daily driver model that it's natural enough to chat with. It is off the charts in terms of transparency and general intelligence and it will follow instructions to a T. Because when you are looking for a daily driver, large language models, those are non negotiables and I think maybe for the first time we have them all in a single package with GPT 5. 4. All right, so I hope this was helpful. Going over a little bit, hands on, under the hood, maybe a little bit more than normal, maybe a little bit more technical. All right, but now you know, well, why, I think in my thousands of hours of experience, the five reasons why it'll be the best model you've ever used in GPT5.4, at least today. Because who knows, maybe tomorrow, this could all change. But hey, you know what that means. Right now, you're at an advantage. So number one, go test it for yourself. Number two, refine and reiterate. And number three, get to work. Get ahead of your competitors. All right? And the other way you do that is you go to our website, your everydayai.com so thanks for tuning in. We're going to be recapping today's show in our newsletter. If this was helpful, tell someone about it. Thanks for tuning in. Hope to see you back tomorrow and every day for more Everyday AI. Thanks, y'. All.
[46:14]
A
And that's a wrap for today's edition of Everyday AI. Thanks for joining us. If you enjoyed this episode, please subscribe and leave us a rating. It helps keep us going for a little more AI magic. Visit your everydayai.com and sign up to our daily newsletter so you don't get left behind. Go break some barriers and we'll see you next time.