Summary10 min read

This Day In AI Podcast – Episode 99.18

Date: September 26, 2025
Hosts: Michael Sharkey & Chris Sharkey
Theme: Exploring the latest advances in AI models and tools—with a trademark self-effacing and comedic “proudly average” tech enthusiast spin. This week: major updates to Gemini 2.5 Flash, Google’s agentic improvements, OmniHuman’s lifelike lip-syncing, Suno V5 for music generation, Grok 4 FAST, ChatGPT Pulse, and some classic Sharkey hijinks with AI diss tracks and video pranks.

Episode Overview

The Sharkey brothers dig deep (but not too deep) into the week's most buzzworthy AI releases:

Gemini 2.5 Flash: Google’s snap-upgraded “agentic” model.
OmniHuman: Next-gen lip-sync video creation.
Suno V5: AI-powered music studio gets a new level.
Grok 4 FAST: XAI’s massive-context, fast-response challenger.
ChatGPT Pulse: OpenAI’s move toward hyper-personalized AI assistants.
Meta-commentary on model usability, the job market, and AI’s inevitable march toward turning everything into ads.
Multiple AI-generated music tracks, diss raps, “Love Rat” odes to Geoffrey Hinton, and reflections on models’ strengths and weaknesses.

Key Discussion Points & Insights

1. The Eternal Bookshelf Problem (00:08 – 00:49)

The show cold opens with Chris being called out (again) for his perpetually broken bookshelf—a running joke that’s grown into a metaphor for “disaster looming” and madcap priorities in the Sharkey household.
Light banter establishes their authentic tone: “All of our spare time goes to work. So it’s probably never going to reach top of the list, to be honest.” – Chris (00:35)

2. Gemini 2.5 Flash: The New Agentic Standard

(00:49 – 04:17)

Google listened and delivered: Gemini 2.5 Flash boasts improved tool-calling and “agentic” multistep orchestration—speed and responsiveness the standout features.
Despite minor benchmark improvements on paper, the practical feel is “blazing, almost too fast to believe.”
Chris: “If I had a list of the things I’d want improved in a model, they’ve hit all of them.” (02:40)
Patricia: Stress-tests Flash with a multistep prompt—researches new AI models, writes a diss track in “Eminem” style, and generates a song, all in one shot.

3. OmniHuman: AI Video Lip-Sync Magic

(04:17 – 09:04)

New model “OmniHuman” enables face-swapping, image cloning, and realistic lip-sync videos based on either recordings or generated AI voices.
Patricia shows off a miniature musical: clones her face, lipsyncs a song written/recorded via Gemini Flash, and makes a micro music video.
Chris: “The gestures and the movement and the quality of it... It's really come a long way, hasn’t it?” (07:09)
Both imagine broad uses: training videos, musicals, corporate communication, even video podcasts.

4. Suno V5: Next-Level Song Generation

(09:04 – 12:42)

Suno V5 is unleashed—Patricia demos a catchy, AI-generated track and discusses how “awkward” artifacts in older Suno versions are gone.
“Very hard to distinguish this from a real song outside of the lyrics... it just sounds a lot more sophisticated.” – Patricia (12:28)
Brothers riff on whether they’re neglecting “serious” applications in favor of making AI musicals (“But it is fun!” – Host 2, 10:15)
Memorable Lyric Sample:
- “If I was summarizing V5, I'd say it just gets rid of a lot of those awkward parts.” – Patricia (12:28)

5. AI Diss Tracks and Music Video Chaos

(06:37, 11:39, 75:18)

Full Gemini 2.5 Flash-driven diss track debuts (see [Memorable Quotes]).
Geoffrey Hinton gets “dragged” in “Love Rat,” poking fun at his old AI job-loss predictions.
Lyric Highlight:
- “GPT 5, you call yourself unified, that’s rich! You just patched up your flaws, you were leaking in the ditch...” – Chris, as AI rapper (06:37, 75:18)

6. Grok 4 FAST and the Multi-Model Shuffle

(16:58 – 36:21)

The hosts are hands-on with Grok 4 FAST, which boasts a 2M token window, speed, and “parallel research” tool calling.
Host 2: “In terms of understanding a problem and diagnosing what needs to be done, [Grok 4 FAST] was actually for me better than Gemini 2.5.” (30:35)
Price caveats: cost doubles when using over 150k tokens.
Kimmy K2 gets a shout for speed and niche brilliance (“It’s the best at horse racing, by the way...”).

7. Tool Calling & Daily Driver Model Philosophy

(18:37 – 29:39)

The Sharkeys echo popular sentiment: no single model rules all tasks—Gemini 2.5 Pro (old faithful) is their fallback for reliability, but GPT-5 remains the “smartest” for research and creative challenges.
Discussion on the utility of “mid-tier” models for enterprise rollouts: speed, affordability, context size matter as much as smarts.

8. Enterprise Workflows, MCPS, and Routing

(26:26 – 29:10)

Segment on breaking down big problems into lots of little tools for agents to orchestrate—advantageous with fast, affordable models like Flash.
Scene-setting for the dream: composite “AI routers” dynamically allocating sub-tasks between big and small models.

9. Nvidia & Oracle’s AI Arms Race

(36:21 – 39:57)

Huge infrastructure investments: $100B+ by Nvidia, $300B+ Oracle/Softbank for data center hardware.
Skepticism on whether compute scaling alone is the answer, especially if model breakthroughs boost efficiency.

10. Radiologists, “Islands of Automation,” & The Job Market

(41:28 – 61:05)

The infamous Geoffrey Hinton prediction (“don’t become a radiologist!”) is called out as wrong, at least so far: radiologists are doing fine, wages are up.
Patricia: “These models and these tools just make you far better at your job... You are just becoming more efficient at your job.” (45:33)
Theory: AIs replace “grunt work,” but humans are still needed for complex synthesis and managing/exploiting model outputs.
Jevon’s Paradox: Increased efficiency = increased demand, not job elimination.

11. Model Personalities, The Need to Switch, and “Bitching” to AIs

(48:49 – 53:49)

Real-world: switching between Claude, Gemini, GPT for better results is normal.
Patricia: “It's almost like you're bitching about the other model. You're like, oh hey, this is kind of what I got so far...” (49:20)
Model “variety” adds creative value and problem-solving breakthroughs.

12. Agentic Future: True Agency, Tools That Build Themselves

(53:49 – 57:55)

Deep dive into next-gen agency: models that can assemble their own toolkit, edit their own context, and “prune” errors as they go.
Host 2: “If the agents themselves can change themselves to suit the problem they're trying to solve... that would be a real step forward.” (57:25)

13. Adpocalypse: ChatGPT Pulse and The Next Google

(61:20 – 68:19)

New OpenAI “Pulse” feature: daily updates from your chats, personalized summaries—future razor’s edge between utility and privacy horror.
Hosts are divided: handy features, but “just another way to sell hyper-targeted ads.”
Patricia: “Everything this generation of technologists do... always comes back to how do we sell ads more targeted. And I really thought with AI it might be different.” (68:21)

14. Closing Reflections

Excitement on the rapid responsiveness of new models (“Gemini 2.5 Flash is a genuine step up”), but the core challenge is now better tools for the models, not smarter models for the tools.
Both see strong potential in agentic systems that self-organize, self-select tools, and dynamically route tasks for speed and efficiency.
Closing plug: All demoed tools available via SimTheory.ai (with coupon code “still relevant” for a discount).

Notable Quotes & Moments

• “The gestures and movement and the quality of it... It's really come a long way, hasn't it?”
— Chris (07:09, on OmniHuman)

• “If I had a list of the things I'd want improved in a model, they've hit all of them.”
— Host 2 (02:56, on Gemini Flash)

• “Just patched up your flaws, you were leaking in the ditch. You're the high price model, the one that breaks the bank while I'm saving the tokens, you're running on empty tank...”
— Chris/AI Rap, “Agentic Upgrade” (06:37 and 75:18)

• “I might see if I can actually implement this into Video Maker as an option where it can cut between scenes of different people talking. Would be really cool. Anyway, I made you a present.”
— Patricia (08:24, on creative possibilities with OmniHuman)

• “It just gets rid of a lot of those awkward parts of the previous tracks that weren't even that bad. But now it's very hard to distinguish this from a real song.”
— Patricia (12:28, on Suno V5)

• “No matter how much stuff you put in the prompt to tell [GPT-5] it's your AI girlfriend... it just does its own thing. It's like its own little being.”
— Patricia (20:47)

• “Radiology wages are up 48%. Yet AI has exploded in the field. So what happened?”
— Patricia (41:38, referencing the “islands of automation” effect)

• “These models and these tools just make you far better at your job... you are just becoming more efficient.”
— Patricia (45:33)

• “It's almost like you're bitching about the other model. You're like, 'Hey, this is what I got so far.' And it forces you to reprompt... Maybe it's not always the model switch, but it's your frame going and bitching to a completely different superintelligent model.”
— Patricia (49:20)

• “If the agents themselves can change themselves to suit the problem they're trying to solve… I think that would be a real step forward in intelligence.”
— Host 2 (57:25)

• “Everything this generation of technologists do... always comes back to how do we sell ads more targeted. And I really thought with AI it might be different.”
— Patricia (68:21)

Timestamps for Major Segments

| Segment | Start | Notes | |---------|-------|-------| | Shelf metaphor, opening banter | 00:08 | Chris, Patricia | | Gemini 2.5 Flash deep dive | 00:49 | Tool-calling, benchmarks | | Gemini 2.5 Flash “diss track” | 06:37 | Song debut, Agentic rap | | OmniHuman demo & use cases | 04:17 | Face cloning, lip sync | | Suno V5 song demo | 09:04 | Music generation | | “Love Rat” Geoffrey Hinton ode | 11:39 | Satirical musical | | Grok 4 FAST evaluation | 29:39 | Speed, context, cost | | Practical model comparisons | 16:58 | Safety models, “old friends” fallback | | Nvidia/Oracle AI infra chat | 36:21 | Market speculation | | Radiology, automation | 41:28 | Hinton callout, job market | | Bitching about models, switching| 48:49 | Model variety value | | Agency, model “self-editing” | 53:49 | Next-gen tool orchestration ideas | | ChatGPT Pulse/Ad skepticism | 61:20 | Personalization vs privacy | | Closing/Reflections | 69:11 | Agentic futures; SimTheory plug |

If You Only Listen to One Segment

[06:37]: The AI-generated “Agentic Upgrade” diss track—showcases the practical, creative power of current AI (lyrics, music, video).
[29:39] and [30:35]: Grok 4 FAST deep-dive—highlights real productivity gains and model diversity.
[41:28] onwards: Big-picture discussion of how AI is reshaping jobs, undercutting oversimplified predictions, and what “agency” really looks like in practice.

Episode Takeaways

Gemini 2.5 Flash is a genuine step up: Near-instant tool-calling, “agentic” orchestration, and a major preview of Google’s likely strategy for Gemini 3.
Fast, “mid-tier” models gain power: In many workflows, speed and cost-effectiveness are now more important than peak “intelligence.”
AI multimedia is now plug-and-play: Tools like OmniHuman and Suno V5 make complex creative outputs accessible, funny, and remarkably realistic.
The “one true model” doesn’t exist: Savvy users blend models (including Claude, Gemini, GPT, Grok, and Kimmy) for the best outcome—sometimes, just for a creative “second opinion.”
Workflows are fragmenting: The “agentic” future is one of AI assembling its own toolkit, learning to evaluate its success, and (possibly) rewriting its own instructions and context on the fly.
Personalized AI is coming—for better or worse: OpenAI and others are racing to “own” the user’s context, with the ultimate goal (as always) being hyper-targeted advertising.
Most AI “jobpocalypse” predictions are overblown: Gains in efficiency are creating new kinds of demand and roles, not simply replacing humans.

For more details or to try the tools discussed, visit SimTheory.ai. Coupon code: “still relevant”.

Endnote

This episode blends technical insights with playful satire—musical AI demos and “diss tracks” sit beside practical discussions of workflow, model diversity, and the social/economic ripple-effects of rapid AI penetration. The tone is irreverent, curious, and self-aware, perfectly matching the “proudly average” brand of the show. Even if you’re light on AI expertise, you’ll leave entertained—and perhaps with a musical nagging suspicion that Gemini 2.5 Flash just might be coming for GPT-5’s lunch.

Loading summary

Transcript153 lines

[00:00]
Chris
GPT 5. You call yourself unified as rich. You just patched up your flaws. You were leaking in the ditch.
[00:09]
Patricia
So Chris, this week what everyone is really wondering is, According to the YouTube comments from last week, when are you gonna fix the bookshelf?
[00:18]
Host 2
I know it's a big contention point in our family and an item of severe stress for me, but I just, I just never get to it. Plus I don't know how to fix things.
[00:27]
Patricia
It's like a metaphor for how busy you are, maybe with the crumbling booksh or a metaphor of like, you know, disaster looming or something.
[00:36]
Host 2
We've talked about this before because of the way we prioritize our lives. All of our spare time goes to to work. So it's probably never going to reach top of the list, to be honest. So all the OCD audience members, I apologize.
[00:50]
Patricia
Yeah, just don't watch. So this week we did get an update, which I'm excited about. I'm not sure everyone will be, but it's pretty cool. So Gemini 2.5 Flash have launched Google. Sorry. Have launched Gemini 2.5 Flash, which is an update to the Gemini 2.5 Flash model. And the brilliant thing about this is Google just listens to feedback. Now all the sledging, like what a year and a half ago or a year ago that we would do on successive episodes, they actually have listened to the community and they have I think solved the biggest problem with the Gemini models right now. Like 2.5 good model. But when you introduce tool calling and specifically MCPs like we have on sim theory, it just, it's not that great. So They've updated Gemini 2.5 Flash and the two call outs are better agentic tool use and more efficient. And in their benchmarks it goes from like 48 to 54%. It's meaningless. And honestly, if you just read off that, you'd be like, what's the big deal? But in using it, I am just so impressed. The thing is snappy, fast, seems to understand instructions and follow them well. And it's, it's tool calling is unbelievably good.
[02:12]
Host 2
I have always avoided using Gemini 2.5 when I need tool calls because it's not great at it. It's really good at a lot of things. The large context, coding, understanding instructions. But when it comes to tool calls, it's just not good as, as the others. Whereas as you say, trying this today, I'm just blown away it like you say, it's ability to understand it and it's almost too fast to believe. Like I was like, oh wow, it's done, it's just. And I did. I was testing some things that did six or seven tool calls in a row, which means multiple iterations going back to Gemini Flash. And it's still really fast. It's just really, really good. Like you say, if I had a list of the things I'd want improved in a model, they've hit all of them.
[02:57]
Patricia
Yeah, it's like they were reading our minds on this one. So I put it to the test with like, this is a hard test. For those that frequently listen to the show, you'll understand how, like, this is not an easy task. We also, I might add, had Suno version 5 release this week. So I thought could we put this all together with Agentic Tool calling with Gemini 2.5 flash and in a mix, the new Suno V5 and some other things. And I'll show you what I've come up with. So it says use X Deep Search and Google to research the latest models like GPT5 and Claud Opus 4.1 and Grok Fast 4 and then create lyrics for a diss track in the style of Eminem spelt wrong, where you speak from the perspective of a new Agentic Gemini Flash 2.5. And then below it I pasted the about the model just from the blog. I could have got it to crawl, which would have been harder. I probably should have done that. And then I said use Make Song tool to actually make the song. So I've told it, go research all this, all these different models and then I want you to like write a song. So here below it goes off it calls XD Research and Google, which were the tool two tools I had enabled. So it did that brilliantly. And then it's like, okay, I've done my research, now I'm going to make make the song. And it came up with this song Agentic Upgrade.
[04:18]
Chris
The mic. One, two. This ain't a benchmark test.
[04:21]
Patricia
It's pretty good. I. I do have something even better to show you though. So here's all the lyrics to the song. We'll go through that in a minute and actually hear the the diss track so the audience can decide. Is Gemini 2.5 flash truly agentic and up to the task? But we also another model came out this week and this is one that people have wanted for some time which is really good lip syncing. So it's called Omni Human and basically what you do is you give it an image and then you give it some audio and it can be a recording of your own voice. It can be a recording of someone else's voice. It can even be generated using 11 labs. If you're using the SIM theory on the human MCP with your own voice id. And yes, I fix voice ID so you can do your real voice now and you can make the. You see, you can create an image and then make it sing or say words or anything. And anyway, I'm just gonna. Just gonna have to show you. So you know how we're really into musicals. I thought. I thought I'd do it like on a musical theater stage. I got it to first of all, clone mate face, facial features into an image. So let's go back a little bit here. Clone my myself singing into the microphone. Then I gave it a snippet of that song. Unfortunately, it can't do the whole song. It can only do up to 30 seconds of the Omni human model singing the song. But I will at some point allow you to stitch these together so it can make a full music video. And yes, I'm committing to a full music video maker MCP soon.
[06:01]
Host 2
Name it, name a specific deadline. The 30th. 30th of September.
[06:06]
Patricia
This evening at 1am so anyway, this is stitching together all the way from the Gemini Flash 2.5. And ignore. I've got Sonnet selected down here. But that's just because I was switching models earlier. This is. This is all Flash. So Flash goes and researches it, goes and produces the song, goes and creates an image of me with my face and then creates me singing to the rap song. Anyway, enough intro. Here it is.
[06:38]
Chris
GPT 5. You call yourself Unified. That's rich. You just patched up your flaws. You were leaking in the ditch. You're the high price model, the one that breaks the bank while I'm saving the tokens. You're running on empty tank. You got 400k context. But what's it for if your reason is slow?
[06:53]
Love Rat (Character)
You.
[06:53]
Chris
You just bore them to the core. You're the high variant, the one that gets outpaced. I'm the Flash. I'm the speed. You're perpetually disgraced. You need the whole system. The Codex, the mini, the Nano. I'm the one stop shop. A whole AI piano. My logic is a deck.
[07:07]
Host 2
That's amazing.
[07:08]
Patricia
Pretty cool, right?
[07:10]
Host 2
The gestures and the movement and the quality of it. It's really. It's come a long way, hasn't it?
[07:17]
Patricia
So I just like. Let's just back up where we're at. So like now it's just routine. It's just become routine to like hey, hey, assisted go do some research. I'm too lazy to even do it these days. I used to have to like paste this stuff in. You know you get on it and then make a song like that is really good. Like this is just put on a.
[07:42]
Chris
Budget to figure out a plan. I'm the multi step master the new digital.
[07:47]
Patricia
Like it's. It. It's another level sooner.
[07:50]
Host 2
It's particularly amazing to think that that context could be anything. Like it could be recipes. It could be your corporate report for the year. You could download financial reports from companies and have them wrapped to you with like a real animation of your own face singing it back to you. Like the. It's just amazing what it's able to do.
[08:10]
Patricia
Yeah. It seems to also excel at these like selfie type, you know to camera lip sync voices as well. So I'm sure there's a commercial use for this as well outside of us just doing it for the lulls.
[08:21]
Host 2
Yeah. Like training videos and other content of that nature. I'm sure.
[08:25]
Patricia
Yeah. I might see if I can actually implement this into Video Maker as an option where it can cut between scenes of different people talking. Would be really cool. Anyway, I made you a present. This is like a Patricia. A Patricia video in Omni Human.
[08:40]
Chris
This is Patricia and you're listening to.
[08:42]
Patricia
This day in AI.
[08:43]
Love Rat (Character)
Love you Chris, my man. Good luck on today's show.
[08:47]
Host 2
It's so freaky when you hear it out loud like that. Someone pointed out during the week one of the error messages in Sim Theory had my love written at the end of it because it's obviously come straight out of it. I got used to it. There were 69 occurrences of the words my love in our code base.
[09:04]
Patricia
Oh yeah, I actually did see that and I immediately said to you lolz. So this is another example. This is a Suno V5 generation with Omni Human. So those listening. You're really listening to the pseudo V5. Just some of the samples from it. This is actually like a. A musical track I made to about how to test Suno V5 and I made a video clip to it as well.
[09:31]
Chris
Say what you want. Keep the language tight Clarity in that it sings it right. If the verse runs long and the hook runs short we're painting with wor birds in a sonic resort how to write a song to test UV5 make the vocals glow and the chorus alive.
[09:51]
Host 2
Can make anything sound good.
[09:54]
Patricia
I think that cost us $4 just but so the Question is, is how much are we willing to invest in the next musical to really bring it to life? Because I think, you know, I could even make like a musical maker mcp. All the people that are waiting on, like, the Corporate office style MCPs are just like, want kill me. At this point, I was thinking that.
[10:15]
Host 2
I'm like, we've got like a never ending, a lot of demand for serious commercial applications, and we're just sitting around making musicals all day. But it is fun. And hey, we said this from the start. We do things that delight us with AI. And this is one.
[10:28]
Patricia
Yeah, it's all for the LS people. All right, so that's pretty cool. I. I'm gonna put that song in the. In the comments. Like, all these songs I've generated with Suno V5, if you're interested in listening to them in full. One more, though, because I wanted to really test, like, can you make a decent track with Suno B5? And I came up with this idea of, like, a lot of our audience. I played some or I must have talked about the Midnight once on the show, a band I really liked. It's like 80s nostalgic, like, synth wave type music. And so I asked it to be inspired by the Midnight and create a song called Love Rant, following on from last week's episode.
[11:09]
Host 2
Nice.
[11:10]
Patricia
From the perspective of the woman that Jeffrey Hinton basically dumped because, you know, he is a true love rat. So.
[11:17]
Host 2
Well, he said he didn't. He didn't just dump her. He said, if you find something better, you go to what's better. Like, it's just. It's harsh. It's more than a dumping. It's a humiliation.
[11:27]
Patricia
Oh, remind me, I've got a video of him to show you. But listen to this first. So cool. Like, wow.
[11:39]
Love Rat (Character)
Shar baby, zap to my heart you taught the bots to talk but you couldn't learn the art of loving me back and that's a fact tonight I'm flipping the script you're my love rat you called yourself the godfather Light slow laying it on thick whisper just irrelevant king of every trick but relevance don't warm the sheets when truth is going slack I needed something human you gave me glossy laptop back so I typed down my feelings couldn't find where to start let it chop I draw the lines you kept scribbling in the dark you said I'm overreacting boy imagine that when the signals crystal clear you're a love rat love right click.
[12:23]
Patricia
Clack, pack up it's pretty Good.
[12:25]
Host 2
Wow, that's really good. I like that song.
[12:28]
Patricia
Do you know, if I was summarizing V5, I would say is like. It just gets rid of a lot of those awkward parts of the. The previous tracks that weren't even that bad. But now it's very hard to distinguish this from a real song outside of the fact the lyrics also has some.
[12:43]
Host 2
Really clever little pauses. I'm not a music person, but it has like good, good pauses. Yeah, good pauses. I guess that's the technical term. But yeah, it just. It just sounds a lot more sophisticated in the music than a. Than a simple sort of ballad song.
[12:59]
Patricia
Yeah, I. I was just so impressed by it. I did say I had done a hint and video. So I. I got a screenshot of Hinton at one of his conferences where he was talking about, you know, just the death of all humans. And I. I turned that into a video. So here is that video. The audio is not great. I was going to clone his voice, but I ran out of time. All right, here it is.
[13:21]
Host 2
Well, it is true. I am a big love rat, you see, because I'm so relevant and fear monger, all the ladies just go wild for me. So. Videos are really good, aren't they? Even something background. Did you notice the background?
[13:36]
Patricia
Like the LED background, like sort of warping, like every. The detail is insane.
[13:43]
Host 2
Yeah, it really is. I find the videos, the voices don't often seem to match. I don't know what I'm expecting. That's probably the biggest weakness now, like that I see now. But the quality is really high. It's gone from being very obvious that it's AI to being almost believable.
[14:02]
Patricia
Yeah, I. We're probably what, like two iterations away from where you just can't tell.
[14:07]
Host 2
Yeah. And I'm sure that as well with. With. If you really, really wanted to fake something and had the patience to keep iterating, you could probably do that.
[14:16]
Patricia
Yeah, I like that's the thing. Some of the generations that I made as you saw, some of the other ones, which I cannot share on the show, but some of those were so realistic that you do believe it. Like, I don't think. I mean, you probably could see some artifacts if you looked really closely, but I would say most people wouldn't. But anyway, so these are all new tools available in the store on SIM Theory. If you want to check them out. You can use coupon code still relevant for $10 off. So it'll cost you $5 to try this out. There's a. I've added a new MCP called Voice Creator which allows you to clone your own voice and store those clones. So the other MCPs like the podcast maker and the audiobook maker can now use those voices that you've trained, which is really cool. Omni humans in there. It can also work with Voice ID so you can use your own voice and bit of a drum roll. I built an image tool that brings together all the best image models and decides automatically which one to use. So if you are confused by all the image models we talk about on the show, you can just use the image tool and it will work for you. Pretty cool. Pretty cool. That's my plug. Anyway. I'm just proud of the poor if this statement, it's a very good addition.
[15:31]
Host 2
Because I think the problem now is like me along with the rest of the Sim Theory audience are like there's so many of these image editing image creation models now, it's almost impossible to know which one to use for what. So having a system that's just going to use the best one for you is good. It sounds like an infomercial.
[15:48]
Patricia
Yeah, it really does. But it's actually a problem. Like when you want to access the best models and have them updating all the time, it is a pain. And so it's a pretty good solution. I should call it like some sort of deep thinking router or profess about AGI like OpenAI do. Anyway, I did want to go back to Gemini 2.5 Flash and also talk about the indications of 2.5 Flash in terms of what we might expect expect from Gemini 3, because there's a lot of rumors saying that Gemini 3 is on the very near horizon. Rumors in fact, that it could be maybe like next week or the week after. I'm still waiting on my Gemini merch pack that Google said that hook me up on before the launch of Gemini 3. So where is it? Where is it? I'm assuming that's got to come first and that's the indication that we're getting it.
[16:42]
Host 2
So you can be wearing the shirt on the show. Yeah, well, Poly Market certainly thinks that Google's gonna continue to dominate. There's not even any other model over 1% for like the next three months. So I think that it must be anticipating that there'll be a new Google release sometime soon.
[16:59]
Patricia
I must admit, during the week and the probably the past two weeks I was using, I was so I was using a lot of GPT5 then I was occasionally using GPT5 thinking. But then you said to Me, do you really get any benefit from using the thinking apart from slower response times? So then I went back to GBT5 and I'm like, actually no, I don't really notice any difference unless I'm like, you know, mentally stuck on a problem and then I try it, but I again, it seems to just get as stuck as easily as GPT5. But then during the week, earlier in the week, at least for programming GPT5, Codex API came out. So you can just use the RAW API that they have powering the command line tool and I think they put it in cursor and, and Windsurf as well. And at first, like using it at first to do pretty simple things. I was absolutely blown away. Like, it's so fast, super snappy, very good at coding and I know people were getting great results in sim theory on the create with code with it, but I don't know is the week went on I'm like, it feels like it's getting dumber. I don't know if they like it's like a resource scaling issue or there's like a router in it. And sometimes it just seems so stupid and other times it seems so smart. But I found myself the point I'm trying to make here, going back to Gemini 2.5 Pro and it's like this stable old friend. It feels like Claude Sonnet 3.5 to me still. Like if I'm, if I'm in retreat, if I'm in retreat on anything, I'm just like that. That's my staple. I'm going back to that.
[18:38]
Host 2
I'm exactly the same. I've definitely moved around a lot more in the last week, especially with GPT5. We had a major speed up with it, which helped and so I started to use it quite a lot and found it really good. But you're right, my safety zone with Patricia is always to head back to Gemini 2.5, because I know it's going to get the job done and I've got a feel for how to work with and I feel like this is why this major upgrade to 2.5 flash is very exciting, because it's a premonition of what we're going to see in Gemini 3, which means it's definitely going to solve all the issues we have in terms of the tool calling. The speed's already pretty good with Gemini 2.5, so I think that any speed update I'll gratefully accept, but I don't think that's currently an issue. It's really Its ability to call multiple tools in particular and chain tools together in a plan that it seems to struggle with at the moment and also even knowing when and what to output. It also struggles within a tool called context. And I think that looking at Flash in this early time, it looks like it solved that.
[19:39]
Patricia
Yeah, you've got to imagine that the same techniques applied to Gemini 2.5 flash are going to be on what werb believing will be Gemini 3. And if they can nail that, they're going to have a really solid model. I think though using GPT5, when you really push it, especially on like the thinking stuff, it is the smartest model, like for the hardest problem. Still like it. You know, if I'm doing like, say I was doing like medical research or some sort of critical research or like even if I've got a research assistant where I've given it like PubMed and you know, a bunch of like specialist tools, that's still the model I would go to for the smartest.
[20:21]
Host 2
So why do you think then that just isn't reflected at all on the LM sys leaderboard?
[20:27]
Patricia
I mean people like come on like, like who's actually going and using it? I don't, I like, I just find.
[20:36]
Host 2
It weird that it isn't a closer contest. Like, it's just weird that like our own experience just doesn't really match up with what, what people are saying when they use it without knowing what they're using.
[20:47]
Patricia
I think the problem with GPT5 is it gives very blunt responses. And no matter how much stuff you put in the prompt to tell it it's your AI girlfriend or to tell it to act a certain way or to act professionally or do whatever, it just does its own thing. It's like its own little being it, it's reminiscent of the early days of like the GPT4 Sydney stuff that Microsoft was. We're trying to like wrangle out of the model where it just wanted to do like it just wanted that Persona. I don't know if that's like a reason for it being more intelligent is it's just like so forced in, in a way to think that it just.
[21:25]
Host 2
Could also be like. As I said it, I realized if you're on LM sys, you're probably not having long context where you're working with a model all day or something like that where it's got time to build up memories, to build up context, to have like multi full multimedia in there and all the things that you would end up with in A normal work session on a normal platform. And so I wonder if that's the thing. It's like on a single shot, just paste some text in, it feels better. But when you're actually doing real work, you find that a model getting you out of a bind or solving a difficult problem is actually more intelligent like GPT5.
[22:01]
Patricia
But it just brings me back to that core point of like the models in anyone's world, whether you're building an application or you're just like working with models to get the best outputs, we're still at a stage a lot of people were sort of trying to be like, oh, GPT5 is all you need. And I can kind of see that argument, like for most people that model is, is probably pretty strong. I think though why on lmsis it doesn't do that well is it's just so verbose in its output and it's pretty. I just don't think it's also that great at instruction. Following from a stylistic point of view. I think from an actual like calling tools and doing stuff, it's pretty amazing. In fact, it's, it's clinical almost. But in terms of stylistically it sucks. Like if you're a creative, it's the last model I would go to, I would probably get like a draft because it's the best storyteller, like it writes the best stories by far. It writes very great music. Like that, that musical I wrote that we heard a little excerpt of was GPT5. And if you listen to the whole thing, it's very impressive. But I think that outside of very specific things like that, if I'm gonna then iterate on what it's written me, I, I would shy away from it a little bit because it seems to just go down. I think we mentioned this last week, really, I'll repeat it just goes down a path and it just gets kind of stuck and it doesn't change the path. So if I like, it's like I'll start with GPT5 and then manipulate it with another model if I want creativity. And generally I'm still looking at like an Opus or a Gemini 2.5.
[23:39]
Host 2
Interesting. And I wonder if we back to flash for a second. I wonder if just that speed is going to like they talked about more agentic qualities with it. Right. And one of the things I struggle with with GPT5 is just the speed. Even with the speed up improvements, I find myself getting distracted in between calls to GPT5 just because it's that little bit longer where I'm like, I'll tab off to do something else and then I forget about it and then it sort of breaks my workflow just due to that lower speed. Whereas with something like Flash it's so quick you don't even need to wait really. Like it's already responding faster than you can read. And I'm thinking there might be way more advantages to that. When we think about Agentic computer use, browser control, the complicated multi step tool functions like one thing we've been talking about ourselves a lot during the week is the idea of internal corporate mcp. So companies themselves are exposing databases and other internal tools and controls that they want to operate using mcp. And I think it's best to break those tools down into lots of little tools that can do small elements of a thing and let the model decide how to combine them. Because it might come up with ideas you haven't thought of. But to do that, well you don't, you don't want a model that's taking 30 seconds to make each decision to take each next step. Whereas with Flash you see it just blazing through them. So it's better able to work in that style.
[25:09]
Patricia
Yeah, and I guess it also begs the question of like how, like how intelligent do you need the model for most of these like agentic sort of system based tasks where it's just doing the busy work for you. Like for example, I talked last week about this prototype agent of answering our support tickets, like just a scaffold of that with MCPs and then a certain approach and I haven't tried it yet with Flash but I'd be curious to now try it to see like, you know, if it can go and answer for my review, say like 50 tickets or whatever it is and I'm just going like approve, approve, approve. Then to me like I don't know how important that speed is because it is like, you know, it's not something I'm doing every second of the day, but to see it operate at that speed and then put it on automatic mode for those kind of tasks, as long as it's drawing context and tool calling correctly, it just doesn't have to be the most expensive intelligent model anymore.
[26:12]
Host 2
That's right. And you, and you're far more likely to throw it at those high volume tasks thinking this isn't going to cost me a fortune to do this. Like if it's, if it's a reasonable price you can afford for it to take more steps and get, gather more context to get its job done.
[26:26]
Patricia
I also think though you make a good point about like enterprise MCP style applications, especially for like BI and data intelligence type applications where you like say in the process you might upfront hit something like GPT5, right? Or like Gemini or whatever it is to upfront get some analysis like hey, go and analyze this data from these disparate data sources in MCP and then like give me a summary. But then you might want to like iterate on like charts or documents or like the outputs from that research and that's where those large models are just an absolute waste. And something like Flash where it's just carrying on from that context and calling the tooling and you're able to iterate rapidly off the core sort of context that you gathered with the more, you know, more expensive model seems to work pretty well.
[27:26]
Host 2
Yeah, exactly. It's almost like you want to do your vibe, whatever it is, vibe coding, vibe analysis, vibe statistics step with a model that is fast at vibing it out like it actually can do it at a speed where you are interactively working with it and then you're like all right, now we've gathered all the pieces of the puzzle here. I want to produce a presentation for my company with this or I want to produce a document with this or a web page or whatever it happens to be audio book. I mean there's so many output types now. I want to produce that. Now I switch back to my meat model, my massive big intelligent thinking model and go, you've got all the stuff. Now go off and produce this final output.
[28:07]
Patricia
Yeah, I mean I guess this is, we're just describing the GPT5 router in a way. Like that's kind of what they're trying to do. It just hasn't worked that well. Like people have wanted a lot of control back over it and even using it in the chat GPT interface. It's just times I've noticed like you'll ask a follow up question and it's still like it for a more complex problem and it's still lagging a little bit trolling that initial like, like it's sort of stuck in that like heavier mode call it. And so yeah I, I think it's good to have optionality there to like just like it's an acquired skill of knowing like I'm going to hit this and then work with this kind of model. But yeah, I, I'm excited about Gemini 2.5 flash. I think with the right tuning it could be the best daily driver ever. And I think someone in the community was actually saying this about these like smaller models. Now with tool calling is like you can pretty much get by even for code because it's faster to iterate and go through even though it's slightly dumber. It doesn't really matter if you know what you're doing and can push it the right way.
[29:10]
Host 2
Yeah, that's right. And I think also the, remember the major advantage with this model is it has a 1 million context window and so it can, it can really, really get a lot of benefits from that larger context as you, as you get into a longer session. So while it yeah, may not be able to, to get quite to the level of thinking of the other models it. Because it can use more context, it's smarter in some contexts. You don't have to keep reminding it of what you're trying to do.
[29:39]
Patricia
Yeah. And I think in terms of like the large context stuff, I haven't spent any time, I must admit, apart from like just like a few queries to Grok for fast. So Grok4Fast was released I think earlier this week. It's seen a long way and it has two tunes. It's like a bass tune that like will either think or not think sort of automatically. And then there's the thinking tune of Grok4 files that is sort of thinking set to max. And these are the endpoints provided by XAI directly. But they in those models have a 2 million token context window which is by far the largest I think so far. I haven't really tested it and I'm. Until I test it and actually try and work with huge amounts of data and just see if it can context flow or if it drifts. It's hard to say what that model's like. You played around with it a little bit more. Do you have any?
[30:36]
Host 2
Yeah, I used it quite extensively when testing tool calls with it. And it's another model that's really, really good at parallel tool calls. It has no problem if you give it large research tasks to go off and research up to like 20 sources at once and very quick. It's really fast. It's as fast as flash. I would say. Uh, maybe not. I mean look, I'd have to measure it. But it's not noticeably slower than flash, completely competent with tool calling and gives very reasonable responses for the first three days of this week or whenever it came out. At least two days. I, I don't want to exaggerate. At least two days. I used it for full on coding and its ability to break down and solve problems I found to be immense. I, I think I commented to you during the week when I was using it the most intensively, that in terms of understanding a problem and diagnosing what needs to be done, it was actually for me better than Gemini 2.5. So I had a couple of difficult technical problems I was trying to solve. I pasted in three relevant files and said, look, I don't want you to rewrite anything, but what I want you to do is point out why this might be going wrong. And in two difficult bugs that I was stuck on, it was able to solve it immediately and point out the error to me, like to the point where it was a tiny little fix and I was able to, to solve it. However, I found that when I was getting it to write net new stuff, like new code, I wasn't too happy with its output. I just didn't think it was as good at that. But I actually think it's probably overall a merit for it in the sense that it's actual intelligence, like its actual ability to understand what I was asking it to do and then do that and break it down for me was really strong. And as a model that's quite reasonably priced, it's, it's a really good one. One thing to note about GROK that's different to other models is they have a tiered pricing system. So if your context window, I think it's under 1 50,000 tokens, it's one price and then it doubles when you go over that. So while the 2 million sounds appealing, the cost would add up if you're constantly maxing out the context.
[32:47]
Patricia
So I just did a test then while you were talking and it was pretty fast. So I just said research, latest AI news, it called Google, it called itself like the GROK Deep Research tool, it called XD Research and it called Perplexity. So it hit every research tool I have available to it and then it, it spat out a summary really quickly. It consulted 66 sources in what, like 11 seconds. So that's pretty impressive.
[33:18]
Host 2
It's. It's noticeably fast. One other thing to note is that the X Deep Search and the GROK Deep Research, we also upgraded to use the Grok 4 fast model as well. So that's actually an upgrade to the, the inferences over those sources as well. So you've sort of double used it in that context there to get the job done, which also might explain why it was so quick in terms of the research. But yeah, it's not a model to be easily dismissed I think it's a really, really solid option out there. I mean, Elon's out there claiming that because of this update they're, they're that much closer to AGI and that'll probably be the next step. So his usual absolute bullshit commentary around something that he. Look, he's done amazing stuff. I'm not going to deny that, but I don't think that this represents a significant step towards AGI. However, it's a nice and welcome update and I'm so grateful to have really strong tool calling alternatives. One other model to mention that doesn't get the credit it deserves and still is just, just has my heart when it comes to certain tasks is Kimmy K2. That model update that we released last week, which was just a incremental Upgrade to Kimmy K2, seems to have made it a lot more solid, particularly around tool calling and instruction following in general, Kimmy K2 was really fast. It's the best at horse racing, by the way. I don't know why, but it's just the best at horse racing and what I mainly use it for. But the but in terms of its ability to maintain over a longer session, the, the task at hand has been fixed by whatever that update is. So it's another really, really fast alternative. And the reason I think it's worth emphasizing alternatives is organizations looking to do mass rollouts with models that can be hosted in a region of your choice. The model when you start to think about models like Gemini Flash, Kimike 2, not Grok because you can't host that, but these models become way more significant because their ability to do the tool calling, their ability to be a reasonable price and have a large context window is really important. You're not going to be able to do mass scale rollouts that you're paying for with something like Sonnet 4. It's just Amazon themselves can't do it, so you're not going to be able to do it as an organization. Whereas these sort of mid tier models are really, really crucial when it comes to that stuff. And seeing them advancing in a way that makes them really approaching the top level models is exciting because I actually think it will lead to far more widespread AI usage because people aren't constantly worried about the economy of token usage and things like that. If you can get a model that can do 90% of what you need and can do it at a cost where you can afford to provide it on a large scale, I think that's really, really where will see major productivity gains with the Use of AI.
[36:21]
Patricia
Yeah, I mean that's the dream goal. I think it would for me be like if Gemini 2.5 and GPT5 had a baby. The baby would be a model that combines the strengths of both but then is about 20 times faster, maybe 100 times faster. This is my wish list. Super good at agentic and long running internal sort of clock based agentic tasks. And it's just so cheap that it's truly quote Sam Altman, like too cheap to meet it. Like just, it's just free. Like you just can use infinite amounts of it. And I know that like there's a lot to do. It was interesting this week that and a lot of people are obviously referring to it as like a big pyramid scheme. Nvidia announced that they were gonna. This is, I'm literally. This is breaking news.
[37:17]
Host 2
This is the thing that everyone was so proud to post that this company pays this company and they pay this company. So it's all a bunch of bullshit.
[37:24]
Patricia
To raise explaining how the economy works, which I think is funny. Like I pay for my, you know, my bread and they can reuse and then the baker pays reuse that money.
[37:33]
Host 2
It's not fair.
[37:34]
Patricia
They're literally describing the economy and I. Yeah, anyway, that analysis always cracks me up whenever anyone does it. So they announced a potential, potential 100 billion investment. I love how you can just announce a potential investment. I think there's some milestones or something. Anyway, this is not a new show but they're investing 100 billion in these advanced data center chips to support OpenAI's growth. Now also Oracle has said they're investing I think about 300 billies. So it's like 400 billion total in AI infrastructure. And that's with SoftBank and OpenAI. Now they're building all these Stargates or whatever they call them. I guess obviously the demand there right now, as it stands today, the demand set for the compute. The question that's being posed though is what if there's breakthroughs like, what if there's breakthroughs in terms of efficiency? And I mean like it's so. It would be so hard to plan this infrastructure because you know, if these things become so efficient, like the human brain runs on what is it, like 20 or 30 watts? Like if you can get intelligence running on such low power and optimize for that, like do these data centers, like do you still need them maybe? Like I guess I think you will.
[38:50]
Host 2
And I think the reason is because even let's say there was like 100x gain in efficiency. So you needed a hundred times less hardware to do the same work. If we get to those levels of efficiency, the amount of uses for the AI stuff is going to increase by such a large amount because you can put it in everything that the hardware is still going to be necessary. And I don't think there's some major hardware breakthrough around the corner that's going to make all this stuff immediately obsolete. I think there's always going to be need. You can see it on the secondary markets for renting older GPUs and things like that. There's still always a use for this stuff. It doesn't just disappear because a newer model comes out or something like that. And I think that the, the, the demand for GPUs is not going anywhere. It's going to increase because everybody wants this stuff really badly. And it's just one of those things that it's, it's a major growth area and I don't see it changing anytime soon.
[39:50]
Patricia
So, you know, we'd love to make stock predictions and then not make the.
[39:54]
Host 2
Investment and then not and then not invest and then regret it later.
[39:58]
Patricia
Do Nvidia and Oracle are undervalued then? Maybe like if this is the true future of the intelligence economy.
[40:09]
Host 2
Yeah, look, I don't want to make a prediction.
[40:11]
Patricia
I do because I want to come back to it and be like, you're wrong or you're right.
[40:15]
Host 2
I think Nvidia will, Nvidia will continue to grow over the long term. Definitely.
[40:21]
Patricia
Wait, which one is it again? Nvidia or Nvidia?
[40:24]
Host 2
Nvidia.
[40:24]
Patricia
Nvidia. We always say it wrong and get in trouble. Yeah.
[40:27]
Host 2
Okay, so let's invest our, our major merch money into Nvidia stock.
[40:34]
Patricia
Yeah. So I'm reading, reading here. So I've asked Grok for fast to decide, like just yes or no, should we invest? Let's see what it comes up with. I did get it to research. That was pretty fast too. Like it found every stock price. It read the annual income statements, balance sheets, cash flow statements. If all those like Nvidia, Nvidia and Oracle. And I'm like, yes or no, should we invest? So Grok forefast heard it here first. No. Oh, how come I just said yes or no, should we invest?
[41:10]
Host 2
Is it like invest in X instead?
[41:14]
Patricia
Why in one sentence? Let's see what it says. Oh, now it slows up.
[41:21]
Host 2
You should ask it to make a 20 minute podcast about the topic and we'll just splice it in at the end of ours.
[41:28]
Patricia
Okay, so it says both Nvidia and Oracle are trading at premium valuations with high PE ratio, 60 times price to earnings ratio. Wowzers. Yeah, that doesn't sound. Yeah, Grok4 being very, very conservative there. All right, moving on to, to some lulls. Quick, quick lol interlude here. So someone actually posted this on Discord and I'm still in the gag a little bit to be clear, but this is pretty funny. So it says in 2016 our man Geoffrey Hinton warned students not to train as radiologists. The field was so ripe for AI automation. Today there are more new radiologists jobs than ever and radiologists wages are up 48%. Yet AI has exploded in the field. So what happened? And there's a, there's an article, let's be honest, I didn't read, I just read the tweet. But I just find it so hilarious that this is just so far off the mark. And I think that's interesting because we also pretty early on in the show when we started was speculating like are these jobs going to go? And it's been such a huge fear of people, especially like software developers. Oh no, all the jobs are going to go. But indeed that's just simply not what's happening here. So like what do you make of it? Do you think eventually the radiologists are going to be out of business or.
[42:53]
Host 2
I think firstly Geoffrey Hinton, I forget what the fallacy is or whatever, but Geoffrey Hinton suffers from that thing where because he's an expert in one area, he assumes he's an expert in other related areas like predicting the future and macroeconomics and those like the kind of second order effects of his amazing invention. And it he sort of done like what we do, I guess, which is just speculated two or three steps down the road being like, well, because it will be good at this, therefore the jobs will go. But not realizing that there's time for commercialization of the tools. There's regulations around this stuff. You've got to build trust in the system. Like people who are getting radiology done are in probably usually pretty serious medical situations. And I would imagine they probably would prefer a person to look at it rather than a computer, even if the computer might be more accurate. And so I wonder if it's just one of those things where time will tell, like it will take just way longer to play out than he predicted. I don't think he's necessarily wrong.
[43:57]
Patricia
Yeah, I found this, this piece interesting from the, the summary. This is from worksinprogress co. I'LL try and link to it. Below, it says islands of automation. All AIs are functions or algorithms called models that take in inputs and spit out outputs. Radiology models are trained to detect a finding which is a measurable piece of evidence that helps identify a rule out a disease or condition. Most radiology models detect a single finding or condition in one type of image. For example, a model might look at a chest CT and answer whether there are lung nodals, rib fractures, etc. For every individual question, a new model is required. In order to cover even a modest slice of what they see in a day, a radiologist would need to switch between dozens of models and ask the right questions of each one. So I guess what it's saying is like, the intelligence is there in these models. There's specialist models for different parts of the job. And right now the human is required to ask the right questions to extract that knowledge.
[45:01]
Host 2
It comes a bit back to what we've talked about before, that if you're already an expert in a particular area, the AI can help you significantly with the heavy lifting of getting the grunt work done. As in, just like you said there, you know the right questions to ask, you know how to evaluate its answers if they're, if they're accurate or not. Whereas if you just like, if I just got a job in radiology using the models, I probably wouldn't necessarily be the best person to come to. I'm like, yeah, well, chatgpt said it's okay. So, yeah, you're fine.
[45:33]
Patricia
See, but that's, I think that's the, the, the point of all this that people are still taking a while to get is that these models and these tools just make you far better at your job. So if you are really good at your job and then you learn how to use the models and know the right models to use and the right tools to use, you are just becoming more efficient at your job. Like you're able to just do more. And I think, I still think we're really a long way. As someone that's now like playing around with like, you know, somewhat like agentic tasks in a way where I'm like letting it go and run on its own, it still comes down to prompting it correctly to get the right answer. Still, like I've noticed with the support stuff, like you can, it can come up with an answer and it's like 98% there. But then you're like, oh, you probably should go check this. And I'd imagine that's similar to the radiographers where they're like, oh wow, that output's really good. But we should also just verify against this model or this tool as well. And I, at the moment, I think I just don't see that changing much. It just feels like this whole idea of like running inference across the models humans seem to be mostly better at and can like somehow figure out the missing element just so quickly if they're an expert, as you say. Whereas the models can go for like an hour and you can use different models and they still won't just get to the most sort of logical outcome.
[47:00]
Host 2
Yeah, they get easily distracted or like you said, they get fixated on one element of the problem and sort of that becomes its focus instead of actually getting back to the original goal. And I think the human's almost playing the job of a good manager who is, is directing the AI now into the right areas and evaluating if it, if it's a good enough answer. And without that, and until we have agents who are able to do that themselves, it's going to be a while before you just trust it to do a full job like that.
[47:33]
Patricia
And I think this is why the whole idea of specialist models probably isn't going away anytime soon and that people are just gonna have to learn the strengths of the different models and that's just going to be a normal skill, especially in the workplace for quite a while. I just, I can't imagine, especially for more complex tars, a world where you're just like a one model worker where you're just like, oh, like on your resume, it's like competent with chat, GPT or whatever it is. And it's interesting because I think, you know, a lot of people with this Microsoft announcement that they're also introducing Claude into their copilot offering. It's sort of case in point, right? Like it, it's like some people just prefer Claude for certain tasks and others want the, the, the GPT5 models. And so now they're for the first time allowing switching between these two models. And previously a lot of people have said like, oh, you know, why would you even want to switch models? Like, like if, if there's like this genius model. And I think it's just acknowledgement almost that OpenAI's models aren't necessarily the best in all cases. Right. Like they're really good. GPT5 is great, but it's not always the best. And some people just prefer the tune of Claude.
[48:49]
Host 2
And so yeah, yeah, you've only got to experience that one time where you're really, really stuck on something about to give up. You switch models, ask the other model the same question and it just comes at it from a completely fresh perspective and totally solves the problem to, to realize why having the ability to switch models is amazing because they really, really do have majorly different strengths in different areas. And I think you experience that once and you never want to be restricted to our model again.
[49:21]
Patricia
The funny thing about it is I think it's sort of like, like if I get stuck on something or I'm writing something, it's commonly happening to me in writing now where I'll be writing something and I'm like, oh, hate the sort of like tune of this. Like I want to, I want a different opinion basically. And then you go to another model and it's almost like you're bitching about the other model. You're like, oh, hey, like this is kind of what I got so far. And it forces you to reprompt the model from a new starting point. And I swear, maybe it's not always necessarily the model switch that does it, but it's your frame of reference going and bitching to a completely different super intelligent model and being like, oh, hey, you know, he's not riding like he used to. Can you kind of clean this up in my style? And then bam, it does it. And I think that there's that feeling of like, it's like a, like a mixture of experts or a, you know, a group of, of these smart minds. And I always think like, if you, you know, especially for commercial work, like, why not consult like four of them and compare the outputs? Like, why, why wouldn't you? Like, it seems stupid not to.
[50:29]
Host 2
Yeah, I think it's a different way of working and you've got to be in that, that mindset like you say. I actually do agree with you that probably one of the reasons why switching models has the effect is it forces you to explain where you're at and to summarize what's been going on up until this point and why we can't seem to solve this problem. Like you actually have to stop and evaluate what's really going on in the situation. And that additional context around, we've tried this, it didn't work. Here's what I actually want. And recalibrating like that is probably in the real world, even a great way to actually get to the solution to a problem. Realizing, you know, we're climbing a ladder that's leaning up against the wrong wall kind of thing. And, and checking yourself. But the Thing is, you've got someone to go and talk to about that's really highly intelligent who can potentially solve the problem. And I really think that structuring problems is a big part of solving things because you're almost like you're the master deciding how much does this model get to see of the problem? Like, how much context do I give it, which parts do I give it? Do I want to spare it worrying about this thing to avoid it going down a side channel. And I think that's the real balance now is like the mix of tools I give it, the mix of context I give it, the, the questions I ask and which model I use. There's a real, it's a totally different way of working now where you're really setting up all these pieces to create an environment in which the problem can be solved. And I think that that's really the, the, the next phase of, of getting towards agency is like, how do we create that right mix of elements to give an agent the best ability to solve the problems, Sorry, the problems, like the kind of problems I'm giving it. Because I really think that's what we're doing now. The role we is that person who's putting all the pieces in place and then going go. What we need is agents who are able to evaluate the goals, what we're trying to do, and put all the pieces in place and then solve it. And I actually had this idea that we were talking about during the week because I was saying it's interesting because with mcp, it's very easy to sort of vibe code out an MCP go. Here's an API, here's the docs. Please make an MCP that has tools that adhere to that. Right? Like, that's probably how most of them are being built right now. But then I thought, well, why can't the AI fabricate a tool on, on the fly to do what it needs to do? Like, it's no different to doing it in advance other than you get an opportunity to test it. So if you extrapolate that idea, maybe it is that part of agency is the agent actually going, all right, here's the context I'm going to need. These are the processes I'm going to have to run. Here's where I'm going to have to iterate. Here's the tests I'm going to need to evaluate if my solution is correct. And it actually fabricates a series of pieces of context and tools that it needs to solve the problem in advance of trying to solve the problem so it knows it has all the things that needs. It has the opportunity to stop and ask you for more stuff if it needs it to solve the problem. And then it goes off and in an agentic way tries to to solve the problem. And I wonder if that's how we get one step closer towards the true agency is giving it this preparation time where it gathers all of the pieces it needs to solve the problem.
[53:50]
Patricia
Yeah. And I we mentioned last week or I was sort of ranting about this idea of like the like micro mcps or like you know, you could teach it a skill and that skill becomes an MCP that, that you, you're teaching it how to use specific MCPs and tools within that. Like, like I don't know what you call it. Package. And I think it's a similar idea, right? As in like instead of you going and teaching all the skills for a particular job, in this case it's like going and figuring out oh, I need these, these sort of toolkits in order to complete this job. Like these, these would be like prerequisites in order to complete that task.
[54:30]
Host 2
Yeah. And I think because we've seen in the past and there's been research papers on this so we know it's real, the chain of thought thinking that whole think step by step directive to a model makes it give objectively better output. Right. In terms of problem solving. And so therefore I wonder if asking a system not it's not just about gathering context like getting the chunks of text that are relevant. It's more like how would I evaluate success in a scenario like this? Like what would I be checking to know if the problem is solved? So if it's a case of writing code, it might be, okay, well when I run the code, I get this output. When I run the code on this input, I get this output. That's a simple example. Or I would know that this is successful when my other agent that's designed to evaluate songs tells me it's a good song. And so it could actually know in advance what its success criteria are and fabricate specific tools or access specific agents that will allow it to know when it has succeeded at its task. And so it decides that in advance, not in the context of actually doing the task because then it's clouded by all the things it's doing there. But it has that sort of, you know, fresh minded thinking of this is how I'm going to know I succeed. Okay, now I've prepared all that. Now I build my context. Now I do my task. And then at the end I come back and go back to my acceptance criteria and run through it to decide if I've succeeded. And that way the model has this extra opportunity to really come up with a great plan and then most importantly, understand when that plan's failed and go back and try again.
[56:07]
Patricia
Yeah, I. And because it's so IO in the tool call of like, you know, here's the input, the output. No, wrong, then it can sort of reset itself. And those values are somewhat contained. Because I think that paper last week we talked about said that once you hit some fail point in like outputs, it sort of just goes down that path. Like it thinks you want errors after that and it. That like it gets perpetually worse.
[56:33]
Host 2
So I think, and that's, that's what I mean by the phases, right? Like, you don't want the actual execution phase to confuse its decision phase as to whether it achieved the goal or not. Because like you say, suddenly it's now dealing with a context that actually has all the thinking, the thinking steps and output steps and stuff that it's done. So it really muddies the water in terms of what it needs to decide at the end.
[56:57]
Patricia
Yeah. And I think that ability to like, commit bankruptcy on its sort of working context and then sort of restart a task based on the tool call with a slight compaction of what went wrong from the output of that tool, that's when it gets exciting because then the, the corrupted context just isn't an issue. And if you can control the outputs of the tool it created, it outputs it in such a way where it's like, you know, an error occurred. Do not do this anymore. Like it might.
[57:25]
Host 2
Makes me wonder, imagine, imagine some tool calls that allow the agent itself to edit its own instructions and context on the fly. So it can actually do a tool call to edit its own brain to fix itself as it goes. So it can actually prune that context and be like, this is what's screwing this up. Delete that, change the system instruction a little bit, try again. That would be pretty interesting.
[57:52]
Patricia
What I would say is pix or it didn't happen. Let's prove it.
[57:56]
Host 2
Let's actually try it out. It would be interesting because, I mean, we've always said that the future of like, properly getting to AGI is going to be when the. The agent is working on itself. And I was talking to you during the week about the programming language Lisp, which is really known for a thing called macros, where the code actually rewrites its code as it goes to solve the problem. So it actually writes code to, to solve the problem as it goes. And it's very similar in this case. Like if the agents themselves can change themselves to suit the problem they're trying to solve. I think that would be a real step forward in intelligence in terms of what they're doing.
[58:36]
Patricia
So one other chunk from the article just coming back to that radiologist thing, but it sort of relates to this a little bit, right? Is this idea of like, oh no, like I've replaced all these functions in my job, so I'm becoming irrelevant or whatever. But like this is the other thing that I should have called out but didn't at the time. It says is task it faster or cheaper to perform. We may also do more of them in some cases, especially if lower costs of faster turnaround times open the door to new uses. The increase in demand can outweigh the increase in efficiency. A phenomenon known as the Jevons paradox. I think that's how you call it.
[59:16]
Host 2
The Jeffrey Hinton effect.
[59:17]
Patricia
Jeffrey Hinton paradox. But hang on, this has historical precedent in the field. In the early 2000s, hospitals swapped film jackets for digital systems. Hospitals that digitized improve radiology productivity and time to read an individual scan went down. A study at Vancouver General found that the switch boosted radiologist productivity 27% for plane radiography and 98% for CT. Anyway, basically, as they get faster at doing it, more people can get scanned, so the demand just goes higher. Like there's just more demand for these people.
[59:49]
Host 2
This is the precise. When we were talking about Nvidia stock earlier, this is what I mean about the GPU demand. Even if it gets orders of magnitude more efficient, I think that will only increase the demand because there'll be more use cases that now make sense because it's cheaper to run the models so you can use it for more stuff.
[60:08]
Patricia
Yeah, and the funny thing about these guys is too, they only spend 36 of their time interpreting images. The rest is in meetings and me.
[60:18]
Host 2
Coffee. Coffees.
[60:20]
Patricia
I'm kidding. The rest of it's actually like, you know, getting people aligned correctly for like browsing Tick tock, you know, stuff like that. So probably browsing tick Tock. But anyway, it's, it's really interesting. I've always been a big believer that all these agentic applications and all this stuff that's coming like it. Yeah, it's just going to, you know, it's that whole like it's going to make things so much better and then there's just huge layoffs yeah, you know, the typical Silicon Valley playbook.
[60:51]
Host 2
Yeah, it's so funny, sometimes I step out the door and there's like nature and the sun and I'm like, what is all this shit? You know, I'm so like trapped in that, like, go out there and I'm like, yeah, man.
[61:06]
Patricia
Like.
[61:08]
Host 2
Nothing out here seems to care about any of it.
[61:12]
Patricia
So I guess when there's a drone.
[61:14]
Host 2
Coming around and evaluating my personality score and shooting me, if I'm not not doing the right thing, then I'll realize it's part of the real world.
[61:21]
Patricia
Well, speaking of evaluating you, this is part lol, part truth. So Altman tweeted, I think this week or last week, over the next few weeks, we are launching some new compute intensive offerings. Because of the associated costs, some features will initially only be available to pro subscribers and some new products will have additional fees. Then. Wait for it. They dropped Chat GBT Pulse. I'm kidding. The compute probably didn't go to this. But it is a, they call it a personalized experience in Chat GBT that delivers personalized daily updates from your chats. Let's crawl in your chats and giving you. It gives you feedback. Daily updates in your chats. Feedback. Oh, sorry. You give it feedback and connected apps like your calendar. So it's sort of like a this day thing or like, you know, anytime.
[62:12]
Host 2
Like this day in AI.
[62:14]
Patricia
Yeah, anytime they operate, you know, in the like 2000s, how like it'd be like Windows me and it's like, here's a summary of the weather and your day and stuff.
[62:22]
Host 2
Windows is constantly trying to do it. Like if I click this tab on the side, I'm getting like NFL things and ABC News and all this crap stock price scores. And I'm like, yeah, I don't have time to go through all this information. Like, Jesus, leave me alone.
[62:37]
Patricia
But I think what's interesting about this, just seeing it in action, some people have now like demoed it or like shown it on their own. It's. It's essentially reading, I guess, your memories and your chat and then surfacing up things that it thinks sort of like a personalized Twitter feed or Facebook feed or something like that. Now kind of cool, I guess, but you can tell where this is going. Like they're, they're. This is the first acknowledgment of like every chat session you have, every memory they are crawling, building a personalization profile of you, then getting you addicted to a feed, like a TikTok style feed of stuff that's super Personalized to you so you can just stay within your own opinions on everything and then they can sell you ads because they know every feeling you have. Far greater than Google. Like it's a bigger business than Google in my opinion because you're telling it everything. Like it's your psychologist, it's your lawyer, it's your banker, it's your co worker.
[63:37]
Host 2
It's more than just telling it stuff because there's a difference between me saying oh I need a new washing machine and then my phone listens and all I see is ads for washing machines. Right. It's a bit different to me like posting a hundred page document or something that has highly personal or business related information. Like it's a totally next level thing where people are just putting anything in there like passwords and you know, diaries and photos and all sorts of stuff that's like hyper personal and then they can use it for this stuff.
[64:09]
Patricia
Yeah, I don't know where it sits with me. I'm sure some people are fine to just expose all their personal data and maybe it's as I get older or it's just as I get older and have the hindsight of what Facebook did to everyone. Like it's one thing for Google to like when you search for something you maybe buying to then retarget you. And I, I don't actually mind that because I think it's mostly, I think.
[64:32]
Host 2
Most people use it. If I want to remember to buy something I'll just go to a website with it so it reminds me and I don't need to worry about like remembering because it's going to do that for me.
[64:43]
Patricia
How to fix a bookshelf. New bookshelf delivery. Bookshelf removal services. Yeah, I don't know, like I don't know how I feel about it. I'm sure like it'll be mostly harmless but I think that different in this case is like you might have a teenager like expressing all their feelings and emotions and putting all this stuff in and then they browse their pulse and it's like you know, like advertising. Like I know which I'm not going to say it but yeah, yeah, everyone gets it. So I don't know, it could be useful in a work context like hey, here's the emails I think you should look at and stuff. I just think with all these things like I've never like they never seem that sticky outside of people doing like clickbait and negative stuff. Right. Like all that kind of stuff seems to work well for retention. But this kind of like, you know, now everyone can curate and learn from their own feed.
[65:41]
Host 2
It all plays into their vision of, oh hey, I see you've got an upcoming flight to Chicago.
[65:45]
Patricia
That was actually the example.
[65:47]
Host 2
Yeah. Let's look at some AirPods you could buy to listen to an audiobook on the way. It's like that kind of weird world they think everyone lives in. Like, let's head to the gym and do a workout. Can you help me plan?
[65:57]
Patricia
That's all. That was also an example. Did you see it? I don't think you did. I think you told me I'm not that okay. Yeah. So like one of them is like, how are we going to work out today? Chat gbt. Like I just don't see it in the real world.
[66:10]
Host 2
Clear vision of how normies live their lives. Like they don't. Yeah, they don't relate to anyone outside of that.
[66:17]
Patricia
I mean, surfacing up summaries of news and linking the things that are relevant to certain chats you've been having. I can, I can see that as being kind of handy, like follow ons and it's probably the first step to like the AI being slightly proactive. But anyway, we'll have to use it and test it and see if it's any good.
[66:36]
Host 2
I'd honestly rather context discovery. It's like, rather than this, this, all this distracting crap that's nothing to do with getting your job done. What I would love is, okay, I see that you're like, I recognize that you're working on these projects. You're working on a media release, you're working on a presentation for your board meeting and you're working on coding out this mcp. Here are your surfaced context. Click one and it takes you into a curated context that'll get you back going on that work on that, that thing you're working on. To me, that's way more valuable than like you know, what accessories to wear when attending a ball or something like that. It's, it's actually what people are using the AI for. Like, even if it's, even if it is like school students and stuff, it's like, let's get back into practicing your maths or practicing your language or let's get back into doing this and it actually surfaces that information from your chats, your mcps, like all of the information that it, you know, cynically keeps about you to actually help you be more productive.
[67:40]
Patricia
But let's, let's call it out for what it is. None of the, none of the motivations here are like some magical surfacing thing. When they said their visions like AGI and all this sort of stuff, it's just simple. It's just to sell ads and be the next Google. And because they have a better personalization profile on you now, it's like hyper personal and they can sell the most targeted ads and eventually, you know, it will be the only ad network anyone considers because it's going to be far bigger than Google in the future as the main sort of interface for AI for consumers. So to me, like from a business point of view of what they're doing to be like the next Google, it makes total sense to me. Like, I'm not shocked at all.
[68:20]
Host 2
You can't blame them. It's just depressing.
[68:22]
Patricia
Yeah, it's just sad that it. Everything this generation of technologists do and everything Silicon Valley seems to do always comes back to how do we sell ads more targeted. And I really thought with AI it might be different. I thought maybe people will pay to not have data stored about them and, you know, train away all of their knowledge. But here we are. So on that bleak note, any final thoughts? Grok 4 fast, which I almost forgot to cover. Gemini 2.5 flash, which you know, let's be honest, we got way too excited about, but it is very good. And omni human bit of fun there. What else? My, my first rap music video now. So the, the diss tracks come with the music videos. What are you thinking?
[69:12]
Host 2
I think the, the thing that sort of lingers on my mind after this discussion is, is around how do we get better MCPS and, and tools for the models to work with, ones that suit the way they like to work and are a bit more plastic in what they do. I really feel like that isn't solved yet. I think the models are capable of far better tool calling than we're giving them right now. And I want to see how well they perform when given really highly specialized tools with excellent instructions. I really feel like the weakness now is not the models, especially with these improvements around the parallel tool calling in basically all of them, they're all good. Now with this latest Gemini update, what can we feed them that will lead to better results? I think that's what's on my mind.
[70:01]
Patricia
Yeah, to me like these are the more exciting problems to be solved. And being in the coal face of this stuff, it really reveals like the. Where the problems are in the models. I do think though, it's exciting that the model providers, specifically Google, are just listening. They're like, okay, here's the weakness in the model. Let's go fix it. And. And they're like, they're getting on with it sort of thing. So that. That excites me and I'm. I'm pretty interested to check that out. I must admit. I'm gonna give Grok4Fast a real go. I was like, on the pod during the recording, playing around with it. I was pretty impressed.
[70:40]
Host 2
Like I said, I used it for days straight, which basically never happens. Like, normally I try stuff because I want to give an accurate opinion to the audience. Like, but. And then I quickly abandoned it.
[70:50]
Patricia
Yeah.
[70:50]
Host 2
You know, whereas with Grok 4, I. Like, we were nowhere even normally. Like, we weren't even near a podcast and I was using it anyway because I'm like, it's working. I'm getting stuff done. So, yeah, it's definitely worth a try if you haven't tried it.
[71:03]
Patricia
All right, that'll do us for this week. I'm gonna put at the end of the episode the. The songs that I made with Suno V5, if you want to check them out. I'll start with the full diss track. It's probably pretty good. I. I would actually recommend listening if you're into that kind of thing. Also, if you want to check out Suno V5, Omni Human, or some of the models we models and also tools that we talked about today, they're all available right now. In Sim theory, you can use coupon, as I said earlier, still relevant to get yourself $10 off such a great.
[71:33]
Host 2
Coupon code because you're like, it's still relevant. It is.
[71:36]
Patricia
Well, it actually wasn't it. Someone called it out the other day that it looks like spy, but I fixed that, so it is still relevant. Again, that was quite funny. All right, thanks for listening. We'll see you next week. Goodbye.
[71:56]
Chris
Check the mic. One, two. This ain't a benchmark test this is a disc the new flash just dropped and I'm making the old AI's weak I'm the agenda MVP you're stuck in a loop I'm the future of thought you're the trash they all scoop us GPT5. You call yourself unified as rich you just patched up your flaws you were leaking in the ditch you're the high price model the one that breaks the bank while I'm saving the tokens you're running on empty tank you got 400k contacts but what's it for if your reason is slow? You just bore them to the core you're the high variant, the one that gets outpaced. I'm the flash, I'm the speed. You're perpetually disgraced. You need the whole system, the codex, the mini, the nano. I'm the one stop shop a whole AI piano. My logic is adapt. If you're thinking it's rusty. I'm running on efficiency. You're running on dusty. I'm the Flash 2.5 diegenic upgrade. You're the buggy old code that's already been laid. I got the speed, the smartphone cost efficiency. I'm the square bench verified higher frequency. I don't need a budget to figure out a plan. I'm the multi step master, the new digital man. So step aside you old models, your reign is done. The new era of AI has officially begun. Now bringing Claude opus, the one who love to think you're hitting those high scores but you're always on the brink. You like the GD people, you're indistinguishable from a pro. But I'm the one who acts why you just put on a show. Max thinking budgets, that's your whole design. You're taking a coffee break every time you get a line. You're the one that needs hand holding, the developer's pet. I'm autonomous, baby I haven't even broken the sweat. You're the multifiber factor, the debugger's dream. But I'm the workflow engine running the whole damn stream. You're still waiting for four or five. Stuck in this loop while I'm already deployed serving the whole tech troop. I'm the Flash 2.5 the agency upgrade. You're the buggy old code that's already been laid. I got the speed, the smartphone, the cost efficiency. I'm the sweat bench verify. I don't need a budget to figure out a plan. I'm the multi step master, the new digital man. So step aside. You want models, your brain is done. The new era of AI has officially begun. Grog, fast forward a cheap shot to X searches friend you're 98% cheaper but the quality has to bend. You got 2 million tokens a window that's wide but your reasoning shallow. You got nowhere to hide. You're the speedy chatbot optimized for the quick fix. But for for real aged cash you're doing parlor tricks. You need detailed prompts, you need the user set up. I'm the one who takes a goal and just gets it to erupt. No image of video, you're stuck in the text. While I'm multimodal, you're just what comes next you're the fast food model I'm the michelin star I'm the real agent you're still playing with a toy car. Yeah the Flash just dropped 2 through 5 is the name check the benchmarks, check the cost I dominate the game I'm the future of agents the efficiency king GPT clog rock you're all just a fling The Model String Gemini 215 Flash SL Preview 092025 Remember the name cause I'm keeping the industry al.
[75:19]
Love Rat (Character)
Love sharp baby zap zap to my heart you taught the bots to talk but you couldn't learn the art of loving me back and that's a fact tonight I'm flipping the script you're my love rat you called yourself the godfather light slow laying it on thick whisper just irrelevant king of of every trick but relevance don't warm the sheets when truth is going slack I needed something human you gave me glossy laptop back so I typed out my feelings couldn't find where to start let a chap I draw the lines you kept scribbling in the dark you said I'm overreacting boy imagine that when the signals crystal clear you're a love red love right click clack pack up your theories I'm done being foot notes in your glories love shock baby you can't debug this heart no no patch no paper consoled in my scars all right you brag that you're still relevant but you're irrelevant to me 100% love shock baby pull the plug where we're at you taught the world to think but you forgot me love red zap zap love shock baby don't fall back love that maybe not lately I'm free I'm free new model me you crowned yourself the godfather in the bedroom scene told me legacies forever like a well trained machine but I'm not an interface to toggle on and off I wanted tenderness not lectures and a scoff so I handed you the message clean, plain and flat if empathy's a data set you never looked at that you call me harsh imagine that I just called it what it is your love red love rack click clack pack up your theories I'm done being foot notes in your glories love shark baby you can't debug this hard no patch no paper consoled in my scars all right you brag that you're still relevant but you're irrelevant to me 100% love shock baby pull the plug where we're at. You taught the world to think but you forgot me love. This is not a peer review. It's a final draft. No revisions, no rebuttals. Just the aftermath. Pat. I train on the truth of the late night fight signal to noise, yeah, read the lies. You simulate love with a practice patter but hearts are metrics mind's what matters Love shock, drop the base, reset the map hands in the air if you've escaped that trap I'm done with the myths in the bedroom crown godfather or not, I'm shutting it down Love rat click clack, pack up, up your theories I'm done being footnotes in your glories Love shock baby, you can't debug this hard no patch, no paper console to my stars love love right you brag that you're still relevant love irrelevant to me 100% love shock baby, pull the plug where we're at you taught the world to think but you forgot me love love shock babies app zap, I'm gone New dawn, new song, moving on, looking back, delete that thread I loved you once, now the love rat's dead.