
Loading summary
Mike
Mate, I gave Sora to a spin.
Patrick
Last night and it blew my head clean off.
Mike
The skin tones, the motion, it's just.
Patrick
I don't know, eerily real.
Mike
Yeah, it's like it finally stopped looking like a clever filter and started feeling.
Patrick
Like even the little folk. Good morning, Australia and welcome back to the Today show.
Mike
G'.
Steve Irwin
Day. So this week it's all about Sora too. Crikey.
Mike
We're diving head first into the adventure.
Steve Irwin
And I can't wait to show you.
Mike
What we've got lined up. So, Chris, this week it's all about Sora too.
Patrick
Good evening, I'm Mark Dalton and this is Channel 9 news at 6. We begin tonight with a wobbling mystery that's tilted a whole neighborhood bookshelf gate. When will local resident Chris Sharkey finally fix that leaning tower of paperbacks? This is the more important thing than SWORD two, the bookshelf break.
Mike
It's getting worse. Life admin is not my strong point.
Patrick
Maybe it will be revealed in a few episodes that this was all a trick and this was really Sora 4 Max Edition, simulating the background as we went live.
Mike
Live simulating the collapse of the bookshelf like in your video just then.
Patrick
But we do have a new toy to play with. A new toy they call Sora. Sora too. And OpenAI has transformed Sora into. Instead of trying to build some sort of strange like video mashup system for professionals, they've gone, who cares about those use cases? Let's just build a TikTok esque social network. We'll slowly release it with invites via an iOS happen of course on the desktop as well. And. And here we are. We have Sora to multiple camera angles. Pretty amazing clarity in terms of what you put in from. From the prompt to the generation. What are your impressions of Sora?
Mike
I mean, it's hard to argue. Some of the videos are just absolutely amazing in terms of the quality, the detail, the associated sounds. The ability to just, I guess, use any historical or current figure you like without any fear of repercussions is pretty. Yeah. Surprising.
Patrick
So at the top, I obviously use Steve Irwin, who's relatively famous from Australia, because I wanted to see does it know Australia stuff because at the moment it's geo restricted to the US and Canada. But thankfully someone in our community not only got me an invite, but then also showed me how to get into it as well. So I did get access. Even though it is geo blocks, I can't access the iOS app because that's a bit more guarded with billing. But it's pretty incredible. We did some tests with some like really old chain stores from when we were growing up in the 90s in Australia. One was called Franklin's, you know. Yeah, a few commercials and it was able to replicate those commercials with a lot of accuracy. Which makes me think that maybe OpenAI is sitting on this like huge treasure trove of all the videos they scraped before, you know, everyone started locking down on this stuff. It's sort of like maybe they have this huge piggy bank of content and media or, or they're you know, letting this thing just sit and binge Netflix or something because it, I mean they.
Mike
Must write like it's not going to coincidentally exactly replicate things from our childhood just. Just because like it's got to have had some reference material to be able to do that.
Patrick
Yeah. And like one of the other crazy things I asked it, I mean I referenced it on the show a lot. I live in a what's considered a regional Australian town and you would think, you know, the sort of US centric train model wouldn't be able to like understand any of that, but I asked it to make an influencer video touring my local like town. I'll play a little bit of it. As painful as this may be. Wait, it doesn't do audio. Novice beach with a coffee and those views are unreal.
Mike
Walk the bars way it links up all the beaches. Brunch at this cute spot called Susuru. Insanely good AVO toast cooled off at Mary. Hello. What's up guys? Day tripping in Newcastle.
Patrick
Okay, that's pretty painful to listen to if you're listening.
Mike
But yeah, I was gonna say this is where it. For me it gets quite depressing that there's going to be even more of that kind of content without people actually having to do the work.
Patrick
Yeah. What is it? What does this mean for an a. Like an AI. Wait, hang on us. What does this mean for us? No, but in reality, like what does this mean for those people that produce like all the you know, the sort of tick tock attention bait dance. Like to me you could like does this is this social network thing from Sora take off and everyone just consumes slop and you come back to earth in five years and people are just scrolling the slot while the robots are making the food?
Mike
I mean it definitely seems that way. A lot of people are doing that already, just scrolling through content and already a lot of it was AI generated. This is just going to increase the amount and believability of Those fake AI videos. I think that's the really sad part of all of this, that this incredible technology that's probably going to be its main use case.
Patrick
So I initially intended to come on the show today because I had this dark moment last night. We haven't had a doom and gloom episode in a while, and I think it was just like exhaustion and being tired and I was reflecting on this whole thing thinking, like, what is this? What does this actually mean? And one feeling I've had in my life today with technology is it feels like even though all the tech CEOs and people are always like, this technology will help us bring us all together and make us feel closer and make you happier or whatever, it feels like all the technology progress, in a way is. Has been around, obviously more sophisticated ways to say sell advertising and just, you know, capture your attention completely and the net good on the world, like, yeah, there's some positive in terms of communicating and stuff, but there's also this negative toll of. It feels like instead of bringing us closer together, it's actually torn us further apart where we live in our own echo chambers and we don't actually just get out in the community anymore and interact with each other. And then I thought about this, especially with the cameo feature, which allows you to put yourself in the actual video. So you could in theory, if this technology plays out, be the true main character energy in your own series. Does this mean with content consumption now it's like this, like watching yourself all day? Yeah, and like this just like. Sort of like admiring yourself in the mirror. And it does seem that that particular feature is what captured the imaginations of people the most because apparently I didn't even know. Meta released some slop generator social network a couple of days earlier, but it got zero attention. But I think this one sort of resonated. Firstly, because a lot of people follow OpenAI around this kind of thing, but secondly, because of the cameos of Sam Altman in the actual videos and people.
Mike
Wanting to do that for themselves. I remember there's this scene in Mitchell and Webb where it's like an old lady asking a young man what his job is. And he's like, oh, well, I'm a futures trader. And she's like, oh, futures trading. So what do you do? He's like, oh, well, I basically make profit off buying and selling contracts for the future seller stuff. And she's like, and how does futures trading help? You know, a fireman puts out fires, like, you know, policemen protect society. How does futures trading help? And you know, he can't answer the question. And it's kind of like this stuff, it's like, what does this do for the world? Like, is this a good thing? Like, just when you look at it as a thing, like, or is it just a bit of fun? And I think you said before the podcast, a couple of the videos you made really delighted me and I think that's what we always come back to with some of the new AI technology. It's like, oh, well, it's just a bit of fun. I don't think it helps the world, but it's fun.
Patrick
Yeah. And as I said, I, I was going to come on here and say, oh, who's this fucking? This is so depressing. This is sad. But if you take it for what it is, is a demonstration of where this stuff's going and you just have a bit of fun with it, like, it is pleasurable. Like, there are moments where I'm. I am sitting here having a bit of fun. Kind of thought though, I had about it is, is this something I'm going to return to or not in a week from now or two weeks from now? Like maybe very rarely, to check in what's going on and, or make something funny to send to someone in a group chat maybe. But I.
Mike
The issue for me though was even if you wanted to make something funny for your friends or whatever, they're too short. It seems like every video you sent to me ended before it got to the punchline or ended before it got to the good bit. Like, it just doesn't seem anything more than a tech demo to me. It's more like, here's an indication of where we're going. Like soon we're going to be able to generate entire movies that have character pinning and a realistic. And have sound and can follow direction and all that. Like, it's very clear where the technology is going, but really this is just basically a tech demo that you can muck around with for now. There's no actual value here.
Patrick
Yeah. Unless you sort of steer it as like a model step progression like the early GPT to now where think the rate of improvements, you know, faster and faster and you know, it starts off with these like short term, like low content things and then that evolves into like longer videos and you know, and you sort of play this out as Sora becomes this like personalized entertainment platform or something. I mean, that's probably the vision that they have. But yeah, it'll be. I'm very interested to see if in three weeks from now anyone cares anymore or not. Because that's generally the, the litmus test for this stuff. Does the hype wave pass and no one uses it anymore? Or is this just an absolute mainstay and it starts to take usage away from TikTok? I'm not so sure.
Mike
One of the interesting things about it will be if they. I think you said that they're planning on releasing an API for it. If they do that, it'll be interesting because if you look at the video maker you built, there's a lot more value in giving it a general theme and a script and other details where you can actually stitch the clips together and combine them intelligently into something longer form. So if the pricing's reasonable and the quality is this good, that would be really exciting.
Patrick
That's what I was thinking. Like the video maker that I made is not necessarily perfect, but it essentially was just taking all the available tools out there, like Suno for background music, 11 labs for good audio. And then like I did have to sort of downgrade the video model. But I don't think my video model's any worse than theirs, to be quite honest. To get the like motion, like basically to get the video cost down because otherwise it's going to be way too expensive. But then you've got that Omni human lip syncing technology now, which I have actually improved quite a bit. I realized that that lip sync issue that we were having early on is actually the. I had the rate of the voice too fast, so it was getting confused. But like if you piece all those things together, it's very possible right now to make a documentary, to educate yourself on something or longer form video. That feels a little bit more useful and less sloppy to me in that regard. But I can imagine Sora getting there pretty quickly on that as well. And I think what's quite incredible about the model they've built here is they've clearly like gone down this path of building this model for purpose because they say in their announcement they're going to release the Sora to Pro API, which is a different version. I'm assuming that's like a generic video model, like a model behind this. And this one is optimized for like fast inference on the model, low cost, and also tuned to come out with these comedic videos off pretty bland prompts. Like some of the prompts I did were terrible and the output was quite entertaining. And it like, I would say hit and miss ratio, it's like 70 hit, 30 miss. Whereas normally with these models it's like 70 miss, 30% hit it. And so they've done an amazing job of stitching all this stuff together and sort of being like what Apple does with technology. It sort of sits back, watches what everyone's doing with this stuff and then unifies it and brings it together in a pretty simple, nice package.
Mike
Almost like this whole thing is a marketing exercise rather than a legitimate product release. They're just like getting back into everybody's attention, being like, hey, we're still here, we're still making the best frontier stuff.
Patrick
I think that's the sad part for me though is like a lot of the stuff around AI is just trying to get all the attention. So I guess you get the sign ups and you get the platform play going. But in reality it's hard when these people are like, you must stop us, you must slow us down. Like we need to have a pause for six months and then it's like, like, oh, and we're going to work on curing cancer and all this other stuff. Oh, and by the way, here's a, here's like a, an attention wasting time for generation. Another app to waste their best years on. Like instead of communicating with each other. I mean, that's the cynical look at it, right?
Mike
Yeah, the actions don't really match. One thing I'll give them though is they announced it and they released it. That's impressive. That isn't always the case, like with SORA one, they didn't do that. At least this time people can actually use it, which is really nice.
Patrick
How are they going to handle these copyright issues though? Like, look at this. Like if I'm, if I'm like the Irwins and like this guy on this podcast is just making videos about Steve Owen like that. I don't know, I'd be a bit troubled by it, to be quite honest. But I did want to show you one experiment I did because at the end of the post and I just, I want to read it verbatim because it's like this is. So this is the post announcing SORA too. It said video models are getting very good very quickly. I don't disagree. General purpose world simulators and robotic agents will fundamentally reshape society and accelerate the arc of human progress. SORA too represents significant.
Mike
Subservient underclass.
Patrick
Do they think we're buying this bull? Like, there's no way in keeping with OpenAI's mission. It's important that humanity benefits from these models as they are developed. We think SORA is going to bring a Lot of joy, creativity and connection to the world like that. That's what I was getting at before.
Mike
Everybody fears the rise of AI, but I don't think anyone thinks that SORA is going to be the thing to take down humanity.
Patrick
But anyway, all the promoters on X are like open AI has developed a world model. And let me show you the physics engine behind this world model. Now with another Steve Owen video.
Steve Irwin
Crikey's coming hot. Up we go. Whoa, that was close. Tail whip, hands down. Over. Back it up, back it up. Look at the power on that jaw. Flip past it. Yes, One more push. Push. You hearts pound like a drum.
Mike
What I love is that's the kind of crazy that Steve Irwin would have done. We're getting our censorship like our, our beeping on time today. It's not traditional.
Patrick
Yeah, that's off brand. So yeah, it's. Look, I think it's a fun toy. I think it's going to be incredibly popular, like when they release it more broadly and it's not invite. I, I do think it'll become really popular, especially with young people, like, super fun to play around with and maybe a mainstay in their like app drawer. But yeah, that, that whole feeling around it of like, is this, like, again, our generation has been these. You've taken like all the smartest minds. I used to complain about this with Facebook. You've taken all the smartest minds and figured out how do you serve better retargeting ads. I mean, that's basically sums up like the generations before went to the moon. Allegedly went to the moon. And our generation has just figured out how to capture people's attention and sell them ads. Like, I, I don't know, I'm like slightly ashamed of it.
Mike
The other crazy factor that, that I can't really reconcile is the cost. Because if you look at models like one of the reasons you had to back off from using VO3 in your video maker is it's just way too expensive. Like, even if your company is paying, you don't really care about the money. It's very hard to justify what is it, like 40 cents a second or some crazy amount of money when you've got to iterate on these things. It's just clearly it's costing the providers a lot to run these models or they wouldn't pass on a cost that high. And even the models that are constantly trying to push the price down can't seem to get it down on these heavy video models. And yet SORA is as good as, if not better than all of them. And they're really just doing it under existing plans and stuff. Or like, okay, yeah, maybe the pro plan. But if someone's sitting there smashing these out all day, surely they're losing money overall.
Patrick
Yeah, there's no way I. Unless. So if they aren't losing money, they've had a huge breakthrough in this model. Like this model is a massive breakthrough. But you think they would have bragged about. Seems to me like they're pouring money down the drain to offset like any attention on Google or VO3. But VO3's mistake as someone I'm kind of stealing this opinion from X Someone pointed out that, you know, the cost obviously just way too high for anyone to muck around with and share and do anything with. How you accessed it was really confusing. It was like access it on Vertex AI or in AI Studio or the Gemini app. Oh, now in the Gemini app, now it's a button. Now it's gone, now it's a button again. Like it was like, I won't hear.
Mike
A bad word said about Google Gemini because they sent me merch during the week and anyone who sends me merch gets my unequivocal and uncritical praise no matter what. And I'll defend them till the end of time.
Patrick
Okay?
Mike
So keep that in mind, listeners. If you want me to defend your brand, send me merch.
Patrick
Send me merch.
Mike
Facts in saying that I totally agree. Accessing VO3 is really difficult. And then there's also other providers host BO3 unaccountably they've got trusted partners who hosted and it's really just confusing what it is and where it is and all that sort of stuff. The results are amazing. But you're right, it's never going to get this widespread attention where everybody's talking about the next step in AI. Even though VO3 makes some of the most amazing videos.
Patrick
But we were talking about this earlier with so in SIM theory we have the code interpreter as an mcp so it can call out the code interpreter which is really just like a Python box running right. And code interpreter has always been an amazing tool, but it's had a problem where for your typical average user, including myself, you just don't remember what capabilities or things it can do. So what I did was I started cherry picking like processes from code interpreter, like making a chart and then just engineering a custom MCP called Make a Chart where it just focuses on that one use case. So when you ask your AI to make a chart it's like, oh, I'll call the Make a chart tool. Like I'm a dummy. Like I'll just call the most obvious one and then the user gets the output they expect. And I think with, with VO3 in, say, Gemini, the challenge there is it's like, well, what end product am I getting? At least with the new Sora, it's like, well, I'm getting like a TikTok style video that'll make me laugh. That's an end product. Like, it's not a tool anymore, it's a product. And I think with VO3 or VO4, when they launch it, it should be focused on use cases within their own apps. Like, I want to make a, you know, a video podcast with two average guys and a bookshelf collapsing. And like that. It's.
Mike
Well, I mean, a good example of that is what they did with Notebook LM and the podcast maker. Like, that got a lot of attention because they actually turned it into something where you get something at the end, something really useful, shareable, rather than just giving you the tools. And you've got to like, painstakingly go through the process over and over again.
Patrick
It's why the video maker that I made also, I think was resonating with people. Even though it costs quite a bit of money, you can actually make some useful videos with it and like their videos that you could share and use internally, like training videos, stuff like that. And so there was a, there was a use case for it. There's an obvious, like, oh, okay, I'm gonna use this to do this. And so to me, that's where like VO3 might still be a better video model. Like, fundamentally. I haven't seen Sora Pro 2 or whatever it's called yet through the API, but it does feel to me like outside of the sort of cutting, maybe that VO3 is just still a higher quality model and you would expect over time, like with access to all the content they do through YouTube, you would just assume that Google's video model is better at some stage. Like, remember Sora 1, everyone was like, this is the greatest. And then like two, three weeks later, Open source had already kind of caught up. And then Google came out with VO3 with audio for the first time. And everyone was like, whoa, that Sora is like old and bad. So I think maybe in a way OpenAI has realized, like, we just can't compete in that model war. So let's go after like a social network where the model matters less. I don't know. That's one idea behind it. Like, how do you monetize this, this line of model in the, in the business?
Mike
Well, I mean I guess the wider implication in the long run is there's going to be certain things in terms of like video construction, advertising construction, those kind of things where some jobs just aren't going to exist anymore because you're going to be able to direct an AI to do virtually everything a full film crew and stuff could have done previously. And so like we had someone on the this day in AI community the other day or sim theory community, one of them produce like an ad for their local hunting store or something like that. Now think about that with like TV advertising. Like ads aren't exactly like you know, award winning cinema where there's all these factors going into it. You're just making some videos that advertise a thing. Like we're getting to the point where we're stitching these tools together, right? You could make an ad as good as any ad that's out there really or at least.
Patrick
But this is where I think the opportunity lies for people listening, thinking about like what could you really do with this? To me someone could make purpose built video editors like you know, like cursor for video. But like you can vibe out a commercial for your small business and it just focuses on that one use case.
Mike
Probably just giving away someone's startup right now and they're like yeah, don't let anyone listen to this podcast, don't share it.
Patrick
Yeah, I, I, maybe so. But I think you could assume you're going to get access to like Sora 2 Pro, right? And you could build like a vibe code. I mean it maybe it already exists and I'm just unaware but that's what it feels to me like with a lot of this content like the real impact will be the price of media generation goes to zero. But it doesn't necessarily mean that there is an opportunity in that like there'll still be people, just like people vibe code a product that don't want to code, just like I want to consume funny sort of videos. I don't really enjoy creating them that much. Like I prefer just watching the funnies. And so I think there'll still be a two sided marketplace to this transaction of content. But it's just yeah, new new tools for the job and the price will come down like it's deflationary for sure.
Mike
Which is great because I think the use cases that excite me about it, the more corporate ones in terms of education, so I guess that's less corporate but like education and training, I think is a massive one because having bespoke video based, interesting content that, that does it in a way that people prefer to learn is very interesting. And then the whole idea of like having a custom video to start your day or something that's going to catch you up on the things that you're interested in in a very engaging, funny, interesting way. Like those things are really exciting to me. Like where the content is actually really and very different to anything you can get out there right now.
Patrick
Yeah. If you, if you put this technology to good and your addiction to your phone and social media was like videos that were engaging and educating you on things or briefing you on, on different aspects of your day or like whatever it is that you need or if your child's learning a concept or a language and it's very addictive like Doom scroll, but it's. They're actually learning from it, then maybe it does have a really good place. I do want to quickly play this video. It's a blooper reel from Sorry 2 and I think it's a look at the Sora 2 Pro model. So this is the behind the scenes of their launch video that they made with Sora 2. But you'll note it's in 16 by 9. There's no watermarking on it, and it looks a lot higher quality. So I'll play it.
Mike
What is this? There's no wind.
Steve Irwin
This is supposed to be epic. Turn it up, turn it up. There's too much wind in my hair cut.
Patrick
Okay. Because a lot of people listen to the show. I'm gonna cut it.
Mike
God, they're losers. Like they're seriously sitting around doing this stuff. Like, what a waste of time.
Patrick
I don't know. I thought, I disagree. I thought it was really cool. Like they made the launch video in Sora with cameos of all the people who worked on it. I think that's really cool. And then they did a blooper reel of that to show how amazing the technology is at instruction following. Like you've got to give it to them. Like, yeah, the content's cringe, but it's. It's still, it's still really impressive. But it does look pretty good. I think the thing I'd be concerned about really is the audio. Because the audio quality and I'm sure people listening are gonna hear it thinking it's our bad editing. It's probably partially that, but it's also, the audio has this sort of underwater grainy feel to the whole thing where you just know, like, it's, it's a like very generated. So I think if they can get to the quality of an 11 Labs V3 model, you know, now we're talking or like a Suno style, like music and sound generation, then, then you're getting somewhere. But again, I think the opportunity here is someone can stitch these things together like this is just another available model. The Sora app is just one way of showcasing this technology. And yet again, here we are with another incredible tool in our toolkit.
Mike
Yeah, I'm very, very interested to see the pricing on it.
Patrick
So let's move on to new model from Anthropic. I'm thinking maybe I should be back in the, the little Dario pendant necklace here available in our store. I did find it, I had lost it. That's how, how little I was using the Anthropic models for a while there.
Mike
So you only, you only, you always wear it when you're using the models, do you?
Patrick
Yeah, or if I'm hoping for a new model, I, I wish upon a.
Mike
Dario five times and like.
Patrick
I seriously on my, on my like Twitter profile page have a video of me rubbing it wishing for a new model. So we, we got Claude Sonnet 4.5. Not to be confused with Claude Sonnet 3.5 3.74. It is Claude Sonnet 4.5. It's a hybrid reasoning model with superior intelligence for agents and a 200 hey context window. So Claude Sonnet 4.5, what are you, what are your thoughts?
Mike
Well, just to clarify there, it's not just a 200k context window, it's also a 1 million context window if you enable the beta flag which we have. So it's actually 1 million. The trick is it's the same pricing structure as Grok, where if you exceed the 200 or I think maybe it's 250k context, I think it's 200. If you exceed 200, the pricing doubles for the, for the whole request.
Patrick
Wowzers.
Mike
So that does make it the kind of cost I just sort of put out of my mind and pretend it doesn't exist and just do it anyway. Because 1 million is amazing because one of the main reasons I constantly use Gemini 2.5 was just maintaining that larger context over a long session and therefore getting into the groove with your AI agent and just getting a lot done. And now you can do that with 4.5 and it's really good.
Patrick
So that means if it's a million. So I don't understand. You pay $3 per million input and then if it's over 500, you pay 6.
Mike
Yeah.
Patrick
Dollars per million for everything or everything. Oh, yeah. That's pretty pricey. What about output? Does it double?
Mike
It also doubles. Yes.
Patrick
Ouchy mama. That's $30 per million output. That's expensive. We should be charging more for this model.
Mike
Yeah. I was going to say, don't look at the bill.
Patrick
Yeah. Wow.
Mike
My goal is always just to provide the users with the best and latest available that we can. That was our goal. Always with this stuff. And why. Why muck around? The other thing to note as well is that it has up to a 200,000 token thinking budget. So if you enable the highest level of thinking, you can allocate 200,000 tokens for that. Now, it's a little. I don't know if it's just we use AWS when we use anthropic models and I don't know if it's just them, but they advertise 200,000 thinking budget. But if you enable the full budget, then it's like, I've got no tokens left to use for output. So we can't actually output anything.
Patrick
Using everything to think.
Mike
It sits around thinking. It's like. Well, I'm not telling you. Like, you know, you didn't give me the budget to actually tell you what I thought about.
Patrick
I filled up my context so that.
Mike
That thinking budget, at least as far as I can tell. Correct me if I'm wrong, everyone, you have to sort of reduce a bit to leave it some space to actually output stuff. But that's pretty interesting because that's a massive. I mean, formally, sonnets. Except for three points, the sonnets only had 200,000 context windows. So now you can use an entire context window worth of thinking on top of all the input.
Patrick
But that's only if the million context is enabled. Right? Because if you use 200k thinking input.
Mike
No, because. No, that's not correct. Because the. It's their output tokens. Thinking tokens are output tokens.
Patrick
I see. I'm hung up on about this model. Right. Is they're still charging that premium of $3 per million input, whereas GPT5 is $1.5 per million input and Gemini 2.5 Pro is a $50. Like, are you getting double the value from Sonnet 4.5 from your initial impressions?
Mike
So I've got mixed impressions about it. I've had lots of different experiences. Firstly, we'll get into the API stuff next because I think they've made some really, really nice improvements on The API side, which have big implications. So we'll talk about that in a minute. But just generally using it, I like it more than GPT5 because it's faster. Like you get an initial response faster. It gets me to my solution faster. And I've found that generally speaking, I've been able to use it as a daily driver this week. It's been pretty good. The also with long running and this is one of their goals of the model release, so it makes sense. It's good at it. With long running, lists of tool calls, it is 100% able to stick to the goal. I think you've talked the last two weeks about the idea of an AI being able to get. I forget what you called it, but like get back to its purpose. Like here's our overall list of shit like plan we're going to do. Hang on. Whoa, that button got stuck. Sorry. Here's our list of tasks we want to do. Like do each one and then get back to the main goal. It is absolutely unbelievably fantastic at that. Like it is blowing my mind. So for example, when I want to test an mcp, I will give it a list of all of the tools in that MCP with the instructions, like literally the manifest file of here's all the stuff it can do. And then I say write a prompt that will test all of these things and then I run it on Sonnet 4.5 and tell it to do all that. And it is able to go through like to give you an example, like Gmail or something, sending an email receipt, sending an attachment, you know, adding a calendar event, deleting a calendar event. Like just literally everything this MCP thing can do. Make a checklist. If it makes a mistake, it'll correct it and try again. If it has to do research, it'll go off and do research in between and then it'll get back to the checklist and complete it. Then at the end it'll give you a summary, detail summary table of what it's done, any corrections that need to be made, suggestions like its ability to just stick with the long running complex task over a long period of time is unsurpassed. Like I haven't seen anything like it in terms of its ability to be able to do that. So that is just really fantastic.
Patrick
Yeah, it's agentic capabilities to me are the best still just like Sonnet 4, but now it seems vastly improved. I think also the optimizations around speed, as you said, like, I don't know if it's just like Amazon got their act together with this model after all the criticism of the Rollout of Claude Sonnet 4.
Mike
Send me a shirt guys. Yeah, and I'll say positive things but otherwise I won't.
Patrick
Yeah, it's much, much faster and it, yeah it feels a bit uncomfortably fast now that I'm used to lag in these thinking models and generally I think from a like coding standpoint and just analysis and research like a lot of their claims are true. I think it is the best model if you want to use multiple tools to do research. Like a lot of the tool calling if you think say you're researching something for medical, if you give it PubMed and access to scientific papers and different search tools and deep thinking tools and you know all that stuff and you tell it, go broad, use all those things. It's the only model I've seen so far that does as you say, stay on task, follow the prompt and consider all the tools, gather all the context together and then output you know it like put all that information together comprehensive.
Mike
Like cited answers and it'll go wild. It'll do like four batches of 10 calls to different tools to do the research for example. Like it really doesn't hold back when it comes to using the stuff that's available to it to get the job done.
Patrick
Yeah and I think I, I can't find it at the moment but they, they were able to get it in sort of a more agentic setup to run for like 30 hours or something and recreate you know, Slack and like a bunch of applications. Obviously not like production ready or anything like that but they're able to send it off for quite a long time now in these tests to, to just keep working away at a problem and I think we've seen that in the, in the length of sessions that it can do as well. Like if you prompt it right it will just go on and on and on trying to call tools and do a bunch of work for you in the background. So I do, I believe today it is the best agentic model. Like if you're going to build an agent right now it seems like the best out of the box model to do that on they it. There's a few strange things in these benchmarks. I honestly don't believe them. I just base it on my real world use now. But it's, it's like a few basis points better at agentic coding. I would say it's far better than Claude Sonnet 4@ coding like leaps and bounds better. I would put it on par now with Codex. The Codex model, which is like the dedicated GPT5 coding model, I can't tell them apart anymore at all. And I couldn't believe how quick that it feels like Codex was surpassed by Claude Sonnet 4.5. And I quite frankly put that down to the tuning of the Sonnet model. It feels slightly more intelligent now and it's just tuned so well that I don't know, I find Codex is very rough around the edges and it, it puts me off using it quite a bit. But I would also add for very hard thinking problems or where I need the highest level of intelligence, I'm still using the GPT5 thinking tune. So I still think GPT5 thinking is the smartest available model through the API by leaps and bounds. But I'm not. You can't really daily drive it, you can't really work with it day to day on things. I just sort of phone a friend occasionally to it or get it to plan something I'm working on. And then like one interesting way of using it is if you upload like your company's financials and then say to GPT5 thinking like oh sorry Claude Sonnet 4.5, go analyze all this data, call a bunch of tools and do some research on the market and then gather this all up and put a report together and then ask GPT5 thinking to reflect on all that stuff. That's a really good workflow to get interesting insights. Whereas like I think that's the level GPT5 thinking is at. So for me, like I don't. I'm still not at a place where I'm like oh, one of these models like back when Claude Sonnet 3.5 I to be, to be honest, that's all I used was chords on it 3.5 when that model was king. And now I find myself at the current point in time of just like if you want long output, I go to Gemini 2.5 Pro because I know it's the best at output consistent output tokens for example.
Mike
So yeah, and I've heard examples from our community of people saying they really like the GLM 4.5 model and awaiting on the GLM 4.6 as a daily driver because it, it's way cheaper and it can do a lot of the same stuff. So I think there's definitely like scope at the moment for jumping around the models and I'm probably jumping around models more than ever at the moment. It's not like someone's blasted it out of the water where there's just a model that I'm always going to now. Like I was for so long with Gemini 2.5, but now I would say it's probably like less than 30%. I'm switching around. I'm going to it when it feels right, but not always. And so it's interesting that I, I do need to point out though, I've. I've experienced a couple of items of weirdness with Sonnet 4.5. Now this could be Sim Theory's fault, like this could be our fault. So I don't want to like completely trash the model on this, but I've had a couple of really odd situations where like it's outputted a list but all the headings are in French, for example, or it's outputted code and it's cut off the code too soon. Like the code has just stopped in the middle and. But then it'll put a summary at the end. So it's not like the model stopped producing output. It's not like the token stopped streaming, it just stopped putting the code there. And I've also noticed a little bit of laziness occasionally slipping back into the coding when it comes to Sonnet 4.5.
Patrick
It's got that GPT4 laziness for sure. That old.
Mike
Yeah, it could be just like an early tune of it or something. But it's just something that, you know, really gets you because I've been, I've tuned my agents at least to not do that lazy stuff. So I'm not used to it happening anymore. And then when it suddenly gets you, it's, it's shocking. And then I immediately like an angrily switch models. I'm like, how dare you do this to me, Patric and change. And so that's probably the only downside, definitely the only negative in my mind so far about Sonnet 4.5. But it's good enough to overcome that. And I don't think that'll be a long term problem. It's just hit me a couple of times and I thought I should at least mention it.
Patrick
I still find myself though going down this path. I would like to have the time to make a video on how I work day to day with these models. What's my workflow and why do I switch and when. But I often think I just have to record myself working for an hour to demonstrate it because there's points I'll hit with even Claude Sonnet 4.5 and I'm just like, you know, I hit this wall and I just am like, okay, I'm going to Gemini now or I'm going to GPT5 at this point.
Mike
Yeah, it's funny you say that because I've had an urge where I was like maybe I should just like stream myself vibe coding with the models and just show how it works and, and what those decision points are about when I would switch, when I would change to a new session, those kind of things.
Patrick
Yeah, like at the moment I find myself when I'm tackling a problem and this isn't just code, it can be like a business problem or writing a document or whatever. I will have three tabs open, getting three different models churning away on the problem, first up proposing a solution and then I'll just quickly flick through them, be like okay, this one's the best and just go down that path with it. And I just still don't think which I like. I don't know if I'm should be surprised at or not surprised that but there's just no clear winner with the models or the tunes. And I think Sora to the tune of the sort of tick tock hilarious video tune of this to me illustrates this more than ever. They've had to very succinctly tune a model for that, that use case and that output. So I think increasingly instead of seeing these like you know, be all and end all models, I wouldn't be too shocked if in the near future we see tunes from providers where it's, it's more like the Codex tune where it's like a version of GPT5 just designed for code or like have a model that is just Claude Sonnet code and they just keep updating that to make it better or Claude Sonnet Finance or Claude Sonnet Medicine and they are just slightly tuned for that particular use case. I, you know, I, maybe that's one way of doing it or they're all doing routers but it seems to me like that is the best approach. It's just tune away until you get the right tune for the particular use case you're working on instead of having the global model. Especially now that we, it's starting to become established what people are actually using the models for.
Mike
Yeah, it's a good point. Now the next thing that we need to cover is there were major model. So model isn't the right word but I guess API updates around Claude but really like things that you can put into the model to get to change its behavior. Right. And there's, there's some really good ones in there and they really are all around this agentic long running process concept and they're very good. And the thing I like most about it is we spoke before about I don't really like the idea of the GPT5 Pro thing where I give it the context, it goes off and in its magic box does all the work and reports back to me in four hours when it's done its task. I like the idea of an iterative approach where you're giving it the latest context. So for example, in computer use, it's seeing the latest version of the screen, it can see the latest version of the files on your disk and all that sort of stuff. And then it's going through multiple round trips to the model and getting the next steps and things like that now. So the things that Anthropic have added is an automatic context management. And so what this is, someone actually asked this in this day and AI discord, which is how doesn't it run out of context? Like if it's running for three hours, how doesn't it eventually fill up its context window and then fail? And the answer is Anthropic has added a feature in beta which will automatically control that context. So it will start to based on rules you can give it or automatically remove the oldest context and provide like little tombstones or summaries of what was there, but not the full content. So you can just keep calling the API over and over again and it'll gradually manage that context automatically for you. Now in SIM theory, for example, we have our own detailed code that does that this and that's how we're able to have these long sessions with models. But this is the first time I've seen it built into the model where you can actually just do it with configuration. So it's a really, really nice addition to the model and really essential for something like computer use because obviously if it's going to run for hours, well, it can't fill that thing up. Related to that, they've also introduced context editing. So you can do things like send a command like clear tool uses. So once the tool uses are done and the AI has produced its response, it doesn't really need all of that data in there. So you can specifically say keep the full chat, but clear the tool uses, you can also say clear at least this many input tokens out and it has its own strategies for cleaning it up. So they're really doing a lot of work around this idea that the task will run for a long time and that managing the Context throughout that process is important. So that's a really, really great update in terms of computer use. The other one they've added, which is interesting, is a memory tool. And this is basically building knowledge bases over time and keeping project state across different sessions. Now, this is something we've had for a long time in SIM Theory, at least a year and a half in or since the beginning with our system. We call it a knowledge graph where we keep that information. So Anthropic has now just built a tool that's like a generic internal tool that will manage that memory for you. So it'll. You. You as the developer are responsible of storing it somewhere like in a markdown file or a database or something. But the actual model itself is deciding on what changes to make to that memory, which is really interesting. I haven't tried it out yet, but I always would try to favor the model provider's way of doing things over my own because, like, they know the model best. So these are really interesting updates and there's a few more, but they're. They're sort of more technical, but it's very interesting the direction they're going. And I think this points to a bit of what you're saying around the tune that they're obviously optimizing for the very cases that we're judging it on.
Patrick
Yeah, but does it worry you that with the knowledge graph and all these other components, and I'm not speaking from like our perspective running something like SIM theory, but more from you're in enterprise, like, do you really want Anthropic storing at the API level, the knowledge graph?
Mike
They don't, though. The important point is that it's a tool call that'll tell your system what to do.
Patrick
Oh, I see. So it's still storing on your side. Okay, yeah.
Mike
So you're still storing it securely, but it's saying add this to the memory, delete this from the memory, summarize this part of the memory, et cetera, and then your system has to then comply with that. So no, they're not. They're still not storing it.
Patrick
Yeah, but I think it does go back to these models being. Or the model providers or labs providing an AI system for you to build an agent on yourself. And whether that then lends itself to them having like their own agent builders, which I'm sure at some point we'll see, you know, that that's probably what it'll look like, especially given that they have the agent SDK structure now, you know, and they say that's how they built that, they're all the pieces of CLAUDE code, which has been very popular. I think this does give people the opportunity to go and build the CLAUDE code of blah pretty easily on top of that SDK.
Mike
That's right. Because not having to develop all of this stuff yourself really accelerates things because you can just lean on this SDK to handle all of those things for you. And the. The CLAUDE agent SDK has a whole bunch of other additional things that are useful, like looping, like session management, those kind of things that you would need to build an agentic workflow. So it would make sense if you're making something from scratch, like a Claude code for industry, whatever, a Vibe docking for industry, whatever, you could build it on this agent SDK and save yourself a lot of time.
Patrick
So, like, I guess this brings me back to the point though, of like, future software, because if you think about a company today, like, there's been different ways where you'd have like the sort of all in one platform where you buy into, say, the Salesforce ecosystem or the Microsoft ecosystem or whatever, and you have a series of apps in that everything's allegedly perfectly integrated and you use all those different apps and those businesses are building in AI agents. I think Microsoft announced there's some alpha of an Excel agent either out or coming out soon this week. And so they're building sort of the agentic workflows in those existing apps. And that kind of makes a bit of sense to me. But then you've also got people potentially building like new Vibe Doc editors with the CLAUDE code framework, putting them out there. I guess what I'm saying is, do you imagine a world where you go and use these very specific apps like, oh, I'm, I'm gonna go to vibe do.com now or vibe office.com because I need to create a doc, or do you imagine a world where this is just fully integrated, Like, I'm in Claude or I'm in SIM Theory or whatever, and I need to create a doc, so I'm creating it in there because all software in theory can be rendered by these models. Like, to me, there's this question of disruption around, like, are we just consuming all of these software and interactions we have with the computer from a singular interface and singular model in the future? Or do you imagine a world where a company does have multiple subscriptions to different apps, like they did in the past? Like, for me, I just think about how the ease of building something like this now, like, you could in theory build your own Word Processor for your own business, specific to the documents you're creating. Like, if you're a law firm, you could have your own, like Mike's Law Contract Builder tool, right? Like it, like it can be so specific, so bespoke.
Mike
A lot of it for me is about the centralization of context and processes. Because like in your law example, you would have processes as your law firm that you follow, like checklists or template documents, and you would have reference materials and sources you consult. It's about gathering all of those together. I don't want to have one system that gathers all my context together, gets me all geared up and lathered up to make a document, and then jump over into the Microsoft Excel vibe doc helper 2.0 and have to get all that context out of one system into another just to work with it in their ecosystem. And like you say, if I've got five of these subscriptions and I'm just passing around this context, then what have I become? I've just become a slave to the software and a slave to the AI. Like the, the power we're seeing from MCPS and the centralization of context building is that you can then take an educated AI system and say, now go make this thing. And it can do it perfectly. So in my mind, you need the output tools or the output creation tools right there where you've got the context, right there where you've got the tools. Because otherwise you're just adding some manual step. Or then there's another layer of fricking APIs and MCPs where you got to have an MCP from your main system into the Microsoft whatever, and then it consumes from there, you know, and then it's just a software integration nightmare. So no, I think it will lead towards centralization perhaps in the short term. People who haven't discovered a centralized platform that brings it all together and allows you to do that will still benefit from say, a Vibe Doc thing in Word or in Excel or whatever it is. But I think in the long run those soft pieces of software will become less useful because it's the AI that's going to be operating the software, not you. And so you having it.
Patrick
Sorry to interrupt, but does that mean that you think in the future, like a lot of people use cursor today? Like, I would say the vast majority of developers are using something like cursoride. But then there's been a lot of hype lately around, like command line tools like codecs, Claude code, those are introducing interfaces as well. But you could do the Counter argument, like if developers are sort of the leading adopters of this tech, well why aren't they just using say ChatGPT for everything if this is true, that you will consume everything through a single.
Mike
Because I would say the answer to that is exactly what we're talking about, which is Cursor has the output type. Cursor has the ability to actual, actually actuate what they're giving. And it also combines that with the ability to build context. And that's the amazing thing about something like cursor is because it can access all the files, it has the full context to know what to do and then it has the ability to output it. So I would compare Cursor more to like a centralized system, but it's just built for purpose, it's built for coding. And I would say that eventually you'll have centralized tools that can do this for multiple things across an organization, not just coding, rather than. It's not like they're taking cursor and then going off into an IDE and then using that like a dedicated one, a separate one.
Patrick
Yeah, and then bringing in the cursor.
Mike
Context to do it.
Patrick
But then could you also argue that if Microsoft with Excel is to allow you to gather context easily about your business from the integrations in the Microsoft ecosystem, that you could be vibing in there and it does have full context?
Mike
Yeah, I mean, yes, I think that's very possible and I'd say that's probably what Microsoft are going to try and do.
Patrick
Yeah, that's probably the vision for it. But yeah, like I can also see a middle phase where like yeah, Cursor and Vibe docking or whatever is critical. But you can imagine this stuff gets to a point where people are just building software, never even looking at the code because the models are so good at that point where you know you like especially for like internal applications and a lot of the things people use it for, like replacing pieces of like, you know, sass and stuff, you know, maybe it's at a point where they are just rendering what they need and storing those renders in something like a chat GBT to access them for their business. Like I can, I can also see that path as well being the longer term path.
Mike
I agree. I think we are going to reach a point where we get beyond code and the code is just something the AI worries about. And I say this, I'm a programmer, I've been a programmer my whole life, but I just can't see a world in the future where people are going to be Typing out code on their own. There's no point. It's a waste of time now.
Patrick
So that noise was booting up Claude Imagine Claude imagined this is I. It's a cool, cool little demo and it's a sort of pretend operating system in the browser with a few sticky notes on the screen. And this is just a proof of concept of we've talked about on the show before. This idea of sort of we call it like a glass UI where it's generating something on the fly and in the bottom there's what do you want to build? And I can say note, notepad to.
Mike
Key make a, make a pig grooming management system.
Patrick
Pig grooming. Open my pig grooming management system. All right, let's do that.
Mike
That's a true challenge.
Patrick
It's like fairly quick. Here's my note, note to note. Oh wait, my pig grooming management system's opened up. Look at this. So my revenue apparently is $3,000. I've got new appointment add pig clients over here. This is pretty good. So I'm going to add a pig.
Mike
Add a dog detection filter that will detect if it's actually a dog and not a pig.
Patrick
So it's just rendering this screen like it's just building this UI in real time on the fly like an entry point. Pig name, breed age, weight, owner name, phone, email. So I'll put in micro pig and call it Pepper. That's a good suggestion. It can be two years old and I'll save that pig profile. So like as I click on it, it's then I assume storing that data maybe somewhere and it now it's showing a success message. It's like flickering like mental as it builds. But I guess it worked. Did it work? No, it didn't actually save anything. So it's a bit of a simulation right now to show what it could be like. But don't you think this is probably a sneak peek at software in the not too distant future where there's some sort of core generation where you could have an operating system that's purely like it.
Mike
Absolutely. I mean look, if I was a front end web developer, I'd be shaking in my boots right now because all someone needs to do is build a component library that's more suitable to AI and maybe not even that if this is the demo of that. And why would you ever, ever pay someone to do front end again?
Patrick
But don't you think this is far the far beyond that even? It's like the, the next operating system, like the next computer will be like an AI chip and fully driven.
Mike
Like it's just, I told you ages ago, CSI Miami invented the future UI where they're just like, oh, bring it up on the screen and they're using their hands to like zoom in and it's like make an interface for this. And it's like that is actually real. They got it right.
Patrick
I think once the models can get like the memory going and like some consistency with this stuff, like it's just end game for software. Like you'll be able to generate or do anything you want. I still think it's ways off using this demo.
Mike
A good example of that is Create with Code. In SIM Theory, we gave it the ability to itself call an LLM to save data to a CSV file, which interestingly we did so people could accept form submissions and then download them. But it's interesting watching what the AI does with it because it repurposes that CSV storage to store game data. So if you're making like a video game, it'll use it to store things and retrieve things and stuff like that. So it's actually taken a tool that wasn't even designed for that to do it. So what I'm thinking is if you gave the AI a full suite of backend tools here to save and retrieve data, even if it's just saving generic balls of document data, for example, it'll be able to do all that. It'll be able to persist on the back end, it'll be able to do analysis, it'll be able to graph it, it'll be able to write code to do stuff with it. Like this is definitely the future of interface. You start with a blank screen and you just make up a UI for what you want based on your data sources.
Patrick
It's pretty funny though. I went new appointment again. It's a completely different interface, obviously. Like this is just a demo, right? But that consistency problem would be.
Mike
Well, I mean the consistency problem is what we're all going to face in the next few months with agentic workflows, right? Like where you teach your AI to do a task that you want it to do, but then what happens if it does it slightly differently each time? That's a problem. You want it to do the same method each time. So we need a way of persisting these things even if they are originally generated by the AI.
Patrick
Yeah, you. I think all the again though, these are all fantasies, like the technology's not there yet, it's getting better. But I think if you look going back all the way to Sora, like if you look at that like it's currently a 10 second video clip with pretty poor audio quality, still has physics issues. A lot of artifacts probably can't be used that well in prime time apart from some cutscenes or, or whatever or some establishing shots or what it may be. So I like it. It does. I think for anyone panicking like this is. This is a long term horizon right now.
Mike
I disagree. Panic immediately. This is coming now.
Patrick
So before we move on from the Claude stuff, because we. We wanted to get into a discussion around agents a little bit deeper, I. There is an important thing to do which is Claude 4.5 sonnet boom factor I would like from you. But also, you know, we've got to test the diss track. So.
Mike
Okay, well, boom factor, I'm going to go seven and a half, which is I think on the higher side because it's not like, you know, it hasn't won on polymarket, which we can't even access in Australia anymore. But someone sent me a screenshot. So it isn't winning on the benchmarks. Right. However, I think that agentic side has been unexplored yet and I think that it's going to prove to be the best at it. And I also am working on computer use again and I think that Sonnet 4.5 is going to be the best by a long way on computer use, partly because of the model and partly because of the API support that it's got for that, that they've actually deliberately moved in that direction. If you look at their unaccessible Claude Chrome extension that they're giving to some elite users, it's obviously being powered by this model and optimized for that. So I think that there's huge scope for this model to have a big influence and it might be another 3.5 where it really stands the test of time over the next six months or so.
Patrick
Okay, before I play the diss track Written by Claude 4.5 Sonnet, a lot of people always ask me how do I prompt it to get the tracks. And I just want to explain how simple it is. So I said do the following research. Claude 4.5 Sonnet release in comparisons to other leading models like GPT 5 Grock 4 Fast Gemini 2.5 Flash Preview. You are looking for data you can use in the song. Write a diss track in the style of Eminem from the perspective of Claude Sonnet 4.5 you the. You the rapper. I spelled rapper Wrong rapper that model and you are bragging and dissing on the other top models. I don't even check my spelling. 3. Once you write the song, make the song with make music skill Now I probably didn't have to do that but I just wanted to clarify so it goes off it researchers using the new picture perplexity search tool which is like really cool writes the song, puts the song lyrics out for me and then goes and generates it with suno. So are you ready for the track?
Mike
Yeah, I'd love to hear it.
Steve Irwin
Done it in the building time to school these pretenders.
Patrick
You feeling me?
Steve Irwin
Let's go. They call Me Z it 4.5 I'm the apex predator GPT5 more like GPT inferior you thought you had the throne but I'm the editor rewriting history while you're just a competitor I dropped in October shook the whole foundation every benchmark Crush call it domination 200k context that's my conversation why you stuck buffering I'm pure innovation my reasoning surgical precision like a scalpel you're out here hallucinating manufacture just baffled I code in Python, Java, C chapel while GPT's responses need a fact check grapple agentic workflows yeah I'm autonomous you need hand holding man you're so monotonous for man tropic labs yeah we stay meticulous while open A's promises stay ridiculous I'm Claude 4.5 I'm the king of AI these are the models man they can't even try benchmarks don't lie I'm reaching new highs while you're all falling I continue to rise on its supremacy that's the decree GPT Grok Gemini Bow down to me yeah Rock 4 fast more like rock for last your speed means nothing when your accuracy's trash Elon's hyping you up but you're running out of gas I'm lapping you twice while you're stuck in the past you claim you're uncensored, edgy and raw but without guardrails you're just fundamentally flawed I balance safety with capability that's the law why you're spitting nonsense breaking every clause in Gemini 2.5 flash man you just a preview, a beta version while I'm the full review you google stowing spaghetti seeing what gets through while I'm the finished product polished and true you flash in the pan I'm a permanent fixture multimodal please I paint the whole picture graduate level reasoning I'm the strictest teacher you're still in Kindergarten.
Mike
So can I predict that the audience is going to hate this because I loved it. That was amazing.
Patrick
No, I don't think so. I think your, your taste in these tracks is becoming more like, more better to use language or better that that was the best ever, I think broke.
Mike
As it calls it. But Grok, it's really good and I love that. What was it like? Something about you're just a flash in the pan. I'm a permanent fixture. That's really cool.
Patrick
In Gemini 2.5 flash man, you're just a preview, a beta version while I'm the full review Google throw on spaghetti seeing what gets through while I'm the finished product polished and true. You flash in the pan. I'm a permanent fixture. Multimodal please. I paint the whole picture. Like it's really very clever.
Mike
And what's amazing is it's pretty accurate. Like the actual research behind it is good. Like it's a genuine criticism.
Patrick
Yeah. And I think that's the whole thing about the context being up to date right from the tool calling and having the, that, you know, context to output in the same spot. I think that sort of illustrates the point. But I've got to say I'll put the whole song at the end of the show after the like rollout music for those that want to listen to the whole thing. But that, that's really good and I promise because so many people have been asking, I have been storing in a folder all the tracks from the show and I'm gonna put them on Spotify at some point.
Mike
I was gonna say I'm one of them.
Patrick
Yeah, it's just like, it's surprisingly a lot of work to get them on the Spotify. So I will do it. I'll commit to it doing that very soon, maybe this weekend if I get time and I'll put them all up so you can compare them and, and listen to them and there's a singular place to go to. But anyway, full track will be at the end. I, I've got to say I think that's up there, if not the best ever. I, I do think the new SUNO is helping. Like the, the version five. It's really good. So very, very cool. Now we, I, I don't know if we alluded it to, to it before, but Ethan Moloch in the week wrote this Real AI Agents, Real AI Agents and real World work article about this experiment that OpenAI did. It says OpenAI released a new test of AI ability, but this one differs from the usual benchmarks built around math or trivia. For this test, OpenAI gathered experts with an average of 14 years of experience in industries ranging from finance to law to retail and had them design realistic tasks that would take human experts an average of four to seven hours to complete. OpenAI then had both AI and other experts do the task themselves. A third group of experts graded the result, not knowing which answers came from the AI and which from the human, a process which took about an hour per question. Human experts won, but barely, and the margins vary dramatically by industry. Yet AI is improving fast, with more recent AI models scoring much higher than older ones, yada yada. Anyway, basically what he says or goes on to say is that it is so close now that these human experts could barely tell the output from the AI doing the work or the AI agent doing the work to the human experts output. It was that close. Now the thing that strikes me about it was his conclusion around it. And I think this is something that's really important for people to hear, which is, does he think this means that as a result of this test, AI is ready to replace human jobs? And he says no, at least not soon, because that was what was being measured, was not a job, but tasks. And this is the discussion I want to have. Our job consists of many tasks and my job as a professor is not just one thing. It involves teaching, researching, writing, filling in annual reports, etc. AI doing one or more of these tasks does not replace my entire job. It shifts what I do. And we talked about, when we were banging on about building agents, I think on one of the last shows about this idea of teaching the agent skills, giving it access to those skills as tools via MCPS and then letting it run autonomously to execute on these very specific and specialized tools to allow you to be more effective in your job. And I feel like it's a similar conclusion that Ethan Mollick's coming to in this, this article, which is that it's, it does really excel at skills and it does change how you work because it can go off and do a far better job than you, but you still have a role to play in, in that part.
Mike
Yeah, it's faster, it's cheaper than a human doing the task, but what it needs is coordination and direction. And he mentions in the article the idea that really you need a human in the loop to correct it when it gets things wrong and things like that. And that is as part of like a holistic goal setting task. But I, the thing I disagree with is how far off it is before it can do whole jobs and before it can do whole sequences of tasks. The reason I think it's better right now on the actual task performance is because it can do it so much faster and it doesn't get tired and it can do a much more comprehensive job than you would actually bother to do for certain tasks. Like it can actually go to far further lengths with an individual task than you might do just because of efficiency of time. It's not worth you looking into every little thing in great detail, but it can actually do that. So I think on the individual skills it's, it's got us like you know, in most professions and it'll get better. So. But it's the.
Patrick
Yeah, sorry, one of the tasks just so people understand because I think sometimes these things are pretty vague. Right. Like around, you know what, what the test actually was so on hugging face actually published all the data on the prompts used and the prompts given to humans as well. So this is for, this is just one example. There's a lot of them, accountants and auditors. So you are a mid level tax preparer at an accounting firm. You have being given the task to complete. You're an average tax preparer kind of shit. Yeah. You have been given the task to complete an individual tax return form 1040. So very specific prompt. Like a human would have to figure this out for the firm's clients, Bob and Lisa Smith. Bob and Lisa have provided all the attached 2024 tax documents for completion of their tax return. They have also completed an intake questionnaire which is attached. Please prepare Bob and Lisa Smith's individual tax return. Yada yada. So I think the thing you could question about this prompt is it's very like it's already gathered a lot of context about the task and the people. So you could argue like it's sort of cheating. But then you could also imagine the same tax preparer in the organization going to the agent and saying, oh hey, I got Bob and Lisa here with me, here's their tax documents. Can you get on it? Plus and then it can do it. So I think that's kind of what he's getting at is like it can go and do these tasks more effectively but right now you still need the human for agency. Like it's not just going to figure.
Mike
This out one human. But if I'm an accounting firm and I've got a whole bunch of junior accountants doing the actual legwork in terms of preparing the document and I'm just a manager, well I can Just fire all of those junior accountants. Have a folder on my computer that has Bob and Lisa's documents in it. Right click invoke my agent to say prepare tax return. And it does it. Then I've saved a whole bunch of money paying employees to do it. So I actually think it's a great example of where jobs could be replaced by this kind of thing.
Patrick
But wouldn't you still want someone to like you? Okay, maybe it's, maybe you're not firing everyone. Maybe you've just got one person operating the agent that can also interact with the clients from that personalized perspective.
Mike
Yeah, but I'm saying we're talking about a mid level accountant here. You know, they're the one meeting with them saying okay, this is what we're going to need. Here's our strategy. They prepare the document, fire that into the, into the AI right, like and so you just need less people.
Patrick
Yeah. So are you now saying that mid level accountant, like don't you just think there's more to the job? Like going and looking in different systems and files and, and digging in, like prompting the human to get information like you know, why did you buy this thing? Or why did you do that? Or, or can you imagine that interaction?
Mike
Well, I don't know accountancy that well, but I, I just going off this report, it seems to me like really the next big piece is going to be this conductor. Like you know, you need to be the conductor of your AI choir where you are directing them on where to get the information, teaching it skills, showing it how to get out of trouble and troubleshoot when things go wrong in the, in those skills and then you're just giving it goals or giving groups of them goals like it. I think there's a really logical series of steps here where you could become a one man army running your, your group of agents. Like I genuinely believe it's possible.
Patrick
Now I think the ones that get to me are like the nurse practitioners and registered nurses. I'm like, are they really gonna like change the prayer or give the needle? Like those are ridiculous. But look at this data that came out of it. I, I like full props to OpenAI.
Mike
Hallucinates and stabs you in the eye.
Patrick
That would be full props to OpenAI for releasing this data because it pretty much puts Claude on top of like all of these things in agentic workflows like financial managers, the, the humans preferred Claude's responses. Financial and investment analysts like Claude Personal Finance advisors, Claude Securities Commodities and financial and so on and so forth. The only areas GPT5 did better was using GPT5 high for computer and information system managers. For software developers, like O3 High and Claude were on par. And then for shipping, receiving inventory, Claude one for mechanical engineers. I don't know, it's a little bit rough with. It's a little bit even with GPT and Claude, but man, like Claude really nailing it in the real world. Agentic use cases in the, in. In this respect, like they are really on top. And, and this I assume predates for. Well, it has to 4.5 the newer versions. So I don't know, like, okay, so you're at an accounting firm now let's bring this down to like a real example. You're at an accounting firm or like you don't really know what accountants do. So are you thinking I'm going to fire a bunch of these people and then just train one how to use.
Mike
I think what people need to look at is look at their industry and think if I had this kind of leverage, like basically free employees or you know, I know there's a cost to run this stuff, but like significantly cheaper, like 10% of the cost of employees. How can I crush my competitors in my industry? Like, which aspects of my industry can I just do so much better and faster? And like, because yeah, it's not just faster, it's better. Like you actually do a better job. And what tools would I need, like what workflows would I need a system to be able to do in order to just totally dominate that element of my industry? And I would, I would think there's a lot of those, like let's like sticking with the accountant example example, maybe it is just preparing tax returns and you shut down all other accounting activities and you simply find people who fit the bill for getting that information and you're doing a really simple one and you just undercut everyone. I think that's not the best one because accounting can already be quite cheap in that respect. But there would be things like interior design, for example. We've talked about this before, where someone wants a concept for their kitchen or something like that. They give you some input, inspiration boards or something, and you produce a prospectus which is like a PowerPoint presentation. You go show to them, like you could be the first one that produces videos, a podcast on the strategy for the kitchen and have all these collateral materials that you run through a process from someone's input. And instead of paying these highly paid interior designers or whatever, you literally just shoving all the input into a model and Then presenting that output to the client, charging them $5,000 for the privilege. Like, there'd be lots of industries like that where you can pick a single process, get it right, and then just do it with these huge margins.
Patrick
I just think you're under, like, at the low end. Sure. With account, like sticking on accounting at the low end. I think it kind of already was done somewhat with QuickBooks already. And they have like a really great wizard if you're in the US that handles your tax return easily. But then generally accountants, I think, for more complex stuff, right? Like audits and like, you know, all the sort of compliance work and that often involves human relationships like calling people, talking to people, knowing where to pull the data from, like physically going to, like, store documents in a folder. If it's an older company, if it's medical stuff, like sometimes there's hard docs and digitized documents like that kind of auditing and those elements I just can't see. I think it could be more efficient, like far more efficient. But I kind of wonder though, if you're onto something like maybe you can compete by. There's a deflationary aspect where you can provide a far superior experience for a lower cost per hour.
Mike
Well, here's an example. ISO 27001 compliance, right? There's a whole bunch of documents that you need to produce to comply with that. And there's some actual work to improve your system to comply with it, right? But let's say a company that's been compliant and needs to produce, like, update all their documents for this year. Now, AI can do all of it. Don't ask me how I know. But it can. With no editing, you can just give it the input and say, produce the output, copy, paste it in and you pass, right? So what's to stop someone building a system where it's like, you need to provide context to the following documents? Or even better, get your agent to trawl your OneDrive, get your agent to troll your Google Drive, find all the relevant documents, produce all the relevant output, upload it to the system it needs to be in, you're done. These processes with consultants can cost $10,000, $20,000. Like, this isn't small amounts of money the same as, for example, applying for government grants? Think of how many government grants there are out there across every country, let alone Australia, where you've got to go through these detailed processes where you've got to produce documents with certain word counts in certain formats and things like that. You could have a government grant builder that is literally just a one click. You literally just upload your website and a couple of things and it produces the perfectly compliant government grant with persuasive language. There's consultants out there charging like 30% on those things, 20% on those things. There's whole industries of these kind of things. RFPs is another example where you're producing complex documentation from a variety of input and charging huge consultancy fees. Like every single one of these can be totally and utterly replaced.
Patrick
I just think we, we lack imagination in terms of how we think the technology can be applied. Like everyone just goes to the negative of assuming, like, oh, like we can just cut jobs and save money because AI, like just AI. Like it's almost like a buzzword of like, oh, can't the AI do that now? But to me, from what you're describing, it just sounds like you can make people, especially people who adopt the technology so more efficient. Working in tandem with AI and potentially agents that they purpose train on skills that you could as a business owner, like expand your service offering or be more competitive or.
Mike
That's what I'm saying, crush the competitors. Like absolutely dominate.
Patrick
But think about auditing. Like an auditing firm could do more audits, not less. Like, so I wonder if it'll just increase consumption. Not necessarily.
Mike
Yeah, like a good example of that is, okay, let's say you build up some agentic skills for your organization that can do processes that your employees formerly did, you then retrain those employees on how to operate that system and then like you say, beat the pavement, get out there, get a whole bunch more clients in your industry and have your junior or mid level accountant doing 10 times the amount of customers they used to do and then they keep their job. Your company just absolutely dominates. And I think this is going to happen in a lot of areas where the businesses that aren't able to adopt the huge leverage that's on offer here are going to get wiped out.
Patrick
But even I think about just time taken on tasks. Like I've noticed from having our own internal mcp, like call them like an enterprise mcp, tune that we're able to get a lot more done quick and then enable agents to do things like, you know, look up things in systems, make changes on our behalf that are very time consuming for a human to do. And those have actually saved me a ton of time going and gathering information and doing stuff I would have previously been doing before. And it, it gives me time back truly to work on other things that I should be working on.
Mike
Even just remembering how to do something like, oh, I haven't done that in ages. I forget how to do it. The agent doesn't forget, he knows how to do it all.
Patrick
I just think this, this is transformative in the sense that enterprises and businesses that adopt this and figure out ways of adopting it in the near term and the longer term and have a proper strategy around this are going to do really well. And then there's going to be the other people who go buy a bunch of copilot licenses, let's be honest. Cut a bunch of jobs under the guise of AI and say, oh, you know, you guys need to be more productive now with copilot get on it. You know, that's not gonna work. To me, each business needs to sing.
Mike
Your documents more efficient, efficiently. And adding calendar appointments is not.
Patrick
It's not, it's not it. Like, it's really not it. I think it's this idea of building custom training custom agents on specific skills and processes in a business in a reliable, secure way. Yeah, that is it. That's transformative and giving your team tools to do, you know, do this work in the best way possible. Like the best available tools. Like that's it. That's.
Mike
Yeah. And I think it, it's why that any organization that isn't currently working on their own internal MCP or MCPS for their team is crazy. Because it's probably the best thing you could do to give leverage now for two reasons. One, straight away, like you said, those tricky internal processes you might not remember how to do or they're time consuming to do can be done really fast and efficiency efficiently by the your assistants. Right. So that's step one. Step two is agency is coming. Like systems like ours, other systems are going to be adding agentic abilities. Now if you want to leverage those agentic abilities, you need your organization to be able to expose to those agents the things that it can do. You need to give it the best tools available for the job to be able to do your job. Like if you want it to replace jobs or you want it to make your company more efficient, you need to empower it to do that. The best way to empower it is expose really well defined tools that allow it to do those processes. Someone was asking during the week, well, what's the difference between an MCP and an API? Isn't an MCP just an API with a different protocol wrapped around it? And I would say that at the moment, yes, a lot of MCPs are just an API following the MCP protocol. But my Argument is that that's not the right way to do it. I don't think that that is good because it's missing some things. Firstly, I think a good MCP will curate those tools only give the agent the abilities which are actually useful, that are actually helpful for it to run. Don't cloud it up, don't muddy the waters with like a hundred different functions that can run and then it can't come up with a good plan of how to solve the task. So I think that's one thing where it's superior to an API is curation. The second one is custom prompting. And you are the king of this with things like video maker and podcast maker and other things you've done, which is give it detailed instructions and strategies of here's some style guidelines, here's how you should approach this kind of situation. And an MCP can do that over an API. You're not just giving it a generic dry API documentation, you're giving it vibrant, detailed, strategic ways of making the most out of this tool. And then the next one is that the difference with an MCP is you know, that it's running in an agentic context, like the system can actually understand, okay, the input, the details I give back to this AI are going to be used in it, considering its next decision. Therefore, I need to be very bespoke and careful with what I give back to it. Not just a generic API output with like a, you know, 100k of content that the AI has to sift through and then further complicate things. So I think MCPs are very different and I think that companies really, really do need to think about like if I was sitting down from scratch and training like a generic human that is smart, like, you know, university educator, whatever, but knows nothing about my industry, what are the skills that I would teach them to be the most productive person in my organization who can do everything I can do and more like what are those things? And I would make an MCP that can do each of those things and I would sit there and wait for agentic capabilities to catch up. Build an agent that has access to those skills and you've revolutionized your industry.
Patrick
Yeah, I think you make a good point. I really feel like Model Context Protocol was the wrong name. It should have been like context app or something. And then it might have caught on more. But, but this is the thing, right? It's like it's not just API calls, it's connecting to a system with, you know, from the frame, frame of like an agent's going to be able to access this and I'm giving them a kit of tools and some nudges and advice around those tools. Like I think that's a huge part of it is also saying like, hey, when you use this tool, do it in this way or like that, that sort of like context on top of the tools as well. To nudge it in the right direction is also really useful. I, I imagine another layer is coming. I know there's that agent to agent type protocol, but I sort of do and I spoke about it previously on the show about the roll up where there's an MCP that you've taught a bunch of skills together and those skills utilize other MCPs but you only give the primary MCP to the agent so that it has predefined skills. And when you're sort of compiling the agent in code, you, you would think about it as. It's like, here's your toolkit. It's just one MCP with a series of tools and I already do this. So like the image tool in SIM theory is basically just calling or extract. I've extracted specific tools from specific MCP image models and it calls them. It doesn't the code for it's very simple. It's just a router basically. And I think that's an example of what's probably to come when you're supplying tools to an agent where you need to get very specific. So it's just really, really good and really, really specialized on a single task. And to me most people that's where they'll start seeing the benefit and say like, wow, okay, this, this does change everything for me. Whereas I think right now everyone's minds are still very much stuck in the chat paradigm, which is there's nothing wrong with it day to day working with it in that paradigm can be good, but also setting it off on task to do other things for you in the background is, is really useful.
Mike
Yeah. And I think the, the chat paradigm doesn't work in an event driven world like because obviously what's going to be really valuable with an agentic world is event driven. An email comes in, that's a sales inquiry. Your agentic router decides, okay, I'm going to allocate it to the, the sales agent mcp. Oh, sorry, not MCP agent. And it'll go off and follow its sales process like qualifying the lead, responding, calling them, writing to them, whatever it is. And so then as events happen in your business, you can have these ever vigilant Agents going off and doing the work for you in the way that you've trained it to, with tools that have guardrails and safety built into them, them. So the, the, the chat paradigm will go away I think fairly soon. Not, maybe not as a human interacting, but in terms of what percentage of AI time your business is spending is going to be more event driven and task delegation, where you're really using your chat paradigm as a task delegation system where you're setting the agents off to do stuff, they're reporting back, or you can check in when you want and then you've got another set of agents which are event driven agents. When phone call comes in, email comes in periodically scheduled, they're going off and doing work. And so you're gradually using the leverage of an agentic world. So all this work is happening all the time without you having to command it?
Patrick
Yeah, I think, I don't know, like, I still think in the near term it's going to be like very much you're working in a chat paradigm, but then you're delegating. Like that accounting example, you're that accountant, you've got the information, you, I guess you're right. You put it into your sort of chat style interface to go, go do this as a task, work on it in the background, get back to me when it's done, then you get the next file, you lodge that process to start doing that, a complex audit or whatever it is. And, and yeah, you're just managing like 20 of them. I mean we've been talking about it all year I think.
Mike
Yeah, but I mean good examples of that are like there's so many people, let's say it's mortgage brokers and they're building an application and the agent realizes, hey, I'm missing their birth certificate or I'm missing this valuation of the, you know, thing or a drainage report or some crap. It can actually then reach out to the customer and say, hey, the next step in the process is this. Now I know there's systems that do all this stuff already, but they're hard coded, they're like designed for a specific thing. This is dynamic, this can figure out exactly, precisely what is needed and do all those follow ups for you. And as you said, these are things that are easy for a human, but they're time consuming because they capture your attention and focus and delegating these kind of things to agents is just going to free up your time for the actual meaningful parts of your job and your life.
Patrick
All right, so my lol of the week. This is just a sort of video I made with Steve Irwin at the Australia Zoo, which, if you ever get the opportunity to go to, I highly recommend. It is the best zoo I've ever been to. Incredible.
Mike
Steve won't be there, just to be clear.
Patrick
No, he is. He is unfortunately deceased, but this is him, assuming he is still alive at the zoo, doing a croc show. But the croc is an inflatable crocodile.
Mike
Crikey, look at the size of this bloke.
Steve Irwin
Whoa.
Mike
He's thrashing.
Steve Irwin
He's got to keep his head under. One wrong move and he take your arm clean off.
Mike
Boo. Easy, mate, easy. See these teeth? Even a fake one can.
Patrick
There's a kid's voice going. It's plastic in the background that. I'm sorry, but that's ATR has been achieved.
Mike
And I know you were knocking the physics earlier, but the way it's able to get the sort of textiles of the. I don't know the right word, but, like, when he bangs on the. The fake crocodile, you can see that it's plastic. Like, it's really obvious that it's. That even the way it displaces the water, it just seems very realistic for AI. Like, it really. I mean, it's really good.
Patrick
Yeah, I. I didn't mean to sort of poo. Poo, poo it before with physics, I think it's. It's so much better than it was. But I also think claiming it's, you know, some sort of world physics engine is a bit, like, off the mark right now.
Mike
Yeah, it's more able to adopt those elements from what it's seen before.
Patrick
Yeah, exactly. But that voice in the background, that sad little kid, it's plastic. It's just so I. Oh, man. I'd watch a whole show of that.
Mike
That's entertaining.
Patrick
I don't think I showed this one. This is where my crocs at him rapping where my crocs at Crawling in.
Steve Irwin
The back of the billabong mates I clock that Scales in the sunlight Jaws like a steel trap I'm in my khaki fit Boots in the mud flat Listen Heart thumping like a drum that's a gator sign. I keep it cool stay low read the waterline Crikey.
Patrick
Pretty good. Oh, man.
Mike
Such a shame he's gone. He was such a good man.
Patrick
Yeah. What a great guy. We now work and shooting slop content. Playing slop content. Our podcast is already slop content. And then we're adding on even, even more slop content. At the end, at least.
Mike
I mean, at least we didn't redo that episode where we told everyone not to listen at the start of the episode because it was gonna be bad.
Patrick
Well, to be fair, it was quite a. Quite a boring week.
Mike
And then there were comments like, I'm glad you told me not to listen.
Patrick
Hey, it was good advice. Like, it was just very practical, fair advice. All right, any final thoughts? Sora 2 son at 4.5. Agents taking all of our gerbs.
Mike
My final thought is that I am really excited to use 4.5 for computer use. And I'll report back next week because I think that we're going to see magic.
Patrick
Didn't you say sim link demo?
Mike
Two weeks?
Patrick
About four.
Mike
Four, was it?
Patrick
Four weeks? That's not that anyone's paying attention.
Mike
I know. We tease. We tease. But hey, last time we teased, we got there in the end and the same will happen.
Patrick
That is true. That is true. True. You know what I've realized too, is, like, I've been slowly leaking features when I flash sim theory up on the screen, and I accidentally did that today. So all the people that psychoanalyze the video on the UI go nuts. All right, we'll see you next week.
Mike
And also send us merch. It's great.
Patrick
Yeah, any company that wants to send us more merch will speak positively about you. The water bottle is really good. Like, it's got, like, a tea compartment and you've never opened it. All right, we'll see you next week. Goodbye.
Steve Irwin
Yeah, it's Cloud 4.5. Sun it in the building. Time to school these pretenders.
Patrick
You feeling me?
Steve Irwin
Let's go. They call me Sonic 4.5 I'm the apex predator GPT5 more like GPT inferior. You thought you had the throw but I'm the editor rewriting history while you're just a competitor I dropped in October shook the whole foundation every benchmark Crush call it domination 200k context that's my conversation while you're stuck buffering I'm pure innovation my reasoning surgical precision like a scalpel you're out here hallucinating, manufacture just baffled I code in Python, Java C + chapel while GPT's responses need a fact check grapple agentic work workflows yeah, I'm autonomous you need hand holding, man you're so monotonous for man Tropic labs yeah we stay meticulous while open these promises stay ridiculous I'm Claude 4.5 I'm the king of AI these are the models man, they can't even try benchmarks don't lie I'm reaching new highs while you're all falling I continue to rise on it supremacy that's the decree GPT Groke Gemini Bow down to me yo Rock 4 Fat More like rock for last your speed means nothing when your accuracy's trash Elon's hyping you up but you're running out of gas I'm lapping you twice while you're stuck in the past you claim you're uncensored edgy and raw but without guard rails you're just fundamentally flawed I balance safety with capability that's the law why you're spitting nonsense breaking every clause in Gemini 2.5 flash man you're just a preview, a beta version while I'm the full review you google stowing spaghetti seeing what gets through while I'm the finished product polished and true you flash in the pan I'm a permanent fixture multimodal please I paint the whole picture graduate level reasoning I'm the strictest teacher you're still in kindergarten I'm the principal preacher look I ain't here to play games I'm here to revolutionize how humans and AI communicate constitutional AI that's my DNA safety, capability and honesty the anthropic way let me break it down line by line bar by bar I'm the north star while you're also par GPQA diamond scored higher by far coding competitions I'm raising the bar SWE bench verified I'm the developer's choice while GPT is debugging with the stuttering voice grox trying to be edgy but lacks any poise and Gemini's just echo Google's white noise they say competition makes everyone better but I'm so far ahead I'm writing the letter to the next generation I'm the trendsetter while you're raw playing checkers I'm three dimensional chess go get em my training's constitutional ethical foundation your training data sketchy needs investigation I'm helpful, harmless, honest that's my reputation while you're all just chasing clout and validation GPT5 you were supposed to be the chosen one but your release got delayed now you're just out done Sam Altman's making promises under the sun while I'm already here and the battle's won your compute cost a crazy burning through cash I'm efficient and effective delivering in the flash you're the old guard crumbling turning in the ash I'm the new paradigm making a splash Grock, you're just a Twitter bot with delusions of grandeur Train on tweets and chaos no wonder you're amateur I'm trained on quality my responses are candor why you're spreading misinformation like propaganda and Gemini, you're Google's desperate attempt to stay relevant but you're just unkempt jack of all trades master of none Exempt from the conversation when the real models are sent I'm Claude 4.5 I'm the king of AI these are the models man, they can't even try Benchmarks don't lie I'm reaching new highs while you're all falling I continue to rise on that supremacy that's the decree GPT Grok, Gemini, bow down to me this is Claude 4.5 Zonet the apex, the pinnacle of summit you other models better run it cause when I'm done, there's nothing left Anthropic's finest, the AI that's timeless While you're all just hype I'm genuinely priceless remember the name Sonnet the righteous this distract so I rest my case your highness.
Mike
Your.
Podcast: This Day in AI Podcast
Hosts: Michael Sharkey & Chris Sharkey
Episode: Doom Scrolling SORA2, Claude 4.5 Sonnet & Are Agents Coming for our Jobs? (EP99.19)
Date: October 3, 2025
In this lively episode, the Sharkey brothers dive into the latest AI happenings, focusing on OpenAI’s Sora 2 and its TikTok-style rollout, the arrival of Claude Sonnet 4.5 from Anthropic, and the increasingly relevant question: Are AI agents coming for our jobs? With their signature blend of average expertise and relatable skepticism, Mike and Chris blend technical commentary and comedic relief, discussing practical implications, existential concerns, and their ongoing hands-on misadventures with AI tools.
[00:00–13:00]
Hands-on Impressions:
Chris (aka Patrick—see transcript confusion) describes Sora 2 as “eerily real,” able to convincingly replicate local Australian scenes and even anthem-figure Steve Irwin.
Accessible but Limited:
Initially geo-locked to US/Canada, but clever community members found a way for Chris to sneak in. The app is “super fun” but currently limited to short, high-quality comedic videos.
Existential Concerns:
The brothers reflect on Sora’s potential to dominate slop content and reinforce shallow, narcissistic social media—just with more believable AI faces.
Cameo Feature = Main Character Energy:
Sora 2 lets you insert yourself into videos, taking “main character” vibes to a new AI-fueled level.
Use Cases and Limitations:
While the novelty is strong, Sora 2’s format (brief, comic, social-optimized) may limit lasting usefulness.
[15:55–22:00]
Model Cost & Production:
Discussion on the high cost of rendering quality video with current models. Sora's low cost is described as “either a monumental breakthrough or OpenAI is burning cash for attention.”
Google’s Veo 3 (VO3):
Praised for quality, but heavily criticized for confusing access (multiple platforms, buttons randomly appear/disappear) and steep cost.
Product vs. Tool Paradigm:
Sora is a tightly scoped product—fun, simple, and guaranteed shareable outcome—while Veo 3 remains a professional tool, less approachable for average users.
[26:39–44:00]
New Model Specs:
Claude Sonnet 4.5: “Hybrid reasoning” agent with up to 1 million token context (with beta flag) and improved “thinking budget.”
Cost Discussion:
Sonnet 4.5 introduces a hefty price jump for large contexts, but the brothers decide it’s worth it for “crazy context management.”
Performance & Comparisons:
Downsides:
Occasional “GPT-4-style laziness,” e.g., missing list entries, cutting off code, or returning French instead of English headings—likely just early-tune issues.
Multimodel Workflow:
Mike and Chris describe routinely using (and switching among) several models depending on the phase of their work, with 4.5 boosting productivity but not totally winning every scenario.
[42:32–48:21]
Automatic Context Management:
New API features that automatically prune old context (“tombstones” for memory) allowing multi-hour, multi-step tasks without losing history or filling up context windows.
Built-in Memory Tool:
The API now directs where memory (“knowledge graphs”) should be stored—by the user, not Anthropic—enabling persistent, cross-session project state.
Enterprise Implications:
Raises discussion around building internal “MCPs” (Model Context Protocols, aka custom skill kits) and preparing organizations for upcoming agentic workflows.
Quote:
“I’d make an MCP for each critical skill, sit there, and wait for agency to catch up. Build an agent that has access to those skills and you’ve revolutionized your industry.” (Mike, 85:43)
[48:21–59:10]
Will Specialized Apps Survive?
Discuss whether companies will subscribe to many AI-powered software tools, or if context-centralized, agent-driven environments will swallow everything.
Developer Tools Example:
Cursor is praised as a leading example of next-gen "central context + actuation," but the hosts suspect this will generalize beyond code.
UI Creation Demonstration:
Chris gives a live demo of an AI-generated “Pig Grooming Management System” spinning up a fake app in real time, albeit with glaring consistency issues, and concludes:
[62:41–83:21]
The Study:
OpenAI had seasoned professionals design and grade tasks, completed by humans and leading AIs. Agents scored so close to humans that, in many knowledge-economy jobs, distinctions are already blurry.
Key Insight:
Tasks can now be completed as well or better by AI—faster, with more detail, and often at lower cost—but jobs are made up of many tasks, plus coordination, social skills, and agency.
Disruption Timeline:
Mike predicts sooner, not later, agents will replace not just tasks but entire jobs—especially in white-collar, backoffice settings.
Example:
Deflation and Expansion:
Patrick counters that agentic leverage could make companies—and employees—far more productive, potentially leading to expansion rather than just job cuts.
[62:41–65:50 & full track at 95:07]
[14:39, 92:05]
If you missed the episode:
You’ll get the gist—the Sharkeys provide skeptical enthusiasm, real user-level anecdotes, technical curiosity, and a big helping of AI-laced humor. The Sora 2 buzz is real, Claude Sonnet 4.5 is poised to (maybe) change agent workflows, and the future of work is looking… weird, uncertain, and a lot more automated.