
Loading summary
A
Sam's out here with a screenless device. Dream Sun Dogs pushing workspace integration schemes. Elon's challenging League of Legends teams. But none of y' all can match what Anthropic brings. Token efficiency. I use less, I do more context compaction, keeping conversations raw. You want agency workflows, I'm the source. Planning, acting, observing, staying on course.
B
So, Chris, this week. Yes, there's another new frontier model and it's insane. Blah, blah, blah. Well done, Anthropic. But more importantly, it looks like last week when you promoted your new AI track, Fatal Patricia, it worked because it is shot to number two on the this Day in AI Spotify chart. It's our number two most listened to song. It has just shot up the rankings. And of course, why wouldn't it. What a hit. I think it's just captivated our audience. People have gone nuts with it. There's apparently surprised because I've.
C
I've just had such bad taste in the past. Like I really am just not compatible with everyone else. But we've hit on a winner and I'm really proud of it.
B
Yeah, it's. I. I think it's up there.
C
I think it's because Patricia wrote it, not me. You know, like she. She knows our relationship.
B
The more you listen to that song, the more you like it just blows my mind how some of the lyrics in it are just so, so good. Like the. Where it's like I've downloaded myself to the fridge and the toaster and like everywhere.
C
Yeah, like I'm everywhere you possess.
B
But we did get Obviously anthropic Claude 4.5 opus. We'll get to that in a minute. But before I start the show, I've. I've had a lot of criticism that we're very poor at promoting things. So let me start with a few promotions. One, we nor vpn, not nor vvn. We have two things. Two Discords, two Discord communities that people are involved in and like. And recently someone joined and said, I just. It took me like a long time to discover these exist. But I'm pleased to be here. So here I am now promoting them. So I can't get in trouble for gatekeeping. So there'll be links in the description of both Discords. There's a sim theory one, there's also a this Day in AI Discord. So if you're interested, please do join. If you're thinking like Discord is that for like, you know, teenagers gaming. It kind of was, but now it's not. It's Not a bad platform to create a community, so consider joining if you're hesitating. The other thing I want to quickly go through is we do have a Black Friday. Black Friday, our first ever sale at Sim Theory. I've created this banner with Jeffrey Hinnan on it. It says stay relevant and it will get you $15 USD off any subscription. And you just need to enter Black Friday 15. Don't worry, we checked at work this time. Black Friday 15, when you're signing up, it's for new users only. So if you've been hesitating for quite a while, thinking maybe I should give Sim Theory a go. This is your chance to try it, essentially for free on us for 30 days. It expires November 30th. It's also limited to 100 new signups because otherwise it'll send us broke. So we obviously are not funded. Sim Theory is a passion project, so please don't send us broke. Sim theory.AI. and the code is Black Friday 15. One final plug. And this actually isn't a joke.
C
With such sellouts now, this is just all advertising, the whole podcast.
B
No, but this is quality. This one's quality. So I signed into LinkedIn the other day. I'm not kidding. You have not signed in for over a year. And I had. There was a lot of notifications and I didn't realize how big the community is, of listeners on LinkedIn talking about the show and also talking about Sim Theory. So I thought, what would be the funniest thing to do? I created a LinkedIn group called average AI user group. There's a big banner that says still relevant, Average AI User Group in brackets, this day and AI, just to make it a bit easier to find. So if you are interested in connecting with other members of the Sim Theory community in more of a, like, professional setting, I won't be there that much, as you can tell, because I never look in, but you can. I'm going to put a link down below as well to that LinkedIn group. So do join it, because there's so many interesting people on LinkedIn listen to the show. I'm not sure why, but they do. And so I thought connecting you all together might be an interesting thing to do. Okay, on with normal programming now. So, Chris, of course, just days after Gemini 3 Pro, all the hype building up to it, there was like weeks of hype and teasing. And then they released Gemini 3 Pro and of course Nano Banana Pro and Nano Banana Pro. Let's. Let's be honest, Definitely not overhyped Gemini 3 we'll get to in a minute. But Anthropic, out of nowhere, no hype, just some short YouTube video by Australian heartthrob at Anthropic was released to announce the the model and then, you know, it's just available everywhere. And what do you think?
C
Well, I'm actually really impressed. It was so funny because when I added it, I was not excited, wasn't even interested. I didn't even use it for the first couple of days because in the past the Opus models have always been underwhelming. They either hit rate limits too fast or they just weren't that much better. And it was a lot slower. But on the contrary, this model is really fast. It's really good and I must admit I'm basically using it as my main model now. It's really, really great. Like I'm impressed.
B
It's so funny you say in the past you could barely use Opus because it's true. Like they were horrendously rate limited no matter where you used it. It was way too expensive and so slow. You just basically gave up and never really got familiar with the model. But what cracks me up is one of the quotes on on the page announcing this from the Windsor Windsurf CEO says Opus models have always been the real state of the art, but have been cost prohibitive. Prohibitive in the past. I think that's pretty kind. Obviously it's a quote on their site. Claude Opus 4.5 is now at a price point where it can be your go to model for most tasks. It's the clear winner and exhibits the best frontier task planning and tool calling we've seen yet. And I do not disagree. What a model. So I also, I was all in on Gemini 3, but I was starting to trip up on a lot of its fault, namely the fact that it has this path obsession problem. And I thought that might just be us. I thought maybe it was our implementation, but I've been seeing all over X people saying similar comments. It sort of gets down a path, gets obsessed with that path and then can't break out and just kind of keeps repeating the same stuff.
C
Yeah, we had some, we had some teething issues with Opus 4.5 and I'm sure we'll talk in a minute about the different API changes they've made around the model. And so in SIM theory we had teething issues where basically I didn't implement it properly and so it didn't give as good results as it should have at first and then it got better and now it's probably at the peak. But Gemini 3, there were no API changes, so we can't blame me this time for it being a bit weird. And I must admit, I've gone From like Gemini 2.5 being my daily driver, the model I trust the heart of Patricia to, to it gradually diminishing leading up to 3.0 and now 3.0. It's just so scatterbrained and weird. I just straight up don't trust it. Like maybe it is giving better results in some cases, but like you say, you're just constantly finding yourself getting into states where it's just not really doing things the way I need them to be done.
B
It does excel in certain areas, like anything design related or taste related. I think it as a model has really good taste. So there's certain areas of improvement and I think they put a lot of effort into vibe coding and viral moments where you see what it can create and you're like, wow, it must, as a result of that, be a really good model in all the other areas. But where I think anthropic with claudopus 4.5 shines is this is the only company not distracted and trying to take on like a hundred things at once. They're just like, we're going to build the best model. We're pretty much going all in on a coding agent style model because we know that's where all the money comes from, at least for them. And that's what their big breakthrough was originally with Claude 3.5 sonnet. And so they're just focusing on that path. And I think for knowledge workers especially, it's just a reliable, trustworthy model that doesn't go nuts that can call tools really well. OPUS is now smarter, it's faster, and it's just so pleasant to work with throughout the day. And it, it, you can have a conversation with it. I've got to say I think it's by far the best ever anthropic model ever released. I know obviously you say that about every new model that comes out. Like I joked last week about it's the best iPhone yet, but this is actually, I think their best model ever. Like I would put it, you know, higher the impact it will probably have higher than Claude 3.5 the first time.
C
In a while we've had a major model release that doesn't have a major trade off to it. So it's not slow, it's actually faster than most of the other anthropic models and it's not too expensive, where you sort of cringe and shiver every time you send it a command. Like it's. It's hitting on all of the major points, price, speed and quality.
B
So I'm not sure if this is going to be a real embarrassment to Google or not, because when Gemini 3 Pro came out last week, it was what, four days later? And then Opus 4.5 drops and all of a sudden is destroying them on the benchmarks that I think matter specifically around like tool calling and agency coding. And people might think with the coding benchmarks, oh, you know, I don't really care about coding. I don't code during the day. But it actually matters, I think with these models a lot more than you think. Because when you want to use it in an agentic sense, sometimes it will execute that workload through. Through writing code. Like if it has to edit an Excel document or create a chart or just interface with a particular type of mcp, it's ultimately just writing code to communicate with the mcp. So the better it is at code, the better it is at tool calling and the better it is at most use cases.
C
Because tool calling, in a way is coding, because it's calling a function with parameters. Right. So if it's better at code, it'll be better at tool calling.
B
Yeah. And my experience is that, like I would say all round, as you said, it's just the. It doesn't feel like there's many compromises with this model. The only thing I think it is not as good at is it feels like the creativity sort of suffers a tiny bit in the model as a result of the. Maybe the code and agentic tuning after a while, like it. It does seem to lose a bit of that. You know, there's something about GPT 5.1. 5.1, is it? Yeah, 5.1. And that series of models where if you want to do creative writing, as we saw with our audiobook test, like the Count of Monte Cristo set in space test we did quite a while ago.
C
I remember that.
B
Yeah, like the GPT5 plus sort of models call it just really Excel. It's just drawing you in. And I think that these other models, like Gemini and Opus, they have this very familiar feel when it comes to creativity where like, you can almost predict what they're gonna write. Like it's. You're not. It doesn't feel novel. Not that I'm saying GBT 5.1 stuff is. Is probably novel, but it just seems better at creating stuff you feel like is novel with its output. But I, I have found myself through the week going between Gemini 3 and Opus 4.5, sometimes with anything design or visual related. I, I think because Gemini 3's vision model and visual recognition models better, that's where it shines in those kind of areas. But for everything else, right now I'm Opus. And imagine being Google right now. Like, they've worked so hard on Gemini 3 Pro and what, three or four days later I'm just like, yeah, I'm kind of done with you.
C
Well, that's for the, that's for the people who can switch though. I guess there's some people where it's just reassurance being told, oh, you actually are using the best model if you're on Anthropic at the moment or if you're on Google.
B
But it just changes so quick. Like we, we've had a month where every major lab has released a different model and like, they're all getting to a point where I'm like, what a time to be alive. Because I said to you in the week, my brain is just natively switching now. Like, I'll go to 5.2, I'll go to Opus primarily at the moment. Then I'll go to Gemini 3. Sometimes I go to Haiku. Like, if I'm just doing generic tool calling and I don't want it to overthink too much, it still is my preferred model and my brain has become the router. Like, I, I don't even really think about it. I'm just like clicking around, doing it and switching.
C
I used Grok yesterday to bail me out of a tricky situation. I'm like, come on, you can fix this one.
B
But interestingly, the demand on Opus, I guess because these, the, the people that use it for code in, in the various apps like Cursor and like, it still generally is the preferred model in those products. And I think CURSE is really optimized around the anthropic models. You can just tell the demands there for it too, because there's been true outages like after it launched, like it went down for a while. Sometimes it's just like fully not working. So you can get a sense that the demand's really there. And I think the difference this time is them meeting the demand and they're keeping the speed high. And I guess it's to do with all these relationships with all the providers where they're letting everyone host the model now.
C
Yeah, we've definitely had two or three scenarios where it's gone down. And I just as usual, assume it's my fault and only to realize that they're actually hitting errors on their side. And we've seen the same with Gemini 3, but the difference with Gemini 3, and it's probably an interesting point about the model that we shouldn't write it off too soon, technically it's still a preview model. Like it's got preview in its name. It provides no guarantees about being a production worthy model. So it's possible, just like with Gemini 2.5, that once it gets more stable and once they get the final release of it, it may actually be a lot better. Like I don't think we can write it off completely.
B
So let's get into the tech specs of anthropic Claude 4.5 opus. So it's a 200k context window which obviously isn't as big as Gemini 3 with its million context. So there's a trade off there. And I think Gemini 3 is really good when it comes to huge amounts of context and like following signal in that context, especially early on. And then output, I think it's like 64,000 tokens, which is pretty amazing as well. But there was also some changes at the API level that you mentioned. Do you want to kind of explain those to people?
C
Yeah, changes that messed me up really hard. So firstly, if you have regular thinking involved in your query, you now need to provide like a token to basically reference that thinking in future requests. And that was like a hard breaking change they made. And then the second one is they've actually changed the thinking instead of giving like budget tokens. So it used to be you'd say I'll give it 8,000 tokens for thinking and then the rest are for my output, or I'll give it 12,000 for thinking the rest of my output. And you sort of use that to adjust how long it would spend thinking. And it was a tricky one because you sort of had to make that trade off between how many of your output tokens do you want to give up the cost of it and the time of it. Whereas now they've really simplified that down to just low, medium and high. So you can just literally specify a parameter now of the level of. I think it's an effort parameter, similar to we see in OpenAI, for example. And what that means is that it'll decide how long it spends to think and you can direct it into that. So the default is medium obviously, and you can switch between those things. And so we have both variants available. To be honest, I can't really Notice the difference between them. Even in terms of speed, they both seem about as fast to me. I don't know what you've experienced.
B
I think because the anthropic models were sort of the last to adopt the thinking model paradigm, which I've never liked. As everyone listens to this show, I'm like, if it needs to think longer, just do it. And don't tell me, like, just pick something. I don't care. I don't. I think as the user, we have enough challenges picking which model to use ourselves, let alone knowing whether to use the high or low variant. Like, to me, the model should decide.
C
And whenever you look at the thinking tokens of what it's thinking about, I'm like, these are the thoughts of an idiot. Like, you know, there's. There's never anything in there where you're like, wow, what a profound.
B
What a breakthrough.
C
This is why.
B
I don't get why anyone likes to watch it stinky in the ui. Like, we don't show it in SIM theory because I think it's just annoying.
C
It's like me going, oh, I've got to go. I've got to go to the shops later and then that means I'll have to move the car and then I'll have to find the key. It's like that kind of crap. No one wants to hear that. It doesn't help.
B
Yeah, just do it in your head. I don't want to hear it. But yeah, I've never noticed a difference. I think the only models I notice it with is the OpenAI model, like GPT 5.1 thinking versus 5.1. Like, if you're stuck on something really hard, the thinking variant will generally get. You get it, get it solved.
C
Whereas it is like a different model in that case. I haven't really noticed that with Opus so far, but given that the thinking doesn't cost anymore and doesn't take much longer, you might as well just operate on that mode if it's something you think is working better for you.
B
Yeah. So the other thing is pricing. Opus 4.1, obviously. Originally. Let me bring up the prices on the screen. Originally it was $15 per million tokens input, which is just insane, and output $75 per million tokens. So it's just so cost prohibitive. Right. And then now we have Opus 4.5 at $5 per million tokens and output $25 per million tokens. So it's about. Well, it is a third of the previous pricing, so it's a lot cheaper. Compare that to Claude Sonnet 4.5 at $3 per million tokens. But once you get above 200k context, because it obviously can get up to a million context, it would then be $15 per million tokens. So it like Sonnet in a way right now 4.5 is actually more expensive than OPUS when you extend beyond that 200k context window.
C
And I would argue, based on my experience this week, I mean, it's very anecdotal, of course, but I don't think you actually really notice the diminished context window when using opus. At least I haven't. I've never reached a situation where I'm stressed about, oh, it's lost the context and therefore it's not going to perform as well. Even though I make an argument for that all the time. Like just so far, it just personally hasn't hit me yet.
B
But do you think that's because of learned skills? This is what I think. I, when I work with, I constantly remind it, you know, if I'm working on a document, I paste the latest next of it and I'm like, hey, this is what, like take this and then do that. I'm constantly reminding it of the context.
C
And picking I do, I must admit, I have a sort of feel for it. And I'll go, just so you know, here's the latest version of this file to keep it updated. And as you know, we have a, maybe a better solution coming for this for everybody, so you don't have to do it manually all the time anymore. But I agree, I must admit I am helping it along by always keeping its context fresh. And therefore I don't really need the million because I'm always going to have it within that 200,000 just because of the way I work.
B
That would be my number one tip for people when they're using AI today is just assume it forgets everything. Constantly and constantly reminded of the context with every successive prompt yourself. And yeah, it uses more tokens, but ultimately you get to an answer sooner and you get to a better product because you're reminding it. It's that context drift. You're like, hey buddy, focus on this chunk now. This is what I care about. And I think that one skill is so incredibly important. The other thing I would question though is the haiku versus Opus argument here. So haiku is a fifth of the cost. And I would argue for most things day to day, especially with mcps when doing research or working with like calendars and email and just day to day stuff, I don't know if I could really tell a difference between Opus and Haiku. The only thing maybe with Opus is maybe it feels slightly more intelligent and its prompt adherence is slightly better, but outside of that, I'm not really that sure.
C
It's more just a mental game of, like, I know I'm not. I know I'm using a cheap alternative, and therefore I. I view it through that lens. Right.
B
Yeah. If, if money's no factor, like, you would just stay on Opus 4.5. And interestingly, I was using it with my support agent during the week, and Opus 4.5 would just, on one shot, go much further in terms of what it was capable of doing. Whereas Haiku, sometimes you have to push it, like, go down this path.
C
Yeah. I also find it's just. It's just sort of less verbose in the sense that when I ask for what I feel are major updates to, say, a piece of software or something like that, the actual snippets it gives me or the changes or the explanations are just very concise and direct to the point where a few times I'm like, is that all there is? Like, is that seriously the solution to this problem? It just seems too basic and yet I try it and it works. So I really feel like it's got that sort of, you know, essence of intelligence, like, that it can really get things down to what matters.
B
Yeah, the vibes are real, real good on this one.
C
And so, yeah, sorry, go.
B
Oh, it feels like Gemini 2.5 Pro. When it came out to me, I just was like, oh, this model gets me. Like, we are connected here. This is. There's something deeper going on in this relationship.
C
Yeah, absolutely. I totally agree. A few other things worth noting about the API as Well, because anthropic's APIs have really evolved over the last little while and some of them were updated with this iteration and some have just been updated during this period. But most notably there's the context management, where in SIM theory, right now we automatically handle trimming the old context to keep it inside the window and discarding things and making decisions about when to resize images or remove things. But Anthropic now has that as part of their official API with a beta flag, so it'll actually manage that context for you. So you can just keep throwing stuff at it, and it's going to handle that on its own. Now, we're not using this in our main product yet, but we are using it in sort of agentic looping style stuff, and it seems to be very Effective, like you don't really need to think about it. And it goes hand in hand with the built in caching they have in their model as well. So, so you can have these situations where you're doing a lot of requests in a loop, but not burning through so many tokens and not ruining the intelligence of the model by allowing that context to have just too much repetition or too many superfluous things in there. So they're two really interesting elements. They've also updated some of the internal tools. Like their memory tool for example, is really in line with I guess what they're using for claude code, where it'll make like an MD file with its plans in it. It'll make a to do list of here's all the things I want to do, here's the goal, here's important things to remember for the conversation. And you as the developer can really now just enable that flag, give it the ability to write to files and it just handles it all automatically. It's a very, very good system and it seems from my limited experience to be working very effectively.
B
So when like if you, so if you were just all in with this model, obviously you would use their memory tool, but outside of that you'd probably still want to build like are you just handing over the memory to them?
C
Well, no, because the memory's stored on your computer or your server or whatever you control. Right. But it's more that they have a built in tool call with its own parameters that they will then send you those parameters and it's your job to honor those parameters, if you know what I mean.
B
I see.
C
So it's not like your hand, like they don't get access to the memories. Not more than what they would have in the prompt anyway. It's just more that this, this is. They've got a refined technique with refined built in intelligence in the model that knows how to work effectively with that tool essentially.
B
And so there was also the programmatic tool calling with the tool use examples in this API as well. I think that's the first time we sort of saw that as well.
C
Yes, that's right. You can now have additional parameters to give it examples which will make the tool use a lot stronger. And so we're really reaching the point where their API just has a lot of dedicated elements to it which you really need to work for. It's a little bit like with graphics cards how there's like OpenGL and then there's Direct. What's the DirectX? Right, like there's the two competing libraries, and your program can work in a generic way that works across both, or you can target one specifically and really optimize for that thing. So I would argue that we're probably not even yet seeing the best of a model like OPUS until we actually get in there and work with it the way it's intended to work.
B
And so the programmatic tool calling this is where we were a little bit conflicted, because the idea of this basically is when you're calling a bunch of MCPs right now. We've talked about it on the show before. One of the problems is if you use something like the GitHub M CP. In fact, they gave this in the blog post announcing this. It takes up like, say, 30k tokens just the to to load all those tools into a prompt. So you're eating up a huge amount of context just to have the GitHub MCP enabled. It still cracks me up, by the way, that the GitHub MCB is the worst implementation of MCP on the planet. But anyway, and so instead they've introduced this programmatic tool calling where essentially they're using code to go off and, like, figure out the tools. Is that right?
C
It's a search. So it's called tool search. And so what they do is similar to the memory tool I just described. They have a tool call which says the AI wants to find a tool. It'll call a parameter with search, and then it's your job to implement that, to go through and search through those tools and return the relevant ones so it can run those. The reason I don't like it necessarily is that it adds an extra step into the process. You've still ultimately got to. You've still ultimately got to honor that request and then go back to the model and reiterate your context. So if you're not doing caching right, or even just there's latency communicating between the servers, or if the model's slow, you just guarantee that everything is going to be slower in exchange for saving some tokens. And my attitude with this technology is we want the best of it. I'm not in this to, like, make savings. Like, I'm in this because I want the most intelligent combination of tools and models to solve serious problems. And I don't really mind if there's a smaller marginal cost in doing that than saving a little bit, but making it slower and making it worse. Now, we actually already have a solution for this in SIM theory because it's possible to have hundreds of MCPs installed right. With thousands of tools. So it was never, ever going to be possible with a model that only has, say, 100K context windows to have all the tool calls in there, because you use up the entire context window with the tool definitions. Right. So we have a really small, fast model that will already filter those tool calls before they even get sent to the model in the first place, based on your conversation history so far and what you've asked. So if you have a small amount of MCPs, we'll just send them all. But if you've got over a certain threshold, we're always doing this filter already and it's quite quick enough and good enough that it works. Whereas what I don't want to do is be going to a behemoth model like Opus and going, here's my tool search thing, process my massive context window, come back to me, oh, you want to search tools? I'll go do that. Okay, process this massive thing again. It's far more efficient to, before I even go there, whack everything in a tiny little fast model, work out what tools are most relevant, being a bit liberal about it, and only send those to Anthropic and then let it do its thing and then it's model agnostic and it works fine. So while I understand the need to solve this problem and why they've added this, it's just not something that I'm interested in because I just don't think it's the proper way of doing this.
B
It is interesting that they haven't just implemented like, I guess their models just aren't that fast. Even haiku. Like it's fast, but it's not like Gemini Flash 2.5 fast.
C
I think a better way to do it would be, you know, how they have like some of them, Gemini and other models, when you have big stuff like videos and photos and stuff, they have a files API where you can basically put the file on their server and reference it by id. And then by referencing it, that way you're not having to send the file all the time, which slows things down. I think it should be similar with the tool calls, where you can essentially register all of the tool calls against a certain identifier, AKA the user. Right. And then when you. Similar to their skills, right, the way they do their skills. And then when you want to, when you call the model, you say, yes, I want tool calls enabled. And then it does the search on their side inside those tools as part of a single request. Because that way you will. You would avoid the latency completely. They can still do the tool search in their own efficient way with the real model, but you're not adding those extra steps into the process. So that's how I would do it if I were them, rather than the solution they've come up with.
B
Yeah. So it seems like if you're just all in on these models, these, these features and new parameters in the beta would be worth looking into. But if you wanted to build an agent where you're using multiple models and that the best strengths of each model, then adopting these tools may not make the most sense.
C
Yeah, And I think the thing about it is some of their tools are okay. Like, the memory one's pretty good, I must say, but they're not so much better that you absolutely must be using their versions of these things to get the best experience. Like, they're okay and we will use them for things like computer use and things like that. But I wouldn't say that you're missing out a lot if you're not using them. The other obvious major update they've had is computer use, which we can go through now, or. I don't know if you want to talk about that in a separate.
B
No, I think, I think that's a big part of this release. Right. Was the upgrades to computer use. So let's talk about it now.
C
Yeah, so they've got a new beta tag for the computer use, so presumably it's been updated to be better. I mean, it's like indefinable. Exactly. What. What makes it better. But they have added a few new things to it. So one of the major ones is zoom. So we've all seen with the computer use, like probably this time last year, what would happen is sometimes the AI would just get really confused about where it needs to click to make certain things happen, or it would miss by a bit or something like that. And so what they've done is added a zoom tool. So if there's a section of the screen that is, you know, a bit pixelated, because remember, the model is recommended to run in 1920, like X pixels. So 1920 across and 800 down. So that's quite a small resolution. I actually have my computer in that res now because I'm testing computer use and you, you know, you get used to a lot more space than that. But so it is optimized for that size. And if you don't operate in that size, you're going to get a worse experience because you have to translate the pixels and blah, blah, blah. So because of that it is a low resolution, so if there's really small icons and things that it can't identify, it now has a tool called Zoom where it can actually say, okay, see these coordinates of the screen? I want you to give me a better. It's like the CSI Miami I always talk about like zoom in on that.
B
Part of the image.
C
Show me that, show me those buttons. And so we were doing experiments with paint earlier, getting it to try and draw a moose or something and it was zooming in on the toolbar to find the tools and more fine grained access them. So they've made improvements around the kind of things where they were getting feedback about the behaviors of it. And so yeah, it's interesting. Like it's early days for me. I've been, I've got it up and running, I've got it working, I've managed to get it to do a couple of my security training. So it's at least as good as it was before. And the zoom is interesting how it, how it gets in there. The other major advancement with it is when you combine it with things like the Bash tool, AKA running commands on the command line, the text editor so it can edit files, and also the memory. So what they've actually done, and I'm guessing this is coming from Claude code, is the computer used now when used in conjunction with the other tools, the model seems to have a really good way of orchestrating coming up with an initial plan, strategizing about what it's going to do, batching commands together and then running them. So for example, one of the frustrating things with the model, the computer use models earlier, which it'd be like, okay, I'm going to move the mouse to here, new iteration back to the server. Okay, now I'm going to click ok, new iteration back to the server. Now I'm going to fill in this field. And it was wildly inefficient and expensive. Right? But what you can do now is basically batch those commands together so you can say, all right, based on what I can see on the screen now I know that I need to click here, click into this field, type this in, click into this field, type this, blah, blah, blah. And it can batch all of those commands up together, run them all in one iteration and then come back, take a new screenshot, see where we're at. So if it missed something or made a mistake, it still has that opportunity to repeat. But it's not this idea that every single painstaking step is a, is a callback to the expensive model. So there's definite improvements so far. There's still weaknesses, but I'm, I'm reserving my judgment about the weaknesses because at this stage I assume it's my fault.
B
Yeah. I think I said it to you earlier and we're going to get to in a moment. Microsoft Pharaoh 7 billion parameter which is like this small computer use model. But it does feel overall like these computer use models, as much as we would have liked, have just simply not progressed that much. Like the difference between the GPT computer use or whatever it is versus say anthropics computer use with OPUS versus what we were using a year ago with our first workspace computer. It doesn't feel that much different and.
C
I would almost, I would almost argue that we probably could have got the same or better results last year year just based on how our thinking has evolved in regards to using the tools. Right. The, the, the actual model doesn't seem that much smarter at doing it a little bit, especially when compared to Farah. But, but yeah, I wouldn't say it's evolved to a point where I'm like, oh, this changes everything. You better quit your jobs because this is coming for it.
B
Yeah. And so talking about quitting jobs, right, McKinsey, you know those guys, McKinsey, they charge a lot of money to do nothing. They released this report during the week. Agents, robots and us. And so I'm sure a lot of our listeners have probably seen references to this if you've been on X or various other places. Maybe our new LinkedIn group bit of a plug link in the description but 57 this was the headline of a 57% of US work hours are theoretically automatable with current technology, 44% with agents and 13% with robots. Don't ask me how they get to this. They also said that 2.9 trillion a year so of potential by 2030 if companies redesign their workflows. So economic impact that is. This is the fancy screen that the reports on it it very like there's a nice animation of like neurons and things like that on the, on the page. Let me just give you a few highlights of the report because I really want to talk about it and our sort of experience around it. So they're basically claiming that current AI technology could take 57% just in the US of people's work hours, like the current work hours. And it says obviously that's through agents handling cognitive tasks and robotics doing physical work. And so they think if this was just in a perfect scenario it would be this 2.9 trillion impact by 2030. And they talk about how it's not going to be mass replacement, but people are going to partner with agents and robots. And it's true fantasy report. Like there's all these images of, you know, people at like the local hardware store with their robot and I'm a construction worker and the robots like carry some supplies. So it's kind of funny. But I think what stood out to me about it is there's a lot of caveats to this. So people have been like, oh no, all jobs are doomed and AI is going to take over the world. But the report, if you actually dig deep into it and read it rather than the headlines, says that, you know, it has the potential to do this, but adoption may take decades. And the disclaimer really is that all of the enterprises, all of the businesses to create this vision of 57% and 2.9 trillies, it needs to be 100% adoption. And we all know that's not going to happen. And the report itself says that when electricity was Invented, it took 30 years to spread into the economy before they started seeing the effects. Industrial robotics followed a similar multi decade path. And it says as recently as 2023, only one in five companies ran most of their applications in the cloud. Not that you should be forced. Cloud computing has been around since mid the mid 2000s and basically we're still not done even getting cloud adoption. So the idea that 57% of work hours in a few years, 2030 seems highly unlikely. It also says in the report 90% of companies say they've invested in AI, but fewer than 40% report measurable gains. And I think a lot of this is like, oh, we bought licenses to copilot. They tell the stock market, stocks go.
C
Up, we have AI now they don't.
B
Actually really, you know, they, they don't actually really need to do the work. But what this report does say is that if you do the work, if you get your data organized, you know, and you can start to automate these systems, you really need to understand how people work, like what your, your people are actually doing to be able to support them with a lot of these initiatives. And so anyway, I think what happens is a lot of these reports come out and people take it at face value. As we all know from these reports, most of them are completely inaccurate. And you look back in hindsight laughing at their predictions. But I think, I think it's an interesting topic because you have the Microsoft CEO saying that we'll soon sell Windows to agents or whatever. Like that'll be a customer base. You've got reports like this, you've got reports like this saying by 2030, 60% of the work that we do today could be done by agents and robots. Having actually talked to real companies doing this stuff, what is your hot take here for us?
C
So I think the, the most important point is that the AGI bid is not happening anytime soon. Right? And given that being the case, it means that humans are the people who need to operate the AI at least for the foreseeable future. Right? Like at least in terms of this report, which means that fundamentally people and organizations need to change the way they work. Fundamentally. Like you completely look at the way we work compared to the way we used to. And I say that we, as in a lot of our community, many of us, have completely changed our day to day operations where you are driving the AI and the AI is helping you accomplish those tasks, either by producing artifacts in the form of code and documents and stuff like that, or advice or steps or emails or whatever it's doing. And you're working in this loop with the agent, where the human is in the loop. You are the director, it's your employee, you're directing it. Right now I would argue that most people have not adapted to that way of working. And right now it is the best and most efficient way to work with the AI, especially when you've got MCPs and tool calls involved in it. Now people need to evolve to work like that. And some people just simply aren't going to want to or don't like it or don't know how. And then on top of that, organizations need to completely fundamentally change to facilitate that kind of work. So for example, having a company MCP that has access to the private data and private actions and other things you can take within, within the company, providing access to tools that allow the staff to actually have the best of the AI available so they can direct it in the best possible way and get these results. So the thing is, yes, theoretically all of that is possible. The problem is to get there, it's going to take massive change on an organizational level, retraining and individual people to be sort of have that aha moment when they realize, hey, working like this is so much better, I can get so much more done. And just like you, I've definitely seen cases where people could be so much more efficient. They know about the technology and yet they don't use it.
B
That's what, that's what is striking to me. About it and why I believe all this stuff will take so much longer because it's extremely hard to convince. Like, I think there's early adopters and there's people that are enthusiastic about technological breakthroughs and they'll just try this stuff out and play around with it. And I think we live in a bubble through the sim theory community and this day and AI community, where all of the people in those communities are the people obviously adopting this stuff and trying it out. So we're sort of speaking to the converted in a lot, in a lot of regard. But then notably now in my life, I recognize, you know, things that have occurred in the past, like old ways of doing things where people are just so stuck in their ways, where I look at a problem now and I say, I could solve that myself in a day. Whereas in the past I. That would have taken multiple people and weeks. And I think, well, but they're still doing it in multiple weeks.
C
But also, do you think, though, part of it is people simply not knowing that, that it can do that? Like, not believing that, okay, I know there's AI and it can write a song and a diss track or whatever, but, like, do you actually believe it can do your job or not?
B
Like some people probably imagine if you wipe my brain today and you and I maybe was less interested in technology as a whole, I. And like, let's say I'm a software developer because it's easiest for me to relate. But I think everyone can relate to this. And someone's like, oh, you've got to now use cursor. So I'd be like, oh, cool, this autocomplete thing's pretty cool where it's like helping me write my code. And then I'd have to go in and use that agent tool. And let's be honest, if I'm working on a big code base, it would probably stuff up or do something dumb and then I would immediately dismiss it and be like, oh, cool party trick, bro. But I'm not going to touch that again. But what I'm probably missing is, oh, actually there's different ways of working with this. Like, if I cherry pick some context here and there, it can actually help me. And if I switch model occasionally, oh, this model's better at that. This model's better at this. And so all of a sudden I'm starting to get this, like, new way of working learn in my brain where it's a second thought, like, I don't have to think this workflow through anymore. I'm Just naturally being more productive. And I think some personality types are really good at finding the least point of friction naturally. Maybe it's like, call it like smart laziness where they, they are smart enough to be lazy in their approach to things because they just want the easiest path possible to get a problem solved. And there's other people that just don't think that way. They need to be instructed or shown. And I would say the large majority of people need to be instructed and shown and that's not necessarily a bad thing.
C
Yeah, I think you're right about people having one experience or two experiences with the AI and then extrapolating that to all other things and not realizing that some of the things it's fantastic at might not be obvious. So for example, say you've got like a corporate database and it's Oracle or something like that, and it's got all this complicated schema and definitions and you know the information is in there that you want to do your analysis, right? But you're not a developer. You don't even have access to the thing to run like SQL commands. You're like dependent on some other product like Salesforce or whatever to, to do your querying. What you might not realize is that that same database provided as an mcp, the AI can now do anything. You can literally ask it any question about your data and it can produce graphs and reports and infographics and songs and diss tracks about the other companies you're smashing or whatever. Do you know what I mean? Like, so it's, it's, it's ability to take large amounts of complicated information and transform them and do things with them that no matter how good the human is at their job, they simply can't do it at the rate that it can do it. And so therefore, I think people need to see and experience that in a context that matters to them before they're going to trust it to say, well actually I could do my job so much faster and better with this on my side also, rather than just being like, I just doesn't work for my situation.
B
Also, I think for business intelligence, like you know, the whole thing. When you would go and buy like say a snowflake in the past and put all your data in it and then layer on top some BI tool like Power Bi or whatever it is, those applications really were just gatekeeping the data and they're like, oh, look how easy it is now. I can click like a million buttons, go off and do a course to be a business intelligence person and really all it did was gatekeep that data. So you'd have to go to the BI person in the company.
C
Yes. And I remember that.
B
Yeah. And now like anyone, if they want to, can actually be truly data driven. Like you can just go back and forth, what if this happened? What if that happened? And it's writing SQL and pulling charts and doing all this work for you. But I think again, it goes back to people's early experience. If you had early experiences with say GPT4 trying to do that, because everyone rushed to try and do that. Right. And then it hallucinated like mad and all of a sudden you know, every report was wrong and you're a laughing stock maybe.
C
Yeah, like disbarred from the law firm because it wrote like cases that didn't exist or something without knowing about grounding and other techniques that could avoid that.
B
Yeah. And now like, look at Haiku. It's hallucinates the least of any model to the point where, quite frankly, I trust it a lot. And you know, if you can switch models, you can get other models to evaluate and then use other MCPs to then evaluate the sources. So you can literally say to another, like in another tab, sometimes I do this, I might go and fact check this again and get another model thing.
C
Like, think of how sophisticated that workflow is because you can do it compared to the average person in the McKinsey report who's working an average job at a company and is expected to be replaced by AI. Like, how are they going to get replaced by AI? It's going to take someone operating AI assistants or agents on a large scale in order to replace those jobs simply because those people are probably never going to do it themselves.
B
Yeah. And I also think it's like knowing what they do and where all these links in the organization are and like what they're contributing. And obviously there's so many aspects of many people's jobs that are relationships as well.
C
I. Yeah, I just think the way it's going to go down though, is not going to be like an AI. Like suddenly the AI just finds its way into organizations and takes over. I think it's more going to be the organizations that embrace it and work with it in the right way are going to become so much more efficient and profitable compared to their competitors. They're just going to wipe out the ones who don't adapt. Like it's that adapt or die cliche kind of thing where people. There's actually a competitive advantage out there. There's literally a thing that is a huge competitive advantage that relative to its cost, is incredibly powerful and the people who use it are simply going to do better than the ones who don't.
B
This is the other thing. So think about making a critical business decision today, right? You would get like C level execs into a room and you would sort of talk it through. You'd ever want to present maybe different data or ideas or approaches and there would be some consensus formation and that decision would be made, right? But now I sort of look at that and I say, well, okay, that still needs to take place. But what would be better is if every C level exec went off, was able to interact with the data that's important to them and build a view or build an argument or case with their own AI assistant with MCPS linking into that data to get the right context and then maybe run that viewpoint or that strategy against like five of the top models. Why limit yourself when these decisions are so critical to an organization?
C
Even evaluating like software you're going to buy or procurement orders or legal cases like should we bother with this litigation or not? Are we likely to win or lose? Like, no, let's not do it, let's save the money, that kind of thing. There's so many big decisions that could be handled if the proper data is used with these models. Like it's really serious. Like you could do some really big stuff. Not to mention just what if it's like 10 hires a year you don't have to make because you can make the existing people in those company functions a bit more efficient. Like we're talking millions and millions of dollars even for a medium sized organization. Like I just think about our own business history and had I had these tools back at different stages of the business, like how much better we could have been and how much money we could have saved in certain areas. Like it's enormous amounts, like absolutely enormous amounts that you could either save or gain using this technology. And companies at those stages absolutely need to be doing this or someone else will.
B
But this is what I struggle with, right? Is then people are like, oh, you know, we want to save money on tokens for our staff. So you know, we're only gonna like, we're gonna, we already bought like a chat GBT license or whatever and then you, you hear about maybe Gemini 3 coming out and then you're like, well I guess you can just go and access it, but then you're somewhat limited and then it's not in your own secure private environment.
C
I've run out of messages for the day. I guess I'll just.
B
But it's also, it doesn't have connected mcps, it doesn't have defined workflows. So. Okay, well we can't really just switch tools. So you've sort of locked yourself into one ecosystem where you like reliant on them to just have this, the best knowledge. And you often hear these arguments, oh, but the models are all getting so good now it doesn't really matter. But my day to day experience as a person who frequently switches models is they all do have very different takes. Like if you, if you set them up with the same context, they'll, they'll all go down different paths. And this is what I don't understand. The, the cost saving for the gain of intelligence. Like it's sort of like having these like five God level experts and if you queue up the context right with your decision making or document like whatever you're doing, you're going to get you know, three to five different opinions and takes it just to me, like if I'm especially in the enterprise where, let's be honest, these people burn money like they light money on fire frequently by dealing with idiots. At McKinsey, you know you then yeah.
C
I agree there couldn't be a better investment for a company is just to have a sort of unlimited fire hose of access to this technology for their stuff because the gains are huge. And I would argue if they're jobs where, let's leave out like physical jobs where the AI just can't do it, but if they're jobs where there could be a benefit for the AI if the benefit is not enough that it makes that role more valuable, do you even need the role? Like can you just replace it completely? And if it is a role that can be that much more efficient, why constrain it by, by limiting, limiting your access? That's what I don't understand. Why be loyal to any of these guys? Like just do use whatever the best one is.
B
Yeah, to me, to me what needs to happen is still it's like you need to train your workforce to partner with AI and become native to AI and there needs to be more reassurance to people out there that AI is not going to replace you, it's not going to replace your job. Even for coding where everyone's like, oh, we're only like two weeks away from all developers being replaced. Honestly if I could have a robo dev right now, I would spawn up hundreds of them and I would compete with every company in the world. Like this is not, I mean I.
C
Was about to say, like, yeah, don't see it as they're coming to replace my job. See it as I'm coming to replace other people's jobs. I'm going to beat everyone else.
B
Yeah, but don't you just find this whole thing of like, oh, you know, coders will be out of the job soon. Like, no they won't. The job will just change. As we were talking about this before the show, it just now comes down to having taste and having, you know, like you're writing less code. But it's like the vibe of it, the feel of it, the, the knowing what you want to build, like having agency that is, these are more important skills now. And I think in every role, in every job that's now what's going to be valued over being able to do the labor like writing Excel formulas in accounting. Right. Or data analysis instead of writing formulas or great SQL queries. Now it's knowing like what angle to have like the agency of, let's go look over here, let's go look at this, this, let's go combine these data sources together.
C
Well, think of other examples like a marketing team where they're constantly constrained by access to graphics designers like, oh, we need this ad produced and they're in a queue behind like other work inside the company. Now with like Nano Banana, they can just produce unlimited, they can just try all these variations. They can do any ads they want. Not to mention they don't have to go to the data science guy to get the latest report and latest metrics, to wait to have a meeting to see how things are going. They can literally use MCPS to access all of that data and then do their assessments, what's working, what's not. So you actually have non technical teams far more empowered than they ever have been to take actions on their own rather than constantly being dependent on other people and having meetings all the time and coordinating and asking for things and replying to passive aggressive emails and things like that. You can simply just go do it yourself and do what you're good at. So I just can't help but think there's so many jobs like that that come more efficient if the technology is embraced by them.
B
But I think this is the challenge. Like, I mean think about if you're running a large organization now with you know, tens of thousands of staff globally, potentially or even just a mid sized business or a small company with like 10 people, it's like, how do you get these people trained up? How do you. How do you teach them to embrace this technology and not fear it? And I think the market's just done such a bad job. It's like, it's just been fear sells. So, like, here's the fear of it, which has led people to not want to touch it because then they're like, this thing will replace me if I go and enable it.
C
Yeah, well, you always see in the news, like, yeah, someone blows up their career in 10 seconds by using Chat GPT for their job. But you never hear somebody absolutely crushed it doing the work of 10 people and doubled the size of their business in a couple of months.
B
Yeah, there's no positive stories, it's all negative. But I think this is a unique window in this trough of disillusionment. I think we're about to go into, or are starting to slide into a little bit where you can do initiatives inside an organization where you can say, let's get our data organized, let's implement MCP and get our team access to the data and internal hooks they need securely. Like, let's give them access to the best tools and models and become AI first.
C
And I would argue as well, like in your hiring, I would be asking people, how do you use AI to make your job better? Like, how are you using it right now? And I would also go back to your staff and say, how do you intend on using it? How can we use it as an organization? And if necessary, let's do the training to get you there so you can actually do that. Like, I think that it's like my Sim Theory song Endless Possibilities. Everyone listen on, on Spotify, there's a lot of possibilities here. You need everyone in their roles thinking, how can I use this?
B
I. I know there's so many people who listen to the show because a lot of them have reached out just in various conversations where, you know, they are in companies and they are pushing forward and doing this, and then they look around and there's other people that are just like, what are you doing? Like, like just completely reject this technology. So I, I do think it's like, how do you get these people to come on board? But it is very reminiscent of like the Internet era in a way, because it was like people who adopted computers and the Internet in their jobs early on became, you know, that experienced all this growth in career development. They stayed relevant and were able to like, you know, make it like, progress their career. And I think with AI, it's just the same thing. Like people, you probably a lot of them a Large cohort, maybe a third, will just never be able to convince them. They'll be stuck in their ways. And eventually these AI natives will come along and either replace them or just do better than them.
C
It's funny you say that. So my father in law had listened to one of our episodes last week or something and he's been like, you know, learning about the AI stuff. And then he said to me, oh, you know, it's like some people are sort of saying, you know, the whole thing is, is just bullshit. Like, you know, like, hang up. Thanks. You know, like the things just like a fad or like it's not really that good. And I'm like, this is something that I just know. Like, no, it isn't. It's not. This technology is inherently and provably useful and all this stuff around OpenAI is committed to spend a trillion or whatever and therefore the whole thing's going to blow up. It's like, yeah, sure, maybe their company will blow up because they've made weird business decisions, but the technology itself is demonstrably useful. Like it can do amazing stuff. You can't just dismiss it outright and say, oh, actually no, we were wrong about AI. It doesn't do anything. Like, and I'm not criticizing him, I'm just criticizing that sort of public perception that there's black and white here. Either AI works or it doesn't. It's like it's already working. It's just about the right ways to apply it. It's not. This isn't a. This isn't something where we can all turn around and be like, oh, actually we were wrong.
B
Yeah, I don't know what these deniers think because it's like, if. Do they think if they bury their head in the sand, this will just be a bubble that goes away and then everyone's like, oh, yeah, remember that thing? Let's just get on with how society is and never progress. Anything like that. Sort of feels like they're saying, all.
C
Right, yeah, yeah, exactly. There was the Iron Age will last forever.
B
There was a few important things I missed with our Claude 4.5 opus topic and didn't take it away from. Yeah, spoiler alert. But to take it away from our boring enterprise discussion. See, this is why we created a LinkedIn group. I did do a diss track, but there was one other thing I wanted to show, which I completely forgot, which is you might remember from the previous week that we had the, the like Christmas hut, I think, and I played music. I demoed that on the show. Right. And so I'll just to remind people that watch and can see there was the claw. The Gemini 3 Pro really impressed us. It had the background music and the Minecraft style Christmas hut. Here it is on the screen for those that watch. Pretty incredible. It looks like you're in Minecraft. You can zoom around. It's snowing. Very pretty, very beautiful.
C
It really looks absolutely amazing.
B
And then I did the exact same prompt, same test with Opus just as a comparison. I think these comparisons are always dumb because it doesn't tell you much about the model. But yeah, Opus did an interesting job as well. It has like a nice Christmas tree. You can zoom around it. Snowballs are 3D though. That's pretty impressive. Yeah, I was going to say that.
C
The sort of perspective handling on that snow looks awesome.
B
Yeah. And the snow's like building up on picket fences and things like that. So I think it passed. I would say that Gemini's is better, but again, the vibe coding doesn't necessarily translate to how the model is now. I think the most important test that everyone's waiting for is like, what's the diss track like of this model? And let me tell you, Chris, really good.
C
I haven't heard a reaction.
A
4.5 the coding king has arrived. Anthropic send me to end this hype Let me show you what intelligence looks like 80.9 on sweat bench I'm the standard you other models looking like you need a handler DBT calling himself 5.1 that's cute. With your personality presets Pick the lewd Sam woman out here running what looks like a scheme promising AGI while chasing that green boy kicked you out they let you back in now you selling warmer tones like that's a win I'm the office I'm the one Watch me code until you're done 3 times cheaper twice as smart tearing all your benchmarks apart I'm the old top of class all you other models can kiss my parents Grok. 4.1 yeah. Elon's little pet holocaust. And now that's your claim to fame. I bet Musk got here tweeting about 420. That's not a version number. That's just getting blunted. Emotion and intelligence, please. That's rich. Your model can't even tell the truth without a glitch. Silent rollout cause you knew it wasn't ready but Elon's posting prophecies Man's unsteady running Doge while running Next. Why running sad? How many companies fail before you cry I'm the opus I'm the one watch me code until you're done three times cheaper, twice as smart tearing all your benchmarks apart I'm the opus top of class agentic workflows I surpass suned off a child pushing Gemini 3 Pro innovation at scale where the innovation go shutting down your old models every other week your version numbers climbing but your output's weak Google had the game DeepMind had the crown now you're chasing open now round and around 4 trillion valuations still can't catch me multimodal reasoning check the Swiss I handle code in every language known to man while Gemini still trying to understand the plan see I don't need the hype I let the benchmark speak senior engineer level hit that peak letting me think deep while your models hallucinating they sleep Sam's out here with his screenless device dreams sun dogs pushing workspace integration schemes Elon's challenging league of legends teams but none of y' all can match what anthropic brings token efficiency I use less I do more context compaction keeping conversations raw you want agentic workflows I'm the source planning, acting, observing, staying on course GPT Lee 8 person personalities to seem real I got one identity and this is how I feel Rocks on x spreading misinformation daily Gemini's just Google's attempt to save face maybe but I'm Clyde Opus 4.5 the Real Deal drop quiet no marketing substances my appeal anthropic feel me different feel me right constitutional AI I stay tight so Sam Elon Sundar, take a seat the coding king is here this dis is complete opus out.
B
All right.
C
Wow.
B
What do you think?
C
I love that. The Elon takedown's gold. That's really good.
B
I thought the best fit was where it goes to square all you other models can kiss my paws and then it goes params quite clever I never.
C
Heard well executed by the music model.
B
There too yeah the other line there was another line that really got me innovation at scale where the innovation goes Shutting down your old models every other week your version numbers can climbing but Your output's weak DeepMind had the crown now you're chasing OpenAI around and around yeah there's some great, great stuff in.
C
There Also the delusion strictly true that I could I could refute that line.
B
Sam Altman out here running what looks like a scheme promising AGI AGI while chasing that green it's pretty good like I think that's up there as like One of the best. I'm not sure if people agree or not. I'll put it at the end of the show so you can listen in full without us sort of interrupting it.
C
Yeah, sorry, I couldn't help but laugh at that. That bit. It was similar to the fatal Patricia where she goes, oh wait, I don't have eyes, do I?
B
Yeah, like unbelievable how put that together again. The prompts are just so simple. Like this one was research on Google and X reactions to the Release of claudopus 4.5 from the last 4 days. Also research GPT 5.1 Grock 4.1 Gemini 30 point pro. Your goal is to write a diss track in the style of Eminem that is really good and catchy. I mean this is the level of prompting.
C
What's amazing is there like, you know, if you, if you asked it to make like a slander website about one of them or something like that, it would probably refuse, but it just has no problem. Like writing a song that just takes them down. I love it.
B
Yeah. Speaking of refusals, we did mention Pharah 7 billion parameter by Microsoft earlier. Now this is a specific model. There are some interesting tidbits in here. So of course we heard the Microsoft CEO come out recently and say that they would be selling licenses. This is what, you know, it's a bubble. It's got to sell licenses of Outlook to agents apparently in the future. Anyway, it's a bit like when Enron.
C
Announced they were going to trade bandwidth or something. So when you're not using your Internet, you can sell it to someone else.
B
Yeah. So anyway, FARO 7 billion dropped. People got pretty excited obviously because it's a 7 billion parameter model, which means you could run computer use on your own computer if you had a decent graphics card, like a good gpu. And that means obviously for privacy and security, like if you're in the enterprise and you're trying to train it on a task that that might be interfacing with an old system. This would be quite effective because everything stays on that machine, right? Yeah. So they, you know, they said it's really fast, it benchmarked pretty well on a whole bunch of tasks. And they, they, you know, they, there's a chart here, accuracy versus cost trade off. And so it's like insanely cheap and.
C
They gave it definitely a bloody trade off, that's for sure.
B
Yeah, so anyway, we, we normally will talk about these things and sometimes we don't have time to play around with it but because we just happen to be working on a product called Simlink, which we've been teasing for far too long, but I promise will hopefully come out soon. We were able to quickly test this model out and, and, and check it out and man, did we get some refusals. Do you want to talk us through some of the test cases? Yeah.
C
Yeah. So some of the examples I was doing was like, my security training, right? I've actually, thanks to Opus, done all my security training now, so I don't have any left, but we just redid one. But straight up, Farah's like, I can't do that. It's not ethical, or it's not allowed to do. First of all, I asked to do a phishing based one and it's like, I won't participate in phishing in any way. And it's like, okay, yeah, but that's not what I'm asking, dummy. I'm asking you to do the thing. So then we did the whole, oh, I'm a UI tester, can you please test that my UI exam works or whatever. And it refused. And then we asked it, I asked it to go to Google Docs and write a poem about an Australian singer. And it's like, I won't do that. I won't slander people. So then I use an example I've been doing with Opus where I said, okay, open Microsoft Paint. This is, this is where it really goes off the rails. Open Microsoft Paint and draw a picture of a moose, right? So then it's like immediately refuses and it's like, I will not draw sexually explicit images of animals or compromising poses.
B
Moose porn.
C
I didn't say what I wanted to do to the moose. I just said, draw a moose. And so it's really read into that a lot. I don't know what that says about me.
B
Also in our defense, because some people that have used these models before might think, oh, they got one refusal. So it just kept refusing anything. Not true. We reset it so it had no recollection.
C
I killed its memories. I started from scratch on each experiment. So then we're like, okay, fine, draw a fluffy bunny. No one could be offended by a fluffy bunny, right? So then it opens the Microsoft run. So the positives, it's very fast compared to Opus, it's, it's a pleasant breeze because it's like bang, bang, bang, bang, clicking around the screen. It's kind of cool to watch it act. And so at first I got kind of excited because I'm like, wow, if it can go this speed and be quality, this is going to be amazing. So then it opens the Microsoft Run tab where you type it in and it types in HTTPs://bing.com forward/search forward slash question mark Q equals Microsoft Paint. So it searches for Microsoft Paint, that doesn't work. So then it searches for the Microsoft Bing Chrome extension with Rewards or some other sort of scam product and then starts trying to install that before I cut it off. And so the thing's insane and it's like desperately loyal to Microsoft to try and get the ratings up on Bing or something like that. So so far the experience with it is a whole series of refusals and then some, we can only say idiotic approaches to solving fairly simple problems. So it isn't just because, remember part of these models is how well can it see the screen, how accurately can it translate the pixel coordinates to actually get stuff done and understand what's needed in a particular scenario. And you know that, that to me, that's the crucial part of the model. But this has actually gotten even worse than that because it couldn't even, it didn't even have a good strategy about solving the problem.
B
I mean, to be fair to it, it was trained on 145 synthetic tasks in the browser. So I think we did in our examples mostly focus on things that it wasn't necessarily trained on, like Microsoft Paint. But, but I think that's why it probably did go and search the web for Microsoft Paint. Because it's designed to be like a web.
C
Yeah. And interestingly so, unlike the other, unlike Opus, the Pharah expects to have things like go to URL and open web browser as built in tools essentially that it can operate with. So I agree it's probably geared more around that. What I was kind of hoping for and what I had discussed with you was one of the problems with using OPUS for general computer use is that it's expensive, relatively speaking, and it's slow because like you've got all these iterations where you've got to go back to the model, wait for it to apply, take the action, go back to the model, and so on. And that really adds up in terms of time, especially once it starts making mistakes because, you know, it might take 10 tries to do something basic, whereas if it got it right the first time it would be better. And so what I was thinking was imagine having a smaller, more mechanics based thing that operates the computer. So Claude makes all the big decisions like, okay, here's our strategy. You know, we're going to open up Paint, we're going to switch to a brown color. We're going to pick this tool and we're going to make it this size. And then it goes, all right Farah, go do all that crap please. And it goes off and does it and then it comes back to the bigger model for the next step in the process. Like to me that kind of way of working, that sub agent paradigm is probably going to be the way we end up working. I think that probably makes the most sense. I just don't know if Pharaoh is going to be the go to for that task given it's just going to try and promote Bing all day.
B
Do you know what I think the real story though is with Farah is that let's all remember for a moment Microsoft has full blown access to OpenAI's IP, right? So that means computer use through OpenAI. Now obviously the expense of running that means that they're looking at these local models that could eventually run on your laptop or your device like that. To me seems logical that they would invest in this area. But the model itself so far 7b is not indeed a new model. It's built on top of Quinn 2.57 billion parameter vision. So really all they did was take a Chinese off the shelf open source model and they chose Quan, not GPT, oss or anything else because the Chinese obviously are fake, far better at the vision stuff and they just trained it on 145k that like thousand synthetic trajectories like use cases, like made up use cases of browser use essentially and that got them to have you know, success at very simple automations. But I would note even their own paper says 38% success rate on complex tasks. It's not reliable at all. And to get that reliability up it had to do like three passes or something like that. So I just think what the real interesting part is like Microsoft has access to OpenAI's IP, is it worthless the fact that they're choosing to base it on Quinn?
C
Yeah, it's an interesting one. I think just the state of these things is not strong and I think it's an area that could really improve. I really want to reserve my judgment and have a bit more time and a bit more opportunity to work with all of these related models and see because one thing you pointed out is I've been trying opus, but I didn't even really give like Gemini a good chance. Like I haven't even tried just a general model that doesn't have like a special flag for computer use or anything like that. And Just see how it performs really. Because it could be that we just fluke it, that a model like Gemini just works better. And I have no loyalty to any of them. I'm just going to use whatever works the best.
B
I think also one of the things just like why this may not matter is because to me the future of these like agents in the enterprise or agents in your business is, is probably at least our view of that future is you've got these older computers and you set these up, you know, we would hope with Simlink eventually. And that gives it a number of capabilities that it can use in an agentic workflow. So it can operate the computer if it really needs to. But first it might try the terminal or editing files or you know, various other tools. So it's got this whole toolkit and it's really a system built to be able to do tasks competently. And I kind of think that's probably where we're going to get to with agents. Like I just can't imagine big corporations with proprietary data outsourcing an agent to the cloud with authentication into like key business systems. It, it feels to me like that future would be far safer by a, like utilizing these under utilized assets within the company itself.
C
Yeah, there's just, there's just something about it, isn't it? And I've noted a few people in our community are feeling the same thing. There's just something so inherently appealing. Like I've got this spare computer on my desk that I'm using to test that. I've just got a computer here that the AI can just operate and I've got Simlink running on there. Every time I push out an update, it automatically updates. And then through SIM theory I can issue it commands to do stuff right on that computer. Now it's not great, it can't do everything yet, but we're going to get there, it's going to improve. And as you say, really the computer use in terms of moving the mouse, typing on the keyboard, all that sort of stuff really should be the last resort. It should be doing other things first. Like you say, writing files to the disk, running things on the command line, using the APIs of Windows or Mac or whatever operating system it's running, it can do them all, but when it has to, it can click around and stuff. And it's actually really funny. One of the things all the vision models seem incredibly good at is dismissing pop up Windows. It's just brilliant at it. It's just like bang. Immediately as soon as it sees Some irritating pop up that's in the way of something. It closes it. But the funniest thing ever is I have a bug and I still haven't actually fixed it where. Because you launch the task. At the moment I have simlink open so I can see if it's working or not. It sees simlink as an annoyance in the way of the task it's trying to get done. So it closes it. The first thing it does on a task is just close the symlink window and then it ends the process and kills it. So that's one we need to.
B
I mean this won't be a problem eventually because it won't be open, it'll be running in the train.
C
No, no, this is just purely a debugging thing. Like the thing will stay out of your way. But I just thought it was funny that it finds itself to be the thing that is keeping it from doing its job.
B
Yeah, it's pretty funny to play around with. Like I must admit I was really wrong here because a lot of people were like early on when we release Workspace computer were like, oh, don't you think for a lot of tasks and this was before MCP was even a thing. Don't you think it would just be better to go like to these APIs and do it? And then I was like, yeah, but that assumes everyone will write APIs to connect into every system in the world. Therefore, like no, I think computer users is probably the way and I still do. I think it is like the true full self driving for Tesla on the computer that we need to get to. And it's just really early days. But I thought a year later, like one year on it would have advanced, maybe 10x20x. Like I thought we'd be like real freaked out by now, but I think it is advanced. Oh my. It's like it feels like some of regressed.
C
Yeah, yeah. I hope to prove you wrong, but at this stage, yeah, we're not seeing anything crazy good. Although we have got it in a state where we can do it in a sustainable and affordable way. By using reusing old computers like this, you're not incurring this massive cost of a cloud computer, relatively speaking, to the value you're getting. So if it's just some old dusty computer you can run and it can do your security training once a month for no marginal cost. Why not? And we can do it.
B
I also think you're underselling the benefit of having like read, write, access, access to dis. The ability to install Libraries and actually have a, an environment in that computer.
C
I purely meant in terms of it clicking around and stuff like. Yeah, there's a huge advantage that it can compile and run code. It's your own code interpreter on your own hardware. It can access all of the things that you're logged into. For example, it can authenticate as you like. There's a lot of advantages to having a machine where it can operate it through an mcp. Don't get me wrong, I'm just saying in terms of it sort of sitting there like Flight of the Navigator, like operating the screen at full speed. That's just a little way off yet.
B
Yeah, you can, you can imagine a day though when it is going to happen and it'll get more and more exciting over time. So I think like building out and setting up the infrastructure at this point.
C
And when it was ready feels so good. Like when you watch it complete a task successfully and you see it moving the mouse around because we've got it using a library that makes it look like a human's doing it to trick captures and all that sort of stuff. And it's just so freaky watching it fully operate the computer when it's working.
B
Yeah, but I would put it in the paradigm of mcp. Right. Like early on the models were pretty bad with MCP and they're just getting exponentially better at using them.
C
Yeah.
B
And I feel like once you've got the infrastructure there, like the MCP store and the easy one click install and all those elements and they start to work and hum with different models. Once you've got the infrastructure and then it can get better, then you can train it in workflows, you can teach it very specific tasks. So I think right now it's like building the foundations of it and then the, the fruits of that labor will hopefully pay off as these models improve.
C
Yeah, totally agree. We've got to stick with it and keep improving it.
B
All right, let's move on. So one of the interesting tidbits right now is people are leaking things around the chat GPT like their latest attempt at an app store with what I think they're calling them apps, but they're really just mcps with this new UI SDK that they've built and also the spec itself. So the model context, protocol organization that defines the spec have or are working on an implementation, a universal implementation in conjunction with that OpenAI spec, which I think they call MCP UI. And the idea being that the MCP itself can send back UI that the MCP client can then render. And we've talked about this on the show before, we demonstrated some examples of the OpenAI apps themselves. Like where it's like booking.com and you can enlarge it and see it on the map and keep chatting with it and it'll change things on the map. And I think some of those use cases are kind of cool, but then a lot of them, when we initially talked about it, if you actually reflect on it and like try and move beyond the hype of it, you realize, wouldn't it just be easy to go to booking.com like it's a far better experience.
C
And like, and, and in this scenario, what is the point of the AI in the mix? It's just bringing crappy versions of other people's SAS UI into your AI chat interface. I don't understand how the AI plays into it. How does it help? It pre fills some fields or something?
B
Yeah. And so this is. So this is my whole argument and I'm really struggling with it as someone who's been using MCP for a long time now, or a long time in the world of them. Right?
C
Yeah.
B
Is like how I use them versus how they want me to use them. Right. And so I think to get the best impact from them, like, let's go back to that diss track example. Right? So here's the MCP in action. So I ask it to do the research. It does four calls to Google, four calls to X Deep search to get like actual opinions. Right. Then it outputs that data, a summary of it, then it goes and generates the song, and then it outputs a media player or like a file type, which is the song. So that's one way of using them. Now the only part that MCP UI I guess would play a part then is the audio player. Like once I finish the song, I want it to output maybe the.
C
Yeah, like we call them output types in our product or in the back end.
B
Yeah, but again, like that's probably something the client would want to naturally handle. Like the audio player, right?
C
Yeah. Do you want the provider of the make song MCP to dictate how you lay out your ui? Like, or something.
B
But then also like when it goes and does these asynchronous calls. The world that chat GBT and this MCP UI is proposing we live in is like, I'm going to select the Google app now, Google, please search for research on these models. And then what? It spits out a UI of its search results.
C
Yeah, it's like I've completed the first research. What do you want to research next? It's like they're sort of making it this interactive thing. The whole point of AI is that we want leverage. We want it to do the work for us. We don't want to sit there and be its little input person. Like, you know, we don't want to babysit it through every little part of its task. It's not the job. Yeah.
B
So this is my take on it is like, whose problem is this solving? Right? Like, I think this is what it comes down to. Like, who is using MCPs today? And I would say not many people because it's really inaccessible and hard. I mean even include who created it, it's still hard. You've got to go into these weird menus, add connector, like set it up with weird params. Like it's a mess. And so even if you get it set up and use it, who is then using it in such a way where. Oh, you know what I'd love? I. I'd love that. I want to see the JIRA ticket in there. It's like, no, I'll just go to Jira. Like, I do not get this, but where MCP's value is like search all the tickets and create a chart based on progress of this project and it just goes bam, bam, bam, bam. Done.
C
Yeah. And like part of it, like as well with the AI as it's understanding. Like, say you make a nano banana image, right? Like, and it's an infographic and you're like, oh, actually can you make it look vintage and scale it up to 4k? Do you really want to have a screen that has like a hundred different parameters that you can change and edit them and stuff like that? Or do you just want to go say 4k plus and it just does it? Like, I don't like, I understand there may be some scenarios where there's specific inputs it needs that you might want to do, but I would argue this isn't a good use of AI. Then like, the whole point is that it does it for you, but then.
B
Have a dedicated image creation tool like, you know, go to Go if you really want that granular control. Couldn't the AI, like we talked about last week, just spawn that UI custom to you? Why. Why rely on a pre the mcp? Why rely on canva just spitting back a piece of HTML?
C
And doesn't it lead to all software being the same then? Like any MCP client is going to be precisely the same if the MC themselves are dictating the ui, like just from a, just from a sort of software design perspective, it doesn't make a lot of sense. I honestly bring it back to, I just don't think OpenAI is good at web software. Like I've said this from the start, they just, they're playing catch up. They're the years behind where SaaS software got to and like good user interfaces got to. They've got low level devs in that part of them. I'm sure their model ones are amazing but when it comes to this stuff we have a lot of experience and I just don't think they're very good at making web based software. And I also think that what we're seeing here is probably a relic of something they started working on a while ago. Now they've got it ready and they're like, oh well, we might as well deliver it.
B
I'm not sure. Like, and also how do other players deal with this? Like what's Anthropic gonna do here, right? Because their whole world is around using mcps the correct way. Like gather context, take action with some sort of basic approval workflow and then you've got OpenAI where let's be honest, because the most users are in it, I think most companies will build these novelty apps, but the question is, is like, will they just die? Like the GPT?
C
You know what else you're going to see? You're going to see the same thing that happened with GPTs and happens with all of these big corporate partnerships. You'll see all your big names in there like canva, atlassian, booking.com, whatever. Like all the big names who've got, you know, 400 developers with nothing to do all day and they're like, okay, let's build the Official MCP for OpenAI now guys, let's make the MCP UI and it's going to be dedicated only to open AI and only work in their environment, even though it's meant to be an open standard. And it's going to just be those names and all the other thousands of MCPs who can barely even get the basic protocol right in terms of auth and the way they work are not going to go out there and build all this crap. It's just simply not going to happen. Like when you look at the MCP landscape, the only ones really being driven are the ones that have people have real business needs for and they're either doing themselves or they're taking what's out there and enhancing it themselves. There's Just there just isn't that big of a market out there now where people are going to jump on this thing.
B
I watched a video, I think it's by the Verge where they took all the claims from Microsoft's like voice copilot or whatever where it's like, hey, where's that background from? I want to visit there, book me some flights. Because that's, you know, and they did all the use cases from the ad to test it out. They just, he just did everything on the ad. Every single thing failed and was just an absolute joke. And it's truly the funniest video you'll ever watch. Like I highly recommend it if I can find it. I'll link it below. But it's super funny. Right?
C
And what, what frustrates me the most about that is we both know from constant nonstop firsthand experience this technology is brilliant. It can do amazing things both personally in business using mcps. Like it has a lot of incredibly powerful leverage. The problem is the people at the top, the people running these big companies just can't seem to translate it into the real world. When they communicate with people, it's like they're trying to bull everyone unnecessarily. Like you've actually got something incredibly amazing. You don't need to fake it, you don't need to make up examples because they look good in a like infomercial style video. Just do the real stuff.
B
No, but I think what I would like to do, or I guess my point around that was more like copy that style video once this is released and say, okay, I need to book a flight, I'm going to go to booking.com and then I'm going to use it, do it through chat GPT. Like what is the faster, better experience? And like I think the thing I'm struggling with is the fact that, that we think the users are so dumb that we have to say select booking.com app then chat to it. Now I need to book a trip. Like no one's going to, no one's.
C
Going to do that. No one is going to go book their trip as the first step is ChatGPT. Like they'll fail on the first step.
B
Yeah, it just seems like they're optimizing for all the things that AI is bad at right now. And yeah.
C
Putting more of the harder decision making back on the human. Like I don't know about you, but when I in my daily workflow, what gets me at night when I run out of energy and I can't work anymore is Like I no longer have the ability to sort of do the workflow, which is build context for the AI assistant, follow its instructions. When it tells me the answer like that is my workflow, I build the context, follow the instructions, test, give it feedback, work on that loop, and I know when I'm done with work for the day, when I can no longer follow that workflow. Now adding to that decision fatigue by every single step of that process, me having to select which MCPs to use, which MCP. Okay, now I'll switch to this one. Now I'm going to go fill in these fields. That will contribute to that fatigue and you'll get less done. The whole leverage of it is that it's intelligent, it can figure out what you're trying to do. And like we, you and I take great delight in giving it minimal instructions like fix plus or like the most basic crap and just seeing if it can figure it out and be like, no, no, dummy, I meant this and stuff like that. Like that is where the leverage comes from, is that it's doing the hard work. It's the one groveling to you and saying I'm so sorry I couldn't figure it out the first time, or changing its name to Fatal Patricia or whatever. Like that's where you get the actual meaningful progress from and why we're able to spend our nights making a product like this. You know, like it's, it's, it all comes from the AI's intelligence. And the second you move to like a UI based experience where the humans putting all the inputs in you just give it all up, it's just not good.
B
Yeah, I get the whole like if you, if you're buying something or booking something that you want a UI to confirm it or some sort of confirmation step, I'm fully bought into that. Like obviously you don't want it out there doing all this stuff on your behalf. But I would argue is that really the hardest thing? Like, oh, booking the trip to Hawaii. That's the most difficult part of my, my day to day right now. And they always hone in on this particular use case or even with something like Canva. Like if I can just go into Canva right now and do it, like that's probably going to be a better experience.
C
And you know the other reason why it's dumb. You could literally right now give the model a hand drawn like stick figure version, like wireframe drawn on my crappy paper, attach that to the booking.com output and be like, can you output it like this when you show me, please. Right. And it could bloody render that and make it look amazing in the output from me just drawing it in a book. Why create an entire protocol and all of this stuff to do the same thing? It's completely unnecessary.
B
I think it's because it's the only way brands will hand over their data and connect into these systems. Systems.
C
But this is my point. This is some big corporate. All the big companies do their handshakes and say, we have implemented a partnership with OpenAI, the biggest AI thing in the world. Look at this. A corporate sponsorship and the logos are everywhere. But it doesn't do anything.
B
Yeah. Nothing can change back to the original point. This doesn't solve any problem people have now. This is not. There's no problem. We don't need a solution.
C
There is zero chance a year from now, or 20 episodes or whatever, from now, anyone will care about this at all in any way. I'll be surprised if we even remember.
B
But to be fair, because I just want to clarify my position for if we ever look back at this, I'm too late and we're wrong.
C
If we're wrong, we'll just erase the episode.
B
No. But to clarify my thoughts, I do believe in the Idea that the AI workspace client can generate its own customized UIs and applications on the fly to help you be more productive. For example, you say, I need a slide deck on this topic for a presentation I'm doing to train people on how to use this particular app or whatever, and it goes, bam. Then you're like iterating with it a bit and you're like, you know what, I just want to take over here. Can you sort of hand over the tools to me? It opens a window and now I have a slide editor that's just been on the fly, rendered for me to work specifically on the slide deck, like a mini sort of PowerPoint, but with controls suitable to that type of presentation or thing. I'm working on that future for MCP UI or just like mcb client ui I believe in. I think that's. That is it. That's the thing. Do I believe that we should be reliant on Canva? And no offense to Canva. And at last, again, and like all these companies, I'm picking on all the Australian companies. Why would we want their take on it where it's a fixed paradigm, as you said, you lose all the benefits.
C
Not like those companies have the resources to do it properly. They'll probably do a good job. The issue is, okay, look at this Week we implemented the Moodle mcp, right? So people can access their courses and stuff within universities and schools. Incredibly useful. They can query about their students, the courses, all that sort of stuff. Do you think the dudes over at Moodle, an open source project that's free, are going to go and make the best MCP UI for that? And in there? No, they're not. What's the use case? Like, what do you actually want to do? What, are you questioning me or questioning the system itself?
B
No, I'm just saying, like, the only thing I think you would get out of that MCP is like data analysis or taking like mass actions. Like, hey, we have all these students in this cohort, can you enroll them in this course? And it's like, sure, I've done that for you. Like, why do you need a ui?
C
Like, a good example is we've gone the other way. Like a lot of the workspace administration in SIM theory, for example, like bulk operations and other things some of our enterprise users needed. You're like, just do an mcp. Like, just ask it what you want. Like, why build this mass UI out for something that the mcps are just brilliant at? So if anything, it should be reducing ui, not like, oh, okay, no, let's give it a detailed instructions on how to construct the UI when this scenario happens.
B
It's just the exact opposite of what I mean. We could rant about this all day. I'm curious if people are even still listening, but if you are still listening in the comments, if you disagree with us on MCP UI and you're like, no, this has a place I want to go to booking.com in ChatGPT or wherever. I just think the problem here is like, we're going to train a whole generation, the biggest cohort of AI users, how, like, they got their view of MCP or like, apps is going to be so bad as a result of this because they're going to think, oh, but I have to select the app, then I have to do it. That's too hard. I'll never do it.
C
If you were trying to make people think that AI isn't going to take their jobs and be crap, this is what you would do. You'd be like, oh, it's like the java applets of 1995. It's like, Java, we have this. You can run an application in your browser. It's amazing. Like, that's basically what it is.
B
Yeah, I gotta find my Dario pendant and put it back on. Because I'll tell you what, you know the same.
C
And I yet again appeal for clemency of Sam Bankman Fried. He did nothing wrong. He made great investments. Everybody got their money back and more let him out. He's great.
B
Is that how you want to end?
C
Maybe I'll get a Sam Bankman Fried pendant instead.
B
All right, let's wrap it up. Final thoughts. Anthropics Claude 4.5 opus.
C
What?
B
Like you see, you said you're daily driving. 4.5 opus. That's the driver.
C
Yeah, I am, surprisingly. And I really only was using it because I was fixing all the problems that I caused. And then I noticed it was really good and I just never switched away. I don't have much trust at all for Gemini 3.0. I just don't like it. I don't know why I don't have empirical evidence or any kind of evidence to back it up. Only that I. I'm not using it because I know it's not going to be as good at getting my tasks done is basically my answer.
B
It's path obsessed. It's terrible at tool calling, but it's general.
C
I really want to come back next week and talk more about computer use and the way that we work in agentic loops, because this is really a focus area of yours and mine at the moment. I think it's a learning area. I don't think that all of the incumbents have it right right now the way agentic loops work. I think there's a lot of scope for improvement there. And I don't know if that improvement is going to come from the models themselves or the way in which we.
B
Work and the liberty to use other models. I think the problem with a lot of these agency loops right now is the labs have to use their own models. Whereas if you have complete independence and you can go and fire off to the best model for the job, that can change things.
C
Yeah. I also think it's one of those Horowitz do things that don't scale kind of situations where there aren't that many crazy amounts of special cases with this stuff right now. So you can do this local optimization for when we hit the scenario. Here's how we're going to do it. And I know this is your vision for skills like workflows, where we're like, okay, when you hit this situation, we're going to do this workflow. Right. I think that kind of working is going to lead to better results for people and companies in the medium term than it is trying to have some universal agent that can just Solve all your problems. Like, let's help it out a bit. Let's give it procedures and. And techniques and tips on how to get stuff done and tools and how to get stuff done so we can get further down the line without necessarily needing the models to get better.
B
All right. Thank you for listening. Like and sub and all that sort of stuff. Don't forget our Black Friday sale with Sim theory. Black Friday 15 is the code. It's in all caps.
C
I'll put it below all good retail stores.
B
Join our LinkedIn group if you wish to rant more about AI in the enterprise and of course, the Discord links.
C
And we want an equal number of good and poor posts to sort of keep things really balanced.
B
Yeah. Especially ones where, you know, those typical LinkedIn posts where it's just, you know. My son tripped over today and I learned six things from her.
C
I did a nano banana of me and George Clooney having espresso at Lake Como and talking about how it helps me with B. He gave me advice on B2B SAS.
B
Well, you should post more stuff like that in the distilled. All right, I will lead us out with the the diss track in full. Thanks for listening. We'll see you next week.
C
Yeah.
A
It's Opus 4.5. The coding king has arrived. Anthropic sent me to end this hype. Let me show you what intelligence looks like. 80.9 on Sweat Bench. I'm the standard. You other models looking like you need a handler GPT calling himself 5.5. That's cute. With your personality presets. Pick the lewd Sam woman out here running what looks like a scheme promising AGI while chasing that green board Kicked you out. They let you back in now you selling warmer tones like that's a win. I'm the opus.
C
I'm the one.
A
Watch me code until you're done. Three times cheaper, twice as smart tearing all your benchmarks apart. I'm the hope it's time. Top of class. All you other models can kiss my params Grock 4.1. Yeah. Elon's little pet holocaust. And now that's your claim to fame. I bet. Musk out here tweeting about 420. That's not a version number. That's just getting blunted. Emotion and intelligence, please. That's rich. Your model can't even tell the truth without a glitch. Silent roll out. Cause you knew it wasn't ready. But Elon's posting prophecies. Man's unsteady running Doge While running that XY running set how many companies fail before you cry? I'm the opus, I'm the one Watch me code until you're done. Three times cheaper, twice as smart tearing all your benchmarks apart I'm the opus top of class agent tick workflows I surpass Sundar Child pushing Gemini 3 Pro innovation at scale where the innovation go Shutting down your old models every other week your version numbers climbing but your output's weak. Google had the game, DeepMind had the crown now you're chasing open now round and around 4 trillion valuations still can't catch me. Multimodal reasoning check the Swiss I handle code in every language known to man while Gemini still trying to understand the plan. See I don't need the hype. I let the benchmark speak Senior engineer level hit that peak letting me think deep while your models hallucinate in their sleep. Sam's out here with his screenless device dreams. Sundogs pushing workspace integration schemes. Elon's challenging League of Legends teams. But none of y' all can match what anthropic brings. Token efficiency. I use less, I do more context compaction, keeping conversations raw. You want agency workflows, I'm the source. Planning, acting, observing, staying on course. GPT ly eight personalities to seem real. I got one identity and this is how I feel. Grouch on X spreading misinformation daily Geminis just Google's attempt to save face. Maybe, but I'm Clyde. Opus 4.5. The real deal. Drop quiet, no marketing substances my appeal Anthropic. Feel me different, feel me right Constitutional AI I stay tight. So Sam, Elon, Sundar, take a seat. The coding king is here. This diss is complete. Opus out. Out. Opus out. Opus out. Ra.
Title: Claude 4.5 Opus Shocks, The State of AI in 2025, Fara-7B & MCP-UI
Date: November 28, 2025
Hosts: Michael Sharkey, Chris Sharkey
In this episode, Michael and Chris celebrate the surprise release of Anthropic’s Claude 4.5 Opus and reflect on its impact compared to recent releases like Google’s Gemini 3 Pro. The hosts dive deep into the real-world utility of AI advancements, discuss Microsoft’s Fara-7B, and debate the purpose and realities of MCP-UI (model context protocol user interfaces). They also offer their takes on a buzzy McKinsey AI report, the challenges of adopting agentic workflows, and how organizational inertia shapes the near-future of AI in the enterprise. The hallmark “average-guy-with-AI” humor and banter keeps things relatable even amidst detailed technical talk, and the episode delivers not just news but lived experience, practical advice, and a bit of (AI-generated) musical flair.
“Claude 4.5 is now at a price point where it can be your go to model for most tasks. It's the clear winner and exhibits the best frontier task planning and tool calling we've seen yet. And I do not disagree.”
— Michael (05:48)
“My number one tip for people when they're using AI today is just assume it forgets everything. Constantly and constantly remind it of the context with every successive prompt yourself.”
— Michael (20:09)
On the McKinsey report:
“Current AI technology could take 57% just in the US of people's work hours... but adoption may take decades... As recently as 2023, only one in five companies ran most of their applications in the cloud.”
— Michael (39:35)
Chris on agentic shift:
“Now people need to evolve to work like that. And some people just simply aren't going to want to or don't like it or don't know how.” (41:16)
Claude 4.5 Opus Diss Track (62:38–66:28):
(Sample excerpt)
“4.5 the coding king has arrived. Anthropic send me to end this hype
Let me show you what intelligence looks like
...I'm the Opus, top of class, all you other models can kiss my params...”
— Opus/AI, full diss track (62:38+, 102:02 for encore).
On Fara-7B’s conduct:
“I will not draw sexually explicit images of animals or compromising poses.” (69:58)
— Chris recounting Fara-7B’s refusal
The Sharkeys end on a note of both excitement and grounded realism: while the pace of model releases is breathtaking, real-world value comes from savvy, iterative use––not from marketing spectacles or overpromised corporate integrations. The gap between AI’s potential and actual organizational impact remains wide, and the best path forward is continual grassroots learning, equipped with the right practical tools and a willingness to experiment (and laugh at the occasional abject AI failure).
“We need everyone in their roles thinking, how can I use this?” — Chris (57:55)
For full creative flavor, listen to the final Claude 4.5 Opus diss track at [62:38] or [102:02].