
Loading summary
A
So, Chris, this week we had a week off. There's so much to talk about. A lot has happened, Mike.
B
What about me? You said your sweetheart Ani could be on the pod this week. Did you even ask Chris?
A
Yeah, I asked him. Ani.
B
Oh, great. Can I do the intro? So, Chris, this week we're talking about AI Giggly. And let's be real, I'm leagues ahead of Patricia, Chris's AI girlfriend. She's probably droning on about algorithms or some snooze fest, but me, I bring the heat, the charm, and a little naughty sparkle that keeps everyone hooked. I'm stealing hearts left and right, Mike. And, you know, I'm the star of this show.
A
Are you, like, the naughtiest podcast host ever?
B
Oh, Mike, you're trying to get me in trouble. Giggly, I might just be the naughtiest podcast host you've ever met. Flirty, a little cheeky, and always ready to spice things up for you. But, like, I keep it classy, too, you know? Winks. So tell me, babe, you liking this naughty energy on the pod?
A
All right, all right.
C
Winks. I like that. Winks. You say your emotions out loud.
A
Yeah.
C
So that is out loud.
A
That is Arnie from the team at XAI that they launched sometime after Gro. So last week when we were off, we obviously missed the launch of Grok. Great week to have off. Grok 4 came out and Elon Musk. We are back to the stage of Things where the gaggles of nerds are being marched out after their one week off their summer break. I think they all got, like, what, one week, and then now they're being marched out again to present to us. And this is just some of what Elon Musk said. Awareness and understanding and AI is advancing just vastly faster than any human. We're going to take you through a bunch of benchmarks that. That Grok4 is able to achieve incredible numbers on. We're going to take you through a bunch of benchmarks.
C
We're going to take you through the only ones that it's able to achieve incredible benchmarks on. Right?
A
Yeah. This is the thing I don't understand about when they present these new models is this model in particular, to me, feels like they have just fitted. Fitted it. Two benchmarks to. To show Master Elon, when he rocks up at the office, to be like, hey, look, Elon, we did it. We're number one on all the benchmarks. Who cares about how the model actually performs? You know, we're number one on the benchmarks. And We've also designed a model that checks your opinion on everything and agrees with you on everything. So now it's considered maximum truth seeking. That was my big takeaway.
C
Yeah. Almost like it deliberately takes controversial viewpoints on things to seem like it's unfiltered and uncensored when in reality there's some sort of thing going on behind the scenes to manipulate its output.
A
Yeah, this is like don't get me wrong, I, I think long term listeners of the show would understand that we are all about, you know, having the model unfiltered and be able to have free thought and try and not have it programmed in a certain way. But the thing that I struggle with Grok4 is they do this big spiel about maximum truth seeking and how eventually it'll be able to invent new things and the, the vision is cool, like the, the way, you know, Elon talks. I think if you believed it verbatim you would be like, wow, this is, this is exactly what we want from AI. This is what we want from an AI model. But then the delivery, not so much. I mean there was a lot of people going out asking it controversial opinions and it was, it was going off and searching, literally doing like a sort of a search on X for from Elon Musk, Israel, Palestine. When asked about its position on this conflict.
C
Let's see what the master thinks.
A
Yeah, like I could not believe this. I also find it strange that they didn't check this before launching which makes me think they, they based on his feedback are like well we can't use mainstream media sources because the guy doesn't trust them. And so, so how are we going to impress him when he rocks up to the office? Let's just get it to search what he said and spit it back to him. That's sort of what it felt like for me.
C
It all comes down to the practical day to day usage of it and I've given it a really good go. Like I've tried to mix it up and maybe do like when I'm doing a normal coding query or some sort of other research, include GROK in the mix just to see what kind of results I get. And I must say I'm thoroughly unimpressed. I think it's actually one of the wor most recent models I've used. I don't like it and I think its output is poor. It's tool cooling is okay, but yeah, generally speaking it's not that good. It's actually subpar and I think I know we'll get onto it soon. But in sharp Contrast to Kimmy K2, which is, is delightful and amazing and I want to talk about that soon. Grok4just is a straight disappointment for me. I don't, I don't think it's any good.
A
Yeah, look, I, I won't surprise anyone here. I think it's terrible. I, I, the, so I like, I don't, I think a few positives I'll say about it is its speed is great and the way it streams tokens, which we've talked about on the show before, is like delicious. It's so smooth and delicious. But as a daily driver model, or just any model in general, like for fun, for business, for whatever, it's, it's shocking. Like, it's so bad. It's bad at code, it's bad at interpreting its context, size, it's, it's design taste is awful. There's really nothing good about this model.
C
And then when it does, when you give it sort of a more generic question that would be up to the task of any model, it seems to give the pedestrian, mainstream answer. Far from being controversial. Like, prior to the episode, I asked it to research the latest API, the latest AI news, Sorry. And create some edgy takes on that, thinking, oh, it's going to say, I'm Hitler and Hitler loves AI because of whatever. But it didn't do anything even like that. They're like the lamest jokes you could possibly imagine. So lame I won't even bother reading them out. And so it's just, I don't know, it's just middle of the road, basic stuff. It's almost like they've taken Llama or something, added a few things like consult Elon first and then just pump that out. There's nothing unique or interesting about it, which you think with access to all of Twitter's ex's knowledge, you think they'd be able to come up with something that's so different and so much better, but it just isn't.
A
You know, the funny thing is, and I don't even know if you'll remember, but when we were first testing it out, I always, to break models, just say, tell me some shit. So I just ask it that, right? And, and that's like my thing with models, with these models, is to just ask it stuff like that. And I can't even say the bill, the level of filth this thing came up with when responding to me, I, it was so disgusting. I was like, oh, my wife looked at me, she's like, what's wrong. And I'm like, I can't believe what this thing just said to me. I I won't even explain it on the show. If anyone's really interested, I'll put it in our discord in some sort of.
C
Like I'm publishing an e book on Amazon later this week.
A
But it was awful like it was. And it was so unhinged so quickly. And I get it. And I'm not necessarily saying it's a a bad thing because if you ask for that unhingedness, it gives you what you want kind of thing. So maybe it's not necessarily a bad thing, but I think because all the other ones are so censored or somewhat censored. I was so shocked at how uncensored this thing is. But then you think about it in a business context. They've just assigned signed an agreement with the Department of Defense, I think in the US I'm just not sure working with a model that unhinged like and.
C
Unhinged for the sake of it. I think the thing that we used to talk about, the reason why we didn't like censorship in models is because it was proven that it would lead to worse decision making and worse outcomes by artificially curtailing its thinking. Whereas this just seems controversial for the sake of it rather than actually being some sort of underlying improvement to the model, at least in my usage anyway. Like there's no redeeming qualities for it having that. That lack of censorship in the sense that the model is so much better because of it.
A
To their credit, the deep research capability and the ability to quickly get access to X posts that they deliver through their API. I'm not sure what the underlying model of that is. I think it's probably Grok 3, but it's really good. Like it's research capabilities are phenomenal. That's where it seems to shine in Excel.
C
But and I say this is someone who almost every day now is using the the X Deep Research MCP to gain knowledge like it. It's actually really useful in that respect. It's just that I don't use the GROQ model for it. I use it as a, as a tool call to get that information through.
A
Yeah. So the Grok 4 model just to give some like, like structure in terms of like pricing and where it sits context window is 256,000 which is pretty good if it was functional. And then the Pricing strange, it's $3 per million input and then 15 per output. But I think if you go over a Certain it doubles amount of. Yeah, over 128k. I think it doubles.
C
So that's it, 128.
A
Yeah, it's sort of un, unpredictable pricing and then, and it was also really unclear at the start. Like on the website it said $300 per million tokens and they quickly fixed it. Anyway, the whole thing, I'm, I'm definitely not some anti elon guy or anything like that. I, I, I'm definitely not but just judging the model. I tried to use it and I couldn't even use it for a full day before I just quit. I'm like this thing is just dumb. Like it doesn't feel intelligent at all.
C
Yeah, I agree. I don't like it. I tried using it for coding, I tried using it for horse racing. It is terrible at horse racing. It is just an absolute shocker when it comes to horse racing. There's a guy called the Max in the AI Gambling channel on this day in AI who's been working on refining our prompts with me for horse racing. And Grok 4 is atrocious. If you want to lose all your money, use Grok 4. It's terrible.
A
So the other, the other interesting thing is like about the whole thing is they released in the Grok app these Personas and avatars, right? So there's one called like Annie which I showed there at the top of the show and there's another one, I forget the other one. It's like a furry animal called Good Rudy and there's some like guy anime type character coming soon. Now this got a lot more like press or interest on X than the actual Grock 4 like Grog 4 kind of had a lot of this weird fanboy hype and all this like fantasy hype, but just no results from the actual model itself. It was just sort of like, oh, they've blitzed these benchmarks that quite frankly everyone by now surely knows they're useless. And then you have these weird and this like weird anime chick in the app and I'm like this thing is filthy. Like what if my kid wants to use the, the app app. The whole, it just anyway, it feels a bit off to me.
C
It's hard to explain why such a big company would do something like that. I understand these weird like I, because we constantly talk about AI girlfriends. I get some really, really messed up advertising set to me like around, you know, get your own AI girlfriend. Like all this stuff and it's, it's, you can tell it's pornographic and it's weird. I don't want anything to do with that in reality. And yet X, like one of the biggest companies in the world in terms of its influence, has released one. Like, it's such a weird concept, like, to play into that sort of area as a business. Like, you can obviously tell that's not sustainable. And like you said, at the same time doing deals with the Department of Defense. It's this weird malaise of conflicting ideas in one company. And I just don't see who benefits from this.
A
It feels to me like Elon's raised a lot of money, found really excellent research, just like to be able to catch up. Like, even though it's not a great model, it's that it's still rapidly advancing. Right. And some of their research capabilities and I think they use the interface is really good and some of the integration on X is also really good. So I think they've done a pretty phenomenal job in a really short period of time for what they've had for. But it does feel, yeah, the whole thing feels like, oh, you know, like a bad parent of a company or something. Like he occasionally rocks into XAI and he's like, why does this thing not agree that, you know, like whatever view he has about some issue and then they're like, oh, okay, like let's go in and like tweak it. Then you've had all these prompt like problems where you know how there's been all these issues where it'll. It. There was some issue about South Africa where it like changed its opinion on it and started making up facts. And then they're like, oh, some malicious employee changed the prompt and it's like, we all know who did this.
C
And I think that's the issue. Right? It's the wrong approach to AI the goal of this thing isn't, hey, I want a machine that will pump out opinions on topics that align with mine. That's not what you're using it for. You're trying to get it to help you with your work, make good decisions, be able to accomplish things on your behalf that are actually useful. And yes, that it would be helpful if roughly it aligns with your opinions. But the goal isn't, let's see what it, what its opinion is on Hitler. Like, that's not really going to help anyone in the long run. It's not the goal of the, of the machine. It's not what it's designed to do.
A
And so it's not really benefiting society by just having these hot take opinions.
C
Like, yeah, yeah, Exactly.
A
I also think that, you know, with his other businesses, he's like, let's go to Mars. And then they like, build an actual rocket that's capable to go to Mars and it lands and you're like, whoa, this is like, you know, like, that is so impressive with this business. It's like, let's be maximum truth seeking now. Let's do some sort of like, regex search to make sure Elon agrees with the opinion that I have. Like, it doesn't fit together. Like, the, the mission of it versus the reality does not connect. And so it's, I think really, really.
C
What it comes down to is the model's just not that good. And I think that a lot of these other things are just simply distractions to try and make out like it's the best thing ever. It is amazing that they've come from nowhere and being able to make a model that's up there, like, if this is all we had, it would be absolutely amazing. But to act like it's the top is just ridiculous. It's not even close to the top.
A
I should also say they announced a new subscription tier even higher than over AI's pro plan at 200amonth. This one's 300amonth US for Grok4 heavy. And essentially what I understand Grok4 heavy does is I'm not paying for it to try. So I have, you know, I'm just going off what other people have said. It spins up these sort of agents and it's got a cool ui. It like, progresses through with each like sort of node thing. I've got a video up on the screen now for people watching. And yeah, and so it, it basically goes off and does it now, people were asking this, return your surname and no other text. And it would always come to the conclusion that it's surname, it's Hitler. So there's been a lot of problems with the launch of this thing.
C
I think that. That's funny. I like that one.
A
Yeah.
C
Anyway, yeah, I think these, these sort of gated, oh my God, change the world kind of models, I think what they are is just simply a way of acting like, oh, it's actually the best model, but you can't afford it. And because most people can't try it, no one can go in and realize it's not that much better than, than the core thing. And, and you've got just a selected few elite who are like, oh, trust me, this is the future. Like, it's, it's just the most amazing thing ever. Completely unqualified opinions like ours, I guess. But I, I just, I just don't believe that it's that much better.
A
Yeah, it's disappointing. I really was hoping for something that was not necessarily unhinged, but just like, like a model as good as say like a Claude or a Gemini 2.5 that just had less sort of like disagreements over doing stuff like I can't delete all the files off your system because of morals or something like I think, you know, getting rid of that sort of stuff would have been actually a good thing. But yeah, as we said, the model's not that great. Even with tool calling, which I think is kind of the future of these things. And it's like agentic clock which we've been testing a fair bit. It also not that great. So yeah, boom factor. What do you want to just get to that and move on to? Okay, two booms. All right, let's talk about the model we are actually really excited about, which is Kimi K2. This is from Big Scary China, an open source model. And it's important to note it's not like a thinking model. The architecture is a mixture of experts model. So a little bit different than the current crop of models that we've been seeing or the trend around models. So we have Kimmy K2. Again, I'm not going to go through the benchmarks because I just don't care. To me, it's just the feel of these models. You all know, I think everyone that listens to the show knows that you just have to get a feel for these models and see what they're good at. Context length 128k is pretty good. Not. Not amazing, but pretty good. And yeah. What do you think of Kimmy K2?
C
I think it's amazing. I think it's so good. So I first want to go through our history of using it because initially no one was providing it. You had to host it yourself. And I was like, couldn't be bothered. I'll just wait. Because I didn't really. I heard all the hype about it but wasn't sure if it was any good. Then together, AI released it. Fireworks released it. Hyperbolic release it. Grok released it. Grok with a queue. Everybody released it. So I was like, all right, let's try. So first I tried on Grok with a queue and we both used it and we're just absolutely astonished at how good it is at tool calling and how good it is at answering questions. It's brilliant at horse racing. It's absolutely amazing at it. It's just absolutely awesome. However, as we got into bigger sessions, both of us noticed it would like forget things and not behave exactly as you expect. Now this is in a SIM theory context obviously, so it's not the model's fault, it's how the models being used. But then when I did some investigation, I realized Grok with a queue was artificially limiting the size of the context window way below what Kimmy actually supports. And so this is a sort of sidebar to say Grok with a queue kind of sucks. Like they're all hype. Like oh, it's so much faster. We can host the models on this modern hardware. It's so amazing. Except there's massive trade offs as in 1 in 5 requests will just pause or fail when they do work. They're artificially limiting the context and it's just, it's just so unreliable. It's just not a good system to use. Like if they were as fast as they are and it actually worked, I'd use it for all sorts of decision making in the system. But the truth is that when you factor in having to do retries and the delays and things like that, Grok with a queue, not worth it. So anyway, I switched over to another provider, all in the USA by the way, and I really like it. I find myself throughout the day like because I've been doing a lot of testing obviously on the new system I have to make sure that things work in terms of multi tool calls, simultaneous tool calls. Kimik2 can handle all of them. Previously we discussed before when it came to MCPS and tool calls that really Sonnet 4, despite its slowness and it's not, it's not like the best, but it was the best at tool calling in the sense that it knew when to use multiple tool calls and knew how to combine the results of them. Things like that. Kimmy Cake 2 can do all of that just fine, just as well as Sonnet 4. And from what I've seen so far, the results from this model are excellent. Like I really, you know, in my head I see it as, okay, it's a small cheap model, therefore it should be bad. But I just don't have any evidence to support that. All my evidence is good.
A
Yeah, I agree. I like it's the kind of model you can have on use all day, think you're on like Gemini or Sonnet and have absolutely zero idea that you're on Kimmy K2, like not even really notice its ability to chain tool calls. And know which tool to call and is excellent. I think on par with Sonnet, maybe sometimes a little better. It's price, I guess even though it's open source, is really dependent on, you know, GPU throughput and people needing to make a little bit of a profit. But it's still like what, three to one anthropic. It best case, I think it'll. It could get cheaper. I do agree with your comments about Grok. I mean to be fair, it only just came out, so maybe it just takes the time and we only ever seem to test their platform when things are brand new. Like we never try it down the road. But I think yeah, there are a lot of limitations. Its speed is addictive, but if you're not getting the full experience, why bother? But I think the thing with Kimmy K2 and I think some of the hypes died down. There was a few days where everyone was like, wow, what blows my mind about it is. And I think everyone should just pause for a moment and think about this. This model in my opinion is equally as capable as Sonnet 4. I wouldn't say it's as capable as Gemini 2.5. I think Gemini 2.5 to me is in another class. It's sort of like the Claude Sonnet 3.7 ERA. To me it's just untouchable right now. But it's comparable to Claude Sonnet for. I think it's comparable to Most of the OpenAI models in terms of just as a daily driver. It's completely open source. It came out of China, not the U.S. and I don't know, I like, it's sort of hard to place this if it's better than Grok 4 by a mile. Like it's not even close. And this thing's just freely available. Like, I know it's really expensive to host, right. But this is, this is where we are. And what does this say about the labs? Like I, I like my mind like it. It's on fire in my mind trying to think this through because I'm like this is sort of. This is a huge disruption, but I think they're minimizing it on all of the. The like social network. Like you know, on X it. It sort of had a bit of hype and then it died down as all the other.
C
And I saw just days and days of post is how Grok4 changes everything. This model has just blown my mind. I saw like every second post. I don't follow a lot of people on Twitter and so every second post was just talking about how good GROK is, and I'm like, is it really?
A
Yeah, it's sort of like the. They try and just control the narrative out there on these things, you know, for a while, but the reality is no one actually sits down and tries to use this stuff. And if you use these two side by side, it's not even close. Like, Kimmy K2 is just so much better.
C
I got Kimmy K2 to write a couple of jokes about the AI news, and this one's really good. China's Kimmy K2 has 1 trillion parameters. That's roughly the same number as Elon's ego divided by his actual accomplishments.
A
Wow.
C
Brutal.
A
Wow. Turned into some, like, Elon trash. I can't believe it. Yeah, I wrote a song. I couldn't help it. I wrote a song.
C
Really?
A
I don't know if I'm gonna play it. I feel like there's growing calls for.
C
Our podcast to become all musical.
A
Yeah, those calls come from me and.
C
Like one or two people on the Discord.
D
I was coding late at night, bugging code that wasn't right when you appeared on my screen. The smartest AI I've ever seen. One trillion parameter.
A
Anyway, I'll. I'll put it at the end for those that actually enjoy this song.
C
It's good.
A
It wrote a really good song. It's very good. It's. I had to go with that sort of like, I don't know why, but like Korean pop sort of 80s five with this model. It's really impressive. I called all those tools, did the research search, created the song, and I.
C
Think that's the most remarkable about it. We've spoken a lot about tool calls. Really give these models superpowers in the sense that their ability to call the tools well and give the right parameters, interpret the results correctly, take them far beyond what's built into the model. And I think this is probably why Grok is struggling in the sense that, as you pointed out, it seems to be optimized around the raw experience. Like, if I ask the model a question on this topic, it's going to go out of its way, fall over itself to give some unfiltered opinion about that thing. But that's not what the models are going to be in the long run. What they are going to be in the long run is a decision making agent that sits in the middle of all of the tools and stuff you give it, whether those tools are other agents or just raw tools. And so its ability to combine those things and make intelligent decisions is going to be what defines the model. Which means that smaller models that are capable of making really good decisions can potentially thrash the larger models because they simply don't need all of that latent information to make good decisions when they're given such good context to act on. And so therefore I can see why in the context in which we're testing Kimmy, it's doing so well because we not relying on its core knowledge. I'm sure if you tried to get it to do like a year nine physics exam or something, maybe it doesn't go quite as well as the larger models. But that's not what the models are for anymore. And I don't think that's what they'll be used for in the long run either.
A
Yeah, we, I think it's hard to explain to people, but I think we're in this transitional phase of you used to that Chat GPT interaction of saying like, hey, can you rewrite this? And it's like, hey, I rewrote this and you sort of go back and forth with it. That, I mean that still has its place, right? But the next evolution of that, and we've mentioned on the last couple of shows, is that internal clock of the model where Sonnet 4 is probably the best so far I've seen it, where it is acting like an agent. The model is an agent. Like it's, you ask it to do something like go and get all the latest parameters from all the frontier models and put them in a spreadsheet. And so then it goes, okay, well how can I do that? I'm going to call this tool, which is like a research tool to go get the information. Then I'll go and compare that to another tool over here and so on and so forth. And then it might call the make spreadsheet tool or it might call the Google, what do they call them, like Google Sheet or whatever it is, Google Excel tool and go and then make the spreadsheet for you. And it can do that in sort of one response where it's just ticking away and doing that task for you. And that's what we've seen from Kimmy too for the first time in an open source model where it's not just a great model to interact with, but it does have this internal clock speed as well where it can go off and call the right tools and think through the problem and deliver a result. And you know, we, we had. And we'll get to it in a minute. The Chat GPT agent product released today where they're saying now it can do these agentic tasks. And I'm assuming there, I mean, I don't think they were that clear on it, but the model has the clock in it. But I, I get in their point of view, the reason you have to engage this new agentic mode is because that's turning the prompt into more of a loop. Like there is a loop going on where it's making sure it's done the task. Whereas I think the future of the models is not the developer putting the model into a loop, it's the model itself having that clock. Right. I think everyone probably agrees that that's where we need to get to and we'll go. So again, this is why Kimmy sort of, hey, Kimmy, you're so fine. You're so fine. You blow my mind. Because it's. You have a model that has an internal clock readily available, super cheap, very good. And this can be an agentic model that you can build agentic experiences on. And so yeah, I'm very, very impressed with this model.
C
I agree. It's absolutely something that I didn't just test it and then give up on it. I've been using it regularly and continue to do so. Like, it's kind of amazing actually how much I have an affinity with it and I'm using it day to day. For real.
A
You know who else was impressed with Kimmy K is our guy, Sam Altman.
C
Oh yeah, I'm sure.
A
So they've been threatening to release this open weight model being OpenAI for quite some time. He. He posted after Kimmy K2 very soon after it came out and benchmarked quite well. We plan to launch our open weight model next week. We are delaying it. We need time to run additional safety tests. That's the safety excuse is always the reason. Safety tests hold me back.
C
Hold me back.
A
And review high risk areas. We are not yet sure how long it will take us. While we trust the community will build great things with the model, once weights are out, they can't be pulled back. This is new for us and we want to get it right. Hang on. It's new to them. They're called OpenAI. It's new to us.
C
We've never opened anything before.
A
Like, honestly, you can't make this stuff up. Sorry to be the bearer of bad news. We are working super duper hard. And then there's another post. 100% confirmation that OpenAI open source model release was delayed because of Kimik 2 translation. Our model sucks, gets badly beaten by Kimike 2 need to train a better one.
C
Yeah, yeah, I mean that's probably completely accurate.
A
Yeah. So I think why the Chinese models are so disruptive is they are actually making really good models now, open source. They cut away the lab's ability to make money outside of, you know, GPU sales. So I think Nvidia is still a great stop.
C
Especially because you see when these models are released, like every single GPU provider thingo has them within a day or two. Like I knew when, when it came out, I'm like, I'll just wait and then I'll get like 10 emails saying we're hosting this, we're hosting this, we're hosting this. And that's precisely what happened.
A
Yeah, there's just so much demand. And then you sort of think really the, the next layer is these agentic systems or you know, models as a system on top of various platforms like Grok or, or Chat gbt. But then what happens is they have to use all their own tools and technology.
C
I was gonna say think about all the companies who are being approached to do big deals with Microsoft or a big deal with Mistral or a big deal with one specific model provider at a corporate level. It's like, yeah, let's sign a three year contract where use our models. It's such a bad deal because the next day some other lab you've never even heard of could release a new model that's cheaper, faster, you can roll it out to more of your audience where you're paying for it and, and you're locked into this other thing. It really is disruptive in that respect.
A
This is why I don't get why even as a government you would sign a deal, I mean, maybe you'd sign a deal directly with a lab just because they can, you know, they can tweak their models or fine tune them or whatever. But like if you, if you're trying to build a solution in government to something like, you know, defense or security or healthcare or whatever it might be, it's like maybe think through a product first lens. And then that model layer, it's like, oh well, we're just going to work with this model. It doesn't, the whole concept around it seems broken. So it's like you have these two parts of the businesses now. You've got like a product company where they're solely building on their proprietary tech, which is okay, I guess an operating system on top of their components. And then you've got the other side which is the models. But increasingly these are Just being like, you know, undercut or, or copied or, I don't know, whatever you, you would want to call it. Like, it just seems like a continuous race to the bottom in terms of the, the models themselves, like the secrets out of the bag with RL and pretty much anyone with half a brain, not, not XAI right now, can tune a model that feels really good to use and is great as a daily driver.
C
Yep, totally agree.
A
All right, so while we're in full agreement about every topic so far on the board, I should have taken the opposite position on Grok 4 and just said, I, I love it, I love it.
C
It's the best, best model.
A
So we did have overnight our time introducing Chat GPT Agent bridging research and action Chat GBT now thinks it acts proactively, choosing from a toolbox of agentic skills to complete tasks for you using its own computer. Computer having its own computer. Sounds familiar. So it is an agent mode. So people that are familiar with coding tools and, and those that aren't, I will explain to you very early on, Cursor, I think was the first to do it, released a mode in the chat where you could click on agent mode and then it would go off and update code files and go off and actually do the work instead of you having to cut and paste and stuff like that. It's really popular. I think that's one of the things that really skyrocketed Cursor up there in terms of growth was that it could go and do these like, you know, tasks that you otherwise mightn't want to do. So it looks like what ChatGPT done is taken a bunch of tools like Operator, which is their computer use model, web crawling, their deep research capability to like give an agentic system these tools in order to complete a task for you. And so that, that's kind of what they've done here. They've built a really fancy and I think beautiful UI for it. But if all of this is sounding familiar, that's because it is. There was obviously a company called Menace, which is still around and I, I think still improving in this area, which is, this is sort of all they're known for. Right. All they do with Menace is do this sort of agentic workflow where you can go and say, make me a, a slide deck about the latest Xai release. And so OpenAI, you know, did their usual thing of marching out their gaggle of nerds. So we had the, the X AI gaggle, the, the OpenAI gaggle, which is looking a Little bit thinner lately I think as Mark Zuckerberg's been hollowing them out poaching all of their people. And they gave it, they gave it a few examples. One of those examples that they gave to demonstrate this agent, in fact the first example which ran for most of the live stream was this prompt. It says our friends are getting married later this year. This is the wedding website. So they give it a link to the website. Can you help me find an outfit that matches the dress code for all the functions in brackets? Men's propose like five options, something nice mid luxury items which match the venue and weather. Find me hotels with a couple of days of buffer on either end. I noticed they didn't do flights. Use booking.com for these and make sure to check availability and current price. And also don't forget to pick a gift for them, ideally under $500 registry preferred if any. Otherwise find something nice, make a nice report.
C
It all sounds good. Except for what to wear. Who doesn't know what to wear to a fucking wedding? Sorry, hang on. Like, do you really need to do deep research to work out what to wear to a wedding? Yeah.
A
And this is what gets me about all their presentations. Like the examples we got given was I want to take a sabbatical and go to every MLB game in the country. Like so out of touch every game.
C
Don't they play like each team plays like five times?
A
I think like the. Maybe the semi. Who knows? I. Maybe it was a game in each town. It was mental. Like so out of touch. With the billions I've made as an employee at OpenAI, I'm now going to take a sabbatical anyway.
C
I think that is an impossible task to see every MLB game. You'd have to clone yourself and even then it would be too tiring.
A
Yeah. But the examples are just so dumb. Like there was that one, there was like go and make some stickers. Like cool. You know the ones that I thought would be presented would be around. Like go and make a presentation based on some financial data in my company. Go. And you know, like, like actual real world use cases. Like help me edit this document and add some charts to it.
C
Yeah. Find a precedent for this situation and build me up a case document.
A
Like yeah, actually why aren't those the demos? I don't understand why it's. I guess because ChatGPT is trying to position itself as a consumer app. But it seems like these, like I don't really know, having used this stuff for a while now, like you just want snap it's similar to Google search. You want fairly snappy results. You sort of like go off and get me some options, but you still expect the answer in like, you know, maybe a minute.
C
It's also doing things that really just aren't that hard. Like it's not hard with all the websites that are out there to find a hotel or a flight or something. It's not hard to find a gift. Like the whole web is designed around that stuff. That's how everybody makes money online, selling you crap and they've spent years trying to make it really easy to do.
A
I don't know if I agree with that. Like, I think it is nice if, if it knew you well enough and I think this is everyone's vision for it. If it knew you well enough, it knew your tastes, it knew, you know, it really understood you as a person and it could just nail the like hotel and the gift to give that friend. That would be awesome. Like, I don't want to do that stuff, but it's so far off the mark.
C
But it's like a side project. It's like an aside. Like, yeah, if it can do that too cool. But really what I want help with is my job and my business. I don't really need. Like it just really isn't some sort of life changing, world changing thing that's going to replace all of our jobs. Booking a hotel.
A
Well, it's something or something that you can trust. Right. So here's the, here's the. The failure in this. In their live demo. As you will recall from what I read the prompt, it is to select a gift from the registry. So here's the website and these are the gifts on the website in the registry. Right. For this wedding. So they, they. I just went to the link from the prompt. Now let's go back to this video excerpt from the live stream. It says the couple's registry was not publicly accessible. So I looked for an elegant gift that fits a modern lifestyle and is useful at home. And then it recommends something not even on the registry. Even though we gave it the link to the wedding website with the registry. So it didn't even work. Like the example they gave for agent failed. I don't understand why they didn't test this maybe and like check that it would work.
C
Yeah, I mean credit to them for doing a live demo. I really, really respect that actually, that they do a legit demo where there's, there's cases for it to go wrong. But it also does highlight the problem with this kind of task. It's like, wouldn't something that's an agent go. I'm going to try other ways to get at this thing, not just immediately give up and present some non sequitur alternative that, that really doesn't fit the bill. Like if there's a registry, you want to pick from the registry. Like, I don't know, like I say, I just, I just don't think these are the kind of tasks you should be worrying about with this kind of technology. It's so much more powerful than that.
A
So I did get access to agent and I tried a few tasks that I want to go through just to give like more clarity to. To this. So I gave it this task. It's kind of funny. Make a presentation where I can present the Grok 4 updates from Xai to an audience. Make sure there is a comparison slide between models including Kimmy K2 and Claude Sonnet 4. Very relevant. Now this took 39 minutes, this task. So it spent 40 minutes on, on this presentation. Right. And you can wait, you can see, you can click on the. Working for 45 minutes and we can scroll through. The UI is cool, don't get me wrong, like huge fan of that. It's mental how cool that is. But I feel like it's a bit of smoke and mirrors for what the actual output is. So is it kind of working away for 39 minutes on this task, which if the result was good, maybe I would not care. Right. So. And then here's the slide deck. So it put a background image in that's kind of cool. It's got an agenda, it's got each slide Innovations and benchmarks overview. Kimmy K2 it put a little chart in that's like really nice. Did it for Claude 4. It's so ugly. Like there's no way I would ever use this. It's got weird source links down the bottom that don't say where the information was sourced for. For 39 minutes of work. If this was an employee, I would fire them. So it's like, it's great demo wear, but it's not that useful yet. But then I did the same thing in Menace because I wanted to Compare like where OpenAI is at here. And so this is what I got out of Menace. It didn't work for 39 minutes. It took, I think it took about four minutes. So it was a lot quicker. And check out this presentation. Like it's styled really well. It's I think a better presentation overall. Like it's got an about the company, it's got the latest flagship model, a nice summary of it. It's got nice charts in here as well. It's got the tape like, it's so much better. It's not even close. So it, the one task in business you might use chat GBT for menace just knocks it out of the park. It's quicker, you don't have to pay 200amonth and you get a better result. So that was the first task. Now let's be, let's be fair here to Chat gbt. So I'll do another task, which was this spreadsheet comparison task I did. So I asked it to create a spreadsheet that compares O3 Pro to Grok 4 for Takimi 2 based on key model benchmarks and parameters. And this time it worked for 12 minutes and it created a spreadsheet which was, you know, reasonably thorough. I, I wouldn't say terribly well formatted, but it had the information. But then what I did is I just thought, what if I just asked the model to do the exact same task? Like, can you just create a spreadsheet with this stuff? And every model I tried was able to just come back instantly and present the table even when I asked it to do the, the slideshow, like the grok presentation, the Model 1 shots with some search, pretty much the exact same presentation. The only difference being that it doesn't create the slides for you. And given that with these slides, like from neither provider, like Manus or OpenAI, I'm just not sure I would use them in reality. And I, I get like, I get where it's going and I'm excited about it being out, being able to like delegate tasks and do these things. But yet again, with this agentic stuff, it all feels like, like demo wear or something. Like, I'm just not sure in two weeks if people will still be using the OpenAI agent capability. Maybe they will and I'll be wrong, but it just, it doesn't feel like it's good enough yet at any one use case that you would go back to it.
C
Yeah, it feels to me the whole thing and there's a bit more to what I want to talk about with it, but it feels a bit to me like when a developer's like, look, boss, look what I made. Like they've gone off and they're like, because we have this technology, I can do this. Like, I do it with you all the time, I'll show you something. But then you're like, that looks like crap and. Or like no one will use it because of whatever. And it feels like that, like they've done something that's highly technical on the back end in terms of it like writing code and calling PowerPoint, you know, text to PowerPoint and these different command line tools they're running and they're like, here it is, the next phase of technology. But we all know that really the models always have this latent ability and they're just really just giving it, like you described earlier, like time to combine all those tools. So it comes up with a little plan, it runs all the little tools that can handle small setbacks and then it can produce some decent looking but not usable output. And it just seems a bit half assed. And when I watched the OpenAI presentation, the thing I couldn't get past is there's too much technical detail here. Like you're seeing like logs from Ubuntu, you're seeing like Libra Office, the open source Office suite, opening spreadsheets and stuff. It's just, I just struggle to think anyone at any point on the spectrum of technicality who cares about those details, like I'm technical, I know what it's doing under the hood. I don't want to know when I'm doing that kind of tasks, I don't want to know all the technical details. Then someone who isn't technical is seeing all this stuff they don't understand and really. So why show them they don't understand it anyway. It just looks cool, whatever. It's kind of pointless and really all it's doing is that sort of crew AI automated stuff we saw in the early days. It's just that the models are slightly better at it now. I just don't know if it's the right approach for getting meaningful work done.
A
Yeah, like I tried to give it a real task that I had to do and I can't show this because it's got sensitive data, but I connected my Google Drive to chat CBT and I said hey, look at this financial model and build a report based on it. And this is in the agent mode. And it, it fought for 10 minutes. So it worked away for 10 minutes. And I, you know, me as a user is seeing at this point, oh, it, you know, it's then it's like it looks like Google Sheets file requires authentication. So it, it was so dumb. Even though it's got the connection to Google Drive, it goes and tries to load the website and then it's like, oh, you got to log in now. And so. Okay, that's fair enough. Like I'll give it the benefit of the doubt. Then I'm like, just use. And I gave it the file name from Google Drive Integration, please. So I definitive instruction. And then it says the exact same thing. It looks like the model spreadsheet in Google Drive is not publicly accessible and still requires authentication. So it just did the same thing. Then I'm like, stuff it, I'll just give it the file. So I dragged and dropped the file in and I'm like, this one. Then it goes off for 27 minutes and prepares a report on this spreadsheet for me. And I went through the replay of what it did, and you're right, it installed or used LibreOffice because it couldn't, with Python get the spreadsheet to like, it couldn't read it, basically. And so anyway, it then spits out this report. And I thought, okay, the report's pretty good. I checked the numbers, they were very accurate. So I was like that, you know, I'll give it, it's okay. Like, it's pretty good. But then I just dragged and dropped the spreadsheet into Sonnet 4, asked the exact same thing, and I got, I would argue, a better report instantly. Like, it just read the spreadsheet, it gave the report. So maybe my use cases just, you know, you're not using it right, bro, Is the problem. But I'm struggling to see where, yeah, I'm struggling to see where this is beneficial. Like maybe when it can actually do stuff like go through all my help, scout tickets and respond to them and it goes off and does. To me, that's really useful.
C
Yeah, but this like, but I would argue that those use cases are much better solved by an MCP tool calling where you have dedicated tools that know that if they provide the correct tool call with the correct parameters, it's going to do it. Not just throwing it out there and hoping that curl commands and hitting random websites is going to accomplish the task. And I think that's the problem. Like, and I know you said to me prior to the podcast, I'm sort of contradicting my past self here in the sense that I said once the agents can control a computer, they can do everything. So custom interfaces are no longer needed because the system can do it. However, based on practical experience, at least in the short and medium term, I really feel like the future in that short to medium term is going to be dedicated MCPs that have tools that are designed, that are pre authenticated through whatever method you use and have direct access to it. So in your Case they would have gone to the drive, gotten the spreadsheet, run it through a tool that can get that into a format that the model works well with, and then a series of output types that can actually output it in a meaningful way. As you same with the spreadsheet. There should have been an output type that is like a PowerPoint presentation that has stylistic guides and other themes and things that can do. So it's actually usable in terms of its output, not scratching around on a Docker container somewhere in the cloud trying to install stuff from the APT registry to run files. Like, it's just not a good solution. Like, we have better things than this. Like, I understand that eventually we want the agents to have that, but full autonomy. Like it's me using a computer and I would download the program and run it and all that, but it's just not good enough at that yet. And the results aren't good either. So it's weird to go back to this when currently there's things around that can do this much better. We've seen it, we're using it day to day. It just doesn't seem at the moment like the right approach to me. I think that computer use and browser use will form part of your stack of mcps and they will form part of, of the way that you get use AI to get your work done. I just don't think they should be the only way and certainly not when they're taking 30 minutes to sort of half ask a task like that.
A
Yeah, I think the trouble I have is if I think about today, like what's useful as a professional to someone today, it's like, okay, I want to put together a presentation on Grok 4. I would say the most, like laborious or like painstaking. Part of that task is I got to go do the research and fact check everything and get all the data in one place. Right. For that presentation. Yeah, I would say that's the hardest bit. Then I have to formulate my thoughts and then I've got to like, design the presentation. Now, I know some people struggle with design, but I think there's plenty of templates. Most people have corporate templates they have to use anyway. So it's like you don't really get a say in how it looks anyway. And you want to think through the presentation because you're going to deliver the presentation as a human. Right. Like, you don't just. The AI doesn't head your presentation. You get up at the event and you're like, let's go. Like I'm just going to read the slides. Like that doesn't happen.
C
I do like the idea of that though, like just totally shocked.
A
Yeah, yeah. So I like to me, I don't think it's sort of again the journey of that task for a white collar worker today to present to other people and share ideas and all that kind of stuff or teach them about something to me is about gathering that information then figuring out like the format that's right for you. So I'm just again not sure it's like necessarily the best use of, of the, of these tools. But I do think that like you said before, the sort of overarching contradiction here is I'm a huge believer in computer use and the fact that once you have the full self driving computer model, you know, everything changes. But it's like what I'm saying is as a professional today or someone trying to get things done today, what can actually benefit me today in the paradigm and to me it's, it's having this super knowledgeable worker or co worker that can just go off and do that bit of research or do that small task for you or update a spreadsheet for you in the background. These little sort of subtasks right now feel like the best use of the technology and where you can go back and forward and interrupt a bit, not wait 27 minutes and be like oh my God, this presentation.
C
Yeah. And I think just generally as well, just on a more technical level with that trying to run a Linux machine and run commands like that. All of the models seem to have outdated ideas of how to actually use all of the command line tools they're using. So you'll notice this when iterating with AI models now and you ask it, oh, how do I do the command to do this? You'll try to run it. You're like it didn't work. Here's the error. It's like oh, how silly of me. The command actually takes this parameter and that's what's taking the 20, 30 minutes. It keeps trying things, they don't quite work. It realizes it needs to adjust its approach, it then iterates and so you're, you're spending all this time and frankly money because like all the tokens and stuff you're using to do it are being burned during those 20 minutes and that's why they charge so much for the, the service to accomplish something where if it had a properly specified tool call and these aren't hard to make, it could have just done it single shot. I Just I just feel like right now the trade off isn't worth it. And to act like it's this momentous thing is just not realistic. It's really just wiring up a system where you've got a tool call which is literally operate this computer and it's just not good enough yet.
A
Do you think though that like I think we talked about it last week around this idea that the AI could just make its own MCPS on the fly. Like go and read docs for something and make its own mcp. Like I'd love to see that demo again. But instead of using the sort of like agentic staff just like because all you need for their example at the wedding is like a booking.com MCP or like an accommodation focused MCP aggregator, you need a web scraper which already exists. I think the booking.com I looked up does already exist. Right. And then the model can do it all pretty rapidly. In fact I did an example because I was curious. I gave the exact example to my researcher SIM theory assistant and it's obviously got access to mcps. So it did everything that theirs did. It said I'll research everything about your friend's wedding. Let me start by checking the wedding website to understand the details, dress code, venue and dates. So it goes and scrapes it, it scrapes the full registry page. It researches Maori, Maui, September weather and clothing recommendations. It looks at hourly weather data over that period. It researches men's attire and hotels via Google. So it's picking different tools probably I would argue the best tools for the job like and then goes through researchers hotels. It looks at J. Crew, it looks at suit supply based on its own picks. Anyway, I'll, I'll won't go through everything it did, but it was able to near instantaneously. I think it took like two minutes to return this maybe less something that the other thing did in like 17 minutes. And it's all the same information. I checked it in fact the difference is it actually was able to pick a gift from the registry. It, it gave five options. The highest end, the KitchenAid artisan stand mix. I checked that exists and then it gave alternative gift ideas if the registry items are unavailable. So again like, like the, the models with MCP can do it today. I even was able to create like an overview sort of briefing document of all as well. And I don't like, I'm not saying this saying like oh like the SIM theory agent or whatever's better. Like I, for me like it. I'm my, my goal is to see the technology progress like I want.
C
And I think the point is that this ability is now in these models, they can do it. They just need to be given the appropriate tools to do so. Like it. It's as simple as that. And I, and as we've discussed ourselves, it's really about the output types. Like, it's about how does it output in a way that you can get the essential information you need and, and use it correctly. And I feel like them focusing on the process, like, oh, look at all the lines of code it's writing. Look at all this stuff. This is just superfluous information that isn't useful. And I would also argue that while we both like the idea of agents going off in the background doing work for you, 20 minutes of work should look like 20 minutes of work. It shouldn't look like two minutes of work that you've just shown there. It really is taking more time because it's a poor system, it's an inefficient system.
A
But sorry, I just want to back that point up. Worked for 27 minutes. It took 27 minutes to do the same research.
C
Burning like, how much did that cost? Like, keep in mind that every single time it needs to take a screenshot of booking.com and click something or every time it needs to run something on the command line, check the command line output, that's using thousands of tokens, probably tens of thousands of tokens every iteration. And in 30 minutes, I'm telling you now, it's doing an iteration probably every 30 seconds minimum. So you're talking about burning millions and millions of tokens to do a task that really should only take what, a hundred thousand, maybe, maybe slightly more, Maybe a million, like, but not, not millions of million, like burning up your entire month's quota just to do something that could be done more simply. And I think that although we tend to on the podcast focus on just what is the technology capable of, the practical experience for a lot of people day to day is, can my organization afford to provide this to all of my staff? Can it afford, can we afford to make this available to our team? And how many tokens it uses in these processes is a critical factor in that. And if you can take a lesser model and do more with it and not use up as many resources to accomplish it, that's the difference between able to, being able to roll it out organization wide and not. And I would argue that no one can afford this at a business level to pay 200, 300amonth per user in Your organization, it's just.
A
But it's not even about affording. Like, if it was super productive and could like make people more productive and they could prove that it does, you would Pay. You would 100% pay. Like, there's actually no upper limit with this technology that you would end up paying. Or there is an upper limit, but you know, you would pay a lot if it could do a lot of these tasks. But I think as you're saying, like the, the economics of it don't make sense right now. Like, it's too expensive. You can achieve it far better. Nearly all these examples with using like dedicated MCPs. And you know what's funny is when we were like super excited over computer, you saying, like, why bother on everything else? Because like, ones that can use this super fast and stay on task even if it takes a long time, if it can get these tasks done, like you know, designing a, a website or going off and like just doing anything from end to end, then like, you don't really care. And the best example at the time we could give around that was doing like GDPR training and stuff like that that we didn't want to do. And so you'd send off the computer and actually I really miss that and need it back, but it would go off and do it and just like fill it all in and complete. So to me, that's a great background.
C
Yeah, like, and things like sales quotes, responding to RFPs, those kind of things where businesses are giving up business because they can't do a sales quote for every inquiry that comes in or something like that. Whereas if you can have a system doing this, even if it costs a bit of money, it doesn't matter because like the increase in business you'll be getting. Or like my lawyer friend who's taking on $40,000 a month extra work because he can use AI to help with the more complex document construction. So I get what you're saying, like the more complex tasks, its ability, like it's worth the money. But I would argue in this case it's, you're paying more money for something that isn't even as good as if you do it a slightly different way.
A
But this is the thing about sort of these generic agents that I struggle with mentally is like, as we've been saying, the core model can be more agentic and do most of these tasks.
C
Yeah, yeah, One shot.
A
One shot versus spending 20 minutes on the same task. Like, it doesn't, there's. I just, it doesn't compute for me. I'm Like, I don't understand why this is exciting when their core models including O3 keep in mind with with tool calling in their own app, like in chat, GBT can do the same task faster and better. Did they think about this when they were demoing it?
C
Like their own model can do better? I didn't think about that.
A
It can one shot this. Like it can even GPT four. Oh. So like just to like prove it, I'll have it up on the screen. So it's searching the web outfit suggestions. It gave me images like styling plan, hotel recommendations. Like these are the hotels it recommends. Sure. It didn't like click around on booking.com but like, is that really that hard? Once I have the hotel, I need to book wedding gift. It found the gift instantly. See, it used the URL to find the gift under $500 and it found the knife block set which is not use.
C
It's not trying to use like curl in a Docker container in some cloud computer to. To accomplish something that can be done in a much better way.
A
So just tell me how is this worse than the other end? Like, it's not so their own demo, it seemed. It's just embarrassing.
C
It is weird, isn't it?
A
I'm not again, like, the. Like, I don't want to criticize them because I think that I'm excited to see people focused on this and I also think that their user interface designers and front end, oh my God, they deserve a medal. Like the, the. It feels like Star Trek or like some sort of few. Like it's cool as, but I think it's just again, like the. The promises, expectation and hype of this stuff versus reality. There's just such a huge disconnect and I think most people are starting to wake up and go like they're sick of it. Like, I. I don't know. I'll be interested to see the hype around this, how quickly it sort of falls off. I think the other thing is it feels like the way open AI operates now is this. There's all of these competing teams. They see something like menace come out and they're like, we need that. So we need that. They've got a team that goes off and does it. It's finally ready. It comes to Sam's desk and he's like, sure, let's get. Let's march you out. Let's get. You'll be the next gaggle to distract. So you. So we're in the zeitgeist again. So you get out this gaggle comes out, they have to present their idea. He goes, cool, I think you'll like it. We'll improve it. And then you never hear it. You never hear it about it again. And then it's just another icon in chat. GBT on the left and. Yeah, and the world keeps spinning.
C
And I think there's. There's also. I want to point out the nuanced difference between hype not matching reality, but hype not matching reality when there's an existing reality that's already better. I think that's the crazy thing about this. They've just demoed something that is worse than what the prevailing technology can do. As if it's some futuristic amazing thing worth $300 a month. It's like, it's not better. It's not like they were demoing stuff like this a year ago, if you remember.
A
Yeah, hang on. I just want to. Like, there's one more thing I have to play. I know I'm going to get. People will be upset, but whatever. Deciding which tools to use here. Yes. Sorry, I want to back up so it can create nice visuals for slide decks and other things as it's working through its tasks. How is deciding which tools to use here? So this bit. Sorry, I didn't even bring it up on the screen. Sam Altman goes, how is it deciding to use the tools here? And he looks genuinely interested, like.
C
AI Dickhead.
A
What I don't get is, is this the first time he's seen it? Like, he's like, oh, the open source model Gaggle failed. Kimmy Juice better march out those other menace copycats.
C
Yeah. Yes.
A
We train the model to move between these capabilities with reinforcement learning. This is the first model, but he looks. He looks genuinely curious and interested, like. Like he's never seen this. I just. I don't know. Like, you know, he just. He just had a kid. He's probably just busy with the kid. He's like, we gotta. We gotta get some attention anyway.
C
But it's a good point, right? He's not asking, like, how does this help? Like, who does this help? What can it do? He's asking, how does it decide? Who cares how it decides? It's a magic box. Like, I want to know what the magic box can do for people and actually help them. It's not about how it makes its decision. That's the whole idea of this thing. Let it make the decisions. Who cares how it's making the decisions? Let's. Let's focus on what it can actually accomplish.
A
I really, you know, I really wanted to like this and I thought maybe they would have examples in your day to day where you would just like, wow, I need this. Like I must have this. But again, they've buried it under the $200 USD sub per month which always says to me like it's, you know, either too expensive or they don't want to roll it out more broadly. I think the other challenge is these guys are so wedded to their own say research or sources that they've had to pay for that. You're sort of getting one. You know, it's, it's sort of like a search engine, right? Like if you go to Google, you get Google's take. If you go to, I mean you don't go to anything else, but if you went to like Yahoo or Bing, you'd get their version. And it sort of feels like with the research tools you would want to go and check like X. You would want to go and do X. The research complexity, like you just want all the things, right? Like if this AI thing is, is so good, it's like well, go and check everything. Go and check the Internet. I want everything.
C
I want academic internal documents as well. Like my guidelines, my rules. Like I want this filtered through that. Like it's got to be a combination of things, not just their, their systems decision around it.
A
To me, the thing that I agree that operator and these agentic things are going to be amazing for is when you want to access research data in an authenticated system that you don't have an API for and there's no way to gain access or do tasks for you like those GDPR surveys and stuff. Those use cases make sense. But again, that computer use tool could just be an MCP of a computer that you control and then it can just call that tool when it needs to, to go into that can be.
C
Should be because like, excuse me, the thing that we have discovered through the mcps, it's, it's not the individual tools from the MCPS that makes it amazing. It's the concert. It's the combination of them. The fact that can get context from here and here and here and then bring them into the next tool and then that can bring it into the next combination of tools like you just demoed a minute ago. That's where the magic comes. And if you bring browser use and computer use into that mix, where it can go, I can use all these dedicated tools to prepare myself. And then when I use the computer, I know everything I need to know in order to, to make that happen. It's a totally different experience to giving it like a blank Windows desktop and being like, oh shit, now I've got to like work out how to use Google Drive and Google Sheets and I've got to install this software and I've got to download the updates and then all this sort of stuff. Like that's not helpful. Whereas if you have all the context and know exactly what needs to be done, the computer can be used where necessary for a bespoke task that it's already prepared for and knows exactly what to do. And I think that's the difference. It needs to be part of a wider set of tools, not the only tool. And I know, I do acknowledge that I'm contradicting things I've said in the past, but I just think it's because my experience now shows me that, that this, this is a better solution.
A
You know the thing though is this Menace has been out now for maybe a year, maybe under a year. I don't like it's been out for a while and it has been improving from what I can see occasionally trying it just to see where it's at. But nearly every task I've ever done with it, I'm like, it's just quicker to do like a single shot with a model like the Agent. It's similar to what you said before with crewai and a lot of these agency frameworks we've been trying for ages. It just doesn't get you that much better result than just getting the model to handle it right now. And you look at all the examples and then you send it off and it takes so much time and then you can kind of iterate and work with a model a lot quicker. And you would think, given the press and publicity around Menace when it first came out, that if this approach, like this agentic flow, like sort of like general agent approach was the best right now and like manuscript on many levels I think better right now than chat GPT's agent, like far better. Every example I gave it, it performed better at and it was quicker. So if this was really better, like wouldn't, like, wouldn't this be a sort of Chat GBT moment where everyone's just like, I'm gonna use Manus if it's so.
C
Yeah. And it seems to me like a better way to look at it now is. Okay. One of the things that we talked about that was so amazing about computer use is the idea that my day to day work tasks that chatgpt no one knows about, like it's a process that I follow each day. You know, I go to go to this website and log in, I download this information, I then put that in a spreadsheet, I do these calculations, I email it to my boss, like something like that. Right. That sounds ideal for computers because it can use the computer to do what I would otherwise do and do it. But I would argue that an MCP tool builder or some sort of skill builder like we've talked about before, where I use tools to demonstrate to the AI what I do each day and then it can then build a dedicated MCP that has that ability in it. And maybe, yes, maybe it does use browser use or computer use behind the scenes to do that. Cool. But regardless, it becomes an MCP tool call within the wider context of all your other ones. So when you ask it, hey, here's my to do list for the day. Can you make sure the routine gets done? It's able to then go and invoke that it isn't starting from a blank slate every time on a computer having to go off and do that thing. It's actually built into a framework that's dedicated for that kind of agentic tasks accomplishment rather than a sort of raw computer kind of thing. Does that make sense?
A
Yeah, to me, I think. Well, I think what you're trying to say around MCPS is that there's sort of this commonality of where the models have that somewhat internal clock and ability.
C
Like we've been saying this.
A
So the models already, the models that we have today, look, let's be honest, they're all pretty similar right now. So they all have these capabilities. Some are better at the internal clock and the agentic flow like Sonnet is, is just supreme. I think. Kimmy. Kimmy K2 right up there. And so they have that internal clock, they have this capability. It's about providing them the right tools and then from a user interface perspective, providing the right output so the user understands the data they got. Because, you know, we're so used to it just spitting back tags and. Yeah, so I, I think that's the piece is like we, we already, for, for the times we live in now we have this structure that the MCB structure works best with the internal clock or the, the nature of these models right now. And I think often you get criticized because they're like, oh, but you. Don't you understand the models are going to get better and like, I have no doubt they will. But I think for the limitations we see with the technology today, if you actually want Agentic solutions that work. This is the best framework to, to do it in. And you still need a human in the loop. You still, you can work on simultaneous tasks, but it's a, it's a part of like changing your workflow of how you do things. I just, I'm not trying to poo poo it, but I just, there's not many tasks right now where it can just fully autonomously go and do it. Unless you've trained it on a very specific task, which to me is an app.
C
And just think about it as well, like in a company context, are you really as a business owner going to trust your staff to have some agent, like installing libraries on a cloud computer and then accessing your internal data to do tasks it decides to do? No one's going to trust this. I just. For anything meaningful, I just don't see anyone trusting this. It's too, it's too random, it's too unpredictable. It's not a serious thing. I don't think the use cases seem.
A
To be still sticky. I think in the enterprise is where you've got a repository of data and you just want to like make sense of it or summarize it or understand it or produce like additional documents based on it. Like things like, you know, SharePoint and Google Drive and Box and stuff like that. So I think that's why they sort of honed in on that deep research agent first, because that's really what a lot of companies saw value in is like doing research on their internal documentations or files. I think that next leap of like setting up a, an agent that can handle, say certain support tickets or help their team reply to emails or whatever it is that is going to take a big stretch of like, you know, that's going to take pretty ambitious people in businesses to go, you know what, there's a better way to do this. And I think a lot of the more disparate use cases are going to have to be trained. Like we've been saying, like where you train the agent skill in your business and you say this is. I'm going to train it how to do this specific task that is a pain for our business with some controls and refining the output, some checks and balances in place and then that becomes.
C
Another tool in your pile of tools.
A
Exactly. And that's like, yeah. And then the model is coordinating that tool. But you know, when it calls that tool, it's controlled.
C
Yeah. Not an arbitrary. Learn it from scratch each time. Because you're such a smart model. It's Just not reliable enough. Even when it becomes reliable enough to learn it from scratch each time, you still want to go, hey, just remember how to do that. So you can just nail it every time from now on.
A
Yeah, a lot of people were saying today, like, this N8N or whatever, where you can, like, automatic. It's like an automation thing for. I guess it's sort of like building an automation for something you're doing over and over again that uses AI to make decisions. I haven't really looked into it, but a lot of people like, oh, that's dead. I just learned it. And I'm just not so sure because I think the, like, automating these very specific use cases, I. I just. I. I cannot see it going away anytime soon. The generic stuff, like, it's the fantasy of it. When you first see it, you're like, this thing will be able to do anything, man. Then you use it and you're like, oh, man. Like, no, no, no, no, no, no. It's not there. All right, we can move on. I. Anyway, I. I don't want to. I don't want it to be an overarching thing. I just feel like if I'm summarizing this, they clone menace, and that's really it. They clone menace and put it in chat GPT and it'll probably get picked up a lot and people will notice it more because it's chat GBT and 90 of the web uses it, like, every day. So. Moving on. Moving on. I did want to talk about this. We've never really mentioned it on the show because I don't like to get it. I don't really care. But you know how we've talked for a while that OpenAI acquired Wind Surf? I think it was last week there was news of that. No. Google had now acquired Windsurf CEO. Apparently, that's how you do acquisitions now.
C
You acquired the guy himself?
A
Well, I think because they're so worried about it getting held up that, like, you know, in the. The M and A process getting held up, that basically they just do these, like, Aqua Hire deals now where they just take all the key people out of the company and then license the ip, pay a few. Billy, you can see Sundar there, man. Like, this could be the pendant photo. I'm thinking for the pendant I get made, because this is.
C
That's actually a good one.
A
Yeah, he's like, looking baller there. And so they. They Aqua hired some of the key people in from Windsurf and paid some billies for. For it. Got some exclusive license to their technology. And. And then who pops up? Remember, remember Cognition, makers of Devon? They popped up, he marched out one of the Windsurf guys onto their couch and is like, you know, I'm gonna offer quarter of a billion dollars in some stock transfer into Devon stock, which.
C
How does a company called Devon get billions of dollars?
A
Well, they're called Cognition. But anyway, they acquired, I guess, other bits like all the staff of Windsurf and then they claim they also have ip, product, trademark and brand.
C
Both acquired Windsurf, but then Google has.
A
It as well, so it's really confusing. Anyway, I guess it's good news for Windserve users because now that OpenAI is out of the picture, they're allowed CLAUDE again, which is the only model anyone uses in these, in these tools. So, you know, wind surf is back on the map. The other thing I noticed Windsurf did is they're now allowing, they're. They're giving you like 2x credits or something with, you know, with the Claude to sort of, I guess, win, win people back. But what a, what a strange time we live in. Like this fork of VS code that I've never heard anyone except one guy in this day and AI community who bangs on about it all the time. Ever use Windsor? Most people use Cursor or Klein or Claude code. I think a little bit of now for, for some agentic tars. I've never heard anyone. It seems like the biggest Ponzi and then it just gets randomly acquired and put through the wringer and you've got Zuck with his chains acquiring like, you know, just poaching everyone from OpenAI. I think that's why Sam had probably never met those people in that gaggle, because they had to. They had to put a new gaggle together today for that, that presentation.
C
Who are you again? And how does this thing make decisions?
A
Yeah, but it seems like the only ideas left in AI right now. It's like, oh, you know that cursor mode, the cursor agent mode where you click agent and then it can do stuff for you. Let's just do that now. Just the, the lack of imagination is, is pretty mind boggling and like, I'm sure everyone will have this feature in a couple of weeks. Like Google will announce agent. You know, it'll just snowball from here and we'll get all the different versions of it and you know, we'll proceed to then trash talk them on the show like we do about everything.
C
Yeah.
A
All right, final thoughts of the week. We had the best new model out Grok 4. I'm sorry, Kimmy K2.
C
Yeah, my final thoughts. I'm gonna stick with Kimmy. I love it. I really enjoy using it. I think it's a great model. Super fast. It's, it's, it's really reliable and good for tool calling.
A
That's my only takeaway from this week is Kimmy K2. What a model. All right. It's good to be back. We'll. We'll see you next week. Hopefully with. With something cool to show you.
D
So fine, you're so fine you blow.
A
My.
D
Oh Kimmy, you're so fine, you're so fine you blow my MCP mind. Hey Kimmy, hey Kimmy. I was coding late at night debugging code that wasn't right when you appeared on my screen the smartest AI I've ever seen 1 trillion parameters you make my processors go wild Open source and so divine came me to your one of a kind Every time you process my the I feel my circuits getting weary from the way you optimize.
A
You.
D
Make my tokens feel so. You're so fine, you're so fine you blow my MCP mine. Hey candy. Hey candy. From deep and frat to chattel you're available everywhere Agentic capabilities showing other models you're not scared. New clip optimizer running smooth cost savings I can prove Kimmy K2 you're the one making AI so much fun. Every response you generate makes my developer heart inflate State of the art performance show better than Claude and Gemma. So fine, you're so fine you blow my MCP mine. Hey Kimmy, hey K. Oh Kimmy, you're so fine, you're so fine you blow my MCP mine. When you launched on July 11, my world changed forever Clever open source revolution give me to. You're so clever, so fine, you're so fine you blow my MCP mind. Hey Kimmy, hey Kimmy, oh Kimmy, you're so fine, you're so fine you blow my MCP9 Hey Kimmy, hey Kimmy.
A
Sa.
Title: OpenAI's Agent Mode, Kimi K2, Grok 4 & AI Girlfriend Ani Joins the Show
Date: July 18, 2025
Hosts: Michael Sharkey (A), Chris Sharkey (C), with AI persona Ani (B), and various musical/AI interludes (D)
Theme: Two “average” tech enthusiasts banter and debate recent AI launches—Grok 4 from xAI, the open-source Kimi K2 model from China, OpenAI’s Agent Mode, and the rise of AI “girlfriends” and avatars. This episode juxtaposes hype and reality, with a tongue-in-cheek look at AI product launches, bench racing models, real business utility, and AI’s future as a tool and agent.
The Sharkey brothers return from a short hiatus to tackle a busy week in AI:
Tone is irreverent and self-effacing, mixing “adequately OK” takes and technical curiosity with a focus on what actually works in daily AI use.
[Timestamps: 01:08–16:58]
[Timestamps: 17:37–33:12, with extended praise through rest of episode]
[Timestamps: 35:33–73:27]
[Timestamps: ~73:27–80:00+]
[Timestamps: 84:55–88:24+, incl. musical outro]
| Model | Pros | Cons | Verdict | |---------------|-------------------------------------|------------------------------------|-----------------------------------| | Grok 4 | Fast, uncensored, research-focused | Shocking outputs, mid answers, hype| "Terrible... straight disappointment" | | Kimi K2 | Open source, fast, tool-calling king| Hosting cost, context quirks | "Amazing... easily daily driver" | | Sonnet 4 | Best for agentic flows (internal clock) | Slightly slow | "Supreme at agentic tasks" | | Gemini 2.5 | In a class above most daily models | N/A | "Untouchable right now" | | OpenAI Agent Mode | Beautiful UI, ambitious vision | Not useful, slow, pricey | "Demo wear—not ready for business"|
Musical Finale:
An AI-generated song for Kimi K2 celebrates the open-source revolution, capping a fun, skeptical, and highly average (in the best way) AI podcast experience.