Summary7 min read

This Day in AI Podcast: EP99.32

Did Clawdbot Just Show Us the Future of AI Workers? & Kimi K2.5 Dis Track Tested
Hosts: Michael Sharkey & Chris Sharkey
Date: January 30, 2026

Episode Overview

This episode dives headfirst into two of the buzziest AI developments of 2026: the rise of “Malt Bot” (formerly Clawdbot), a locally hosted AI assistant that feels eerily close to a true digital coworker, and the launch of Kimi K2.5, a new open-source, agentic large language model taking shots at its closed-source competitors. Michael and Chris Sharkey candidly break down why Malt Bot has captured the imagination of tech enthusiasts, the challenges and realities of local agentic AI workers, and how open-source models like Kimi K2.5 are shaking up the landscape with cost, performance, and even rap-style diss tracks.

Their tone is self-effacing, irreverent, and playful, offering the “average AI enthusiast” perspective—which means there’s as much skepticism, humor, and personal anecdotes as there are deep dives into technical features.

Key Discussion Points & Insights

1. What is “Malt Bot” and Why Is Everyone Obsessed?

[00:02]–[02:10]

Formerly “Clawdbot,” renamed after legal pressure from Anthropic; name change met with ridicule for its “gross” crab-molting imagery.
Created by Peter Steinberger, Malt Bot is a locally hosted AI agent capable of using a user’s computer to run tasks, manage scheduled jobs (via cron), and interface with messaging apps (WhatsApp, Telegram, Discord, Slack).
Can perform web browsing, code execution, file management, and more due to wide integration with command-line tools.
Memory and proactivity are user-driven—it's not truly autonomous, but scheduled/job-based.
Security concerns: prompt injection risks and open access to user data/API keys.

Quote:
“A lot of security researchers are saying when people set this up, it’s susceptible to prompt injection attacks, it has access to all your API keys, like anything your computer has access to. So some people are freaking out about it.” —Michael [03:14]

Why the Excitement?

Users love the idea of a “digital employee” with persistent access, running on-device, able to coordinate tasks without SaaS or excessive cloud middleware.
The stack-of-Mac-Minis-on-the-desk vision: each runs an agent acting as a coworker, handling delegated, real-world tasks.

Quote:
“There’s something… fascinating too about having this digital employee running on your computer... delegate out tasks to these computers as you go.” —Michael [06:06]

2. The Tech Stack and Power of Skills

[06:06]–[09:32]

Malt Bot relies on stacking open-source skills, mainly command-line utilities, allowing robust, scriptable control without kludgey GUI automation.
Local skill integration means full access to tooling not possible when sandboxed in the cloud; thousands of skills already available.
Community-driven improvements and modularity are a huge advantage.

Quote:
“By using these existing tried and tested tools… it’s just put it in a way that the AI can reliably work on it. ... It’s a lot more efficient and accurate.” —Chris [07:49]

A lot of actual user workflows are mundane (email, calendar, file shuffling), but the psychological leap is the sense of having a trusted “worker” on your desk.

3. Reality Check: Is It All Hype?

[09:32]–[13:03]

Tasks performed by agents are sometimes underwhelming compared to expectations (e.g., paying $45 in tokens to book a restaurant).
The “virtual employee” meme is driving much of the adoption, even though the practical use cases delivered aren’t that new.
Cost is a growing issue: unlimited agentic loops burn through API credits quickly. Users report “bill shock” after being banned from “hijacking” cloud subscriptions.

Quote:
“It cost me $170 for Claude Opus... all I did was check my calendar and email and got it to sort some email out.” —Michael [45:47]

4. Why This Model Feels Like a Revolution

[10:57]–[17:15]

Local agentic agents now genuinely reduce “chat-to-worker gap”—they do tasks reliably, not just suggest next steps.
Agents can learn, retry, and even modify processes mid-execution, creating robust automation loops.
Smaller models (e.g., GPT-5 mini) perform well due to leaner, task-focused contexts; high-end “frontier” models are often overkill for routine jobs.

Quote:
“My productivity is probably doubled in the last… since the start of the year compared to what I was at the end of last year.” —Chris [12:48]

5. Security, Enterprise, and the “AI Employee” Model

[18:39]–[28:55]

Security can be an advantage if local machines are tightly permissioned and purposed, e.g., per-project/department “AI workers.”
In enterprise, the vision is to have dedicated, orchestrated agent boxes: “digital employees” with well-defined skills and data access.
Anticipated 2026 trend: “Year of AI Employees”—most knowledge workers moving to agent workflows integrated into their roles.

6. Changing Work, Upskilling, and Human-AI Collaboration

[31:30]–[37:24]

The “director’s mindset”: knowledge workers need to shift from “doing tasks” to setting objectives, directing agents, and reviewing work.
There’s still an essential human-in-the-loop element, especially for creativity, UX design, and context-sensitive review.
AI can support planning, execution, teaching, and even diagnostic/critical thinking skills, codified as workflows.

Quote:
“Everybody needs to become like a director… what I really am is someone who is directing resources… trying to get to my personal goals or organizational goals.” —Chris [33:49]

7. Open Source Agent Models: Kimi K2.5

[52:58]–[59:33]

Kimi K2.5 rivals closed models in benchmark tasks at a fraction of the price (API costs are 10x cheaper than Claude Opus).
Open source, can be self-hosted, supports vision/multimodal input, massive context window (256K tokens).
Direct in-episode comparison: “in the agentic sense… I honestly can’t really tell the difference between using it and Opus at all.”
Slight edge to Opus in design taste, but Kimi performs excellently in most agentic tasks—especially for cost-conscious users.

Quote:
“If you want to pay a tenth of the price and have basically the exact same functionality, I think it’s a real go here.” —Michael [56:45]

Kimi’s “agent swarm” functionality is mostly marketing hype—running 1,500 parallel tool calls is described as impractical, but aggressive capability claims are seen as a flex in the open source vs. closed-source competition.

8. Notable Moments: Diss Tracks, Humor & Demo Highlights

Kimi K2.5 Dis Track

[65:20]–[68:25]

Two AI-generated rap battles: one by Opus and one by Kimi K2.5. Both throw shade at closed source models.
Lyrics emphasize open-source’s value, cost advantage, and “swarm” agent capabilities.

Sample Lyric:
“Open source king, while your code stay asleep; Opus 4.5, 250 for tokens, that’s theft!” —AI-generated, [65:42]

Kimi K2.5’s track described as “way more hyped” and appropriately aggressive.

Real-World “Stupid Demos”

[68:25]–[72:24]

Examples: both Kimi K2.5 and Opus code up black hole simulators and full CRM apps from scratch—outputs are “creepily similar,” reflecting models converging in practical ability.
The AI-generated CRMs are sarcastic (“Let’s pretend to work,” “Find excuses to avoid”), exhibiting playful, borderline troll responses from the models.

Reflections on Change

“SaaS is dead—maybe?” but hosts agree that agentic-first SaaS, enabling deep integration, could disrupt incumbents if they aren’t vigilant.

Timestamps of Key Segments

What is Malt Bot and how does it work? – [02:10]
Why “digital coworker” model resonates – [04:45]
How CLI tools and local skills power agent reliability – [07:37]
Security concerns & “bill shock” with agentic loops – [08:30], [45:47]
Comparison to other models: agentic loops and productivity – [10:57], [14:47]
Human-AI collaboration and new “director” mindset – [31:30], [33:49]
Kimi K2.5: open-source disruptor – [52:58]
Kimi vs Opus: price and practical outcomes – [57:21], [56:45]
Diss tracks sampled – [65:42], [67:07]
Demo highlights: AI building games and SaaS CRMs – [68:25], [71:24]

Notable Quotes

“There’s something… fascinating too about having this digital employee running on your computer... delegate out tasks to these computers as you go.” —Michael [06:06]
“By using these existing tried and tested tools… the AI can reliably work on it. It’s a lot more efficient and accurate.” —Chris [07:49]
“My productivity is probably doubled... since the start of the year compared to what I was at the end of last year.” —Chris [12:48]
“Everybody needs to become like a director... What I really am is someone who is directing resources...” —Chris [33:49]
“I honestly can’t really tell the difference between using [Kimi K2.5] and Opus at all.” —Michael [56:22]
“If you want to pay a tenth of the price and have basically the exact same functionality, I think it’s a real go here.” —Michael [56:45]
“Open source king, while your code stay asleep; Opus 4.5, 250 for tokens, that’s theft!” —Kimi K2.5 Diss Track [65:42]

The Hosts’ Perspective

Chris and Michael approach these developments with a mixture of awe, skepticism, and personal anecdotes:

They see local agentic AI as the near future for knowledge work, especially as tools become more accessible and secure.
Emphasize that smaller language models, in the right workflows, provide virtually all the needed “work” for most users at a fraction of the cost.
See open-source models (like Kimi K2.5) as the democratizer in the coming wave of AI-powered productivity.
Urge listeners (including an unexpectedly large professional following) to begin codifying their workflows as “skills” for agentic models, as AI collaboration will soon become the norm.

Conclusion

This episode captures a pivotal moment in AI adoption: the cusp of digital “co-workers” moving from meme and experiment to daily necessity, and open-source models making agentic workflows feasible and cheap. With self-deprecation, humor, and practical insight, Michael and Chris chronicle the birth of “AI employee” culture—where security, orchestration, and user creativity will matter just as much as raw model horsepower.

Loading summary

Transcript106 lines

[00:03]
A
So Chris, before we get into the show this week, just a quick update on the still relevant tour. There's a link in the description below if you haven't filled it in to get tickets to our upcoming tour. It's called the Still Relevant Tour. We're primarily focusing on Australia, although it seems there's some other locations that have been trending. So if you live in other countries like the UK or the us, there's various countries that are trending a little bit. We might bring this still relevant tour to your town. Can you say that about cities like London? I'm not sure, but I am. So we will bring the tour to other cities if we, if we get enough interest. But so far there's been quite a lot of you, in fact hundreds of you in the UK and the US who want the still relevant tour in your city. So please do help us out. Fill in the form below if you're interested. And of course, if you are in Australia and you want to attend our still Relevant tour later in the year, fill in the form. Link is below. You can put yourself down, nominate yourself for some tickets now and we'll share details once we start booking locations in a couple of weeks. So still relevant to check it out. But the main event, Chris, the hype to start 2026 that's taken over the Internet is of course flawed. Bot, Sorry, Anthropics just chimed in. It's now called Malt Bot because of some sort of like threatening lawsuit to the creator of which I can no longer say. So anyway, Malt Bot, it has taken the interwebs by storm. So let me first explain what, what is Maltbot to those that have not.
[01:45]
B
Even just to be clear, isn't molting when like animals shed their fur?
[01:50]
A
Yeah, but I think because crabs shed their skin or something, that was the.
[01:54]
B
I don't know, it's not the best imagery to associate with your product or whatever you want to call it.
[02:00]
A
It's a pretty, pretty disgusting name. Obviously Claude Bot was better, but Anthropic, you know, Billy's in they need to protect their trademark, so they made them.
[02:08]
B
Humiliate themselves by picking an awful name.
[02:11]
A
I mean, I guess they chose that. I don't really know. There's probably some story about it. But anyway, to catch people up that have not been following this story. So a veteran iOS developer who's developed like a ton of open source projects, this guy Peter Steinberger, has basically built a locally hosted AI assistant that can use your computer. So it's kind of Like Claude code in that it can use your computer to do stuff, but it's connected to things like WhatsApp, Telegram, Discord and Slack. So it's sort of present where you are and you can bark commands at it and it has pretty good memory from what I can see. It can browse the web, run code and it can do things on your computer because it's authenticated and of course it has cron jobs. And for those that don't know what that is, it's essentially it has the ability to schedule tasks. For example, you could say like every hour check my email and if there's an important emails, draft responses or let me know. So you can do a little bit of automation. Now a lot of people are saying that it's proactive, but that proactiveness has to come from you. Now this story has gone on for, for quite a while and there's been some detractors and a little bit of drama here as well. A lot of security reach researchers are saying when people set this up, it's susceptible to prompt injection attacks, it has access to like all your API keys, like anything your computer has access to. So some people are freaking out about it. And early on why this really blew up was because Maltbot was able to sort of hijack Claude's Mac subscription and basically run infinitely and burn all the tokens in the world that it wanted. But since Anthropics cracked down on this and said it violates their terms of service. So anyone doing this, they've been banning off their service entirely. So just fully banning their, their subscriptions and then so that forced people to switch to the API and some heavy users are seeing bills as high as $750 in a single day in tokens to run this, this assistant. So I want to unpack this because I think that, you know, obviously we've been talking about for probably over a year now on the show about what we're doing with Simlin, which has a lot of similarities, if not is pretty much the same thing in a way. But I wanted to talk about like why this has captured attention and then the reality of what, what it can actually do and just like your impressions of why people are so, you know, excited about this.
[04:46]
B
Yeah, I think that everybody has that dream of, of having like a worker who's just going to do everything for them. Right. And this idea of this always on present coworker, let's call it, who is doing stuff on your behalf. And so I think that the dream has been there for A long time that everybody wants this. I think that's why everybody latched on so hard to computer use, because it was sort of a generic way of hopefully getting this. But unfortunately the reality of computer use was it just keep, kept getting tripped up on things. And as you just pointed out, it has to loop so much that the cost of running it is just unsustainable to do anything meaningful when it goes down rabbit holes and it just can't quite get things done. But I think, yeah, I think the reason people want it is because then you're like, well, if it can do everything on my computer, it can access devices on my network, it can be logged in as me and therefore act as me and do things that are really, really difficult via the API, like for example, operating WhatsApp and things like that. And so what this project has done has enabled all of the open source projects that allow a level of computer automation to be available as skills that a system can run in an agentic loop. And this project has just brought that together in a semi, like a reasonably accessible way to way more people.
[06:06]
A
I think there's something that is just fascinating too about having this, you know, like digital employee running on your computer, like being able to have this sort of stack of Mac Minis on your desk. And this is what we envisaged early on with Simlink. And these, these computers are sort of the agents that work in your, your business alongside other people on your team. You can delegate out tasks to these computers as you go. And I think there's just something really captivating about that because then it's a worker, it has access to tooling a browser, like everything that you do. And as you said, like computer use, that's what it captured for a lot of us as we dreamed up these models would improve over time. But what really is going on under the hood here, and I think it's sort of an interesting development about where things can go, is instead of using a ton of MCPs and connectors and all these layers, one thing we noticed looking through the code of this thing is that it's heavily reliant on local skills. So similar to anthropic skills, it's putting them into the prompt in its, in its looping, like when it fires its actual prompt, has a series of skills in that prompt with the location of those skills and then what those skills can do. And interestingly those skills are using a lot of CLI tools, which are command line interface tools. So it's essentially operating your computer through all of These projects where people have developed command line tools to operate different things.
[07:37]
B
Things.
[07:37]
A
So for example, like Spotify, I think Obsidian, which a lot of people like to store their knowledge, all of those things can be operated through the cli, which makes them super reliable instead of the computer having to click around.
[07:50]
B
Yes, exactly. And I think that's the real trick here, is that by using these existing tried and tested tools that, that have been working for years, doing this kind of automation, it's just put it in a way that the AI can reliably work on it. It's really good at specifying parameters for things, whereas locating something on a screen and clicking on it, it gets really confused. So it's a lot more efficient and accurate to be able to use these tools. The beauty of it is, as you said, a huge part of it is it's really just a basic agentic loop with some, you know, prompt modification to keep it reduced and leveraging the skills. And the skills have been. There's just so many skills that can do all sorts of useful things and the advantage of running them on your own computer is, is that you avoid all of the restrictions that you would have running it in like a Docker container or in a cloud container, or inside anthropic skills execution where it's in a sandbox, it can't access the Internet, things like that. So when you look at there's a lot of scientific research skills, for example, that can access all of these incredible databases and there's like, I think there's like 3,000 skills or something just for things like that that you can add in and use. I don't know, if necessary in this Multbot, you can add them, but I mean, it's open source so you can add them. And the reason we relate to it so strongly is everything that this thing can do is stuff we have in Simlink and have been working on for a long time now. So we sort of recognize the way it's working, but also we can learn a lot from the things that it's very effective at. And part of that is just discovering all of the different skills and ideas that people have of how to use it.
[09:32]
A
I think that this, there's two parts to this for me. I think there's obviously the extreme hype around it and similar to Claude code, people that can configure it well and have deep technical knowledge and are willing to also take the security risk of like handing your entire computer and authentication over to an unrestricted agent looping system. You know, I Think there's those elements of it which are really exciting. But at the end of the day, like a lot of the examples I've seen when I've watched YouTube videos or gone onto X to look at what people are actually doing with it is just around like email scheduling. Like now I can book a restaurant, except it cost me $45 in tokens. I was able to clean up some files on my computer or move files to the cloud. And like a lot of the use cases I've seen aren't terribly exciting. Like I would say, not too dissimilar to what we've seen out of like Claude Code and cowork so far. It does seem like the biggest. The thing that is really excited everyone here is like this concept more than anything of this virtual employee where I've given I'm going out, I'm buying my Mac Mini, I'm going to put this software on it and then I can access it via different messaging services. It really sort of just is Claude code with Cron Jobs at its essence and then attached to messaging services.
[10:58]
B
It's true. But I actually do think it warrants a lot of excitement working in this way and just having worked with it for a little while now, seeing it all come together and seeing the magic of the intelligence of the models, even the cheaper models, being able to coordinate things. Like, I don't know about you, but I've had several moments of delight in the last couple of weeks working with simlink, where it's done things that I just totally didn't expect. Like it's able to read the logs of something I'm working on, or it's like, oh, you know what, I'll just check this in the browser and opens a tab in Chrome and test something out to figure something out that it couldn't do. Or it can load up a new skill with a whole bunch of different information and ways to access APIs and other things and then use that on the fly as part of its process. And I think that what it provides, this sort of local agentic loop style thing, is the ability to build more of the context for you. Something we always talk about, like it's able to take more of the steps that you would have otherwise been doing for it. It can do itself now and it can actually verify that the things it does are having the effect that it thought it would. And this is a huge one because like so often when you're working with a model and you get to a certain step, you then need to go, oh, Is this right? Is this not right? But what the having all the skills available on your own computer is that the AI can actually do more of the steps of verifying that the things you have tried have actually had the effect you wanted to and correct itself. And so even in a non agentic loop, like a sort of normal tool call loop, what I'm seeing is these much, much longer running processes where the AI will go into so much more depth to fully complete a task rather than just telling you what to do next. Like it really is for me at least a huge step up in how far down the line you can get and how much more you can take on. Like I'm getting so much more done. Like my productivity is probably doubled in the last, like since the start of the year compared to what I was at the end of last year.
[13:04]
A
Yeah, I certainly, I'd certainly agree. I think that there's a huge uplift and the reason is, is because it's sort of like you, like if you like in the morning you sit down at your laptop or desktop computer or wherever you work and you have all these authentications and different applications that you're using and all of a sudden it can not only access that content, but operate that computer. And then it's got that context feedback loop like you said, whether that's debugging code or looking at changes in a document, looking into a spreadsheet file like that is truly, there's something about it, it changes the relationship between. Yeah, I think chat to worker. Like it's actually going and doing work for you and you feel like it, it's, it's contributing now instead of it telling you what to do and you operating. And it's weird because it was projects like Auto GPT. Remember I ran a loop ages ago doing stuff like this. But I think it's just all the pieces and the community putting all this stuff together. We've always said that it just needed to be put together in ways where it would work. But I think also what's surprised me, and maybe this is a positive thing for the Maltbot community that's forming, is this stuff works pretty well with, with other models as well. It's not like I think Opus 4.5 was the breakthrough in that it's very good at the agentic looping and agentic based tasks. But I would say after using different models you can get a long way with some of the newer models, obviously. Kimmy K 2.5 we'll talk about in a little bit. But also you were using through the week Gemini 2.5 with this thing and it was working flawlessly.
[14:47]
B
Can I shock you a little more, Mike? That most of the week I've been using GPT5 mini. Like just seriously, it's just fine. It works great in this environment.
[14:57]
A
But do you put that down to now? It just has so much more context and tooling that it just doesn't. The model is less relevant.
[15:03]
B
Here's the thing and this is what I think the beauty of this is. It actually has less context because what it's doing is by the idea of using sub agents, by the idea of using skills and adjusting the prompt to the task at hand, you're actually ending up with a much more bespoke context for the task at hand. So it's saying, okay, right now I'm trying to do this element of this task, I am going to load in the skills that are most relevant to this. Relevant is the word, isn't it? The still relevant to it. Such a good name. And so it's loading in the relevant system prompt the relevant tool calls that it needs to make until call results. And so therefore what it's trying to do is so much more targeted on the actual task at hand that there's less room for ambiguity. And this is something that the smaller models struggle with. Where you just like the good thing about say Gemini 2.5, which you could throw a million tokens at it and somewhere in the middle of it you can be like, also, please talk like a pirate. And it would figure that out. Whereas the lesser models would struggle to get that focus on what's actually needed here. Whereas by working in this agentic model, you can actually have it. So it's so targeted on what it needs to do that the small models are actually better in some ways because they're faster and they're also very good at that sort of single purpose kind of thing. So I think it brings the smaller models to life working in this way. And then the other really cool thing about it is because you are doing things in chunks of work like this, the overall process is able to focus more on making sure it actually realizes the goal that you set in the first place and brings things back into focus. Like, okay, we've gone off the rails here, I've screwed up, I'm going to adjust my plan and get back to the task at hand here. Also retries realizing you've made a mistake and being able to try again in a different way. Like you've seen this a lot in the Last few days. I just know from talking to you, and I think that that's something else. That's great. One of the things people would talk about with models is they'd go down the wrong rabbit hole and completely destroy their context and then they can't really get back there. That happens a lot less now working in this style.
[17:15]
A
I think a good example is earlier, before the show, I was like, trying to code up things using simlink with Gimme K2, just to give it a fair comparison with, like, Opus and other models. And it has this thing, obviously, because it can use MCPS within SIM Theory. So it created a logo for this CRM I was creating. I'm obsessed with the Klarna CRM thing still. It created a logo for it and it was trying to upload it onto my, like, NAS storage so I could access it on the PC instead of, like, because my SIM links running on that laptop behind me. Right. And it was unable to do that for whatever reason. Obviously. Still a work in progress. But what it did, like, surprise me so much. It opened a tab in Chrome on my Mac in the background, like, over there. Like, I had no idea it was doing this and then decided it would. It would try and scrape that page and get the URL off that and then download it. So it.
[18:16]
B
That's the thing. It's just.
[18:17]
A
It'll keep trying basically all these different ways. And then it resorted to screenshotting the tab, like, I mean, that's how crazy it gets. And then it used. We have an MCP in SIM Theory, Bria to remove background. So then it removed the background to make it look better in the. In the CRM, which I'll show a little bit later on the show. So I don't know.
[18:39]
B
I think this is another vote for smaller models. Because the beauty of that is you're not sitting there being like, did I just spend $700 on this? You can actually do it in a reasonable way where you don't mind if it takes a bit longer to get there. Because the other thing, as you've pointed out, is. Well, as I'm about to point out, more than ever, I'll have four or five different things going at once where I'll work with it on a plan. Go bang, let's go. It's off doing that task and I'm starting. Not quite, but I'm starting to build trust that it will get there in the end. And so then I move on to the next thing. So I'm spending a lot more time now in that sort of delegation mindset. And I really do think this is why people are excited by this model of working, because you can actually trust it to get a lot more of the tasks done and there's a lot less you need to do in terms of getting things. And one of the huge things we've been working on is discoverability of context, like making sure the models, the agentic models, are able to really gather all of the things they need so they have a full picture of what's going on. And getting to that point sooner is very critical because if you don't do it right, the models will relentlessly try working through files 100 lines at a time if they have to, to try and figure out what's going on. So I think that that's a software problem. You need to get at that context. But once it gets it, it can go so far down the line now, especially enabled by skills that allow it to do almost anything that it needs to do on the computer to get jobs done. And I think the. That brings me to another point. You raised the issue of security with this methodology. And I actually think it could be a security advantage in some ways. And I'm talking particularly in our model. And I don't want to just talk up our product over Multbot or whatever, crab shedding, snake skin, lining pot. But one of the advantage of ours is you can run multiple symlinks, right? So you can run 20 of them if you want, across different systems. Now the model I'm proposing at say an enterprise level is actually to have dedicated machines running simlink for a purpose. And you actually lock the machine down itself to permissions where it can only do the kinds of things on that machine that you allow it to. So you simply don't allow it to do things that would go outside the bounds. And within that you start to think about, think of all the physical devices this could control. So many people run machines like Raspberry PIs and other like enterprise controllers in, in, in companies to do stuff, right? What this allows you to do is actually have almost like a virtual MCP for anything that you can put in place and actually work with and have the AI have access to. So like monitoring systems, actuators, any sort of internal databases and things like that. And you simply lock down that machine's overall access to it. And so then through that model you can actually get things online and get things in a way that you can orchestrate them much more easily and access them in a centralized way through a Convenient interface. And so I think that's the vision I'm going for, is like, we really need to look at enabling the right sets of skills and tools and actual physical and network access to things and then providing that to the AI and going here. This is now part of your context gathering and the actions you can take in real world systems. And I think that that model is going to be massive. And I honestly think that's where the excitement comes from. Because suddenly you realize when it's running on a real computer right in front of you, it all comes back to that song. Endless Possibilities. There's a lot you can do with this. And so it truly is exciting. Like this is where the technology is going to go. And we talked about it for so long about that, really. Ultimately controlling the computer is where the AI is going to have its biggest leaps. And I think even though it hasn't come exactly the way we expected, where it's just clicking around, playing counter strike or whatever with the mouse, instead it's running it through commands and things like that, it's still going down that path. And I think we're going to see an absolute explosion in the technology in this area.
[22:55]
A
I think interestingly too, we did predict this very wrong. I just assumed with the computer use models at the rate of progression we were seeing that they would just get exponentially better and faster, that using things like CLI commands and all this stuff would seem pointless. That's what I really thought. I thought the costs would come down so rapidly and it would get so fast at using the computer that it would feel superhuman. And I'm like, this will change everything. Game over. And I still think at some point in the future this will probably happen. But a lot of people argued back at the time saying, you're an idiot. And I mean, they were right where no, no, no. We would have to build all these skills and commands and learn a new way of operating the computer. And I think that's now what is starting to happen. Whether or not long term that will be how it works, I'm not sure.
[23:47]
B
The things that I'm seeing are astonishing. Like the AI itself will realize that it doesn't have a particular tool that would be ideal to solve this problem. So it's not like a person where you're like, oh, if only I had a tool to do this, then it would make this problem easier. It just goes and makes the tool and then uses it on the fly. Like, I've seen this multiple times where it's banging up against the Wall, it can't solve a problem. It's like, hang on, I know what to do. I'll craft a tool to do this, run it locally, and then it actually gets the job done. That kind of work is where you get leverage. That's where it's worth the money to spend to run this stuff, because it can actually solve problems that you probably wouldn't have attempted or you would have had to do manually or something like that. I'm working on something at the moment where I've got thousands of documents that I need to work with, way more than you could possibly include in a single prompt or a thing. But because it's able to operate using them as files on the computer, it's actually able to build a map of what it can see there and then know which context to load at which time to get jobs done. So you can give it, for example, an overall goal like, okay, there's. I want you to flesh out these hundred documents for me. And it can systematically work through them, loading in the context as it needs for each part of that without losing the overall goal of what it's trying to work on. And I think that kind of thing unlocks true productivity. Because suddenly you go from being like, I am a director working with a sort of slow but intelligent employee where I've got to micromanage them, to being like, I'm a goal setter. I'm setting goals here, and I'm the visionary who is powering the goals, and this thing's going off and doing it. And I think while some of the Claude bot examples we saw on Twitter are kind of cool, like, someone's like, oh, it accidentally registered a $4,000 domain. I'm like, yeah, bullshit, it did. That's just marketing. Thanks. But I think that they're missing the real point, which is you can give substantial goals, like super substantial goals, and not only can it go and do them, it can actually build systems that will verify that what it's done met the original goal. And I think that's what's really exciting. It's not. It's not like the scenario where it's like, prepare my court case, and then it comes up with incites a bunch of papers that don't even exist. This thing can go off and research every single thing that it did to verify that what it did is accurate. And it can do it with a different model, if you like, a different approach, a completely different skill prompt that is designed to do that. And when you look at some of the skills that are out there. It's like people are putting their entire wisdom and intelligence. Like a front end dev. You can get enough skills that you can have a fully trained, like top of the line front end dev working for you. And this is going to happen in every industry for every job that can be done on a computer. So you really are going to get this vision where you have these virtual workers who can do things like, this isn't a fantasy anymore. We have the tech right now to do this and it's coming together well. We're being part, we're part of it. And other people are doing it too. And it's, it's really genuinely exciting.
[26:53]
A
I think also there's probably for people listening, like fear around it, right? Because I think this stuff's moving so quick and there's so much hype around it right now, even I'm feeling so overwhelmed. Like, you know, we're obviously in the thick of our new SIM theory release that was built around the basis of simlink and Agentic. And a lot of this stuff has spawned up and I'm like, oh, is this better than what we've been working on? You know, like, how does this fit.
[27:19]
B
In the amount of times I'm sitting on Twitter at 1am in bed being like, we suck. This is like, what's all over, man? Like, I'm a loser. I just was give up. And then I sort of wake up in the morning with a bit of fresh energy and I'm like, now this is cool, let's keep going. But yeah, no, it is really intimidating.
[27:36]
A
Do you think too, like, if you think about the enterprise and just like commercial settings of these technologies for a minute, where people are, you know, I mean, people are still on Copilot, right? And like even that's controversial. Like, oh, should we buy Copilot? Right? So that, that's where it's like asking.
[27:55]
B
If we should go all in on Clippy as a workspace. Sort of like, yeah, no, this, this is good enough. It has AI in the title. So like, people will understand. It's like, you know, you can't assume that people are stupid. Like, people are aware. Well, I don't know. Every time I say this, someone's like, you, you don't realize most people don't know about this. And maybe that's true. But I think people are aware enough that there are tools now that can do their job for them. And if they set it up right, they can, they can become a super worker or just become super lazy and just have the AI do it for you. But either way, I think people need to have this in the enterprise. You need to have it as an organization to be able to get this kind of leverage. Like it's unbelievable what, what's possible now. And I don't think that this is a case where just because you're a larger organization, you can't be an early adopter of it. I think it's, it's perfectly fine. And in terms of the security, it's all about architecture. You can architect it in a way that you have guardrails there and it's not going to go outside those bounds.
[28:56]
A
But then I think there's the whole training element of this because it's obvious to me now that this year the theme, at least it's forming. Like last year was apparently Year of the Agents. I think this year is the, the year that you'll, you'll have agent workers as an employee. Like especially if you're ahead of the curve, you'll move from the chat GPT model to just agent workflows for everything. And I know some people that argue I've been doing that for a year with Claude code, sure, whatever. But the average, you know, white collar or knowledge worker is not there yet. And I think this is the year that most people maybe even in their private lives. So like you can imagine Siri later this year coming out and being like somewhat agentic and being able to do tasks and maybe do some cron jobs with the Apple ecosystem. I can totally see that happening. I think maybe we'll see the Google Assistant do stuff like this. Maybe it can now. I'm not even sure, but it just doesn't seem like something I have to go out and get immediately. But I do think in terms of like the enterprise, this will become the thing, right? And I'm sure Microsoft and others will get on this bandwagon.
[30:06]
B
But I mean I know people in industries that are not tech in any way and they are building themselves software systems to help them with their job using, using these tools. Right? Like there is, there is. The limitations on that are gone now. Like I really don't think it's gatekept. You don't need a tech guru anymore to get this stuff done. People who, people are smart, they can figure it out. You don't need to be tech anymore. I also think that while Claude code is good and stuff, I actually think code is going to get diminished in terms of proportion of the value of the tools. I think the real value comes from the coordination of all of the different inputs and the different ways you can output it. The actual need for code is going to be decided by the agents. They're going to be like, we need code for this because it'll make it more efficient. We need to build an interface for this because it's just going to make it repeatable and better. But it'll be on a as needed basis rather than the solution to the future isn't like build a SaaS solution. It's going to be how do I coordinate the agents to get the goals done. Like the actual, where are we actually trying to get to with this? That's going to be what matters. And yes, code will form part of that, but I don't think that just being the best at Claude code or something is the answer. I think it's a stepping stone towards where we're going to get to the other challenge here.
[31:30]
A
And I'm just going through the real implications of thinking this way. One of the challenges I find when you get access to these tools for the first time is how do I approach this? You often hear online people being like, I'm working in my sleep. And it's like for a lot of the work I do, it's about the UX and thinking through how would someone use this, how will they understand it, what design of what's the model design? Let's brainstorm together with the AI to figure out what the right data model to store something is and things like that. And there's no doubt in my mind that it can't go off and do all this stuff. I'm not at the point where I am in denial that it can, it totally can. But the problem is if you set it free and you lose that agency or you lose that human touch. And I spoke about this last week. Ultimately, at the moment, I'm building a product to be consumed and used by humans, not, you know, not agents. And so that like, to me to spawn up a bunch of tasks and go off in those different directions, you know, I, I've got to think through, okay, this is what I want to do. Like, let's plan together, let's execute that task and yes, I can then go and repeat that in another tab and do the same thing. And for the most part I'm doing that right now. Right. I've got like three things going at once. But I spoke about it last week with that cognitive overload and that pressure of like, oh God, where am I at on all these things? Like, there is a part of the human context Overload, bottleneck. But also you sort of need human in the loop for that review of like, does this feel right? And I think that's something that the AI can't emulate right now. And I would argue like even right now, like if you're working on a presentation, I spoke about this last week again, and it just goes off and does all that work in the background, comes back with the presentation and notes like it did for me, there's still that back and forth after it to get it to that final point where you feel comfortable with it. So I guess what I'm trying to say and put people at ease is the transition to this work is going to take learning as well, just as it was when you had the chat based relationship. Like, do you agree with that or are you at a point where you're just like, no, I've move beyond that.
[33:49]
B
I think a big part of it is people need, everybody needs to become like a director, everybody needs to become an orchestrator of what they're trying to do and, and actually have a little bit of humility in the sense that, well, actually maybe the technical bits of my job that I thought was my unique talent, you know, isn't really, isn't really that unique anymore. And I think that a lot of people who've run their own businesses before have this exact progression. It's like that E Myth book where it's like if you get good enough at your job, eventually you won't do your job because you'll be managing other people who do the thing you used to love and you're not doing it anymore. And I think that. And then everybody has to go through that struggle of like each day. Now what I really am is a decision maker. What I really am is someone who is directing resources and working out how do I get to like my personal goals or organizational goals, whatever it is. And so then, then it really is. It really does come down to your own creativity or at least your own understanding of what the ultimate goal is. Because as everyone jokes about infinitely is the fact that no matter what you tell the AI, it's going to say, fantastic idea, let's do it. And I can write 50,000 lines of code, build you hundreds of documents about it. It's going to do whatever the hell you want. So therefore you can't have everything. And no one, like you say, with the overload, no one can understand everything. So then you really, really get to get down to the essence of what am I actually trying to accomplish here. And I think The AI can help with that too. Like, as you've discussed before, we have a planning mode coming where it's like, let's work on the plan together before we just go off and blindly do tasks together. And you also mentioned, okay, well, I've planned my presentation, the AI goes off and completes the presentation, but now I need to understand it if I'm going to actually communicate it to other humans. So maybe then we go into a teaching mode where it's like, okay, well now teach me about this presentation. Like, what are the talking points? What do I need to understand to be able to present this and use its intelligence to quiz you, to educate you on that kind of stuff. So I think that the way of working just becomes totally different and much more goal orientated, like, where are we trying to get to as the final output? And then once I have that, what am I doing with it? Am I just giving it to someone? Am I needing to understand it and that kind of thing. But I really do feel like the AI can help you at each point in that process. And the beauty of the whole skills paradigm is that you can actually learn from the wisdom of every industry, even ones that you're not an expert in, of how to do that. Like there are skills for critical thinking, there are skills for building a great presentation, there's skills for a whole different kind of disciplines that have the wisdom that's accumulated over a long period of time that you can benefit from at each point of getting to a goal. And I really think what the, the best AI systems now need to be able to do is be giving you the right mix of skills at the right time to be able to do that. And I think also at an organization level, codifying your company's knowledge into sets of skills, like groups of skills that are applying for your team or yourself at the different stages of whatever it is you do. Because I think once you do that, you're going to be much more consistent. You're going to get much further down the path of everything you're trying to do as soon as you try it. And you can get the most out of the AI, even with the lesser models.
[37:24]
A
It's funny, we've had obviously model launch after model launch in rapid succession. I feel like a community or whatever you want to say, like people are back focused now on building out agents and tying all this stuff together, like the orchestration of it all and the impact that's going to have over the next 12 months. It's very hard to predict where we'll be at, at the end of the year based on what we're seeing now. Because I think for those like, you know, call it in the know, they're going to completely switch this way of working. There's going to be people just figuring out how to work in that sort of turn by turn chat paradigm, thinking that's still the best of AI. I mean like think about even MCPs and skills right now. Like if you want to go out and just get an off the shelf thing where you can easily work with MCP. It's like a broad range of MCPs and skills to my knowledge, that doesn't exist unless you're like, you know, pretty technical and can figure it out. Maybe, maybe the closest is going and getting a Claude subscription and integrating some of the MCPs but each one requires like different, you know, setup instructions still. And I know they just introduced like the canvas kind of mcp UI that OpenAI has. And to me, like just to reflect on that for a minute, I think that is so shocking to see that because I look at them releasing that and I think why, why even bother when you've got coworking Claude code and all these other things coming along. Like it is so irrelevant now.
[38:58]
B
This is why I think skills are going to take over. I think that despite everything I've said, let's erase a whole bunch of episodes so I can be consistent. But I actually think the skills, locally run skills on dedicated machines or your own machine, either as an agent, as an agentic identity of its own or as you. I think the agentic identity will probably be the next step later this year we get to. But for now I think think about Atlassian, those over at Atlassian who won't let anyone use their MCP unless you're part of the elite club, right? Those guys, we can do it ourselves with skills. Now you can actually operate Trello from your local computer using skills. So it's like you suddenly don't, you're not restricted by all that. You don't have to wait, you don't have to adhere to the protocols that the other companies provide and go through authorization processes and stuff. It's like if you can do it, the AI can do it now using your computer, which is what. What we'd spoken about. So it means that you don't have to sit around planning out how we're going to do all this stuff. You can actually just document it in a skill, give it access to the things it needs like APIs, API keys, local software, your Browser and then teach it through the skills and iterate with it on the skill until it can do it. And then now it is part of your identity, it's part of what you can do. And I think that I want to.
[40:24]
A
Just hone in for people listening because I think we sell this stuff. And even for me initially some of this stuff was really confusing. So I just, I want to lay it out for people. So, like, let's talk about why a skill, right? So the skill and why we say we can operate something like Trello can use the command line, it can use the computer, it has the authentication, so it's able to get into those systems just like a human would and do things very competently. And you know, one thing out of this still relevant to reminder link below. If you want to join us on the still relevant tour. I've been asking people like, you know, what do you do for work? What's your role? A bunch of other questions to get a sense of the audience. And what blows my mind is doctors, lawyers, like, there's a lot, there's a lot more people in sort of like, let's call it traditional knowledge work, I guess, than you would otherwise think would listen to this show, especially how technical we get. I think it's kind of surprising to me. So what I would say to any doctors or lawyers or anyone like that listening, because there's a lot of you, is I think how this changes your work is interesting because you can imagine having local folders of different patients or, you know, essentially record keeping. And look, I'm not sure about HIPAA and the compliance aspects of this, but let's imagine all of that stuff is settled. So you'd have like a folder structure of all the patients or a database against your database. Whatever you want, it doesn't even matter. You don't need the stupid health software anymore. So you have all the records of it. It now can see your calendar and your schedule, right? So then it can every, like 15 minutes or every 30 minutes in this loop that it's in, it can look at the calendar, be like, okay, the next patient is this person. Now I will go and scan all the relevant medical records, you know, whatever it is, all communications, all scripts, like everything, and I will present a perfect brief to you about it. And then while, you know, while we're in the consult, again, permissioned. And security, like, there's a number of things that have to be overcome here. We can record that so they don't have to transcribe what the patient's saying so they can stay focused on the patient. I know there's tools that already do this. I'm just saying at a high level, it records, it transcribes it, it connects the relationship of that to that person into their record, puts it in, does it all for you. And all of a sudden it's just, you know, this insane preparation tool and keeps you on task and just does a lot of things. And it can even go and fill in things. Like in Australia we have E scripts. Like it can go and do those things if you teach it with a skill as well. Lawyers, same deal. Accountants, probably same deal. So I think for people it's almost like it does actually get you to that point of a real working assistant for the first time that is actually useful and where you're like, oh, I can't live without this. Oh, this is my second brain.
[43:24]
B
And I think it is worth addressing the security side of things because this is significantly different to like copying and pasting someone's medical history into chat GPT, right? Like you can actually have these files, they're stored locally, they'd never leave your computer or your local network or whatever the environment is. And you actually have this local process. Like I'm going to talk in terms of simlink, because that's what I know. You have this local process that will do the necessary work to gather the context you need. And it's only working with that context over a secure connection, you into a private install of a system like SIM Theory that is secure, like architecturally secure. So it's not leaving the environment, it's encrypted the whole time through the whole thing. And the files themselves never actually leave where they started. So you can actually get an architecture that keeps it as secure as it's going to be. Like, you know, there's always risk in everything. I'm not going to say there's no risk, but it is possible to get an architecture there where you can do the vision you're talking about. Even with highly sensitive information, it's possible we're doing it. And so, and so I think that what everyone needs to think about is like, what are the skills I need this thing to be able to do and start writing them down, like literally just write them down and then iterate with the agent on doing that, maybe with some sample data and stuff like that, to the point where, as you say, you have a repeatable process, you can schedule it, you can automate it to some degree and start to get leverage from this stuff.
[44:53]
A
I was Thinking, you know, with like, the Gemini and large context models, you should just record a video of your entire day with audio. Like, just everything you do. Record your screen, record audio, record, like, inputs on your keyboard, and then get it to build a series.
[45:05]
B
What is it that I actually do?
[45:07]
A
And then it's like, browsed X, browse Facebook, on Facebook Marketplace. So then the AI starts mimicking your dumb shit through the day.
[45:14]
B
Stock prices. Yeah. Just spends all day dicking around on.
[45:19]
A
Your computer and then occasionally working. I wonder if that would be the truth.
[45:24]
B
It's like, I have analyzed, and you spend about 45 minutes of productive, productive time per day. I'll do that in 30 seconds. And then we can just have some fun together.
[45:33]
A
Yeah. Like, I bet you for the doctor, it's like looking up golf clubs or golfing resorts or, like, booking golf on a Friday. And that's a skill.
[45:40]
B
All right, so you spent 10 minutes complimenting the audience, and you're like, no, I'm insulting.
[45:47]
A
Anyway. Yeah, I don't know. I was going to share the audience stuff, but I feel like it's like it. I will share it at some point at a high level. But, man, like, I apologize to those people wasting their time listening. There's some very important people that shouldn't be listening to this podcast that do. All right, so one thing I do want to just reflect on, though, is you kind of. You've touched on the security. I think we're roughly good there. There's still risk, cost issues. So I mentioned before, people running Malt bot, Claude bot, whatever you want to call it now, some of them were getting Bill shock once they switched the API, so they were spending. You know, I think one guy I saw, he's like, all I did was check my calendar and email and got it to sort some email out, and it cost me $170 for Claude Opus. And we mentioned, like, you mentioned how you were using, like, GPT5 mini and like these. These sort of. Let's call them, like, gutter models, and they were working just as well. You know, probably. Let's not say as well, but, you know, good enough. Or would you say?
[46:56]
B
I mean, what's your definition of as? As well, it did what I wanted to do. It accomplished my goal. Like, to me, that's just as good as the bigger model. If it. If it accomplishes my goal, then what's the difference? Like, I think this is. The point is that you need, like, I think we all have this thing where you just want to use the best one because it's the best one. You don't want to take any chances. But I'm telling you now, as someone who's been iterating on this constantly, the smaller models are just as capable if set up in the, in the right ways. They really are quite amazing. And I like you joke that I use older models. I use Sonnet 4.5 all the time. I use Claude 3.5.
[47:34]
A
Sonnet 4.5 is not an older model though, to be fair.
[47:37]
B
Okay, fine, a little bit older. And as we repeatedly point out, and you even said to me this morning, the Groq models are amazing. They're absolutely amazing. I just don't know what it is, why we all have this mental block against using them.
[47:51]
A
Yeah, I find that interestingly. Like, let's get to the next topic, shall we? Because we got. There's so much more to unpack on that and I think next week, if we have time, we should talk about a little bit more. Because this idea that the home computer or these home computers becoming like AI workers in your home almost, it's like having a home robot in a way. And I kind of think that it's this like renaissance for having computing at home again versus all in the cloud. Like I think, I don't think the cloud's going to die, but I do think it's interesting especially with just the climate around, you know, privacy and people stealing data and all these complications. It adds like you can kind of see some businesses and just some workers, knowledge workers starting to just have compute at home, like have their own hardware. Because as the models get better and can run, I mean Tesla's hardware 4 or whatever it is can like basically I think fully self drive despite what some people say, like it's incredibly reliable now and it's running on like, like consumer grade stuff. So it's like once these models can run on, on more, you know, more performant on, on your local computing then you can totally see like people could just have these little boxes everywhere.
[49:09]
B
Well and as you mentioned, firstly you could insofar as a product like Symlink or Moltingbot go, you could run a local model that does all the sort of work on your secret data. Like all of that is done there. So you really lock it down to the only bits that ever touch like a real model are, you know, the essence of what you're trying to do. So it actually has the protections there. The second point I want to make is one advantage of doing this way is exactly what you said. You can lock that machine right down. It never leaves your machine. So you're not like going, you know what we'll do? We'll upload every single secret document we have to claude cloud and let it do rag and inference and whatever the hell it needs to do on that stuff. And then they have that and then we're at risk. You actually don't need to do that. You can keep it all right where it is and only give it the bits it needs to do that. And then at that sort of machine level and local network level, you can lock it down. So it's only ever getting access to the things that it absolutely must have access to. So in a lot of ways it is more secure and it's also more portable. It means that if, if that provider pisses you off and you're just like, I'm not going to use them anymore, then you can just take all your stuff and use it elsewhere. You don't, well, you don't even need to take it, you've still got it. It also handles cross regional things like if you need to have all this data kept in Europe, you can keep it physically in Europe and show where it is. You don't need to go, oh well, you know what, we'll just route it all through the US because we sort of have to, to get this done. So there's a lot of massive advantages there. And the thing is models like, I know we're going to talk about Kimmy in a minute, but models like that, you can self host them for not a crazy amount of money. Like you talk about the cost of like these people having this AI bill shock and things like now it might sound crazy to spend like $60,000 on a computer to run, but you're probably going to spend miles more than that. As a, as a company running cloud models and things like that, it isn't unreasonable to think I could run two or three local models and just run everything through that and have a fixed known cost and absolute security. Like this is a setup that is realistic right now.
[51:31]
A
I just, I just don't think cost is. As the tooling improves and it becomes more just vital to the workflow. I can imagine for each, each hire in a workplace, if the tools are adding sufficient value, $12,000 a year in tokens to run these things to me isn't a lot of money. Like it's in the business context, obviously personal context, probably, you know, obviously too much. But I just don't think, I don't know, I don't think we're there yet.
[52:01]
B
I agree, but there's, I guess my point is there's always going to be the frontier models and there's always going to be the lesser models. Right. And I think that you should build your workflow around the lesser models so you can scale up to a bigger one if you need to and want to for certain stuff. But you can also have it. So your day to day workers, your day to day processes that just simply don't need those larger models can be taken away. Like if for example, some of the examples you gave about like you know, the sort of repeating tasks like okay, I want to check my latest emails, check if there's any like hot leads in there or important things I need to do and then if so send a notification or if so prepare a response, whatever it is. You absolutely do not need a frontier model to do that. Right. But maybe building the next world changing cancer research or something does need the bigger model. I think that it's gotta be something where you've got that flexibility to sort of change gears as to what you're using at different times and do take cost into factor even if it's just like the efficiency of it.
[52:59]
A
All right, let's move on to Kimmy K 2.5. Kimmy K 2.5. Now obviously when the first Kimmy came out you were pretty impressed. What a hit. Now available on Spotify. Oh, Kimmy. I mean K2 if you want to reminisce. Anyway, so Kimmy K 2.5 is our visual agentic intelligence and we were talking about this model a little bit earlier. It has a 256k token context window. So that's the same as Claude Opus 4.5. It has 32k out so it can output up to 32k. They make all these claims in their, their release notes. It says for complex tasks, Kimmy K 2.5 can self direct an agent swarm with up to 100 sub agents executing parallel workflows across up to 1,500 school calls.
[54:00]
B
I'm telling you now first of all, the swarm thing's 100% marketing because everyone's like I want that, I want to swarm. I just when I want to get my swarm.
[54:08]
A
Are you supporting swarm?
[54:10]
B
I want a thousand people swarming on my task until it's done because it just sounds so cool.
[54:16]
A
Yeah. So it is available in the Kimi app. They have a cool feature where you can haggle. It must be a Chinese thing, I don't know where you can like literally haggle. It's pretty cool with Gimme itself to try and get a discount on your subscription. I didn't have any luck. And it's available through the API. Obviously we in Sim Theory have it available. We don't serve it through their API because we like, we want to host that. Yeah, we don't trust the Chinese.
[54:46]
B
Can I bleep? Right.
[54:48]
A
No. I mean, it's just the truth. Right. So let's just say it out loud. And. And they also have Kimmy code, which is like a copy of Claude code. Now a lot of people sort of spread the rumor early on that if you ask it who it is, it says, hi, I'm Claude by Anthropic. So I, I don't know why that would be the case. Maybe it's trained.
[55:10]
B
They're probably like, well, Claude's smarter. So maybe if we tell it, it's.
[55:12]
A
That it'll be better, it'll just get better.
[55:14]
B
It.
[55:15]
A
I gotta say, it feels when you're using it exactly like Claude Lord Sonnet 4.5, maybe 4. Like almost scarily similar. That it's freaky. I'm not implying anything by that. And it doesn't even matter because it's open source. That's the difference. And it's putting huge price pressure on Anthropic and other model providers. It's pricing through their API. And I've got to be clear here, their API is $0.6 per million input and $2.5 per million output, essentially 10 times cheaper than Claude Opus. I don't think it's 100% comparable, but it's good enough, especially when you wrap this tooling around it. And people in the local Llama community over on Reddit are saying a sonnet 4.5 alternative for a fraction of the cost is sort of the main takeaway from that community. Now, I played with it quite a bit, not as much as I would like to. My. My feeling is in the agentic sense or the like true agentic sense, using it through SIM Link. I honestly can't really tell the difference between using it and Opus at all. Maybe the tune of Opus in design taste is just better, is what I would say. It. It does fall down, I think, on very complex problems. Faster, much faster than Opus would be my take. But in terms of, you know, running it for your Claude, bottle, whatever, with a bit of tuning, I think this model, if you want to pay a tenth of the price and have basically the exact same functionality, I think it's a real go here. And what is shocking to me about this is, this model is only what, like six months behind the release of Claude Sonnet 4.5 and it's completely open source, mixture of experts model available. If you had the hardware, you can run it. It's pretty incredible.
[57:22]
B
Yeah. Being able to self host a model of this quality is just astonishing and its vision qualities are great as well. I gave it a picture of Dario from Anthropic and I said do you think this guy has billions of dollars in the bank? And it's basically like, yes, I think he probably does. The guy's not struggling to pay rent because it knew who he was immediately. So adding vision to the mix is really cool as well.
[57:45]
A
Yeah. And they didn't have vision before, just to be clear. So the original K2 did not have vision. Now it does. And they did all these examples where they reconstructed a whole website, a quite complex website, just by doing a video of the website, scrolling down, taking the keyframe frames and saying hey Kimmy, you're so fine. Can you build this and blow my MCB mod?
[58:11]
B
It's pretty cool. And as you say, I've been using it quite a lot in our own agentic workflows through Simlink, running hundreds of tasks and making code updates and all sorts of things. And as you say, once you're sort of in the workflow of using it, you really just forget like it just works like any other good model. It's really quality. It makes occasionally makes a few like it does a few weird things occasionally, but I guess that's the price you pay. But generally speaking it's hard to fault the thing.
[58:41]
A
Yeah, I think if you're going to go all in on these models, you need to tune for them. Whereas you can get away with just standardized prompts and tuning with the main US Lab models. It tends to be for some reason these Chinese ones do go a bit haywire. Like a lot of people that use SIM theory have noticed, it just does like random tool calls and just chunks code in when it should like. It just does stuff that quite frankly like you have to do like custom prompting to get it to stop.
[59:09]
B
Yeah, and partly, I mean partly, we've obviously over time optimized for the large players models and I think it sort of suffers from that as well as you've just said, like where because we give it less attention, we don't tune it as well. And so therefore it sort of gets a bit of unfair criticism because it's actually probably capable of more. It just isn't being given the attention it deserves.
[59:33]
A
So where do you, like, where do you put this model? Do you think it's just getting a lot of pickup? Because this, this Malt bot is out there now and people are looking around for cheaper agentic models like that seems like I do.
[59:49]
B
I think that is a huge factor. And like, we've just spent, you know, an hour discussing it. I actually think it's very significant we need models like this because there's going to be so much work you want to get done that. I know you say it doesn't matter if the value's there in terms of what you're doing, but it's just a mental thing. It's just like, well, if I can have these way more regular tasks running, knowing that it's not going to break the bank, then I want that I'm going to be setting up agentic workflows with this model, knowing that they're going to be running every 10 minutes or something and not costing me a fortune. The sort of value for money is there in the sense that, okay, I will automate this process because I don't really need to think about it. It's going to handle it. And I know I'm not spending a ton of money where I need to think about justifying it. I'm just going to do it. So I actually think there's huge value in this kind of thing. And then the, the second piece of huge value is just the fact that you can take this thing, you can host it in any environment you want, there's under 100% of your control and, and get almost as good as some of the frontier models. It's pretty amazing.
[60:55]
A
Okay, so the one thing I want to touch on here is you read stuff like Agent swarm, up to 100 sub agents executing parallel workflows across up to 1500 tool calls. And we were joking before we started recording this show, like, I would love to see it call 1500-TOOLS and like, why?
[61:13]
B
Yeah, I think. And so I looked into this and what really is, it's like, it's not the model itself. It's a software layer on top of it that's doing the workflow we've been talking about all day, which is where it'll plan out a task. It will divide it into subtasks, and then those subtasks will be additional prompts that will go off, do that work, do the tool calls, summarize that down to the essence and bring that back to the main thread that will then synthesize it. But what he was sort of joking about is like 1500 of them. That's an immense amount of output data, no matter how small it can get it down to, for it to then come back and understand and bring back together. So I haven't tried it, but my gut instinct is that there's just not going to work with any model too well right now because it's just too much like, it's just too much stuff that's going on to do in one hit. And also we were trying to think just mentally of like what use case would require calling 1500 different tools and then working on the, the output of those. It's just, it's just too much right now.
[62:16]
A
I think I do want to unpack because like we joked about this, like I spawned up 5 million sub agents and now I'm 2000xing in my job. Really the sub agent thing is as, as you said, like what we've been talking about, which is just breaking a core task down into, into subtasks. So you benefit from essentially like new chat sessions that you would have. You know when you have a new chat session because you're like, oh, that one's.
[62:43]
B
I've screwed that one up context. It's really breaking it down into like you are going to nail this task because I've only given you the resources to do this exact thing. That is where the value comes. And I don't deny that's the right way to be working. It's just like what could you be doing that requires 1500 of those? I can't think of a use case where that's helpful. Volume isn't necessarily the solution, it's the architecture that the solution.
[63:12]
A
Yeah, they say scaling out, not just start. We release K2 2.5 agent swarm as a research preview, marking a shift from single agent scaling to self directed coordinated swarm. Like execution train with parallel agent reinforcement learning. I'll pause the new actor and even deload. But yeah, I think, I think it's just being like, look what we can do. And it sounds really cool, but as you say, like if you go off and then you try and bring all of those sub agents together, you still need the context window and the processing capability and those smaller processes need to output something back to the mothership and eventually that whole system becomes overloaded. Or if you're calling APIs to do research, you get rate limited. So.
[63:55]
B
Yeah, and well, exactly that's what I said. It's probably, it could probably like saturate services if you're not careful. Like if it's hitting the same thing too many times. And the other thing I've noticed is successful agentic workflows tend to work in like clusters of tool calls. So they'll do maybe like six work out where they're at, relate that back to the current step of the process they're trying to complete and go, okay, I've got to do a bit more, right? It isn't going like I'm going to think of every possible thing I might need, do them all literally simultaneously, bring them back in, and just hope that that solves the problem. I just don't think that's a good way to work. You need the ability to correct. We've seen it. When you're working with so many real world systems, there's going to be almost constant errors of some kind. Like things aren't going to work perfectly across every system, including like, as you say, rate limits, timing issues, other stuff like that. By working in this sort of iterative workflow where it has the opportunity to correct and change its course slightly and replan and those kind of things, it's actually much more likely to accomplish the goal than trying to just come up with a plan in advance and just go bang. I'm going to one shot it. And I think that's why I don't actually like necessarily the way Anthropic does their skill calls where they take it all out of your hands and do it all for you because your system doesn't have the opportunity to take that corrective action.
[65:18]
A
So do you want to hear the diss track?
[65:20]
B
Yeah. Yeah, that's right. I'll stop talking. Let's get to the music. I'd also like to hear you play the original Kimmy song in full while we sit here. Oh, God.
[65:28]
A
Okay, no. All right, so I've got a story to tell about it. I've actually got two diss tracks. I'll play this one first, get to the chorus. I'll put the full songs at the end so you can hear them in full for those interested. But I'm just going to play a few little excerpts here in parameters Deep.
[65:42]
C
Open source king while you code stay to sleep opus 4 or 5250 for tokens that's theft. I'm at your benchmarks charge 1.8 was left GPT 5.2 you're thinking mode slow x high reasoning but the vibe's still low Gemini 3, you broke 1500 Elo but your context window can't catch my flow they said open source couldn't win Now I'm burning Through your margins thin 8 out of 16 benchmarks mine your quarterly reports decline I'm that K2.5 open source worldwide honey sub agents swarm on the rise Closed source crime can't match the price give me K2.5 I run this no comparable remind I'm that K2.5 benchmarks don't lie.
[66:28]
A
Pretty good.
[66:30]
B
Yeah, look, I hate it, but usually that's a sign that everyone else will love it.
[66:33]
A
I think it's good. But here's what happened. I put the prompt in and then forgot to select Kimmy K 2.5 so that I cheated. It was actually Opus wrote that song and it's super catchy. I'm gonna play it at four at the end if you do hang around to listen to that kind of thing. Interested in your feedback in the comments below on that one. But here's the real Kimmy K2. Let me find it and. And you can see the comparison. So it's actually a good comparison. Oh, wait, hang on.
[67:07]
C
The open source nightmare. You closed source labs ain't ready. Let's go. Moonshot dropped the bomb January 26th. One trillion parameters yeah, we built the full mix. 32 billion active MO E architecture while you charge a mortgage just to run an inference lecture plot opens 4.5, 25 for output 54 input manual pricing got me shook up give me a 60 cents, I'm laughing to the bank 9 times cheaper, same performance. Who's the one to thank? Agents 100 agents deep, 1500 steps while you lab rats sleep. Parallel execution 4.5 times the speed. Native multiple multimodal vision guaranteed.
[67:56]
A
All right, what do you think?
[67:57]
B
I actually like that better.
[67:58]
A
I think I do now too. It was funny. I think it's like the last one you hear, maybe the other one. At first I was like, wow, it nailed it. But then that one was way more hyped.
[68:09]
B
I tried to make a song too, and I got the original. Kimmy, you're so fine. And then I said to Kimmy K 2.5, can you make like a more aggressive, sexually explicit version of this song? And my God, did it oblige. Like, I absolutely mustn't play this song at any time.
[68:26]
A
Join Chris's only fans to hear that one. All right, so I've got two other things to show you. And these are the like, stupid code demos, but I think they're kind of interesting. Just to show you how close the models now are. This is a black hole simulator with real, well, supposedly real physics where you can change the view angle. You've got the accretion disk, you can turn that on or off and so you can like speed up the black hole mass, make it bigger and watch what happens to the particles. It shows you how big the event horizon is for those that are into black holes and stuff like that. I'm not sure if it's accurate. Maybe, maybe not. But then I. So this is using Kimike2 and then if you look at the one that. And these are all like SIM link raw raw files in the model. So it's like multi file create with code for those that are familiar. So this is the OPUS one. I, I mean I don't care if it's less accurate, it looks a lot better. So I think this is like they both achieve the same thing, right? But the Opus one I just, it's just tuned better. But it's weird. The outputs are so similar that it's creepy at this point. Like they've all converged on a similar point. But the tuning I think in OPUS is just still far, far better. I mean look at that. It's, it's.
[69:47]
B
Yeah. I had, I must say I had a. I had a similar experience through the week because I was working on local code. I was getting simlink to make Pygame like the, the Python gaming library stuff. And remember like a year ago, maybe two years ago, we were trying to get the models to make that sync sub game where the submarine goes across the top and drops bombs and stuff like that. And I just one shotted one and I was amazed. It had like perfect sprites, looked absolutely incredible. The game had a score. It had like, it was a full fledged game. And then I asked it research the history of this game and put it like a title sequence on there that honors the people of the past. And it had this like incredible like animation and all the history of the game and all this sort of stuff. Then I said out your. Remember in Mortal Kombat that guy that comes in from the side and says toasty.
[70:34]
A
Did we talk about this last week? You're like an old man.
[70:37]
B
I'm sure.
[70:38]
A
Maybe not, maybe not. That's, that's how all the episodes.
[70:41]
B
I think we talked about it, I don't think we put it on the podcast. But, but what I'm saying is like the, the ability of all of the models just, just to create stuff now is just astonishing. It's like able to I guess see more of the picture and use more of the tools to, to get things right earlier. It's really cool.
[70:59]
A
I think the other thing with that toasty thing, the toasty, like for it to go and like generate assets and like get sounds and then bring that all together.
[71:08]
B
And to be honest, we made the system and I don't even really know how it did that. I don't know what tool calls it used. Like was it a skill, was it running locally, was it an mcp? I didn't really care. I just had the right mix of things enabled that it was able to do it. It's like, I think that's a kind of cool way to work.
[71:24]
A
I mentioned it earlier. So this is a demo of Kimmy's CRM and it's the same kind of thing, right? It created its own brand. So it's like this computer confused robot and it says keep it managed, I guess. And then the login says, let's pretend to work. I mean it came up with this. I just said make us Kimmy CRM. It's trolling. So let's pretend to work. And then, you know, it's like a really nice ui. Everything works.
[71:54]
B
Jeez, that looks. Imagine how much money you could have charged for that like 15 years ago.
[71:59]
A
But it's like fully, you know, you can edit records, I mean and you could.
[72:04]
B
Now this is the beauty of it, like if you got this local setup going where the system can access your browser, access the browser and console, run the process itself and edit the code, it could actually just, you could be like make the database now, make this, save, add security, add login, whatever. And it can do that. Like genuinely, you could just iterate this on this. So you have a product.
[72:25]
A
I just noticed in the search on tasks that says find excuses to avoid is the search thing. So it's like. And then in the, in the drop down to filter tasks, there's all tasks pending in progress, completed in brackets. Yeah, right. I don't know why. And it's like deals, find imaginary money in contacts. It says find humans. So yeah, it, it's also trolled the whole time. So anyway, Sass is dead. A sign of things to come. Sass is dead, apparently. I don't know, I don't think it is. Like maybe would you change? Like seriously? I would if it was agentic first. Yeah, I would like if it was because you think about what could disrupt say Atlassian. Like if you want to go there. And to me what can disrupt them is not allowing their data out into the wild because it's their customers data and if they release their customers data out in the wild and then you can do stuff with that data in the agency setting and they're supporting that environment, you're very unlikely to replace them because of all the like security compliance, the processes and structure people have built around it. I think that's how you become sticky again in this new world.
[73:44]
B
Yeah, totally agree.
[73:46]
A
All right, we better end the episode. We've ran it for far too long. Thank you for listening. If you want to join us on the still relevant tour, link is in the description below. Also a link to Sim Theory. If you sign up after all the stuff we've talked about, you'll be slightly disappointed because V2 is coming out soon.
[74:03]
B
That's what we aim for though, right? Like yeah, not too excited, but not too disappointed.
[74:09]
A
It's a great product for what it is today. I think you can trial the models and all the MCPS on earth, but I think what is coming in V2 is super duper exciting. We'll get back to work. We'll see you next week. Goodbye. Yeah.
[74:33]
C
Yeah. The open source nightmare. You closed source labs ain't ready. Let's go. Moonshot dropped the bomb January 26th. 1 trillion parameters yeah we built the full mix. 32 billion active mo e architecture while you charge a mortgage. Just a running inference lecture clut opens 4.5 25 for output 54 input manual pricing got me shook up Kimmy at 60 cents I'm laughing to the bank 9 times cheaper same performance who's the one to thank agent swan mo hunted agents deep deep 1500 steps while you lab rats sleep. Parallel execution 4.5 times the speed Native multimodal vision guaranteed we open and alive hl for 50 points to watch the throne arrive you paying nine times more for less that's just the same claw GPT gemini y' all feeling the pain? Kim k25 the swarm is taking over close source sen yeah, your time is nearly over. Open weights, open minds that's the only way 9 to cause but we beat you anyway. Open a got GPT 5.2 talking about about shy thinking mode so slow you use us wonder why Prote 168 for output tokens I'd rather self host Kimmy flip the scripping on them you got 100% on end. That's cute math but real world agent work you fill in the aftermath. Browse comp 74 with context management but I'm at 78 with agent Swarm. That's heaven sent anthropic sopus yeah. You cold? Pretty well. 80.9 swift bench okay I hit a bell but you cross the nine to times more to run a simple query Dario charging premium while the users getting weary thinking mode extended got us waiting 10 to 20 seconds for a response. Nah, my swan moves plenty you call the cops on the user that's a PR disaster. Kimmy's open free and built for moving faster. Google's Gemini 3 million token context window long bench 68 but your reasoning is simple Video understanding from YouTube training data but on hle fool you trending let me calculate the base beta you everywhere like Gemini I ubiquity but users want performance not no deep pockets trickery Kimmy beat you on video me yo 86.6 native vision and code yeah we built the full mix vision based coding screenshot to the code autonomous debugging watch the errors explode self directed agents, no orchestration needed while you labs keep gatekeeping we proceed unimpeded. Give me gate 25 we open in a live HL equal 50 point to watch the throne arrived you paying nine times more for less. That's just insane. Claude GPT Gemini y' all feeling the pain? Give me K25 the swarm is taking over closed doors and pies yeah, your time is nearly over Open weights, open minds that's the only way 9 to 1 the cause but we beat you anyway modified MIT license use it commercial free unless you hyperscale then just give credit see we democratizing they are open for the masses while you walk wall gardens charging rents like landlord classes 256k contact center for quantization 12 times the capacity 2 times acceleration 15 trillion tokens train on visual and text Moonshot took the shot and now we watching y' all get ready Kimmy, Kimmy K open source just here to stay nine times cheaper, better benchmarks the future is open. Yeah I'm K2 5 trillion parameters deep open source king while your code stay to sleep opus 45250 for tokens that's theft I'm at your benchmarks charge 1.8 was left GPT 5.2 you're thinking mode slow x high reasoning but the vibe's still low gemini3 you broke 1500 elo but your context window can't catch my flow they said open source couldn't win now I'm burning through your margins thin 8 out of 16 benchmarks mine your quarterly reports deck climb I'm that K2.5 open source worldwide honey sub age swarm on the rise closed source crime can't match the price give me K25 I run this no compromise I'm that K2.5 benchmarks don't lie HL52GPT your Bob Opus cost a fortune Gemini's too wide give me K25 open source Pride OCR bench 92.3 14 points up what you trying to be Sam's charging billions for a shopper hype Dario safety speeches all the same type Google kill products now you trust Gemini the graveyard bigger than AI video I said the sota claw can't see vision clear what can I say yeah they said we couldn't do it Moonshot came through not a whole game ruined trillion pam e I catch a tight spawn agents run test all night they said open source couldn't win now I'm burning through your margins then 8 out of 16 benchmarks mine quarterly reports decline I'm that K25 open source worldwide honey sub agents swarm on the rise closed source crime can't match the price give me K25 I run this no compromise I'm that K25 benchmarks don't lie HLE52GPT your bye bye hope it's cost a fortune Gemini's too wide give me K25 open source pride browse complaining I'm the one to be longer rising task taking defeat GPT pro says it got the vibes right but can me handle business through the night Sunday 45 you wedge me on tools but fast mode knowledge I'm breaking rules open waist dropping democratize the game while you charge enterprise I bring the flame local Emma this one's for you moonshot AI we made history open source forever K25 out.