
Loading summary
Chris
So Chris, before we start the show, we're going to do a little plug like we have been for Sim Theory. If you want to support the show and also get access to all the models and some very unique mcps, you can do that by signing up to SIM theory AI using the coupon still relevant to getting $10 off any subscription. I also wanted to call out a few new mcps. I am leaking some accidentally on the screen right now, so I'm going to slowly scroll down. But we have released Recently C Dream 4 which is like Nano Banana. If you're getting confused with image models, it's really good and worth checking out. It creates stunning images and it can do very precise edits. I also built based on a request in the community, the audio book maker mcp. So now you can turn a story you might have or a story you create into an audio book so you can start to judge the model's storytelling capability. There's also now Zapier in there which enables you to connect to I think 8,000 plus applications right across your business or just things in your personal life as well, which is a really cool MCP to check out. All right, end plug on with the show. So Chris, this week we finally got answers to the question of do model providers intentionally degrade the model or route to cheaper versions of the model to save money after a launch, which has long been speculated. In fact, there was a tweet from Anthropic which is now calling themselves Claude. I think they did some sort of rebrand in the week. We're also. We'd also like to address a concern we've heard in the community. We never intentionally degrade model quality as a result of demand or other factors. Now everyone speculated that was untrue, but they did release this week a post mortem of the three recent issues that they had found with the Claude models. They at least admitted the models had become stupider.
Nate
It's an interesting one because I've long thought about whether they're behind the scenes switching things out and the even their post mortem sort of admits that there is routing going on. So there is some sort of evaluation of the queries you're sending it at least on Claude AI when you send it through and it's deciding which model to give it to. Even though people assume that like in the selected they've got say Claude Opus selected, but they're obviously dropping it down to either quantized versions or lower versions of models when they deem that's appropriate. And to me, even Their admission is sort of like saying, well, oh whoops, we accidentally tuned it a bit to go to the lower ones when you thought it was the higher ones. So I think even with the, even with, even if they are telling the truth, it's still a little bit sneaky. What's going on behind the scenes there?
Chris
Yeah, I mean one of the bugs was the routing error. Some users requests were accidentally sent to the wrong type of server, specifically servers set up for the massive 1 million token context window. So this is their new 1 million context window. But they said a routine load balancing change in late August made this problem much worse, affecting up to 16% of certain, certain requests at its peak.
Nate
Yeah, so it's, to me the, the big implication here is there's actually introspection, like some sort of evaluation going on on what you're sending through, rather than it just being some round robin routing that just happened to accidentally send it to the 1 million context. They're clearly looking at factors like the number of tokens, the content of your messages to decide which model to send it to. At least on the Claude side I don't, I doubt this applies to the API but you know, you don't really know if you're working directly with anthropic.
Chris
There was also this paper, it's called the Illusion of Diminishing Returns Measuring Long Horizon execution. Execution in LLMs, it says. Does continued scaling of large language models yield diminishing returns? Real world value often stems from the length of task an agent can complete. We start this work by observing the simple but counterintuitive fact that marginal gains in single step accuracy can compound into exponential improvements in the length of tasks a model can successfully complete. And so the, the essentially they say the real bottleneck in LLMs right now is actually execution, not reasoning. And they argued that basically if an LLM fails at a task, so this is a long running task, like you could think about it as an agentic task. It's not because it can't figure out what to do, which we've always observed like the plan generally is right, but it's because it's making mistakes during the execution. And the paper found that once it makes one error, it basically sees that in the message history this like all these errors and then it sort of assumes like oh, they want more errors in the output. So it, it progressively gets worse. But then if re prompted correctly, the model actually is smart enough to know what to do and do the task. Which is a really interesting observation.
Nate
I think it's not Just interesting. I think it really matches up with the way, at least I know you and I work with the AI right now. Like as in the. Because my initial reaction to this was to bulk at it and be like, you're totally wrong. Because I find that the longer and longer a chat session gets, the better it gets in terms of answering questions. But what I missed is the point that I'm in the middle there, nudging it along and saying, you're wrong about that. This is different. Giving it additional context and gradually improving things to the point where it's giving right answers. What they're talking about is leaving it to its own devices to go through that process and compounding on the fact that makes a mistake. And during the week we saw that AI village thing that everybody likes, where they're getting AIs to go off on a long period of time with goals to try and solve things and you absolutely see the errors compounding there. It sort of goes down a path that's wrong right from the start, but just keeps optimizing that path even though there's a fundamental error that a supervisory agent or human is going to realize, hey, no, that's totally wrong. The amount of times I've been working with an AI assistant and said, hey, this approach is wrong. What if we take a totally different approach and then using the exact same context, it's able to get itself out of that mess and go on to solve the problem. So I definitely completely understand what they're saying here as we move towards having autonomous assistance where we want them to be more goal based and give them the starting point and then hope that they get all the way there. They're not going to have me or someone in the middle nudging them along and saying, you're wrong about that. You need to fix that. And so this really probably is the next major area we need advancement in, in order to get those autonomous agents working in a way that everyone expects them to.
Chris
Yeah, I think that to me, the only missing piece is really getting the supervisor agent to say to the like the runner agent, hey, buddy, like, you've gone down the wrong path. The, the challenge there becomes from testing this is they're very agreeable. So like you ask it, hey, like, look at what's wrong with this. And it's not necessarily like definitive in saying like what a human would say, where with experience or just with some common sense and logic, you're like, no, that's wrong. But interestingly enough, if in the paper they, they can get a Lot further than you would, you would think. Let me bring up the, the fact here. So GPT5 can execute over a thousand steps correctly. So that's, that's a pretty long running task. The next best competitor is Claude for sonnet at 432 steps, according to the paper. So it doesn't look like they tested Opus. They probably couldn't get enough bandwidth.
Nate
Yeah, exactly.
Chris
GPT5 in terms of thinking and, and its ability to execute long tasks, it's just so far ahead. And, and you can kind of tell that working with it. I think generally that it, it sometimes I think that's where the intelligence comes from. It can go down the right path. And as you said, we often talk about, you can have a chat where you start to go down the wrong path and you've got to go back now and like fork off where it sort of all went wrong and then push it down another path.
Nate
And I think probably one of my biggest fears when I work with the assistant over a long session is I don't want to accidentally mislead it or give it information that may confuse it and muddy things up for the future. Obviously being able to fork chats avoids that, but nevertheless you still want to make sure that you're keeping it on the task and not distracting it from what's important.
Chris
To me. This can be solved pretty easily if we can get the technology to a point where that instead of the human supervisor at the highest level, at least the supervising agent can think somewhat like we would think now.
Nate
Yes, and I think this is why the emphasis around models isn't necessarily the right thing. I see that the LLMs will become a increasingly diminished part of an AI system. Not that their role isn't important, but it's just that it's more around the logic and the architecture of how the different elements in the AI system work. Like you talk about a supervisor agent. The question is, does it intervene after every step or does it initiate multiple threads of execution and evaluate which is the best outcome of that? Does it have like a sort of, we talked about this in the past, like is there some sort of voting system where you have multiple opinions from different expert agents within a system to vote? Is this the right way to go or not? Or is it like the, the, what's that seven hats thing where you've got, know, the devil's advocate, the, the skeptic, the, the different roles where each of those supervisors is saying, hang on a sec, you've completely screwed this up, you need to do it this way. Instead, and then another agent's like, no, it's correct because I checked this and I did this research and that's right. And I think that increasingly we're going to see systems like that where you have these balancing and counterbalancing elements in there to guide the, the main thread down the right road. And one of my problems with models like GPT5 and why I don't use it too often is because it's, it's high latency. But if it's getting that higher accuracy and you can therefore get to a higher level of autonomy in an, in a AI system, then it doesn't matter so much because you can sort of set it off on its path and it'll tell you when it's ready. So it's probably better in those scenarios to have it take a more careful approach, have more steps in the process and get way further down the path than you would, you having to be in the loop at every step.
Chris
I also just question the, like, I guess there's probably two schools of thought right now, and I don't think, you know, there's probably not a right or wrong answer. But there's the idea that a model like GPT5 can just eventually, like maybe GPT6 can do 2,000 tasks and GPT7 can do 5,000 tasks. And I guess I would question, does like you say, does it really matter? Because is the technology going to get to a point where you can't run concurrent threads? Like to me running one thread, if it goes down the wrong path or takes the wrong turn, you're in a lot of trouble, like getting out, like undoing that is going to be really tough. Whereas if you have like three threads with slightly different approaches, then I guess the next challenge is how do you pick the best outcome when you don't necessarily know the answer? And to put that in a practical sense, imagine you're answering a support support ticket and the it, the tools are it has access to the database of users and payments through stripe and like pretty much all the things that a regular support agent would have access to, maybe even a computer to go and test a bug or like, see if the bug is a problem. And so you send it off on that path. At least one model, it comes to an answer for the customer, like a draft answer. And then you, you look at it to do that approval step. You're like, this is, this is just completely and utterly stupid and wrong. Like at some point it's taken the wrong turn. But then if you had three Versions that are slightly different and maybe one's the right answer. Like how do you then verify like which is the best answer given that it then goes off in the wrong direction. So I just, I wonder if scaling the model models internal clock or the task length that it can achieve versus just using small chunks like completing things in small chunks and having those chunks with success fail criteria would be better. Like it's hard to know without trying.
Nate
I kind of agree with you. I must admit I don't understand it on the fully technical level that's going on inside the models. But my preference has always been do it in smaller steps and let's evaluate and guide you along the way rather than thinking that some holy grail model is just going to be able to go off and fully solve an issue. I've found the longer they execute the, the, the more detailed the answer. Perhaps, but it doesn't always lead to the best real world results in terms of the final output. Even if just for the fact it takes so much longer so that it's way slower to iterate. You showed me something during the week that I thought was really interesting. This idea of when a model does something that's not quite right, actually saying to it, which part of your prompt caused you to get this wrong? Or why did you decide to do that instead of this? Or how could I actually adjust your prompt to make sure that we do this, do it this way always in the future. And I think that sort of, you know, cooperative process of the AI being almost like a malleable system that you can work on and work with it to get it to a point that you want and then, and then sort of freeze that state and say, okay, from now on this is just starting point for operation. I think that kind of thing is probably going to be a lot more effective than just having a better model. Because the paradigm is always going to be there where the model is just part of the system, the model will never become the whole system. And so therefore its role in the system needs to be proportionate to how you optimize the whole thing.
Chris
Yeah, I agree with that because I think about the practicalities, like if you said to me today like, I want a research agent that can call and survey people, collate that data and then present a report to me on, you know, any topic. Or I want you to build me a support agent that can handle like 80 to 90% of support tickets, right? And, and these are things I have direct experience doing right now. And so like that introduces a lot of Challenges where if you think about like trying to actually replace like a worker, say you've got to think through, well, okay, what's the job description? What are the tasks that you do in that job? So I'm going to use support tickets just because it's something I think everyone understands. So it's like you've got all this reference material, you've got knowledge in your brain of just like you just, you know, pattern recognition, right? And so you've got to simulate all that stuff with the model. I think just running the model and saying like, go do this, it's never going to work. So it's like prepare the context, prepare the memories, all those typical sort of agentic use cases. But what I'm landing on lately is like, okay, do it manually first. So connect a bunch of MCPs, right? To connect into the, the help software, the documents, like whatever information you need, actually connect it in and then go through the process of like, hey agent, can like, what do you like draft a response to this particular ticket and then give it some feedback and then give it like four examples of where it nailed it and then wrap that in some sort of other MCP or wrapper which is like you've trained it on that task. And so then you give the agent you're like, here is your job description, here is how the tools you have to execute these very defined tasks with, with, with obviously security and permissions or whatever you need. And then it's just a case of like, you know, it's, it's limited scope as to what it can do. So it's very unlikely then to drift and go, you know, go down a path where it errors out a lot. And so it's like almost these sub skills or subtasks within the agent right now with today's current technology, if you want to get it to work, I think that's kind of how you have to do it.
Nate
I agree. Because if you think about what the alternative is, right, the idea is, okay, I have my same MCP is connected, so I've got my ticketing system, I've maybe got my internal MCP that has access to data about customers and things that are needed to answer the questions and a knowledge base, right? Let's say you've just got those three, but between them they've got like 30 tools that it can use. And then you throw it at this incredible thousand task thinking model and says, here's your tools, here's the goal, here's the ticket. How does it know which combination of tools to use in order to solve that. Like you might have certain checks that need to happen every time, like check they're on a paid plan, check if they're on the paid plan that it's valid or whatever, check that you know this, this flag isn't set in the account. Only things you as the sort of business operator know that matter. With all the tools in the world, the system is not going to pick the exact combination that get the job done in order to solve that problem. And even if it does, there's no guarantee that that will happen every time. And so the thing becomes, yes, the model is perfectly capable of doing this task from end to finish with the right prompt, but it's not always going to have that. And things in the ticket itself might cloud the prompt and then you get different results. So an AI system like you're describing is far more valuable where you've trained it. Okay, in these scenarios, here's what we do and these are the steps we take. And you've seen that and this was successful and this wasn't. And then you give it four or five examples of that and then from there it gets to know your system and infers the combination of tools that need to be called or in some cases you might want to be even explicit. This is this, this part of it's compulsory, this part, you use your own discretion as to whether to do that or not. And I would argue no matter how good the models get, you're always going to need a system like that to accomplish these real world multifaceted tasks. It's just not going to be a model that's smart enough to just figure out absolutely everything.
Chris
Yeah, it's sort of like putting GPT5 in a Tesla and saying like self drive plus like it's just not going to be able to do it. And, and so you know, you obviously have specialist models in a car to do that and, and you could imagine one day even just having special agentic based models for particular job roles. Even like the, the model is just a total tune for that role, which I think people would pay a lot of money for, to be quite honest. But I can also like, I think though, having said all that, like the, like, if you look at GPT5 being able to execute for longer and you know, figure out where it went wrong better than the other models and try other methodologies before it basically, you know, self destructs and quits out. I think really what the paper's saying is like Claude for Sonnet's pretty Good, right. But it does, it does die pretty quickly if it, if it goes, takes the wrong turn or it, it gives up a lot quicker. Right. Whereas GPT5 goes on and on. I do find it suffers from the problems I saw in O1 and O3 though, where it can just go so deep down a path, burn so many tokens, call so many tools and go so far south in terms of the wrong direction that because it runs longer, you can't necessarily get feedback from another model and say, hey, hey, no, no, no, stop buddy. Because it's still on that sort of run or that its own, own clock. So I am interested to experiment more like around more with this. But I, I do think right now for agentic TARS at the moment in terms of intelligence and just like, you know, the ability to make great decisions with MCPS, GPT5 is, is really the standout. Like if I was going to build it today, that that's what I would be using.
Nate
Yeah. And I think that we get to that autonomy in stages and I think that's the idea that you've been talking about all week, is that we need to gradually equip these AI systems to be able to do, get further down the road on a task for us. So one example you gave during the week was having a dry run kind of concept when you ask an agent to do something. So you go, here's all the tools you need to do, here's the procedure, here's an example of it successful, now go do some of them. But before you finish, give me a summary of what you're going to do. And then you as the, right now as the human supervisor say okay, that's great, proceed, or you give it some feedback. And I think that the idea of getting to autonomy the way they say doing in that AI village, it's a fun thought experiment right now, but we can see clearly that the technology isn't there. The way to get there is to gradually equip the models with different, or sorry rather the AI systems with the tools to get there perhaps like in context, memory approvals, processes, some sort of basic procedures in terms of order of doing things or checklists or something like that. Get the combination of those things that get us closer and closer to automation. And then one day, before we know it, we're trusting more and more of them to just do it on their own.
Chris
Yeah, And I think like this is why, I mean it's, it's not a secret really. In SIM theory, our approach has been like, don't Go to like full blown a especially like to allow people to build their own agents, right? Like it's not like we can go build a product like Claude code, like you could do that. I'm not sure why we would when great products like that exist. But those kind of products with the agentic loops you like. I think the problem, at least in my head, I'm trying to solve for myself first is like, how do I have a way I can train my own agents that are very successful and very productive at tasks? And so how I think through that first is well, I need context and I need to be able to interact with those systems. Like I need it to be able to reliably do stuff. And to me, MCPS mostly solve that. So it's like give it access to a computer is sort of a last ditched effort. Give it connections into say a Salesforce or a Zendesk or Zapier or whatever it needs, like whatever context you need and then whatever actions you need to take. So it's like, okay, now you can get context, now you can take action. You have an underlying framework where the model itself can go and execute these. Then you need sort of a safety framework around that. Like you don't necessarily want it to take certain actions all the time, but in, in some other agents you might want it to. So I think the next, the next part becomes, okay, well they're not very good at just running wild. So how do you train it to do tasks and skills? So can you allow the user to build and train the like an assistant to do skills? Okay, now it's successfully calling skills. Like imagine skills is like a unit, like a series of like context that can gather and actions it can take. Like solver support ticket is a skill and it's a combination of things and mcps and then you know, that to me then becomes okay, cool, it can do it. Like it can reliably do these skills like 80% of the time. Okay, now how do we make it agentic? Like how do you then like set like it's like Free Willy. Like how do you set this thing free into the wild and see how it performs and then measure that performance? Like you know, and then like just like we saw in AI Village, which we'll get to in a minute, it's really then calling back to the human and saying, hey, I'm like, I'm stuck here, I need some help. And then how can you reduce like those interventions with the agents? But I think what, what excites me.
Nate
Like your Tesla, isn't it? The number of human interruptions required or whatever it is.
Chris
But, but what excites me most about this is then I can imagine a world where listeners of the show and people in general can go and say like, I have this very defined process in my business which is causing busy work. And like for us it's like with SIM Theory, like we, we obviously are still doing like all of the support tickets ourselves mostly, but for us it's like, can we build an agent to take care of like 80 to 90% so we're only dealing with like things where it's like we really take care of them completely.
Nate
Right? Like actually diagnosing and solving problems, not just quoting shit from the knowledge base.
Chris
Yeah, yeah, I should, I should add that, that it's like actually taking action like, like fixing something, fixing account problems, like doing real work that's just really busy work for us. And that to me that's so meaningful. And I'm sure there's so many people out there that listen going, okay, well there's all of these processes and tasks and, and to me this is like the holy Grail, right, that we're all chasing. And so I think like being able to have your assistant that you're working with day to day, but then having your sort of agentic runners in the background doing like actual busy work and maybe, look, maybe it's only like 50 of those tasks that successful at. But that's 50% of stuff you're not not doing it.
Nate
A good example, again, just sticking on the support ticket one is just gathering all the relevant information, like looking at recent error logs, looking at account status and information, looking at some sort of history of what's going on to say, hey, I've found these three fairly obvious things or this one fairly obvious thing that this probably is like that alone could save you 10 minutes investigation. And you're doing that perfect per ticket, like, and I think there's so many business processes like that where it's really the getting all the different pieces together. Like I've got to log into five different systems to situate myself and work out what's going in the. On in the problem. Whereas the AI with its access to all the different tools and systems and context and a history of successfully solving tasks of this kind is able to get you so much further down where you're just like the, the again that Homer Simpson pressing the enter key just yes, I would like to solve this. Yes, I'd like to solve this. It's just simply because the AI is just. It's effortless for it to do those things for you once you show it.
Chris
But I think there's another big, like. Like there's a large amount of work that really needs to be done for this all to come together. And that's around, especially in the enterprise or business context. The whole idea of the true custom internal MCP where you're accessing internal databases or running commands, at least for us, like running commands that maybe we would do through sshing into a server where it's like a pretty important command that we're doing. And it's like, how can you build mcps in such a way with permissions that agents can use, where they can access sensitive internal data or actually take meaningful actions in the business or access data that's simply just not available? Like, it's not like everyone's got all their data in like Salesforce or a snowflake database. Like, the world's not that simple and easy. So I think I just see this huge boom, like this enormous boom for people to be out there in the enterprise going in, building internal MCPs that just like I built, like the video maker or the audio book MCP, but like these sort of like product ties, MCPs within an organization that can do very specific things from start to finish for that organization that can be called. And then you can sort of bring those up to the agentic level. And then all of a sudden this is a meaningful agent to that. That business.
Nate
And I think the models are just so unbelievably good at, at classification and data structures and things along those lines that once you expose it to your internal data for your company, like actual, you know, RAW log files or RAW database records and those kind of things that to us look like it's just overwhelming because there's just too much data and you can't really make sense of it. AI is just so good at that stuff. Like, it can take all of that and synthesize it and it. And you just like, make me a. Make me a beautiful visualization that explains what the hell's going on here. And it can draw the connections and it can come up with ideas of how to represent it that solve your problems. And I've seen that a lot of times. And this is why I think the real rise in the next little while is exactly what you just said. Every company needs to have an internal MCP that exposes data. And yes, there needs to be security and yes, there needs to be permissions, like, as in don't allow the AI, for example, to access things that you wouldn't want your staff to see, or if you do, make sure they've got the appropriate role to be able to do that. But the idea is that there's so much metadata and other things in a company that will allow the assistant to do a much better job and get further down the line that having these internal MCPs available is just going to multiply what you can do, keeping in mind you can combine it with every other tool there is around. Like, even I thought of a simple example yesterday. Imagine you're a sales rep who needs to prepare for a sales meeting and you've got your internal CRM data and account data about the customer, their recent usage, and how they've been using your system, who the staff members are who are accessing it. You get the system to gather all this context together. Then you listen to a podcast while you're on the way to the meeting that gives you all the information that you need in order to prepare for the meeting, including the account history and everything. Like, it could just be like immensely powerful and fun. And again, because the AI system can access all this stuff without you having to put it in there and copy, paste context and all this stuff, it can be just done effortlessly. It could even read from your calendar and know the meetings coming up and proactively send it to you.
Chris
Yeah, I mean, I think a lot of this stuff I'm doing now with business metrics and collating them from multiple systems and getting it to give me a snapshot so I can go into a meeting, like, way, way more informed. But to give people, like a really practical example, we internally have an MCP for SIM theory called sim. I nearly said MPC for some reason. SIM mcp. And it has a bunch of, like, tools that it has access to, like creating custom plans, finding users, finding workspaces, things like that. Just like stuff that will eventually want an agent to be able to perform. Like, these are very, like, important tasks for it to add capability. So I think it's like these are the kind of like, really good tooling, like tooling stuff that it can use, but also just exposing the, like, context from the business as well. I think that's something that's underestimated. And I did see earlier in that report from Chat gbt, like data analysis. And just understanding business data in general is quite complicated with data spread across multiple systems. And if you look at providers like say, a snowflake or even a Salesforce, like the bread and butter of These companies or the pitch for years is to has been to like collate all this data together and then you know, then you'll magically get these insights and be able to take action. And quite frankly like I think LLMs coupled with MCPs with internal data like from your raw, like you deciding what data it has access to, you can cut out like the middleman, you don't need them anymore. You can just start asking questions like if I want to see like how many active users do I have? I can ask the assistant and then say like create a chart for that. Then I can create a document, insert that chart into it. Now I have a full report on, on you know, something that I've been asked a question to. Or you can imagine like in marketing use cases like, like pulling in advertiser data and creating a report on that coupled against like new customers coming in. Like there's just, I think that data analysis stuff, I know a lot of people are doing that at an elementary level now, but I think once you unleash the power of just connecting it into organizational data, what it enables right now and what it will enable in the future is just pretty unfathomable.
Nate
Yeah, and I think this is not even accounting for companies that have proprietary data, like data coming from say sensors and like other other systems and things like that where they can actually then combine that into the external context, like data that's just not exposed to the external world or on some SaaS API. And the possibilities there are huge because these are systems that in the past would have cost hundreds of thousands of dollars or millions of dollars to develop that you can just in 10 minutes actually access. Like we had an example recently. There's a company called Soulcast that gets data sun or Radiance data. I don't even know what the hell that means, I guess how bright it is outside or something. But they have an MCP that they're using in SIM theory with open interpreter code interpreter, sorry to graph this data in immense detail. You should see these graphs, how much data it's able to do.
Chris
The graphs are really pretty.
Nate
And this is something that previously you would have had to have a full on developer access the API, build these different dashboards and then that's not even accounting for the fact you can combine that with other context. And how many companies are there out there like that that have all this incred data out there possibly already on APIs that are going to be available for the. It's just that that crossing of the systems and the different Output tools and things that make it so powerful. It's the, and the, the AI's ability to make sense of it and synthesize it. And I think one of the reasons, and I thought this when we were looking at those reports earlier that you saw that the like, science and analysis had actually gone down on chat GPT and I think on the anthropic one as well, because I made a note of it, I thought that's weird. But I wonder if it's. People lost trust of it because they were just pasting CSVs or something into the RAW prompt and then they're like, oh, it's not accurate. But that's not accounting for the fact. Now you can tell the model, hey, you're not good at calculations, but this tool is. And you have access to this tool, take that data, shove it into the tool, and then you make the evaluations based on what that tool tells you, knowing that those calculations are accurate. And I think this is, but this.
Chris
I think you're describing the problem right now is people in the know at least will tinker around with like, prompts and responses and like, what combination of MCPs and like, how to ask it the right way or like, how to put instructions in. Like, I know for my experimental support agent that I've been working on, it's a case of just like so much tinkering. And I often think, like, how would you explain this today to someone starting out, like, someone in a job role where you're like, did you know you can automate all of this? Like, check this out. It's just still like, no matter how, how easy we make it, or like, not, not easy, but especially the MCP paradigm, right? Like, it's quite challenging, like even making them installable on SIM theory, Like, and it still has its faults, but that challenge in and of itself is just so hard because the protocol is such a mess that it's like, right. Like, it still seems very tech elite right now. Whereas I think it, it needs to be brought more mainstream. This idea or this construction of these, I don't know, assistants or agents using these tools.
Nate
Yeah, I think especially because the, the underlying concept is so simple, but yet the actual practical implementation of it makes it quite hard to work with some of them. Like, is it a remote mcp? If it is, is it SSE or is it HTTP event stream? And if it is, what kind of auth does it use? Does it use a key? Does it use heather headers? Does it require an auth token? So many of the MCP is literally require people to like become a developer or whatever the system is, generate a new app, get that app approved, generate a token, and then have a way to refresh that token on an ongoing basis for the MCP to continue working. Like who can do that? Like even, even as a technical person it takes a lot of overhead because you've got to understand how does this particular system do it and where do I have to log in, do I have to register as a dev? Like this is no good. And then the, the other way, that's actually part of the protocol that is the best, which they call OAuth 2.1, aka discovered auth. The idea that you can have an OAuth workflow without being a developer on their system, as in you call off to say GitHub as an example and it says, okay, this, this system that we just trust its name. Sim Theory is trying to access the following data. Do you approve? You say yes, the system gets a token and that's it. So when it works, it works really well. But if you look at Atlassian Intercom, hang on, I've got a spreadsheet here of all the companies where they've name and shame, I'm going to name and shame them because I want them to fix it. So they've implemented the protocol. So atlassian Sentry, Intercom, Asana raindropmonday.com, right, all of these have discovered auth in their MCPs, except it's discovered auth for the elite few clients that they pre approve. So if you're like Windsurf or I don't know, you know, all the, all the favorite darlings of the, of the world. If you're one of them, you're literally hard coded into their code as an authorized redirect URL. And if you're not on that list, it doesn't work. So you simply can't offer those MCPs as an MCP client unless you're like a pre approved dev.
Chris
I think where this is a bigger problem is not just us whinging about it from our perspective, but when we talk about those agents, like if you're at an organization and you're trying to build an agent with data you have stored in one of these platforms, right? Like Intercom is a great example. Like you want to automate support as a layer. I know they have some, some their own sort of agent around this, but say you want to just fetch context from Intercom or whatever it is as part of like some holistic agent for your business you're trying to build, you can't like, you just cannot off into that thing because you aren't on that list. And I think this is a problem.
Nate
Just strictly speaking, you could register yourself as a normal integration, follow their integration path and get approved. So it, it technically yes, you could do it, but the problem is you're talking about weeks of work in terms of time, overhead, admin, that kind of thing, depending on the platform. So it sort of goes against the whole plug and play idea of these discovered auth.
Chris
But I, I thought isn't the whole protocol of being discoverable and open and like these things are plug and play in the future of the agentic workflow and world? Like. Well, yeah, it just like the, the state of MCPS is an absolute mess. Like it's gotta, it just needs a big overhaul where it's like you must follow these rules or you can't call it that or I don't know how they're gonna fix it.
Nate
I'm fulfilling your point here but like every other day some website announces we're launching the first MCP registry that's going to be the comprehensive list of all the available MCPS. And you go there and it's like the same 10 or the same 15 or something that, that are there. And even when you dig into them you find that some of them are just a link to like a GitHub repo with vague instructions on how to set it up. And, and a lot of them are as I described, where it's like it's sort of half implemented. So look, we have to forgive them that it's a brand new thing, it's changing. Everyone's not 100% sure how it should work, but I really feel like this, this area really rapidly needs some sort of cohesion and consistency. So they really are plug and play. Because I think when that happens it's going to feel a lot better because I think right now you get a little bit of mistrust with the whole thing because something that was working suddenly stops and like there's a lot of up and downs in terms of getting it consistently working.
Chris
But listen to this. So Atlassian's description on the new, brand new launch GitHub MCP Registry. This is the latest one all, all MCP services 39 on this side. On some there's like, you know, a million. But these are all supposedly ones that follow the right auth protocol. But then you look at Atlassian's description. Remote MCP server that securely connects during confluence with your LLM, IDE or agent platform of choice.
Nate
Lol. As long as your choice is one.
Chris
Of the pre approved, as long as you, you have verified the choice by us. But anyway, so I don't know, it's a bit of a mess and I again it's like we've been saying it for a fairly long time. Like all the pieces are out there. It's just like to me at least the way I see it is like step by step going through carefully step by step. Like can you get the context, can you get the models right? Can you get the memory right, can you get all these components right in order to actually let people build really reliable agentic autonomous use cases. And I know people though before I quit this point, I know people are going to be like N8N and like all these other like drag and drop things. I just, I'm not sure in my world that's an agent. Like my vision of an agent is like hey buddy, like you're now my support worker. Here's a list of instructions and like basically here's your job description, here's a bunch of the expectations. Like here are the things that we expect you to be able to do. Now go do it. It's not like individually wiring up if this then that with an LLM call mixed in in between. That's not, to me, that's not it exactly.
Nate
It's. It's not. It's not like a flowchart where you're just designing the process because you've not been a programmer and now this sort of allows you to become one. That's not where we want to get to. We've got to use the intelligence of the model. And I think that sort of leads into the point that I wanted to make there. Is that the reason we're going on about it being a mess and the reason we care about it so much is when the MCPs are working and they're working together. Especially when you can combine it with an internal one for your company. Results are amazing and you're the best at this. Like I've seen you do so many for our company where you've built these comprehensive detailed analysis pages, charts, insights, questions to ask and things like that. Combining four or five sources from different MCPs that all relate back to one another. And you look like a genius. Like it makes you look like a true expert and in a way you kind of are because you're actually doing able to cut through and see the really cohesive, important information from a holistic Business perspective. And I think that the reason that we want more MCPs in there and want them to work better together and more reliably is because everyone can experience this. And I think it's where the real delight and thought of the future comes from. Because you're like, wow, this is really amazing.
Chris
Yeah. And I just think there's so much like, there's also so many challenges because you're in a market where so much overhype, so much promise and then people then go and use it. And I would argue honestly, like even with our own MCP implementations, because of the challenges we face around auth and like the consistency of these tools, like you can go in and have a bad experience from time to time and then that will often, including myself in the early days of testing, I'll be like, I'm never trying that MCP again. Like it didn't work for me once, so I'm done with it. And I kind of think that with the high expectation setting and then the hype cycle with models and things, people sort of forget. Like, you know, if you look at just like image and audio and video models in the last couple of weeks that have been announced that video maker MCP that I built, someone sent me, by the way, like the most amazing video I built the other day, which is far greater than any demo I ever did. But it's just like those tools are just sitting there, they're still sitting there right now. And I just joined a bunch of them together and people like, oh my God, I didn't know it could do this. And I think that that's the challenge right now with the like using LLMs or using the MCPS is like, you can have a bad run and you're like, oh my God, this thing is so dumb. But then you never sort of have the time at least to go and revisit it and play around with them again. And so I think that's why having these like more refined sort of packages of like, here's an Assistant, here's the MCP's perfectly tuned towards it. Select the apps that you use in your business. Okay, here's the perfect like intelligence assistant for it. That's probably going to be easier for a lot of people to get that aha moment where they're like, I can't live without this. And then on the other end with the agents, I think it's like, you know, there's like a combination of roll ups and training and I think if you can do that in an environment where you're just sort of chatting like you're being interviewed about a job role or the AI is observing you do the job for a while and then it's like, hey, I think the job description's kind of this and I'm going to need these tools. That is probably the better entry point for the masses into this stuff.
Nate
Yeah, almost like you say to the model, if you could manufacture a dream tool to help you gather the context to solve this. Look, if you could, you know, manufacture a dream tool to take action here, what would it be? Define it for me. And then you take that definition and then use AI to write code to shove that into your mcp. And now you've upgraded it so it's better able to do that skill. And I really think that we're going to see a huge rise of MCPs and I think they'll probably be competing curated paid ones that are far superior to the existing sort of half assed ones we have now.
Chris
Yeah, I, I think looking forward to Maybe it's like 12 months but I can see a point where like if you look at it today, a lot of developers do use Claude code and cursor agent and I think there's a copilot one now must admit I haven't even tried it but the like, they like there are very like small bugs or tasks that they're really good at going and finding the context, handling them for the developers. They can be working on more than one thing at once. And I, I think that agent paradigm is obviously just gonna improve and get better over time. But I can see in the next 12 months people, you know, spending a lot more time wiring up these agents for disparate tasks. And there was this point made by, by Dario, you know, at Anthropic when he's not counting his billies about how he predicted in six months 90% or 80 some random percent, who cares of code will be written by LLMs and not developers. And people lately have been like, oh well, six months has passed man. And like that's not happening. But it depends who you ask. Like for me personally I would say he's right. Like six months ago I was writing a lot of manual code. Still six months later, rarely I'm just yelling like do this. Plus no, you're wrong, you're an idiot. You know, like you're a director now.
Nate
I'm the same. And there's just certain things where you're like why would you, why would you write it out by hand? The AI is going to be Far more comprehensive, it's going to cover more cases, it's going to handle the errors properly, it's going to do all that stuff.
Chris
This is what I think about automation. I think it's the same thing coming like maybe six months from now, people will just start automating away different processes in a business like to the point where there's no human. Maybe there's hum, there's definitely going to be human in the loop and approvals but it's just, it is like that vision that people keep talking about and people think progress has slowed. I'm of the belief having like revisited a lot of this stuff lately that like it's not actually slowing down and the real impact will come and it's going to come and everyone's going to be like, oh, AI hasn't changed much, there's not as many models coming out. You know, like people will fall asleep a little bit I think like that trough of disillusionment. But behind the scenes there's going to be people building these agents, improving them over time, getting them to do certain parts of their job. And you know, I like, I want to deliver that vision to people that use SIM theory by the end of the year. Like I want to be doing that myself and I want other people to be doing that, even if it's like an elementary version. But I think the mistake most people made when they heard Dario say that was they thought, you know, the developers would be cut out of the equation. And he meant like the LLM magically somehow writes all this code. And I think if you think about agents today, you think the same way, you're like, oh, all the support workers will lose their jobs and what will they do? And it's like no they probably won't. They will be wiring up and automating huge parts of their job and their job will be to control and supervise and run those things. Just like developers stop writing code, they will stop answering tickets and supervise at greater scale. Maybe that means they need fewer resources in the future, but I can't again see those people like unless they're bad at their job, immediately having job loss as a result of the technology, you might actually see more people hired to wire up different automations to grow faster potentially. Yeah.
Nate
And at least the companies who use this technology the most and properly are just going to be so much more efficient that they will rise while the others gradually fade away. I think that's the crucial point.
Chris
Speaking of fading away your camera, our.
Nate
Technology reliability for a tech podcast is not Great.
Chris
I love how it's a sign of how average our show is. Like your bookshelves like collapsing in the background. My camera turns off mid recording just.
Nate
Every other day disintegrated during the week. So I'm using. These are from like Kmart or something.
Chris
Yeah, I, I always wonder who buys those Kmart headphones and now I've, I've found out.
Nate
I just. The thing is I similar to the AI, you know, I prioritize tasks and my priority is, is building cool software. It's not headphones.
Chris
Right now like now we're on that average track. Let's change the tone a little bit from us ranting about the future. So our boy Mark, I don't think he had the chain on when he announced it. They announced some pretty cool technology. It's the Meta Ray Ban display. And so it's like my Meta Ray Bans, which of course I've lost now for the, the purposes of demonstrating. But so right now the, the meta glasses for those unfamiliar have some creepy cameras on the front, a video camera and a regular camera and they have audio and they have the built in assistant and it can, you can ask it like what you're looking at or you know, it can set time as two tasks.
Nate
What kind of plan is this? That's the main one. Right.
Chris
Honestly, I don't think I've used the AI assistant apart from asking it the weather and asking it to play a certain playlist.
Nate
What monument is this? That's the Eiffel Tower.
Chris
Yeah. Well, wait, I've got a pretty funny demo for you and I say so the, yeah, the, the reality is they're just really good headphones and I like wearing them for the headphone capability. I rarely take photos with them and I pretty rarely use the only one in public bathrooms. Yeah. And so anyway they released the Meta Ray Ban display and I think this is, this is something that I think is pretty cool. So there's a little screen in the right eyelid and for those that watching, you can sit up on the screen now and it can do things like give you directions, you can reply to texts, all that kind of stuff. Now you would think you would have to talk to this assistant like an idiot. But even better, you can be a bigger idiot and wear a wristband which basically detects the like movements in your hand. And what you can do is reply to a text by like scribbling on your leg. So you can, you can be like riding in a. Like one of their examples seriously was in a meeting Basically, if you're bored shitless that you could be texting still and no one would even notice because you can't even see the light.
Nate
Oh, yeah, in the hallway, people won't notice someone, like, completely fricking distracted by something under the table.
Chris
The interesting part too is where people were demoing these and using them and, like, reading messages. Their eye movements was so weird. And they're, like, looking at someone being like, yeah, you can't tell at all.
Nate
Like, I know, I know from. Even this morning when we're preparing for the podcast, I was distracted by something off screen. You noticed immediately. Like, people concentrate on something completely different and look like they're engaged in the situation. It doesn't work like that.
Chris
This whole AI embedded screen thing, I think it's. I don't know. I don't know how I feel about it.
Nate
Like, you know, I. The thing I think that's crazy about it is they're missing the point of what it could actually be amazing at, which is passive context gathering of the situation you're in. Like, if we know how good vision models are now, right? Like your example from, like, freaking a year ago, where you showed it a photo of you driving in the current. It knew you were in Newcastle and it knew roughly, like, where you were in the situational awareness. Think of how much faster and better vision models have gotten since then and the. The ability for it to make inferences. Like, imagine a system that's just constantly inferring the environment around you, telling you information about the things you're looking at, the people you're interacting.
Chris
You're in a public bathroom. I should not be recording this.
Nate
Yeah, Chris, we've spoken about this, but do you know what I mean? Like, texting and freaking calling. Like, they're just not. They're just not interesting. Like, do something cool with your environment. Like, think about a work environment. Like, I know Microsoft has had the HoloLens for a while, but, like, in a work context, just imagine the ability. Even looking at a screen, for example, could be gathering context. Like, it's almost like a permanent screen share, except it's actually able to go into your environment as well. Think about other things. Like, I write a handwritten to do list. I could look at it and then have that come up on a notepad on the computer, for example. Like, there's a lot of really cool things you could do. Like to take snapshots, I think.
Chris
But now they've got the platform, maybe that'll happen. Like if they open an app store for it. And an SDK you could probably like when it recognizes a person, say is like an API or like a hook in that SDK then bring up a tile which is everything it remembers or knows that you've chatted about with that person before. But then it's like at what point does it just destroy humanity where we're just all so wired in that it's like and. And you can see their eyes moving like to figure out what your name is and what you talk. Like the whole thing to me I'm not so sure about. But here's where I think it can work is. Is like in business scenarios and also in. In sporting scenarios. And this is a new pair of glasses. They also showed off a quite called the Oakley Meta Vanguard. And these have a camera in the nose which I think is pretty comical. But as someone who cycles a lot, having a camera to be able to just like record something I see on a right or just for my own safety, like, you know, is pretty interesting to me. Also you can do video calls to them. So like you could like I could call someone. I don't know. I mean you wouldn't do this too frequently, but you could call someone and then they can like see what you're seeing. I think it's kind of cool. But this also has integration into Strava and Garmin and the AI model can connect so it can talk to you while you're riding and be like hey, you're in zone two or whatever it is.
Nate
Yeah. Like telemetry information and like it could probably give you strategy advice.
Chris
Like you know, I mean that's what they. They demoed. Like it can be like come on, push harder. You know, you're on this particular workout on your garment and you know it can also then play music. So I think this kind of like ambient computing for a particular use case. Right. To me makes total sense today. And like I will probably buy a pair of these to be quite honest because I.
Nate
Can we as Australians buy them? Probably not.
Chris
Yeah we can immediately day one, which is awesome.
Nate
That is pretty cool. I think I'd like to give it a go especially if. So you said is there an SDK for it like anyone can develop for it or it's only.
Chris
I don't. I mean I might be proven wrong, but I don't think they've announced anything like that yet. I hope that it comes because I think building apps for this. I know a lot of people that listen to our show would. Would probably be really interested in that.
Nate
I know that there's a Kickstarter or whatever, like an open source kind of one that is being worked on that's probably going to take 60 years and never be delivered. But this idea excites me. If it's programmable, I think that that makes it truly exciting if you can do that.
Chris
I think that just the new input device is what also excited me a lot around this idea. Like, if you want to turn up and down the volume of something you like, if you're listening to music, you can just like move your hand like it's a dial, as long as you've got this wristband on and then it, you know, can like obviously know that. But I also think this is where like probably the Apple or Android ecosystems has a huge advantage in this sort of like immersive AI assistant world. Because if you think about it like you've got an Android watch or an Apple watch, if they can replicate what Meta's been able to do in terms of detecting the signals in your hand and the motion and stuff, eventually then like, you've already got these devices, surely that that might give them some sort of advantage. But I think you've got to give big props to Meta here for pushing this like wearable AI. They're like by far the leader. I don't think anyone else is doing it this good that I'm aware of. And those glasses, honestly, they're so addictive. Even though you look like a creep half the time wearing them. Like, you know, if you're going for a walk or a run or like riding a bike or whatever, they're fantastic to have on. Just knowing you can kind of, I don't know, like I don't do it much, but just knowing I can call on an assistant and be like, oh, hey, what, what's this? Or what's that? It does appeal to me, weirdly. I, I'm, I'm not sure.
Nate
Yeah, I must say I'm, I'm intrigued. Maybe I'll be wearing some on the next episode.
Chris
But I did have to play this. So unfortunately. And like mad props to them for doing live demos. Like, I don't want to, I don't want to fully troll here, but their demos didn't go so well. So here is. Here is one of their demos about cooking, of course, because that's why you would use these glasses.
Nate
Make a Korean inspired steak sauce using soy sauce, sesame oil. What do I do first?
Chris
What do I do first?
Nate
You've already combined the base ingredients, so now grate a pear to Add to the sauce.
Chris
He's done nothing.
Nate
What do I do first? You've already combined the base ingredients, so now grate the pear and gently combine it with the base sauce. All right.
Chris
I think the WI fi might be messed up. They tried to. He tried to blame the WI fi. I don't know if it was a joke, because they kept blaming the WI fi throughout the entire presentation. But every live demo. But every. Every demo they tried, apart from, I think the live translation one, which also didn't work at first, you know, failed. That's the other thing I didn't mention. They can live translate as well. So if you're in a conversation, it can detect the direction of the voice that's speaking to you. I don't know how it does this with, like, an array of mics or something. And then it can live translate using the AI model and put up on the screen, like, what that person's saying. So you can use that for, like, language translation. Or if you're deaf, which occasionally I am, you can. You can see. Sometimes I could have these glasses on, and it's just like reading what someone's saying to me. Just like I'm, you know, watching some foreign film. That is.
Nate
That's pretty amazing. That's actually really amazing.
Chris
Yeah. So anyway, they're kind of cool. I'm interested what people think. Like, would you build an app for these meta Ray Bans? Do you think this is the future of computing? Or are you like. Nah, like, this is, like, I'm just gonna stick with my.
Nate
Just a cool toy to play with. I mean, it doesn't have to be the future. It can just be fun for now.
Chris
Yeah. I personally do. Like, I would like to buy maybe a communal pair that we share and just try out. I don't think I want to drop, like, you know, $3,000 on two pairs of these, but it. It would be kind of interesting to try out here. So, Chris, before we go, I did. I did want to give a shout out, obviously. We talk about Jeffrey Hinton on the show quite a bit, and our man Jeff, he had his heart broken recently. There was a bit of media coverage about this AI Godfather. This is a real story. We did not make this up.
Nate
It's hard to. It sounds like the kind of stuff we would make up to slander him.
Chris
Yeah, it really does. It feels like. Yeah, like, these are like someone's released a troll press release and put it out there, and then media outlets have picked it up. But anyway, so it says business Insider, AI Godfather. Jeffrey Hinton says a girlfriend once broke up with him using a chatbot.
Nate
So maybe when he says once like it. Chatbots have not been around that long. It sounds like he's talking about the 1970s or something. But it's like it must have been like last year.
Chris
It says Jeffrey Hinden said his ex partner used ChatGPT to critique him during their breakup. AI is increasingly used for personal interactions, not just industry applications. Prior research from OpenAI found that bot, that bot, the bot increases loneliness in power users. Ok, so what I find funny is he is quoted in this article. So he said this irl, she got the chatbot to explain how awful my behavior was and gave it to me. He told the Financial Times I didn't think I had been a rat. So he made me feel too bad. I met somebody I liked more. You know how it goes.
Nate
So he admitted to being a love rat. I mean that's an admission, right? Like I met someone I liked more so I ditched my current person and moved on.
Chris
Yeah, I mean he's the godfather of, he's the godfather of AI.
Nate
You know what I realized? Maybe our still relevant comment is wrong. He is a love rat and he's used the AI thing and got back into the media to get women. I assume it's women, but like get partners, right? Like he's actually using his clout to get people and he's found someone he liked better. He's an older man, he's unattractive. Like he's only getting people because of his, his notoriety. Right? So like what a, what a plot twist. And like the thing is he had to admit this. Like they didn't like grill him, they didn't cross examine him and he had to admit that he, he got called a love rat.
Chris
He voluntarily gave, he wanted it out in the media. That's what Jeffrey Hinton is, a player.
Nate
Yeah, that's the message he's trying to get across. Still relevant. Love rat player. It's crazy. Like we couldn't, I couldn't. I wouldn't have made this up.
Chris
So ladies, Jeffrey Hinton, he's out there. He's dateable. He is a love rat. It'd be funny if someone made a song about that, wouldn't it?
Nate
If only we had the tech.
Chris
If only we made songs on every show that no one wanted to listen to and played them. All right, it's been a good show. If you are interested in the song, it'll be after the outro music, the Jeffrey Hinton Love Rat Edition song God, it's good. I really like it. It's good. Chris and I competed. Who could make the best Jeffrey Hinton song about him being a love rat? I. I lost Chris. Is this. This is a. I think it's like, they're probably the first time you've ever made a better song.
Nate
You did most of the musical, I'll give you that. And that was one of the best things ever created. But, yeah, when it comes to Jeff Hinton, I'm. I'm deeply invested. This is my second Jeffrey Hinton song, and I think two of the best probably ever created.
Jeffrey Hinton (as the singer in the song)
Can you see?
Chris
Just a reminder, the music.
Jeffrey Hinton (as the singer in the song)
When the love we had ain't what.
Chris
It used to be. You're forgetting, I think out of the, like, 1200 views it's had, I've been at least 200 of them or more.
Nate
I must say. Like, you know, when I'm. When I'm sort of at a low point in the week, I put the musical on it. It's a. It's true happiness for me, that thing. Like, I know it's wildly unpopular and outright hated by a lot of people, but it's just something about it that makes me smile.
Chris
Yeah, it's all for the lulls. All right, we will see you next week. Thanks for listening. And, yeah, if you want to check out Sim theory. Sim theory. AI Use coupons. Still relevant. Which is very relevant given what we're about to play. All right, we'll see you next week. Goodbye.
Jeffrey Hinton (as the singer in the song)
There's a man in AI Goes by Jeffrey H. He's the godfather of tech but he loves to play Got his neural networks running but his heart's on the prowl when he meets someone better hear that love rat how she, she got chachi potato telling what already been But Jeffrey just laughed and said, let me explain I'm Jeffrey, the love rat king of AI Swiping through the ladies like I optimize Got my deep learning charm and my neural net game When I find someone better, I'm gone without shame Jeffrey the love rat that's my claim to fame. You the artificial intelligence to play the dating game? His ex pulled up the chatbot said explain his ways the AI Wrote a thesis on his cheating days But Jeffrey ran it through with a confident grin Said I didn't think I was a rat And I let the games begin Power users getting lonely but not this AI King he's got algorithms running for the next best thing I'm Jeffrey, the love rat king of AI Swiping through the ladies like I optimize Got my deep learning charm and my neural net game When I find someone better I'm gone without shame Jeffrey the love rat that's my claim to fame Using artificial intelligence to play the dating game Open a don't ask bots about your love life But Jeffrey's got it figured out he don't need advice he's training his romantic models on the side with back propagation through his player pride From Toronto to the valley all the ladies know his name the pioneer of the passion in the neural dating game he's revolutionized romance with his gradient descent Every heart he breaks is just an experiment I'm Jeffrey the love rat Legend of AI Teaching machines to love while I say goodbye Got my turin test charm and my transformer ways in the kingdom of romance I'm setting the pace Jeffrey the master of the game Use a deep learning algorithms to stake my claim.
Chris
So.
Jeffrey Hinton (as the singer in the song)
If you meet Jeffrey at a conference or bar Remember he's a love rat that's his avatar he'll optimize your heart Then move on to the next the godfather of AI and the king of complex.
Hosts: Michael Sharkey & Chris Sharkey
Date: September 19, 2025
In this characteristically self-deprecating episode, Michael and Chris Sharkey delve into three key topics: the reality behind model provider behavior (including model degradation and routing), the challenge of building genuinely autonomous "long horizon" AI agents, and the state of MCPs (Modular Capability Providers) for practical business AI. The brothers also entertain with their takes on Meta's latest smart glasses, before closing on the viral "Love Rat" story about AI godfather Geoffrey Hinton.
“Even their post-mortem sort of admits that there is routing going on ... Even if they are telling the truth, it's still a little bit sneaky.”
—Nate, (01:57)
"If in the paper they, they can get a lot further than you would think. So GPT-5 can execute over a thousand steps correctly. ... The next best competitor is Claude for Sonnet at 432 steps.”
—Chris, (06:49)
“My preference has always been do it in smaller steps and let's evaluate and guide you along the way rather than thinking that some holy grail model is just going to fully solve an issue.”
—Nate, (12:41)
"Every company needs to have an internal MCP that exposes data.”
—Nate, (28:16)
“The state of MCPS is an absolute mess. Like it's gotta, it just needs a big overhaul...”
—Chris, (39:32)
Registry Problems:
Path Forward:
“What you can do is reply to a text by like scribbling on your leg ... Their eye movements were so weird.”
—Chris, (53:09)
“They can live translate ... detect the direction of the voice that's speaking to you ... and put up on the screen like what that person’s saying.”
—Chris, (60:47)
“He voluntarily gave, he wanted it out in the media. That's what Geoffrey Hinton is, a player.”
—Chris, (63:44)
On Model Degradation:
"There's more introspection going on than people expect—it's not just round robin routing. They're clearly looking at factors like the number of tokens, the content of your messages, to decide which model to send it to." —Nate, (03:05)
On Execution over Reasoning:
“The real bottleneck in LLMs right now is actually execution, not reasoning.” —Chris, (03:34)
On AI Copilot Impact:
“Six months ago I was writing a lot of manual code. ... Six months later, rarely. I'm just yelling, ‘do this! Plus, no, you're wrong, you're an idiot.' You know, like you're a director now.” —Chris, (48:02)
On Automation and Work:
“Maybe six months from now, people will just start automating away different processes ... their job will be to control and supervise and run those things.” —Chris, (49:14)
On Geoffrey Hinton:
“She got the chatbot to explain how awful my behavior was and gave it to me ... I didn't think I had been a rat.” —Hinton, quoted by Chris, (62:09)
Casual, irreverent, and self-deprecating, but peppered with genuine insights and specific practical observations. The hosts consistently frame technical developments in terms of everyday usability and the real life messiness of tool integration, with a flair for dry humor and pop culture references. The episode closes with goofy AI-generated music poking fun at Geoffrey Hinton’s “love rat” public image.
“Jeffrey the love rat, king of AI, swiping through the ladies like I optimize ... Got my deep learning charm and my neural net game, when I find someone better, I'm gone without shame.” —Excerpt from the hilarious "Geoffrey Hinton Love Rat" AI-generated song, (66:02–68:43)
Summary:
A lively exploration of the genuine day-to-day challenges in building useful AI—from the realities of model performance and agentic autonomy, through the messiness of tool integration and authorization, to the fun (and foibles) of gadgets and industry personalities. The episode is equal parts practical guide, skeptical commentary, and irreverent entertainment.