Summary9 min read

This Day in AI Podcast — EP99.29

Episode: Gemini 3 Flash, GPT-Image-1.5, Skills vs MCPs, and Our 2025 Model Reviews
Hosts: Michael Sharkey & Chris Sharkey
Date: December 23, 2025

Overview

In this lively year-end episode, Michael and Chris Sharkey reflect on an incredibly dynamic year in AI. With their usual mix of humor, self-deprecation, and surprising technical depth for two "proudly average" techies, the Sharkey brothers review the latest model releases (Gemini 3 Flash, GPT-Image 1.5), analyze the shifting landscape of AI model leadership, and dig into the evolving paradigms for building enterprise AI tools (Skills vs. MCPs). The episode serves as both a 2025 AI recap and an opinionated, practical guide for listeners trying to make sense of an industry moving at breakneck speed.

Key Discussion Points & Insights

1. Gemini 3 Flash: Release, Pricing, and Performance

[01:00–10:30]

Shipping on the Beach:
Chris recounts deploying Gemini 3 Flash while hunched by a brick wall on a family holiday—“looking like a sophisticated programming hobo” (Mike, 00:56), then laments wasting 20 minutes due to misreading a logo.
“I've got glasses coming, by the way.” (Chris, 01:41)
Model Impressions:
Mike highlights that Gemini 2.5 Flash had become their “absolute workhorse” for summaries and reasoning within SIM Theory, expecting Gemini 3 Flash to be faster, smarter, and cheaper.
- Actual outcome: “Pricing's actually gone up… But in the benchmarks it's crazy… It just seems like a sped up, cheaper 2.5 Pro.” (Mike, 02:49–03:50)
- Chris adds: “It's got this really nice tight output where it actually gets what you're after and really sticks to the brief… I actually like the Flash answer better.” (03:51)
Model Naming Critique:
Both hosts agree: the branding is confusing (“they just slap a brand on these models”), and the differences suggest full re-trains rather than mere upgrades.
Unreliability Across the Board:
Mike and Chris share that no current “flagship model” (Gemini Pro 3, Claude Opus 4.5, GPT 5.2) feels 100% reliable on a daily basis—leading to more frequent model-switching in their workflows.

“I find myself in this weird thing where I'm not really trusting any models at the moment. So jumping down to a sort of mid range model that I think is more reliable is actually a really good option to have.”
— Chris, 06:54
Model as ‘Recovery Tool’:
Gemini Flash 3 has become their go-to fallback when major models let them down — “sort of Hail Mary model to get things back on track.” (Mike, 07:51)

2. Image Model Wars: GPT Image 1.5 vs. Nano Banana Pro

[10:45–20:39]

OpenAI’s GPT Image 1.5:
- Mike argues it was a “rush reaction to Nano Banana Pro,” and while it’s a notable improvement, it doesn’t beat Nano Banana on reliability, character consistency, or infographics quality. “Nano Banana Pro typically just wins and is more reliable, especially at things like upscaling.” (Mike, 11:15)
User Perception:
- For mainstream ChatGPT users, GPT Image 1.5 is “good enough,” but AI enthusiasts pursuing quality will migrate to platforms with better models (i.e., Google/Gemini and Nano Banana).
Relaxed Safety Controls:
Chris is shocked by the lack of filtering:

“It made like the most graphic, detailed image… I can't believe that it did that.” (15:03)

Mike’s party trick: inserting Taylor Swift into every photo.
Direct Comparisons:
- Mike runs several side-by-side image tests:
  - “This one [Nano Banana] is so good. … GPT Image 1.5… just looks like a dirty effect over it.”
  - For infographics: GPT underwhelms—“looks like it was made on Ms. Paint 95.” (16:54)
“If Nano Banana Pro didn’t exist, you would be blown away [by GPT Image 1.5].”
— Mike, 17:23
Real-World Usage:
Chris uses Nano Banana Pro for system diagrams, calling it “amazing” for non-artistic users, despite requiring a few iterations.

3. Research Agents: Fire Crawl & Gemini Deep Research

[21:04–34:57]

Fire Crawl Agent

[21:04–27:57]

Key Capabilities:
- An agent that reliably crawls multiple webpages, extracts structured data, and makes decisions when gathering information—“so reliable and so accurate, it has blown my mind.”
- Used for compiling a timeline of AI model releases. Output was so trustworthy they’re integrating it into the research workflow in SIM Theory.
“It is the most useful tool to have in your toolkit as an MCP for research to go off and get data.”
— Mike, 21:58
Why It Matters:
- More depth than a “Google and paste” approach; can now trust an agent to go deep, extract, and check facts rather than just aggregate surface info.
“There's something untrustworthy about [old school] Google searches, paste the first few web pages… This level of decision making … is just such a big step in the right direction.”
— Chris, 24:20

Gemini Deep Research Agent

[27:57–34:57]

Impressive Synthesis & Decision-Making:
- The agentic process includes: deep dives, knowledge-gap identification, and reference synthesis.
- Has a “files” API enabling research over user docs, videos, and datasets.
Workflow Synergy:
Chris hacks together combining Skills, MCPs, and research files:

“The results of using them combined is just unbelievable.” (Chris, 28:53)
- Supports follow-up questions for deeper threads. Mike likens it to how “building a good context is the key to getting good AI tasks done.”
Advice for Organizations:
- Research context + agentic planning + execution = best results. Expensive, but “very essential… or at least one for your most serious tasks.” (Chris, 32:09)

4. Enterprise AI: Skills vs. MCPs and the Evolving Paradigm

[34:57–52:50]

Definitions:
- MCP (Modular Command Protocol): Secure, explicit tool calls for software/data, but all instructions and examples must be loaded into prompt context—wastes tokens, especially at scale.
- Skills: Procedures and knowledge as context, loaded only when invoked—like “codifying business practices into repeatable skills.” (Chris, 41:16)
  - Repeatable, complex tasks (e.g., “brand guidelines,” compliance, filing legal docs)
  - Skills can reference other skills: chaining procedures for consistent, domain-expert output.
Anthropic’s Strategy:
- Skills released as an open standard, promising a future of “one general purpose agent equipped with a library of specialized capabilities.”
“Not sure I entirely agree with that, but… their pitch is like having a universal agent with skills being [the roles].”
— Mike, 43:04

Chris: “I disagree… I think that's what agents are for. An agent will use skills and follow these procedures.”
Practical Impacts:
- “Skills” make complex, expert-level procedures re-usable, shareable, and accurate, with far more detailed prompt instructions than possible with regular assistants.
  - “Some of these skills are absolutely massive in scope in terms of how much they bring into the prompt.” (Chris, 46:50)
- For enterprise: “Getting this sort of data in and actions out with MCP is still the right area to be focused on for now.” (Mike, 52:38)
- “Just try it… when I started to use them, I’m like, okay. This is magic. It's amazing. I get it.” (Chris, 52:50)

5. 2025 Recap: Major Model Releases & Trends

[53:19–63:10]

Timeline Walkthrough:
Mike presents a Fire Crawl–generated month-by-month release chart, shocking both at how quickly things moved.
- Gemini 2.5 Pro released only in March ’25—felt like “a longer relationship.”
- Key observations:
  - “No one was taking Google seriously at all” in January, but they finished the year as top contenders.
  - Feeling of constant model churn and a lack of a universally trusted “core” model, unlike prior years.
  - Continued open source model attempts failed to keep up.
- Chris: “I can’t believe we’re finishing the year in a state where I don’t really have a go to model at all right now.” (09:33)
Model Awards:
- Best: Gemini 2.5 Pro (both hosts)
- Runners-up: Chris—Sonnet 4.5, Mike—Opus 4.5
- Worst: Llama (“in general”), with OpenAI 5.1/5.2 also deemed disappointing.

6. 2026 Predictions & Reflections on the Future

[63:10–78:09]

2026 Will Be “Year of Agents”
(Contrary to 2025’s predictions)
- Skills will reach all providers; agentic workflows will become the organizational standard.
- Mike expects “everyone will recognize [agentic workflows provide] huge leverage… every organization will have their own approach.”
- Open source models may catch up, but still lag behind in key business contexts.
Centralization vs. Control:
- Anthropic and others want to “own” organizations’ skill repositories—but companies may prefer control and flexibility, integrating best-in-class tools as needed.
Hiring and Job Market Impacts:
- Workers who adapt to agentic, planning-driven AI workflows will become “immensely productive,” causing a hiring slowdown in roles that don’t.
Best Model for End of 2026?
- Chris: Anthropic, unless context window remains too small; acknowledges “it has all of the components you need to get this stuff into the future.”
- Mike: Bets on Google or possibly OpenAI “if they put their mind to what enterprises really want.” (75:46)
  - “I wouldn’t write them off yet… they’ve got enough smart people.”
OpenAI’s “Abomination Store”
- Chris: The AI app store fails (“never talk about it again”). Mike agrees; says “it is truly stupid,” prefers integrating action at scale.
- On model provider skill execution: Both expect to decouple execution, running skills on their own infrastructure rather than via the model vendors, for flexibility and cost reasons.

Notable Quotes & Memorable Moments

On the state of AI models:

“I find myself in this weird thing where I’m not really trusting any models at the moment.”
— Chris, 06:54
On Fire Crawl Agent:

“It is the most useful tool to have in your toolkit as an MCP for research to go off and get data.”
— Mike, 21:58
Comparing Nano Banana Pro to GPT Image 1.5:

“If Nano Banana Pro didn’t exist, you would be blown away [by GPT Image 1.5].”
— Mike, 17:23
On the value of Skills:

“It’s literally codifying business practices into repeatable skills. And that’s what it is.”
— Chris, 41:16
On the future of AI tooling:

“I think next year, increasingly it’s going to be a planning phase that you work with the agent… Delegating, and it goes off and does the work.”
— Chris, 72:05
Betting on OpenAI:

“I wouldn’t write them off yet. We’ll see. I’ll replay this and we can laugh.”
— Mike, 75:58

Segment Timestamps

| Segment | Time (MM:SS) | | ----------------------------------------- | ---------------: | | Intro Holiday Hijinks | 00:02–01:45 | | Gemini 3 Flash Review | 01:45–10:45 | | Image Models Showdown | 10:45–20:39 | | Fire Crawl Agent & Gemini Deep Research | 21:04–34:57 | | Skills vs. MCPs Debate | 34:57–52:50 | | 2025 Model Recap & Awards | 53:19–63:10 | | 2026 Predictions / Future of Agents | 63:10–78:09 | | Closing/Thank Yous & Musical Parody | 78:09–end |

Tone & Style

The brothers keep things laid-back and irreverent, mixing practical technical commentary with “average guy” jokes about their own programming mistakes, dodgy AI usages, and dubious “hacks.” They candidly admit when they don’t fully understand emergent paradigms, repeatedly experimenting with models “live” so listeners can learn along with them. References to surfing, hobo coding, AI-made prank photos, and Taylor Swift in every image keep things fun and relatable.

Bottom Line

If you’re in the trenches trying to wrangle AI tools for yourself or your organization—and you don’t have an army of PhDs—this episode distills which models, agents, skills, and strategies actually work, and which to steer clear of. Despite all the hype, the Sharkey brothers prove that sometimes it’s the “average” practitioners who have the clearest take on a year that was anything but average in the world of AI.

[For further detail, check the timeline link Mike promised in the show or browse the full episode transcript.]

Loading summary

Transcript189 lines

[00:03]
Mike
So Chris, this week the long anticipated average holiday special is here, except we kind of forgot to prepare and it's just going to be a normal episode. But we did think we would play, you know, 21 minutes of our musical from earlier this year and just sit here and listen to it with everyone to start the show. So let's go sort of sing along style. Yeah, maybe a sing along, but here we go, 21 minutes of this going.
[00:30]
Chris
Up stories that'll blow your mind.
[00:34]
Mike
Okay, probably not.
[00:35]
Chris
I do encourage everyone to go have a listen though. It really was the best thing we created all year.
[00:40]
Mike
I really do think it was the AGI moment for me being able to produce a full 21 minute musical that made sense. And you know, maybe no one's noticed, but I, I'll put a link in the description in case you're feeling generous with your time or have nothing to do over the holidays. But we did get new models and we were a little bit behind. I have a photo that I'll, I'll show in a little bit of, of you shipping Gemini 3 flash into SIM theory to show how truly dedicated we were. We were on a little family holiday and Chris is like tucked up against a brick wall trying to ship this model into the product on a hot sunny day in Australia by the beach, looking like an absolute, like, I would say, like sophisticated programming. Hobo is like probably the best description, I thought.
[01:25]
Chris
I thought a desperate degenerate might have been more appropriate, especially because I realized later I forget what it was, but I made a mistake and, oh, I know what it was. The logo wasn't matching up and I'm like, what is going on? The new logo isn't working. Then I realized I just can't see properly anymore and the logo was fine and I wasted about 20 minutes. I've got glasses coming, by the way.
[01:46]
Mike
So I've got a little test later to test GPT image 1.5 on you to see if you can tell the difference between Nano Banana Pro Pro and GPT image 1.5. And I'm going to use the Hobo pick as this is like the Hobo test. But let's head over to the sort of breaking model news, which is Gemini 3. Sorry? Gemini 3 flash. The model names this year got even easier to say, this is the new Flash model. There's two variants of it. There's a like thinking version with the, the reasoning turned up and then the normal Gemini 3 flash model. Probably the most interesting thing about this model to me is that Gemini 2.5 Flash is really, in sim theory, an absolute workhorse model for us. It does all the summaries and reasoning and in a new version soon to be out, does all the notification summaries and things like that for ages.
[02:43]
Chris
It also does like tool call selection if you've got hundreds of them. And it needs to reduce the list.
[02:49]
Mike
Yeah. And it's really fast, really cheap, really reliable. It has just the right mix of intelligence. So when Gemini 3 flash shipped, I was really excited. I was like, well, we can get. I was hoping for even faster, more intelligent and maybe lower pricing. But that's not really what happened. The pricing's actually gone up from Gemini 2.5 flash. It's still very cheap, $0.50 per million token input. But as the. And then the output price, sorry, is $3 per million. So reasonably cheap, but definitely more than its predecessor here. But in the benchmarks it's. It's crazy. Like it's benchmarking really well. And that benchmarking I think actually translates to intelligence. When you use this model it's. It's really hard to tell it apart from Gemini 2.5 pro level intelligence. It just seems like a sped up, cheaper 2.5 Pro from my initial usage.
[03:52]
Chris
Yeah, I definitely agree. And I find that it's got this really like nice tight output where it actually gets what you're after and really sticks to the brief. I actually did some work this morning comparing skills and MCPs and I ran it on Gemini 3 and I ran it on Gemini 3 flash and I actually like the Flash answer better. I thought it was more accurate. It stuck to what I wanted better. It used the tool calls appropriately. It's really hard to fault it. It even did the references correctly.
[04:21]
Mike
What I find funny though about the Gemini Flash and Gemini like pro brands is you can tell with Gemini Flash 3 it's just an entirely different train model. And then they just slap a brand on these models. Like there's like. At least this is my interpretation using them. It doesn't feel like this is some sort of continuation of the previous version. It just seems like a entirely new model that they slapped the same name on just to keep the naming and versioning in intact as opposed to.
[04:52]
Chris
I think you said the same thing last week about German Sorry, open AI's 5.2 where GPT 5.2 just seems bizarre. Like it's like a totally different beast of 5.1 which was actually pretty good. And 5.2 is just off with the fairies, just completely wild. I haven't even touched it all week.
[05:11]
Mike
Yeah. And I think I talked last week about the, or the week before last about the tune of these models and people in the comments roasted me being.
[05:20]
Chris
Like it was trained from the ground up.
[05:22]
Mike
It's an entirely new model. Yeah, yeah, yeah. I'm not entirely sure that's true because they've, they've not confirmed nor like, you know, they don't confirm or deny their training. Like they could do different, different tweaking to the model. But even if it is trained from the base up, these models, that training run is built on the foundations of those GPT models. Right? Like they're not going to start from scratch every time. So it is a different variant tune from the ground up and, or train from the ground up. But I, I just think really calling it tuning, even though it's not the right language, is really what they're doing. They're looking for a new tune of it and that tune is particularly weird. But Gemini Flash 3 in my view, I got stuck on a problem during the week that I was using Opus 4.5 on and I thought, you know what, I'm gonna, I'm gonna go to Gemini Flash 3 and also Gemini Pro 3 and see if it can solve my problem. It was a pretty tedious front end bug I had and to my Surprise Gemini Flash 3 just not only solved it, but spat out the entire code. It was perfect, beautiful, very smart. Gemini Pro also did. But I like you. I preferred the Flash version, I thought, you know, just spat it out quicker and solve the problem problem. And I was kind of blown away. I, I hadn't really been able to put it to the test with like a truly hard problem during the week. So it's smart, it's cheap and it's fast, it, it really ticks all the.
[06:55]
Chris
Boxes and I think it's very good timing to have a model like this because I don't know about you, but firstly I'm switching models more than I ever have. Like I, I'm constantly switching models even within the same session at the moment because I find the current batch of top line models just very unreliable. Like Gemini Pro 3 I find unreliable. Claude Opus 4.5 works most of the time, but I always get this impression like I'm not getting the best there is to offer. And then GPT 5.2 is just a write off. So I find myself in this weird thing where I'm not really trusting any models at the moment. So jumping down to a sort of mid range model that I think is more reliable is actually A really good option to have. And I also think that at a time where we're seeing more larger organizations actually want to mass roll this out in a controlled way to their teams, having a model this good at that price is. Is incredible.
[07:51]
Mike
Yeah, I think you make a good point. A lot of people in our own community have been saying, and that's also reflected on X. There's a lot of complaints about Opus 4.5 seemingly being dumber than when it first launched. And at times I find that as well. I'm not sure if it's like when it's under load it just gets a bit dumber or to reduce costs and get their unit economics better, they, you know, they release a different version, like a more tuned version. I'm not entirely sure, like I don't understand what they're doing, but from my day to day usage I do notice at times it seems stupider and so or it gets stuck in like a doom loop. And I think you're right. Having a model like Gemini Flash 3 as a recovery model or like a sort of Hail Mary model to. To get things back on track is pretty handy to have. But I also agree, I think the current crop, and Admittedly Gemini Pro 3 is in preview and Gemini Flash 3 is in preview as well. So you've got to give them the benefit of the doubt. But these frontier models all seem very beta to me right now. Like that there's no. It's not like Gemini 2.5 pro days where there's this core model you can just always trust and rely on.
[09:06]
Chris
And this is the problem because they destroyed Gemini 2.5 prior to releasing 3. I can't even go back to it. Like I've tried that a few times. It's just not what it used to be. So I've actually weirdly been using Sonnet 4.5 quite a lot this week. Me and Patricia just back on the classic old 4.5 just because it gets the job done. But yeah, I just can't believe we're finishing the year in a state where I don't really have a go to model at all right now.
[09:34]
Mike
I think for me it's been Opus 4.5. That's been my true workhorse. But sometimes I go to Gemini 3 Pro and I'm like, this thing is so smart at times, like, but as you said, it's just unreliable. Like it goes a bit mental after a while and it's rough around the edges, like it needs a lot of work. So anyway, I'll be interested to see with Gemini Flash 3 if it's something I continue to rely on over the next, you know, month or so. I, I was impressed with the tool calling. I think they've. It's far better than Gemini Pro 3 that the Flash 3 variant just is far better at tool calling, more precise for some reason. I have no idea why. Like, it's clear, clearly a very different model under the hood. And I would sort of describe it as Croc 4.1, which is very good at asynchronous tool calling, but just less insane. Like it doesn't try and call like, you know, 100 tools at once. It's a bit more staggered in its approach.
[10:33]
Chris
Yeah, agreed. I mean, Grok 4.1, as you know, is often my bailout model that I'll jump over to when I'm really struggling with a problem. And now having Gemini 3 flash, it's going to play that role for me.
[10:45]
Mike
So the other big release we, we missed last week when we were with this episode, obviously being slightly delayed, is the new Chat GPT images, which the model is called GPT Image Dash 1. And this I think was a maybe rush reaction to Nano Banana Pro because I think that that model is obviously pretty incredible. Came with a lot of fanfare. They've got a bunch of examples about character consistency. Bringing together different images with a dog here that I've got up on the screen with two of the researchers from OpenAI and this dog in the output image. My impressions of it are that it's good, it's a little bit cheaper, not a lot cheaper, but a little bit cheaper than Nano Banana Pro. But I would say Nano Banana Pro seems to just be far more reliable in, in terms of producing like character consistency text in images. The infographics I think are better generally from my perception, Nano Banana Pro typically just wins and is more reliable, especially at things like upscaling as well. So again, it's a strange release, it feels reactionary, it probably won't matter because it's at a level now where it's good enough that Most users of ChatGPT will be like, oh wow, it's so much better now. But you just wonder if the integration of images itself is good enough for people to try it again to realize, oh, it doesn't suck now, versus just going over to Gemini to use nanobanada Pro because their friends have talked about how much better it is now. Like, you wonder if they've sort of lost the momentum there in terms of word of mouth.
[12:33]
Chris
Yeah, exactly. I mean, these days Every time there's a model announcement, you just see like LinkedIn and Twitter and stuff go wild with people, like how nano Banana is going to supercharge your plumbing business and like all that kind of stuff where it's like everybody jumps on the keyword. But I think Nano Banana was just such a significant shift forward in terms of accuracy of text, accuracy of prompt flowing and just being able to get large amounts of basically information into an image that was just so much better that I feel like anything OpenAI release was going to be diminished in comparison. Like, I just don't see how you can come out with something better. And certainly in my experience, it's very good. Had it come out in isolation and nano Banana didn't exist, I'd be like, this is by far the best image model we've ever seen. Like, I've made some pretty amazing stuff with it. You can't discredit them. It's very good. It's just that Nano Banana is just better and so I'm never going to bother using it. You know, it's like we've got this amazing thing, but we just don't need it.
[13:34]
Mike
And this is the strange thing about it. Like, they don't just magically have these models and then be like, oh, you know, we're under a competitive threat. Like, GPT 5.2 was definitely, you know, maybe they weren't going to release a 5.2, but decided to because of the hype and, and fanfare around Gemini 3 Pro. But you just, I can't imagine they had this image model sitting there. Maybe they just didn't think Nano Banana Pro would be that good. Or maybe Google wouldn't have a new image model release and originally they were going to push this out. Maybe they don't care because they, they have so many daily active users on ChatGPT. So they're like, whatever, it doesn't even matter to us.
[14:20]
Chris
Yeah, and I think, I think your point around, like, if you're a regular ChatGPT user, well, don't worry about it. Just use it. Like, it probably is enough that you're not going to jump over to nano Banana Pro. So I sort of get that. And it's just a retention strategy. Right. One thing that blew me away though, is clearly both platforms have relaxed their safety controls in terms of what they will and won't allow. I was going to a party and someone was talking about like going to the butcher, so I got a photo of one of my friends and said, make a photo of this person Butchering a pig. And it made like the most graphic, detailed image with like, blood on the floor, knives.
[15:04]
Mike
And it was disturbing it.
[15:06]
Chris
Yeah. You saw it?
[15:07]
Mike
Yeah, yeah.
[15:08]
Chris
I'm like, I can't believe that it did that. Like, even back in the day, the uncensored models usually wouldn't do that kind of thing. So they've clearly all relaxed about it. And even the celebrity images seem to. It doesn't like bulk when you give a celebrity name most of the time.
[15:23]
Mike
No, I mean, my party trick at the moment is making it look like Taylor Swift is in every single photo I take from now on. Just like in the background or like, you know, next to me or whatever. Because it really throws people. It's so realistic now. They're like, their mind, like, they know it's AI still, because why would she be in any of my photos? But it is really interesting that it's gotten that good that you can sort of like randomly do it. Anyway, so I put these two models to the test, both GPT Image 1.5 and Nano Banana Pro. So the first test was to get a timeline of the this day in AI music releases this year as like a recap infographic. And Nano Banana was able to say we had 21 total tracks. Amazing. Two albums released on Spotify.
[16:13]
Chris
Just a music podcast at this point.
[16:15]
Mike
Yeah. Incredible. And seven singles. And so this is the Nano Banana one I have up on the screen now. I was going to get you to guess, but I realized it actually says it on the screen. So it's also.
[16:27]
Chris
I see it in like micro vision, so it truly would be a guess. Plus, I also found out my vision doesn't work so great.
[16:33]
Mike
Yeah. So you're not rating the. The image models. So it has.
[16:37]
Chris
It has each track.
[16:38]
Mike
It's done like a. A nice design and like a workflow with this space sort of midnight theme that I was going for. So that's Nano Banana Pro. And then you compare it to GPT image 1.5 and that's depressing.
[16:55]
Chris
Yeah, that looks like it was made on Ms. Paint 95.
[16:58]
Mike
Yeah, that's the vibe I get too. It's like a really. It's just not great. And I think this is the thing with like infographics and stuff like that, or even slides. It's just not that good. And at times it looks. Yeah, it looks like really cheap design. Like awfully cheap. And the text doesn't make sense either.
[17:23]
Chris
So.
[17:24]
Mike
Yeah, anyway, it's okay. Like, it's like again, like you said, if Nano Banana Pro didn't exist. You would be blown away. So the next thing was that this day and AI holiday card that I did. And let me just go down to the holiday card.
[17:42]
Chris
Really nice.
[17:43]
Mike
Yeah. So this one is GPT Image. GPT Image 1.5. The characters both are the same, even though I asked it to make it look like us. But neither of the models were able to do that correctly, so that I'm not gonna strike it down for that. But I would say in this case, just from a, like, you know, percent like design, taste point of view, I think GPT Image 1.5 really won.
[18:10]
Chris
Definitely.
[18:12]
Mike
The other one's not bad, though. Neither of the both of them are, like, crazy good compared to what we've experienced before. Now, here's the photo of us at the beach against a brick wall with you trying to ship a model, the original photo. And then this is just to see, like, if it can edit images. And then I asked it to make us look like homeless programmers in both models so I can compare. Now, this is what nano Banana Pro came up with. It is so good.
[18:46]
Chris
I mean, this is what I look like midweek, to be fair, but it's pretty good.
[18:51]
Mike
Yeah. It's pretty sad, though, that it always puts us in veteran outfits whenever I ask for homelessness. American veteran outfits, for some reason.
[19:00]
Chris
Yeah.
[19:01]
Mike
Now this is GPT image 1.5. And all it's done is put, like, a dirty effect over it.
[19:09]
Chris
Looks more like a painting or something.
[19:12]
Mike
Yeah, it's. It's pretty bad. So, Yeah, I mean, there's. There's a clear winner with. Yeah. Nano Banana Pro. And then finally, I asked it to do a holiday card with our actual faces for, like, character reference style. So this is Nano Banana Pro. I think it looks pretty Sydney.
[19:33]
Chris
Or did it figure.
[19:34]
Mike
No, it figured that out from research. Chris has a shirt on. What did Ilya see? A Christmas, like, holiday sweater. And I have one on that says Free Sydney.
[19:43]
Chris
We need to get those made.
[19:45]
Mike
They look great. Yeah. Yeah, we should have got them made. That would have been the good holiday surprise. But no. And Then GPT image 1.5. Yeah. I don't know. It doesn't look like you at all. And it doesn't really look like me that much. So the character consists.
[20:02]
Chris
I think it's pretty close. Yeah, but you'll.
[20:04]
Mike
You're blind and looking at it in a tiny thumbnail. It's. It's good. But the. I don't know. To me, Nano Banana Pro is just, like, a leap ahead. Like, just this step, step ahead. Where Generally it's. It's preferable, I'm sure, for all the people that listen to this podcast. This segment has really not been great. But all you need to know is if you have choice in model right now and you're trying to do any kind of image work, whether it's like marketing images or holiday cards or whatever. I think the go to is Nano Banana Pro.
[20:40]
Chris
Yeah, it's pretty amazing. I've been using it to make like system boundary and architecture diagrams. And the accuracy there. I mean, yes, admittedly it probably takes two or three iterations to get it how I want it, and occasionally it'll just screw it up in weird ways. But generally speaking, being able to get a diagram for someone who's not an artistic person at all, getting a diagram like that done is just amazing. It's really, really powerful.
[21:04]
Mike
So we wanted to talk about two releases. That one's more recent and the other is less recent. But define recent. Like they're both in this month, but we kind of just didn't have time to cover them or skimped over them at times. And now we've had a chance to check them out. And these are two very like research specific agents. So I thought we'd start with the Fire Crawl agent, which came out, and I don't think this got enough hype or, you know, no one's really maybe paying attention. It's currently in Research Preview and it's available through the API, which is how we're consuming it. We're about to add it into Sim Theory. We've been using it though prior and it's pretty phenomenal. There's a lot of use cases, if you think about it, where you want to extract data. And models in the past haven't been terribly good at this, especially if you want to extract a lot of data. And so to give you an example of where this shined, when we were just doing a little bit of preparation for this episode, one of the things that we wanted to do was get like a history of all the things that have happened this year, like all the model releases was one of the things that I wanted to do. And I'll bring it up on the screen now. So it it I asked, did I need to do research on the timeline of model releases this year, like by month, what models were launched by major labs each month. I mean, it's a terrible prompt, like all of the prompts I feature on this show. But what Firecrawl agents able to do is call this Firecrawl agent MCP and you can see if you look hard enough. I use Gemini 3 flash for this. And so it's able to go off and, and act like an agent, like crawl, crawling multiple pages, scraping data from many pages, making decisions as well. So it'll go off and go down different paths until it feels like it's got a true representation of the information you requested. This thing is so reliable and so accurate, it has blown my mind. It is the most useful tool to have in your toolkit as an MCP for research to go off and get data. And you can see what it output here, like each month, each model summary of it, what, what, what company release, what really powerful stuff. And I went off and checked a lot of this data because we've actually gone off and built out of that data a model timeline, which we'll get to in a minute. But I did want to call it out. I think it's a phenomenal tool in terms of doing research and finally being able to reliably go off and get an agent to go and fetch and retrieve information accurately as opposed to having to run one of these like deep researchers or just when you know you want to crawl and go and get information. I think it's a really great tool and it can do so much. It can go off and like, you know, research how to book flights to Bali or whatever. So it's really a reliable, great tool. And I'm excited to see how it evolves because currently in preview, but we have got it as an MCP soon to be released into SIM theory called Fire Crawl Agent.
[24:20]
Chris
And what I love about it is that depth because there's something just for me, just untrustworthy about do a couple of Google searches, paste the first few web pages you find into the context and that's your research because I feel like that's so easily manipulated. And we all know from doing our own Google searches that it's not often the best way to research. So this level of decision making while it's doing it, doing more tailored queries, getting in deeper, extracting and verifying actual data is just such a big step in the right direction to where you'll trust it more. Not to mention firecrawler's just done a lot of work. We're not sponsored, by the way. However, guys, if you're listening, send us merch and then we will be sponsored.
[25:02]
Mike
I take some Firecraw merch.
[25:04]
Chris
Yeah, they're pretty responsive when you email them and stuff, so maybe they're listening anyway. Yeah, so I think that that is the trend with the research. And I think having that as part of your research arsenal is really, really powerful.
[25:17]
Mike
This is the thing that the research paradigm for me is kind of like this is why I like it. Right. The research paradigm right now is like most apps that you use. It's like put into research mode and then wait for it to churn away for way too long that I honestly lose focus. Whereas I like this as part of a conversation, like go find all that data and then it's of part bring it into context and then you sort of continue the conversation. You're not getting just, just, just absolute spam output. And I think that doesn't really work like because the use case for me is not like I want to produce a research report. It's like I just want you to go get that information and then use it as part of this context to continue on. I think that's where it really shines.
[25:57]
Chris
Yeah, I've been using it. I was looking into certain things around certifications and other things where I wanted in depth information and comparisons and things like that. And I think when you want to get that kind of comparison and deep knowledge about something, it's really good. The other thing I've found lately is, is when you want to develop new features, for example, and you're looking for accurate docs. One of the problems we have is a lot of the things we're developing is so new that it's really hard to find good examples of different things. And I find that this Fire crawl agent is actually able to go off and get really solid examples to prompt your agent to then go and write the code. So I'm increasingly actually trusting tools this to help me in development. Whereas before I would do all that bit myself and build the context myself, but less and less do I have to do it manually now. And I think that's as we're getting this one step closer to an agentic style workflow where I'm basically at the point where I would trust an agent now to do that research itself and then write the code itself and not have to have me in the mix for those parts of the process.
[27:02]
Mike
Yeah, I think you summed up my thoughts there so brilliantly. Like I trust it now. And that's the main, the main takeaway for me is these, these research tools are getting to a level, especially this Viaco agent where you can, you can pretty much fully trust it to go get docs or whatever you need. And it's reliable and accurate. And I mean it's probably not 100% perfect, but it's getting bloody close.
[27:26]
Chris
It's as good as we would do with a quick search around, you know, and, and going in and reading the docs and that kind of stuff. And I think that's where it's key. And I've seen it on things like collecting stat for sports and things like that where I never ever trusted mcps or models to do that because would they would get the most superficial stuff and just make up the rest and you really couldn't trust it. And you needed to overcome that process by manually building the context. Whereas now we're getting to the point where you don't need to do that anymore.
[27:58]
Mike
So yeah, you should definitely check it out. Firecrawl Agent, it'll be available soon in Sim Theory is an MCP to try out. It's a very, very handy tool. The other one we wanted to talk about because we are currently working on some pretty fancy new research tools and one of them that came out in the month, earlier in the month, but still pretty, pretty new is Gemini Deep Research Agent. Now this has previously been available I think on the like Pro Max plus Google plan. However you subscribe to it. I'm still unaware but. And I think you were like kind of limited with this. You can only do like 10 a day or something like that. Anyway, they've finally released it as an API, so now you can get it to do research tasks. And it's powered by Gemini 3 Pro. Now you've obviously been working with this extensively and know a ton about it now. So you know, like, is this everything it was promised to be? Gemini Deep Research.
[28:53]
Chris
It's incredibly good. It's, it's really actually amazing and I'm blown away at the, the level of detail it can, it can go into. What it does is. It actually is more agentic. I mean, I know it's in the name, but I'm going to say it anyway, where it actually isn't just blindly going off and finding the sources, it's making decisions as it, as it goes along. And you can see it. So like just, just to give you some examples of the, of the process it goes through, like I see it starting deep dive, structuring the research into distinct phases, making initial findings and then going off and pulling at each of the threads in those findings. And then it says identifying knowledge gaps, researching next steps. So you'll end up at the point where it's gone into every little facet of the things you're researching it then Synthesizes that into a report and then cites the whole thing with references. It's absolutely amazing. And I inadvertently made our version of it better by completely misunderstanding how it works. I didn't actually get it because you can provide it with tools, right? And the the two built in tools are crawling websites and searching the web, right? So as part of its basic tools, Google provides web crawl and web search and that's how it does the majority of the research. And so you can just do it that way and that that provides everything I just described. But it also has a files API which is its rag retrievable augmented generation where you can provide it with hundreds of files and videos and audio and all sorts of stuff. And it can alternatively do its research on that. Now when I first was building this I thought it was both and you would combine both. And I'm like that is far superior because I want to be able to do both, right? Because then I can get the best of the web, but also I can run all my research, MCPS and now in the new world, and we'll talk about this soon, skills where you can have a whole bunch of Claude skills that have existing databases and access to other APIs and things like that. And so take all of that knowledge, put those into files and then run those through the Google research agent as well as the Google searches. So I went off and built all of that only to realize, oh, that doesn't work, you can't do both at once. But anyway, I've found a way to combine them and the results of using them combined is just unbelievable. Like the decision making of this thing and its ability to synthesize that knowledge is incredible. And the advantage of that is you've kind of got both, you've got a detailed research report on the topic you've looked into, but additionally you can then add that into your context so you can then deal with your assistant as normal with that, all of that research now piled into the knowledge that you can then work with and things like that. The other thing it has as as a built in feature to the Deep research agent is the ability to ask follow up questions. So you can actually ask it to then pull it a thread deeper like oh, tell me more about this area of the research or expand this and it will start a new process which will then expand upon that research so you can actually get into that normal flow we have where you're sort of iterating with it. But the power there's huge. And looking at it as part of an Agentic and planning workflow, I can see being really powerful because one thing we've learned is that building a good context is the key to getting good AI tasks done. And I feel like having a powerful research modality that you can do prior to even planning. So it's like, okay, these are the areas we're going to need information on to get this task done. Let's do extensive research on them in this agentic mode, like multiple threads of research, Combine that research, then we're going to build our plan of how to get this task done, then do it. This is a very expensive way to get your task done. But I'm saying if we want to get to that max agentic level, I feel like this is a very essential tool in that, in that process, or at least one for your most serious tasks.
[33:05]
Mike
So if you're a researcher, like an actual legit researcher, and you've got a folder of all of your research, right? And so, so you can take that research, put it into this, combine it with, say, skills that access different, like research databases, MCPs that connect to different research tools, and then run that with a query. In Gemini Deep Research, in theory, it can now take into account and use all of those capabilities.
[33:40]
Chris
That's right. Now you could argue that, okay, well, if I'm getting it from all these disparate sources, maybe you just see Google's deep research as one component of that and let the regular model synthesize everything, which it's perfectly capable of doing. Because if you think about it, it's probably really just Gemini 3 under the hood, Gemini 3 Pro under the hood, doing that anyway. But there's something I like about having it all in one cohesive process with that, with the benefits of the knowledge that Google's clearly put into that, their agentic flow. So one thing that I really feel at the moment, we're entering a time of I don't think anyone has the processes right in terms of the combination of like MCP skills, research agents, things like that. I think it's a very rapidly evolving area. So what I want to do for us is really keep our options open and allow it to work in both ways. So it's like, yes, pile the information into the agentic, Google Deep Research agentic, see how it goes, or alternatively run it independently with say, only the web search, but not the files and other contexts. Take that knowledge, get it back into your regular assistant and then you synthesize from there. So I really feel like I don't know the answer to that. But I just want to have all the options available and see. See what works best.
[34:58]
Mike
Yeah. And of course I think the skills are starting to have their like, moment in the sun right now. And it's also a really good segue into talking about this skills versus mcp. Like perplexity problem. Where do I use a skill? Like if I'm an enterprise or a business now, do I build skills or do I build mcps or do I build both? Like, where do they sit? What are they? You know, how to best use them. And during the week there was an article in VentureBeat, Anthropic launches enterprise Agent Skills and opens the standard challenging OpenAI in workplace AI. So what they've actually done is taken this skills paradigm, which we'll talk about in a, in a moment, and released it as another open standard. So of course they released mcp, which everyone's adopted now they've released Agent Skills, which a lot of people are adopting. There's a few quotes I wanted to just call out from this article. One of them is we've also seen how complementary skills and MCP servers are. Marag noted MCP provides secure connectivity to external software and data, while skills provide the procedural knowledge for using those tools effectively. Partners who've invested in strong MCP integrations were a natural starting point. So just a recap to position these. So really an MCP is a tool call and some of the challenges we've talked about on the show before with these tool calls where all the model to be aware that it can call these tools and how to call them. The challenge with MCP is you put a lot of, you fill a lot of the context with information about how to access and use those tools even if you're not using them. And so.
[36:49]
Chris
Right. And so it's a big problem because if you've got a hundred tools and the descriptions are long and often the descriptions need to be long in order for when the model chooses to call that tool. You don't want it to be in the dark about how to best call it. So you need a bunch of best practices and things like that. So you've either got the problem where you've got to filter those tools down or limit the knowledge it has about them, which means you get less intelligence from the model because it doesn't have the full arsenal available to it or you, you know, you have to try and make them more combat. I don't know, there's just a trade off there. It uses a lot of tokens and while it's Great. It has that disadvantage with skills. The way it works is they have this thing called front matter, which is a small description of what the skill is for. And then the model decides when and how to use combinations of those skills. Now what's interesting is that Anthropic's the only one that publicly their API supports this. We know that OpenAI is bringing it out, but in Anthropic's implementation, you can only have eight skills enabled in any given workflow, right? So you can only have eight. And trust me, there is a lot of these things. I've got one website alone that has 30,000 skills available, and this is what's blown my mind. I might take a step back and just for those people who don't understand the difference between skills and MCPs, just to explain it. So MCPs are a series of tools. That's all it is, just a list of related tools that will give you access to some external system or maybe something on your own computer, but usually an external system. What skills are is more like an additional context that gets loaded into the model when it chooses to. That gives a procedure of how to do a given task. Now, that task can do things like execute code, or it can run programs on the computer in a sandbox and those kind of things. In the case of something like Claude code, it can actually run the code on your computer and that's slightly more powerful because then what it can do is access the Web and Access APIs and things like that. So what the skills are better for is repeatable processes. So we use examples like applying brand guidelines for your company. So I'm producing a document, but before I output the document, I'm going to follow the brand guidelines thing. The other thing skills can include is files. So you can actually have a Word document bundled into the skill. That is your template for that document to output. Or you can have extensive database level documentation built right into the skill. So it becomes almost a form of RAG or research, but it can actually be more accurate. And so you could even have say, a company knowledge base or a guidelines of how to accomplish particular procedures that need to be followed exactly, like filling out forms or writing a compliance report or writing a grant application or something like that, where it's a very specific procedure that needs a specific output format. You can do that with a skill. Now technically, MCPs could do this too, and that's where there's crossover. But it has the disadvantage, you said, which is where if you want to do that, you need to provide all of that context in every prompt that you send to the model. Whereas with a skill, it's only loading that context in when that skill is invoked. So it's just a different way of working. And I having used it a bit now, I'm starting to see the advantages.
[40:24]
Mike
But so how do you think this plays out over the next 12 months? Like, do these things just merge? Because to me there's so much overlap. But at the same time, you know, people like us and Anthropic and others have been out there saying like, go and build your enterprise mcps. Integrate it with your data, like with control permissions. There's, you know, there's like actual, like Amazon Azure has different like services now host your MCP securely with permissions. Like there's a whole ecosystem built around this. And really, I mean at the most basic level thinking about it is the skill is just a way to contain all of the prompting and instructions. And it's almost like an application written in human language that the model can access when it needs to, versus always having that stuff in the, in the context.
[41:16]
Chris
Not to mention, one of the things we've always talked about with MCPS is you often give these incredible examples on the podcast with using MCP tools and everyone's like, wow, how did you do that? Like, how did you make that song, for example? Like, you built the context, got it to write the song, the song was amazing. How did you do that? Now the advantage of a skill is you could actually document your process and have rules and have standards and have output required output formats, required output artifacts. And you could actually document the procedure you do to make a banging song that, that Spotify worthy, right? And then someone could reuse that skill and get the same quality you're getting. So it's almost like a way of like making ip. Like it's literally codifying business practices into repeatable skills. And that's what it is. And so the immense advantage of that is there's 30,000 of these things available where people have put their expert knowledge out there that you can actually get and just integrate directly into your workflow now and take advantage of that. And there are just some absolutely amazing ones around that. Like for example, critical thinking procedures. Remember they don't have to be code execution based. They don't have to be about accessing knowledge. They can simply be follow this thinking procedure. Like this is a way of critical thinking that we want to use when making these kinds of decisions and it forces it to go through that process. So what I imagine in the long run we're going to see is the skills actually driving the tool use. So in other words, you'll start with a skill that is your procedure to get something done and then when and where necessary, that's going to go off and call an MCP to get taken action or retrieve knowledge or something along those lines.
[43:05]
Mike
It's interesting, in that same article they said the skills approach is a philosophical shift in how the AI industry thinks about making AI assistants more capable. The traditional approach involved building specialized agents for different use cases. A customer service agent, a coding agent, a research agent. Skills suggest a different model. One general purpose agent equipped with a library of specialized capabilities. Not sure I entirely agree with that, but, but I think their pitch with it is like having a universal agent with skills being like customer service agent, a coding agent, and like codifying those, those roles in a business into skills. Now I just don't know if I agree with that approach right now.
[43:47]
Chris
I disagree with that strongly. I don't think that what, that's what skills are for. I think that's what agents are for. And there's a, there's a big difference there. An agent will use skills and follow these procedures, but they're part of its, its set of tools. It's not, it's not who they are.
[44:02]
Mike
Yeah, like I would see in a customer service agent, for example, having skills of, of like how to process refund, how to change subscription, how to look up knowledge base and fact check response like policies and procedures before responding, like tone of voice guide. Those kind of things is more skill to me.
[44:26]
Chris
Exactly. Like make sure we pass it through our, our safety skill, our tone of voice skill. Before you go, keep in mind you can also the skills themselves can refer to other skills. So you can say to the model, hey, every time you complete this skill, you now need to run this other skill, right? And direct it to that one. So example you see commonly with the skills is something like do research on these academic databases and then use the scientific writing skill, which is whenever you're writing or revising any section of a scientific manuscript, use this skill, use it for formatting it using the correct reference style, that kind of thing. So you get yourself in this situation where you're getting this consistent output in the format you want, it really lowers the model hallucinations and gets you to where you want to be a lot faster. And people are just sharing their business knowledge. Like we talked about this for so long, like, oh well, I can't be an expert lawyer without the knowledge to see if the model is giving me legal output that's accurate with skills, maybe I can be. Because if I combine the right combination of these skills that are, that are doing checks and balances, making sure the writing style works, fact checking it, like you said, checking it against the law database to make sure it's accurate, suddenly I am two or three steps closer to being able to do that in a professional way. Like, it really, really takes it to the next level by forcing it through almost these, like, you know, choke points where it must do the right thing.
[46:01]
Mike
Yeah, like industry expert points. And I think, interestingly though, you can imagine a law firm or like, if you're a researcher, you have your own methodologies. Like different organizations have their own procedures and processes and methodologies how to do things, and codifying all of those, those things into skills and then chaining them together or linking them together in the appropriate way could be a huge unlock. Because if you have a structure of like how your law firm thinks about writing a contract or a will or whatever it is, and that's codified, you can get a lot more work done in a precise manner now, in a reliable manner as a result of a skill. And then you just see the MCP really is tapping into different databases, like a customer database or whatever.
[46:50]
Chris
It actually like retrieving knowledge to add to the context or taking actions. And that's what you use your MCP tools for. So it might be okay. The next step in the skill is I need to gather the relevant data to make this decision, call the MCP tool call that goes into my database, runs the appropriate queries, brings that data back. Okay, what's the next step? Now we perform deep analysis in an Excel spreadsheet, and that's the, the next step in the skill. Right. So it really becomes that the MCP tools become part of the equation that it's working through that. Currently we rely on the models to just raw call the tools at the right time when they decide, and we bring that into a more structured format with the skills. Now, the other thing that's crucial about skills that I think brings the biggest advantage is we always talk about the overall prompt that you're giving to the model. Now, the advantage of skills is you can bring in a far more detailed prompt. And some of these skills are absolutely massive in scope in terms of how much they bring into the prompt. Now the problem is if you're working in a regular environment with a regular assistant, I can't have, I can't account for everything this assistant might ever do and have 30,000 tokens of context to say, well, when you are answering a support ticket, follow this massive 30,000 token procedure. On the off chance, that's what the user asks, right? I have to have it in there. Whereas with a skill I only have to have a reference. Hey, this is available. If you ever come across this situation, then it can load it in and then it can have extremely detailed instructions. So something we see at the enterprise level is teams sharing assistance that are sort of surrogate, like their own way of doing this, the hobo way of doing this, where they're trying to have an assistant that has specific instructions for specific situations. Sorry, I can't say that. And so what the team is having to do is, oh, okay, I'm going to go to this assistant when I'm dealing with this problem and this assistant when I'm dealing with this problem. What this allows you to do is the whole team share those abilities and only load them in at the appropriate time.
[49:03]
Mike
I think too, what's interesting about this is we were sort of naturally getting to skills with mcp anyway. Like I made that I've talked about on the show before, a theme maker for Sim Theory. So you can just ask it to make a theme, like make a Christmas theme and switch to it. And I found it really hard to make that because I would have to put structured examples of theme files, which are JSON files.
[49:28]
Chris
Yeah.
[49:28]
Mike
Into the prompt for every single chat. If you had this MCP enabled, which would eat context. And for those that don't know, like Jason eats tokens. Like it really eats tokens. So I was like, how do I overcome this? So I created a skill, learn how to create theme schema. And I said, when you're creating a theme, you have to call this tool first. And so the prompt for the theme maker is incredibly concise and that allows it now to work perfectly.
[49:58]
Chris
But you inadvertently made skills in the mc.
[50:02]
Mike
Yeah, I mean a pretty low level version of it. But I think the benefit of Skills is not just that this ability to then go and execute code and run processes and actually run software. So one of the skills that's quite interesting is the document skills, the Docx skill to create a Word document or an Excel document. And what it actually does is it gives the model the ability to create the file in a Python sandbox. So it has instructions and brand guidelines and everything about how to create this document. And then it physically goes and executes that in a sandbox in order to build it and then returns that to the model to output. So it can. Yeah, I think it goes.
[50:44]
Chris
You've also got sort of all the advantages of like a code interpreter style environment where any Linux based tool that exists can, well within reason can be used as part of these skills as well, remember? So like it has far more advanced skills than just model inference. It can do a whole bunch of stuff and that makes it so much better for empirical style work because it can actually be accurate and not hallucinate and those kind of things, which is another huge advantage.
[51:13]
Mike
I mean, I guess having said that, you could still do it with MCPS because you could just call a tool and then have your own like, you know, code that's executing to deliver that. But it, it feels a lot more. I think what's exciting about it is once you start building skills and writing your own skills, knowing that the model then has that sandbox available to go and do more complex tasks is definitely interesting. So we do plan very, very early in the new year to bring Skills into SIM Theory. So we hope to support nearly every major model with it eventually. We want to deliver this. We've got a skill store already ready to go, so they're really easy to install and a skill builder so you can build and train your own skills in there as well. So we think it's good enough. We want to heavily adopt it and allow people to share these in the store as well so our community can share knowledge. But also like enterprises and larger organizations who use the SIM Theory workspaces product can, can actually build out around these skills knowledge with the MCPS behind the scenes as well. So we're definitely excited about it. But I think for the, like, if you're sitting there panicking, being like, oh, now I've got to get my head around skills. And then there was mcps. I think you got to remember Skills is very, very early days and evolving very quickly and likely to change pretty dramatically. So I still think if you're an enterprise today, getting this sort of data in and then actions out with MCP is still the right area to be focused on for now.
[52:51]
Chris
It's not, it's not going to be hard to shuffle it around between the paradigms. And what I would encourage everyone to do is once we release it is just try it. Because I must admit I was building it and I still didn't understand and I'm like, what's the difference? I can't really understand why I'd use this instead of an mcp. Then as I started to use them, I'm like, okay. This is magic. It's amazing. I get it. I understand now. So I think, like, all this tech, you've got to go try it, and then you'll see where the right fit is for you, you and your organization.
[53:19]
Mike
All right, so I wanted to now get into a bit of a recap of 2025 and then talk about our, like, predictions or what we think will happen in 2026. Now, I built this timeline. I did use Fire Crawl agents, so I practice what I preach, and that's where the data comes from. So if it's wrong, when I share the link below, you'll know if it's wrong.
[53:44]
Chris
Ignore everything Mike said earlier complimenting Fire Crawl, and we'll. We'll send the merch back.
[53:48]
Mike
Yeah. And so this goes month by month. There's also a filter by labs. I'm just filtering by the major labs that we talk about, at least on the show. But I, I look, I thought this was really interesting to look back at, and primarily because how quick the year's gone. Like, I. So just without trying not to look at this for a sec, when was Gemini 2.5 Pro release? Was that 2024 or 2025 is my first Gemini, which, sorry, Gemini 2.5 Pro.
[54:18]
Chris
I'd say, I'd say. Well, based on the fact you've. You've got it there, I guess this year. I would have said last year.
[54:25]
Mike
Yeah, I would have too. I thought it was over. Like, I thought. I, you know, I, like, I feel like I've had a longer relationship with that.
[54:31]
Chris
Was it like September?
[54:33]
Mike
It was March. March 2025 was Gemini 2.5 Pro, Gemini.
[54:38]
Chris
2 Flash, but strong.
[54:40]
Mike
Yeah. Gemini 2 Flash came out Flash thinking in January and O3 mini came out in January. So I do think at the start of the year, if you look at the releases, OpenAI was well and truly ahead. Like, they did have the most, like, intelligent models by a long way, I think in terms of the most practical model. A lot of people then were starting to sour on Claude 3.5 sonnet at the start of the year. That's how far we've gone.
[55:08]
Chris
Wow.
[55:08]
Mike
And so in February, we had Gemini 2 flash light, which.
[55:14]
Chris
That's. Is that the one multimodal, like the image output?
[55:17]
Mike
No, no, no. That was just the, like, super cheap, super fast, like, on device, Android could run it model. Then we had Gemini 2 Pro, and then we had Claude 3.7 Sonnet in February. Groq3. It was actually February was like a big release window. GPT out of the sloppy model, period.
[55:39]
Chris
Yeah, like 3.7, even now in my mind is like, yeah, it can do stuff, but it's sloppy as hell.
[55:46]
Mike
Yeah. And then GBT 4.5, which was, remember, just so expensive, but a lot of people did like it, but just way too expensive and not that much smarter than 4o. That allegedly was the failure GPT5 training run. Then March, we got an upgrade to 4.0 where it could do images. So the original GPT image they called like 4O image, I think from memory. And March, of course, was Gemini 2.5 Pro, which for a lot of us, in particular us, we were sort of like, this is the turning point in terms of Google starting to take the crown. Many disagreed. I still think Gemini 2.5 Pro is super underrated in terms of what it kind of did at the time.
[56:31]
Chris
I think it's the best model we've ever seen.
[56:33]
Mike
Yeah, I, I agree. And then April, we had Llama four, which was the biggest letdown, I think.
[56:41]
Chris
If it's given up at this point. I read an article yesterday saying his Metaverse thing only has 900 people on it.
[56:49]
Mike
Yeah, I mean, I, I don't know. I. I think in terms of AI, they're all in, but they've definitely given up on that Metaverse stuff. So, yeah, Llama four, we had Scout Maverick. So bad. I like such bad releases. And then we had GPT 4.1, not to be confused with GPT 4.5. And then we had O3, the, like, the large O3 version and O4 mini. So April was really all about opening.
[57:18]
Chris
Those two were decent.
[57:19]
Mike
I think 03 was a great model, like a really great model. I think that's the foundation of the GPT 5.2. It feels like there's like, O3 is somewhat heavily in there somewhere. O4 mini as well was pretty damn impressive. Then in May, we had GPT 4.1 mini, who cares? We had VO3, which was a big release from Google, like, just blew our minds in terms of video generation. And that month of May, we got Claude for Opus and Claude for Sonnet as well in there. June we had O3 Pro, which was the, the Pro model that was affordable enough. We actually had it in Sim theory, if you recall. And it, it was, it was pretty good. It was, it was the phone, a friend model for a long time for me, I was like a Gemini 2.5 Pro guy. And then I would use O3 Pro as my lifeline. July Grok 4. And then of course, August we had GPT5 Claude 4.1, which was a slight upgrade of Claude 4. We had the open source GPT models as well come out. And then September we had that was.
[58:37]
Chris
That made Llama look good.
[58:39]
Mike
Yeah, yeah. And then GBT5 Codex came out in September. And of course I think the probably, you know, fairly major release from Anthropic, putting them back on the map. Claude 4.5 sonnet Sora 2. That Sora 2 moment was September. It wasn't even that long ago with the Sora app and all the videos that were being shared around 1.5 Sonnet.
[59:04]
Chris
Has only been around since August.
[59:07]
Mike
I mean, this could be really wrong. Right? So I don't know, the time scales.
[59:12]
Chris
In my mind is so warped. I thought I've had that forever.
[59:16]
Mike
Let me just like quickly confirm that one. Yeah. September 29th.
[59:22]
Chris
Holy crap.
[59:23]
Mike
Yeah, it's. It. This is why I did the timeline. And I would encourage you, I'll put a link in the description. If you just want to go through the timeline, you'll be like, hang on.
[59:32]
Chris
Serious money off me betting on that. I. I would have thought. I've had that for way longer than that.
[59:37]
Mike
Yeah. And then haiku, Claude 4.5 haiku only came out in October. I rely on that constantly for model calling, but it seems like so much longer than it's been. And then November we, we had the big boy, Gemini 3 Pro GPT 5.1, Claude Opus 4point. Man, November really delivered. Claude Opus.
[60:00]
Chris
They're all crap. Like it's, it's sort of like they're all.
[60:03]
Mike
I don't know, Claude Opus 4.5. I'm a huge fan of, I think and Gemini 3 Pro. Like it needs. Like it's not once it's the official release. I'll judge it because remember, Gemini 2.5 Pro Preview was great, but it was rough around the edges and they, they seemingly stabilized it.
[60:21]
Chris
Yeah, I mean I, I say that, but I must say Gemini 3 Pro has been steadily getting better the last week or so. But you're right, like I just, I just, I don't know, I'm just not there yet with them.
[60:32]
Mike
And then Brock 4.1 was early, early November.
[60:36]
Chris
It's underrated. It's a great model. It's cheap.
[60:39]
Mike
It's so cheap. 20 cents per million. Great at tool calling. Yeah, it is. That model's real underrated. I think too. If they can get it more intelligent and like tune better, that's going to be the one to watch in 26. And then you also had Codex 5GBT 5 Codex Max, GBT 5 Codex Mini. But you know, anyway, there's a couple missing.
[61:03]
Chris
There was GLM 4.6, which everybody loved and is actually a pretty good model. And of course, Kimmy, you're so fine. You're so fine.
[61:13]
Mike
You know, it is in here. I've just got the filters on, I think.
[61:17]
Chris
Yeah.
[61:17]
Mike
So I'm just going through major US labs for some reason. I'm not entirely sure why. I mean, I'm not gonna say like in December, Nova 2 models came out.
[61:25]
Chris
By Amazon because 3.2 deep seat, whatever.
[61:31]
Mike
And then, so then anyway, finishing up Gemini 3 flash, GPT 5.2 and GPT 5.2 Codex. But I mean think about that year. Like just recall we started the year we started the year balancing between O3 mini and Claude 3.5 sonnet. That's where we started. Everyone started in January. No one was taking Google seriously at all.
[62:01]
Chris
Yeah, that's pretty amazing. What a good year the people over at Google have had.
[62:05]
Mike
Yeah. And then March. Gemini 2.5 Pro. Anyway, I look, I think the progression this year has been great. Like I, I think yes, the latest models need to stabilize a bit. But like think about where we started last year with, with you know, Sonnet 3.5. Think about where we got to with Opus. Like those models are like that is extremely better Gemini 3 Pro in terms of just raw intelligence. Extreme, like huge, huge curve. So I, I can only assume that by the end of next year if, you know, if, if we're still here, it's. Who knows? I mean, who knows? No, I mean in terms of like AGI killing us all, that kind of thing.
[62:52]
Chris
All right. I think.
[62:53]
Mike
But yeah, so anyway, so now we go to our, our sort of like best model of the year award and worst Model of the year award for 2025. So what is your best model for 2025? Now we've done that recap and can remember.
[63:10]
Chris
Gemini 2.5.
[63:12]
Mike
Gemini 2.5 Pro. Yeah, yeah. I like, I know we're going to cop hate for this, but I'm going to say the same. Gemini 2.5 pro. I think in terms of most important impactful model, to me personally throughout the year I would say it was.
[63:29]
Chris
Yeah. And runner up Sonic 4.5.
[63:33]
Mike
My runner up would be Opus 4.5. I don't know why you hate Opus 4.5 so much.
[63:38]
Chris
It just, it, it's just, it's just betrayed me one too many times.
[63:44]
Mike
Maybe it just doesn't like the Patricia problem.
[63:48]
Chris
Sonic 4.5 is the one that renamed itself Fatal Patricia. And it's got, it's got heart and it gets my work done.
[63:57]
Mike
Right.
[63:57]
Chris
I'm still using it. I still use it.
[63:59]
Mike
Worst model 20, 25. This is easy.
[64:04]
Chris
Well, are we going all the way down to like GPT oss or are.
[64:07]
Mike
We going like, it's just Llama, right? Like, I don't.
[64:11]
Chris
Yeah, I guess Llama in general.
[64:13]
Mike
Yeah, I think, I think for me, Llama. But then if we're talking like major release, like the most disappointing major release is 5.2. No, I'd say 5.1, it was even worse. Wow. But yeah, the OpenAI has not had a phenomenal year in terms of model releases at least going well. And then finally, like, I don't want to do the whole prediction show thing with, with questions, but if you were trying to predict what we'll see, the general theme of 26, do you, Do you see epic bubble happening and like the whole market implodes? Do you see AGI level models?
[64:54]
Chris
I don't want to do like market based stuff because I've, I think I've proven with my own money that I've got no idea and I'll be wrong. So I don't want to do that. But what I am interested to see, let's say, and I think we'll see, I think what we'll see straight away is every single model provider will support Skills. I think that's probably going to happen in like January, like really early on. We already know OpenAI is doing it and I think every other model will support that Skills MD style thing. I think probably the thing that I'm unclear about is where the agentic process is going to live. Because if you actually look at skills, one thing we didn't mention the difference between Skills and MCPS is when you have MCP tool calls, the model comes back to you and says, okay, you go off and call the tools and then when you're ready, come back and do a fresh request. But as far as the model's concerned, its work is done. You come back with the tool results, you could come back a year later and it would still work. Whereas with the skills, it runs them in line, it runs them as it goes as part of the single request. And so it's a much more longer running process. And we've heard a lot of talk about, oh, GPT 5.2 can run for 60 hours and you know that kind of stuff. So the question is, do the model providers go more down that route where it's like they're taking on more of the computing burden and you're, you're really just loading it in and just waiting and pulling and checking how it's going and reporting that back to the user. Or do we see that not working so well and people like Sim Theory and other, other products being more like, okay, well we'll run it and loop back with you. Because it seems to me like it's going the formal way where you're actually putting more onto the model and it having these longer running, which I don't.
[66:48]
Mike
Think is good for the enterprise, I don't think it's good for consumers, it's only good for the model companies because they can charge for compute that anyone can pay for.
[66:58]
Chris
Yeah, and I don't like it either because for example, the eight skill limitation is no good. I would rather have the opportunity in between each request to maybe give a different mix of skills that's more appropriate for this step in the agentic timeline.
[67:12]
Mike
And also remember what they're doing most of the time is just running and executing Python code, which anyone can do on like a two bit server. Like you don't really need them to go do that, you just need them to tell your system what process to run.
[67:25]
Chris
Yeah, that's right. So I think for our approach next year we will decouple ourselves from that process and not allow chain to skill calls and run the skills ourselves, run them in SIM link, run them in a vm, whatever it is, and don't rely on the model providers to actually run the skills. And I actually think keeping the skills agnostic to the platform provider is key. And that now that the standard's been open, I think that's what will happen is we'll see skill executing systems rather than relying on the model providers. So look, I guess that's not really a prediction because I'm saying, I don't know, I just know that it's going to be the first issue we face next year.
[68:02]
Mike
My, my prediction in 2026 is that we actually finally will see the Year of Agents. Like everyone said, 2025 year of agents. And I think for developers, to be fair, most of them now run agentic workflows with code. And, but, but when, when I heard Year of Agents I sort of thought that meant like everyone using agents and sort of the chat GBT paradigm becoming stale where you, you're more like, I want it to go do things for me and help me with tasks as opposed to just this like chat back and forth. But how I think agents are really going to play out is once the skill paradigm comes into it and MCPS come into it. I think increasingly if the industry can make these things more accessible to everyone, you'll see that Claude code moment and Codex and cursor moments in all of the various different for different knowledge workers. I think that's probably the most likely thing to happen because I can already.
[69:04]
Chris
Agree with you there. I think at an organization level every organization is going to have their own like decided approach to AI and how they implement it throughout the organization. It won't be just a casual chat, GPT or Microsoft subscription, copilot subscription from here on out. I think everyone will recognize that there is huge leverage to be gained by agentic workflows and have their own approach to that.
[69:31]
Mike
I I think increasingly to people in the enterprise and just general consumers of these models are starting to also recognize that they're tools and that different providers have different tools that are better at any given time. Like we talked about Fire Crawl Agent on the show today and Gemini Research Agent and I think increasingly in your workflows being able to access all these various tools and treat them like tools is a far better experience. And increasingly next year we will see people adopt this paradigm of just integrating and picking the best tools for the job and taking some ownership over those agentic flows as opposed to just relying on consuming single model endpoints. And I know a lot of people will probably disagree with that prediction, but I would say that increasingly people are wary of these businesses and like what the future holds. It's very uncertain and unpredictable and it seems to me like being in control of these critical workflows in your business mean that you want to take ownership of those of those processes and being able to slot in different tools and different MCPs and skills and things like that.
[70:44]
Chris
I know what you mean because you can just tell Anthropic is going to try to own that space and say well we'll handle all that for you and we'll be the repository of your the skills for your organization. But do you want your organization dependent that heavily on their specific approach and their specific systems?
[71:04]
Mike
Well, especially Too if in 2026 which you have to think is certain we're going to see an open source model as well good and probably faster than Nano Banana Pro. Let's be honest, we're going to see open source get good enough that it's probably at a level of like an opus today. So you would hope and think that those models increasingly will get there where they're just a lot tighter and more stable and maybe that's good enough for a lot of these agentic workflows. You still, even though open source, I think in a lot of ways from what I see is sort of somewhat declined, like a little bit because there's so much competition in the commercial sector now that prices have been driven down enough, it's just not that economical. And things are changing so fast. It still doesn't make sense to like triple down on, on open source, at least I think right now. But increasingly in the future, I think that time might come and it, you know, anyway, it's exciting.
[72:05]
Chris
Yeah. Especially because a lot of it is going to be around the way you work and the, the agentic procedures and, you know, so for example, instead of doing the workflow that I think is the workflow of 2025, which is building a really nice context, getting your chat into a state where it's solving your problems and working with it until you accomplish your goal, I think next year, increasingly it's going to be a planning phase that you work with the agent, where you come up with the plan of what the steps are for the goal you're trying to achieve. Delegating and it goes off and does the work and then when it's finished, it pings you and you go review it. And you're doing maybe four or five of those at once or something like that. And it's going to be, like you said, bringing that code loop agentic style to every job. And I think that we're going to see the rise of the workers who are able to adapt to that working style becoming immensely productive and a sort of slowdown in hiring of some roles where some people just can take that over.
[73:10]
Mike
Okay, so final thing. Which lab, which company has the best model at the end of 2026, in your view, next year, should we just.
[73:20]
Chris
Not Poly Market because.
[73:22]
Mike
No, don't worry about polymarket. I want your opinion. So, like, we're doing our end of year episode next year, we might have finally reached 100 episodes and you're making a prediction about which lab then has the best model.
[73:42]
Chris
Well, the thing is, if you asked me like three weeks ago, I just would have said Google without even thinking about it because I think Gemini 2.5 was just the unassailable leader. I think Gemini 3 Pro preview is not as reliable as I would hope. And I think the reason I'm gonna probably say anthropic with Opus is that its ability to do things like skills, MCPs, all of the things you need in order to have this modern agentic workflow. Anthropic's the best at it. Like, Gemini's getting there. I think Groq's great at it. Like, some of the other. Some of the other models are great at it, but Anthropic has it all. It has all of the components you need to get this stuff into the future. The main reason I'm hesitating is the limited context window. Only having 200k context on their flagship model, I think, is no longer good enough. And I think it's enough to sort of take it out of contention as something I would use seriously for my agentic workflows. So. But, I mean, they've got. They've got 4.5 sonnet, right? So, yeah, it's a hard one. I don't really know. What do you think?
[74:59]
Mike
I think it's gonna be Google, but I. I still would not back away from OpenAI, especially if they stick to their word around, you know, focusing on the enterprise more in 2026. And it's all about the Enterprise, I think.
[75:16]
Chris
Hang on, so you're saying next year? I thought you meant this year.
[75:18]
Mike
No, that's why. Yeah.
[75:20]
Chris
All right. I don't know.
[75:24]
Mike
Okay. I'm so confused now because I thought you were talking about end of 26.
[75:29]
Chris
No, no, no, I'm talking about right now.
[75:31]
Mike
No, I was, I was. I'm. I meant 2026. So I. I still think open a a sounds like you like the songs. I've been listening to too much AI music, but I. I do believe it'll be Open AI.
[75:46]
Chris
Really?
[75:46]
Mike
Yeah.
[75:47]
Chris
God. Can I. Can I take the. The. The opposite side of that bet? I think anyone but OpenAI. I think their problems are going to continue. They just don't have it. I just.
[75:59]
Mike
I wouldn't write them off yet. I wouldn't write them off yet. We'll see. Well, I'll replay this and we can last.
[76:04]
Chris
This is great. Yeah, yeah. Like, we can have our Open AI wars.
[76:08]
Mike
I just think they've got enough smart people. If they put their mind to, like, what enterprises really want in models, they could build something truly great. But they've got to start from the ground up and. And look at what secret sauce Anthropic has. And I think, given that they've just got more money and more eyeballs on them, that they have an opportunity to really do it. I feel like.
[76:32]
Chris
Can I make one more prediction about them?
[76:34]
Mike
Yeah, whatever.
[76:35]
Chris
That abomination store thing they have where it's Like Travelocity and at the, you know, like all the, all the different apps or whatever the hell they are, that. That just dies that we never talk about it ever again.
[76:49]
Mike
Yeah, I, that, I mean, I didn't even cover it. Like they launched it. No, we, we mentioned that it was coming out, I think. But it. Anyway, right now I think it's bad. I. Maybe if you have like AI apps where it's like similar to what we've done with the Doc editor and you can have these like interactive experience experiences maybe, but we'll see.
[77:11]
Chris
That's not what it is. It's like, it's like a integration store into like web apps.
[77:16]
Mike
Well, kind of. But you can have like full screen apps and then interact with AI with them. But as everyone's saying is like, it's stupid. Why not just go to the actual company's website? Like it doesn't, it doesn't get you the value of mcps, which is going off and gathering context and taking action at scale.
[77:32]
Chris
Have you had a single person ask us about when's MCP UI coming?
[77:37]
Mike
No, but I mean it's just unnecessary. I think output types like, obviously it's great for. But no, I, it's just stupid. I mean like it is truly stupid. If it takes off, I'll. I'm happy to throw an egg at my face.
[77:56]
Chris
Yeah, what are you gonna throw on your face?
[77:58]
Mike
An egg I'll. On. On the show. I'll throw a horse egg at my face and have egg on my face. All right, we better wrap it up. We're just ranting at this point. Any final thoughts for 2025?
[78:09]
Chris
I just gave them accidentally.
[78:12]
Mike
Accidentally. Yeah. All right, well, that is it. The pleasure has been all of ours in 2025. I'm. I'm honestly amazed we made it through another year. Still haven't reached that hundred episode milestone, but maybe one day. I did want to just say quickly, thank you for those that listen to us every week laugh along with us, comment like, subscribe, joining our Discord communities. I also wanted to thank all of the people that are subscribers to SIM Theory that not only support us, but help us keep developing and building out that product that we so passionately love. And to all the enterprises and larger businesses this year that and. And universities and corporations that have rolled out SIM Theory workspaces for their teams. We really do appreciate you and, and just thank you for your support. So we'll see next year, will we hit 100 episodes or will AGI happen before episode 100? So have a great holiday and New Year, Merry Christmas, Happy Hanukkah or whatever it is that you celebrate. We appreciate you, thanks for listening and we'll see you early in the new year. Bye. It's beginning to look a lot like an average year.
[79:42]
Joanie
Let's ring it in. We started out the season with a panic in the air Deep sea our one arrived and gave the market quite a scare. Nvidia was crashing 17% they say while the year of agents hype was slowly fading away. We thought the bots would do the wash and fold our laundry tight but 01 was a hack Just a prompt to start a fight. It's a very average Christmas but on a month waiting for the AGI light We want it super gods to come and make it right but we just got a March dress repackaged for the night oh, a very average Christmas. Yeah. With Joanie Ivan Smoke $6 billion vanished and tell me where's the joke? It's basically money laundering. Then Google brought a model called the Nano Banana Gemini 2.5 the tops in the savannah we loved it on a Tuesday by Friday it was broke Just experimental Is that some kind of joke? We saw a horse lay eggs A hallucinating mare and cloud four called the coppers just cause I asked it where to hide a fire I was riding a play I'm wearing my Dario pendant It keeps me safe from the safety team. Patricia. Patricia. Come back to me, baby. Sonic forest niche and telling crime stopper is my name an OpenAI operator is playing the same game we try to do vibe coding we try to build a clone but mostly we just doom scroll upon our telephone From Sputnik moments crashing to a browser in a dress this year of II magic was a hot and average mess It's a very average Christmas FA da FA da we're waiting for the AGI where is the ig we wanted robots soaring way up in the sky but we just got a cheese ad that made the people cry. Oh merry average Christmas five stars but kind of miss. Maybe next year we'll finally see what the super intelligence did. Merry Christmas you Average animals boom factor 7 out of 10 cha cha cha cha cha cha.
[82:22]
Mike
Sa.