Summary8 min read

Podcast Summary: This Day in AI Podcast

Episode: Can o3-pro write a book? ElevenLabs v3 and MCP as the AI Business Model - EP99.08-PRO
Date: June 13, 2025
Hosts: Michael Sharkey, Chris Sharkey

Episode Overview

In this episode, Michael and Chris Sharkey explore the latest advances in generative AI, focusing on ElevenLabs v3 voice synthesis, OpenAI's O3-Pro and its new pricing, and the rising significance of the MCP (Model Connect Protocol) as a future AI business model. The brothers test O3-Pro's capabilities in creative writing, debate the emerging economic potential of AI-accessible data, and reflect on how conversational AI's growing toolset may reshape journalism, software, and information access.

Key Discussion Points & Insights

1. ElevenLabs v3: Advances in AI Voice Synthesis

[00:04 – 06:37]

Demo & Hands-On Impressions:
The Sharkeys experiment with ElevenLabs V3's new "emote technology," which enables nuanced emotional cues in voice generation through tagging (e.g., laughter, whispering, dramatics).
- Mike: "You can designate the voices, but then they have an automatic mode... add all the emotes for me based on the text. You don't even need to go and do it." [03:09]
- Chris notes its emotion fidelity outpaces Microsoft’s comparable offering.
Current Limitations:
- Voice cloning is still weak in V3’s alpha, but generic voices are impressively emotive.
- For longform content (audiobooks, news), cutting and editing multiple takes can yield strong results.
Broader Impact:
- AI-generated voices could soon perform entire plays or audiobooks with emotional accuracy.
- Chris jokes about the impact on professional narrators:
  "Imagine the poor person who did like the Wuthering Heights audiobook... now a computer can just do it in 15 minutes." [03:09]
Practical Observations:
- V3 handles context-dependent prompts (like “haha” for laughter) automatically.
- Bulk processing and emotive tags make it ideal for audio content creators.
- ElevenLabs is noted for rapid releases, with V3 being among their most impressive leaps.

2. OpenAI O3 and O3-Pro: Pricing, Performance & Positioning

[06:37 – 14:24]

O3-Pro Pricing Shift:
- OpenAI significantly lowered the price of O3-Pro—now undercutting GPT-4O—making it viable for more practical, day-to-day API operations.
  - Mike: "For comparison, input tokens on GPT4O is $2.50 per million, and on O3 it's now $2 per million, so it's $0.50 cheaper." [07:44]
- Previously, higher-end models like O1 Pro were prohibitively expensive and not useful outside ChatGPT.
Strategic Implications:
- Hosts speculate whether pricing aims to undercut competitors (e.g., Gemini 2.5 Pro, Anthropic).
- O3-Pro’s speed and tool-calling prowess make it standout for tasks requiring multi-step reasoning.
Model Use Cases and Limitations:
- O3-Pro excels as a background agent—ideal for long, context-heavy, step-intensive jobs rather than quick chats.
  - Chris: "It really is a single shot, big job, your hardest problems kind of model. Not a day to day driver." [10:19]
- A key shift: Making users rethink HOW and WHEN to deploy various models—it's less “What can it do?” and more “What context/problem is it for?”
- Some early community (Reddit) criticism is dismissed as misunderstanding—O3-Pro not meant for “hello how are you” chats.

3. O3-Pro in Practice: Prompt Engineering, Tool Calling & 'Book Test'

[14:24 – 41:36]

Behavioral Shifts in the Model:
- O3-Pro gives more precise, less verbose answers than prior OpenAI models.
- It avoids unhelpful “data dumps” and can zero in on key issues—sometimes with higher accuracy than competitors, sometimes surprisingly wrong.
- Chris: "[O3-Pro] picked this non-obvious answer... it just felt like it wasn't shortcutting. It actually thought it through and came to its own conclusion." [14:24]
Prompting and “Doom Paths:”
- O3-Pro resists conversational “doom paths” (repetitive or context-locked threads) due to its slower and more asynchronous workflow.
Tool Calling & MCP (Model Connect Protocol):
- Tool calling is essential for these models; O3-Pro relies even more on user instruction for when and how to use tools.
  - Chris: "O3 Pro just right off the bat for me seemed like it was more inclined to try to answer the question itself rather than to go off and use a whole bunch of tools. And I really had to nudge it in that direction." [19:17]
- MCP explained:
  - A way of connecting AI models with external tools via APIs, unlocking new research, automation, and agentic capabilities.
  - Two architectures: Hosted publicly for AI to discover/access tools, or provided as a local list for more selective control.
Future of AI Agents:
- As MCP adoption increases, economic opportunities abound: Data providers, SaaS platforms, even governments can monetize trusted, high-fidelity data/API access for AI agents.
- Chris: "I strongly believe this is going to be the future. And I think there's going to be an entire economy around it where you pay for access to these worlds... trusted data... through mcp." [41:36]
Practical Examples & Use Cases:
- Financial datasets as MCP: pulling balance sheets, crypto data for analysis by AI.
- Journalism: News orgs as “raw fact” MCPs, letting users build their own trusted feed, potentially disrupting traditional news.
- Research and memory: MCPs could grant AI agents read/write access to personal memories or academic databases—revolutionizing study and retrieval.

4. The O3-Pro Book Writing Experiment

[52:41 – 71:37]

Experimental Setup:
- Mike challenged three LLMs (O3-Pro, Gemini 2.5 Pro, and Claude Sonnet 4) to write a space-themed Count of Monte Cristo. Each model generated a blurb and opening chapter. The results were converted to audio via ElevenLabs V3 for a blind comparison.
- Mike: "I heard a lot of rumors that O3Pro really excelled at creative writing..." [52:41]
Results:
- Chris, as the subjective judge, found O3-Pro's sample the most compelling, Gemini 2.5 Pro a close second for faithfulness to the original plot, and Sonnet 4's result flat and cliched.
- The book test confirmed speculation: O3-Pro is an exceptional creative writer, despite not following instructions robotically.
- Notable Quote:
  Chris after hearing O3-Pro’s sample: "That's damn compelling. I kind of didn't want it to stop. I was genuinely enjoying that. That's going to be very, very hard to beat. I want to know what happens." [58:12]
- Mike: "I genuinely would keep reading this story." [68:50]
- The experiment highlighted both O3-Pro's strengths (creativity, focus, ability to draw in a reader) and the subjective nature of creative output evaluation.

5. Broader Reflections: AI Tool Stacks, Data Models, and the Future

[41:36 – 77:16]

The Emerging Economy of AI Tools/Data:
- Hosts anticipate an “App Store”-like landscape where MCPs, specialized data sources, and API toolsets become products and subscriptions, leading to a new gold rush in data-driven SaaS.
- Mike: "This seems like the economic delivery method to me." [50:53]
Parallel Tool Calling & Model Differentiation:
- Claude Sonnet 4 is praised for aggressively leveraging parallel tool calls, a crucial advantage for complex, multi-step workflows.
Changing Information Workflows:
- AI agents will shift info-seeking from search and curation to “give me the best answer from my chosen sources, as deep as required.”
- Curation and source selection (e.g., “Reuters MCP”), rather than content writing, becomes the valuable role.
User Empowerment & Interface Paradigm Shift:
- Both developers and non-developers can orchestrate powerful custom workflows by mixing and matching MCPs (news, finance, government, academic, actions).
- The AI assistant of the near future will go well beyond chat—autonomously deciding when and how to fetch, act, or summarize information.
Outlook on Model Innovation:
- The current period, while sometimes perceived as stagnant, is seen as the "preview of the future"—things will accelerate as tool ecosystems mature and integration becomes seamless.
- Chris: "It's not going to be like, which is the best chat model anymore." [77:16]

Notable Quotes & Memorable Moments

On ElevenLabs V3's Progress:
Mike: "You can put the entire script in, right, and designate the voices... and have an automatic mode where you can say, add all the emotes for me based on the text. You don't even need to go and do it." [03:09]
On O3-Pro's Creative Writing Ability:
Chris (After Listening to Book Test, Clip 1):
"That's damn compelling. I kind of didn't want it to stop. Like I was genuinely enjoying that. That's going to be very, very hard to beat." [58:12]
On the Future of AI-Driven Journalism:
Mike: "It mightn't be the journalists writing the article, it might just be his Reuters MCP with all the facts that their journalists on the ground have found out." [41:13]
On Parallel Tool Calling:
Chris: "The parallel tool calling, I think, is a vastly superior way to do it... call these seven tools and then get back to me with all of the results." [25:53]
On the Paradigm Shift:
Mike: "I think we're just so used to this stuff now... this is just a preview of the future. It's not actually something that's that useful right now. But... as we increasingly add MCPs... it's an async, different world and it's coming and it is exciting and these models are the models that will power it." [76:16]

Key Timestamps for Important Segments

| Timestamp | Segment Description | |-----------|--------------------| | 00:04 – 06:37 | ElevenLabs V3 overview, emotes demo, voice cloning discussion | | 06:37 – 14:24 | O3/O3-Pro pricing, model strengths, user strategies | | 14:24 – 19:17 | Prompting, ‘doom path’ explanation, model behaviors | | 19:17 – 28:21 | Tool calling, MCP explained, integration challenges | | 28:21 – 41:36 | AI agent eco-system, business model futures, data curation | | 52:41 – 71:37 | Book writing test: Setup, results, discussion, literary analysis | | 71:37 – End | Reflections, future tests (e.g., vision), conclusions, “boom factor” ratings |

Tone and Takeaways

True to the podcast’s promise, the hosts deliver accessible, self-deprecating, and humor-laced tech insights. This episode blends practical experimentation with broader speculation about where generative AI is headed, repeatedly highlighting how agentic AI, tool-calling, and data economy models could displace today’s workflows—whether in business, research, or creative writing.

The take-home message:

AI is entering an era where the orchestration of tools, sources, and context will matter as much as the models themselves.
Economically, the next big waves will be about who owns, curates, and supplies the data/tools for AI assistants.
O3-Pro, while not flawless, is a major step forward—especially in creative and research contexts.

Listener Prompt:
The Sharkeys invite listeners to weigh in on the “book writing test” and share which sample they preferred—did O3-Pro’s creative storytelling impress you most?

End of Summary

Loading summary

Transcript164 lines

[00:00]
Narrator
Foreign.
[00:05]
Mike
So, Chris, this week on the show.
[00:06]
Chris
We got V3 of 11 labs.
[00:08]
Mike
We did indeed, Mike. And we were too average to even train our voices. So this segment now is probably not going to make any sense at all. I really should have just trained our voices. But anyway, we ran out of time. And I think it still shows just how far 11 Labs has come. I mean, it seems like anyone could add an AI podcast type thing to their show now. Our days are numbered. Although I'm not sure if with the voice training, I am the problem or it's the problem. Does this even sound like me? Is that even how I laugh? How did I get in here? Wait, Moshi, is that you? So, Chris, this week we're talking about Moshi. No.
[00:57]
Chris
Are you okay? I still can't believe I'm still this guy.
[01:00]
Mike
Yeah, sorry, I was just trying to test out all these emotes. It can do. G', day, mate. Throw another shrimp on the barbie. All right, roll the music. So, Chris, this week we have been playing around a little bit with the new 11 Labs V3, which we heard at the top of the show there. And apart from that being, like, really creepy and weird and probably blowing out people's ears and scaring their children in cars or wherever they're listening, I was trying to demonstrate the new emote technology in 11 Labs v3, where you can essentially put in these tags and tell it to, you know, be emotional and things like that.
[01:38]
Chris
Yeah, Microsoft's had that for a while in their voice one where you can put the emotion tags, but this seems to actually adhere to them a lot better. I mean, I wouldn't say that's the best voice clone I've ever heard. It really didn't sound like you much at all. But the actual emotions, like its ability to follow those, is really cool. And it actually gave me the thought while I was listening to it. Imagine putting in, like, an entire Shakespearean play or something like that with emotions and getting it to, like, do an entire act or something like that.
[02:08]
Mike
Yeah, I. For me, to be fair to them, it is an alpha, and they do warn when you use it that voice clones will not work well with it. So I think that to judge the voice clone as part of that is just not. Doesn't make a ton of sense. So, yeah, the other voices I think do, and they. They are quite emotive and expressive. The interesting thing is I obviously tried that a few times, and the first voice you hear, the female voice at the start, that voice was a lot better in many of the recordings. But I just took the worst one because the second part of it was better. So I'm sure if you were willing to spend the time cutting it up, especially if you're doing like an audiobook or, or you know, you actually edited your podcast audio like we probably should, then, you know, you could get some pretty amazing results from it. And I think the big breakthrough with this model and why I thought it was worth talking about is this idea of the two speakers having a conversation and then the emotes kind of matching up in that conversation. It just makes it feel and sound way more realistic.
[03:10]
Chris
Imagine the poor person who did like the Wuthering Heights audiobook being like, oh my God, I spent so long on that and now a computer can just do it in 15 minutes.
[03:19]
Mike
Yeah, it's like I learned how to pronounce every word and now it's ruined. But the other thing is if you, I mean you would know this really well having used it a fair bit. But if you write like haha in often AI would struggle, especially the voice models picking that up. So it would, it would literally say ha ha. But now, you know, it obviously laughs. It can, you can put things like I've got up on the screen now in brackets, whispering before some text and then it'll obviously whisper it. You can give it cues like dramatically. But what I think is the coolest part of it is you can put the entire script in, right. And designate the voices, but then they have an automatic mode in their user interface where you can say, add all the emotes for me based on the text. You don't even need to go and do it.
[04:04]
Chris
That's impressive and that's probably what's needed in these larger things of text. And you're right, anytime I've done text to speech, I've had to have these big search and replace things or get the, like an AI model, an LLM to do it where it's like, you know, instead of saying this, say this. Instead of this acronym, you know, pronounce it even, even the word AI itself. It's like say a, like, like a pirate, A Y E E Y E. Like AI immorities, like that kind of thing. And this model seems to overcome those natively, which is an impressive improvement.
[04:38]
Mike
It seems to excel to me at sort of maybe like the Notebook LLM style podcast. Right. Or an. An audiobook or reading a news article. I think that's where it's best for. I'm not, I mean actually having based on some of our phone Call experiments. It probably would work fine. But I still think you can tell if you. If you're listening to the raw audio that it's not human. Like, it still feels a bit synthetic.
[05:04]
Chris
My attitude is changing on this, though. I think that in a lot of cases it's kind of like, okay, like if people know that it's AI on a phone call, it's still like an agent for you, it's still representing you. It's still accomplishing a task. And so it isn't necessarily the worst thing that people recognize it's AI, but the better it is, the more pleasant their experience is in dealing with it.
[05:28]
Mike
It felt like then you just said AI like a. Like, like the AI did some reason to me, I don't know if that's true or not, but. But maybe so. It's.
[05:37]
Chris
It's interesting.
[05:38]
Mike
You can play around with it on their website. They have some examples. And you can also just sign up for an account on 11 labs and play around with it. You get a number of free credits. It's definitely worth trying out and a pretty cool improvement from the previous models that I've tried. Eleven Labs is a pretty nuts company. They release stuff all the time and we rarely cover it on the show because they just release so many different things. And some of them aren't that interesting unless you're really into audio. But I thought this model or this alpha of a new model is so impressive that it was definitely, you know, worth talking about.
[06:13]
Chris
There is a lot of voice models around, but they always seem to be right up there in terms of the quality. They also have a bunch of turbo models and much quicker ones for live audio. So that's like in the phone call scenarios and things like that. So you, you lose a bit of quality on the voice, but you get that speed, so you don't have that massive latency waiting for it to generate. But generally speaking, they're fast anyway. Like, they're. They're really good.
[06:38]
Mike
So, look, I don't want to take a victory lap here, but if I am taking a victory lap, I think I might have helped drive down the price of OpenAI's O3 model. I posted when it first came out and I was playing around with it a fair bit. The speed of O3 is contagious, albeit for a hefty price. But I like you so far. Now, keep in mind, this was. I think this was even maybe before, if I'm remembering correctly, the Gemini 2.5 Pro pricing was even out. Like there was that period where it was just free. And so it was hard to compare it at the time. But then when O3 finally shipped and then you saw that price comparison, it just started to not make any sense.
[07:19]
Chris
That's right. Especially when it was rating lower. But everyone was talking about how good it was at tool calls. And I did have good experiences working with oh3 prior to Gemini coming out.
[07:30]
Mike
Yeah, I've. I've thought it's a really good model. I think it excels when it's connected to tool calling in its thinking steps. I don't think it's as good when it's raw. And it seems to rely on that thinking and that tool calling in that process to be a better model. Whereas I think out of the box, Gemini 2.5 Pro is just a better model than it and it was a lot cheaper, significantly cheaper. But sneakily OpenAI this week really just slammed the price down on it. And this is the strangest part of this whole thing, right? It's now cheaper than GPT4. Oh, I don't even know how to make sense of it. For comparison, input tokens on GPT4O is $2.50 per million, and on O3 it's now $2 per million, so it's $0.50 cheaper. And then on the output side it's $8 per million tokens and GPT4O is $10 per minute.
[08:30]
Chris
I mean, it makes it so useful that you can use it for everyday, like utility operations, like as an API call in regular software. There's a lot of things that, that opens up when it's at that price.
[08:42]
Mike
Yeah. And I'm not sure. Do you think the strategy here is to just detract from the other models like the Sonnets and the, the Gemini 2.5 pros?
[08:52]
Chris
Maybe they just did it because they can. They might have always intended on making it cheaper. Some people have speculated that they quantized it, which means like discarding some of the values along in the.
[09:02]
Mike
They came out and denied that, though. They said it's the exact same model. So that's been disproven already.
[09:07]
Chris
Oh, well, it's. It's impressive. And I think when we see the weakness on the new anthropic models in terms of speed and availability, this. Updates like this are really significant because you can actually use them at scale. So I think that it's, it's really good and we're definitely putting new eyes on it for that exact reason, especially because our emphasis lately is around the tool calling and MCP work. So having a Model that's both fast and has a propensity to use those tools and is a reasonable price is just such a great spot to be in. I really think this is the future of how we're going to use LLM. So a model that plays into that is really valuable.
[09:45]
Mike
What's also intriguing about this strategy is, you know, O1 Pro basically is completely unaffordable. So no one ever really used it outside of the walls of Chat GBT that was $150 per million input and $600 per million output. Like just completely unaffordable to even test in your app. But then you've got O3 Pro, $20 per million in, $80 per million out. Still crazy expensive in my opinion, for what it is, but significantly cheaper that you can actually afford to maybe give it a whirl.
[10:20]
Chris
Yeah, exactly. And I think that that price drop has been really exciting having a pro model that doesn't make you feel like you're losing half your spleen every time you send a request. And so I'm actually willing to give it a go because there's some models where, like the 01 Pro, where it just wasn't giving better answers necessarily to a degree where I would give it enough time to give it a chance. Whereas O3 Pro, the pricing is at a level where you can actually give it a chance. I still think though, the time it takes to get a reasonable response means it's a totally different way of working. It's gotta be like a sort of set and forget style request like we've talked about before, where you give it a far more complicated task with a lot more context, tool call abilities, set it off on its way and then come back later when it's finished rather than being an interactive ongoing conversation. And I think that's why they've structured it like that. It really is a single shot, big job, your hardest problems kind of model. Not a day to day driver where you're just working with it all day.
[11:26]
Mike
And I think it's why they have identified correctly they need to move to this AI system where at least in their own interface with Chat gbt, it's going off and figuring out, okay, this is a sort of background task, something I need to do a lot of work on. I'm going to go use O3 Pro. This is just a chat conversation. I'm going to use GPT4O. Obviously we haven't seen them do this yet, but I think this is what's going to come right, like, because the challenge right now is knowing when to use the model, how to use it. Like, you've really got to retrain to use a model like this and really think about your prompts. Because if you just like, hey, as some people did on Sim Theory, it's quite expensive to get a response.
[12:04]
Chris
Yes. And I actually saw this criticism on Reddit earlier today. Someone saying, it took two minutes to reply to me saying hi to it. And I think that that's not a correct or fair test because that's not what it's designed to do. It's not trying to compete with a chat model that's meant to be snappy and get back and stream the token so smoothly, like Grok does, for example, it's designed to do the hard way, do everything the hard way, and not miss small details. So it's just going to take longer. And I think that when you look at that paradigm of, okay, I'm going to give it all of the information it needs, or at least access to all of the tools to gather all of the information it needs, and I want one perfect answer, then this model is very suited to that. And I think in our early testing, and certainly my testing this week, I've seen it give some really excellent answers to difficult problems. Um, and I think, as you're going to point out, really good responses. Like, it isn't trying to, like, write the entire Wikipedia out for you and things like that. It's actually focused on the answer and its logic to get to that answer when it replies to you.
[13:12]
Mike
Yeah, and this, I mean, I've already said this to you when we were playing around with it earlier in the week. But my biggest frustration with O1 and even slightly O3 was the tunes of them, where they would just spit out these essays of data, almost like trying to show off that they could just output so many tokens and so much data and they were so smart. And it does feel like with O3 Pro, the tune is just so different. It. It sort of cuts through the noise and gives you these precision answers. And, you know, maybe some of the time, look, they're really wrong, especially some of the coding things I've put it through where I'm like, I waited ages and this is just blatantly false. But then other times it could cut through things where I was shocked and it would just be like, this is the problem. It's just here, just change the this line or whatever it is. And I have some other examples as well where I've also, you know, tested it with other, other things. As we'll do in a minute. My new book test, which I'm really excited about. The challenge though I have with this model is it's sort of like going back to like we were talking about last week, like a 486 computer when I have the Pentium now because of that context window, like a 200k context window for me now it's too small. I want, I want my 1 million contacts.
[14:25]
Chris
That's true. I did a couple of experiments this morning which I often do with new models, which is horse races. And I kind of had lost faith in doing that as a test because they all tend to give roughly the same answer and there isn't a lot of clever thinking in there. They're basically looking at the horse that has the highest rating and has the best write up in the, in the information you give it. So it's not, it's just not that valuable. Like it'll often pick the winners. But I've started to kind of come to the conclusion that's just a coincidence because the people doing the commentary usually get it roughly right. Right. And so this morning, two races in a row, it picked a non favorite and one I actually picked races that had already taken place so I could quickly know the result. And the other models, so Sonnet and Gemini just picked the really obvious. The really obvious, like highest rated horse and last. And so I was all set to come on here and be like oh my God, I found the holy grail here. I'm going to be a millionaire from this horse racing thing. And then the very next race it lost. And so I'm like okay, maybe not. But what I did find interesting and I don't know if it's sort of like a barometer proxy for the way the model works is that it picked this non obvious answer. It just felt like it wasn't shortcutting. It actually thought it through and came to its own conclusion based on the information. And I've noticed that as well. I've started to use it with MCP tool calling and its approach and the things that searches for are distinctly different to what I see with say or it actually takes the time to construct much more robust queries and interprets the. The results of those in a more methodical way, let's say. Like it's. It's whole style of answering an output seems different. Like it seem. I don't know what the right word is. Do you know what I mean?
[16:20]
Mike
I agree with you. I don't even think it's comparable to O3, like it doesn't seem like to me when I use it the same model, I think a lot of people tried to just get it to just output raw code and use it as a coding model. And I would say to people doing that, stop, because it's, you're wasting your time, your money, everything. It's just not great at it. Like it's, it's an answer engine that uses heavy compute to give you a, A, like I sort of a cut through response. I think that's in your toolkit as a model. That's where it sits to me right now. Like I need a, a breakthrough answer here. I need a cut, cut through answers.
[16:56]
Chris
And it's funny you say that because this is how I often work with Gemini 2.5. I've got Patricia going on Gemini 2.5 and I'll be like, listen here, stop mucking around and just give me a one line answer to what this thing is. And I don't say mucking around. I'm usually ruder than that. And, and I'd say, I want a precision answer. I always use the word precision, a precision answer. Don't rewrite anything, just tell me what the answer is. And I don't always work like that, but when I do, it is remarkable how often it'll just go, ah, here's the line here. I can't believe I didn't think of this earlier. And it seems to me like O3 Pro sort of does that out of the box without you needing to specify that.
[17:38]
Mike
There is one limitation in Gemini 2.5 Pro right now, where it will. It falls into this problem from other models. It goes down one path so strongly it gets caught up in it. And then you often will switch model with the like and then reintroduce the context and you'll be like, oh, wow, the new Claude for Sonnet's amazing. It's so much better. And then you realize, no, I just reframed the problem and got it out of that doom path. And so I think with the doom pathing, O3 Pro, at least in my experience, doesn't fall into the doom path because it's so slow. It can't. Like you can't get it down the doom path. Yes.
[18:15]
Chris
And we've talked about this before and this is why we think that the idea of threading and going off in a different thread is so important in future interfaces because you get the context into a point where you don't want to accidentally send it down the doom path, as you say. And I often will do that out of laziness or whatever and then be like, oh, if only I could go back to where I was earlier because that context was working great. So totally get what you're saying. But yeah, the slower speed makes that harder.
[18:42]
Mike
And that's, that's the thing working in a more async way. You can, you can have the hardest problem you're working on where you really do need to cut through the noise, you know, in, in one tab with like an O3 Pro going and spinning and you just be like, I'm going to leave that to bake like a cake. And then I'm going to go back to my sort of day to day model where I'm triaging through other things that I might be working on with it. And so I think that works well. I do think though, in a weird sense, O3 Pro leans so heavily though into tool calling to get extract the best value out of the model.
[19:18]
Chris
Yes. And I think something that we definitely noticed in our testing of O3 Pro is it seems to need more direction when it comes to the tool calling than other models do. And I think this is probably going to be, as you pointed out to me, something that we need to learn over time is what specific model instructions do we need to, to get it to know, you know, when it's worth doing hardcore research, when it's worth calling 20 tools instead of just two and pushing it in the right direction in terms of that. And O3 Pro just right off the bat for me seemed like it was more inclined to try to answer the question itself rather than to go off and use a whole bunch of tools. And I really had to nudge it in that direction. And I personally feel like it's going to do better with a lot more. And so that requires overriding the prompting to get it to do that. And I don't know, obviously people have spent more time with it than me, but I really feel like this is going to be the next sort of prompt. Engineering style challenges. How hard do you work on this? How. What mix of tools do you need to tell it is required in order for it to, to answer your question? Because if you don't, you could be writing off a model saying it's no good when a slightly different approach using the same model gives a vastly superior answer and better than the other models in total. You know what I mean?
[20:41]
Mike
I think it's worth explaining because a lot of our audience wouldn't know how MCP support factors into these tool callings. Right like, we don't necessarily know that OpenAI is using an MCP layer in that thinking step. You could assume maybe they are, but like, can you explain how that's, that's working down the different approaches? I think that's an interesting way of looking at these models.
[21:04]
Chris
There's several approaches, and I think it's one of these things where people are still working out the best approach. But let's talk about OpenAI and O3 Pro specifically. You can either provide it with an MCP server. And so an MCP server is basically a system that says, I have these tools available to me. So let's look at say, a Google search mcp. The Google search might say, I have a tool which is search Google. Right. Then, then you might have another one which is say 11 labs generate audio. Like, I can do text to speech, I can do speech to text, and I can do one other thing. And so those tools, then through the MCP server, OpenAI goes, okay, I'll hit up that MCP server, ask it what tools it has available. And as you pointed out, Mike, it can ask more than that. It can say, give me some sample prompts, give me a search endpoint that allows me to, to, to search through the tools. And it can have other properties, like, for example, a schema of like, the parameters and what they mean, for example, and so they query the server and then they have their own proprietary technique they're using, which will then use those tools. So that's one way. And the downside of that way is your MCP server needs to be hosted publicly in a way that the OpenAI can access it over the open Internet. Right. So it just has to be able to access it publicly and you can optionally provide credentials to that. So it can log in with an API key, something like that. Right. The downside of that is a lot of MCPs right now don't work in that SSE public mode. They work, they were designed for Claude desktop, so they actually work in like this standard input output mode I designed to work on a computer. So the downside of providing the MCPs direct to OpenAI is you can't get them all. And the ones you can get need to be hosted somewhere with credentials. So it's a lot of work basically to get them in that mode.
[23:09]
Mike
Do you think, though, this is the mode everyone will move to now, though?
[23:12]
Chris
Yes. So I predict that, that that'll be all of them in the long run is they'll all be hosted through like, they'll be on cloudflare or Amazon or something like that. Or there'll be proprietary services pop up where they'll host your mcps for you and have an off layer and a user roles and permissions layer and all that. That's my prediction. But for now, the second way to do it, which is what we've been testing with, is basically in you have the MCP servers that you control yourself and host yourself. And what you do is you give the list of tool calls available from your MCP servers to the model. So rather than giving it an MCP server where it has to go off and discover the tools, you tell it, here's my 30 tools. And those 30 tools might be across four or five different MCPs that you're hosting yourself. So in the first case where it has the MCP server, OpenAI will do all of the tool calls internally as part of its own process. So it doesn't come back to you and say, please call this tool. It just does it itself through the MCP server. In the second scenario, where you give it the list of tool calls, the model will come back to you and say, can you please run this tool and tell me how that goes, please, bro? And then you run the tool and. And then you send the result back and then it decides, okay, now that that Google search, say, is finished, do I want to call another tool and search again or call something else? Call Perplexity or something like that, or am I done? Can I answer the question now? And in that scenario too, it's still up to the model if it keeps going or if it just answers the question. So the net result should be roughly the same. They're just two different approaches to doing it. I kind of like the approach where you give it the tools, because you can then be selective about which tools you give it, for example, to direct it a bit more. And also you see what's going on throughout the process. Like, you see that it's decided, okay, I haven't got enough info, I'm going to go and do this. And so I think this is an area that no one has quite figured out what the best solution is there. Like, do you intervene along the process and check how it's going and give it further direction, or do you just allow it to play out? Do you call a different model halfway through if you detect that it's going in the wrong direction? There's so many possibilities there to get the best possible result.
[25:38]
Mike
And we've seen the new Claude 4 series. They can do like, the Async tool calling where they can be calling many tools in that mode at once. And that seems like yet another approach in a way to just do more at the same time.
[25:53]
Chris
Agreed. The parallel tool calling, I think, is a vastly superior way to do it. So what happens there is when the system comes back to you, instead of saying, please call this one tool and get back to me, it's like, call these seven tools and then get back to me with all of the results and then I'll decide what to do after that. So obviously that's faster, assuming your system can do them all at the same time. And it's also more comprehensive because the model is like, well, I've got all these available, I might as well use them all and see how I go. But again, I really feel like it comes back to a little bit of prompt engineering where you say, okay, I don't want you to ever answer me unless you use all of the systems available to you. Like, I want you to do everything you possibly can to say, I'm researching, researching a stock, right? It's like, well, I want you to search the finance mcp, I want you to search Google, I want you to search Perplexity, I want you to search for papers in this area, I want you to search Reddit, I want you to search, you know, like, whatever, Twitter, go through everything, compile a full massive context and then answer my question. But then you might be like, hey, what's the weather going to be like in my area today? And you don't want it searching 35 sources and like, you know, the meteorological research and all this sort of stuff. So I really feel like there needs to be ways that you can. Either the model's smart enough to gauge how deep to go or you have some level of control over that. So it's doing what you really want. But obviously the exciting part is the first example I gave where you can be like, I've got all of these different MCPs and I want you to go hard and take as long as you need. But when I get this answer, I want the most well researched backed up answer you can possibly give from this massive variety of sources.
[27:47]
Mike
And I think it's important to call out like chat. GPT's connector interface to MCPS right now are a little bit different to what maybe Claude Sonnet and and Opus are doing in the sense that they require these special sort of connected tools. One of them being search and fetch are both required and if you don't implement their way of doing it, they actually Put up an error. This MCP server doesn't implement our specification, which is kind of interesting because model connects protocols meant to be an open specification.
[28:21]
Chris
And this is actually a much bigger problem than you think, because what I've found is a lot of the MCP servers that are available are sort of first efforts. Someone spent an afternoon smashing it out because a lot of them are just thin layers over an existing API that someone does. So let's say it's the Xero, the finance accounting one. Someone's just taken the Xero API and mapped it to the MCP protocol so you can use it through that method, but it doesn't implement all of the parts of the MCP protocol. So the second you try to use it with OpenAI, it's going to fail because it doesn't do that. And then the second problem is what I mentioned earlier, which is where a lot of them are implemented in a different mode that OpenAI just straight up won't support because you can't host them anywhere in a way that it can access it. So they're simply just not available to it. And my attitude is I want to get as many tools going as I can. We can't expect every single company out there to, to immediately publish one that matches the full specification. So in a lot of ways you've either got to take it on and improve it yourself, or you've got to say, okay, I will make up for the deficiencies in these by providing a layer over the top that, that handles those things that it's missing. And so my personal opinion is that MCPs are going to be so popular with non technical users because it gives them power they didn't previously have. Like developers have always been able to access APIs and do things with them, but non developers have not been able to say, okay, I can actually build a fairly complex system by giving myself access to these five tools. And if I have those five tools, I can then create procedures that will get the data from here, send it here and update this system over here, and then send me a report or something like that. And that's going to be really possible really soon. So I think that the providers being able to get a simple way to host them where the user just doesn't even have to think about it. It's not like, hey, Azure, mcp, just enter the URL, the credentials method and the, you know, the, the secret key and the user token and the refresh token time and all this sort of stuff, like no one wants to do that. And it isn't necessary and I know that because we've got a better way of doing it. So I think that OpenAI has been sort of lazy on that front and it's not going to be the solution that takes off. The solution that takes off is be people able to click. I want that, I want that, I want that. Like the App Store, you install them, you go through an auth process to allow it what data you want to or what methods you want to allow it to use. And then the AI just knows how to use it and there's a layer of prompting there that makes sure it's effective. So I think that it's cool that they're going in that direction, but it's just, it's not a complete solution what they have there now, not at all. And they're, they're sort of more treating it like they're just adding integrations than they are a universal connector system, which is what it is.
[31:21]
Mike
And I think the other sort of caveat to it right now is it's just able to use these tools during the deep research function. Like it can't use them anywhere else at the moment. I'm sure it will in the future, right?
[31:35]
Chris
Absolutely, it has to. Because remember, not all of the tools are research based. Like we use those examples because they're pretty exciting because no one wants to go through the painful task of like doing all this, this research and putting it together. And it's great that the AI can do that, but if you think about it, many of the exciting uses around MCP servers is what it can actually do. Like we gave the phone call example, but it's like send an sms, update a database, you know, write out files to disk like, and you know, like publish this to the web. The actual actions it's going to be able to take are actually just as exciting, especially if they're well researched and prepared actions, maybe with an approvals process. But the, you can't do that during a thinking step because no one is going to trust it to just do what it wants. So the, the, the actual activities and doing things is just as important, I think, and if not more exciting than the research. There's also another element that I think is going to be really interesting, which is we've talked about Knowledge graph and the AI's ability to have memory before, but what we haven't spoken about is that in the MCP paradigm, because if you think about it, some of the tools you could give the AI is simply here's your memory, like Your memory is an MCP server that has the ability for you to delete things, add things, search your memories, you know, save an image in there, you know, update it, summarize it, whatever. And that could be like quite an advanced database that the AI has access to. You could also give it the ability to have its own computer that it can run when it wants. Like we've talked about computer usage before. You could also give it the ability to, you know, do text to speech or like other things like that, like actual real world actualization, things where it can take a photo on a webcam and see how the garden's going or whatever it is. So that there's going to be very interesting combinations of tools you can give the AI to create a much more holistic AI assistant that can have a lot more powers. And I think it's going to be genuinely interesting to see in this ongoing context how those AI agents evolve when given access to all of those things. And sort of a vague idea of like, here's all the stuff you can do now, here's your mission, and see how it combines all of those things to do stuff. And none of that is going to work purely in a, in a thinking context, like before it gets to work. Like it's a very simplistic way of using it right now, is what I'm saying.
[34:16]
Mike
And I think also just from the teaching users about this is agentic workflow working with an assistant where it's able to go off and do these tasks right now, personally, as I've experienced both of these, I much prefer the tool calling to happen sort of as part of it, sort of almost like talking to itself, like going off and doing things where I'm observing it and just to know certain tool calls can have the approval where I've got to let it continue. That feels a lot safer to me. But then there's other contexts like research or search where I don't really care. Like I'm just like you, you go do your thing. So I think both have their merits. But to me, as you say, the most exciting thing is it having actual agency where it can go and do work on your behalf, not just go and fetch sources. I think the other interesting topic around this is really this idea of it being able to populate its own context as it needs. Like the old methodology is really go and shove everything in, right? Like shove, you know, all the JIRA tickets in or whatever it may be. And then, you know, we were relying on RAG before, then it got to bigger context Windows, but even those condex windows max out at a certain point. So letting it determine, oh, I need to go get this chunk for my context right now I need to call this tool to get a bit more context about this. And then letting it decide to cherry pick and manage its own context. Seems like it will also be more effective than you just shoving everything in every time.
[35:46]
Chris
Exactly. And I feel like this is the answer to the rag, the retrieval augmented generation problem, because this is a problem, I think that never really fully got solved because say you've got an assistant that has access to three PDFs that have your company culture, you know, rules and guidelines that it needs when it answers a question right in the past, there's two ways to do that. You either have retrieval augmented generation, where there's a query that will get a series of summaries based on a search and put them in the context, or with the larger context models, you just shove it all in. Just put it in every time and say, you know, you must take this into account. The problem is both methods have the downsides. The search method loses fidelity on the data because, like, it's only just giving summaries and the, the shove it in every time method will mean that the AI will get fixated on things that don't really matter. And it, it, it seems to take great pleasure in bringing everything back to its source material. Like, you know, my AI agent will always be like, oh, I really, I know how you love Python functions or whatever. Like as a joke when you ask it about some other thing. Like, it's just, it's just not relevant to the discussion. Whereas in the paradigm that you're talking about where, hey, look, you've got these tools available when you need them, but also you should take them into account when appropriate, then it's much more able to have discretion around when to get that information and when to know it needs all of it. Like, it absolutely must take all of this into consideration in this scenario. So therefore use the method that gets all of it. And in a scenario where it's like, hey, I better just quickly check my memories to see that there's nothing relevant here. And if there's nothing, well, it just does nothing. And that's fine.
[37:26]
Mike
Yeah. And I think a good example of this at work is, as we recorded in the last sort of 24 hours, pretty tragically, a Boeing 787 Dreamliner went down with Air India. And of course there's a lot of good information out there about what happened. And then there's also some misinformation and you know, just like it's like the Internet, right? It's, it's toxic, it's running, running wild. So what I think that this does is from an information gathering point of view you can imagine like right now your workflow today is sort of look at maybe X look at some new mainstream media news articles and read them. So you tend to build a complete picture in your head, right? Like your own natural picture. But what I've noticed with the tool calling and the sources that you prefer is you can then send it off and say go figure out like maybe like what the cause of this crash is or like what the current consensus is or like just what happened. Like you just want a broad spectrum or the most sort of somewhat I would argue neutral analysis of what's happening. And you can see this being really important for businesses, for the media. I think it would be most threatening to journalists in the sense that in like almost threatening in a positive way where it goes back to them having to actually source facts versus especially since.
[38:52]
Chris
Most journalists get all their knowledge from X anyway now and it's like oh a user on Instagram said this and that's now the news.
[38:59]
Mike
Yeah and, but also the inverse of that is you start to realize the power of X in terms of being valuable for this new, new, new media world. So what I did, I actually asked O3 Pro with tool calling to go off and research extensively this incident and come back to me and write in the style of a Wall Street Journal article. And then what I did is I compared the output to the mainstream media articles so like BBC Wall Street Journal to just, and this is completely, you know, my own test here and like my personal opinion I wanted to see like which article in my opinion was the most factual and comprehensive given what I know and didn't treat me like a dummy or be opinionated. And I mean it's probably obvious what I'm going to say but the research article by my AI assistant that was then turned into a page that I can look at that's written in a Wall Street Journal style news article. And I'll, I'll link to this in the show notes below. It's just so much more comprehensive and better to read and it feels like how maybe I sound like member berries here but it does feel like before the, at least in my opinion the news went so far left and right. It does feel like this the middle again somewhat like it's just reporting facts and, and it's not opinions or Speculation or clickbait. It's just like, here is what we know. And it. There's a lot of good YouTubers out there that cover news independently now. And a lot of people listen to those because they know it's just experts in an industry that just give facts and their take based on dedicated experience. And so it felt closer to that experience to me than reading a news article where it's someone personally who knows a fair bit about aviation. You read them and you're like, what does. It's not even close.
[40:58]
Chris
Exactly. And I think you have the additional advantage. And maybe this does introduce bias, but you can curate your own sources. Right. You can say, these are my trusted sources. So if you do want bias in your favor, you can get that too. But if you don't, you can have a variety of sources.
[41:14]
Mike
Don't you think that's good in a way, being able to control those sources? And to me then the sources, which in this case MCP's become the future sort of trade mechanism of data like that, to me is the new business model. It mightn't be the journalists writing the article, it might just be his Reuters MCP with all the facts that their journalists on the ground have found out.
[41:37]
Chris
I strongly believe this is going to be the future. And I think there's going to be an entire economy around it where you pay for access to these worlds, where you get trusted data, or at least, you know, sources of data that you can build into a system you trust and getting this quality curated data as input to your AI models through mcp.
[42:00]
Mike
A good example of that, and I think probably one of the earliest examples is this financial data set that we've been playing around with quite a bit. It's a, it's, it's really just a GitHub project. Right. An MCP server for financial data sets. And it's got a bunch of available tools like getting income statements, balance sheets, cash flow statements, all that kind of stuff about companies that gives the AI, like, access to retrieve those from, you know, good sources, crypto prizes, things like that. And then you can see on their website here, financial datasets, AI. This is like a whole category of business. It's almost like, I would argue you're seeing the future of what will happen to SAS companies or like how they'll be disrupted.
[42:47]
Chris
Yeah, and just, just think about all of the different data sets that are out there. Like even governments could publish economic data. You could get interest rate data, you know, finance data, like commodities data. There's so many different sources of information that are going to be incredibly valuable in this tool calling context and not even necessarily going straight into the LLM like it might be. It sources data from this one and then puts it into the spreadsheet module, which then produces a spreadsheet which is then used as input to another model. And those kind of things, the, the different and disparate sources of data it could get, to answer its question, can be absolutely massive. And when they're designed in an interface where the AI knows exactly what the data types are and exactly what to expect in the response, that's a lot more valuable than just like copying and pasting massive text dumps from websites or like crawling the web and just hoping and trusting that whatever you crawled is relevant and isn't being manipulated against you. And that's the reason why I'm personally excited about paid sources. Because if you pay for it, they've got far less incentive to manipulate the data for some dodgy reason. Right. Like if you're paying for it, it reduces the likelihood that it's, it's bad. And so I think that there can be a really, really exciting market for that. And that's just the research side, I think, as you know, there's the other side, which is what actions can I take? What abilities can I give to my AI agent by subscribing to services that provide them?
[44:25]
Mike
Yeah, and it's just like your stack of tools and your tune of tools and maybe your prompt tune on top of that's going to have a lot of value. But I also think it solves the context drift problem I bang on about all the time. And these doom parts, because you're not chunking it all into the context like you say immediately, you're allowing it to go off when necessary, step by step and go and fetch things. Okay, interpret that in line with what's going on. Okay, now I'm going to fetch this, now I'm going to interpret it. And that process the model naturally takes to me tends to perform better than just chucking everything in the context and going like decide on this horse race or whatever.
[45:05]
Chris
Yeah. And it almost forces it into a step by step logical process in terms of thinking, especially if it's instructed to verify using the tools you have available. Because then it's like, okay, I better check that my logic here is correct based on this data source. And I think that that's going to lead to much better outcomes even with existing models, without modification. And I think this is why we were talking earlier about a model like O3 that's faster and has very strong tool calling abilities may actually end up rating better than other models that aren't as proficient at calling tools. And I think that it's going to become an increasingly important part of models is their ability to use what's given to them to make themselves better.
[45:51]
Mike
Yeah. And this is sort of why the current set of models probably will feel better in a couple of months as the these tools are available and applied to these models in different ways.
[46:02]
Chris
Yeah, like they're doing like they're currently fist fighting with each other, but then you just give one like a bronze sword and you're like, oh, suddenly it's like smashing these guys. Like it's so much better. But then the other one gets what, what's the next one? Like a netherite sword or something? And you know, and, and so I think that's what it is. It's like giving them weapons and like whoever wields the weapons better is going to win. And I would argue that like I think you could take Gemini 2.5 Pro now with no tools and O3, just.03, not even 03 Pro with tools. And I think it would beat it for a lot of questions. And I don't just mean like what's the weather going to be like tomorrow? But I mean like you know, a research task or like a thinking task where it just has more information available.
[46:47]
Mike
And these data sets, why I think they're so valuable is we were talking earlier about these like custom news articles, the way you tune it, the way you want to read an article. Right. And so if you think about the future business models of news organizations, you can imagine them just supplying like raw notes from journalists or like in field facts or some sort of like JSON object in an API of like here are the current facts about that particular tag topic and then your AI agent or assistant is then interpreting that for you and formatting it in a way that's suitable for you. But I just think this is where the interface change is so dramatic because you're not going to the news website and being disrupted by these silly ads and all this nonsense. If you're willing to pay for those facts and you deem it valuable to whatever you're doing, which look, it's debatable, people still not want to pay, but if they do pay for that, they're going to get a much better product and be able to curate that experience and honestly probably be more informed and more intelligent than anything else. I don't Actually see it as a negative thing.
[47:55]
Chris
Yeah. And it doesn't just have to be journalism. Like, there's so many elements in life and business where you need to sort of gather information from a bunch of sources before you make a call. And it's just time consuming. And it's like the trade off of how long do I spend on this versus solving my problem. And so I think that having a really fast and convenient way to do that that doesn't involve you having to do all of those steps when you're busy or on your phone or whatever it is, makes it far more likely you're actually going to do it and therefore get better research and better outcomes for what you do. So over time, I imagine people just prefer that interface because they don't have to go after all this stuff. The example I've got is whenever I cook now, I just use an AI agent as to get the recipe so I don't have to read someone's life story before I get the information. And it's far superior. And like that is without like. One of the examples I think is even better than my cooking example, which is like top notch. But this one's better is imagine a source that is like a university library with old digitized books that are available nowhere else. I know a guy in Austria who works in like the stack, you know, remember out of like James Bond goldeneye, where there was like the stack was one of the levels. And it's where they have like all of these old manuscripts and books that have never been digitized, never been put online. And people literally go to this place just to like access these books and this knowledge for research. Now imagine all of that digitized and then available as an mcp. So you've got a source of information that just simply isn't available anywhere else. It's the same as old newspaper articles. Imagine like newspaper. Thanks. Imagine newspaper articles from throughout history. Like every single newspaper article from, you know, 1900 to now from the one newspaper available as a searchable MCP as a data source in your research. Like that could be absolutely just unbelievably amazing in terms of what you're able to accomplish. It's the same with journals. In our testing of mcps, we've been using research journals and when you ask it questions, it's able to quite readily go through and find you relevant sources to continue your research or in some cases access them. Now a lot of journals are proprietary, I imagine subscription services through mcp where you're like, okay, the AI might Even say, hey, I'd really love to buy subscriptions to these three because they keep coming up in my research. And if we can get the articles themselves, we're a step further to solving our problem. Now imagine that if you're doing hardcore research for something and you're able to get all of this historical information from multiple sources, quality published services that you pay for so you have the right to use the data. It's going to be just immense what can be done in like just a few minutes.
[50:53]
Mike
Yeah, to me, this is $1 billion opportunity for people to build these like great MCP like for all, all different. Whether it's you go and build them on behalf of companies and do a deal with them or whether it's actual businesses that are sitting on the knowledge. This seems like the economic delivery method to me.
[51:13]
Chris
Yes. And that's what I would do if I was one of these organizations who was sitting on a bunch of proprietary knowledge or I just had access to something, even if other people had access to it too. But I was the first one to make it available through this method as a paid service. I think people are just going to lap it up. Like, can you imagine the announcements where they're like, oh, there's now an M C P for every old newspaper. There's now an M C P for all of these textbooks like that just haven't been. They might be online, but they're like images or some shit, you know? So hang on, do you know what I mean? Like, there's going to be sources of knowledge that if put in this massively convenient format and available as an MCP to your AI agents, are going to be just so valuable. And I really feel like it's going to be a sort of like gold rush to see who can provide these services because everyone's going to be subscribed to them. I really think, like you said earlier, MCP stack people are going to be like, it'll be us. Like we'll be sitting here with our like 20 views of our thing and there'll be some guy with 150,000 who's like, here's my MVP stack. Like, why you need these top five MVPs in your stack?
[52:20]
Mike
Yeah, you'll never believe. Shock face.
[52:23]
Chris
Yeah.
[52:24]
Mike
How many M CBS I used I.
[52:26]
Chris
Subscribed to the New York Times MCP for my unbiased facts. Like it's going to be that kind of thing.
[52:31]
Mike
Mr. Beast, like spent a billion dollars on MCP tokens to cure world peace or something.
[52:36]
Chris
Yeah, yeah, exactly. How I used MC tokens to purify water in Africa.
[52:42]
Mike
So I do want to run this experiment with you, and it's an interesting experiment. I heard a lot of rumors that O3Pro really excelled at creative writing and narrative writing and just writing in general. And I always find writing hard to test because it's quite subjective. And also, you know, it kind of depends on the prompt. And sometimes a model will put out something where you're like, this is incredible as a story. And then other times the same model will put out total trash. So it's. I think it's hard to look at the benchmark and say, like, okay, this is better because it's all subjective. So I thought, why not do a subjective test and use you as the guinea pig?
[53:19]
Chris
Okay, great.
[53:19]
Mike
So I've prepared three. So to clarify the prompt, I asked three different models. Claude Sonnet 4, Gemini 2.5 Pro and O3 Pro. I said to them, write a space themed version of Count of Monte Cristo.
[53:37]
Chris
Oh, you know your audience.
[53:38]
Mike
This is great to write the synopsis on the back of the book that you would put to get my brother you to buy the book.
[53:46]
Chris
And this is great. But then I have the full text of the Count of Monte Cristo printed on a jumper.
[53:52]
Mike
So this is good. So then the next bit of the test is. Then I said, okay, now you've got to write the first couple of paragraphs of the first chapter. And I'm going to. I've run these through 11 Labs v3, so it all ties together beautifully. And you have to listen to them. And I hope the audience will listen.
[54:11]
Chris
To and pick the models. This is like on dude, perfect. Where they had to pick, like where the fries were from, like, you know, McDonald's or Wendy's or whatever.
[54:19]
Mike
Yeah. So I'm curious if. If you pick O3 Pro to begin with.
[54:24]
Chris
Oh, man, this is a lot of pressure. I need to take notes. Hang on.
[54:27]
Mike
But I just, I genuinely want to see which you prefer. Right. And so we're gonna have to really listen here because these clips go for. They're not short. But if you bear with us, I. I do think all of them are quite compelling. And I have my favorite. I already know my favorite. And I purposely bamboozled myself with the file names, so I didn't know at first. And then I re. Figured out what they were based on the transcript, so I picked mine blind test as well. So I'm. I'm just curious what you'll come.
[54:56]
Chris
Okay. And what did you rate them on or am I just going like, would I read it?
[55:00]
Mike
Which book would I continue to want to read over the others? And I think that's a really good assessment, right?
[55:05]
Chris
Am I ranking them or just picking my favorite?
[55:08]
Mike
You don't have to rank them, I think just pick your favorite. One thing I would say that helped me is I would maybe take a note after I play it to you so that. And. And I would. I would encourage you to do this if you're on your desk listening or whatever as we play them. Maybe write your Stop your car.
[55:24]
Chris
Pull over.
[55:25]
Mike
Yeah, pull over.
[55:26]
Chris
Take this seriously.
[55:27]
Mike
Tell your kids to shut up in the back of the car.
[55:30]
Chris
Can you tell that kid in the back to shut up?
[55:33]
Mike
And then. And then get out your notepad because everyone carries one of them. And. And right. Write some notes as you go. You know, maybe out of 10 or whatever it is just so you can compare them at the end. All right, I'll stop talking. Here we go.
[55:45]
Narrator
Chapter 1 Event Horizon.
[55:50]
The cell lights.
[55:51]
Flickered in uneasy harmony with the gravitational tides outside, casting ripples of shadow across Elias Voss face. He lay strapped to a polymer cot no wider than his shoulders, feeling each shudder of Sila 9's orbit around the black hole below. Every six minutes, a bone deep lurch reminded him how close annihilation waited. Somewhere in the bowels of the prison moon, a siren wailed, a falling tone lament that never quite reached silence before beginning again like a heartbeat for the damned. Voss kept his eyes fixed on the ceiling's cracked plating as he recited vectors in his mind, refusing to surrender. One more thought to fear than necessary. Momentarily, the lock seal chirped and a blast door irised open. A single guard entered, helmet visor opaque and pulse rifle idle at hip. But Voss knew the real threat was the drone that hovered at the threshold, its surgical arm gleaming with tranquilizer needles. Today's ration tray skidded across the floor to stop beneath the cot. On its surface, etched by countless fingernail scratches from inmates before him, was a crude star chart, intersecting arcs, forgotten systems, and a set of coordinates circled so many times the metal blurred. Voss's pulse quickened. The cartographer's legend was no myth. It waited for the one prisoner still stubborn enough to believe escape from both Sila nine and his own grave was possible.
[58:04]
Mike
Chapter okay, so that's the first one.
[58:07]
Chris
Do you want my comments now?
[58:08]
Mike
Yeah, no, I think let's do it as we go, just so we get your sort of raw feedback.
[58:12]
Chris
So firstly, I think Andy Weir should quit his job. He's the guy who wrote the Martian and Hail Mary. It sounds a lot like Hail Mary. Like that the way he writes and like the sort of waking up on a bed and all that sort of stuff. And secondly, I'm so stupid. It took me to like three quarters in to get the Count of Monte Cristo references. Like, you know, being in a cell, the map, like that kind of stuff. But that's damn compelling. Like, I kind of didn't want it to stop. Like I was genuinely enjoying that. That's going to be very, very hard to beat. I want to know what happens.
[58:41]
Mike
My only criticism of this one is, and all of them really is that it still takes the instruction a bit too literally. Like it's turned the synopsis of the book into like a story in the first. Like you would never, as an author unveil so much in the first chapter.
[58:59]
Chris
You reckon? I thought it was okay because like, what did we learn that, you know, he's on some sort of prison ship, they're orbiting a black hole. There's drone guards with tranquilizers and stuff like that. And there's a map, which means there's going to be treasure, you know, like.
[59:13]
Mike
Yeah, I guess it does draw you in. It does draw you in. Yeah.
[59:16]
Chris
I mean the thing it missed from the Count of Monte Cristo was like, it took him a while to get to the point to find out about the treasure, but. And it didn't have like him being like wrongfully accused of a crime. Blah, blah. But it doesn't have to be the same. Like, I don't know, I thought that was excellent. I really, really liked it. It's going to be hard to beat. That's my number one so far.
[59:33]
Mike
Okay. It's not gonna be easy, so the next one I'll play. Play here. So this is a. Again, completely different model. Exact same prompts for all of these.
[59:44]
Narrator
Chapter one. The Edge of Trust. The warning klaxon should have been Darius Cain's first clue that his life was about to end. It cut through the bridge of the Meridian's edge like a blade through silence, its harsh metallic shriek drowning out the steady hum of the ship's fusion drives. Kane's hand froze halfway to the tactical display. The encrypted data chip containing Admiral Voss's intelligence, still warm between his fingers. Around him, his crew, men and women he'd served with for three years, fought beside in the Proxima Rebellion. Trusted with his life a dozen times over, moved with the fluid precision of a well oiled machine. Too fluid too precise, as if they'd rehearsed this moment while he slept. Captain, Lieutenant Commander Sarah Chen called from the communications station, her voice steady despite the alarm's wail. We're receiving a priority transmission from Fleet Command. Admiral Blackwood himself. She turned to face him, and in her dark eyes Kane saw something that made his blood turn to ice water in his veins.
[61:03]
Pity.
[61:04]
Sir, you need to hear this.
[61:07]
Mike
Hey, that's number two.
[61:09]
Chris
That was shit. I hated it. That was cliche drivel. Yeah, no, that's a zero out of ten.
[61:15]
Mike
Okay, okay. Interesting. Interesting. Didn't draw you in at all?
[61:20]
Chris
No, not at all. It just. It sounded like something you would write in high school. It was really bad.
[61:26]
Mike
Okay, mental note. So number two, not so good now. Moving on to number three.
[61:33]
Narrator
Chapter One the Captain and the Cage the star chaser sang its usual song of arrival, a low, resonant thrum from the grav plates that vibrated up through the command chair and into Cailin Corps bones. It was a familiar, comforting hum, the sound of a job well done. Outside the viewport, the velvet black was pinned with the diamond dust of distant stars, but his focus was on the jewel that grew larger with every passing second. Cygnus X1 station, a glittering spindle of light and commerce. Another perfect run. The nav logs were clean, the cargo of rare isotopes secure, and the docking sequence was a familiar, comforting dance. The word captain tasted sweet on his tongue, a promotion all but guaranteed upon his return. But it was Alani's face that truly filled his mind's eye, her smile brighter than any nebula waiting for him on the observation deck.
[62:50]
Mike
Sir?
[62:51]
Narrator
A voice crackled over the bridge intercom, pulling him from his reverie. It was Jax, his comms officer. Priority one message coming through Fleet Command encryption. They're patching it directly to your private channel. Kelin's brow furrowed. Fleet Command rarely bothered with pilots directly, especially at the end of a milk run like this. That was what port authorities and corporate liaisons were for. He toggled the receiver. The voice that came through was clipped and impersonal, an anonymous drone from the station's security hub. He was to bypass the commercial docking bay and proceed directly to Auxiliary Hangar 7, a sterile, high security berth typically reserved for military vessels and diplomatic envoys. He was to be met by a security detail. No reason was given. A flicker of unease, cold and sharp, cut through his triumphant warmth. It was irregular, but it was a direct order, and Kalan Corr had built his entire career on following right.
[64:07]
Mike
That one cut off, probably. There was like one more word.
[64:10]
Chris
That one, that one is way, way more true to the Count of Monte Cristo. It's like almost just like taking the plot directly out of it. Like they're returning on a ship from a mission. He. He gets ordered to go see the guy who's going to betray him because he's carrying a letter for Napoleon. Like all that sort of stuff. That one, that one was way truer to the actual plot of the Count of Monte Cristo, however. It was like a bit dry. Like a bit. It didn't draw me in like the first one did. I think the first one as like a sci fi drama was way more exciting than any of the others. I think the third one was the best at sticking to the mission, like as in make a space themed Count of Monte Cristo. So I think I'm gonna go 1, 3, and then a far distant 2. 2 was garbage.
[64:59]
Mike
So just to clarify, like, just to remind everyone listening, the first prompt was write an introduction for a book that is like the Count of Monte Cristo but set in space. It should draw the reader in and sell them on the book. They should not be able to put it down. Then when it writes its response to that, I then trigger it again with another prompt. Okay, now write the first two paragraphs of the actual book, like chapter one. So then that's what we then converted to the audio. So which model do you think was which?
[65:31]
Chris
Oh, okay, so that's gonna be hard. So my choices are O3 Pro, Gemini 2.5 and Sonnet 4.
[65:40]
Mike
So sorry, just to clarify, you think the first one was.
[65:43]
Chris
No, no, no, I'm saying, are they my candidates?
[65:47]
Mike
Yeah. So you got O3 Pro, Sonnet 4 and then Gemini 2.5, bro. Yeah.
[65:53]
Chris
Okay, so I reckon, I reckon Sonnet 4 was number two, the one I didn't like. I think that Gemini 2.5 was number one, the one I liked the most. And I think O3 Pro was the last one, the one I like second, which is the. The one that stuck closest to the plot of the book.
[66:15]
Mike
To me, like this tells me how much use the models because you do have a pretty good intuition for them at least Sonnet. So Sonnet, you are correct, was number two pretty bad. That's Sonnet four. And what's surprising to me is when we do the rap battles, it by far is the best. Always like, it's so good at riding rap battle tracks.
[66:37]
Chris
So I was wrong on the other two, so.
[66:38]
Mike
Yeah, you were incorrect on the other two. So the. The best one number one is 03 Pro.
[66:43]
Chris
Wow. And I think it was. I would think it was markedly better. I should have gone with my instinct thinking that's the best model and it would do the best in the thing. The only reason I said it for the last one is I thought it was better at instruction following, which I guess Gemini 2.5 is so. Damn, I wish I had my time over.
[66:59]
Mike
It sort of is. But like you, yes, it is better at instruction following in that regard. But I think what the user really is implying with the prompt, or at least what I thought I was implying, is like, base it on that, but don't actually, like, recreate the exact story in space.
[67:15]
Chris
Exactly. And I think that's a great point because, like, you want that. Like, when you're doing a task like that, you want creativity. It's not like, just do a one for one ciphering of it and just change the words for space. Just like search and replace the word. You know, like C with space.
[67:32]
Mike
Yeah. And for those listening, if you listen in a place where you can leave a comment, like YouTube, do drop a comment like, which do you think was the best? Maybe it's like driving again as well.
[67:43]
Chris
If you're on the way to school.
[67:44]
Mike
Yeah. You can tell your kids they can resume talking, put away that paper notepad and get on with your life. But the. Yeah. So, like, honestly, I think naturally I picked number one number like the same order as you.
[67:59]
Chris
Yeah.
[67:59]
Mike
And I'm not just saying that I truly did, but I thought number one was just so, so far better. It wasn't even. It wasn't.
[68:08]
Chris
I also thought the voice delivery of it was better too. I know that isn't really a factor, but, like, I think my orders the same for how good the voice was in each as well.
[68:18]
Mike
Yeah. Yeah. It worked better with that one. Keep in mind, I. In 11 labs, I did use that feature where it decides on the emotes to put in and where to put them. And so there's maybe a little bit of bias creeps in there. But I still tried to interpret it based on the storytelling in that sort of vision I create in my mind versus the voice. Like, I wasn't judging the voice when I heard it. I was most. The first one. I was like, I actually genuinely would keep reading this story.
[68:50]
Chris
Yeah. Yeah.
[68:51]
Mike
Maybe I should do a series like the Count of Monte Cristo with this plot and we keep it going.
[68:56]
Chris
I just so regret I was so close to getting that right. And, you know, I've only got 30, 33.3.
[69:03]
Mike
It's still, I think what, you know, I thought you might have lent into Sonnet being the winner. Like, this is gonna, you know, this is going to be that.
[69:13]
Chris
I did think back that you always said it's better at creative writing, but I just knew that it was, it was going to be the bad one because I knew, I knew Gemini 2.5 wouldn't do it bad. And I thought, oh, 3 Pro has more time to think and get it right. So I just thought there's no way it's going to produce drivel like that second one.
[69:29]
Mike
So the one thing I would like, I should have added to the test is actually done Sonnet 3.5, because I think they've neutered a lot of that stuff in Sonnet 4 in in the guise of, like, we want to be better at coding, right?
[69:44]
Chris
Yeah.
[69:44]
Mike
And so they've sort of like degraded other parts of the model.
[69:48]
Chris
I think this should be the new way of us judging model. Classic literature space rewrites, judging panel. Like, it's, that was fun.
[69:57]
Mike
It's not a bad way. And it translates well to the, the primary medium that people listen, which is audio. So it's tends to work a lot better than.
[70:06]
Chris
Yeah, we could just become like a sensual story station. It would be very interesting.
[70:10]
Mike
Also contrasted with the rap battles. It makes no sense why these would be our two tests. It's like one.
[70:15]
Chris
Although I imagine it tends to polarize our audience. Some people hate the rap battles, so they might also hate the, the literature section.
[70:22]
Mike
Yeah, look, we're never gonna win either.
[70:24]
Chris
Yeah, it's true.
[70:26]
Mike
Yeah. So that I, I, I. Anyway, I thought that's a really interesting. What I, what I found strange about that experiment, though, is this it Again, in my limited experience, and I do want to road test this thing more. I don't think it's a great coding model. Like, almost all OpenAI models right now, I think they've just, you know, and for whatever reason, there might be a strategy behind it, but I just don't think they're the best. And I know a lot of people will be like, but Codex has its own model. And it's better. Like, it's not being used in reality. Like, you know, no one's using it in Cursor or any of these other big ides. So, like, I win that argument. I'm sorry. But I think for cutting through a hard problem, great model to occasionally rely on for certain things. Clearly for writing news articles and stories and all these things. Like, you know, my kids are Going to cost me a Fortune now calling O3Pro to write a Batman story. But I know it'll be better. So I'm very impressed with this writing capability. It's code writing capability, great at cutting through a problem and cutting through the noise, but if you just want raw output, it's a, it's not a great model.
[71:38]
Chris
Yeah.
[71:39]
Mike
So, all right, we have covered 11 Labs V3, the O3 pricing drama, them dropping their pants on pricing. Oh, 3 Pro. I'm, I'm excited about, I think it gets to that point of the show where you, we got to go back to the boom factor here. How many booms are we talking?
[71:57]
Chris
Yeah, I think pretty high. I think maybe like a 7 out of 10, I think. Because I still think for me the, the, the time it takes is going to limit my usage of it. However, its ability to call tools and give these really high quality responses. I think my early horse racing results plus that story make me want to give it much more of a chance throughout the week to solve my problems. And I think now that I work in a sort of tab day synchronous kind of world, I'm a lot more inclined to set it about problems and just leave it and then just wait till the ding comes through and, and check on it. So I think yeah, maybe 7.7.5 and I'll, I'll give it a real run this week on real world stuff that I need done and, and report back next week because definitely I'm not seeing quite the excitement around like the, the new sonnet opus is basically unusable because of Amazon for us. And so, yeah, the anthropic's really sort of gone down in my estimation. I'm still using Gemini 2.5 a lot, but I think this OpenAI update, like with the lower cost, the faster O3 and this O3 Pro available at a sort of middle reasonable price is actually getting me excited about them again. And I think it's a, you know, they, they went through this phase where it was all hype and no substance, whereas here they've just given substance out of nowhere. And I think it always just helps in terms of people who focus on the practical side of the models.
[73:27]
Mike
I think O3 Pro is a good model at its core. There's something good about it that I can see myself using it again. And as you said, like they didn't try and gatekeep O3 Pro into that crazy plus plan or whatever at 200 USD a month that most people, it's out of reach. For they do have it available in the AI, they are letting developers access at a reasonable price point. I think they needed to drop the price of O3 because it just wasn't interesting at the price that it was at. So now with, with, now that it's on par with a Gemini 2.5 Pro, it starts to make you consider using it a lot more and therefore maybe you start to enjoy that model and can extract more, more, more out of it.
[74:13]
Chris
One other thing we didn't mention as well is that we didn't test vision and it's my fault because I thought it didn't support vision. So we never got around to actually trying it. So I think that's another thing throughout the week we need to try is see does it perform better on vision and in particular is it really good at vision when it comes to say, operating a computer or those finer details which we've seen other models have issues with. So I think that given its ability to think the way it does, we may actually see better results in vision as well. And so that's something I'd really like to put to the test.
[74:46]
Mike
Yeah, for me, a model like O3 Pro, it's still operated in a chat chat level speed like that speed that you can put up with with the ability to do Async tool calling in a 1 million context and then mix in the sort of tune from Gemini with code that to me I hope is GPT5. Like, I kind of hope that's what it is. That's what I think is needed like bringing these altogether, like, because there's something really nice here brewing. So yeah, I'm not sure about a boom factor from me. Like it's sort of hard to say until we play around with it more, especially with the tool calling. The thing I would question you on though, with Sonnet or even Opus provided the bandwidth. There is, I do think there is something, and I mentioned this last week around the Async tool calling, where when you experience that and then you have to go back to an O3 Pro, which chugs through it slower for most tasks like checking email or adding a calendar invite or doing an operation in accounting software or something like that as part of a daily workflow, I think that it's like a workhorse model at that point where you just like Sonnet is a good workhorse model.
[76:00]
Chris
Sonnet is the only one I've seen consistently and aggressively use the parallel tool calls, that's for sure. And that Includes over Sonnet 3.5 and 3.7 Sonnet 4 is the only one that seems to go all in on it. And so I think that is a very interesting factor.
[76:17]
Mike
But honestly, my, my total takeaway is what a good time. In LLMs and AI in general. I know people feel like things are stagnant. Someone in the week on our discord posted a thing about come on, do something. That meme where he's poking, poking a rock. And I get, I do get that feeling because I think that it felt before like there were. I think we're just so used to this stuff now. Like, when Google VO3 or whatever came out, people were arguably really excited. But I think, like, all those things, it always dies down. And then there's that reality check of like, this is just a preview of the future. It's not actually something that's that useful right now. But I do think these models, as we increasingly add MCPs, like, we're about to go through, and I know we keep hyping it, we'll deliver soon, but it's a paradigm shift. Like, it's an async, different world and it's coming and it is exciting and these models are the models that will power it.
[77:16]
Chris
Yeah, agreed. And I think that's all everyone will be talking about soon, is the mix of tools they're using and how best to get them working in orchestration. I really think that'll be everyone's focus. It's not going to be like, which is the best chat model anymore.
[77:30]
Mike
All right, thanks again for listening and all of your support. Tell us, please do tell us which you preferred in the book test. Maybe you completely disagree with our opinions on that. Although I think it's fair to say I can't imagine people disagreeing that heavily.
[77:43]
Chris
No, my opinions objectively. Right. I think.
[77:48]
Mike
All right, we'll see you next week. Goodbye, Sa.