Podcast Summary: This Day in AI Podcast – "The Future of AI Systems" (EP99.04-PREVIEW)
Hosts: Michael Sharkey (A), Chris Sharkey (B)
Release Date: May 16, 2025
Podcast Description: Two self-described “proudly average” tech enthusiasts navigate the rapidly evolving world of AI—from the betting markets on best models to the nuts and bolts of practical tool use and the growing complexity of agentic AI. This episode provides a down-to-earth, sometimes irreverent look at industry progress, hype cycles, and the changing AI application layer.
Episode Overview
This preview episode revolves around which company has the leading AI model in mid-2025 and the rapidly changing landscape of large language model (LLM) systems. The hosts banter about their own bets on industry leaders, examine the current betting markets (via Polymarket), dissect model advancements and the plateau of core LLM performance, and dive deep into the future of tool use, agents, and practical AI integrations. The episode is rich in both industry commentary and hands-on insight, making it useful for listeners eager to understand shifts in AI utility and workplace applications.
Key Discussion Points & Insights
1. The AI Model Arms Race & The Betting Markets [00:03–04:46]
-
Betting on Models: The episode opens with banter about which company is likely to have the top-rated LLM at the end of May 2025, referencing odds on Polymarket. Google's Gemini leads, with hosts expressing surprise at OpenAI's low odds.
- A: “I'm losing money every day. It's depressing.” [00:30]
- B: “OpenAI is a 3% chance of having the best model at the end of the month… just not even close. Like that's kind of crazy.” [02:20]
-
Current Leader: Google’s Gemini 2.5 Pro is the acknowledged frontrunner, especially after recent coding advances and positive community feedback.
- B: “It was like they saw your bet. And then last week… announced on X that Gemini 2.5 Pro had gotten this upgrade.” [01:36]
-
OpenAI & Anthropic: The hosts are flummoxed by OpenAI trailing not only Google but also upstarts like xAI and DeepSeek, highlighting the shifting leaderboard.
-
Plateauing Improvements: There’s a consensus that major LLM leaps (like the GPT-4 moment) are less likely; recent updates are more focused on tooling than on raw model intelligence.
2. Instabilities, Real-World Performance, and Model Switching [04:46–08:00]
-
Reliability Concerns: The hosts note issues with Gemini 2.5 Pro (e.g., timeouts, perceived drop in intelligence), reflecting on growing user expectations as reliance increases.
- "We've gone from literally trash talking Google, laughing at their failures, to now... they've become so reliant all of a sudden on Gemini 2.5." [05:15]
-
Experimental Labels: The models are still labeled "experimental," yet users pay full price, leading to debates about enterprise reliability.
- A: “My argument is we're paying full price for the thing. Like we're paying for it. That is professional. Like if you're paying for something at that level, it should be reliable.” [06:18]
-
Seamless Switching: Both hosts frequently alternate between models—Gemini, Claude, GPT—reflecting that most modern LLMs offer comparable performance for many tasks.
3. Context Windows, Tool Use, and Problem Solving [08:00–14:02]
-
Anecdotal AI Brilliance: Michael recounts impressively detailed code troubleshooting by Gemini 2.5, attributing success to both model power and user-provided context.
- "It wrote back like a four page breakdown of its logic... and it nailed it like first go. It was absolutely brilliant and not an obvious solution at all." [08:44]
-
The Importance of Context Size: Models with very large (100K to 1M) context windows, like Anthropic’s Claude Sonnet and Gemini, can process massive data chunks, which is a crucial differentiator, especially for tool-rich use cases.
-
The Rising Importance of Tool Use: Context management and smart tool invocation is seen as the next competitive edge—being able to chain many tool results, synthesize them, and reason methodically.
4. Tool Calling, Recursion, and Agentic Reasoning [14:02–20:26]
-
Anthropic's Approach: New rumors suggest Anthropic’s models will “think and think some more,” meaning iterative tool use coupled with reasoning loops to self-correct.
- B: “It can go back to reasoning mode to think about what's going wrong and self-correct…” [13:48]
-
Multi-Step Tool Calling: The hosts point out that this multi-step, recursive approach to tool use (e.g. refining search queries based on results) is already being done by current LLMs, sometimes in a more transparent and easily observable way.
-
Observability Matters: Transparency and the ability for users to “interrupt” actions is cited as a hard requirement for real-world adoption, especially for high-stakes tasks.
5. Trust, Human-in-the-Loop, and Practical Automation [15:54–22:31]
-
Human Approval Required: The duo revisits earlier show discussions, emphasizing most people will not trust AI to take significant action (like emailing a boss or editing production databases) without human approval.
- A: “As much as I'm against the safety controls in AI models, this is the case where I'm thinking human in the loop. Some sort of approval is crucially necessary.” [16:44]
-
Specialization via Tool Clusters: With hundreds or thousands of potential tools, models may need “clusters” or different agents/assistants with access to carefully scoped toolsets, enhancing both reliability and user trust.
6. Agent-to-Agent Protocols & The Future of SaaS [22:31–26:24]
-
Agent Layers: AI “agents” may just be models bundled with tool clusters, instruction sets, and gating/approval logic—“just an abstraction layer with different instructions and some tools.” [23:10]
-
Reproducibility: Consistency matters—users want repeated, comparable outputs for similar queries (e.g., researching multiple stocks with the same methodology). No universal model can guarantee this alone.
7. Skills, Pre-Tuned Prompts, and Customization [28:13–32:32]
-
Emergence of “Skills”: The hosts cite experimentation with “skills buttons” (pre-tuned, task-specific prompting modules), echoing recent leaks that OpenAI is testing similar features.
- B: “One of the things we’re working on is the ability to train skills…” [29:08]
- A: “You limit what’s available to a particular skill...you give it a guideline around the effort, the methodology, what needs to happen, but then you allow it to use its intelligence…” [30:10]
-
From Simple Tool Calls to Composite Workflows: Skills become complex clusters of actions—users define effort, constraints, and permissions, while models provide reasoning within these bounds.
8. MCPs (Model Context Protocols), Connectors, and the Next AI Platform Wars [32:32–45:19]
-
Explosion of MCPs: There’s excitement about the proliferation of MCPs, which allow easy addition of capabilities, but managing tool conflicts and desirability of tool clusters becomes crucial.
-
Vendors Reacting, Not Leading: The hosts observe that many AI lab moves (like OpenAI’s rushed MCP integration) are reactive to competitor advances and community buzz, not evidence of secretive breakthrough tech.
- “It just makes me kind of curious. Like, it feels very reactive. Not like they're leading anymore, even in those areas.” [39:41]
-
API vs. Native LLM Tool Use: There’s debate about whether tool calling should be tightly integrated into the core model (e.g., GPT-5 as a “model router”) or managed at the application layer.
-
Walled Gardens, Lock-In, and Competition: As apps build deep MCP integrations and skills banks, platform lock-in becomes a concern, though the hosts remain optimistic about open protocols and self-hosting.
9. SaaS Disruption and Marketplace Dynamics [45:19–57:02]
-
MCP Hosting & Commercialization: Expect a rise in cloud platforms for hosting, gating, and monetizing MCP connectors (with fine-grained permissions and account controls). Proprietary data vendors (e.g., Bloomberg) may make bespoke MCPs available for a fee.
- B: “Imagine… Bloomberg having access to all their data in an MCP…and just… taking a toll on that data.” [51:17]
-
From Connectors to Agents as a Service: SaaS companies may pivot to offering “Agent as a Service” endpoints with embedded skills, rather than exposing raw APIs.
10. Current Flaws in AI Product Design [61:26–68:44]
-
Critique of App-Based AI Integrations: The hosts humorously roast inefficient “AI everywhere” features in products like Notion, Canva, and Atlassian.
- A: “Now we've added AI. It's so good. Like Seriously, Unless all your docs are already in there and you consume them via MCP into something else…But no one is logging into Notion as their like, starting point each day as their command center to get things done. It's ridiculous.” [61:28]
-
AI Chatbots Everywhere: Multiplying sub-par chatbots are seen as a distraction from genuinely empowering users. The hosts advocate instead for deep, workflow-integrated AI that operates as a background agent, rather than constant context switching between superficial chat interfaces.
11. Where Are We in the "Year of Agents"? [69:45–77:44]
- Are Agents Delivering Now? The hosts reflect on the reality gap between “the year of agents” hype and the present reality. Agents are augmenting human workers, not replacing them, and the biggest productivity gains come when humans leverage agents for background tasks, not total automation.
- Trust, Memory, and “Training” Agents: Long-lived workflows give rise to chats or agents that users get attached to—“it knows what needs to be done." Consistent planning, toolchain memories, and customizable interfaces are seen as the next big breakthroughs.
12. Outlook: Tool Use, Reliable RAG, and the Shift to System Layer Innovation [77:54–84:01]
-
RAG (Retrieval Augmented Generation) and Tool Use: As models grow, simply dumping all data into a prompt is less practical; effective tool-based research modules (including internal data and long-term memory) are more important and must be controllable on a per-assistant basis.
-
Reduced Hallucinations: Increased tool use and targeted retrieval has been found to reduce LLM hallucinations. The hosts call for business-grade assistants configured for guaranteed behaviors—e.g., always citing official policies—over mere model improvements.
Notable Quotes & Timestamps
- “It just blows my mind that it's so low…and that they have so much faith that xai, which is significantly higher, is going to somehow come out with a better model than OpenAI and Google. I just, I just don't see that happening.” (B, 02:51)
- “This is why I'm not like a stock trader or any of those things, because I'm not good. But, you know, I guess my point was that anything can change with this stuff.” (A, 01:59)
- "It wrote back like a four page breakdown…can you figure out what's going on? And it nailed it like first go. It was absolutely brilliant…" (A, 08:44)
- “Transparency and the ability for users to ‘interrupt’ actions is cited as a hard requirement for real-world adoption, especially for high-stakes tasks.” (Summary of section, 15:20)
- “The rise of very soon is MCP hosting, like as in platforms, sort of like Cloudflare, Netlify, that are like, okay, we will host your MCPs…” (A, 48:44)
- “Now we've added AI. It's so good. Like Seriously, Unless all your docs are already in there and you consume them via MCP into something else…But no one is logging into Notion as their like, starting point each day as their command center to get things done. It's ridiculous.” (A, 61:28)
- “We're going to get these sort of like trained states that it reaches where it's like, this can now actually do part of my job for me.” (A, 70:34)
- “It's just about time now to implement these ideas and make them work.…The future is not written yet—like, there's a lot of opportunity here." (B, 82:45)
- "My suggestion would be OpenAI, like, from the leaked screenshots we've seen, having a bunch of connectors is not enough…you, the user, is still going to have to be very, very specific in your prompting to get it to do useful things with those connectors." (A, 83:32)
Summary
In trademark dry, self-deprecating style, Michael and Chris Sharkey deliver a fast-moving, insight-rich preview of where the AI industry is in mid-2025: Google is on top for now, but tool integration and the “system layer” have become more important than raw model leaps. The discussions zero in on the practicalities of agent-based systems, the vital importance of observability, human trust, and skill/cluster-based tool control, and express healthy skepticism about the value of throwing undifferentiated chatbots everywhere.
They predict an emerging landscape where the most valuable advances will come not from marginal accuracy jumps in LLMs, but from flexible, composable tool and skill platforms—augmented by memory, planning, and fine-tuned agent behaviors—ready to finally automate some of the tedious, repetitive “busywork” tasks professionals face. The episode closes with anticipation for imminent Google and OpenAI announcements, resigned acceptance of another betting loss, and their usual call for average, “adequately okay” engagement with the world of AI.
For those who want a fast take:
The future of AI in 2025, according to the Sharkey brothers, is less about whose LLM is mathematically superior, and more about building practical, transparent, and trustable systems that bring AI’s skills to bear on real tasks—without the smoke, mirrors, and endless context switching of today’s many half-baked AI integrations.
