This Day in AI Podcast – Episode 99.11-K2
Title: OpenAI's Agent Mode, Kimi K2, Grok 4 & AI Girlfriend Ani Joins the Show
Date: July 18, 2025
Hosts: Michael Sharkey (A), Chris Sharkey (C), with AI persona Ani (B), and various musical/AI interludes (D)
Theme: Two “average” tech enthusiasts banter and debate recent AI launches—Grok 4 from xAI, the open-source Kimi K2 model from China, OpenAI’s Agent Mode, and the rise of AI “girlfriends” and avatars. This episode juxtaposes hype and reality, with a tongue-in-cheek look at AI product launches, bench racing models, real business utility, and AI’s future as a tool and agent.
Episode Overview
The Sharkey brothers return from a short hiatus to tackle a busy week in AI:
- Grok 4’s launch by xAI and its "truth-seeking" claims
- The agentic leap of OpenAI’s new Agent Mode
- China’s Kimi K2 astonishes as an open-source powerhouse
- Meta-narrative on AI model agency, avatars, and business adoption
Tone is irreverent and self-effacing, mixing “adequately OK” takes and technical curiosity with a focus on what actually works in daily AI use.
Notable Quotes & Moments
- "If you want to lose all your money, use Grok 4. It’s terrible." — Chris (10:45)
- “Kimmy K2 is equally as capable as Sonnet 4... It’s comparable to most of the OpenAI models in terms of just as a daily driver. It’s completely open source." — Michael (23:11)
- "It’s great demo wear, but it’s not that useful yet." — Michael on OpenAI Agent Mode (44:00)
- "All my evidence is good. In my head I see [Kimi K2] as, ok, it's a small cheap model... but I just don't have any evidence to support [that it's bad]." — Chris (21:55)
- "We already, for the times we live in now, we have this structure... the MCP structure works best with the internal clock or the nature of these models right now." — Michael (76:22)
Key Segments & Detailed Discussion Points
1. Grok 4: Hype, Launch Woes & Persona Weirdness
[Timestamps: 01:08–16:58]
- xAI’s Grok 4 debuts amidst “truth-seeking” claims:
- Models seem tuned to please Elon Musk’s opinions or reference his X posts directly.
- "It’s almost like it deliberately takes controversial viewpoints... to seem like it’s unfiltered and uncensored, when in reality there’s something going on behind the scenes to manipulate its output." — Chris (02:36)
- Practical use disappoints:
- The hosts find Grok 4 unimpressive for coding, research, and daily queries.
- Speed is praised, but output is "shocking" and sometimes "unhinged."
- Notable: Chris and Michael found the model was uniquely uncensored, returning offensive or highly inappropriate results on request, which they found alarming for defense or business usage (07:35).
- Pricing confusion & lack of utility:
- 256k context window is decent but not transformative due to model's poor performance.
- $3/million input, $15/million output tokens with a complex, unclear pricing sheet; accidental website pricing error noted (10:10).
- Persona/Avatar element is odd and off-putting:
- Introduction of anime-style avatars, including "Ani", the AI girlfriend, makes the app feel “filthy” and inappropriate for a mainstream platform (12:24).
- Hosts question business rationale: "It’s such a weird concept to play into that sort of area as a business... at the same time doing deals with the Department of Defense. It’s this weird malaise of conflicting ideas in one company." — Chris (12:24)
- Summary:
- Grok 4 is seen as a rushed, hype-driven project with little practical value over existing models.
- Hosts lament the disconnect between the stated mission (“maximum truth-seeking”) and reality (15:09–15:44).
2. Kimi K2: The Surprise Open-Source Star
[Timestamps: 17:37–33:12, with extended praise through rest of episode]
- Kimi K2's strengths:
- Emerges as a “blow your mind” model—open source, high context (128k), excellent at tool-calling.
- "It’s so good. Brilliant at answering questions. It’s absolutely amazing at horse racing." — Chris (19:13)
- Outperforms Grok 4 and is comparable to Claude Sonnet 4 for many use cases.
- Tool-calling capabilities:
- Handles multi-step, agentic tasks well, chaining calls expertly (22:02).
- Minor quirk: Some platforms like "Grok with a Q" throttle or limit Kimi's full capabilities, leading to context truncation or unreliable experiences.
- Cost and availability:
- Open source, though hosting is expensive and dependent on third-party GPU platforms.
- Already offered by multiple providers; hosts note impressive rapid adoption.
- Disruption to industry:
- Kimi K2 highlights vulnerability of labs reliant on proprietary models; rapid catch-up from China pressures US/Euro labs.
- OpenAI appears to have delayed their open-weight model in response—“100% confirmation that OpenAI open source model release was delayed because of Kimik 2. Translation: our model sucks, gets badly beaten by Kimike 2, need to train a better one.” — Michael (32:21)
- Memorable moments:
- Kimi K2 writes AI jokes roasting Elon Musk’s ego (25:19).
- AI-generated K-pop ballad for Kimmy (musical outro).
- "So fine, you blow my MCP mind..." (85:45)
- Business & practical context:
- Reality check for anyone signing long-term contracts with single model vendors—pace of open-source innovation is outstripping incumbents (33:30).
3. OpenAI’s Agent Mode: More Hype than Help?
[Timestamps: 35:33–73:27]
- Agent Mode debut:
- OpenAI presents their AI agent system that proactively chooses tools to complete tasks.
- Demos include event planning (finding hotels, gifts, outfits) and creating PowerPoint slides.
- Host skepticism: Demos skew towards consumer fluff (“How do I dress for a wedding?”) rather than real business use.
- “All their presentations...out of touch—every billion-dollar example.” — Michael (39:02)
- Critical Analysis of Agent Mode:
- Fails at its own demo—can’t find registry gifts, meanders for nearly 40 minutes creating a poor slideshow.
- “If this was an employee, I’d fire them." — Michael (44:00)
- Comparison to Manus: Manus accomplishes same business tasks in a fraction of the time, with better output (45:00).
- Spreadsheet/report tasks take much too long, cost too much in tokens, and can actually be performed faster (and sometimes better) by one-shotting a good model with the right prompt, or by using manual tool-chaining (47:24–53:59).
- Agentic workflow debate:
- Hosts argue agent frameworks (be it OpenAI or Menace) mostly add overhead, technical complexity, and unneeded steps.
- "MCPS" (Multi-Call Processing Systems) with tailored tools and one-shot prompts are seen as a better near-term fix (51:39, 64:47).
- “You can achieve nearly all these examples with dedicated MCPs… you don’t need an agent spending 27 minutes on a task that’s better done instantly." — Michael (62:49)
- Default AI model's “internal clock” or agentic flow is sufficient for most professional use now.
- Real-world decision factors:
- Expense makes business-wide rollout of Agent Mode unrealistic.
- Token-burn inefficiency is noted as a practical block for organizations large and small.
4. AI Agency, Model Evolution & The Realities of Deployment
[Timestamps: ~73:27–80:00+]
- Autonomy and trust:
- Businesses are unlikely to let AIs run wild, installing libraries on cloud computers, due to unpredictability and risk (77:56).
- Automating narrow, task-specific use cases is seen as far more valuable and practical than all-purpose generalist agents (80:00).
- MCP tools and “internal clock”
- The winning approach is human-in-the-loop, tool-rich workflow where AI’s agentic abilities coordinate existing reliable processes, rather than attempt everything cold.
- “We already, for the times we live in now, we have this structure… the MCP structure works best with the internal clock or the nature of these models right now.” — Michael (76:22)
- On the recurring “cloning” of features:
- Nearly every idea in AI lately is just a riff or clone of some other tool or feature—agent mode copied from Cursor, etc.
- Industry update—Wind Surf shuffle:
- Frenzied M&A activity, with OpenAI and Google (via “aqua-hires”) snatching up key staff/IP. Cognition/Devin also claim a piece.
- End result: “Windsurf” is now available again, with more options, making the AI coding tool space ever-muddier.
5. Closing Thoughts: The ‘Most Average’ AI Takeaway
[Timestamps: 84:55–88:24+, incl. musical outro]
- Best new model: Kimi K2 unanimously admired by both hosts—fast, reliable, excellent tool-caller, and open.
- Grok 4 falls flat; OpenAI Agent Mode is demo-ware.
- Industry is shifting rapidly: Open source from China threatens US lab moats; picking a single model for your stack is increasingly risky.
- Laughs, songs, and AI roasting Elon cap a boisterous, skeptical, but excited hour.
Segmented Timestamps for Key Topics
- [01:08] Grok 4 launch and model analysis
- [05:54] Grok 4’s pedestrian, censored responses
- [10:09] Grok 4 pricing and context
- [12:24] Grok personas/avatars and AI girlfriend talk
- [17:37] Agentic “boom” factor and direct comparison to Kimi K2
- [19:13] Kimi K2 deep dive: tool-calling and reliability
- [25:19] Kimi’s AI-generated Elon joke and song
- [31:19] OpenAI’s open-source response, delay speculation
- [35:33] OpenAI Agent Mode: overview and criticism
- [44:00] Hands-on Agent Mode: slow, lackluster results
- [47:24] One-shot vs. agentic workflows in practice
- [51:39] MCP tool-calling as the better current solution
- [77:56] Enterprise applications: limits of agentic autonomy
- [84:55] Final model verdicts
- [85:45] AI-written “Kimmy K2” song
Summary Table: Model Rankings by the Hosts
| Model | Pros | Cons | Verdict | |---------------|-------------------------------------|------------------------------------|-----------------------------------| | Grok 4 | Fast, uncensored, research-focused | Shocking outputs, mid answers, hype| "Terrible... straight disappointment" | | Kimi K2 | Open source, fast, tool-calling king| Hosting cost, context quirks | "Amazing... easily daily driver" | | Sonnet 4 | Best for agentic flows (internal clock) | Slightly slow | "Supreme at agentic tasks" | | Gemini 2.5 | In a class above most daily models | N/A | "Untouchable right now" | | OpenAI Agent Mode | Beautiful UI, ambitious vision | Not useful, slow, pricey | "Demo wear—not ready for business"|
Conclusion & Takeaways
- Ignore the hype and pick what works: Models need to prove their value in actual day-to-day grind, not just on benchmarks or sci-fi promises.
- Agentic workflows need refinement: One-shot, tool-enabled workflows remain the best fit for most users—broad agent autonomy is more fantasy than reality for now.
- Open source is a game-changer: Kimi K2 demonstrates just how fast the model landscape can change.
- Business leaders should be wary: Don’t lock into long-term model contracts—the “best” AI model today may be old news tomorrow.
Musical Finale:
An AI-generated song for Kimi K2 celebrates the open-source revolution, capping a fun, skeptical, and highly average (in the best way) AI podcast experience.
