Latent Space: The AI Engineer Podcast
[AIEWF Preview] Gemini in 2025 and Realtime Voice AI
Date: June 2, 2025
Participants:
- Swix (Host, Latent Space)
- Sam
- Logan (AI Studio at Google; works on Gemini)
- Shrestha (Lead PM, Gemini API team)
- Quinn (Founder & CEO, Daily; organizer, Voice AI meetup SF)
Main Theme
This episode previews the latest developments in the Gemini foundation model—particularly around its Live API, real-time voice and audio/video capabilities, and infrastructure changes following Google I/O 2025. It highlights philosophical and technical challenges in building multimodal, generative, and real-time applications, with key takeaways from Google’s AI team and leading voice AI practitioners.
Key Discussion Points & Insights
1. Google I/O 2025 Announcements: Developer-Centric Upgrades
- Thinking Budgets in Gemini 2.5 Pro & 2.5 Flash
- Enables developers to set or disable "thinking" for more cost and latency control.
- Logan: “You’ll be able to disable thinking as well. So if you just want 2.5 pro as like a raw non-reasoning model… we’ll have that hopefully in early June.” [01:19]
- Thought Summaries
- New feature to present developers with concise summaries of the model’s internal steps.
- Logan: “We have thought summarize right now as a sort of step in that direction. It'll be really interesting to find out...what are things that work with thought summaries, what are the things that don’t?” [01:19-02:18]
2. Native Audio Output & Multilingual Support
- Shrestha’s Highlight: Native audio output, with strong multilingual (and even Klingon!) capabilities.
- Shrestha: “Just being able to...switch into and out of Bengali and English, that’s been special.” [02:52]
- Demo cited:
- Matt Veloso (Google) demoed audio model speaking Klingon [02:52].
- URL Context Tool:
- Retrieve focused information from web pages, enabling research agents with publisher respect.
- Shrestha: “You can use it by yourself or pair it with search to retrieve more in-depth information from web pages…” [03:11]
3. Implicit Context Caching
- Developers now benefit from out-of-the-box context caching for chat-like use cases, helping with costs and efficiency.
- Logan: “You don’t have to do anything, it just works right now and you’re saving money.” [04:03]
4. Gemini Diffusion: New Frontiers for UI Generation
- AI-Generated UI:
- Gemini Diffusion paves the way for “generative UIs”—dynamic, on-the-fly user interface creation via code.
- Logan: “You have no precompiled notion of what your website is and as a user goes through, as they click buttons...it just makes that UI for you.” [06:15]
- Still Some Production Challenges: Model quality, speed, and token generation rates remain areas to develop further.
5. Live API: Audio, Video, and Real-Time Agent Workflows
- Transcription as Foundational Use Case: Huge developer demand, which fueled early API features. [07:24]
- Session Length & Tool Calls:
- Developers want longer sessions and smoother chaining of tools/functions (e.g., code exec or search within a voice/video flow).
- Shrestha: “You could do like 15 to 20 minutes of audio, ...about five minutes of video” at launch; now, more parameters to tweak session length and resolution. [07:40]
Challenges and Key Learnings
- Commitment to Provider:
- Switching between different providers for live API is complex; code and workflows are not easily portable.
- Logan: “It’s not easily interoperable between different model providers...it’s a different level of commitment.” [08:48]
- Complex Workflows:
- Use cases like gaming agents or support bots require frequent dynamic changes to system instructions within long-running sessions. [10:13]
6. Componentized vs. Unified Architectures
- Dual track: shipping specialized (componentized) models for near-term needs while ultimately aiming for a unified, multimodal Gemini model.
- Logan: “We want to make one model, and it’s the Gemini model...not have splintering of all these different capabilities.” [12:04]
7. Community & Partnerships: Daily and Pipecat
- Quinn (Daily): Sharing perspective on open-source frameworks for voice AI (e.g. Pipecat), partnership impact, and the importance of infrastructure.
- Quinn: “What we see is that the shape of building these real time voice agents is a different set of developer problems than non real time or text mode things.” [18:25]
Notable Quotes & Memorable Moments
- On Thought Summaries and Thinking Budgets:
- Logan: “I think developers say they want full thoughts...thought summaries are live now. Thinking budget for 2.5 Pro will land with the GA model in a couple of weeks.” [01:19]
- On Developer Control:
- Shrestha: “We want to give developers as much control as they can on top of models.” [02:18]
- Native Audio, Multilingual Demo:
- Shrestha: “...speaks Klingon even though that's not an officially supported language.” [02:52]
- On Caching Complexity:
- Logan: “There’s a tradeoff...latency...cost...how much stuff do you want to cache altogether?” [04:50]
- On Generative UIs:
- Logan: “This is the way generative UIs happen...build the UI on the fly using code based on what a user does.” [06:15]
- On Voice Infrastructure:
- Shrestha: “It is really, really hard to bring all these components together and still get latency down to where it needs to be in the 500 to 700 millisecond range.” [17:30]
- Proactive Audio Feature:
- Shrestha: “What this feature does is it’s trained not to respond to irrelevant audio...semantic voice activity detection.” [20:19]
- Real-Time Speaker Diarization:
- Quinn/Shrestha: Discuss emerging ability to recognize and distinguish speakers by voice, even if not officially supported yet. [21:03-21:49]
Timestamps for Key Segments
- [01:13] Recap of Google I/O—Highlights from Logan and Shrestha
- [02:52] Native Audio Output, multilingual, Klingon demo
- [03:11] URL Context Tool for research agents
- [04:03] Implicit caching (major API upgrade)
- [06:00] Gemini Diffusion & Generative UIs
- [07:24] Audio, video, transcription—Live API developer needs
- [08:48] Infrastructure choices: challenges of Live API adoption
- [10:13] Examples of complex workflow integration for AI agents
- [12:04] Vision for a unified Gemini model
- [15:07] Quinn/Daily partnership on voice orchestration
- [16:30] Evolution toward audio-to-audio architecture in Live API
- [17:30] Voice activity detection innovation and challenges
- [18:25] Developer experience differences: real-time voice vs. text
- [19:16] Discussion of WebSockets vs. WebRTC for low latency networking
- [20:19] Proactive Audio—AI ignores irrelevant input
- [21:03] Multi-speaker recognition in real time
- [21:57] Asynchronous function calling in cascaded architectures
- [22:48] Closing thoughts and Gemini “wish list” for next year
Tone & Atmosphere
- Candid, technical, and forward-looking: The conversation offers both behind-the-scenes context and thoughtful speculation about AI engineering’s near future, mixing developer concerns with tangible product updates and expert forecasts.
- Community-oriented: Shout-outs and recognition for active community builders underscore the collaborative approach of the AI engineering space.
Closing
The episode wraps up with each guest sharing a wish for Gemini’s future—ranging from massive language expansion to the aspiration for a single, universal, multimodal model capable of everything. The discussion hints at rapid, ongoing innovation, emphasizes the complexity and opportunity in voice and real-time AI, and underlines how community-driven development and partnerships accelerate progress.
For deeper dives or reference materials, listeners are pointed to latent.space.
![[AIEWF Preview] Gemini in 2025 and Realtime Voice AI - Latent Space: The AI Engineer Podcast cover](/_next/image?url=https%3A%2F%2Fsubstackcdn.com%2Ffeed%2Fpodcast%2F1084089%2Fpost%2F186632795%2F8b00666435ec24a4c450f3749c1c8186.jpg&w=1200&q=75)