[AIEWF Preview] Gemini in 2025 and Realtime Voice AI

Latent Space: The AI Engineer Podcast

Date: June 2, 2025
Participants:

Swix (Host, Latent Space)
Sam
Logan (AI Studio at Google; works on Gemini)
Shrestha (Lead PM, Gemini API team)
Quinn (Founder & CEO, Daily; organizer, Voice AI meetup SF)

Main Theme

This episode previews the latest developments in the Gemini foundation model—particularly around its Live API, real-time voice and audio/video capabilities, and infrastructure changes following Google I/O 2025. It highlights philosophical and technical challenges in building multimodal, generative, and real-time applications, with key takeaways from Google’s AI team and leading voice AI practitioners.

Key Discussion Points & Insights

1. Google I/O 2025 Announcements: Developer-Centric Upgrades

Thinking Budgets in Gemini 2.5 Pro & 2.5 Flash
- Enables developers to set or disable "thinking" for more cost and latency control.
- Logan: “You’ll be able to disable thinking as well. So if you just want 2.5 pro as like a raw non-reasoning model… we’ll have that hopefully in early June.” [01:19]
Thought Summaries
- New feature to present developers with concise summaries of the model’s internal steps.
- Logan: “We have thought summarize right now as a sort of step in that direction. It'll be really interesting to find out...what are things that work with thought summaries, what are the things that don’t?” [01:19-02:18]

2. Native Audio Output & Multilingual Support

Shrestha’s Highlight: Native audio output, with strong multilingual (and even Klingon!) capabilities.
- Shrestha: “Just being able to...switch into and out of Bengali and English, that’s been special.” [02:52]
Demo cited:
- Matt Veloso (Google) demoed audio model speaking Klingon [02:52].
URL Context Tool:
- Retrieve focused information from web pages, enabling research agents with publisher respect.
- Shrestha: “You can use it by yourself or pair it with search to retrieve more in-depth information from web pages…” [03:11]

3. Implicit Context Caching

Developers now benefit from out-of-the-box context caching for chat-like use cases, helping with costs and efficiency.
- Logan: “You don’t have to do anything, it just works right now and you’re saving money.” [04:03]

4. Gemini Diffusion: New Frontiers for UI Generation

AI-Generated UI:
- Gemini Diffusion paves the way for “generative UIs”—dynamic, on-the-fly user interface creation via code.
- Logan: “You have no precompiled notion of what your website is and as a user goes through, as they click buttons...it just makes that UI for you.” [06:15]
Still Some Production Challenges: Model quality, speed, and token generation rates remain areas to develop further.

5. Live API: Audio, Video, and Real-Time Agent Workflows

Transcription as Foundational Use Case: Huge developer demand, which fueled early API features. [07:24]
Session Length & Tool Calls:
- Developers want longer sessions and smoother chaining of tools/functions (e.g., code exec or search within a voice/video flow).
- Shrestha: “You could do like 15 to 20 minutes of audio, ...about five minutes of video” at launch; now, more parameters to tweak session length and resolution. [07:40]

Challenges and Key Learnings

Commitment to Provider:
- Switching between different providers for live API is complex; code and workflows are not easily portable.
- Logan: “It’s not easily interoperable between different model providers...it’s a different level of commitment.” [08:48]
Complex Workflows:
- Use cases like gaming agents or support bots require frequent dynamic changes to system instructions within long-running sessions. [10:13]

6. Componentized vs. Unified Architectures

Dual track: shipping specialized (componentized) models for near-term needs while ultimately aiming for a unified, multimodal Gemini model.
- Logan: “We want to make one model, and it’s the Gemini model...not have splintering of all these different capabilities.” [12:04]

7. Community & Partnerships: Daily and Pipecat

Quinn (Daily): Sharing perspective on open-source frameworks for voice AI (e.g. Pipecat), partnership impact, and the importance of infrastructure.
- Quinn: “What we see is that the shape of building these real time voice agents is a different set of developer problems than non real time or text mode things.” [18:25]

Notable Quotes & Memorable Moments

On Thought Summaries and Thinking Budgets:
- Logan: “I think developers say they want full thoughts...thought summaries are live now. Thinking budget for 2.5 Pro will land with the GA model in a couple of weeks.” [01:19]
On Developer Control:
- Shrestha: “We want to give developers as much control as they can on top of models.” [02:18]
Native Audio, Multilingual Demo:
- Shrestha: “...speaks Klingon even though that's not an officially supported language.” [02:52]
On Caching Complexity:
- Logan: “There’s a tradeoff...latency...cost...how much stuff do you want to cache altogether?” [04:50]
On Generative UIs:
- Logan: “This is the way generative UIs happen...build the UI on the fly using code based on what a user does.” [06:15]
On Voice Infrastructure:
- Shrestha: “It is really, really hard to bring all these components together and still get latency down to where it needs to be in the 500 to 700 millisecond range.” [17:30]
Proactive Audio Feature:
- Shrestha: “What this feature does is it’s trained not to respond to irrelevant audio...semantic voice activity detection.” [20:19]
Real-Time Speaker Diarization:
- Quinn/Shrestha: Discuss emerging ability to recognize and distinguish speakers by voice, even if not officially supported yet. [21:03-21:49]

Timestamps for Key Segments

[01:13] Recap of Google I/O—Highlights from Logan and Shrestha
[02:52] Native Audio Output, multilingual, Klingon demo
[03:11] URL Context Tool for research agents
[04:03] Implicit caching (major API upgrade)
[06:00] Gemini Diffusion & Generative UIs
[07:24] Audio, video, transcription—Live API developer needs
[08:48] Infrastructure choices: challenges of Live API adoption
[10:13] Examples of complex workflow integration for AI agents
[12:04] Vision for a unified Gemini model
[15:07] Quinn/Daily partnership on voice orchestration
[16:30] Evolution toward audio-to-audio architecture in Live API
[17:30] Voice activity detection innovation and challenges
[18:25] Developer experience differences: real-time voice vs. text
[19:16] Discussion of WebSockets vs. WebRTC for low latency networking
[20:19] Proactive Audio—AI ignores irrelevant input
[21:03] Multi-speaker recognition in real time
[21:57] Asynchronous function calling in cascaded architectures
[22:48] Closing thoughts and Gemini “wish list” for next year

Tone & Atmosphere

Candid, technical, and forward-looking: The conversation offers both behind-the-scenes context and thoughtful speculation about AI engineering’s near future, mixing developer concerns with tangible product updates and expert forecasts.
Community-oriented: Shout-outs and recognition for active community builders underscore the collaborative approach of the AI engineering space.

Closing

The episode wraps up with each guest sharing a wish for Gemini’s future—ranging from massive language expansion to the aspiration for a single, universal, multimodal model capable of everything. The discussion hints at rapid, ongoing innovation, emphasizes the complexity and opportunity in voice and real-time AI, and underlines how community-driven development and partnerships accelerate progress.

For deeper dives or reference materials, listeners are pointed to latent.space.

Latent Space: The AI Engineer Podcast