![[AIEWF Preview] Gemini in 2025 and Realtime Voice AI — Latent Space: The AI Engineer Podcast cover](https://substackcdn.com/feed/podcast/1084089/post/186632795/8b00666435ec24a4c450f3749c1c8186.jpg)
Loading summary
Shrestha
Foreign.
Sam
Welcome to another episode of the Twiml.
Swix
AI Podcast and I'm Swix. This is a special episode of the Latent Space Pod with Twaimo at Google. I o welcome.
Logan
Thanks for, thanks for being here. Thanks for hanging out with us. I'm excited.
Swix
Logan, you were our first guest. You came back remotely a few months ago and now you're back. You're sort of the a lot of the face of the AI studio basically that a lot of people are using. I'm using it and I think it's a really welcome change for people being more accessible with the rest of the Google suite. And shrestha you've been I actually don't super know your role. I just generally have you pegged as PM of the API team with a particular focus on Live.
Logan
Stressa runs the show behind the scenes. Behind the scenes is the latest public face running the show. Behind the scenes model launches the live API. Generally all the stuff that's happening in the API is stress is hard work.
Shrestha
So thank you for that Logan. But I think everyone knows who really runs the show. There's public evidence there. But yeah, I work with Logan and a few other excellent PMs, but I lead the API side of the house.
Swix
There's a lot of announcements. I think a lot of people have done their recaps. What are you guys personal highlights over.
Logan
I O I'll break the rule and I'll give two that are not the sort of big, big flashy ones. I think the two that I think developers are going to be super excited about one thinking budgets coming to 2.5 pro. So and you can also you'll be able to disable thinking as well. So if you just want 2.5 pro as like a raw non reasoning model, we'll have that hopefully in early June. And then thought summaries. So we've had this debate internally about like do we need to show full thoughts? Do developers want full thoughts? I think developers say they want full thoughts. We have thought summarize right now as a sort of step in that direction. It'll be really interesting to find out and get the feedback around. Like what are things that work that work with thought summaries? What are the things that don't work with thought summaries? I was reading some threads last night about like thought summaries are now live in cursor as well and people were sort of reacting to, you know, having summaries versus not full thoughts. So it'll be interesting to see. But I'm excited for Both of those things. Thought summaries are live now. Thinking budget for 2.5 Pro will land with the GA model in a couple of weeks.
Shrestha
Yeah. And I should say we already do have thinking budgets in 2.5 flash. I do think, you know, with all of the features that we are releasing on top of our thinking models, summaries, budgets, I think this is our way of, you know, you have the models, but then we want to give developers as much control as they can on top of models. But coming back to your question about my favorite feature, it's really hard to pick because like all of these features we've been trying to push out for weeks. But I think native audio output.
Swix
I was just saying that with Quinn.
Shrestha
Yeah, yeah, it's a personal highlight. I actually, Quinn and I been playing with it together for a bit as well. I think especially with all the. Obviously the voices sound great. The fact that it can switch in and out of languages. So Matt Veloso, our boss, actually has a, has a demo on Twitter where it actually speaks Klingon, even though that's not an officially supported language. But you know, I speak Bengali. Just being able to. For it to switch into and out of Bengali and English, that's been special. And then if I get to pick another one, it's. I'd say we released a new tool called URL Context and the idea is that you can use it by yourself or pair it with search to retrieve more in depth information from web pages in a way that's respectful of our publisher ecosystem, of course. And I think that this will unlock new use cases like if people want to build their own version of a research agent, which, which is something developers ask us for a lot.
Swix
Yeah.
Sam
It's worth mentioning that just prior to IO there was a ton of new, interesting new capability, including the Update to Gemini 2.5 Pro, as well as the implicit context caching, which I know a lot of folks are waiting for.
Logan
We made implicit caching happen. I think there was lots of feedback that people are like, explicit caching is nice. There's definitely use cases where it makes sense, but people want implicit caching. So I'm happy passing the cost saving on to developers. You don't have to do anything, it just works right now and you're saving money. It's a great outcome.
Swix
I don't want to manage that myself.
Logan
Yeah, there are, there are people like, I think if you. There's so many use cases where like you're just doing chat on the same stuff over and over again and for those use cases, you know, you want to be able to explicitly cash the thing and make sure and guarantee your cash it so that you save money. So I'm happy we have that.
Swix
Is there any behind the scenes of like, what makes caching hard or anything that people don't appreciate about caching as a general concept? I think this is a very important pricing paradigm that people need to really get behind.
Logan
Yeah, that's a good question. I think there's a trade off between like all of the dimensions of caching, which is around like the sort of latency because in some cases you're getting latency gains. In other cases it's like, how much, you know, what's the cost for Google? How much stuff do you want to cache altogether? So we could have an entire episode and get a bunch of the caching people. Yeah, it's like a good example of like an infrastructure problem to be solved and a bunch of the folks who, who we work with love working on this problem. So we should do a deep dive episode.
Swix
Yeah. I want to shout out that you've been doing more video stuff. You have your own podcast as part of your Gemini work. You've also been doing video with people on the team who done the work.
Logan
We had the long context episode.
Swix
That's exactly. You did the long context one. People loved it.
Logan
Your reception was very, was very positive about the long context one. So thank you. That was the first time that we did like a more deep technical discussion with folks on the team and Nicolay is awesome. And we actually just did one with, we did one with Shrestha about the live API, which I'm excited about. We did one with folks on the team about the multimodal capabilities in Gemini. We're going to do a pre training one, hopefully, which will be really cool. We've got a bunch of people who are excited to talk about that. So there's a bunch of them in the works and it's, it's fun to make them happen and have those conversations.
Swix
Yeah. My underrated pick is Gemini diffusion.
Shrestha
Yes.
Logan
Yeah, yeah, yeah.
Shrestha
It's not underrated for all the love it's getting. Yeah.
Swix
So like apart from speed, I wonder like what the potential results of a diffusion language model could be.
Logan
Generative UI. Generative UI. This is the way the generative UIs happen is through, through this experience, the UI bit just like being able to like say I want, you know, build the UI on the fly using code based on what a user does. So like you have no pre compiled notion of what your website is and as a user goes through, as they click buttons, thousand tokens generate and it just makes that UI for you interesting. I think that's going to be possible. I mean, I think there's a lot of work to productionize, make Gemini Diffusion like actually a high quality model that meets the bar for us to bring to the world more generally. But I do think that's going to be the killer use case will be like this generative UI experience that doesn't exist today because the models just take too long to generate tokens.
Sam
Yeah, for me it was really the role that audio and video are taking throughout a bunch of independent product releases. From the generative models to the Live API to the on the fly transcription and translation, it's I think kind of foreshadowing the role that that's going to play in a lot of developer applications.
Shrestha
Yeah, Transcription actually even before we released native audio. Now of course you get text and audio interleaved in the output, but transcription used to be one of the biggest use cases we had on the Live API.
Sam
What are you seeing as the challenges for folks getting started with Live?
Shrestha
Yeah, that's a great question. I think firstly awareness, right? Like people knowing that we have a live. That's why we're doing this. Talking to you folks. I think some of the areas where. So we were actually the first to market with also video input. But one of the areas where we've been getting a lot of feedback is in session length. Anybody who's been trying to put this in production, like when we started, you could do like 15 to 20 minutes of audio, I'm sorry, and about five minutes of video. And so we've been putting in a lot of knobs for developers and we can talk about that more if you guys want for people to have a sliding window or decide what resolution they want to send video in. But you know, to basically increase the session length and then tool calls, that was another area where we used to get a lot of feedback. Again, we were very proud because we introduced tool chaining first so you could change search and code execution, do all kinds of analysis. But then we've had to do a lot of work in improving function calling, improving the performance of search and anyway, we continue to push on that.
Logan
I've got a quick one on this too, which is I think the level of commitment you need to make to the model provider in the world of the Live API. Like I, I do think for developers is a higher bar. If you look at like what does chat completions or like what does for us generate content, provide from just like a, a text modality perspective. It's like, it's a pretty lightweight thing. There's a lot of model providers that have that option. Like I could switch it to a different provider if I end up not liking some model provider, which I think is good for the ecosystem. I think if you look at a lot of the Live API infrastructure right now, like you really do need to commit that you're like gonna, you know, there's, it's not easily interoperable between different model providers. Like everyone's infrastructure is all bespoke and different. So like it is a, it's a different level of commitment that you need to have to like really bet your company or your business or your product on the Live API, which I do think is a challenge for developers to sort of make that level of commitment in this like fast moving AI world. But I think hopefully there'll be like some level of like similarity and you'll get some model agnostic infrastructure to help make that, you know, make developers feel a little bit, a little bit easier about being able to move between models potentially.
Shrestha
I could go on and on, but if you have say more complex workflows, then one of the things is being able to change the system instructions at every step of your workflow. And so yeah, so onboarding some of the more complex use cases with the Live API has been a work in progress as we've released like new features.
Swix
So what kind of complex workflows are we talking about?
Shrestha
You know, we have people who are building say gaming agents but like which have multi states, for example in them we have a lot, I mean this was a famous demo at Next, but we have folks who want to, you know, customer support agents. Of course, you know, they can, the sessions can last for hours. Right. Then there's a lot of use cases around people showing a certain screen.
Logan
This is the coolest use case, honestly.
Shrestha
Yeah, And I was referring to like the famous demo at Next where Shopify showed how to set up a DNS using cloud player. Right. So in certain cases, especially the longer your workflow runs, like you might have to go from one state to another state and might want to change their si. Or if you hand it from one agent to another agent, you might have to change the system instruction.
Sam
When you're thinking about building voice based applications is speech to text and then processing with a standard LLM, would you say that's like a precursor to the Live era or are these two distinct paths that are still viable and that you still see being viable going forward?
Shrestha
That's a tough question. And I'm still, right now we have both out. I do think perhaps eventually for most use cases, as these audio to audio architecture models get better, a lot of use cases will probably transition to that. But you know, when we talk to our developers, they still very much like those componentized, componentized components. So that's why we also put out two new text to speech models at I O. Not available through the Live API yet, but really high performing, controllable, promptable text to speech models.
Logan
I have an angle of an answer to this question which is I talked to Cora this morning, who's our boss's boss, The CTO at DeepMind and Corey had a, had a really interesting take which is just around like what makes one of the main things that makes what we're doing at Google with Gemini different than what a lot of the other labs are doing is like we're here to make one model and like that model is Gemini. And like I think, I think you do need to stress this point, like to make the capabilities work in some cases, like you do need to have these forks that like go off and make that capability and harden it and then find a way to bring it back into the mainline model. But like we want to make one model and it's the Gemini model and like not have the sort of splintering of all these different capabilities. And we've done a good job of I think thinking the reasoning stuff was like the best example of this. We had those, they were separate from the mainline Gemini model so that those teams, the research teams could go and hill climb and make progress and not need to be constrained about like how do we do this without having there be collateral damage on other capabilities like multimodal or something like that. But the teams went and did that and then they find a way to sort of bring the capabilities together. And oftentimes what you see is there's tension in bringing them together. But it's the really exciting thing is what happens when you bring the capabilities together. And like 2.5 pro with reasoning is a great example of this where like multimodal with video understanding ended up like having this huge, like it's having this beautiful moment. The model is like soda out of the box because of all the reasoning capabilities that were baked in. It wasn't because they like did a bunch of stuff to make video understanding really good. It was just like an artifact of bringing and merging those capabilities together. So I think that as like a North Star for Gemini models makes. Makes a ton of sense.
Shrestha
I agree with you and that's what I said. Right. Like, I think eventually a lot of use cases will end up on Gemini, will end up on natural voice. But I think in order to foster development, like we have these offshoots from time to right. We have our imagined models for image generation. Even though now another IO, well, slightly pre IO announcement, you can do interleave text and image within Gemini also. Right? And it unlocks.
Swix
Yeah, but those are different models, right?
Shrestha
Those are different models.
Swix
One is autoregressive, the other is diffusion.
Shrestha
The other is. That's what I'm saying. Right? But for a lot of image generation, image editing, high quality photorealistic use cases, developers are still using imagen. But then, you know, slowly but surely we're bringing those capabilities into Geminis.
Swix
Whoever's watching this, we had a mid IO switch because obviously there's a lot going on here.
Sam
It's not AI shape shifting.
Swix
I know, I know. But we also have Quinn actually who made this podcast happen. But you're a founder CEO of Daily. Welcome.
Quinn
I'm a big fan of all things voice and audio. It's fun to be here with you and with Shrestha.
Swix
Quinn actually runs the Voice AI meetup in San Francisco. You are basically consistently the leading community builder. You're very generous of your time and knowledge. I really appreciate that. And obviously also recently you started pipecat, which is this open source framework for voice orchestration, which has really great support.
Quinn
For all the Gemini models.
Swix
You wanted to say something about the relationship with Gemini and Daily.
Shrestha
I just wanted to say that it's been a very, very fruitful partnership with Daily. They've been our partners since the launch of the Live API and a lot of their feedback that they continuously has been, you know, instrumental to the success of the Live API. So both Daily and Live Kit are. We're partnered with them.
Swix
Quinn. I think, I think, you know, we had, we had a little bit of a prep for this. You also wanted to dive into a little bit on like the cascade of models in Gemini Live.
Quinn
I mean, I think stress has taken a really interesting approach designing these APIs. So you talked about components a little bit. You talked about how you want to be able to do things both in the Live API and in the more sort of traditional chat API. And you've got. Originally you designed the Live API to have Audio in, but then it's a separate text model, the Notebook, LM models, audio out. What was the sort of driver for that originally.
Shrestha
I mean that at the time was we wanted to hit a certain quality bar, a certain latency bar and you know, Notebook LM was already out and the TTS models that were powering notebook elements were very, very good. But we wanted an aspect of native. So it was native audio in but TTS out. And we still have that architecture available through the Live API. But then now we just released audio to audio architecture.
Quinn
I mean the infrastructure for this stuff is so interesting because you're always balancing latency cost, output quality. There's no free lunch.
Shrestha
Yeah. And other things like multilinguality. Coming back to your question earlier, Sam, we had a lot of users asking us for say better German language support or something, which hopefully now we've delivered on with these models.
Quinn
Yeah, but now you have audio to audio in the Live API as well.
Shrestha
In the Live API only is where we have the native audio output models.
Swix
Yeah.
Sam
Now continuing to pull on the component versus single model thread a little bit. When I think about voice, I think about it as being an area where to deliver solutions. You need to surround that strong model with a lot of voice specific infrastructure that is imagining challenging the scale. So Shrestha, can you talk a little bit about that and maybe we can have Quinn talk about that from his perspective.
Shrestha
So the first thing that comes to mind is of course the voice activity detection models that we have and we've done a lot of work like finessing that model server side, but we've also learned that we need to provide some knobs to developers. So now developers can actually tune the sensitivity on our voice activity detection model as well as, you know, how much of the prefix, like how much of a time duration at the beginning, at the start or stop of saying things. And we also have a mode where now you can, where you can disable our voice activity detection and bring your own. But I think the larger point that you're touching on Sam, that I do want to mention is it is really, really hard to bring all these components together and still get latency down to where it needs to be in the 500 to 700 millisecond range. It's one of the hardest things we've had to do with the Live API.
Quinn
What we see is that the shape of building these real time voice agents is a different set of developer problems in the shape of non real time or text mode things. One of the fun things about partnering with Stress and DeepMind is we work on this open source framework that people use to build these production voice systems. And so we try to solve problems at the framework level, like turn detection, like context management. As the models get better, as the use cases get more clear, some of those features migrate from the framework into the APIs, which makes life easier for developers. The use cases at the same time continue to broaden out and so there's more things for the framework to do. So we're sort of filling the top of the use cases, building blocks, developer experience funnel and pushing down as we all get better and we all figure out what this new world looks like.
Shrestha
And maybe this is also a good segue into web sockets versus WebRTC.
Quinn
GWYN yeah, you know, there's so much infrastructure like one for my whole career I've been building, you know, large scale, low latency network stuff. What we saw from my perspective when we started to see the possibilities of voice AI was you need this packet routing, like down underneath the inference layer. And there's like the AI inference stuff, but then there's the just how do you move the audio and increasingly video around the Internet? And so there's a whole new generation of developers who are interested in these networking protocols because voice AI and now real time video are so interesting, which is super fun for me because like, I've always thought moving packets around is one of the most fun things you can do on the Internet.
Sam
Seven layers of the OSI stuff.
Quinn
Exactly, exactly. Yeah, at. At pretty, pretty demanding real time latencies, as Shresta is saying. Like human beings expect you to respond in a conversation in 500 milliseconds or so. And if we're talking to an AI, we don't relax that assumption. We bring our assumptions about human conversation into that experience of interacting with an AI.
Shrestha
Yeah. Or not respond.
Quinn
That's a great point.
Shrestha
Yeah. So like one of the features that we've pushed out, a little more experimental, but would love for people to test it, is what we're calling proactive audio. And it's available only in the native audio in the audio to audio architecture right now. And what this feature does is it's trained not to respond to irrelevant audio.
Swix
Okay, so it's like a refusal kind of.
Shrestha
Yeah. Or you could call it directionally, like semantic voice activity detection. Right. So basically, yeah, like let's say I'm talking to the AI and then Quinn comes and asks me a question and I respond to Quinn. It'll know when not to respond.
Sam
I saw that in one of the demos the AI seemed to ignore a background question from someone else.
Quinn
And yeah, I think there's Two threads to pull on there. One is. That's another great example of things that we had to work really hard at the framework level to implement. It's much, much better if it actually migrates down into the model or the API. The other is part of the magic. There is this semi separate feature, but I think they're multiplicative of. Now your models can actually recognize two different people just based on their voices. You and I were playing that.
Shrestha
Yeah, this is not officially supported yet. The world just does it.
Swix
But just try it, right?
Shrestha
Just. I mean, just try it.
Swix
Give feedback.
Shrestha
Give us feedback.
Quinn
Is it okay to talk about it? Because it's. It might be my single favorite thing you can do with these models that you previously have not been able to do.
Shrestha
I mean, you can talk about what you've observed. I'm just saying it's not officially.
Sam
And what specific models are we talking about? Because speaker identification and diarization has always been really hard for these models.
Shrestha
It's called gosh, like model naming now has become. So it's called the native audio dialog. You'll see it in the live API, but that's the model. And then. You know, to your point again, Sam, about architectures, one thing that we launched on the cascaded architecture that we hope to eventually bring to the native audio as well is asynchronous function calling. So earlier the way it used to work is if you wanted the model to do a function call, you'd have to wait for the response. And now you can set a non blocking parameter and the model can go off and execute the function in the background.
Swix
I love you so much. Yeah, yeah, that's great. We do have to wrap up. So I think one fun thing that we can do to wrap up would be a wish list for like next year's IO. What would be one thing that you would wish? It doesn't have to come true, but wish happens with Gemini.
Sam
Well, I was hoping for Gemini 3.0 at this IO. So maybe Gemini 5.0 at the next IO.
Shrestha
Tell us what you mean by Gemini 5. What do you want in Gemini 5.0? Then I'll let Quinn go.
Quinn
I'll just put on my hat as representative of a big community. People building this stuff more and more languages because AI is global and there are so many communities all over the world that are starting.
Swix
Can we do language loras? You know, it's hard to stuff everything in one language and it's in one model. Yeah, okay, but yeah, and then.
Quinn
But they're building one model as they said, they're building the one universal model.
Shrestha
That would be boring answer. But I think really hard and work no languages. I mean, wasn't I telling you earlier? Like, we officially support 24 languages, but you can try talking to the model and cling on and it'll respond to you. So I think we'll get there way before next IO, but I just think more and more capabilities into the main model is what I would say. I'll have to think about this.
Swix
Yeah, yeah. It's a fun parlor game, but it also helps people align as to what is possible and what's coming up. Thanks for your time, everyone. This is very hastily organized, but I'm glad that we can make this happen. It's nice to actually see Sam in person.
Sam
Same.
Shrestha
Yeah, yeah. You think I am the only the PM for the live API, but we did not get to talk about some of all of the other releases as well.
Quinn
That.
Swix
We'll save that for your. Your talk at World's Fair and you guys are all speaking and. And we'll. We'll be podcasting as well, so.
Logan
Sounds good. Yeah. Yeah.
Swix
All right, that's it. Thank you so much.
Date: June 2, 2025
Participants:
This episode previews the latest developments in the Gemini foundation model—particularly around its Live API, real-time voice and audio/video capabilities, and infrastructure changes following Google I/O 2025. It highlights philosophical and technical challenges in building multimodal, generative, and real-time applications, with key takeaways from Google’s AI team and leading voice AI practitioners.
The episode wraps up with each guest sharing a wish for Gemini’s future—ranging from massive language expansion to the aspiration for a single, universal, multimodal model capable of everything. The discussion hints at rapid, ongoing innovation, emphasizes the complexity and opportunity in voice and real-time AI, and underlines how community-driven development and partnerships accelerate progress.
For deeper dives or reference materials, listeners are pointed to latent.space.