Summary9 min read

The MAD Podcast with Matt Turck

Episode: Voice AI’s Big Moment: Why Everything Is Changing Now

Guest: Neil Zeghidour, CEO of Gradium AI

Release Date: February 19, 2026

Episode Overview

This episode dives deep into the current "big moment" of Voice AI, exploring major technical, industry, and societal shifts enabling rapid improvement in voice-based artificial intelligence. Matt Turck hosts Neil Zeghidour, CEO of Gradium AI (formerly DeepMind and Meta), for an honest, highly technical, and engaging conversation that covers:

Why Voice AI lagged behind other modalities
The rapid pace of recent advances
Building and scaling native audio and conversational models
The unique culture and talent pool shaping Voice AI
Technical, ethical, and business frontiers
The role of open source and productization
The future of voice in hardware, society, and the global AI landscape

Key Discussion Points & Insights

1. Why Voice AI Is Having Its Moment (01:18–03:28)

For the first time, it's often more convenient and enjoyable to speak to an AI than a human for certain tasks.
Recent breakthroughs: Latency, naturalness, and accuracy have dramatically improved in the past two years.

“It actually can be enjoyable and even more convenient to talk to an AI on the phone than talking to a human.” — Neil (01:40)
Voice interfaces now marry conversational intelligence (agents) with rapid, natural audio models.
Still in early days: Demos typically occur in quiet, controlled environments—robustness in noisy, multi-speaker, real-world contexts is the next frontier.

2. Voice AI's “Poor Parent” Status—And Turning Point (03:28–06:44)

Historically, voice/speech was less prestigious and thus attracted fewer top machine learning minds.

“…Voice did not attract the visionaries in machine learning… the prestige of speech conferences used to be much lower than that of computer vision or nlp.” — Neil (03:53)
Ironically, the earliest deep learning success (pre-AlexNet) was in speech recognition, thanks to Geoffrey Hinton (2007–2008).
Voice sits at the intersection of ML, signal processing, psychoacoustics, and domain expertise—fewer than 100 true global experts can train competitive models.

3. Technical Breakthroughs & Remaining Challenges (07:39–11:24)

Full Duplex Conversation: Models no longer process strict alternating turns but can “listen and speak” continuously, enabling more fluid, overlapping dialogue.

“One of the things we contributed doing is getting rid of speaker turns completely with what we call full duplex conversation... in that context there is no real latency anymore.” — Neil (07:56)
Beyond latency: Next steps involve natural expressiveness, emotional appropriateness, and context understanding.
The quiet room caveat: Real-world robustness (e.g., factories, public spaces) is still far off.
Hardware convergence: The next generation of devices (glasses, pendants) is built “voice first,” with no keyboards or screens—voice becomes primary interface.

4. Use Cases, Office Life, and Social Implications (11:01–12:54)

Applications are expanding rapidly—coding, interface navigation, and beyond.

“Even prompting LLMs now is… much more convenient rather than typing.” — Neil (11:24)
Social/work contexts: Office norms may evolve as voice becomes a main human-computer interface.
Anthropomorphism of AI: Engineers increasingly treat AI assistants as colleagues (“Claude-ing”).

5. Neil Zeghidour’s Path: From Math to Voice AI (12:54–22:23)

Discovered ML via finance/news automation; interned at Facebook Paris, where he barely knew how to code.

“[Sumit Shintala] asked me to implement K means… I asked if I could do it in matlab… I get the job, which… is so cringe… Thank you Sumit.” — Neil (14:05)
PhD focused on efficient, data-light speech learning, studying language acquisition in infants.
Career at Google: Pioneered neural codecs and generative models (Soundstream, AudioLM, MusicLM).
Invented “instant voice cloning”—prompting LLMs to generate compressed audio for any voice with seconds of data.

6. Innovation at Small Scale: Nonprofit & Gradium AI (22:23–31:26)

Founded QTAI (formerly Sphere) as a lean nonprofit, prioritizing frontier research, built around open collaboration.
Small, expert teams can still drive world-changing impact in voice—“You don't need 10,000 GPUs… the ability to go fast, iterate fast… is far superior.”
Gradium emerged to productize and scale top-performing open-source models, addressing market requests for higher quality, multilingual, production-ready models.

7. Why a Small Startup Can Beat Mega-Labs in Voice (31:26–34:01)

Voice models must be ultra-compact to run at scale and meet latency/cost requirements—small, focused teams have an advantage.
Big labs’ multipurpose models dilute resources across modalities (text, image, code), whereas Gradium directly targets developer needs with dedicated voice primitives (not just one assistant).

“If you have the right team, it can be extremely small and still make a significant impact.” — Neil (31:57)

8. On-Device Voice & Edge AI (34:01–35:49)

Full “conversational AI on device” remains out of reach—current on-device use cases (e.g., speech translation, PocketTTS for games) are narrower.
The real challenge: Matching quality while drastically reducing model size (e.g., CPU-only solutions for embedded hardware).

9. Open Source and Competitive Dynamic (35:49–39:21)

Open source advances (e.g., Alibaba’s Qwen3TTS, Mistral's Voxtral) often build directly on their frameworks (Moshi architecture).
Open sourcing helps Gradium and QTAI stay ahead—deep understanding of the core mechanisms ensures their "last mile" solutions remain state-of-the-art.

“Open source at QTAI… is the end goal of the lab. At Gradium, that's not the end goal… the end goal of the companies to make competitive products that outperform every alternatives.” — Neil (36:39)

10. Last-Mile Quality & Productization (39:21–44:39)

The “last mile” factors—handling all accents, edge cases, latency, naturalness—are what separate market leaders from baseline open models.
Real-world audio is fundamentally judged by humans, not by objective metrics or neural proxies; blind listening tests inform every product call.

“It's fundamentally subjective experience, the quality of audio. But there are some things that are going to be widely shared results…” — Neil (41:56)

11. Commoditization Myth (44:39–46:49)

TTS and voice models aren't (yet) commoditized; so many hard problems (e.g., full duplex interaction, diarization, robust transcription, real-time reasoning) are unsolved.

“The best TTS, the most controllable... is in front of us. Nothing is close to it yet.” — Neil (44:57)

12. Technical Deep Dive: Cascaded vs. Integrated Models (46:49–53:29)

Cascaded System: Standard approach (speech-to-text → LLM → text-to-speech); easy to swap LLMs, but loses emotional and paralinguistic nuance and introduces latency.
Speech-to-Speech/Full Duplex: Models can listen and generate speech for both sides simultaneously, modeling overlapping/asynchronous conversation.

“We took the audio language model. Instead of having it modeling one stream of tokens, we called it multi stream... There is no turn taking anymore.” — Neil (47:56)
Challenge: Full integration (speech-to-speech) limits backend flexibility; but ultimately, it delivers the most human, low-friction voice experiences.

13. The Hardest Problems: Robustness in Noise, Data Challenges (53:29–60:28)

Recognizing speakers & understanding in noisy, multi-talker environments is still almost unsolved—requires both hardware (multiple mics) and new model architectures.

“One of the frontiers… a robot in the factory; a lot of people talking… It’s extremely challenging.” — Neil (53:31)
Labeled, high-quality audio data is scarce and hard to generate, especially for rare languages, accents, or unwritten dialects.

“…we could probably do that with 10,000 hours if we had the right method… For speech, it’s not about the volume, but about having high quality data.” — Neil (56:33–59:00)

14. Hardware Efficiency & Selective Compute (63:04–64:10)

Voice AI at scale requires small, efficient models that can run on commodity hardware, mobile, and embedded devices.
Selective/adaptive compute is key: Not every utterance needs the same level of model horsepower.

15. Products, Use Cases, & Business Models (64:10–68:58)

Gradium focuses on providing the best raw models and infrastructure, letting others build end-user agents for:
- Customer care
- Video game NPCs
- Language learning
- Personalized media
- Interactive entertainment
Voice cloning: Gradium offers best-in-class cloning and even voice design (generating new voices by description), with particular strength in customizing accents, effects, and styles.

16. Privacy, Deep Fakes, and Security (68:58–72:30)

Watermarking and audio “deep fake” detection aren't reliable for real-world defense.

“Watermarking is a scam. I'm sorry… It just doesn't work.” — Neil (69:15)
Gradium ensures cloned voices remain under user control, with possible opt-in for voice sharing and compensation.
Voice design is seen as a promising path to avoid privacy issues—generate custom, parameterized voices instead of cloning real humans.

17. Voice with Visuals & Multimodality (72:30–75:03)

Audio-visual understanding (not just audio) greatly boosts accuracy for many tasks (e.g., joint diarization).
Video-gen models will need “native” audio, not bolted-on—it’s a multimodal future.
Demo: Gradium’s playful “Bridgerclone” app generates a video+voice message matching your selfie.

18. Building from Paris and European AI Strengths (76:29–82:16)

Paris has become a global AI talent magnet, with a dense concentration of top researchers fueling Google, Meta, OpenAI, Anthropic, and homegrown startups.
French and broader European AI have the talent and resources; culture is more sober, less “hype,” but results speak for themselves.

“In a way, the more people are mocking Europe, the more it can make the people who mock overconfident… and everyone who is overconfident eventually gets displaced by the underdog.” — Neil (81:23)

Notable Quotes & Memorable Moments

“For the first time, it actually can be enjoyable and even more convenient to talk to an AI on the phone than talking to a human.” – Neil (00:00, 01:40)
“Historically, for some reason, voice did not attract the visionaries in machine learning… even at conferences… you had to have an application in vision or NLP. If you did it in speech, you would get rejected.” – Neil (03:53)
“You don't need 10,000 GPUs to train a speech model... The ability to go fast, iterate fast, with the right people—to me is far superior [to] a big organization.” – Neil (30:40)
“One of the things we contributed [was] getting rid of speaker turns completely with what we call full duplex conversation.” – Neil (07:56)
“Watermarking is a scam. I'm sorry, I have to say it. It just doesn't work.” – Neil (69:15)
“The only thing you can make [AI audio] is to make it more steerable for each user… There is nothing that’s going to please every user consistently.” – Neil (43:28)

Timestamps for Important Discussion Segments

| Topic | Timestamp | |---------------------------------------------------|------------| | Why voice AI's big moment is happening | 01:21–03:28| | Why voice lagged behind other AI modalities | 03:28–06:44| | Full duplex models & the expressiveness frontier | 07:56–11:01| | Use cases and social context in offices | 11:01–12:54| | Neil’s journey: math, Facebook, Google, Gradium | 12:54–22:23| | Small teams, open research vs. productization | 22:23–31:26| | Why small companies can lead in voice AI | 31:26–34:01| | On-device models, efficiency, new use cases | 34:01–35:49| | Open source influence, competitive landscape | 35:49–39:21| | Productization, last-mile, blind testing | 39:21–44:39| | Technical deep dive: cascaded vs. integrated | 46:49–53:29| | Robustness in noise & data challenges | 53:29–60:28| | Hardware and adaptive compute | 63:04–64:10| | Gradium’s model-centric product strategy | 64:10–68:58| | Cloning, voice design, and privacy | 68:58–72:30| | Voice + video, multimodal applications | 72:30–75:03| | French & European AI scene | 76:29–82:16|

Tone & Style

The conversation is candid, insightful, and mixes technical depth with humor and industry context. Neil is unpretentious, sometimes self-deprecating, and takes clear stances while celebrating both his team and the broader field. The technical explanations never lose sight of real-world impact or the excitement (and occasional frustration) of rapid innovation.

Summary for Listeners

For anyone interested in the next generation of human-computer interaction, this episode provides a state-of-the-art tour—both visionary and grounded—of how, why, and by whom voice AI will become the interface of the future, and why the field is only just getting started.

End of Summary

Loading summary

Transcript99 lines

[00:00]
Neil Zegedor
For the first time, it actually can be enjoyable and even more convenient to talk to an AI on the phone than talking to a human. I don't want to be mean to my people, the speech scientists, but historically, for some reason, voice did not attract the visionaries in machine learning. All the new hardware companies have voice at the heart of the product. All of these devices, they got rid of keyboards, they don't really have a screen or an interface, and voice is going to be the main one.
[00:27]
Matt Turk
Hi, I'm Matt Turk from firstmark. Welcome to the MAD podcast. Voice AI is having a big moment for the field. Was stuck in the uncanny valley, lagging well behind other AI modalities, robotic, slow and frustrating. But in the last 18 months, everything has started to change. My guest today is Neil Zegedor, CEO of Gradium AI and formerly of DeepMind and Meta. Neil is one of the very top AI researchers in the field and a key architect of the rapid evolution of voice AI towards real time native audio intelligence. This conversation is a deep dive into everything you need to know about Voice AI, where we explore many key concepts in, in a very accessible way and discuss plenty of fun stuff, including why voice AI has so few experts, the massive challenge of building native audio models, and the rise of autonomous voice agents. Please enjoy this terrific and very educational conversation with Neil Zegidor.
[01:18]
Interviewer
Hey, Neil, welcome.
[01:19]
Neil Zegedor
Hey, thanks for having me.
[01:21]
Interviewer
So a lot of people in the industry are saying that Voice AI is having its big moment. There's certainly a lot of activity, there's a lot of funding rounds from your perspective. So you've been in this field for many years now. DeepMind, Meta Nagradium. Is Voice AI indeed having its big moment or are we still early?
[01:40]
Neil Zegedor
I think it's both having a big moment and we're still early. It's having a big moment because there is progress all around AI models and in voice. For example, the progress in latency, naturalness, accuracy have been really, really huge in the past years, in particular in the two last years. And at the same time, text models have evolved into what we now call agents, which are not only text models, but they can actually make actions and manipulate data, access information and so on and so forth. And now when you bring both together, you can have voice interfaces that at the same time are going to solve complex problems. And so I think there is a moment now because for the first time, it actually can be enjoyable to, and even more convenient to talk to an AI on the phone than talking to a human, because you can call any time of the day or night and the interaction is, is working pretty well and it sounds really nice and the latency is low and so on and so forth. So it's definitely having a moment because I think in a way it's. Now it can be used in much more use cases than it used to, but it's still early because it's still quite experimental. So anybody who is using even the most advanced voice agents and compare that to the her movie from 12 years ago, you know, it's obvious the gap that is still remaining. And there are so many topics that are completely unaddressed at the moment. In particular, you know, every time you watch the voice agent demo, just realize that it's someone talking to a phone in a quiet room. So the day where you will have someone shouting to a robot in the middle of a factory and having the robot understanding what's happening and who's talking to them, that will be, you know, like we'll be there and we are not there at all.
[03:29]
Interviewer
So we'll get into some of the technical details in a minute. But at a high level. Why has Voice AI being I guess, the most underdeveloped modality? There's been obviously extraordinary progress on text AI and then image AI and then video AI, but it seems that voice has been a little bit the poor parents in terms of progress. Why is that?
[03:53]
Neil Zegedor
I don't want to be mean to my people, the speech scientists, but historically, for some reason voice did not attract the visionaries in machine learning. Right. So even if you looked at the dynamics in conferences, if you proposed a new method like fundamental algorithm and you wanted it to be accepted in a prestigious venue, you had to have an application either in computer vision, like image classification or nlp. If you did it in speech, you will get rejected because it was like two speech. And at the same time, the prestige of speech conferences used to be much lower than that of computer vision or nlp. So honestly, I don't really know why, because when you look at the details, the first big success of deep learning, everybody knows the ALEXNET model in 2012, where for the first time, you know, you had a deep learning model outperforming every single alternative on image classification. But actually the really first big success in deep learning was speech recognition. We've worked from Geoffrey Hinton himself.
[04:59]
Interviewer
And when was that?
[05:00]
Neil Zegedor
It was In, I think 2007 or 2008.
[05:03]
Interviewer
So way before.
[05:04]
Neil Zegedor
Yes. And so I think it's, you know, it was just not as prestigious and so it wouldn't attract a lot of people that could have made significant contributions, I would say. And then what happened and was very nice is there was kind of a convergence of algorithms around transformers and LLMs. So that now pretty much regardless of the task or modality you are looking at, you're always looking at the same technology. And there started to be much more progress also thanks to now the similarity between different modalities. Because in particular what I contributed to as my team was to take a lot of inspiration, separation from successes in vision and NLP and apply them directly to speech. But it's still interesting because in a way there are way fewer people who can train a competitive speech model than in text or in vision. So I mean it's a good position to be in because it's, you know, very few people have really gone into like the depth of this topic and it's one that is very challenging because it's bringing. Ideally, if you want to solve the problems you need to understand machine learning, signal processing, that is much more completely different literature around telecommunication, audio compression and so on, along with psychology and not like cognitive psychology, but psychoacoustics. How does the human hearing works? How does speech production works in humans? So all of this, when you bring them together, you can make competitive models, but it requires kind of a very wide scope of expertise in very different domains.
[06:45]
Interviewer
Fascinating. To put it in numbers, how many would you say people there are in the world with that expertise? Are we talking about 100, 510?
[06:55]
Neil Zegedor
Between 10 and 100? No, I would say 50. I don't know, it's hard to say. Yeah, I think it's very few and really meaningful contributions that have pushed the field forward have been made by very small groups of people. And I think that's also what's nice. So AI, I think in general is one field where individuals can have a disproportionate impact because the amount of things you can do by yourself, you have access to compute and data sets is huge. And in voice in particular, since the required compute is much lower and that's the same for data, a few individuals can make stuff that is completely just changing applications at very large scales.
[07:39]
Interviewer
Great, so you mentioned her a minute ago, which is the inevitable reference for any conversation about voice. What is the ultimate success in voice that the field is working towards? Is that super low latency expressiveness? What is great.
[07:57]
Neil Zegedor
So latency. Latency is already something that only makes sense if you are in a turn based conversation. Because latency then the definition of latency is how much time there is between two Turns. One of the things we contributed doing is getting rid of speaker turns completely with what we call full duplex conversation. So the idea that it's always listening, always speaking, and when it's not speaking, it's just that it's producing silence. But it's always on. And in that context there is no real latency anymore. So because the model is just basically can talk at any time and it can talk over you and you can talk over it. And that makes the conversation really natural, then naturalness. It's not only these dynamics in terms of tempo, you know, like when the model can jump in the conversation, when they should remain quiet, there is this dynamics question and then there is emotion. And so there is emotion in what the AI expresses. Its emotion is natural but also appropriate that if you start feeling confident enough to start sharing about stuff that makes you unhappy or sad or feeling miserable, it's not saying, oh, I'm so sorry for you, let's talk about it. And it will also understand when you're getting upset, when you're getting confused and so on and so forth. This will make already voice AI in terms of interaction extremely natural and as close as possible to human, which is basically what is one of the things we see in the movie, which I, I hate mentioning as a reference because it's so overused. It's annoying.
[09:33]
Interviewer
Yeah.
[09:34]
Neil Zegedor
But at the same time, everybody understands, you know, the gap between where we are right now and the movie. So that's, I think it's still a relevant one. And it's even interesting how relevant it still is, despite the, you know, the fact that there's so much work around voice. But then there will be other questions about how voicing is integrated into our lives. So there are paradigm in voice AI such as wake word detection. So when you use Google Home or Alexa or whatever, you have a wake word that is going to turn the speech to text on. So now let's say you want to work with your assistant that is always listening to you. So in a way you would have something that is just running constantly without having necessarily a wake world. So all of this I think is going to be both technical challenges and product challenges around where do they sit, how are we interacting with them? The link to the hardware as well. So I think what is a good sign for voice as a field as well is that in my perception, all the new hardware companies have voice at the heart of the product. All the prototypes that we see, whether it's glasses or pendants or the new stuff that journey I'VE and Samatman are working on Voice is at the heart of the product and will be the main way of interacting. So all of these devices, they got rid of keyboards, they don't really have a screen or an interface rather and voice is going to be the one that is the main one.
[11:01]
Interviewer
What's your vision of the future? Where does voice fit in? Is that a voice and text? Is that primarily voice for certain use cases? There's certainly an argument that you'll hear a lot of people saying voice is great, but most of the time I'm at the office, the last thing I want is for people to hear my conversation and therefore I don't want to talk to a machine. So where does voice fit in that vision of the future?
[11:25]
Neil Zegedor
So for example, I used to think that one obvious application where voice was kind of irrelevant was coding because it's fundamentally you're not going to read code out loud right yet. Now, since coding is going more and more towards vibe coding, which is natural language, it makes a lot of sense to do it by speaking. And now people are developing products that allow you to dispatch orders to coding agents in a way that is much more efficient that if you had to type in each different window to each of them. Even prompting LLMs now is doing by voice is much more convenient rather than typing. I still agree that there is one part which is more social about what the offices environment will look like. I don't know, maybe we will just also rethink the way we just structure office environments. What is sure is now people have AI assistants that are almost colleagues, right? I mean you talk to any software engineer, the anthropomorphization of Claude code is, I find it extremely funny even the verb coding, I mean it's going to be clouding pretty soon. And so these people, if it's more convenient to interact with their main tool Provoice, that will justify also rethinking office spaces. I guess so. So yeah, I think there will be workarounds and we will naturally find them if voice becomes the main way. I mean, if it's more practical to interact with AI through voice.
[12:55]
Interviewer
Super interesting. Before we go further, let's talk about you a little bit and your journey and the company. So I mentioned DeepMind and Meta and now Gradium and Futa in the middle. Just like we'll walk us through your life story and your work.
[13:10]
Neil Zegedor
So I studied mathematics and I started my career in with a short internship in quantitative finance.
[13:16]
Interviewer
Yeah, and that was in Paris, right?
[13:18]
Neil Zegedor
Yes, in Paris. And I was born and raised there. And what was interesting, doing my internship, I had access to a Bloomberg, you know, terminal. And so I will see the news like the constant news in the below the screen. You know, I was thinking, what if I could have an algorithm that just reads this news and take positions on the market faster than anyone because it was just able to analyze the news live. And I was looking, but I had no keywords about that, right? So I literally googled how can I analyze text automatically or whatever. And I found machine learning, you know, it was epiphany, decided to completely stop, started studying again. My goal was to go back to finance with AI, you know, and machine learning. But I got passionate about all the possibilities there was around. So back then There were no LLMs and it was not really about generative AI. It was about medical imaging, text understanding, a lot of things around audio speech recognition obviously. And I looked for an internship which was about unsupervised learning, which was already pretty cool and I just wanted to do it. And so I pretended that was passionate about language and I got the internship. And then Yann Lecan opened Facebook Paris and was able to interview. So for the anecdote I did my coding interview with Sumit Shintala who then invented Pytorch. And now I think he's the CTO of thinking machines. And so I didn't know how to code because I had only studied mathematics. And so he asked me to implement K means basic algorithm. And I asked if I could do it in matlab because I didn't know, I didn't even know Python. I knew like very basic Python. And he was kind enough to let me do my coding interview in matlab. And I get the job, which when I think about it, it's so cringe because oh my God, thank you Sumit. But and yeah, I did my PhD there. It was very interesting to have already on speech. And I was spending half my time at Facebook and half my time at Ecole Normal Superior in Paris in a lab that was studying language acquisition in babies in particular. The main observation on which the lab was built was that humans learn language from mostly two speakers, their parents, with few hundreds to 1000 hours overall in the first four years with huge variance between social backgrounds and without annotation, right, because you learn to speak before you learn how to read. And that still makes us already pretty okay for conversation when we are kids. And you know, speech recognition back then was trained with already hundreds of thousands of hours of annotated data. Now it's millions of hours annotated, annotated data. So the topic was more around efficient learning, which is interesting because it was 10 years ago, but now is still as relevant as I think there was a new company that raised a large round recently to make learning more efficient. So it's still as relevant as it's used to. And then I joined Google at that time it was interesting because so I joined working on speech in Google Brain and there were almost nobody working on speech in Google Brain. It was not considered vibrant research topic. It was like a product topic. A lot of people are saying, oh, but it's solved, you know, oh, no, it's solved. It just works. So already back then.
[16:37]
Interviewer
Yeah.
[16:37]
Neil Zegedor
And what year was that? 2019.
[16:40]
Interviewer
2019.
[16:41]
Neil Zegedor
And so I found someone to work with me and we did like a lot of work around speech. And then I got excited about a specific topic around compression, audio compression. So was just out of patience. I wanted to do like a new compression format that will not be MP3.
[16:57]
Interviewer
It's like Silicon Valley, right, The HBO show.
[16:59]
Neil Zegedor
Exactly, absolutely. And I wanted to do it with neural network. The idea was that, yeah, it will be computationally more expensive to compress and decompress the audio, but then you could compress it much more efficiently. And that's something we worked on for Google Meet and it was called soundstream. That was the first, what we call neural audio codec. And I had no plan of doing generative modeling back then. Really didn't care about that. But I was very lonely in a way, and I just wanted. I was trying to lure some people around me who are working in reinforcement learning. I wanted to get them to work on speech with me. I started a project around diarization, which is a task of you're listening to a conversation and you have to tell who said what, which is probably the less sexy research topic out there. I'm sorry, I think it's fascinating, but you cannot get people excited. I mean, it's very hard to get
[17:52]
Interviewer
people excited about speech, which is not very sexy. This is the less sexy part.
[17:57]
Neil Zegedor
Like the Monk project, you know, like very lonely and very. Yeah, I was not very successful with that to get people to work with me. And I thought, okay, generative models. The nice thing that if we generate speech, people will listen to it and they'll be, oh, that's cool. My thing, you know, my method has generated speech. So honestly, it was very opportunistic for me. I thought that it would be a good way to get people to work with me. And so we started a project and the idea was that we started to see success around language models. So it was 2021, way before ChatGPT. But you know, internally at Google there were already quite a few projects that were successful around language models. And the idea was that just after the work we had done on the neural codec, so now if instead of using your codec for real time communication, but you just use it to compress audio, now you have.
[18:47]
Interviewer
Do you want to define quickly what a codec is?
[18:50]
Neil Zegedor
Yeah. A codec is just a compression decompression. So you have an audio, right? And you want to send it over the network. When you're having a zoom meeting and you're not going to send the uncompressed WAV file because it's too heavy, you're going to compress it in a much lighter file that you will send over the network and then the receiver can decompress it into, back into audio. And the secret is based on a lot of science and knowledge around human hearing. We know what kind of information we can remove from audio so that it won't create a perceived degradation, basically. So there is a lot of science around what specific information you can remove from an audio that will make it almost as good for human as the original one. Despite the fact that we removed a lot of information which allows you to compress. And the main idea was that instead of using hard coded rules to do that, we would learn from data what are the transformations that allow to compress audio while making it as transparent as possible for the human ears, basically. And so now we had this way of compressing extremely efficiently, much more than MP3 or Opus on audio. And in a way you could consider that it was so compressed that it, it was almost like text. And so we just. Simple thing we did is just train LLM to predict this compressed audio instead of predicting text. And then you could do the exact same thing you can do with text. You could prompt it takes three second audio, compress it, I pass it to LLM and I let it predict the next compressed audio. And realized in one week we had invented instant voice cloning so we could replicate any voice with a few seconds of audio. And yes, this became extremely successful because it was all the advantages we had with LLMs we could benefit. So LLMs are great at modeling long context. They scale very well. So if you want to have a large model, you just scale the small model. It sounds obvious like that, but for a lot of architectures it's very difficult to go from a 100 million parameter to a billion parameter model with transformers and LLMs. It's obvious and I could go into more details, but in what was that project called? Audio lm and then it gave Music LM and then it gave Notebook LM that was an automated podcast and that became the standard framework for audio generation. There were two families that were kind of fighting at one during some time. It was diffusion models, which I think what ElevenLabs was based on early on. And we were the audio language model family. I think today virtually everything is audio language models because since they are autoregressive, so they, they run in a streaming fashion, they are naturally compatible with real time inference, which is kind of the main topic around voice right now. And so everybody is using this technology today. And yeah, so it was very, very successful and extremely easy to apply to new tasks. So, you know, we did it on speech first and so then we collected the data set of piano performances and then we had a model for piano and then we did more general music and then we could do pretty much anything. It's even used by a nonprofit lab working on human, sorry, animal vocalizations to try to decode the language of animals called the Earth Species Project. So, yeah, it's as flexible as LLM for text, basically.
[22:13]
Interviewer
Amazing, Amazing. You played an incredibly important pioneering role in the current state of oci. And then what was the next stop after that?
[22:23]
Neil Zegedor
At the time where Gemini started at Google, that's when I left. So I wanted to create a small research environment that reminded me of the early days of FAIR or Google Brain. So very small Team Elite, no distraction, sorry to say that. No product manager, no just research scientists, no emails, just locked in a room with the machines and focused on science. And in particular, the goal for me was really to keep working on fundamental research and keep pushing the field forward and training students and so on. Because I felt very grateful to have been able to do research in such an open environment. It was also obvious for me that, and for this I think I agree 100% with Yann LeCain, the fact that what made AI dynamic and get from ImageNet in 2012 to where we are right now today is open research, because it's kind of a worldwide collaboration and everybody benefits from the progress of everyone. So for me it was important for the field itself to keep this going. And so we decided to create a nonprofit with the help of Eric Schmidt, Rodolf Sade. So for the anecdote, the code name was Sphere because that's the name of the restaurant where we discussed the project and we then understood that we could never trademark the name Sphere, obviously. So we just asked ChatGPT for Sphere in a few languages and Sphere in Japanese is qtai and there was AI in it. So like, okay, that's the name of the lab now. And so that's how we created qtai. The first person I reached out to is Alex defossee, who is also now co founder of Gradium, is our chief science officer because we had done our PhD together at Facebook. And then when I joined Google, we kind of became rivals because, you know, we are working on the same stuff at the same time and every time we'll meet one another, we'll just not talk about anything like, what are you working on? Oh, nothing. Okay. So yeah, we had a small team but with big expertise in speech, and we decided to work on the stuff that again, it was an opportunistic decision that we made. So I looked at the kind of stuff we could do around Voice and for our lab, okay, we had 1,000 GPUs. That seemed a lot. In 2023, it was still already at least one order of magnitude lower than biggest labs. So we wanted a project where we could do a difference despite the fact that we were four people. It shouldn't need too much compute and it should be very innovative so that just by being smart about what we did, we could make a difference. And so we focused on conversational AI and real time conversation. Because I had seen from the inside at Google that nobody was daring to touch this topic because it was so challenging. And it really seemed extremely far to be able to cast the task of dialogue into an LLM. Right? People are just working on like TTS and speech to text. The interactive stuff was really not there yet. So we thought, okay, we're going to work on it and we're going to make it full duplex. Might as well do something really innov. What we didn't know at the time is that OpenAI was already working on speech to speech conversation for a while, but in six months we have a team of core team of four people and then six, we were able to ship a model train from scratch called Moshi that is still to this day the only full duplex model you can talk to it. It's a bit dumb because it's archaic in terms of intelligence, you know, but the latency is still one year and a half after. It's still the best in the world by far. And I think what was very interesting was that a very small team could make such a difference because then we shipped the first speech to speech translation system and then streaming speech to text and then streaming text to speech. And our models have been used across all industry. We are always proud to hear that A lot of big companies are using it, a lot of small companies are using it. The Maya and Miles demo from Sysami was built around our open source models. So I think what I really like about voice and what makes Gradium a project I deeply believe in is it's one of the modalities in AI where a very small team can make a difference. I don't really see benefits from having extremely large organization with a lot of people and resources, because you don't need that many resources. You don't need 10,000 GPUs to train a speech model. You don't need 1,000 people. The ability to go fast, iterate fast with the right people to me is far superior to the advantage of having a big organization.
[27:09]
Interviewer
And then let's talk about Gradium. So Gradium is a commercial spinoff?
[27:13]
Neil Zegedor
Yes. So as I said, our open source models were very successful. They are still downloaded millions of times a month. And we started having companies reaching out from all sizes, small, very large. They saw the potential. Sometimes it was a bit weird because I was talking to people leading extremely large teams and I was explaining how I could train by myself streaming speech to text that was better than all alternatives. There was something we are doing very well, but at the same time our open source models remain limited. They were fundamentally prototypes for us. They were not even the actual contribution. For us the contribution was the invention and the related research publication. For us it was kind of like artifact accompanying the paper, these models. But people wanted such models to be multilingual, higher quality, you know, all the things that make an actual product. And so we considered kind of outsourcing this part, you know, working with another team that will lead the product and so on. And honestly, after a few interactions, I realized nobody could carry such a project except us. Nobody can believe in something. They have not developed it from scratch. We were in conversations with companies that wanted to create partnerships to improve our open source models. And I would look at the specifications they wanted and I realized it's something we could do in a few days. And they had been struggling for months. So yeah, I mean, it seems that we are also the best team to, to do that. And from a personal point of view, you know, I was kind of addict to academic prestige, having best paper awards, all this stuff. It was Always nice, but it was never enough. Every time I did a paper I was happy for like two hours and then I was already thinking about the next version. And at the same time it felt a bit weird eventually that I could not imagine that at the end of my career I would have worked on something so applied and so close to actual real life applications and not doing them myself, you know, not contributing to them directly. So I think in the end, even in terms of achievement, academic success is one thing, but nothing is near in terms of impact to the fact of having people using your models for the real world. And in a way I think it's something that is kind of generalized now in the industry and was also something that made me think is I looked at all the people I respected the most, the main one being to me a living legend, Aaron van den ord from Google DeepMind. All these guys, they were making so nice contributions scientifically and then they decided to focus on products. And I want to think it's because they also realize that academic prestige is one thing, but making a real impact is having your models being used in the real world. And so for me it's the ultimate impact we can have. We still do. Science in particular, QT keeps doing open source and open science and so on. For me the upside about it is mostly to be able to train the next generation of AI researchers and keep the field alive, as I said, because I think it's to have a healthy and vibrant AI field you need to have scientific dissemination, so scientific exchange between institutions. And there also, you know, like the Chinese lab, I'm making a remarkable work and it's the kind of are forcing everyone to stay open to some extent because otherwise it also hurts the ego, I think of the people who are in the lab that don't publish. That was also something where I was very opportunistic about. So my strategy was like if we publish in a world where the others don't publish, they will get so pissed of us claiming all the inventions that it will make them join us eventually. And it's true that, you know, it's kind of hard because some people, they want to be in the place that is the cool place where the cool stuff happens and it's not only about compensation and so on. Now I would say everybody working in AI and doing a good job in AI is going to get good economic outcomes. So then what can you. Glory is also very important and it can be scientific or it can be just being proud that you are making the best products. But I think it's also an important part. Important part, yeah.
[31:26]
Interviewer
Great. Now, Gradium is an actual company that was launched a few months ago and obviously you are a new entrant in the field of AI where there's tremendous amounts of competition. So I think for Voice in particular, like the obvious question is why has OpenAI or Google or Meta not already won voice AI? And I think you probably alluded to some of the reasons up front, but why is that? Why can a small company hope to become the leader?
[31:57]
Neil Zegedor
So one thing I mentioned was if you have the right team, it can be extremely small and still make a significant impact. Other arguments I think is one is focus. So for example, if you look at large multimodal models, right, like these generic models that understand images and can generate text and can produce code and so on, you have like a limited budget, which is the number of parameters and data you're going to fit to your model. When you want to add speech to them, you're fighting with coding and image understanding and so on. So you are playing with a lot of trade offs that are irrelevant to the tasks that you want to solve. And at the same time, these models are so large they cannot run at scale because they will just make everyone lose money in the process. The only format that makes sense for speech models to run at scale is to be extremely compact, which also means that the training resources you need to train them are much smaller than what you need to train other kind of models. So the resources are not as challenging as for text models. I think also another aspect is in a way not trying to just make a conversational product, so really making building blocks so that people can build the product. So we could make the gradient conversational assistant and think a lot about its capabilities and what it can do and what it cannot do and what will be the use cases for people and so on. And hope that like the voice mode of OpenAI, it's used for this task and this task and so on, but it's impossible to cover everything with that. And so now if we want to cover NPCs, fake sport, commenters in video game and a language learning app, and an annoying character in a cartoon and the customer center agent and so on, then we just make the building blocks. And this again is not really, I think in the DNA of big companies to do this kind of very specific models that are targeted towards developers rather than trying to solve a lot of things at the same time.
[34:01]
Interviewer
And then there's an additional aspect to this which is that voice can also and should be pretty often on device versus an API call to the cloud. Is that fair?
[34:13]
Neil Zegedor
I think what is very challenging right now is if you want to have the full intelligence on device like your full conversational AI on device. Honestly, I would say at this point if you want such a model to be useful, we are not there yet. Right? You can have something that can chit chat a bit and it will be decent or we also have shipped models on device, but they are much more constrained in terms of applications. So for example, we started a year ago with on device speech to speech translation, which is something that makes a lot of sense because when you're traveling maybe you know, you don't have a data plan that is going in every country. So it makes sense to have something that works on your phone if you want to order at a restaurant like that. I think it's a particularly adapted use case. But now we also we released two weeks ago a model called Pocket TTS that not only is on device, but CPU only. So there are already modes for AAA video games where the NPCs can be powered through these voice models. And now you unlock a completely new kind of applications because on device models allow to do very large scale personalized content that will be economically not realistic with an API. So again these kind of things is if you want to make meaningful progress in that direction. Making small models in voice is much more difficult than making large models in voice. So keeping the quality while reducing the size of the model, that's where the big challenge is. And our CPU model in terms of algorithm, it's like the cutting edge of what we know. It's really the later generation of everything we've been doing so far.
[35:49]
Interviewer
So just a few days ago there was this big announcement by Alibaba QEN that they were open sourcing the QEN3TTS family for voice design, clone and generation. How do you think about open source in your world as gradium? Is that a friend? Is that a foe?
[36:07]
Neil Zegedor
So if you read the paper you will find our our names in several pages. It's mostly inspired from the Moshi architecture. Like pretty much every model right now. Even the Voxtral model that was Released by Mistral2 weeks ago is also based on our framework. I think that's really interesting because this proves that there are things that we do right because everybody is building on them at the same time. I would say it's quite an advantage because I would say there are two kinds of research papers. There are research papers that are meant to be as explicit and reproducible as possible, which is what we try to do when we do one. And there are some that are more about, I would say, marketing in the sense that they are mostly focused on the results and the performance rather than explaining the underlying mechanism and the data and so on and so forth. So nice thing of building people, building around the frameworks we introduce is that even when they don't give details, we can infer the details. So in a way, in a competitive landscape, I think it's quite an advantage because in a way it would be more challenging if people were transitioning to something completely different from what we've been doing because then we could not infer anything when reading their papers. Now all of this is very familiar when we read those. And at the same time we have all the issues that people are facing. We have been facing them for a while and so we have already workarounds and new versions and so on. Open source at qti, that's like the end goal of the lab at Gradium, that's not the end goal of the lab. The end goal of the lab is to make of the companies to make competitive products that outperform every alternatives. But open source is, I think it's a good way of allowing developers to prototype stuff, understanding what people expect, what they want. You know, it's also a way to train talents. A few days ago we released the Hibiki zero. So that's Hibiki was our speech to speech translation system. Now it's a new algorithm that makes it even lower latency, better voice cloning, multilingual and so on and so forth. And the PhD student who is working on that project at QTAI is joining Radium for a few months to do a visiting PhD and then we'll start his PhD again. So I think there are very healthy relations between open source and close source that can be done at the same time. I don't think it hurts any defensibility, quite the opposite, honestly. And, and so I think it's, it's also for talent. It's very attractive because people, when they join us, they know that they can work on really competitive products, but at the same time they can keep sharing models and more exploratory research and do really frontier stuff. And I think if we want to stay at the cutting edge, we should be a product company, but also be a frontier lab. And being a frontier lab, you need to do fundamental research to what's the
[39:05]
Interviewer
gap between open source and closed source in voice? Obviously in Text there is like this cat and mouse game and it seems that the commercial labs are constantly pretty far ahead of the open source. Is that the same thing in voice AI?
[39:22]
Neil Zegedor
So what is interesting in voice AI is now I think a high school student could make something that is decent, but then what people want is the last mile. And the last mile is extremely difficult. And the last mile is pronouncing well. All the difficult cases is having a latency that not only is low but is robust and almost zero downtime and, you know, being able to clone voices regardless of accents and so on. So all these hard cases, for me the only way to find the energy to solve them is because that's your business. Because otherwise if you look at benchmarks, like the benchmarks for TTs, it's a Libris speech. So it's books, a lot of them from Dostoevsky. So it's more about whether you are going to miss an I or a Y in a Russian name or, you know, it doesn't evaluate can you pronounce phone numbers and email addresses and URLs and all this stuff. So when you optimize for these benchmarks, you lose a lot of the actual real world cases. Yeah. So I think the main difference is the incentive to do things really with the finest details only makes sense for when you want to be competitive in product. It's not only about the models. Right. The infrastructure to run these models at scale is extremely important. I think particularly in an API business model, your margins mostly depend from the efficiency of your inference. And so in particular I'm very lucky to have in my team one of my co founders, Laurent, who did several years in, you know, he did all his career in quantitative finance mostly at Jane Street. And now we have more people coming from quantitative finance. And these people, they are really passionate about efficient inference. That's our bread and butter, you know, because in financial, like in high frequency trading, that's kind of the only way to exist. And the engineering challenges are really significant. And having a team in engineering as well and not just like training models makes a huge difference. And well now if we're talking then about on device inference, that's even more complex. Right. If you want a model to run on all Android and iPhone and so on, that's a big engineering challenge.
[41:35]
Interviewer
You mentioned benchmarks a minute ago and that seems to be a really interesting question for voice and video as well, and images. Make a case that your technology is better than the next provider because some of it seems to be a little Bit around vibes, right? Like how you feel when you're on the receiving end of a voice.
[41:57]
Neil Zegedor
AI, one thing that is clear is that you can only trust human judgment. People have tried to make objective proxies of human judgment. Like that would be a neural network that listens to an audio and gives it a grade. It sucks like so many people try and it works on their constrained setting. And unreal audio, it doesn't work at all and completely breaks. So we don't trust anything about, you know, but our ears. So we do a lot of blind tests internally, we do a lot of blind tests externally. So we are working with human judgment constantly. So every single decision we make is based on human listening. We don't trust metrics at all. So it's fundamentally subjective experience, the quality of audio. But there are some things that are going to be widely shared. Results like the, the, the, the tone and the rhythm are natural or not. A lot of people would agree on that. Is a voice nice or not? Nobody agrees on that. And so then the only way for me to claim to have the best solution is to have the largest catalog and most diverse set of voices that people can pick. Because then, you know, it's the kind of what voice are people going to like? This really depends on between people. I had faced that in the past when we did MusicLM at Google. So for the first time we made, you know, text to music so you could type like death metal with marimba or whatever and you get death metal with marimba. And so we put like some like a website online where people could do this stuff. And at the same time we were already planning to do for the first time RLHF so reinforcement learning from human feedback on music. And so what we had designed is people could give, so there will be two generations every time and people could give a trophy. I had planned that because, you know, I wanted for the first time to have integrate humans in the loop of music generation. Because I think, you know, scientifically that will be really huge. And what was interesting is that, so we made a paper about that called MusicRL and it was quite nice, but the results were not extremely convincing. And then what we did is that we just did judgment ourselves. So, you know, with people, with my colleagues, we took like, I don't know, 10 or 20 pairs of audio and each time we choose our preferred one. And there was zero agreement between people. So there is no way, you know, like an algorithm is going to learn human preferences when there is so much subjectivity. So the only Thing you can make is make it more steerable for each user to be able to customize it, because there is nothing that is going to please every user consistently.
[44:39]
Interviewer
The obvious next question is, if nobody can agree on whether this model is better than that model, then aren't all the models more or less the same? And therefore the entire voice AI model industry is sort of like commoditized, all
[44:57]
Neil Zegedor
the models being the same. I think then people talk about TTS already, right? Which is much more constrained than voice AI. So in text to speech, I mean there are factual metrics about accuracy and latency and then there are more subjective things around expressivity and so on. But you can again, you can really make a difference by making it more controllable, more customizable and so on and so forth. So I don't think that it's clear that the best TTs, the most controllable, the most serious, the most smart in terms of expression is in front of us. Nothing is close to it yet. And then there is everything that is not tts. You look at transcription now again, what if a lot of people are speaking? So I was talking about diarization as the least sexy problem ever. At the same time, it's extremely useful problem. And you look at the error rates and they are, they are very bad. I mean it's just not working. In difficult cases where you have a podcast with a lot of people talking at the same time just completely breaks full duplex. We did Moshe a year and a half ago, still nobody has made it into a product. There are so many things that are just not existing today that I think the communitization, maybe it will happen someday, but we are very, very far from it, honestly. And the gap in there is already bridging the gap in quality and abilities to make that as powerful as human production of speech and understanding of speech in complex environments and so on. And then even if you reach this point, there will always be challenging about getting the same performance with smaller models. Because then again, like a full duplex smart voice AI that can run on a gamer GPU or on an iPhone. Good luck. Like that.
[46:49]
Interviewer
Okay, awesome. I'd love to spend a little bit of time on the technical aspects of voice AI. So we talked about some of them, alluded to others, but I think it'd be nice to bring everything together. So in terms of the fundamental innovation that you described around speech to speech models, can you for us compare and contrast like the old way, which I believe is the cascade versus this new generation of models?
[47:18]
Neil Zegedor
How does that cascaded system, which is the old way, but is actually pretty new. Like the old way is without LLMs, you know, so we talk, you know, the old LLMs. So the old way is like Alex science and so on.
[47:31]
Interviewer
It's back two years ago, back in the day, two years ago.
[47:33]
Neil Zegedor
The old way is natural language understanding with parsing with very limited vocabularies and so on. But the cascaded ideas, you basically just take a LLM and you wrap it into a speech input and speech outputs. So you have like streaming transcription and streaming text to speech. The nice thing about that is you can plug in ELLM and you can customize its behavior through its prompt like you would do with a text model. A main limitation is around first latency of the whole thing because you're running three models in a cascade and each of them is going to add to the overall latency of the interaction. And by going through the bottleneck of text, you lose what we call paralinguistic information, which is all the information we convey when we speak on top of what we say. So that's emotional states irony. Lying is one, you know, like a lot of, you know, it's, you know, a lot of information is conveyed that is not, that is not in what we say particular. I encourage people to, to look at the. Every few years interspeed, one of the main speech conferences organizes a paralinguistic challenge with challenges that are more on them every time. Like so lying detection is one. But then you have trying to recognize the origin of the parent of someone from them speaking. Or there was one about people are speaking while eating and you had to recognize what they are eating. So you know, anyway, when we speak there are a lot of information that come about us and this is lost through. So what you're writing obviously is maybe not the most like relevant one, but understanding in the customer care context, understanding that you're getting lost or annoyed and so on. It's extremely important to be able to recover the conversation into a good state. Sometimes it's obvious from the text. Like if the AI says go to hell, yeah, probably they're upset. But sometimes it's not that that abuse in that kind of cues can be helpful. And so that's the only way, I mean one way to address all of that altogether is speech to speech. And full duplex on top of that addresses what I think today is the worst part of voice AI. I hate it from the bottom of my heart. It's turntaking in any introductory course to a conventional network or Image classification, there is always this analogy that is made that you can make rule to tell whether an image is shot in the morning or night just based on colors, right? You can just look at colors, values and make a handmade rule. But now if you want to recognize whether it's a cat or a dog, you cannot make rules that are recognized. All the shapes of cats and dogs and all the angles and so on and so forth. So that's why we do machine learning, right? You just learn from the data when you cannot make the rules. With turn taking, we are back at the archaic era of unmade rules, which is ridiculous. It's like you have an algorithm called the voice activity detection algorithm that just says whether it's silent or not. And if it's silent more than x hundred of milliseconds, and then this rule and this rule and this rule, then it's an interruption. But if this happens, it's not an interruption. And so we have rules on top of rules on top of rules to decide whether the model should talk or not. And that makes it so that, I mean, you know that everybody who is talking with cascaded systems right now is. You need discipline when you talk to AI. You need to adapt to its flow, otherwise it gets lost and gets confused and interrupts and so on. And this is extremely annoying. Mushy. Again, people can talk to it today. It's not really powerful, but the quality of the interaction is unmatched in the sense that you can really take five seconds to think about what you're going to say and then talk. And the model won't be lost, it will just be extremely flexible. And what we did is it was an extremely simple thing. So we took the audio language model. Instead of having it modeling one stream of tokens, we called it multi stream. Just two stream of tokens, one for the user, one for the AI. And both can be active at the same time. And there is no turn taking anymore. And you just train it on stereo data, people talking on the phone. And you have like one person on the left channel and one on the right. And your model models both at the same time. And then you play the role of left channel and the AI plays the role of right channel. And the model will learn how to handle a conversation like humans on the phone. And so when we did the most announcement particular, there was a fun early demo that we did of at some point we trained the model on the official data set. So it's a phone conversation about that were recorded in the US in the late 90s and so you could literally give a phone call to the late 90s and it will mention like Saddam Hussein and Jack Chirac and talk about, you know, like a lot of political stuff from the. It was very weird because you're talking to a guy and say hey, I'm Bob from Arizona and every time it's a new guy and you can talk about whatever and they tell you about their job and so on. So it's a kind of paranormal experience. Very fun. I think eventually it's impossible to think that we will stick to turn taking in the future. So obviously for us the end goal is full duplex. That was always our motivation. At the moment what we do is cascaded systems because that's where the market is right now. Its people, they are still iterating a lot on the underlying text models that they want to use, on the tool use and so on. And so there is so much progress on the text side. One drawback of speech to speech models is that since everything is integrated, when you go from a text model to the speech to speech model, you need to fine tune it on speech data. So now the cost to switch the underlying text model is extremely high because you will need to refine, tune everything from scratch. People want something that is modular, plug and play. What will solve everything is providing the same flexibility as cascaded system so that you can change the backend on demand basically and you get the same customizability and customization that you have with text models, but with the full duplex. And this I think is going to be a convincing solution to all the current limitations.
[53:30]
Interviewer
So that's the frontier of voice AI.
[53:32]
Neil Zegedor
One of the frontiers I would say. Yeah, again ones where I would say a frontier is I could bet to every single speech team in the world that they don't solve it in the next year or so. It's a robot in the model, in the factory and there is a lot of noise from machines and you have a lot of people talking to the robot and the robot has to figure out what the hell is going on. This I can, you know, I can, I can sign that it's. This is extremely challenging. If, if I had to pick the worst topic, I mean the ones that will be the most challenging, I think that will be this one. Even more than full duplex, but you know, much more difficult I think.
[54:06]
Interviewer
And what happens behind the scenes when you have that noisy environment? What does the model actually do and how do you solve that problem? If you're calling a receptionist in a restaurant and he or she picks up and there's super loud in the back. How do you solve that problem?
[54:23]
Neil Zegedor
One thing that is really important is to have several microphones. So we do, we have two ears and it's extremely useful because that allows us, for example, to localize an audio source. The reason why we can localize where our sounds come from because we have a small time difference between the time when it arrives in this ear and this year. So if it arrives here before my brain will understand it will come this and you get like an angle and then the acceleration of the phase gives you the other dimension. And so that's how you can locate in 3D. So having the ability of doing this specialization, understanding where the sources are coming from and which one is saying what and so on. Again, it's both a hardware and software issue. Creating training data for that is extremely challenging. Honestly, the level of robustness that we have as humans on that is extremely high. Also we are all lip reading. We don't know that we are doing it, but everybody is lip reading all the time. That helps understanding a lot of. In noisy environments in particular, we do it unconsciously, but we all do it. I think interactions on the phone are quite okay because the mouth is close to the phone and the phone can do a lot of work to enhance the quality. But now people want to have a robot that can be even a static robot that is just in a room and you want to shout to it from your living room. It's called far field speech recognition. It's really broken. Every company who has made these home assistants know how difficult it is to have them work in environments where there are several people. The reason I say, I think it's one of the biggest challenges in TTs. We see a lot of progress for this kind of hard understanding problem. I hear people saying the exact same stuff as they did 10 years ago, like there was zero progress.
[56:15]
Interviewer
You mentioned data a second ago. How does that work for voice AI and how does that compare to text AI? Obviously text AI, the LLMs are training on the whole Internet, but presumably there's a lot less audio and speech data to train on. How does that work?
[56:33]
Neil Zegedor
So if you do the math, basically like training on a few trillions of tokens, which is what you will do for a basic text model that will amount to hundreds of millions of hours of speech or something like that, which is kind of amounts that are very hard to get. I think it's a very interesting question that comes up in a lot of discussions. And everybody has their Theories in particular, one impact one let's say one attribute of speech data was that if you train a conversational model on speech data, it's going to be much less intelligent than a model trained on text. And I think it's because when you listen to speech data, the density of information is much lower than you will have in text. So you don't have Wikipedia, like, you know, in speech data, you don't have Stack Overflow, Reddit and so on. Getting your model to learn about the world from speech, I think it's a terrible idea. I think you should start, I mean, you know, we have text model, so what we did for Moshe is we started from a text model and then we took this test model and trained it on speech while trying to prevent as much as possible a loss of intelligence. So all the time we recompute the text metrics and they will degrade, but we are trying to keep it a bit contained. But indeed the quality of speech data extremely valuable. Honestly, I think we are training on way too much data. So for Moshi, we train on 7 million hours of speech. It's ridiculous. I mean we could probably do that with 10,000 hours if we had the right method. We didn't care because we had the data. So like, yeah, screw it, we just trained on that which is great to capture all the very specific idiosyncrasies in voice, the French accent, like the things that make a voice unique. So for the diversity of voice, the amount of data is extremely useful. But at the same time you could think, okay, if my model is intelligent as a text model, it could learn dynamics of conversation from a few those thousands or dozens of thousand hours of data and not millions of hours of data. That's one source of data is the largest volume is just unstructured conversation, publicly available audiobooks and this kind of stuff. And then you get towards more specific data that you make for your applications. For example, for tts you want to have expressive data of high quality. You don't want to have something that is recorded with arbitrary conditions. You want to read studio recording, very low level of noise, professional or semi professional actors. And then there is what kind of text do you give them to read. And so it's very also interesting problem because so if you want to generate hundreds of thousands of hours of scripts for voice actors, it's not that easy. I mean you cannot ask Claude or ChatGPT, write 100,000 hours of scripts and make them as diverse as possible. So it doesn't work. It's going Just to be in a loop and collapse on a few topics. Even if you ask for phone numbers, they are not really random number generators, LLMs, they will produce eventually always the same phone numbers. So Alex and my team spent a lot of time on making very complex machines for script generation with a taxonomy of all possible topics and subtopics and sub subtopics and sub sub subtopics and sampling constantly. We generate a phone number and generate it with an actual random number generator so that we actually cover the whole scope of phone number. So I think it's a very interesting problem. It's painful, it's annoying. It's really about details. And I love that because I think that's always where we can differentiate is just because we always, as a gamer say we tank. We tank the painful stuff. And this pays off a lot because as I said, in terms of computational resources, we don't need as much as to train large reasoning models or video models. But for speech, it's not really about the volume of data, but having high quality data is extremely useful and very important. And the accuracy, for example, of annotations, it's paramount and it's a lot of human labor to be able to annotate precisely.
[60:27]
Interviewer
So there is a labeling aspect to do this.
[60:29]
Neil Zegedor
Yes, the labeling aspect is interesting because automatic annotation works very well. And where humans are useful is really for the few mistakes that the best automatic annotations will do right now. And so the only way to make it worth the time of humans is if they have perfect annotation. If they have slightly imperfect annotation, not really worth it.
[60:51]
Interviewer
How does it work? If you want to add a new language, especially not to pick on any, but I don't know, Urdu, like Afrikaans, something where there's less data on the Internet, presumably. Do you have to hire actors to create specifically for them? How does it work?
[61:11]
Neil Zegedor
What is interesting is that we see transfer between languages in particular languages from the same family. For example, with Hibiki Zero that released last week, so it was trained on 50,000 hours or maybe 100,000 hours, for example, Portuguese and Spanish, and then to do Italian translation to English, 1000 hours was enough. I would say for languages that belong to a completely different family, that's more challenging. What is nice is that fundamentally the whole pipeline is the same. Right now we have been focusing on a few languages to find the recipe that works and then it can be applied to most languages. What is very difficult is unwritten languages. So languages that don't have an official writing system that's a lot of dialects where people are mixing languages and putting a bit of French and then a bit of Creole and then a bit of something else and so on. This is very challenging because getting data is difficult. Getting annotated data is even more difficult because you have much fewer annotators that can do that. But these ones are the most challenging, I would say. But it has been a challenge for a long time and there are a lot of projects around collecting data in such language. Interestingly, there is one. So, for example, the content that you can find in most languages is the Bible. And so the main source for the most widely available content in the world in the Bible in all languages, because a lot of people spend time into translating it and so on. So that's a very valuable source of data. Obviously you won't have mention of modern vocabulary in it.
[62:55]
Interviewer
And what about a hardware? So you mentioned earlier, this is not a big hardware GPU play compared to the big LLMs, is that correct?
[63:05]
Neil Zegedor
If you want to have voice models to run at scale and make sense in terms of economics, you need to have this model compact. Anyway, if you think about having NPCs in the game where people are going to pay, I don't know, 70 bucks or 90 bucks or whatever for a game and then they want to talk to it three hours a day because it's a very good game. So you spend a lot of time on it. If you have a large model, it's just impossible. Not only it doesn't fit on the GPU, but even through APIs, it just would not make sense economically. So fundamentally I think these models need to be small, but sometimes they require access to large models because they are solving a complex task. I'm quite in a way stating the obvious, but selective and adaptive compute usage based on the context and the difficulty of the task at hand, the task being as precise as the next few words. Can I just answer like that? Because somebody say hey and I just say hey. They ask for the next flight to SF and I have to look up on Internet to find the time. So being very selective about when to compute, to use compute, I think that's the only way for all of it to make sense economically.
[64:11]
Interviewer
Let's talk a little bit about the product and business aspect of voice AI. You mentioned that you were building a model company, but also a product company. What is a product in voice AI? And the two things I'm thinking of are one, cloning and. And then agents. That's the two things that people are talking about. A lot. Take whichever one you want.
[64:34]
Neil Zegedor
Yeah. So I think for this one, so for us, the product is the underlying models, right? Because what we see is that people are building voice agents. So voice agents, it sounds, I think, a lot like a business agent, like a customer care agent or something like that. But an NPC in a way, is an agent, right? Because they have a voice interface, they have an underlying text model. Eventually in video games, they will be able to do function calling, to launch a quest to decide that you solved something, to control an action, and so on and so forth. What we focus on is providing the best technology to the people who want to build the agents. So we don't build agents ourselves. We want to focus on the quality of the model, because that's our specialty, that's where our talent is. And at the same time, I think it will be a bit unrealistic. Unrealistic to things that we can address all the needs of the market with our agents. Because again, I think what is very exciting is when we look at our customers, every time there is a new one, it's completely unpredictable what it will be about. And it's about learning and customer care and video games and media and personalized press. And it's so wide that we'd rather just make it easy for people to build voice agents and provide them with a reliable models infrastructure. And that also works pretty well because in terms of staff, we are less than 15 employees today at Gradient. When we partner with a company that deploys voice agents in banks or hospitals or different kind of businesses, it requires a lot of what we could call today a forward deployment engineer. So there is a lot of staff, human labor that is involved for these deployments into various complex systems and infrastructures. And we can just focus on the model. So that also allows us to remain pretty small and really focus on the science and the engineering.
[66:31]
Interviewer
And voice cloning, Is that a use case?
[66:34]
Neil Zegedor
Actually, at Gradient, we have the best of the industry, and the best means not only replicating the specific characteristics of someone, but I mean, also the accent, some unusual recording condition. And so in a lot of contexts, for example, if you want to create a vintage sounding character with old radio effects, and it's something that we do pretty well. If you want to have a robot voice, we can do that pretty well as well. If you want to have any kind of accent or speaking style, like posh or more, you know, laid back or more urban, any kind of social aspects of an individual aspect of voice is something that we replicate pretty well, I think what is interesting is that voice cloning in itself, where I see the most potential is about creating interactive experiences around licenses I tried to pitch, for example, who Wants to Be a Millionaire? There is a video game and the questions can be generated on the fly. You would like the voice of the host to answer them all the time. So this one makes a lot of sense with cloning. A specific voice because you want to replicate the voice of a character, of a person, of an athlete, of a K, pop star or whatever. I don't know, you know, any kind of experience where people want to engage with a voice that they know, but they want to go through an interactive experience. Then a lot of cases, for example, in customer care, people don't want a specific voice. They want a voice with specific characteristics. And that is nice for the use case. And so typically they clone the voice of a colleague that has a good voice for these customers. I think the solution that makes the most sense is not cloning, it's voice design. So being able to create voices from a natural language description and having your voice generation being really faithful to the prompt so that it's really worth iterating on it. Because there are a few solutions that exist today around voice design, but as far as I know, they are not very popular. And people still stick to the existing voice catalog because they cannot steer it precisely enough. But when this works, this will allow people to design the voices that they want for their use case, which I think is very interesting. Even though sometimes I'm a bit surprised because we have some customers that just take a random stock voice from our catalog, which is one of ours, like me or my colleague, and they don't care. I try to put them like, let's design a voice together. Let's make a voice that represents your brand. But some people are like, no, it's fine, fascinating.
[68:59]
Interviewer
Or maybe you just have such a great voice. That's what the market wants. Inevitable question around privacy and then deep fakes. Obviously, in a context where one can clone somebody else's voice pretty easily, what does that mean? And how do you protect all of us?
[69:16]
Neil Zegedor
So first thing I want to say, watermarking is a scam. I'm sorry, I have to say it. It just doesn't work. I worked on it. We have an appendix in the Moshi paper around how we could break so easily. Any watermarking that was supposed to be state of the art. So people should not rely on that. Also watermarking, who verifies a watermark? Right? You will need a platform to do the verification and remove the audio.
[69:39]
Interviewer
But if someone explain what watermarking is.
[69:42]
Neil Zegedor
Watermarking? Yeah, in audio. That's the idea of either finding a way to. So when you generate an audio, you put a hidden stamp in it like you would do in an image. But now it's in audio so you cannot hear it, but it makes it very recognizable that it not only it's fake, but it's fake from this specific model. And also there is deep fake detection, which is just recognizing that the audio is synthetic, even if there is no watermark in it, just telling true from fake. These are very difficult to do, unfortunately. You know, if you get a phone call from a scammer, unless you have inside your phone's automatic detection, it's useless. Right. You cannot put upload it to a website that says it's true or fake. Honestly, I would say at this point I don't know who has their grandma who lost their credit card and it has their part and it's $1,000 today because it's always the story. I hear it's probably not the case if someone gives you a phone call. So I would say just in that context, honestly, I think there is nothing as safe as just asking personal questions that only the person that they pretend to be could answer.
[70:50]
Interviewer
So you think it's going to be a fact of life and therefore we need all to be just more vigilant.
[70:55]
Neil Zegedor
Yeah, I mean, it was already the case with, with emails. Right. And people just need to be much more vigilant and probably will find ways to have authentication, for example, on the other phone side that it's actual person that is calling, and so on and so forth.
[71:13]
Interviewer
But on the privacy front, if I clone my voice with Gradium, then you keep control of it.
[71:21]
Neil Zegedor
Only you can use it, you need to own it, but nobody will be able to use it. And so it's only for your own usage. And I think what eventually we could also do, which was done before, is allow people, if they want to opt in, to share their voice with the community so that they can also get financial compensation if their voice is used by other users. So I think this one is, you know, it has a lot of value because it's a good way of sourcing a lot of voices eventually. Again, if we omit the case of replicating familiar voices from licenses or existing people who give their voice knowingly to create specific content, I think voice design is going to just remove this issue because then again, people typically are going to clone the voice of Someone, but what they wanted is someone from a specific gender, specific demographics, age, accent and so on. And so they could just fill this information and get a few propositions for voices that will fit their need and that will remove the need for voice cloning.
[72:30]
Interviewer
Yeah. So as we get towards the end of this conversation, just like a few sort of quick ones, one, there seems to be an emerging discussion about the intersection between voice and screens or image. Is that something that you focus on?
[72:47]
Neil Zegedor
There are several things in particular. So if you want to do speech, understanding, having access to the video can help a lot. Again, for diarization, for example, if you're filming, like if you're listening presidential debates with several candidates, it's awful to understand. If you watch, you can see who is speaking at each time and so on. So audio, visual understanding is much stronger than audio understanding in itself, I think. Now also what is interesting is so we see that in a video generation, like with VO3 from Google now, there is native audio that is included. And that's why I think it's for video generation. I think the most natural thing is integration of video and audio. I don't think it really makes sense to do video to audio generation separately because typically the data exists as a multimodal signal. Right. When you have a video, you have the audio track as well. So you might as well just exploit both to train your models. So however now. So for example, we released for the Valentine's Day a small app called Bridger Clone. Now if I put my face in photo, I put two video. I say make a video. It's going to make up a voice that it thinks sounds like me based on my appearance, my age and so on.
[74:03]
Interviewer
Where do people find that?
[74:04]
Neil Zegedor
So it's called Bridgerclone app.
[74:07]
Interviewer
Bridgerton Like Bridgerton like there's a play on word.
[74:09]
Neil Zegedor
Okay, exactly.
[74:10]
Interviewer
Bridgerclone app.
[74:12]
Neil Zegedor
Yeah.
[74:12]
Interviewer
Okay.
[74:13]
Neil Zegedor
And so what it allows you to do is now you can clone your voice, you can record a small short love message, and now you get a video of you in your voice. And this again, I think that's why it's useful to have a voice that is treated as kind of a separate component because now you can have much more control on the actual voice of the virtual character rather than just having a likely voice that sounds like it could be yours.
[74:42]
Interviewer
Yep. Okay. So we talked about cloning. Obviously, cloning is just one of the many apps ultimately just to play it back and drive it home. You provide building blocks and models to create all sorts of different products based on voice, whether that's customer service use case or any kind of interaction translation and like all sorts of.
[75:03]
Neil Zegedor
And what is interesting is so at QTA in two years we were able to do conversation, translation, tts, speech to text, always competing with much bigger and much more mature teams. It's still the same at gradient. And one of the big strengths is that we have kind of this fundamental framework for the generation around audio language models and it's extremely flexible. So I think one of our strengths as well is our ability to do like a new task and when people come up with new needs, whether it's about annotating value stuff or generating value stuff, all of this can be cast pretty easily enough in our framework. So then if we see that there is huge demand for speech separation, it's pretty easy for us to do it for voice transformation, for accent transformation, whatever. That's also why we are always interacting with the developers to understand what they want. We're also now giving access to alpha models that can do stuff that are still experimental but are world premieres. And yeah, I think that's something I'm pretty excited because we can be much more creative than just speech to text and tts. Obviously these are kind of the master tasks where we want to be the best in the world because that's where most of the opportunities are. But at the same time we can do a lot of fun stuff that is completely orthogonal to that, in particular around transformation, audio effects and so on. I think that can be. There are a lot of things too that can be very cool.
[76:29]
Interviewer
Great. And last theme or question. You building this company out of Paris?
[76:36]
Neil Zegedor
Yes.
[76:37]
Interviewer
Obviously you're building the company very much in a global way and you have multiple customers in different geographies, including very much the us. We recording this today in New York. You're flying out to San Francisco in a few hours. Any thoughts on the the current state of the French AI scene and the European, I guess, AI scene? There's always this fun back and forth scene from the US combination of occasionally admiration, pretty often mockery, and the fact that Macron or whoever runs his social account sort of misfired the other day by saying that that he was going to allocate 30 million euros to AI, when in reality meant a specific program to attract a few academics to France. So on the ground there, what is your sense of the current state and the strengths and weaknesses of France?
[77:36]
Neil Zegedor
A lot of things to say about that. I was born and raised in Paris. I did all my careers there. Facebook arrived when I started my PhD and then Google Brain moved to Paris and Google DeepMind afterwards. I'm what we can call terminal online in the sense that I love the very mean memes against Europe. So there is this guy, I don't remember his name, it's like a fake Swedish name. And he keeps posting about how, you know, he has, after only 20 meetings, he has contributed a €10,000 check for Instagram and it's a compliance first company and everything, you know. But I love it. It's very mean. Honestly, I love it because it's very mean. I'm like, I love what. When people have so much time to spend just to be mean. I mean, I think it's, it's a quite, you know, I respect that. But at the same time it's so far from, from reality. So, you know, French AI and European AI. So European AI is mostly French AI, to be fair, there is also Germany, but a lot of it is in France and the UK and before French companies, as I was French talent in American companies. So again, a lot of the current audio generative models of Google, a global company obviously were developed between Paris and Zurich. Actually most of it. Llama was started in Paris. Dino, which is the most groundbreaking vision work from Facebook was developed in Paris. A lot of things have been developed in Paris that are not seen as Parisian because they were made in American companies. And now I think the field in Paris is the talent is so dense and the people are extremely strong and extremely committed. And the best signal that proves it is we used to have Facebook and Google in Paris. Now there is OpenAI, there is anthropique, there is cohere. Pretty much everybody is opening in an AI lab in Paris and the reason why they do is for the talent. So I think we have everything in France to develop global companies, in particular in AI, which is an economy really built around talent, that's a perfect deal for it. Also, as a French guy, I really don't want us to screw it up in France because in a way, when you see the most successful French companies, it's luxury kind of stuff. And so, yeah, I really want AI to be, you know, that's a field where France can make a big difference and that can become one of the biggest driver in the European economy. If it doesn't work, then Europe will really have to look itself in the mirror and be like, how could we screw it up with so many strong people? Because, you know, the people are here and, you know, capital, we can get capital in Europe as well. I think all the conditions are there for it to be competitive. In a way, the more people are mocking Europe, the more it can make the people who mock overconfident about themselves. And everyone who is overconfident eventually get displaced by the underdog. So, you know, I, I mean, people should always, you know, like when I, when I was the996, like LARPing on. On Twitter, it's. It's ridiculous. Like people coding at the gym, like, what the hell? Like code and then go to the gym. You're not doing a good gym and you're not doing good code. So you're just. It's just pretending. So I think we don't have a culture of pretending. And that's whole Europe, from the most western to the most Eastern part. We tend to be a bit more pessimistic, maybe, and a bit more down to earth about our impact and the challenges and so on. We don't try to sugarcoat things. We rarely look enthusiastic or happy about our work. But, you know, it's a good discipline. So I think the results kind of speak for themselves. What I think. So Mistral, for example, was. Sometimes it's smug because it's not considered the frontier lab, like OpenAI, anthropic, whatever. Okay. But look at the staff and the resources and so on. I know the best people from Mistral, they can be compared to the top of the top of the biggest lab. So then there's a question of scale and scale of resources. Indeed. So what you. The unfortunate tweet you mentioned about the 30 million euros, you know, is one of them. But, yeah, I mean, other than that, people are. There are a lot of very strong people and, you know, they have done also a lot of great things for American companies. I mean, Jan Le Carn is one of them, obviously, but it's not only him. Sami Bengio, who used to lead Brain, is now leading Apple mlr. The amount of French people in the leadership of big tech AI research is. Is very large.
[82:16]
Interviewer
Wonderful.
[82:17]
Neil Zegedor
Well, I love that I get carried a bit when I speak about this.
[82:20]
Interviewer
I love the fighting spirit and this is a wonderful way to end it. So, Neil, thank you so much. This was terrific. Really appreciate it.
[82:28]
Neil Zegedor
Thanks.
[82:30]
Interviewer
Hi, it's Matt Turk again. Thanks for listening to this episode of the MAD Podcast. If you enjoyed it, we'd be very grateful if you would consider subscribing if you haven't already, or leaving a positive review or comment on whichever platform you're watching this or listening to this episode from. This really helps us build a podcast and get great guests. Thanks and see you at the next episode.