The MAD Podcast with Matt Turck
Episode: Voice AI’s Big Moment: Why Everything Is Changing Now
Guest: Neil Zeghidour, CEO of Gradium AI
Release Date: February 19, 2026
Episode Overview
This episode dives deep into the current "big moment" of Voice AI, exploring major technical, industry, and societal shifts enabling rapid improvement in voice-based artificial intelligence. Matt Turck hosts Neil Zeghidour, CEO of Gradium AI (formerly DeepMind and Meta), for an honest, highly technical, and engaging conversation that covers:
- Why Voice AI lagged behind other modalities
- The rapid pace of recent advances
- Building and scaling native audio and conversational models
- The unique culture and talent pool shaping Voice AI
- Technical, ethical, and business frontiers
- The role of open source and productization
- The future of voice in hardware, society, and the global AI landscape
Key Discussion Points & Insights
1. Why Voice AI Is Having Its Moment (01:18–03:28)
- For the first time, it's often more convenient and enjoyable to speak to an AI than a human for certain tasks.
- Recent breakthroughs: Latency, naturalness, and accuracy have dramatically improved in the past two years.
“It actually can be enjoyable and even more convenient to talk to an AI on the phone than talking to a human.” — Neil (01:40)
- Voice interfaces now marry conversational intelligence (agents) with rapid, natural audio models.
- Still in early days: Demos typically occur in quiet, controlled environments—robustness in noisy, multi-speaker, real-world contexts is the next frontier.
2. Voice AI's “Poor Parent” Status—And Turning Point (03:28–06:44)
- Historically, voice/speech was less prestigious and thus attracted fewer top machine learning minds.
“…Voice did not attract the visionaries in machine learning… the prestige of speech conferences used to be much lower than that of computer vision or nlp.” — Neil (03:53)
- Ironically, the earliest deep learning success (pre-AlexNet) was in speech recognition, thanks to Geoffrey Hinton (2007–2008).
- Voice sits at the intersection of ML, signal processing, psychoacoustics, and domain expertise—fewer than 100 true global experts can train competitive models.
3. Technical Breakthroughs & Remaining Challenges (07:39–11:24)
- Full Duplex Conversation: Models no longer process strict alternating turns but can “listen and speak” continuously, enabling more fluid, overlapping dialogue.
“One of the things we contributed doing is getting rid of speaker turns completely with what we call full duplex conversation... in that context there is no real latency anymore.” — Neil (07:56)
- Beyond latency: Next steps involve natural expressiveness, emotional appropriateness, and context understanding.
- The quiet room caveat: Real-world robustness (e.g., factories, public spaces) is still far off.
- Hardware convergence: The next generation of devices (glasses, pendants) is built “voice first,” with no keyboards or screens—voice becomes primary interface.
4. Use Cases, Office Life, and Social Implications (11:01–12:54)
- Applications are expanding rapidly—coding, interface navigation, and beyond.
“Even prompting LLMs now is… much more convenient rather than typing.” — Neil (11:24)
- Social/work contexts: Office norms may evolve as voice becomes a main human-computer interface.
- Anthropomorphism of AI: Engineers increasingly treat AI assistants as colleagues (“Claude-ing”).
5. Neil Zeghidour’s Path: From Math to Voice AI (12:54–22:23)
- Discovered ML via finance/news automation; interned at Facebook Paris, where he barely knew how to code.
“[Sumit Shintala] asked me to implement K means… I asked if I could do it in matlab… I get the job, which… is so cringe… Thank you Sumit.” — Neil (14:05)
- PhD focused on efficient, data-light speech learning, studying language acquisition in infants.
- Career at Google: Pioneered neural codecs and generative models (Soundstream, AudioLM, MusicLM).
- Invented “instant voice cloning”—prompting LLMs to generate compressed audio for any voice with seconds of data.
6. Innovation at Small Scale: Nonprofit & Gradium AI (22:23–31:26)
- Founded QTAI (formerly Sphere) as a lean nonprofit, prioritizing frontier research, built around open collaboration.
- Small, expert teams can still drive world-changing impact in voice—“You don't need 10,000 GPUs… the ability to go fast, iterate fast… is far superior.”
- Gradium emerged to productize and scale top-performing open-source models, addressing market requests for higher quality, multilingual, production-ready models.
7. Why a Small Startup Can Beat Mega-Labs in Voice (31:26–34:01)
- Voice models must be ultra-compact to run at scale and meet latency/cost requirements—small, focused teams have an advantage.
- Big labs’ multipurpose models dilute resources across modalities (text, image, code), whereas Gradium directly targets developer needs with dedicated voice primitives (not just one assistant).
“If you have the right team, it can be extremely small and still make a significant impact.” — Neil (31:57)
8. On-Device Voice & Edge AI (34:01–35:49)
- Full “conversational AI on device” remains out of reach—current on-device use cases (e.g., speech translation, PocketTTS for games) are narrower.
- The real challenge: Matching quality while drastically reducing model size (e.g., CPU-only solutions for embedded hardware).
9. Open Source and Competitive Dynamic (35:49–39:21)
- Open source advances (e.g., Alibaba’s Qwen3TTS, Mistral's Voxtral) often build directly on their frameworks (Moshi architecture).
- Open sourcing helps Gradium and QTAI stay ahead—deep understanding of the core mechanisms ensures their "last mile" solutions remain state-of-the-art.
“Open source at QTAI… is the end goal of the lab. At Gradium, that's not the end goal… the end goal of the companies to make competitive products that outperform every alternatives.” — Neil (36:39)
10. Last-Mile Quality & Productization (39:21–44:39)
- The “last mile” factors—handling all accents, edge cases, latency, naturalness—are what separate market leaders from baseline open models.
- Real-world audio is fundamentally judged by humans, not by objective metrics or neural proxies; blind listening tests inform every product call.
“It's fundamentally subjective experience, the quality of audio. But there are some things that are going to be widely shared results…” — Neil (41:56)
11. Commoditization Myth (44:39–46:49)
- TTS and voice models aren't (yet) commoditized; so many hard problems (e.g., full duplex interaction, diarization, robust transcription, real-time reasoning) are unsolved.
“The best TTS, the most controllable... is in front of us. Nothing is close to it yet.” — Neil (44:57)
12. Technical Deep Dive: Cascaded vs. Integrated Models (46:49–53:29)
- Cascaded System: Standard approach (speech-to-text → LLM → text-to-speech); easy to swap LLMs, but loses emotional and paralinguistic nuance and introduces latency.
- Speech-to-Speech/Full Duplex: Models can listen and generate speech for both sides simultaneously, modeling overlapping/asynchronous conversation.
“We took the audio language model. Instead of having it modeling one stream of tokens, we called it multi stream... There is no turn taking anymore.” — Neil (47:56)
- Challenge: Full integration (speech-to-speech) limits backend flexibility; but ultimately, it delivers the most human, low-friction voice experiences.
13. The Hardest Problems: Robustness in Noise, Data Challenges (53:29–60:28)
- Recognizing speakers & understanding in noisy, multi-talker environments is still almost unsolved—requires both hardware (multiple mics) and new model architectures.
“One of the frontiers… a robot in the factory; a lot of people talking… It’s extremely challenging.” — Neil (53:31)
- Labeled, high-quality audio data is scarce and hard to generate, especially for rare languages, accents, or unwritten dialects.
“…we could probably do that with 10,000 hours if we had the right method… For speech, it’s not about the volume, but about having high quality data.” — Neil (56:33–59:00)
14. Hardware Efficiency & Selective Compute (63:04–64:10)
- Voice AI at scale requires small, efficient models that can run on commodity hardware, mobile, and embedded devices.
- Selective/adaptive compute is key: Not every utterance needs the same level of model horsepower.
15. Products, Use Cases, & Business Models (64:10–68:58)
- Gradium focuses on providing the best raw models and infrastructure, letting others build end-user agents for:
- Customer care
- Video game NPCs
- Language learning
- Personalized media
- Interactive entertainment
- Voice cloning: Gradium offers best-in-class cloning and even voice design (generating new voices by description), with particular strength in customizing accents, effects, and styles.
16. Privacy, Deep Fakes, and Security (68:58–72:30)
- Watermarking and audio “deep fake” detection aren't reliable for real-world defense.
“Watermarking is a scam. I'm sorry… It just doesn't work.” — Neil (69:15)
- Gradium ensures cloned voices remain under user control, with possible opt-in for voice sharing and compensation.
- Voice design is seen as a promising path to avoid privacy issues—generate custom, parameterized voices instead of cloning real humans.
17. Voice with Visuals & Multimodality (72:30–75:03)
- Audio-visual understanding (not just audio) greatly boosts accuracy for many tasks (e.g., joint diarization).
- Video-gen models will need “native” audio, not bolted-on—it’s a multimodal future.
- Demo: Gradium’s playful “Bridgerclone” app generates a video+voice message matching your selfie.
18. Building from Paris and European AI Strengths (76:29–82:16)
- Paris has become a global AI talent magnet, with a dense concentration of top researchers fueling Google, Meta, OpenAI, Anthropic, and homegrown startups.
- French and broader European AI have the talent and resources; culture is more sober, less “hype,” but results speak for themselves.
“In a way, the more people are mocking Europe, the more it can make the people who mock overconfident… and everyone who is overconfident eventually gets displaced by the underdog.” — Neil (81:23)
Notable Quotes & Memorable Moments
- “For the first time, it actually can be enjoyable and even more convenient to talk to an AI on the phone than talking to a human.” – Neil (00:00, 01:40)
- “Historically, for some reason, voice did not attract the visionaries in machine learning… even at conferences… you had to have an application in vision or NLP. If you did it in speech, you would get rejected.” – Neil (03:53)
- “You don't need 10,000 GPUs to train a speech model... The ability to go fast, iterate fast, with the right people—to me is far superior [to] a big organization.” – Neil (30:40)
- “One of the things we contributed [was] getting rid of speaker turns completely with what we call full duplex conversation.” – Neil (07:56)
- “Watermarking is a scam. I'm sorry, I have to say it. It just doesn't work.” – Neil (69:15)
- “The only thing you can make [AI audio] is to make it more steerable for each user… There is nothing that’s going to please every user consistently.” – Neil (43:28)
Timestamps for Important Discussion Segments
| Topic | Timestamp | |---------------------------------------------------|------------| | Why voice AI's big moment is happening | 01:21–03:28| | Why voice lagged behind other AI modalities | 03:28–06:44| | Full duplex models & the expressiveness frontier | 07:56–11:01| | Use cases and social context in offices | 11:01–12:54| | Neil’s journey: math, Facebook, Google, Gradium | 12:54–22:23| | Small teams, open research vs. productization | 22:23–31:26| | Why small companies can lead in voice AI | 31:26–34:01| | On-device models, efficiency, new use cases | 34:01–35:49| | Open source influence, competitive landscape | 35:49–39:21| | Productization, last-mile, blind testing | 39:21–44:39| | Technical deep dive: cascaded vs. integrated | 46:49–53:29| | Robustness in noise & data challenges | 53:29–60:28| | Hardware and adaptive compute | 63:04–64:10| | Gradium’s model-centric product strategy | 64:10–68:58| | Cloning, voice design, and privacy | 68:58–72:30| | Voice + video, multimodal applications | 72:30–75:03| | French & European AI scene | 76:29–82:16|
Tone & Style
The conversation is candid, insightful, and mixes technical depth with humor and industry context. Neil is unpretentious, sometimes self-deprecating, and takes clear stances while celebrating both his team and the broader field. The technical explanations never lose sight of real-world impact or the excitement (and occasional frustration) of rapid innovation.
Summary for Listeners
For anyone interested in the next generation of human-computer interaction, this episode provides a state-of-the-art tour—both visionary and grounded—of how, why, and by whom voice AI will become the interface of the future, and why the field is only just getting started.
End of Summary
