Latent Space: The AI Engineer Podcast
Episode: Mistral: VoxTral TTS, Forge, Leanstral, & what's next for Mistral 4
Date: March 30, 2026
Guests: Guillaume Lample (Chief Scientist, Mistral), Pavan Kumar Reddy (Leader, Audio Research, Mistral)
Host(s): Latent.Space team
Episode Overview
This episode dives deep into Mistral’s latest advancements in audio and foundational model research: the debut of VoxTral TTS (Text-to-Speech), the internal and customer-facing infrastructure of Forge, the open-sourcing of Leanstral (formal reasoning models), and their broad vision for multi-modality and open-source AI. The conversation explores model architectures, the evolving state of audio research, customization for enterprise use, and open-source philosophy.
Major Announcements and Main Themes
1. VoxTral TTS Release
-
First TTS model from Mistral, supporting 9 languages.
-
Efficient and compact (3B parameters), with quality matching or surpassing state-of-the-art at a fraction of the cost.
-
Uses a novel autoregressive flow matching architecture and an in-house neural audio codec.
- "So we support nine languages and this is a pretty small model, 3D model. So very fast and also state of the art. Performs at the same level of the best model. But it's much more efficient in terms of cost."
— Guillaume, 00:26
- "So we support nine languages and this is a pretty small model, 3D model. So very fast and also state of the art. Performs at the same level of the best model. But it's much more efficient in terms of cost."
2. Research and Architectural Insights
Novel Audio Architecture
-
Developed autoregressive flow matching and a new neural audio codec that maps audio to semantic & acoustic tokens for efficient synthesis.
-
Model is an evolution of Mistral’s prior ASR and speech (Voxtral) models, and leverages transformer architectures with new encoding approaches.
- "Ended up with a autoregressive flow matching architecture. And also have a new in house neural audio codec... that's the new part about this model and we're pretty excited that it came out with such good quality."
— Pavan, 01:39
- "Ended up with a autoregressive flow matching architecture. And also have a new in house neural audio codec... that's the new part about this model and we're pretty excited that it came out with such good quality."
Foundation Models for Audio
-
Audio presents unique challenges vs text/vision; no consensus “winner” model yet.
-
Emphasis on real-time capability and efficient encoding/decoding (for e.g. voice agents).
- "Even in vision I think this is true. But in audio it's definitely true. There is no winner model yet. ...That also makes the space pretty exciting to explore."
— Pavan, 06:26
- "Even in vision I think this is true. But in audio it's definitely true. There is no winner model yet. ...That also makes the space pretty exciting to explore."
Deep Dive: Model Details & Research Philosophy
3. Autoregressive Flow Matching & Codec Innovations
-
Combines discrete and continuous representations; flow-matching head enables low-latency, high-quality audio generation.
-
Approach allows cutting inference steps from K autoregressive steps (slow) to ~12–16, with future potential for 1-step inference.
- "With flow matching we were able to cut it down significantly. So we are able to do the inference in 12 steps or 16 steps and it works pretty well."
— Pavan, 12:01
- "With flow matching we were able to cut it down significantly. So we are able to do the inference in 12 steps or 16 steps and it works pretty well."
4. Real-World Evaluation and Data Customization
-
Focus on custom, client-specific deployments, including on-prem and private cloud for data privacy, as well as tailored fine-tuning for industry needs (e.g. rare languages, custom voices, domain jargon).
-
Forge platform allows enterprises to fully control, fine-tune, and deploy Mistral models.
- "If you actually fine tune you can actually really go much further than this and then you have a very big advantage. The model is trained on your entire company knowledge so it knows everything."
— Guillaume, 18:37
- "If you actually fine tune you can actually really go much further than this and then you have a very big advantage. The model is trained on your entire company knowledge so it knows everything."
Key Discussion Segments & Timestamps
VoxTral TTS & Audio Research
- [00:26] – Announcement and overview
- [01:39] – New architecture details (autoregressive flow matching, codec)
- [07:41] – Real-time generation and design trade-offs
- [09:43] – Roadmap: from ASR to full-duplex voice agents
Audio Model Evaluation & Industry Use
- [13:02] – Inspiration from vision diffusion models; potential for improvement
- [14:37] – Efficiency vs. generalist models: Mistral’s philosophy
- [15:06] – Progress and remaining gaps in TTS — especially for non-English languages
Fine Tuning, Customization & Forge
- [17:56] – On-prem/private cloud, regulatory workflows
- [18:37] – Importance of tuning models on domain-specific data
- [20:43] – Unique client use-cases (e.g., in-car offline TTS, rare language support)
Voice Cloning & Personalization
- [25:14] – Use cases: enterprise, healthcare, customer support
- [26:54] – Long-form and coherent speech synthesis, model’s handling of context
Model Integration & Specialization
- [28:56] – “Mistral Small” and mixture-of-experts architectures
- [32:24] – Merging voice and video, multi-modal frontiers
Open Source, Leanstral, and Research Culture
- [33:25] – Mistral and the mixture-of-experts breakthrough
- [34:09] – Open-source dedication; releasing Leanstral (formal theorem proving, verification)
- [36:10] – Why formal systems matter for verifiable, long context reasoning
Research Horizons
- [40:49] – RL and long-horizon supervision
- [41:00] – Pre-training at scale and next frontier algorithms
Notable Quotes
-
"There is no winner model yet... it's still evolving. That also makes the space pretty exciting to explore."
— Pavan, 06:26 -
"If you care about this specific use case, you can actually use this model. It just does that. It's extremely good at it, but also very efficient."
— Guillaume, 13:54 -
"That's the Mistral pitch right there. Take all the money."
— Host, 22:43 -
"We really don’t want to be living in a world where the smartest model, the best models are only behind closed doors. …We want intelligence to be used and accessible by anyone."
— Guillaume, 34:09
Memorable Moments
- Discussion on disfluencies in speech modeling (pauses, intonations) and how flow matching helps capture natural variation. [11:32]
- Recounting the breakthrough mixture-of-experts paper and its impact on open-source AI acceleration. [33:25]
- Anecdotes on unique customer needs: e.g., a specialized model for in-car voice commands that needs to work offline, or unusual kid-focused educational use cases. [20:43, 47:37]
Forward Looking: What’s Next for Mistral?
-
Full-duplex voice (speaking while listening) is on the roadmap but approached incrementally.
-
Continued merging of models for new capabilities: coding, reasoning, computer-aided design, legal/finance use cases.
-
Deepening open-source contributions: detailed technical reports, competitive specialized models (Leanstral), and a strong research-sharing culture.
-
Hiring across all teams, especially for science and customer-facing ("forward deployed") engineers—emphasizing agility and real-world impact.
- "Like a small team, very agile. …We are growing the team trying to hire very strong people."
— Guillaume, 42:32
- "Like a small team, very agile. …We are growing the team trying to hire very strong people."
Final Takeaways
- Mistral is positioning itself as a leader in open and efficient foundation models for audio, text, code, and beyond.
- The company’s approach—decoupling and tuning for each application, then merging capabilities when mature—offers both specialization and scalability.
- Open source remains core to the mission; Leanstral shows how niche efforts (formal math) fit a larger vision for verifiable reasoning.
- With advances like VoxTral TTS, Mistral is set to challenge both closed and open model incumbents across multiple domains.
Try VoxTral TTS and learn more at latent.space and in the episode show notes.
