Eye On A.I. Podcast #320 – Carter Huffman: Exploring The Architecture Behind Modulate’s Next-Gen Voice AI
Host: Craig S. Smith
Guest: Carter Huffman, CTO & Co-founder, Modulate
Date: February 11, 2026
Overview
In this episode, Craig S. Smith sits down with Carter Huffman, co-founder and CTO of Modulate, to delve into the architecture and applications of Modulate’s cutting-edge voice AI. The discussion explores how Modulate moved beyond traditional voice AI—now often seen as commoditized—by developing real-time, scalable, and deeply nuanced voice understanding technology. Carter explains how these advances are impacting gaming, safety, fraud prevention, and potentially many other industries, while maintaining technical and ethical rigor.
Guest Introduction & Background
[03:07]
- Carter Huffman introduces himself, detailing his journey from studying astrophysics and cosmology at MIT under Alan Guth to working on autonomous spacecraft at NASA’s Jet Propulsion Lab.
- Developed an interest in machine learning and neural networks, eventually applying that to the voice domain, leading to the creation of Modulate.
“I was out at Jet Propulsion Lab... working on super cool problems like how do you have a spacecraft that's flying by a comet... figure out what to do next to get the most science out of that flyby mission.” — Carter Huffman [03:16]
The Commoditization—and Evolution—of Voice AI
[04:03–05:53]
- Voice transcription and basic speech recognition have become commoditized due to advances in accuracy, speed, and cost.
- The new frontier is not just transcribing, but deeply understanding—and reacting in real-time to—the emotions, intent, and social context embedded in spoken conversations.
- Modulate is focused on this deeper, context-aware voice understanding, which Carter argues is a significant step beyond where the field was even three years ago.
“You don’t just want a transcript of what's being said, you want to do something useful with it. You want to really understand what's going on.” — Carter Huffman [05:10]
Real-Time Voice Analysis in Gaming & Safety
[06:12–07:43]
- Gaming was chosen as an initial domain because of the acute need for scalable, nuanced moderation.
- Unlike text or video, voice has been harder to analyze historically, making it ripe for innovation.
- Real-time understanding enables action against harassment, toxicity, and other abuses, reducing attrition from gaming communities due to toxic behavior.
“The breakthrough that we had was that you can do that analysis quickly, accurately, understand the nuance... And you have to do it super, super high scale and super accurately. We’re talking hundreds of millions of hours of audio a month.” — Carter Huffman [06:49]
Applicability Beyond Voice: Modalities & Use Cases
[08:21–10:43]
- While innovation started in voice, Modulate’s architectures and strategies could be extended to text, video, and multi-modal analysis.
- Beyond safety, these technologies can be used to improve voice bots, sentiment analysis, intent recognition, lie detection, and customer engagement tools.
“Reading a transcript is so much harder to pick out those kinds of social cues... Understanding what’s going on in an accurate, repeatable, deterministic way so that you can take action on it—that’s super important.” — Carter Huffman [11:18]
Architecture: Ensembles of Ensembles
[12:15–19:55]
- Modulate deploys an “ensemble of ensembles” rather than a monolithic foundation model.
- Breaks down the complex task of voice understanding into a hierarchy: transcription, emotion, accent, language, sentiment, and environmental noise.
- Each subtask can use models optimized for specific conditions (e.g., high vs. low audio quality), routed via an orchestrator for efficiency, accuracy, and cost reduction.
“If you know what kind of representation you want ahead of time, you can piece that out... you can be so much more efficient and more accurate and more deterministic than a large foundation model.” — Carter Huffman [14:40]
Example:
- Emotion extraction adapts to audio quality, switching between models optimized for studio microphones vs. cell phones.
Real-Time Performance & Scalability
[20:54–25:51]
- Low latency is achieved through model specialization (smaller, purpose-built models) and rapid orchestration.
- System is architected to tolerate some delayed model outputs, returning robust results within fixed latency budgets.
- Scalability comes from parallelization—each audio stream is processed independently in a stateless, feed-forward manner, with asynchronous background learning for further optimization.
“If you’re taking like half a second to decide which model to route this data to, you’ve already lost on latency, so you have to make that decision super fast.” — Carter Huffman [22:15]
Origin Story: Gaming, “Voice Skins,” and Tackling Harassment
[25:51–34:53]
- Carter shares personal gaming memories and explains how Modulate pivoted from voice “skins” (voice style/identity transformation) to solving online harassment after feedback from gaming studios.
- Toxicity is a primary cause of user attrition in gaming; traditional solutions are expensive and not scalable.
- Modulate developed cost-effective, real-time analysis tools (like ToxMod) to address this industry-wide problem.
“If you’re going to say, ‘I’m going to listen to every single thing that every person says in every online game and see if they’re harassing somebody or not,’ that would have cost billions of dollars a month... But when we came along with these ensemble models and we said, what if it’s a hundred or a thousand times cheaper, then would you do that? The studio said, yeah.” — Carter Huffman [30:54]
Products: ToxMod, Velma Models, and Beyond
[37:21–38:32]
- ToxMod: First product, purpose-built for moderating live voice chat in gaming; employs the “Velma” ensemble models under the hood.
- Evolved from earlier, more limited versions to current state-of-the-art models like Velma2, with greater flexibility and accuracy.
Expansion: From Applications to Models-as-a-Service
[38:58–56:53]
- Modulate is shifting from building specific products to offering their models as APIs, enabling third parties to integrate advanced voice understanding into any application.
- Potential use cases: fraud prevention, voice bots, sentiment & deception analysis, public speaking analysis, finance, law enforcement, elder safety, and more.
“Maybe most exciting of all, we’re transitioning from being this product company that builds products... to being this model-first company that provides the models for anyone to use.” — Carter Huffman [40:08]
- Real-world examples include CEOs using sentiment clues, romance scam prevention, and hackathon participants inventing applications the founders hadn’t envisioned.
Privacy, Consent, and Ethical Boundaries
[46:14–49:34]
- Speaker/fingerprint matching tech is available (99%+ accuracy), but only deployed with explicit consent and careful attention to privacy laws and ethical guidelines.
- Modulate avoids entering the national security domain to maintain an ethical stance, focusing on applications with clear, positive impact and appropriate oversight.
“You want to catch common scammers and you want to keep them from hurting other people. But... there are different levels to the amount of data you extract and store from an interaction.” — Carter Huffman [46:33]
Multilingual & Technical Breadth
[62:17–63:13]
- Supports 18 language families (covering ~100 individual languages and ~99.3% of traffic processed), with dialect detection possible.
- Accent identification and demographic extraction are technically feasible, but subject to privacy considerations.
Analysis & Synthesis: Capabilities and Boundaries
[58:52–64:45]
- Fuse transcript, paralinguistic cues, and conversation context to yield a holistic understanding—better than a single model for emotional analysis, lying detection, or sentiment.
- Synthesis (voice generation with emotional/identity nuance) is on the roadmap, but current focus is deep understanding, not TTS.
“If the tone of my voice doesn’t match the text transcript of what I’m saying... that tells you a ton more about the content... than looking at just the text or just the vocal emotion. You gotta fuse it to really understand.” — Carter Huffman [59:52]
Edge Deployment and Latency
[60:35–62:17]
- Currently API/cloud-based, but exploring splitting processing between edge and cloud for lower latency and efficient resource use.
- Won’t sacrifice accuracy; architecture decisions are driven by the need to maintain top-notch results.
Looking Forward: API, Community, and Open Innovation
[54:56–56:53 & 65:10–66:09]
- Focus is shifting to API-first, inviting developers and companies to innovate on voice understanding use cases.
- Modulate plans to enable access to their models with free credits and open channels for new partnerships and hackathon collaborations.
“If you’re listening to this and you’re interested to try them out, come follow us, come message us and let us know what you want to build... We’d love to let you in and see what you build.” — Carter Huffman [65:36]
Notable Quotes & Moments
- “Lie detection is absolutely a component of what we’re able to do from the voice signal and from the context and how folks behave in a conversation. Even the best trained humans aren’t perfectly accurate. Technology will never be 100% accurate at everything.” — Carter Huffman [52:10]
- “Our recent transformation has been let’s give those best models on the planet to everybody through this API access. So that’s completely new for us. As of like last week, that’s where our big focus is.” — Carter Huffman [54:56]
- “If I sound neutral, that’s a deviation from the norm and that’s important for you to know about. So we take all of those different signals and we’re trying to get you the understanding of the conversation, we’re trying to get you the answer.” — Carter Huffman [59:19]
Key Timestamps
- 03:07 – Carter’s background, astrophysics, JPL, move into AI/voice
- 06:12 – Gaming as a proving ground for voice AI safety features
- 12:37 – Ensemble architecture vs. foundation models
- 17:59 – Emotion extraction example: model allocation based on audio quality
- 22:11 – Real-time orchestration and latency minimization
- 24:33 – Scalable processing for millions of audio streams
- 25:51 – Modulate’s pivot from voice skins to anti-toxicity tooling
- 37:30 – ToxMod and product evolution
- 40:08 – Transition to API/model-as-a-service
- 46:14 – Privacy/consent complexities with voice “fingerprinting”
- 52:10 – Lie detection and nuances of context
- 58:52 – Fusing content and delivery for richer analysis
- 62:26 – Language, accent, and dialect coverage
- 64:02 – Synthesis ambitions and current product focus
- 65:36 – Invitation to try the Modulate API
How to Learn More / Get Involved
- Website: modulate.ai
- LinkedIn: Modulate page and Carter Huffman’s profile
- X (Twitter): @ModulateAI
- Free credits and early API access available—contact Modulate to participate or share ideas.
This episode provides an in-depth exploration of the current and future state of real-time voice AI and its broader implications, mixing technical detail with humane, ethical reflection. Modulate emerges as a leader shifting the paradigm from transcription to understanding, opening new frontiers for voice technology worldwide.
