
Loading summary
A
If you're taking like half a second to decide which model to route this data to, you've already lost on latency, so you have to make that decision super fast. Lie detection is absolutely a component of what we're able to do from the voice signal and from the context and how folks behave in a conversation. Even the best trained humans aren't perfectly accurate. Technology will never be 100% accurate at everything. We have multiple different emotion extraction models that will run on any given part of the conversation to do the best and most cost effective job at figuring out what emotion am I expressing right now?
B
This episode is brought to you by Tastytrade on ionai. We talk a lot about how artificial intelligence is changing how people analyze information, spot patterns and make more informed decisions. Markets are no different. The edge increasingly comes from having the right tools, the right data, and the ability to understand risk clearly. That's one of the reasons I like what Tastytrade is building. With tastytrade, you can trade stocks, options, futures and crypto all in one platform with low commissions including zero commissions on stocks and crypto, so you keep more of what you earn. The platform is packed with advanced charting tools, backtesting, strategy selection and risk analysis tools that help you think in probabilities rather than guesses. They've also introduced an AI powered search feature that can help you discover symbols aligned with your interests, which is a smart way to explore markets more intentionally. For active traders, there are tools like Active Trader Mode, One Click Trading, and Smart order Tracking. And if you're still learning, tastytrade offers dozens of free educational courses plus live support from their trade desk reps during trading hours. If you're serious about trading in a world increasingly shaped by technology, check out tastytrade. Visit tastytrade.com to start your trading journey today. I'm going to myself. TastyTrade Inc. Is a registered broker, dealer and member of FINRA, NFA and SIPC. Okay Carter, well, I usually begin by having you introduce yourself to listeners and I know your background. It's fascinating. You were at the Jet Propulsion Lab for a while. You were or are at mit. I'm not sure if you're still up in the Boston area, but if you could give your background how you got to modulate and then we'll talk about what modulate does.
A
Awesome. Fantastic. So hey, I'm Carter Huffman. I'm the CTO and co founder over at Modulate. My background I started out studying astrophysics and and early universe cosmology at mit. I was there a couple years, studied under Alan Guth, did some really, really cool physics there. And then I was out at the Jet Propulsion Lab in Pasadena, California, working in the machine learning and instrument autonomy group. So working on problem like super cool problems like how do you have a spacecraft that's flying by a comet, look at the data it's gathering and figure out what to do next to get the most science out of that flyby mission that you can of course, like a, you know, that's like a decade old processor kind of thing. So it's a really fun problem. So I was there for a couple of years, got really, really into machine learning, neural networks, all that kind of stuff. And that's what eventually led me into using AI and applying it to the voice domain, which sort of brought about modulate.
B
I have talked to a lot of people in AI voice and you know, there have been tremendous products created, you know, 11 labs and things like that. And I was surprised and initially not particularly interested when, when I heard about what you're doing because I felt that voice had become commoditized. But you are looking at a particular problem which is a real time voice analysis, is that right?
A
Correct, Absolutely.
B
Which is a different animal and much rarer in the space.
A
Well, I think the space is really evolving. It's kind of like a multi stage journey, if you will. And I think you're right that the first stage of that like okay, I'm talking and you want to transcribe what I said. Transcription technology, like people have been working on this stuff for decades and it has recently become accurate enough broadly in many scenarios there are edge cases and fast enough and cheap enough to be become very widespread, which has led to part of the voice AI boom that started recently. The other piece being the natural language understanding parts, language models that have made those transcripts useful for a variety of tasks. So it's kind of like those pieces have become commoditized, but that's led to more capability which has led to the next boom. And I think that's very exciting and moving very fast and very deep, differentiated from the voice AI of even just three years ago, where it's like you don't just want a transcript of what's being said, you want to do something useful with it. You want to really understand what's going on. And that's new.
B
Yeah, and, and you are applying and specifically, I mean along with a lot of other things, but to safety and, and to gaming in particular within the safety universe. Can you talk about that, why that's,
A
that needs addressing, got started and where we're kind of expanding out from right now, it's a super important problem because gaming has a long history of being a social activity, but it's social over the Internet, online, where people are generally anonymous or pseudo anonymous. And that leads to a small subpopulation behaving in sort of generally socially unacceptable ways. But it's been very hard to stop that behavior from happening. If somebody's harassing somebody else in a coffee shop, you kick them out. If somebody's harassing somebody else in a video game, you could ban them. But you kind of need proof. You can't just do. He said, she said. That's been very hard to get for a long time. The breakthrough that we had was that you can do that analysis quickly, accurately understand the nuance. Because I might be trash talking with you and it's fine. I say some rude sounding stuff, but we're friends, it's totally cool. I might say exactly the same thing to you, but you're a stranger and that's unacceptable. So you have to have that nuance and you have to do it super, super high scale and super accurately. We're talking hundreds of millions of hours of audio a month across some of the larger gaming titles. And to be able to moderate all of that and get rid of the bad behavior, that would have cost tens or hundreds of millions of dollars six years ago. So we had to make that efficient. That was the big breakthrough.
B
Yeah, and I mean, we'll get into how you do that, but you know, this is a problem not only with voice, but with video and, and images and, and text. And the big social media platforms have done frankly a pretty poor job of policing the content on their platforms. And you know, there have been attempts. But is your approach, does it translate to other modalities or is it really voice only?
A
Sure thing. It's broadly applicable. Voice is uniquely a good place to go because voice is traditional, traditionally quite tricky to be accurate at. And you need really, really sophisticated models to do a pretty good job both in understanding and in the synthesis style. Voice has often been out of what I would call like the three main modalities, right? Text, voice and video or imagery. Video, imagery and text are often the most widely researched and voice has tended to trail behind mostly because imagery, you can do really cool synthesis work really easily. Like you can make a face and even if the face isn't perfect, we're evolutionarily predisposed to see faces. So you think it's super cool. You synthesize a voice and it's kind of off. We're predisposed to find weird voices and then we're creeped out by it. We don't love it. Right. You know, that kind of thing. So, so, so voices tended to lag behind in the research and I think very recently things have become accurate enough and cheap enough for it to catch up. But yes, the same ideas are applicable across all different modalities of use. The specialized models do it cost effectively, Pick the right model for the right job, very general strategy.
B
And so we'll talk about some of the other before we get into how it's constructed, about some of the other applications. And you said you're moving beyond safety, correct?
A
Absolutely, yeah. To more broadly, just general voice understanding. When we started out in safety, we picked a specific task so that we could optimize the heck out of that. Right? Like just everything is targeted towards like, even just like, okay, let's say you're transcribing the audio and if the audio, you're transcribing it and you're like, there's no way this is possibly harassment. You didn't have to spend a ton of time in compute on it because you know all you're looking for is harassment. It makes it a lot easier to optimize for that specific job and solve that extremely hard problem of high accuracy and high cost efficiency. Once we got really good at that, then we were able to build the tech to generalize it.
B
And generalize to what other applications?
A
Anything that requires voice understanding. So this can be all the way from the traditional applications we talked about in safety to things like optimizing your voice bots that you're building to talk to people, right? Like, do they understand the emotion that you're using in the tone of your voice? Do they understand what your intent is in the conversation? Are you lying or not? Like, these are all things that are. Like these social things were built to be exceptional at them. Like, there's so much history that has gone into making humans very good at these sorts of things. Reading a transcript is so much harder to pick out those kinds of social cues that you have to use in order to interact with people well. So making your voice bots better, improving your voice agents to find when, like, you know, like, like, like rule following domain adaptation guidelines, these sorts of things. So let's say you call in a voice bot, it's a customer service bot, and you start flirting with it. What does it do? Right? Like these systems are Very general. It could flirt back. You can have guardrails and people can get around them. Right. One thing I know about gamers coming from, from, from gaming is, is they will sit down and they will try to break your system no matter what. And that's true of every system. And so like, you know, you try to engage with this thing and people will find a way around your guardrails and stuff like that. Understanding what's going on in an accurate, repeatable, deterministic way so that you can take action on it. That's super important for basically that involves a conversation in some way.
B
Okay, well, let's, let's talk about the tech. Is this built on top of existing foundation models? I mean, what's the architecture? I know it's an ensemble of models or strategies. Yeah. Can you talk about how modulate is built?
A
Absolutely. Fantastic. So it's an ensemble of ensembles. So what we say is the ensemble listening model architecture is we know we're analyzing a voice conversation, so we're narrowing into that domain, really wide space. There are tons of voice conversations out there, but we're narrowing in. We're going to understand a voice conversation. We know those conversations will have topics they can change over time. We know those conversations will have participants playing roles that can be dynamic as well. They're going to have intents in that conversation. They're going to exhibit behaviors that you might want to notice. Harassment, fraud, whatever, or just like socially, you know, social engagement, those kinds of things. And so you have this structure of like, you know, what goes into a voice conversation. The ensemble listening model is built to elicit all of those things efficiently and accurately and use that context to do a better job. Job. So what we do is we start off by saying, what are the building blocks of getting that representation of a conversation that we want? You do need to transcribe it, so we need transcription. You need to figure out the emotion in what people are saying, the tonality. You want to understand a little bit about who these people are. So what accent are they using? What language are they speaking? You know, all these kinds of things. Try to get the sentiment on any particular topic as they're talking. These kinds of things. Low level signals. Is somebody playing music? Music? Is there a fan running in the background? You know, like, like, like, look. What are the other kind of noises? What are the other kind of channels? What's the audio environment for each of these people? Are you sitting there on a professional grade microphone? Am I calling in on my 8khz telephony line, like all of these different things, like they're all important pieces of context, not to just understand the conversation, but also to understand how people are, are engaging with each other. So we have to figure out all of those different pieces of information to represent the con conversation well. So we start off by saying, what are those different building blocks? And this is where we start to deviate from foundation models. Because a foundation model approach would just say, run, slap the biggest model you can, and it'll figure all that stuff out. But our insight was that actually, if you know what kind of representation you want ahead of time, you can piece that out into those individ, say, components. And you can be so much more efficient and more accurate and more deterministic than a large foundation model by saying, we want a model or block of models that does this task. We want a block of models that does this task. We want a block of models that does this task. So that's part one. The kind of hierarchy of ensembles is that we build out those things that you need to understand from a conversation. Part two is then digging into each of those and say, how do we do transcription the best. You could just run a transcription model like Siri does or something else does. But as you might know, if you've used Siri, there are situations in which a given model will work well, and there are situations in which a given model will not work well. The big insight for us there was that we're trying to optimize a couple things. We want accuracy, we want cost effectiveness, we want determinism and repeatability. We want scalability, we want all these things. There's no model that is going to do perfect at all of those things. More general, broad accuracy is going to cost you on cost efficiency because you need a bigger model. It needs to be able to understand more kinds of things. Instead of one big model that can handle, say, 100 different accents over 50 different languages, what if you break it down into 10 models that can each handle five? If you do that, each of those individual models can be much more efficient. They're individually less general. But then you sit on top of that, an orchestrator or ensemble selector that will say which of the models is most appropriate for this domain? Right now you do that now you have a slightly more complicated system, but for a given conversation, for a given piece of dialogue, it's only running a small, small, small subset of the whole capacity of the system, and you're only paying for that compute. So you get much better cost performance as well as better Determinism and accuracy because you've got exactly the right model tuned for only that right job compared to just stuffing it all into a foundation model.
B
Yeah, yeah. And I've seen the chart that shows you're drastically less expensive and much more accurate than the commodity foundation models. When you talk about ensembles of ensembles in those sub ensembles doing specific tasks, can you give us some examples beyond, you know, a model handling a fewer number of languages and then multiple models to cover the field. Are there other tasks that you break down into sub ensembles thing?
A
So let's just talk about extracting emotion from my voice. Right. We have multiple different emotion extraction models that will run on any given part of the conversation. So to do the best and most cost effective job at figuring out what emotion am I expressing right now, we actually break those models down into we have one or two big general models that are like, this is going to do a pretty good job no matter what, but they're more expensive and they're less specialized. Then we have other models, some of which are really specialized to low quality voice, voice, voice data. So if I'm calling in on a cell phone, a lot of the emotion models will look for higher frequency signals in the voice to help inform them as to like, like what kind of inflection I am putting into my words. But if I'm calling in on an 8kHz phone line, those high frequencies are gone. So the model would rely on those, but they're not there. So it's going to do a worse job because what it's looking for is an 8, you know, isn't found. And it's also going to be wasting a ton of capacity trying to process those high frequencies when they don't even exist. So it's just not the right model for the right job. It'll do a better job on this conversation because the audio is high quality, but it won't do well if I call in on a cell phone. So if I'm calling in on a cell phone, part of our ELM architecture is looking at audio quality and saying, look, this is an 8kHz phone call. And each of those sub ensembles, like the emotion block, has models that are more or less applicable to that level of quality. So it'll route the data away from the general high audio quality emotion model and into the one that only looks at those lower frequencies because that's all that's available, which will make it more accurate without the wasted capacity.
B
And that's why you can beat other models on, on cost.
A
Right, Correct.
B
Using, you know, a huge model when you only need a small model and that sort of thing.
A
Correct, exactly. Right. And like, and, and again, because, because most people take the approach of just shove it all into the one model. It's like that model has to handle the 8khz phone audio. It also has to handle the super high quality VoIP audio. It also has to handle all of these different other things. It has to handle, you know, if I'm talking and there's 40 people in the background and they're also talking and the mic's picking it up. Right. Like each of those scenarios requires different weights in the model to handle appropriately. And if you've got one model that needs to handle all of it, it's going to waste all those weights when for a situation where it's not applicable, but it still runs the compute anyway because it's just one big model and,
B
and the other big differentiator. I mean, sure, there are many, but is, is your speed or, or the low latency under real time conditions. Can you talk about how you achieve that?
A
Sure thing. So part of it is sort of what I've been talking about already. Smaller models run faster, that it just gets through the model faster. But the other piece is the orchestration. So this is where naively or sort of at a first level, a single big model has an advantage because all you do for a big model is you give it the input and you wait for it to get you the output. And then once you've got the output, you send the output back. And so it's a very simple architectural problem. You're just running a big network and you just have to wait for the GPU to be done. For us to be able to match or beat that level of latency while using an ensemble, we have to very, very cost effectively route to different models in the ensemble and make sure that architectural overhead isn't costing us very much. The way we do that. Yeah, go ahead. Yeah, exactly.
B
No, go ahead. I was just going to say costing you much in terms of latency, Correct?
A
Yes, costing as much in terms of time. Exactly. If you're taking like half a second to decide which model to route this data to, you've already lost on latency, so you have to make that decision super fast. So the way we do that is by breaking up the statefulness of the model into two different pieces. There's the feedforward pass that runs right now and just gets you an answer as soon as possible. And given what we know, this Second. And then there's the feedback loop that comes back in and says, now we've learned a little bit more about this conversation. We know what's going on. Let's make better choices the next time. So it optimizes itself as it's processing the data. It's dynamic. So the feed forward pass, we engineer so much to make that really, really efficient. So we pre register, okay, for all of these different input conditions, which models are we going to run, and how do we fuse the outputs. And we also do a kind of flexible distributed computation model where these ensembles are actually robust to missing one or two pieces 99% of the time. All the data gets there in time to make a decision and get the answer back. But if I'm fanning this data out to five different models and only four of them come back in time because one of them locked up or went down or something like that, I'll still do a good job with those four responses. Fuse that into the final answer and get it back to you. And so a lot of the problems in distributed systems that you're kind of referring to here about an ensemble and latency are you got to wait till all the different pieces run, collect the results, and then get it back. But we don't wait for that. We do, hey, you've got a certain fixed latency budget. We're going to pump the data through the models, we're going to get the results. And if one or the two of the models didn't get back to us in time, we build the ensembles such that they can still robustly do a good job with only partial data.
B
And one of the other differentiators is, you've been able to scale this so that you can. You can analyze millions of audio streams in real time. How do you do that?
A
That's a great question. So part of it is the parallelism that I was talking about earlier. So it's really easy to scale a system to a million streams. If each stream can be its own completely independent thing and doesn't have to care about anything else. We try to make that true as much as possible. So it's this feed forward pass of like, as I'm analyzing the data live, each stream is completely independent. Now, asynchronously, in the background, we're taking what we learned from each stream, processing it and feeding it back. And then we're jumping in midstream and saying, hey, we learned something else. Here's a new configuration. Do an even better job in the future. But that's A really quick update. And it doesn't matter exactly when that update happens. It's just going to. Whenever that information comes along, the stream will take it and do a better job. But if it never comes along, it doesn't matter. It's fine. It'll still keep processing the stream. And so that's how we do. Each stream does its feed forward computation and gets you the answer completely independently. But there's this kind of lower priority feedback loop that's just making everything more optimal and everything better in the background. And over, over time we get more efficient and more accurate at processing stuff.
B
You, you started with talks mod. Can you talk about talks mod? And why did you focus on gaming? You mentioned that you're familiar with gaming. Are you a gamer? And did you see this problem and get frustrated by it or what? What's the genesis?
A
Oh yeah. I've been a gamer my whole life. So I got, I spent, I spent a lot of hours in Halo, the early Halo Combat Evolved. I was actually on the PC version and people would host their own servers and you would log into the servers and there were like seconds of lag. So I would like shoot at somebody across the map and it's like, you know, it'll hit where they were three seconds ago or something. I actually got really good at timing, like where are you going to be in three seconds? Because the server's lagging like three seconds behind or something. And then when we all went to Halo 2 and they had dedicated servers on the Xbox, I was like, it's too fast, the latency is too low. I suck at this game. And I never, I never got any good at Halo 2, but I was great at, great at Halo 1 with the, with the, with the Highline. So I played a lot of that. I played a lot of Call of Duty. That was a super fun game. They did a great job with like the progression system early on to like, okay, now I'm gonna level up now I'm gonna get my new ranks, everything like that. I put the thousands of hours into Call of Duty when I, when I was growing up. So I've always been a gamer. I grew up super shy. I was super shy, super introvert. I was always really nervous about hopping into voice and talking because I was worried about what people would think of me. And so it was always sort of like a self conscious nervous like. And I don't even think necessarily it was like, like the games didn't care about it. It's just they couldn't really control the voice Chat environment. They didn't have the tools to do that. And so people like me just kind of wouldn't jump in, wouldn't engage. And we still played the game and we still had fun. But I was really familiar with gaming and the voice ecosystem in multiplayer gaming. And so when we built the company, my co founder and I, he's a gamer too. He's more of a Nintendo guy. We really wanted to do something in a space that we loved and were passionate about. So we were like, can we apply this technology to gaming? And when we first got started, we were doing, like, voice transformation. That's the very first technology we built where it's like, as I'm talking, I sound like Taylor Swift or I sound like James Earl Jones or something like that. Called him voice skins. And we thought, oh, this will definitely apply to gaming. You just want to hop into Fortnite and sound like your favorite actor or your favorite character. So we're going to go sell this. So we tried to do that for a year and we went out and people were like, this is really cool. The technology worked. It was very, very cool. They're like, you know, I don't know if people are going to want this. Like, I don't know. You know, Mike and I were like, we. We really want this. We think it's cool. But, you know, nobody really, like, thought, you know, hey, we'll definitely want it. And there have been voice changers out there, like apps, and they never really took off or anything like that. So it's like, it's not clear that people actually want this, but one of the gaming companies we talked to earlier took that tech and they were like, hey, can you make everybody sound the same so people stop harassing each other based on the sound of their voice? And we were like, that's an interesting idea. It's kind of weird. You don't want to, like, hide yourself in order to play a game without being harassed. Like, it's kind of weird, but it's also like, you could do it and it has merit. And most importantly, you care a lot about preventing harassment in your game. You'll do anything to try to fix it. Even this tech isn't built for that, but you still want to try to use it to solve that problem. That's the problem you really care about. And that's what got us turned on to let's solve harassment in gaming. Hearing that from our customers and how much they cared.
B
Yeah. For people who are not familiar with multiplayer games, and I've never played One. But I. My kids are grown now, but I used to listen to them, you know, and they're talking and, you know, they're wearing a headset. I can't hear the other side. But you're not only talking to your friends, right? You're talking to anybody that. That logs onto the game and on the server that you're connecting to. Is that right? And so is the. Is the. I didn't realize there was. I mean, I've heard about the swatting, for example, but I. I didn't realize that harassment is. Is that big of a problem.
A
It happens all the time, sadly. Yeah, it's a huge, huge component, and it's one of the largest drivers of attrition from a gaming community across all factors. One of the biggest reasons people leave is they're like, this community is too toxic. People are harassing me or people are harassing other people, and I don't want to listen to it. And if that's what the community is like, you don't want to engage. Most people don't want to engage with that. And so you stop engaging. You're not participating in the community, and eventually you're just like, I. I'm tired of listening to this stuff. And you drop off and you go play. Something else happens all the time across all the big platforms. It's not unique to any particular game. It happens everywhere. And again, the problem was that there weren't really the tools to do anything about this, because if you were going to say, I'm gonna listen to every single thing that every person says in every online game and see if they're harassing somebody or not, that would have cost billions of dollars a month. Like, that's such a big scale. There's so many people that game, so many people that talk. Completely impossible. But when we came along with these ensemble models and we said, what if it's 100 or a thousand times cheaper, then would you do that? The studio said, yeah, yeah, we do want to do that. We want to solve that problem. Yeah.
B
To jump back to your original voice skinning tech, I was at a conference in. In Armenia, and I met a company that had a tech that I. I thought was incredibly cool. But it's, It's. It sounds like you guys were doing it quite a while ago. And it was for call centers in places like India or the Philippines, where the agent, the human agent, it would. It would normalize. Not normalize isn't the word, but it would. It would turn them into a Midwestern accent. So no matter how heavy their accent was on the other side, you wouldn't be able to identify the person as a non native speaker. Did you guys do anything with the voice skinning? Do you still offer that?
A
Yeah, we still have it as a product and we actually have a couple customers and some pretty cool amounts of revenue from that product. The, the thing about that product that caused us to focus more on the analysis piece was that I think the use cases for voice skinning, there's twofold. There's the business use case that can range from normalizing accents to making which makes people more understandable. That's a good, that's a good goal to. Sometimes it's important that you're not easy to identify in a phone call if you're doing like an investigation or something like that. There are other kinds of applications for this sort of thing from the business side and then the other side is purely commercial for just like hey, would people think it's cool to change the sound of your voice? And we were really heavily into that kind of consumer side of it of like we want to build a technology that people think is really cool to use and makes their experience better, makes their experience more fun. And I do think those other applications like they do provide value. They make things better for sure. But we were really like I want to build tech that people love, love, love using and people get really engaged by and just kind of makes their experience better. And that was really how we wanted to focus which is why we started with a voice conversion and, and style transfer voice skins for consumer first like you're going to buy skins you want to use because you want to sound like this person. And then fighting toxicity, you want to participate in the community, but this small subset are ruining it because they're being super toxic and nobody can do anything about it. We're going to go solve that problem. We're going to make our tech make, make this community better so people have more fun and people enjoy this more and people have a better experience. And that's been a real driver for us. Not just can we find all the use cases but can we find those use cases that really make things better? So that's why we focused in that direction.
B
Yeah. And the, so you, you do this real time analysis and when you identify harassment or grooming is another issue I'm sure you're paying attention to. Does your system, what happens then is does it just flag the, the gaming company and they take care of it and with their own strategy or does do you like mute the, the offender, I mean, what do you do with, with that data once you've analysed, analyzed it?
A
It's really important that there's one central expert for any community that is the final arbiter of what do you do when something bad happens that has to be a single centralized source, otherwise you risk these pieces getting out of sync. It would be a really awful experience for me to say something in Call of Duty, for example, and then type something in Call of Duty and I get banned for saying it, but not for typing it or vice versa. It's just like why, you know, it doesn't make any sense for this to be, to be inconsistently applied. And so these studios, especially in gaming and then also if we, we've expanded out into anti fraud security, like keep people from, from breaching passwords through calling the help desk, all of these kinds of things. In all of those places there are multiple modalities, text, email, voice, video, all these sorts of things. But you have one group that is responsible for safety or security or community or trust that applies to all those different domains and everything has to feed into them or you're going to get inconsistent and you're going to miss stuff and that's going to be a bad experience for everyone and at worst a security breach or people are going to get hurt or that sort of thing. It all has to feed into one centralized group. So our role is cost effectively, scalably and accurately find stuff and then make it as easy as possible to feed into that centralized group who can be the single expert to take the action. So that's always our model.
B
Oh, and what is toxmod? Is that your model or is, was, is that the primary product or what is that?
A
It's an application. So it's an application that primarily social and video gaming companies can use to help moderate their voice chat at scale. It uses our models under the hood. And in fact it was our first product that did this sort of thing. And so it started out using kind of an early preliminary version of our models, which we're kind of retroactively actively calling Velma1. It was our first ensemble of voice understanding models and we've now kind of published the second, bigger, more flexible, more dynamic version. But it kind of used the first version of the models and we've since updated it to use the more sophisticated models. But toxmod is sort of the purpose built application that you hook into your voice chat stream and it helps you moderate this stuff. And under the hood it's using the models, Velma 2 to do that analysis cost effectively and scalably and on.
B
On. You said that the same strategies kind of can be applied to other modalities. Where are you guys going? Are you, are you, you said that you're, you're moving beyond gaming. Can you talk about other applications for the technology? And then are you moving into other modalities?
A
Big expansion into different applications is to provide these models in a very flexible way to do any kind of understanding you want. So you could run the models on this conversation we're having right now and feed the data that we extract into any sort of system that you want to. So people can get a whole readout, not just a transcript, but everything about like, you know, what is, like, like, what is Carter really trying to say in this conversation? You know, where, where is it going? Like, like, like, how did they feel? Are they excited? You know, is Carter tired? Like, did he not drink enough coffee? You know, these kinds of things. Like, like get all of that information from just this conversation. You can do that to any conversation on the planet. I want our models to be helping my grandmother stay safe from fraud. She's 91 years old and she got, she got scammed. I think it was nine, 10 months ago because somebody called her and pretended to be me. And she's 91. You know, like, it's, there's, there's not that much you can do to protect against it. Our technology could help protect against that stuff. Same thing. You're deploying a voice AI. It goes off the rails, does the wrong thing, people get hurt. Happens all the time with new technology. We can be watching that and keep that from happening. Like, is it appropriate for an AI to say this thing in these contexts, all of these sorts of applications? And maybe most exciting of all, we're transitioning from being this product company that builds products like toxmod that have one use case to being this model first company that provides the models for anyone to use. So I think the most exciting place we're expanding is offering the models directly through an API and then people can build them into more applications wherever it's useful to understand voice that I haven't even thought of yet. So I think that's the most exciting piece we're going.
B
Yeah, it's interesting the, the scamming industry, which is just. Who knew, you know, that it would be a billion, multi billion dollar industry since. Yeah, yeah. And I have a close friend who, you can't believe it, but he got taken in a romance scam, basically. Lost his life savings. Who would apply that then? Is that, I mean, your, your tech? Because it seems to me that, for example, he claimed he was talking to one person and showed me a photo. And it's like it was obviously a photo taken from the Internet. You're not talking to that person. But I couldn't convince him. But presumably I, I guess my question is, who, who would be implementing it? Is it. Is it the telephone service that you're using? Or would. Would individuals have to download it onto their phone? That's one thing. But also how, how subtle can you. Could, could you match voices to faces? I mean, there is a certain physical structure to your face that, that influences what your voice sounds like. I mean, those sorts of things. How, how sensitive is this to, to those kinds of subtleties?
A
Great questions. I'll answer in order. Who implements it? And then how sensitive is it? So who implements it? It could be anyone in that chain. That's what we're excited about is the telephony companies themselves, the backbones, they could implement this before today, it's. It, it's just completely impossible. I mean, there are 13 and a half billion phone calls that every single day you want to monitor every single one of those. Like, how on earth would you. Now it's a thousand times cheaper. Okay, maybe you can start to at least scan for fraud, right? Something like that. It's possible, right? So the telephony companies could implement it. The OS systems for the phones, Apple, Google could implement this at kind of an operating system layer. Before you even get to the app layer to scan that stuff. They do some of that for like the explicit photos, right? So like you can turn on features. If somebody sends you an explicit photo, it can double check and say like, hey, you want to receive this? Yes or no. And you could also do it on the app level. So the people making the communications apps, you could implement the scanning there. It could go anywhere in the stack. Because all we need is the voice signal. We extract all of the nuance, we extract everything just from that voice signal. You don't need a ton of other inputs coming in to figure out is this a likely fraud or not? Is this likely scam? Is somebody using a deep fake?
B
Right.
A
We have the best deepfake detector for voice in the world. Is somebody using a deep fake, like all of these things, you can do that from just the voice signal. So it's so flexible, you can implement it anywhere in the stack. So for the sensitivity question, we are completely voice specific and we are voice native. So I see our niche, our role is like, there are a bunch of AI companies out there, and many of them focus on different aspects of voice. Many focus on different aspects of video or imagery. Tons of people are focused on text, for example. We are really, really deep experts in voice. We know what we're doing. We know how to make it efficient. The architecture can apply to other media, and it's not impossible that we go and apply it one day. But we have built so much expertise in voice. We understand the data, we understand the medium. And I think the biggest advantage you can have as an AI model company today is to really deeply understand the data you're working with. And we deeply, deeply understand conversational voice data. So that's where we want to stay, because we want to build the best, the highest quality models that exist in the entire world. And in order to do that, we need to be experts, and we're experts at voice conversations. So that's where we're going to stay.
B
Voice is maybe not as unique as a fingerprint or, or a face structure, but it is an identifier, a unique identifier. Can you. I mean, the example of. Of someone calling your grandmother and pretending to be you, and you said you're, you're. You. You guys can spot deep fake audio or voice. Can you. If you have a chronic abuser, can you identify their voice? And then, you know, that becomes a profile that, that whoever is trying to prevent that person from harassing people would recognize and maybe alert somebody, hey, he's back. She's back. We recognize the voice print.
A
We have technology that can do that with over 99% accuracy. The thing that's more difficult than just building the model to do this in a capable and accurate way are the privacy concerns. Because on the one hand, you want to catch common scammers and you want to keep them from hurting other people. But on the other hand, there are different levels to the amount of data you extract and store from an interaction. So, for example, scanning for toxicity or scanning for harassment is one level of analysis. And you need to have the proper, proper safeguards in place and the proper consent in place to do that. Building a fingerprint of someone and tracking their conversations across calls is another level of privacy and another level of consent that you need to get to be able to do that. And there's a lot of variation across countries, across regions, across even states in the US as to what you're legally required to do. And there's also a layer of what you ethically should do. And we Pay attention to both of those things. So we have the technology capable of doing that. But there's a big jump to saying we can apply that technology in a specific situation, in a specific application, on specific kinds of data, with specific kinds of consent. That's where things get very nuanced and you have to be very careful. So for our end, we've only deployed that technology in a small number of cases where it's unambiguously value providing to everybody involved and we get completely unambiguous consent to be able to do that. That has an opt out path.
B
Yeah. And, and you know, this drifts into sensitive areas because it depends on, on who's deploying it. But it seems like this would have tremendous application in the national security space. And are you working with national security agencies that are, you know, scanning voice data, looking for criminals or terrorists?
A
We're not working with those agencies. I do agree it could have a lot of impact there. And I think the thing from my perspective to make sure we're being super, super careful about is once again our goal as a company is to both comply with, with the law and do what we need to do from a regulatory perspective and also do what we need to do from an ethical perspective. And I think there's once again, as always, a ton of nuance. Nobody's against catching the bad guys who are out there hurting people. But being able to do that in a way where you're unambiguously having neutral to positive ethical impact, that can get a lot more nuanced very quickly. So it's absolutely a very applicable technology to that kind of problem statement. And I do not at all doubt that you could do a ton of good doing that right now, but it gets tricky. So we're not doing that today.
B
Yeah. And, and do you see? I mean, I started out saying that I kind of felt voice was, was commoditized, but, but as you pointed out in this conversation, you're, you're going much, much deeper and besides the other advantages in latency and cost, but you're going much deeper in analysis. How, how large, I mean, is, is, and, and the fact that you're not working with national security or law enforcement is how big is that specialized market, I mean, you know, below the commodity market? Is it, is it just from your point of view as a, as a early stage companies, is it sort of bottomless or, or do you see a pretty definable market, approachable market?
A
There are tens of billions of hours of voice conversations that happen over some Digital channel telephony, VoIP, et cetera. Every single day, multiple people, the participants, the platforms can get value from every single one of those conversations that they aren't getting today. Maybe it's just kind of, kind of taking notes about how you felt. Maybe it's assisting a voice AI agent, maybe it's moderating and making the community safer. But only a tiny fraction of those conversations are analyzed at all and a very, very tiny, currently today, sub fraction of that fraction are analyzed at the level of depth where you can get the complete amount of value out of that conversation and do what you want to do with it. That can apply to every single conversation happening over a digital channel on earth. So I think it's a big enough
B
market and it doesn't have to be a one to one conversations over telephony. It could be analyzing public speaking for. I mean can you, how can it detect. Obviously not, not with absolute accuracy. But can it detect when someone is not being truthful?
A
Lie detection is absolutely a component of what we're able to do from the voice signal and from the context and how folks behave in a conversation. It's a really difficult field. Even the best trained humans aren't perfectly accurate and technology will never be 100% accurate at everything. But what I can tell you for sure is that just trying to analyze a text transcript and see if somebody's lying is a much, much, much more difficult problem than trying to understand the full context and emotion and tonality and sentiment and cadence and everything about that audio signal to try to figure out if there's a lie going on there. And that's in fact why of the behaviors that we look for and detect and that's only going to get better over time.
B
Yeah. And beyond lying, just sentiment analysis I would guess in. Well, the obvious thought is of finance industry where they're trying to pick up signals about the direction of the economy, the direction of markets or policy. Do you have customers that are using it for that?
A
Yeah, actually we, we both, Mike and I have a little bit of background in finance and so those were some of the first kinds of customers that came to us when we started talking about how we're doing more than just text analysis because they have some of the strongest needs. The CEO of a company is going to say the exact same talking points no matter what they're, they're never going to be like hey everybody, company sucked this quarter. Things were bad. But they're going to talk about the future and they're going to be say exactly the same words and be more or less excited compared to how they are in other quarters. Right. Because each CEO is different, so you have to be really, really nuanced and you get a massive payoff if you get it right and you're able to be a little more nuanced than everybody else. So they were actually some of the first people that came to us and said, I am very interested in even the tiniest amount of extra signal you can give me beyond just the transcript.
B
Yeah. It is really fascinating how, how data that we once thought was held no real information and has all these signals. So is modulate. Where are you focused right now? Is it in safety? Is it in or. Or is it in just making the system available through API and people can try it for whatever application they see fit? I mean, where do you see the market going or your market?
A
Our company focus is right now very heavily concentrated on making the model available to everyone. We've never been an API company before. We've never really been a model model company before. We've been an application company powered by the best voice analysis models on the planet. And our recent transformation has been let's give those best models on the planet to everybody through this API access. So that's completely new for us. As of like last week, that's where our big focus is. But we are doing that while still making sure that we maintain and improve and iterate both on the core models themselves and on the applications that are already out there delivering value. We know monitoring voice conversations for safety, for child safety, for extremist propaganda, for potential fraud, for potential security incidents, for deep fakes. We know all of those provide value because we have customers doing that through our applications. So we're maintaining those applications, continuing to grow, that we're offering our models for the first time to everyone to use and build other applications that we couldn't even think of. And then the thing that joins all of that together is we are constantly improving our models to make sure we are shipping the best and most cost effective voice AI models in the entire world. That's what joins our whole mission.
B
Yeah, and I would guess, I mean, it's early days in the API journey, but you can see who is making an API call, presumably. So over time will you be able to see, oh, there's this industry that we hadn't thought of that's starting to use the API and maybe there's a market there that we should talk to. Do you have that kind of visibility?
A
One of the most important priorities for us is understanding who's using our models for new purposes that we haven't really thought of or haven't understood the scale of before. So a big part of this push for us and a big part of the learning process is how can we go out and see who are using our models, what they're using them for, and how we can optimize our models and our experience to make them more useful and better for those use cases. One of the cool areas that I experienced recently was going to a hackathon in San Francisco and you can just provide your technology to the hackathon participants and they'll build crazy stuff that you didn't think of, like you had no idea. And then you sit down and you judge the projects and they come up to you and you get like, what, you know, 50 teams of extraordinary developers who just spent six hours building something and you've, you know, maybe seen 20 of the 50 projects before. Ten of them you've thought of, but you weren't sure if it would work. And 20 of them are totally new. And you're like, wow, I, you know, I've spent, I've spent a decade thinking about voice AI and I never thought of this. It's so cool. So between that and kind of real world customer usage, that's a top priority for us. What are people actually doing with these incredible models that we've provided?
B
Yeah, and modulate is separating the voice from the content initially, right? I mean, you're, you're analyzing the voice, you're not necessarily analyzing the content, are you? Or, or do you have different. Do you, do you do both? You analyze. I'm just thinking about sentiment analysis, you know, where you, you could analyze the, the emotion in the voice, but you can also analyze the, the transcript. And do you do those both? And do you do them separately? I mean, can you talk about that?
A
Fantastic question. We do both of those things and then we fuse all the data together to figure out what's going on. Because you don't want, when you're calling. This is one of the big things where I'm like, okay, voice was kind of commoditized. And then we have this new explosion. Right. Part of that new explosion of voice AI is the realization that you don't want to get the output of one model. You don't want me to come back and tell you, well, my voice specific emotion model that's tuned primarily towards English speakers with American or British accents, thinks that there's a 73% chance that this person is happy and a 20% chance that they're neutral. That's not what you want is the, what emotion is this person experiencing? That's the information you want. And there is so many different components to understanding that. So, so we look at the content, the text transcript. We also look at the tonality and the emotion extracted from that voice. We look at the context and how it evolves over time. So like someone like me, I just naturally sound excited and so there's some signal in that. But also it means that if I sound neutral, that's a deviation from the norm and that's important for you to know about. So we take all of those different signals and we're trying to get you the understanding of the conversation, we're trying to get you the answer. And also, if the tone of my voice doesn't match the text transcript of what I'm saying, if I'm like, wow, yeah, that was really fantastic, super cool, I'm totally blown away, right? Like, that tells you a ton more about the content of what I really mean than looking at just the text or just the vocal emotion. You got to fuse it to really understand.
B
I mean, you're focused on real time analysis. Can these models be put on edge devices or do they have to be called using an API?
A
Now, the way we're serving these models is through an API, but deployed to a variety of different regions so that we keep the latency, well, as low as possible. Something we've done in the past and explored is breaking up the processing to be partially on the edge and partially in the cloud to be even lower latency and even more efficient. That's an area I'm super, super excited about. It adds complexity because different edge devices have different amounts of capability and also there are different applications. So we tried this early in the gaming days, but when you're playing a video game, you're using the GPU that would run the neural network to render the video game. So your computer's already very busy, you don't have a ton of time to go process a little neural net running in the background. But for other applications that's not the case and you do have spare resources. So breaking the processing flow up into part on the edge and part in the cloud is a really exciting way to get lower latency while still having the same level of accuracy. It's an interesting direction. The one thing we won't compromise on is accuracy. So no matter what approach we take, we have to be able to deploy the full conversation understanding model. And if that can be broken up between Edge and cloud. Awesome. If that can only be done by keeping all the stuff localized together in the cloud, so be it. But we won't compromise on the accuracy.
B
Yeah, and you mentioned that this is you, you handle multiple languages. I think I saw 18. Is that right? How many languages?
A
It's 18 language families, which is a weird. You know, maybe I should kind of revisit that. Marketing language, marketing speak. Because it's 18 language language families that handle approximately 100 different individual languages and then multiple different dialects within those languages. So it's kind of a hierarchy and you get some weird kind of combinations of languages or branches.
B
The.
A
It's actually very hard to enumerate like specific number of languages your machine learning models can handle, but it's pretty broad. We process underneath those 18 language families, approximately 99.3% of all the voice traffic that goes through our models fall into one of those specific language categories that we can understand effectively.
B
Yeah, and this is, I just keep on thinking of interesting things you could do. You could also identify origin where, where a person is from based on their accent, couldn't you?
A
That's absolutely possible. From a voice you can trace what kind of accent does this person have, what kind of tonality and what are some of the demographics that you can extract from their voice and you can fuse all that data together. Once again, it's always a privacy question and a legality question, but it's possible to do that with the models.
B
Yeah. And the other question is you can analyze all of these nuances in voice, can you generate those nuances? Can you go the other direction?
A
Synthesis is absolutely a part of our voice AI mission, but we've started out just focused in understanding. So right now, today, the models we're deploying and the applications we create are hyper targeted towards understanding what's going on in the conversation. And as I alluded to at the beginning of this call, the voice skins where we started, that's sort of a transformation or synthesis application. But the standard voice synthesis task, doing something like text to speech, we haven't built our own individual models around that yet. We wanted to focus on building the absolute best voice understanding capabilities ability in the world. But that's a place where I'll go next.
B
Yeah, yeah. Because I can see huge. I mean the, the voice synthesis models that I've used, it's very hard to get them to express an emotion and, but anyway, if someone wants to follow up with this, I mean, I have to say this is far more interesting than I expected it to be.
A
I'M thrilled to hear that. Fantastic.
B
Yeah. And I'll follow you guys. How does someone find the API? How does someone find you guys? Is it, is it Modulate AI? Is that your URL correct?
A
Yes. You can go to modulate AI. You can find us on LinkedIn, you can find us on X, follow our company account. Feel free to follow my account if you're so inclined. And we're actually going to be doing a ton of announcements over the next couple weeks, launching access to more facets of our models, providing out free credits to people who want to build on the back of our models. So if you're listening this and you're interested to try them out, come follow us, come message us and let us know what you want to build or what you want to try or just that you're interested in. We're going to be sending out a lot of free credits and a lot of access to our different models and capabilities over the coming weeks. So just hit us up. We'd love to let you in and see what you build.
B
Okay, great. Carter, it's been fascinating and I'll. I'll be in touch. This episode is brought to you by TastyTrade. On Ion AI, we talk a lot about how artificial intelligence is changing how people analyze information, spot patterns and make more informed decisions. Markets are no different. The edge increasingly comes from having the right tools, the right data, and the ability to understand risk clearly. That's one of the reasons I like what Tastytrade is building. With Tastytrade, you can trade stocks, options, futures and crypto all in one platform with low commissions, including zero commissions on stocks and crypto, so you keep more of what you earn. The platform is packed with advanced charting tools, backtesting, strategy selection and risk analysis tools that help you think in probabilities rather than guesses. They've also introduced an AI powered search feature that can help you discover symbols aligned with your interests, which is a smart way to explore markets more intentionally. For active traders, there are tools like Active Trader Mode, One Click Trading and Smart Order Tracking. And if you're still learning, tastytrade offers dozens of free educational courses, plus live support from their trade desk reps during trading hours. If you're serious about trading in a world increasingly shaped by technology, check out tastytrade. Visit tastytrade.com to start your trading journey today. I'm going to myself. TastyTrade Inc. Is a registered broker, dealer and member of FINRA, NFA and SIPC.
Host: Craig S. Smith
Guest: Carter Huffman, CTO & Co-founder, Modulate
Date: February 11, 2026
In this episode, Craig S. Smith sits down with Carter Huffman, co-founder and CTO of Modulate, to delve into the architecture and applications of Modulate’s cutting-edge voice AI. The discussion explores how Modulate moved beyond traditional voice AI—now often seen as commoditized—by developing real-time, scalable, and deeply nuanced voice understanding technology. Carter explains how these advances are impacting gaming, safety, fraud prevention, and potentially many other industries, while maintaining technical and ethical rigor.
[03:07]
“I was out at Jet Propulsion Lab... working on super cool problems like how do you have a spacecraft that's flying by a comet... figure out what to do next to get the most science out of that flyby mission.” — Carter Huffman [03:16]
[04:03–05:53]
“You don’t just want a transcript of what's being said, you want to do something useful with it. You want to really understand what's going on.” — Carter Huffman [05:10]
[06:12–07:43]
“The breakthrough that we had was that you can do that analysis quickly, accurately, understand the nuance... And you have to do it super, super high scale and super accurately. We’re talking hundreds of millions of hours of audio a month.” — Carter Huffman [06:49]
[08:21–10:43]
“Reading a transcript is so much harder to pick out those kinds of social cues... Understanding what’s going on in an accurate, repeatable, deterministic way so that you can take action on it—that’s super important.” — Carter Huffman [11:18]
[12:15–19:55]
“If you know what kind of representation you want ahead of time, you can piece that out... you can be so much more efficient and more accurate and more deterministic than a large foundation model.” — Carter Huffman [14:40]
Example:
[20:54–25:51]
“If you’re taking like half a second to decide which model to route this data to, you’ve already lost on latency, so you have to make that decision super fast.” — Carter Huffman [22:15]
[25:51–34:53]
“If you’re going to say, ‘I’m going to listen to every single thing that every person says in every online game and see if they’re harassing somebody or not,’ that would have cost billions of dollars a month... But when we came along with these ensemble models and we said, what if it’s a hundred or a thousand times cheaper, then would you do that? The studio said, yeah.” — Carter Huffman [30:54]
[37:21–38:32]
[38:58–56:53]
“Maybe most exciting of all, we’re transitioning from being this product company that builds products... to being this model-first company that provides the models for anyone to use.” — Carter Huffman [40:08]
[46:14–49:34]
“You want to catch common scammers and you want to keep them from hurting other people. But... there are different levels to the amount of data you extract and store from an interaction.” — Carter Huffman [46:33]
[62:17–63:13]
[58:52–64:45]
“If the tone of my voice doesn’t match the text transcript of what I’m saying... that tells you a ton more about the content... than looking at just the text or just the vocal emotion. You gotta fuse it to really understand.” — Carter Huffman [59:52]
[60:35–62:17]
[54:56–56:53 & 65:10–66:09]
“If you’re listening to this and you’re interested to try them out, come follow us, come message us and let us know what you want to build... We’d love to let you in and see what you build.” — Carter Huffman [65:36]
This episode provides an in-depth exploration of the current and future state of real-time voice AI and its broader implications, mixing technical detail with humane, ethical reflection. Modulate emerges as a leader shifting the paradigm from transcription to understanding, opening new frontiers for voice technology worldwide.