Loading summary
A
Matty Stanischewski co founded ElevenLabs in 2022 and has since scaled it to the 11 billion dollar leader in AI audio. He's credited with capturing the humanness of speech through realistic emotional inflection. And they're now expanding into everything from agentic workflows to music. Thanks for doing this.
B
Thanks for having me.
A
A good place to start is describe to me how like I, I know how an LLM works at a high level. Describe to me how an audio model works. Like if we were Karpathy style, looking to build a toy one from scratch, how does it work?
B
In early days you try to replicate it exactly like you would replicate it with the human body. So you would try to completely try to reproduce a machine, analog machine that will create a vocal tract effectively. Then that progressed into trying to create effectively like a digital signals for speech. Bell Labs was one of the first to try to create a structured set of signals that will represent the speech. And that is the first precursor to what we would do today. Then you would try to stitch in phonemes effectively different sounds of how we would speak humans, and then try to concatenate them together. It's another important part in that equation where you would, based on the most probabilistic approach of the next word, you would effectively try to bring the phonemes from your library of phonemes and bring them together and then down to the modern history where now we effectively do similar neural nets in other domains. So you predict the next sound based on of course the context of the previous sounds. If it's a streaming speech, if it's, let's say a context of audio, you will use combination of predicting of the phonemes, but you also use the contextual text element of that work. And here, credit to my co founder Piotr, who effectively came with that new idea of how you can now create voice models which are both reliable, high quality, quick, where you would bring a lot of the ideas from transformer models, from diffusion models into the speech space. So that prediction of the next token and the phoneme space wasn't something that was possible. You spoke briefly about this, of how you kind of operate on the text on the waveform space. There's also mel spectrogram space. So usually you do text mel spectrogram waveform.
A
So what's this spectrogram space?
B
It's like a visual representation of how the speech sounds across pitch, across energy and then you transform that into a waveform.
A
Got it.
B
So like when Wavenet came along and tacotron models, they would effectively use text to mel spectrogram. So that visual representation and then how you decode and encode that into the waveform to bring it across. And Piotr figured out how to abstract some of those steps and decode and encode them a lot better. So that predicting all the next phoneme was one of the big P's. And second big piece was how do you bring that context into the equation. So what I mean by context is if a voice actor was reading a textual copy, you would know that, okay, this is a dialect sequence. I need to produce a dialogue. If it's a happy sentence, I might need to pronounce it as a happy sentence. But kind of what happens before and after comes into the equation and you need to bring that across and then there's a last big piece. So voice model has the sound of how you intonate the given fragment, but the second big part is the voice itself, of the characteristics of accents, of style, of prosody across that voice. So when you actually try to vocalize something, when you create that voice model, you turn text into audio. You need the text, you also need the voice reference of how you want it to be spoken. So here is kind of the second big innovation, apart from context, is how you decode and encode those features. So when Bell Labs came with their initial representation of speech, the big piece there was, you would have effectively hard coded parameters for that speech with Elevenlabs
A
models hard coded parameters for enthusiastic speaker, British accents.
B
Exactly, exactly. That kind of stuff. Like the set of pitch elements that you can select, set of energy spectrograms you can select from. And in our approach, effectively you would give the model open ended ability to select what those parameters should be. So it's not going to be British, Polish, Spanish, English speaker, but the model will deduce them themselves. The same for other set of parameters that are not hard coded, Whether it's the enthusiasm, whether it's the sadness, et cetera.
A
You're saying kind of Britishness is an emergent property in your voice models.
B
Exactly, yeah. And those kind of those two big parts, encoding and decoding of how you create the voice super hard problem before and figured out too how you then construct that in a sense, how you get the context across so you can predict the next volumes. So how you bring them together in a reliable and stable way while doing it quick. And these were kind of the two first big innovations in the voice models that continue to today.
A
But okay, so if LLMs reason about text and word subparts tokens as the way they think about the world. What is the equivalent of a token and a voice model? You mentioned phonemes, a bunch like what is that representation?
B
So we store the voice embedding effectively for the speaker. So you need that reference when you produce and create the speech. Of course in the input to the voice model you still get the text and you bring the speaker and coding. And then when you produce speech, you do operate on the waveform or effectively on the phoneme level of that speech. And then when we kind of go the opposite.
A
So of course, what is a phoneme
B
fill in my understanding, it's like a syllable deconstructed even to smaller elements.
A
Okay.
B
And these are effectively the human sounds you can produce.
A
Got it.
B
So this would be like the most close to that representation. But of course in our models now it's going to be a combination of not only operating on phoneme level, you also operate on the text level. You operate kind of in both in sync. Because when you are predicting the context, you need to understand how that sentence will get constructed. And especially if it's more of a streaming real time use case. And like a voice agent setting, you need both parts too to work across. So it's similar to how you would operate on the token level. On the text side, we operate on the token level. On the audio side.
A
It feels like a big part of the magic of 11 was your voices were much more human sounding. How did you accomplish that?
B
So I'll kind of give you a quick, quick synopsis of how we think about the models on the text to speech side today. In any model you need architecture, you need compute, you need data. So architecture innovations were one thing. The data part was the second big thing. With audio, you will have a lot of audio data available, but frequently you will not have it annotated in the right way. You won't have which speaker is speaking when some of the what is annotated, but the how isn't. So like as we are speaking now, what's the emotions that we use? What are the accents that we use? So we would invest a lot internally on effectively creating our own data labelers, our own team to be able to create those data sets that will be better. And that was combination of of course like semi automatic techniques and then, and then manual techniques. And actually a lot of the models that we did afterwards actually spun out from a lot of that research too. So speech to text model initially was a model we did for ourselves because the models on the market just weren't good annotate that data. And then another brilliant researcher on our team was kind of being able to construct it so we could span it out as a model that we brought to the customers.
A
So you've just been doing useful stuff in voice, and that has emerged with a whole bunch of products that you mightn't have expected because you find you're building useful stuff.
B
Exactly, exactly. And that's kind of combination of data of being able to do it automatically, create a team that's coached on voice, on how to describe it, because most of the labelers out there just aren't as well versed on understanding the audience. The invoice helped us a lot to bring that back. And then, of course, deploying those models in production, seeing how customers interact with them, having them annotate all of the data, helped us refine those models over time. A very interesting thing on the side. So we spoke about the speech representation. The first guy who created the speech representation is a guy called Kempelen von Kempelen. So he created this analog machine that would represent effectively a human vocal tract and try to produce that sound. He had spent decades on that, and that kind of started producing vowels. But that's the same person that created a chess machine, the first viral, let's say, chess machine that would kind of simulate playing chess.
A
This is a mechanical turk.
B
It was called Turk. Yeah, yeah, but exactly. But the kind of crazy thing behind it, it was operated by a human, and it was all a fluke. And that's where the mechanical turk from which actually we use in that kind of data labeling production to. To make that work there.
A
Yeah, yeah. And sorry, we kind of jumped right in, but if you describe the 11 business today, people think of you as the speech company. How should they actually think of your business? To the extent you can describe the big areas, Text to speech, speech to text, Voice agents just, like, break down the business for us.
B
Cool. So in like the nutshell, describe eleven Labs is a research and product deployment company. We built foundational audio and voice models and then build a platform for businesses to transform how they communicate with their customers, with their employees. And that will apply through AI agents, from customer support, sales, hiring, training, all the way through to marketing and storytelling for our creative tools. In that set, we've created all types of foundational audio models. So text to speech models for producing speech, speech to text models that work over 100 languages and happily beat others on benchmarks all the way through to conversational models of how you loop them together, to music, to other domains of audio and then of course, beyond the models, when you actually bring them to production, that's where the second level of the platform comes in, where that meets their businesses on the specific use case. So on the agent specific example, it would be how you now connect those models to the knowledge base, to telephony, to the integrations that you need to perform the actions, how you evaluate and monitor the agent that behaves in the right way, how you build the right saveguards on the creative side, on the marketing side, it's how do you create a good ad so you can create a good video voiceover for one of the campaigns? How you create an article that's narrated with a specific voice that represents the brand in a good way? So that's where we combine the models and understanding of the customers we work with into one policy platform.
A
Every platform company has this question about how far they go into applications. So how do you think about where you go horizontal and power the whole ecosystem versus where you develop applications? Because like you can imagine there being a whole ecosystem of closed captioning tools that grow up that again are built on the 11 Labs tech like it's not necessarily a space that you would have to go after yourself.
B
I think the big difference between in your kind of question today we see ourselves as a platform where if you're building a horizontal use case in your business, a great place to come if you have a lot of domain specificity. That's where I see a lot of kind of application companies forming over time where they will, where that's specifically not the spaces we will go into.
A
And I think it also is interesting when the tech is moving as quickly as it is here. It's one thing with SaaS where you get these vertical specific providers, but I would imagine one of the biggest risks for you guys in being intermediated is if there is like in this example, a closed captioning service that is on A2 versions, old version of ElevenLabs and hasn't upgraded. That's a problem because you want people to be using the latest and greatest model that you've developed and you'll be kind of deploying new capabilities every week. And I presume that's part of your thinking is that just when it's moving that quickly, you need to go direct in a lot of cases.
B
That's right in the closed captioning here already now we know that our services is going to be able to tackle 99.9% of the cases that customers have. And then there's added benefit of we Work with healthcare customers where we will create custom models for those customers, where we'll get that transcription perfectly.
A
The context is the tricky thing in closed captions where we talk a lot about a lot of technical stuff on this.
B
Yeah, for sure. And that's where you need like effectively like a dictionary of words that you detect beforehand, which as we work with the businesses we know we need to embed in that creation process.
A
We're talking a bit about kind of products here. And one thing I note is that LLMs are amazing and you know, you have the usage stats of ChatGPT and Gemini and all the popular LLMs where they're working and people use them a ton of. It feels like there's a big product overhang when it comes to voice, where the leading edge voice models are incredibly capable and yet. I was driving home the other day and I needed to read a PDF but I was driving and so I said, okay, I'll just have my phone read the PDF to me and you can kind of try and hack it with like iOS screen reader, but doesn't really work with the scrolling. And then in theory you can upload a Gemini but you're trying to get it to not summarize it. And it actually just hung when I tried to press the like read this to me button. And so there was no way I could get my phone to read me something which seemed like a fairly basic feature. And all cars advertise voice control and yet it sucks separately if you want to input something to the navigation, just no car has a good version of that. Yes, maybe Tesla does and, and, and, and so why does it seem like with LLMs and cloud code and everything we are using all the capabilities of the intelligence, whereas with voice we're like living 10 years ago somehow?
B
Well, I'm thinking what I agree with the premise that we are 10 years
A
behind in the lived experience of people day to day. Like they're using series transcription, which has gotten better and it's still way behind the leading edge.
B
Yeah, there is definitely a piece of. I think the technology in many of those cases is ready. There's a deployment gap to what you are saying. It's like in automotive or some of the big companies are not adopting that quickly enough or bringing that into the production, but like plenty of different problems that you need to fix along the way. I mean the quality of voice models for them to actually sound good. This is only like last three years thing.
A
Yeah, that's three years.
B
It's a three years thing.
A
Cars have over the air software updates now.
B
So that's three years for the first voice model that can narrate TextAsync. Two years ago you can start seeing the real time version of that and not really like it's. I think the real break was like a year ago where you could start seeing that in production. And then I think over 2025 the big piece that hasn't been possible is how you connect now the real time voice interaction with something which I think you are referring to. It has context of what you want to do, what is the material that you want to read, how does it connect to set of your preferences from the past and gets that across. I think that's like only recently became possible and where we've seen kind of the big adoption across the enterprises leading on the technical side, I think this year it should be in the automotive side too or some of the applications.
A
Okay, so you think we'll start seeing kind of great voice models in cars this year?
B
This year for the on cloud use cases like on car in car. So without connectivity. Not yet. There's deployment of course gap of like how you bring that into the gaps. But I think like the next two years, three years.
A
How about the PDF reading use case?
B
That should work. Yeah, well.
A
But how should I have done it?
B
So back in the day we our preemptive story to cure 11 reader. But we had this problem. We have so many audiobook authors come into ElevenLabs. So 2023 released First Software. We had a lot of creators and then a lot of audiobook authors or book authors that tried to couldn't afford professional narration and wanted to create an audiobook. However, none of the companies accepted AI audiobooks.
A
As in you can't sell an audiobook
B
on audible or something. Exactly, it's audible with like block AI content. So we had no choice. Like we need to create an avenue for them to.
A
Because it was really distribution.
B
Exactly.
A
For AI audiobooks.
B
Exactly. So we created 11 reader and that kind of came with functionality where you can upload your PDF, you can upload your text and have it read out loud with a number of incredible voices. So whether it's Sir Michael Caine all the way through to Estate and working together with Sir Joseph Feyn where you
A
can working with the Sir Michael Caine's of the world.
B
Exactly. And then you can actually read it out loud. And that kind of works extremely well. So that works now how can you do it?
A
I think actually I do want everything read to me by Michael Caine.
B
It's a great voice.
A
Yeah. Shouldn't you guys have a consumer app where I can just do the common voice things? Like I want to be able to have an 11 app on my phone and then if I upload a PDF to it, it can do the common things that I would like, such as have it read it to me.
B
Yeah, that's exactly an either. So that works.
A
Okay. The phone makers allow third party keyboards. Do you think they. Do they allow third party transcription engines? Will they?
B
Do you think the phone makers you said, right?
A
Like Apple and Google.
B
Yeah, they OS makers. Yeah, not all of them. Android. With Android you can work through it. It's like, you know, variations of that. Nothing tech and others.
A
But yeah, I feel like if you had a popular 11 app that allowed for transcription, people would use it a bunch and maybe eventually Apple would say, oh, we should allow third party transcription engines if that's what people want.
B
I mean it seems like they might be going in that direction. Right. And I recently they announced that we'll open up the LLM ecosystem. Hopefully they will do the same with voice ecosystem which is kind of similar
A
again, I think rational to do when it's moving so quickly.
B
Yeah.
A
The voice assistant paradigm is one of the oldest paradigm, you know, UI paradigms in computing like the open the pod bay doors Hal from 1969.
B
Yeah.
A
I will claim it's not working yet. So Siri doesn't have the intelligence. And then on Gemini and ChatGPT and those apps, I mean I want to use the voice mode, but I don't know about you, it just doesn't work. And so like sometimes I'll be using my phone and I'll use the iOS keyboard trans description to type in the field and then like say a bunch of stuff and then send it off. But this suggests to me that consumers really want voice mode that works. And yet it's just not working yet for the major LLM apps or for anyone. Why isn't it working yet?
B
It is pretty hard to do because you want two things. You want to be able to say things that you want, but you want sometimes for it to execute it, sometimes to wait for you to finish and add something in a sentence. Sometimes you want it to be interactive so it asks you questions back to clarify and get some of the additional detail. And all of that is actually pretty hard. That's where kind of the magical ideal version of a voice agent for us comes through, where you need the speech to text element, you need the transcription side, you need then the kind of the turn taking mechanism. So when do you finish sentence, when is it likely based on silence, based likely on the context. And then sometimes you want it to speak back and clarify or at least give you the text back to clarify and then maybe execute set of instructions. So that problem is still very hard research. So I agree with the claim that this orchestration side has not passed a true conversational agent Turing test where it behaves as you would expect from another person.
A
Where you can say that's the simpler way of saying. What I'm saying is that we have passed the Turing test with text LLMs a long time ago and we're actually nowhere near that on voice LLMs. So it's kind of interesting how that's a final frontier.
B
Yeah, I feel like it's going to work in specific domains like in customer support call passes. The voice stream test works well. Let's take another spectrum of that. An interactive gaming experience like a truly interactive as you would have with another human in that game. It's so hard and further out there. We haven't passed it yet there.
A
Yes, yes.
B
But I think that's a combination of, you know, like even like a simpler version of within that. Like sometimes you might give a response immediately back. Sometimes you need a tool call to get additional information from the database. How you orchestrate that. So like that's probably the most common thing we see as we work with some of the companies out there is you want those systems to orchestrate extremely well. Where if it's a conversational use case, pretty simple, you can route the agent to speak with. But if you need to authenticate, if you need to pull additional information from the database, what do you do? How do you handle that? Graciously and yeah, and to that extent I would agree that's just getting, getting, getting there. And we'll hopefully see that our goal is to pass the voice Turing test in all those cases or the Turing test for all conversational agents outside of Voice 2. And I hope we will all be there in the next year or so.
A
For subscription businesses, a lot of revenue is lost in that last few seconds before the checkout. Someone has to get up and find their wallet or they mistype their card number or they hid an error, they just give up and you lose the sale. For a company like Elevenlabs, adding hundreds of thousands of subscribers, even a tiny bit of friction like that, it would really add up. But that's why Elevenlabs uses Lync from Stripe. Customers save their details once and then they can check out in seconds across more than a million Businesses with saved credentials. So if you want a faster checkout for your customers, you should turn on link from Stripe. Are you guys working on personalized voice transcription where it feels like part of the way we're making it hard for ourselves is when I speak to Siri, I have a bit of an accent and so it sometimes has a hard time understanding me, but my accent doesn't change. And so it could just get good at listening to John. But my understanding is it's not. It's just like running the global voice recognition model. And I'm guessing it's the same for 11 labs where you're running the global voice recognition model. But again, you have an accent. And so if someone's understanding, like if you walked up to someone in a coffee shop and said two words, they might have a hard time understanding because they're not putting it through their Matty Polish accent filter. And so where's this going with like actually interpreting the person that you know to exist on the other side?
B
Yeah, I have a very tricky one to detect. So my voice is frequently used in the test.
A
You're a part of the test suite
B
for text to speech. For speech to text, for like everything. It's pretty tricky.
A
But again, trying to park your voice in a global model is just making life hard. It's like have a Matty specific model.
B
Yeah. So on the speech to text by transcription. Exactly. The big part now that we are bringing in is you have two parts. One effectively a person or a voice specific detection, which is true for the accent side, but it's also true for a crowded room. So that's where we have incredible research team that's able to continually do both the accuracy high, but also add things like speaker detection, of course, noise reduction. But then the second part is also keyword detection. So there are specific words that you would want to say in those settings that you want to effectively monitor for. So we spoke about, let's say I'm going to the coffee shop and order things. The set of actions the coffee shop would expect me to do.
A
There's information theory. It's like they can just listen out for the copy words.
B
Exactly. And then try to match it to the closest proximity. So both things will help. In a setup where you have my voice perfect, you can decode it, encode it on that if you don't have my voice, or even if you want to double amplify it. We already support effectively a keyword detection which is useful for real time setting and async setting. So back to Cheeky Pine transcription. You could effectively pre generate that from the previous podcast and look for a set of words that you would use traditionally in that.
A
And so how hard. Okay, so you do the keyword detection already, but how hard are the. I want to get superhuman transcription performance by feeding it an hour of Mati audio before it listens to Mati and then it should be able to do a much better job transcribing. Is that just a really hard research problem?
B
No, solvable. We think we can roll it out in one of the next versions, which is like hopefully in the next month.
A
Oh, so you think this year you're doing for sure. Person specific transcription.
B
Person specific transcription. Like we can already diary speakers extremely well. So like if we are speaking can of course dissimulate who is speaking when.
A
Yes.
B
Which is like in transcription side. Apart from accuracy, diarization is one of the harder problems and we do that extremely well. And now it's going to be like effectively what you're saying, like fine tuning based on the speaker that I want to listen to, which we know will be important. I mean like in healthcare setup, such an important part. You're an operating room, you're a doctor, you want to say a command, then you want to really be able to listen to that one person specific piece.
A
Yes.
B
You have a hardware device at home. Let's say it's a pilot that helps you control the TV here too. You will want that to listen to you versus let's say the family roaming around. Or maybe you want it to everyone. So you could decide it, but in many cases you want to be able to specify that.
A
Okay, that's really exciting.
B
It's great because there's still so many unsolved research problems that we can.
A
Yeah, there's just breakthrough after breakthrough coming in the domain of voice models. How about on the flip side, when it comes to speech generation, the zoom touch up my appearance feature. I've always thought about that in the context of voice. Where should you offer a deaccenting filter for voices? Or like even this one podcast that I like to listen to, but the voice a little mumbly. And I always thought they should put it through a demumbling filter just to like make the.
B
Slow it down. Slow it down.
A
Yeah, make the enunciation a little better. But all these things again, like photoshopping an image, there's no reason that the like. Have you thought about voice to voice basically, rather than voice to text or text to voice?
B
Yeah. So there are kind of two big parts. One, on the speed generation Side similar. So many, so many innovations still there. There's like a wider piece and that's like we released a V3 model that kind of were solving that for the first time is like, can you control speech so you can have the text to speech? You generate something that sounds emotionally great previously until end of last year. Effectively you would rely on model to decide what's the best performance. You could regenerate it, but that ultimately model decides the best performance. So that's where the controllability came in, where we can finally give it cues of say it in a slower way or change how you deliver the dramatic pause or kind of any cues that you give. And to be able to do that, you need architectural changes and the data that we kind of created over time, where we annotated what was said and how it was said. So you can actually train the model to do that. So today, finally you can have both speed generation or entire voice agent experience with what we call expressive mode, where the agent knows the emotions on the other side. So if the person is stressed, it can react and be reassuring. And that's generating a LEM response on the reassuring side and response in that set of emotions too. And that breakthrough was super hard to do. And that of course stretches to a lot of what you said. It could be some version of speech enhancement, either real time or in a post set up to change how that's delivered. And that's relatively recent innovation. And we know it can still be so much better. The edge cases of how you want to describe it is pretty large. So that's one. And then the second part of the question, which is a huge question, the speech to speech models. So as you said, our approach, as you think about voice agent, conversational agents, is effectively a cascaded approach. You use transcription of speech to text, LLM text to speech and orchestrates all of that together. And then you have a speech to speech which kind of goes directly from speech and there's a speech response on the other side.
A
When we say speech to speech, is that the idea that it doesn't go through text as an encoding in the intermediate set? Oh, interesting. Okay. For performance reasons, for accuracy reasons, you
B
usually do it for latency.
A
For latency, it's faster to run a model that does not have to transcribe and then generate exactly, it's quicker.
B
But on the flip side, you lose reliability, you lose all visibility into the parts of the pipeline and emotionality. We think you can deliver both on both sides extremely well. And maybe you can make it more Controllable too. So today we are optimizing heavily on a cascaded approach. I'm sorry, cascaded approaches is the speech to text. Going through the text layer. Going through the text layer. And as you work with a lot of the businesses and enterprises, they will need that visibility into what happens. They will want to execute certain tasks. On top of that they want a good visibility into each of the steps and great accuracy of all the models. But beyond that they can abstract away what's the LLM layer, what's the intelligence layer. The integrations are easier in that system. So that's where we are betting a lot of the research work of how you can make that great. And we think we can make that great in speech to speech. As you think about maybe more of like a companion version of the applications, that's where that will flourish. Because maybe the hallucinations aren't as important, but the latency is a little bit more. And maybe hallucinations are even a feature. And maybe in the future, future just to finish that part, you will have some version of combination of the models that for low complexity easy models you will have speech to speech. And for higher data complexity you will have the cascaded.
A
Okay, so I was going to ask about this. You know the way there is research on how the invention of writing changed humans brains and just like changed the neural pathways in ways beyond kind of the actual written language. Do you observe that speech to speech models think differently than cascaded models? Like it sounds like they're dumber.
B
They are definitely dumber. You need smaller model. You cannot.
A
But that's interesting, right? That forcing models to reason about text. I mean, I know they just have much more in there as well, but they're smarter.
B
Yeah, but it's like if you are going speech to speech, usually you will use smaller models. So it's still quick.
A
I see. So it's also just a model size thing. Okay, but are there interesting differences beyond correlates like size?
B
What I can say is slightly different to your question. The people interacting through voice and the performance we see for how they interact with the business changes just by nature of interacting with us. A good example, you can contact 11Labs and register for your interest. You go through the form and at the end of that we supplemented that. Instead of going through the form process, you can speak with our agent and leave more details and webinar. Two things. One, people were actually much more keen to leave the forms through speaking with the agent. So we would go through the form a lot easier. But second they would be a lot more open ended in terms of what the use case are. So they would start giving us information about the wider set of use cases, the complexity of the use case. So like the writing out was tedious and tricky.
A
This is like an open ended adventure.
B
Open ended. You could ask follow up questions, you can clarify, but people were just more at ease and could trust the system while doing that, that it's working. And that kind of helped us a lot. And then free, which maybe is more of a technological barrier, it also works across all languages. So now we have leads from all parts of the world coming in and leaving their details. So we did that use case and now we have few different companies building their SDR versions of that too to help them capture the leads coming in from banks all the way to actually one of the automotive companies. That leaves that where people are just more keen to speak through voice.
A
I want to ask about this kind of the second order effect you have. You know, you've talked in the past about how growing up in Poland, I guess the dubbing of TV shows, they were cheap and so they would only have one voice actor for a TV show. So no matter all the parts, male and female, be like, I love you. I love you too. You know, there's like one voice actor doing all them. And now, you know, thanks to better voice models, you'll be able to just have like really good voices AI generated for all the dubbing. Because again, it's not like it's taking jobs from great dubbing that was happening previously. It was like awful dubbing happening in Poland previously. So that's one example of the second order effects. What are the other second order effects you're seeing of ubiquitous good text to speech? Speech to text. It seems like across a broad array of languages. Because whatever in English just this didn't exist in Polish or Irish or pick your language 1.
B
Breaking down the language barrier. The inspiration came from the movie side, but it also applies in any communication setup. Like could in the future, could I travel to another country and speak Polish or speak English and that language isn't being understood in the local native language. Like from Hitchhiker's Guide to Galaxy, this version of a Babelf. Exactly. That you can actually understand the world. And voice of course will be an interaction layer. But similarly, all of us will have our own kind of extension and voice agents that can help on our behalf. And there is like very clear and great examples of that of people that lost their voice and can get it for their first time, for the first time back, we see that everywhere, whether that's people that lost it due to ALS or throat cancer that can get it back. Just recently there was an example of a patient that had neuralink and worked with them to bring the voice that that person could speak with their own voice back to the back with the family around. We worked with the lady that lost her voice before she got married. And then finally technology became possible. We were able to recreate that voice and for the first time she could replicate the marriage ceremony and speak the vows together, which was such a heartfelt moment. Probably the most important from all the work that we do.
A
When you guys talk about voice agents is a voice agent just the idea that you have some long running or persistent agent that is going out and interacting with the world through voice. And so customer service would be one example of it in the other direction. You're claw going and making you a restaurant reservation and actually calling up the restaurant. Is that kind of how I should think about voice agents?
B
That's right, exactly. Whether it's the reactive side of being able to interact with the customer or the proactive to call it back. We recently had a very interesting one topical because it was a Guinness related one where there was a developer developing a Guindex effectively.
A
Oh, I saw that they were calling all the pubs in Ireland to check the price of a pint.
B
Yeah, you could ask that or report information.
A
The Gyndex is built with ElevenLabs.
B
It was built with 11 labs too. So people could actually do both sites could proactively recharge, reactively recharge. All was captured through voice and then kind of 3,000, 3,000 different entities could report their prices and get that across.
A
Have you by the way, hooked up your OpenClaw to ElevenLabs? Is the OpenClaw ElevenLabs combo something that a lot of people at Eleven are doing?
B
So as you know, the OpenClaw will look for the most popular tools frequently where it tries to hook up. So ElevenLabs is one of the recommended.
A
It's definitely the top option for voice. Can you tell me a bit about the business of voice models where I think people have an intuition around big LLMs where there are these very expensive training runs and yes, they kind of depreciate quickly, but there's so much usage that all of the models trained to date have paid off their training runs and then some. And then there's this kind of ever larger Capex going into. I mean a lot of it is inference these days, but also training and so you have some intuitions from the LLM world. I'm curious just how I should think about voice where one how expensive is training the voice models? Is the expense in the researchers is the expense in the training runs and I mean the economics is presumably kind of simple where it's just per usage. But yeah, just talk us through the business.
B
Yeah, definitely cheaper than the LM and image video models. So you can really smaller models.
A
Okay, so the models are smaller, smaller. What's a parameter count for a leading edge voice model?
B
Few billion to a lot tens of billion parameter models.
A
And for context I think the, I mean kind of like you know, CPUs moved away eventually from gigahertz as like the metric as they move to more cores. I think we've mostly moved away from just raw parameter count but I think The Leading Edge LLMs are in the hundreds of billions of parameters.
B
I think the leading ones, yes but of course you have the variations that you will use at lower scale. So capex is still pretty high. We of course raised recently a half a billion at 11 billion valuation. Makes sense to continue being able to build the best models in the world. Researchers of course you want the best people in the world. I think we have those people working in audio and Michael Founder who is leading that work. So that's definitely a big piece of not financially but even how you keep the ambitious deployment so you kind of continue building leading models helps you attract more talent and building that. And then on how we service of course inference is correlated with how the models are used and for us we've seen incredible growth across the work. Mostly this is charged per if it's input text or text to speech, it's usually per text token. If it's voice agent or transcription then it's per minute and we see that kind of being the bigger part. But usually broadly it's per token basis. And of course as we work with businesses it's like an annual agreement. The bigger the spend, the bigger the commit, the bigger the discount to get it across. The way we usually do is when we have a new model we try to give it at cost to a lot of the customers so they can experience the best. It's still usually not as reliable thing
A
is often the most expensive whereas you make the newest thing the most economically attractive one.
B
We try to make it attractive so the customers are like, you know, it's more expensive for us than any previous generation. We don't like the quality is higher. So we try to keep the prices still competitive.
A
You Subsidize it, but it's inherently more expensive to bigger model.
B
Exactly, exactly, exactly. And over time, we might do some tricks to optimize it, but we want the customers to experience because of research. The big thing that we've seen is the reliability of the model in the early days might not be there. And then two people don't even know what's possible with that model. So you kind of want the widest set of distribution so people can show the world what's possible. So you can have it, of course, as the distribution mechanism. Learn yourself what to improve, what to change, and then get it out there.
A
Are the voice models just getting bigger and bigger? Will we have voice models in the hundreds of billions of parameters or have we found like, it seems like for certain types of model architectures, there's like an upper limit on the natural size. Have we found that upper limit for
B
voice models, it feels like for specific use cases like say audiobook narration, you probably found that size. You probably don't need to stretch it too much bigger to make the quality as much higher. But for certain use cases, that will probably grow. The thing that's like I hesitated on the question is in a cascaded approach, you probably will not see dramatic size changes. You inherently want the models to be quick and reliable. You want to orchestrate them in a smart way in a fused approach. Probably that will get into tens, hundreds billion parameter models because you kind of combine of course the LLM side and the voice side. So that will get bigger. But on the just voice, I think it will keep being small.
A
Okay. But there are certain domains where we'll see bigger models.
B
That's so interesting.
A
It is amazing how it does seem fun from a research point of view, how there are still these various unsolved aspects and how you guys are just making technical breakthroughs and then releasing them down the product pipeline. That's like a really fun stage of a company's life cycle.
B
For sure. It's like fun because it feels like we can do innovations on both sides. There's so much on research side, so much on product side. And then ultimately the biggest parts is how we deploy it to to the customers worldwide. And SMB will have very different dynamic than the enterprise. It's not vendor SaaS relationship where you just give the product out there for the biggest companies out there, but you are more of a partner in their AI transformation part. So you want the resources to work alongside them, to work on the frequently very new use cases that were impossible to help create and bring those voice agents to production. So that's like a big shift. But the biggest focus is how we bring the conversational agents out there to the businesses around the world.
A
So when you say bringing conversational agents is the biggest priority, is this for customer service type use cases? Like what are the most popular use cases for conversational agents?
B
Yeah, we want to be a partner for full interactions between businesses and their customers or their audience. Saying the audience, because that will apply in support. Support is the easiest one because that's where it's most ready. And that's maybe the big difference to how we see ourselves to some of the other companies in the space is this can also apply to sales. You can have the proactive side of reaching back, you can have AI SDR versions of that and then you can have all the way to the marketing use cases where we are your partner for working even outside of the conversational agent space of how you create a great marketing campaign.
A
Sarah, how will this break down between we had Destroynor from Intercom on here and they have Fin, their agent and it's a thing in the website that you can go talk to. And he described a very similar phenomena that you described, which is you start maybe thinking, oh, this will help me answer customer support queries, but it becomes like a generic UI for the website where it's a box you can type in to go do things and understand things. And so why wouldn't you read the docs and design your integration that way? Whatever. And so will I have one for text and then one for voice? Will you guys do text too? How does that. Because it seems like this is also succeeding at the text level with Fin and Sierra and all these things.
B
The places where we know we will be able to provide the biggest value is like where ultimately today you will have either a big portion or most of the interactions coming through voice. So if that kind of intersection is there, that's where we can provide higher value. And of course, if you need a text chatbot there that's like if you fix the voice agent, you'll have fixed text piece inherently as well. But the place where we do optimize today is going to be like how do you select the right voice for the right customer interaction? How you pull that in the pretty complex case of what you mentioned earlier of how you orchestrate that to pause or look for something deeper into the docs, how it can be extension of entirety of the business. So not only in support, but across entire user journey. But the bottom line is we want to be Able to provide you across entirety of the interactions. Voice is usually a big part of those interactions. And yes we need to solve the integrations, we need to solve the nodes, we need to solve text as part of that. But we wouldn't for example go into what I think will happen in a lot of those cases, like very deeply into reasoning version of those use cases where you maybe need to the multi
A
touch and a lot of complex actions.
B
A lot of financial analysis that would be not something we optimize for.
A
Can we talk about your revenue ramport? You're just one of the fastest growing startups period of the past few years. What's your most recently announced revenue figure?
B
Most recently announced was end of 2025,
A
whatever number you want to give us.
B
So most recently announced was 350 at the end of 2025. But the best proof of the technology working so recently we are in our work with Deutsche Telekom and T Mobile, with Revolut, with Klarna, with Meta, with IBM, a wide set of use cases. And this quarter was kind of one of the best for enterprise growth where we had the first quarter hit 100 million in additional ARR growth which is crazy.
A
In net new ARR.
B
In net new ARR.
A
Okay, so if you're thinking this quarter was 100 million in net new ARR and 350 million at the end of the year, I'm no mathematician, but it's up in the 450 million and that's versus this time last year. That's a several fold increase. Just what's working like from the outside. I would assume that there is really strong cohort growth within accounts and then you seem to have self serve and enterprise businesses that both contribute a lot. I don't know how big self serve is, but as a user I like to be able to fiddle with 11 labs and not have to go talk to sales. But maybe you can just talk about what worked to reach 450 million plus of ARR so quickly.
B
Yeah, so exactly. So we are over 50% is now sales led on enterprise and I think largely that technology that powers a lot of their agentic interactions just became reliable at the same time as high quality over last year, year and a half. So that's frequently you know this extremely well. You will start the account and then of course it continues expanding and we see there's definitely land and expand motion across level apps we bring.
A
And what does that expand look like? Is it like new departments? Is it just the usage starts taking off when a customer expands?
B
Both but usually the first part too. It's like we try to make it very easy for our customers, maybe kind of against ourselves where we give the technology a pretty attractive economics because we so much believe in the technology providing value. So you can actually try it and test it and then within that.
A
And you think you'll make it up
B
in usage basically exactly that usage, the kind of commit continues increasing because you know it's providing value and then it's so much easier to make that a choice.
A
Yes.
B
And then of course cross department pollination is there too. And it's like, you know, our work with Deutsche Telecom started with marketing side so we did magenta work and pocket generation. Then it kind of expanded to customer support and then it expanded to us working on the agent across the entirety of the network so people can call in and have the agent so you could see those step changes across. But we are now 470 people as a company. So we keep on growing. But some of the things that stay consistent is small teams. So we have less than 10 people teams for each of the product or research initiatives or even as you think about sharding some of our go to market strategy, those will be smaller teams understanding the industry in depth, understanding the market in depth and going independently and going quickly. So that definitely contributed largely to that. Two especially on the biggest enterprises what we found works is we have the full spectrum self serve PLG motion that helps drive distribution, drive awareness of 11 labs. And on the completely other spectrum we have the high touch for deployed engineering working side by side with the customers to customize the entirety of their work together.
A
Why did you guys do self serve? Because I presume you have a lot of competitors where they have tech and it's behind a contact sales forum and you have to go talk to an SDR and then talk to an ae, blah blah blah blah. And you guys just offer the tech available on the side. And I'm a huge believer in this. I mean a huge part of Stripe's growth has been driven by the fact that we just made Stripe available to anyone and built a lot of product around that adoption pattern. But. But so many companies seem to skip it. So I'm curious how you guys can.
B
So many reasons, so many reasons. I think the quick ones that come to mind is feedback loop. Just you have immediate understanding of how good your technology is.2 which is an extension of that. We stand behind our tech. We believe it's the best in the world for models, for voice agents, for deployment. So we want people to Experience that. And I think you do that the same in Stripe, where the best version of the technology is available to everyone, which is so attractive to actually try it out. Always try to make everything we built for the highest end use cases bring it back to the ecosystem free frequently. The newest of the use cases for enterprise, you will need reliability, you need compliance, you need the scale which we deliver. So frequently as you develop new technology, it might not be ready for a lot of those parameters, but it's definitely ready for developers and SMBs. And we love what they are doing because they are showing us the future and effectively helping us find a trajectory of where loudlabs should go.
A
I'm totally convinced Alhamdul. I'm just always amazed that more companies don't pursue it where it feels like they're really shooting themselves in the foot by not like, did you guys self serve on Stripe or did you?
B
We self serve on Stripe, Yeah.
A
For example, you know, 11 is a huge company and yet you started on Stripe on a self serve server.
B
You kind of like initially and it's like, you know, we were two of us at the beginning. You try to see what's working in the industry, but you try to think from first principles. So you want to try it out, you want to understand how it works. So the more friction elements before you're trying it out, the less you trust whether it's available, whether there'll be additional payment that's hidden behind some of those steps. So you don't want to go through them. So it's so much.
A
Speaking of Stripe, do you have any Stripe feedback for us? Anything you want us to fix?
B
My most common feedback until recently is like, why don't you give us pay as you go user based billing type version. But one of our finance lead, Maciek, I know I was speaking with your team and that was day before. He was like thinking about it for a long time. He's great.
A
And then he said, you guys should buy Metronome.
B
You should buy Metronome. And then the next day Metronome acquisition was announced. So now you have it. So that was my most common feedback. And we'll be launching. That's a good announcement for this podcast. We'll be launching user based billing to everyone.
A
I'm shocked. Oh, as in previously. Oh.
B
So pass you go. Pay as you go. Okay.
A
Previously you had it on an enterprise basis, but everything on the self serve basis for the.
B
So we had subscriptions. Yeah, subscription plans, you can go over them. But now we are launching a full Pay as you go experience. So you can just try out voice engine which is effectively this orchestration loop all the way through to any of
A
the models directly Going back to self serve. I think a new thing in AI is that all self serve products should have pay as you go as an option. Maybe you want to have a subscription with some unlimited tiers, but I don't know if you had the experience of you're using Claude and you're typing away your queries and eventually you hit some rate limit and it's like, sorry, you've hit your usage limit and you want to be able to do the thing that you can do with cloud code which is just pay per API. It's like, fine, I'll pay for it. And it's kind of very funny as a consumer to not have the option to pay more to use the product more. And so yeah, I think every AI product will need. They probably want to have some. All you can need, most of what you can use subscription with limits and then the ability to pay for overages. So it sounds like that's what you're.
B
Yeah, exactly. That's what we, what we're doing.
A
The other thing I want to ask you about is I feel like all CEOs of larger companies today are trying to figure out how do all these AI advancements change the nature of the organization and how do you redesign your organization a bit around all this new intelligence. And so that could be about what the scaling factor is of like the number of people you need to do the work. But it also should be like, do you need more senior people because they're better able to direct the AIs and the AIs are maybe kind of doing the work of what previously would have been junior people. Do you need more junior people because they're going to be more AI native in how they work? Do you want smaller teams? Do you want bigger teams? How do you actually go do the process engineering of your finance team should be using Claude extensively. But finance teams do not historically have a lot of home built software. And so there's all these questions that are floating around and you have very rapidly built a much more AI native company. And so I'm curious what lessons we should all be learning from 11 labs as a large business recently built. And so without the baggage of decades of how we've always done it.
B
Yeah, yeah, we started 2022, which is a year when the two topics of the day were crypto and metaverse. So just before. And then of course AI flow started exactly, exactly. But we could have the privilege of scaling through the world when it was all happening for us what works and we really believe in that being the big part of the future. The first is small teams. Keeping the teams small and super flat. So can you have both me and my co founder will have over 15 direct reports each that we'll work with and most of those people will have that same scale of direct reports. Okay.
A
So your span of control is way larger than the traditional normal team. You have double that and obviously that's an exponential.
B
Exactly. And of course there are some teams which in the short term might not do that but ultimately that's where we think it's going to be headed. It's like roughly 10 team size within each of those work items.
A
Startups pretty close. No offense but startups often have pretty wacko management ideas. There's a funny tweet lore grant me the confidence of an early stage startup founder blogging about their management theories. But you think this is not a startup effect. This is an AI effect where basically.
B
No, it's definitely a little bit of startup effect too. Yeah, I think it out. It's hindsight.
A
Hindsight benefit canceling our stripe changes.
B
Yeah, no, no. It's like I need to preempt that I kind of, you know, it's the hindsight of this may be working. We'll see in next 5 to 10 years.
A
Much flatter Org.
B
Much flatter Org So it works for us. Might not work for all the companies and there are some parts where like go to market. We still are trying to figure out what's the best way. But small James Flutter. Org and I think there are two paradigms but generally people being more technical or if not technical, even in non technical teams having a technical resource. So we will have a person in ops or in talent that will we have effectively a tech lead for that team that helps them automate a lot of that work and helps up level the rest of the team too. So there are kind of two parts that are helping.
A
Okay, so talk me through this in talent or something like that. Is it that you are building your own software where other companies might have bought software like a workday or a greenhouse or something? Is it that they are using the existing software you have better is the process that will be spreadsheets in a traditional company are built with software. How do you kind of use the software in these sorts of organizations?
B
Yeah, sometimes, but we still use a lot of the traditional vendors. Like one pattern is of course llmifying everything like Making the data explorable for you to be able to interact with it. Of like who's in the pipeline, what worked, who does the best references. Like all of that works so you can double down on that. But two, it's frequently things that you manually do that a lot of the current like there is a gap between where the agents are today versus what you could do if you have the technical skill set. And a good example is like how do you scrape all the right profiles to be able to reach out to the right candidates. So you analyze whether it's. I don't know how much I should want to say, but try to detect specific things that we know worked. So you bring that across to the people on go to market side. There's just so many things you can do with additional amplifiers. It goes from understanding what case studies are relevant and creating a good pre read for you before you go to the meeting through creating the AISDR experience that we spoke about to creating an entire deck experience so you have like a pre populated deck with the right numbers that is customized to that customer which you want still the person to go through and develop but ultimately is in there. So there's plenty of those additional things that you know will amplify the work of the people around, potentially replace some of those easier tasks that are done. And then there's like, you know, we wanted for people to explore the culture at Elevenlabs so we created a voice agent that people can speak with and see what's the culture but also get prepped for the interviews. I think across many of those teams like additional benefit of what they can do. Interesting piece. So of course in Ukraine with the ongoing work, they need to rethink a lot of how their development, their systems, their support works for the citizens across the country. And people, people are in the war zone, they don't have the same access to the information. They cannot rely on the same phone lines, they cannot rely on the same physical services around the country. So they've developed effectively a central. And your employees in the Ukraine we had a few, but they reached out because they were developing their central map called dia. They developed it over the years but now before they were double downing of how this can doubling down on how this can be a way of supporting the citizens. And of course there's an easy part of how you create a first agenda government where you have help with the benefits and what's happening on the frontline or education so that's delivered to everyone or healthcare so you can book Your checkup or appointment. So how you create all of that. And of course, we traveled to Kiev, we worked with them on bringing that and making that available for voice so everybody can access it. But the thing we've learned while being there was that model of what we speak about where you have technical resources in each of the, each of the teams, they actually have the same in every of the ministries. So every ministry had technical resources working on creating that agentic version of their work. And then it was like a central digital transformation team that would like assemble this all together to deliver that for the central citizen support, which I thought was brilliant.
A
That's very tech forward by Ukraine.
B
So tech forward, like the most advanced set of work we've seen. So we got a little bit validated, like, okay, maybe technical resources in each of the teams is a good idea. And that works hardly for us. And you mentioned some of the other parts, like do you hire the senior or younger? Main thing we try to filter for. Of course the culture piece is so important. You can scale people, but scaling culture is much harder. So you want to optimize for that being right. And in our case, it's first principles, taking ownership, striving for excellence, but staying humble. And the main thing that's kind of in that ownership part that I think works well for the AI world is agency. Like, if you have that agency to explore, regardless of where you are in the experience cycle, it's going to be a tremendous amplifier to your work.
A
My biggest takeaway from all this has been that around agency, where I feel like high agency people are the winners of the advances in AI and within organizations, low agency people will lose out.
B
Yeah, completely agree. Probably the most, the most proud thing that Piotr and I are is as we scaled 11 labs, the people that are at ElevenLabs, it's been like just the culture and seeing the expansion of the culture, where culture builds the company now rather than any single person or any single product builds the company, that is probably the biggest validation and happiness. And there is kind of the other angle of that where I think people are like striving to be incredible in their craft and their work, but at the same time have fun in a lot of their work. And that kind of combination of agency and just enjoying what you do is probably the best thing we've been able to do today at ElevenLabs.
A
Well, it sounds like a really fun stage. Like we were saying, interesting research breakthroughs, really fast growing business. So I'm sure you're enjoying it, Maddy.
B
Thank you John, thank you so much.
In this rich, technical, and wide-ranging conversation, Stripe co-founder John Collison sits down with Mati Staniszewski, co-founder and CEO of ElevenLabs, the leading company in AI audio and voice technology. They explore the inner workings of generative voice models, the challenges and breakthroughs in making machine voices sound deeply human, and the business and societal impact of highly realistic voice AI. This episode is packed with insights about the technical underpinnings, business strategy, product philosophy, and forward-looking implications of bringing advanced voice AI to global enterprises and everyday consumers.
Mimicking Human Speech:
“Now we effectively do similar neural nets in other domains. So you predict the next sound based on... the context of previous sounds.”
— Mati, (01:41)
Architecture & Data:
“Britishness is an emergent property in your voice models.”
— John, (04:21)
Pipeline Stages:
Portfolio Breakdown:
Vertical vs. Horizontal Expansion:
Deployment Lag in Consumer Voice AI:
“The technology in many of those cases is ready. There’s a deployment gap... the quality of voice models for them to actually sound good—this is only a last three years thing.”
— Mati, (14:08)
Distinctiveness:
Memorable Anecdote:
Person-specific Transcription:
Filters and Editing:
Speech-to-Speech (direct audio-to-audio) vs. Cascaded Approach (speech-to-text-to-speech):
“Speech-to-speech models... They are definitely dumber. You need a smaller model.”
— Mati, (30:00)
Real-world behavioral impact:
Examples of Second-Order Effects:
“The most important from all the work that we do.”
— Mati, on restoring real voices via AI (33:40)
Voice Agents: Definition & Use Cases
Model Size & Training Economics:
“Definitely cheaper than the LM and image/video models. So you can really [use] smaller models.”
— Mati, (36:21)
Pricing Model:
Self-Serve Strategy:
Breakneck Growth:
Organizational Lessons:
“My biggest takeaway... is that high agency people are the winners of the advances in AI and within organizations, low agency people will lose out.”
— John, (59:13)
Internal Automation Examples:
Global Inspiration – Ukraine Case Study:
On the Emergence of Accents:
“You're saying kind of Britishness is an emergent property in your voice models.”
— John, (04:21)
On Product Lags:
“The technology in many of those cases is ready. There’s a deployment gap ... quality of voice models for them to actually sound good—this is only a last three years thing.”
— Mati, (14:08)
On Speech-to-Speech Models:
“They are definitely dumber. You need a smaller model.”
— Mati, (30:00)
On Life-Changing Applications:
“We were able to recreate that voice and for the first time she could replicate the marriage ceremony and speak the vows together, which was such a heartfelt moment. Probably the most important from all the work that we do.”
— Mati, (33:44)
On High Agency in AI Era:
“High agency people are the winners of the advances in AI and within organizations, low agency people will lose out.”
— John, (59:13)
Mati Staniszewski of ElevenLabs provides a fascinating, in-depth view of what it takes to build voice AI that feels truly human, how technical breakthroughs are rapidly finding their way into consumer and business applications, and why success in the age of AI will depend just as much on company culture and organizational design as on the models themselves. From reconstructing lost voices for emotional reunions to reimagining support and sales through “agentic” voice bots, ElevenLabs’ vision is reshaping how we speak, listen, and interact with technology around the globe.
For anyone curious about the leading edge of AI voice, ElevenLabs, or the future of business automation, this episode is a must-listen.