
Loading summary
A
Foreign welcome to Generative Now. I am Michael McNano. I'm a partner at Lightspeed. And today on the show I'm excited to have Tenay Kothari, co founder and CEO of Whisper Flow, an AI powered voice dictation platform. Tanay is a self taught coder from India who built over 50 apps before finishing high school. He went on to study AI at Stanford before founding and selling his first startup, Fellow Feather X. With Whisper Flow, Tenay is building toward a vision of a voice first future. We're talking replaces typing. Today we talked about his journey as a serial entrepreneur, the lessons he learned from Whisper Flow's viral growth and what it means to redesign human computer interaction from the ground up. So let's get into it. Hey Tene, Mike.
B
How's it going?
A
It's going great, thanks.
B
How are you doing pretty well.
A
First off, congrats on the product. It's an incredible product and I feel like everyone I know is talking about it. Anytime I'm having a conversation with somebody about AI products and like what are the best AI products at the application layer, one of the first ones anyone says is Whisper Flow. So congratulations.
B
Thank you. Thank you so much. It's a huge shout out to the team here. They've been doing a fantastic job.
A
Yeah, no, clearly, I mean it's, it's so polished and obviously the technology is crazy impressive. This is not like an overnight success, I think, you know, although many people are hearing about it right now for the first time. Why don't you give us a little bit of background to sort of your life and sort of your career and your trajectory thus far that led us to this moment before we dive into the product and the technology.
B
I've been working on Whisper Flow for the last one and a half years, but this is something I've been obsessed about for the last 17, which is what is the next generation of personal computing look like? Because back in 2008 when the first Iron man movie came out, I wanted to build Jarvis. So me and a friend of mine built what was then one of the world's first voice assistants. This is before Siri, before Alexa. And to people it just felt like magic. It grew to two and a half million people in a few months. Then Google shut us down because they didn't like what we were doing. And then fast forward many years again, this idea never left my mind because whenever I saw people use their computers and their phones, it felt so effortful and so mechanical. And that is not how I want technology to feel like and so built and shipped dozens and dozens of different products. I grew up in Delhi, so eventually came made my way here to Silicon Valley. I went to Stanford for undergrad, dropped out of my master's because I was starting a company called FeatherX. Eight months later, ended up selling that. We were essentially building tools for small and medium D2C stores. This larger company was working with Uniqlos and Forever 21s of the world. So we deployed our product there. That was kind of my first time growing from more than just a guy who loves to build apps with his friends to somebody who was now running a 25 person team with most of the people older than me, which was pretty much being thrown in the deep end, learning how to become a good manager, learning how important people are to a company, which I think most online advice misses out on. And that is what got me to trust myself to take on the responsibility of building Whisper, which I started with my college roommate, Sehaj. He and I met first year of undergrad, lived together for the next three years. I probably know him more than anybody else really should. And we started Whisper with the ambitions of changing how people interact with technology and definitely did not set us up for an easy problem that we wanted to solve. And it's honestly fantastic to see where Whisper flow has come today and I'm excited for where it's going to go.
A
Yeah, you mentioned, you know, before, before Feather X and before the acquisition and Big Tech and all that you had built a bunch of apps. I think I read somewhere that you built something like 50 apps before you even graduated high school. Is that accurate? And maybe tell us about what inspires you in those early days and what these 50 were. What are these 50 apps that you're building in high school?
B
I bas started building because I was in, in fourth or fifth grade. I was nine, 10 years old. And then I saw the Iron man movie and then I wanted to go build it and so I went to my computer lab. There were some kids there that were doing something on the computer, probably programming, building an app. And I was like, hey, what are you guys doing? They were like, hey, you're too young. You wouldn't understand. No one says that to me. Yeah, that's when I go back home. It's YouTube. I'm on, you know, the old, like broadband connection. So I have to like open the videos and all the different tabs, leave them for an hour to buffer. And that's how I basically taught myself. I pulled my first all nighter that night, teach myself how to code. This was Visual Basic for anybody who remembers Slash cares. Yeah, you remember?
A
Of course. Yeah. I learned how to program in Visual Basic too.
B
Yeah, it's good old days. And so I started off with that within a couple of days. To me, it felt like magic because I could think about something. I had an idea, and then I didn't have to be like, oh, I wish this existed. I could just go and build it.
A
Yeah.
B
And I got another friend with me who was into design. So he was the designer, I was the programmer and we would build all these things together. And then the hard thing was not letting my parents know, because right. As an 11 year old, your parents have a screen time limit. Right. So I had a one hour screen time limit. I used it to watch Power Rangers on tv. And so the seven, eight hours I wanted to spend on my computer coding could not happen. And so I actually slept alternate nights all through middle school and high school because I could only code when my parents weren't awake.
A
Wow.
B
And so did that. And basically for all the different things we built, we would kind of build it, launch it, and just keep doing that. And it was just like a drug. It started off with Windows Phone, then we built a bunch of stuff for iOS, then we built a bunch of stuff for Android and desktop apps and websites of all sorts.
A
Well, I heard about one that got like millions of users and got shut down by Google. Maybe I'm jumping to the punchline too quickly, but is that right? Did I hear about that?
B
Yeah. So that was one. This was when Limewire shut down.
A
Okay.
B
And so people had no good way to download music. You would go on mp3 skull, try to download mp3s, and then sometimes you download a virus. Nobody liked that. And so what we built was the system where you could go and say, hey, play me the latest song by Metallica. It was intelligent enough to go find out what that song was, then it would scour the Internet to find that song. And if it didn't find it there, it would go to YouTube, convert a video to an MP3, and then download that for you.
A
Oh, wow.
B
So that, that blew up especially because people would try to say all kinds of things to it and it would figure it out. And this is pre LLMs, right? This was built in 2010, so bunch of hard coded hacky logic. But to the user, it felt like the system was truly intelligent, which is all I wanted. I wanted to see the spark in people's eyes. And that was the one that Google had issues with. And another One that we built, that I also was a huge fan of was this was a time where, you know, in Delhi, women were starting to feel unsafe. There was a lot of news going around, they didn't feel comfortable going out at night. The problems there were if they took out their phone to send a text or something that could aggravate the potential stalker that was around them. And so they didn't feel safe doing that. And so they were stuck. And so we built this thing where you open the app and it's a black screen, right. So it looks like nothing on your phone is active. And you can program these hot corners to do something. So you double tap on one corner, it sends your location to a friend, you double tap on another Corner, it calls 911 or 100 in India and then so on, so you could program this. And we call it Aegis, which was Greek for shield, which was a women's safety product, which also did really well. And I think that was solving a real problem that people were facing. The products we built just like spanned the whole, whole gamut of things that were fun, things that saved people's lives and everything else in between.
A
That app that you just described, I mean, those sound like safety features that should be embedded into all the modern operating systems today. Like, why, like, why doesn't Apple have stuff like that? Like, it's such a smart idea.
B
Yeah, I haven't seen anything else after that that's done this.
A
Yeah, that's really cool. Was there anything from these, these early experiments, these early apps that sort of inspired how you are a founder today? Or even anything from Whisper? You actually mentioned one thing earlier that, that I found that I thought was super interesting. You mentioned when you were falling in love with programming, you, you could just imagine something and then you could go build it. And I feel like there's a bit of a connection there with Whisper where I imagine something, I just say it, you know, like effortlessly, I just speak it and boom, it's on my computer. So, yeah, I'm curious if there's any connection there or are there any other things that you took away from that experience that have made their way into Whisper?
B
There's honestly so much I feel with each of those things, you learn something that changes you, some of them change, some of the changes happen so drastic, you don't even know the kind of person or before that. But I think the key things that really stand out to me was initially the products that I built were made for a tech savvy audience. Right. They were made for people like me. And they had these stats and everything else that, you know, look really cool. But there were times where I built something that was way simpler. And I saw it getting adopted by the mass market. I saw it getting adopted by people like my parents and people who are blue collar workers who you generally don't think of as tech savvy people. And those were the ones where I saw the most amount of user love coming from. And so that to me, clicked something in my head. And it became even more apparent when I came to Silicon Valley, which is most products here are built for people like us. It's not built for the 95% of the human population. My dad is never going to write a system prompt. My grandfather doesn't know what a prompt is. And so if you're building AI tools for them because you want them to experience the peak of what is possible today, your products need to look very different. And that is a huge driving force of what I want at Whisper. And so it reflects in the product. It's just one button. You press it and it just works. You don't have to do any setup anywhere at all. And there's no place in the product where we even use the word LLM. It doesn't matter. You just speak and it writes for you. And that is all people care about. At the end of the day, the results come off as people onboard their parents onto Whisper and then it becomes their parents favorite tool. And that honestly matters more to me than any growth numbers or description of product market fit that I can come up with separately.
A
And I've heard you use this term, zero edit usability. Is that, is that the same thing? Is that sort of your term for kind of magic? It just works. What do you mean by zero edit?
B
That one is actually a very technical term that we use.
A
Okay.
B
The place where it came from is voice dictation has existed for the last 20 years, right. And everybody has been claiming like, hey, we have 90% word accuracy, 95%, 99% word accuracy. The numbers keep going up. But it has always sucked. And it has sucked because none of us have been using it at all.
A
What do you mean? I feel like voice to text is everywhere. Speech to text is everywhere.
B
It is everywhere. But people won't use it. People barely use it. You go into an audience and ask people like, how many of you use Siri frequently and love it. No hands are gonna go up. Maybe one. If even.
A
Maybe, maybe not.
B
Maybe one. Yeah. And so that is counterproductive because speaking is how is most natural for people to communicate with each other. And so the thing we realized is, even with 99% accuracy, right, it means that in a 20 word sentence you're going to make a mistake and then you as a person have to go and edit. You have to read everything that this thing wrote for you to make sure it didn't make any glaring dumb mistakes like getting people's names wrong or adding filler words, or you are rambling and it just writes your rambles. That doesn't give you the joy of the work just being done. And so what we care about instead is the zero edit rate, which is what percentage of your messages are ready to send. Flow outputs something and you just press enter, you don't change a thing. For all the other products out there, right? Apple, OpenAI, Deepgram, assembly, everybody else, the zero edit rate is 10 to 15%. Only 10 to 15% of the times do they produce something that's perfect, ready to go. For Whisper Flow, that number is 85%, which means very rarely, if ever, you have to go and change something that Whisper produces. And that is what leads to this magic. That is what leads people to talk about this everywhere and just have this insane product love and community that we've built.
A
That's amazing. So let's zoom out a little bit. Let's take a step back, let's talk a little bit about Whisper Flow. What is it? I'm using it all the time, so I already know, but kind of give us the high level overview and then let's dig into some of the magic and why it works.
B
Yeah, of course. So versaflow is an AI product that lets you use voice across every single application without needing a single integration. So works on your Mac, your Windows, your iPhone, essentially. You press one button, you speak, you can ramble, you can change your mind, and Whisper writes perfectly in your style, four times faster than you can do with typing. And the biggest thing is you no longer have to worry about your formatting, punctuation, grammar, anything else. Whisper just does that for you. And so it's most useful for people who have to reply to a lot of emails, who live their days in Slack, who have to send a lot of text, write long documents, write long prompts to AI systems with them. Honestly, keyboards are the most effortful way we have to interact with any of these systems. And Whisper just makes it seamless.
A
It's incredible. And talk to us a little bit about the technology here. One thing that has not yet been clear to me as a user is are These, your. Is this your own technology, your own models? Are you using stuff that's like kind of off the shelf and finding a clever way through the application to plug it into the operating system? Like tell us how you did this because it's really, really magical how good this is.
B
But the models are built in house. And so my co founder, Sehaj is one of the inventors of diffusion models which is now powering Mid Journey everything else. Image based. Yeah.
A
Where did he do that? Like when, when did.
B
He was at Stanford doing research with Stefano and he did that in his undergrad. Right. So he was my roommate all through undergrad. So you imagine the imposter syndrome. I'm facing that. But no, he's, he's incredible. And the rest of our ML team is like that as well. Some of the best PhDs from across the country coming together and building this. Because there are some voice models that exist, but again, they're trained to write everything down you say word for word. They're not trained to be contextual, they're not trained to recognize accents. Well, a lot of times if you're a Russian native speaker and you're speaking English, they just write in Russian. A lot of times these models hallucinate. And so the Average model hallucinates 2% of the time. Whisper hallucinates one in a million. And those are all the sequential problems that we just had to solve to make this great. And at this point can confidently say Whisper is the best voice model on the planet across 80 languages, both accuracy and latency. And I think this to us was a starting point to prove if we solve this problem in a completely different way, does it work? Do people want it? Do people care? And we got a resounding yes, which makes it a strong foundation for everything else we want to build after this.
A
How do you go and train this model? Especially if you're starting from zero, like, you know, a lot of models or a lot of companies, I would say that choose to go and trade models, have a bit of this cold start problem. Like how do you go from zero? It's on nobody's computers, on nobody's machines, to having the best, the best speech to text model. That, that seems like a very, very hard technical challenge.
B
My philosophy in this is all complex problems are a series of simple problems and simple problems have simple solutions. And so it sounds complex from the start, but if you're somebody who's looking to do this, my best advice I would give you is start off with the baseline, start off with some model that exists out there.
A
Okay.
B
Figure out what's wrong with it. Right. Try to fix that and then figure out what's wrong with it after that and try to fix that. And each time you're just taking this small piece of a problem that you just want to fix. And these problems are coming from users. Right. This is not in your head of what you believe the future should look like. Just make it very tactical, solvable, verifiable. As you keep going through and through with that, you will learn so much about it, where at one point it would stop making sense to apply a band aid and it would just make sense to do your own thing, because you have gained all the knowledge, you have all the data, you have all the little tweaks and benchmarks that you want to hit. And you're going to be way more successful at doing that because you have gone through this step by step process, because things like this don't happen overnight. Literally took us one and a half years to get here.
A
Yeah, got it. Okay. And maybe talk us through, like, some of the technical challenges you have to overcome. I want to get into design and product challenges, but actually maybe first technical challenges to, like, achieving that level of accuracy.
B
There's so many. I'll give you a couple of examples with the technical challenges. If you look at ChatGPT, right. Once you ask it a question, it takes about three seconds to produce the first word and then maybe about 20 seconds or a minute to give you the rest of the response. Yep, that is acceptable. With voice to text, the acceptable limit to produce the whole thing is one second.
A
How did you figure that out?
B
How many seconds is ideal? I basically added different kinds of lags to the system and did a lot of user testing. And the main thing I do with user testing is I don't care what the person tells me.
A
Sure, of course.
B
I don't care what they do. I just look at their face because I care about their expressions. Are they confused, frustrated, angry, impatient? What's going on? After a second, it's unbearable. Like, I don't want to see the look on their face. And you see that in the data. Right. Whenever we have latencies of more than a second, it's like correlated with churn.
A
Wow. So you literally introduce random latency times just to see the response from people. That's fascinating.
B
Yeah. So I did the tests and I would just program like a 600 millisecond latency. Let's try it out, people. How do they feel? And so on. Okay. So we learned this Right. We learned this range. Then it's like, how do we work with this range? Because here's the other thing. People might say a one word thing or they'll ramble for five seconds or five minutes. They still want sub one second. It doesn't matter. If you ask ChatGPT for a long thing, you are okay if it takes a long time to write it all out. But for voice dictation, that is no longer the case. And so again, all of our constraints are coming from people's emotions, which is the number one thing I care about, where I think technical constraints would genuinely come from. We have users in 150 countries with varying levels of Internet connections, with varying levels of just like bad data. I mean, San Francisco has the shittiest data of all the places that we serve and you need to provide that latency. And all of our inference happens in the cloud because our model.
A
I was just going to ask this. So it's not local. None of this is happening locally.
B
Nothing is local.
A
Wow.
B
And so we had to build our own infra everything from our networking stack to customizations on the GPU kernel side to things on the application side. Like a small technical detail is we had to make our own shortcut handler to support particular kinds of shortcuts, but also because it took 3 milliseconds less than using the off the shelf library and saving those 3 milliseconds at every part of the process adds up.
A
Yeah. Wow, that's fascinating. That's fascinating. So, okay, so safe to assume, like if you don't have Internet connectivity, this is going to fail.
B
Yeah. We're launching offline mode on iPhone soon. It's going to be worse than the cloud one, of course, but on iPhones it's exceptionally important to have that, you.
A
Know, when I start to think about the challenges. So, you know, we just talked a little bit about the technical challenges with a product like this. I think a lot about the product challenges. This is. Well, it's not just for whisper sort of voice as an interface is almost like it's like a 0UI product or it lacks some sort of visual clarity that maybe other types of input have. And so you sort of have to implant yourself in the user's brain and get them to remember to take a certain action or a behavior that they're not used to. Like, how do you overcome that? That seems like an incredibly hard design challenge.
B
It is. I think the best products, what they do is they change human behavior. And changing human behavior is maybe one of the hardest things to do for sure. Period. Across, like millions, hundreds of millions of people. Right? That's the thing we're shooting for. And so step one of that for me, starts off with empathy. You need to know what you're dealing with. So before we launched Whisper, I personally onboarded 500 people. It was a call like this. Half an hour getting them to install it, seeing what they like, don't like about it, where they start to build habits, constantly following up with them, which taught us a lot about the real brokers. Because again, you form habits because of how you feel the dopamine releases happening at the right time, the right kind of triggers that you're building in the right kind of habits. And there's all these little nudges that you can build in the product to get that. I would call this an unsolvable problem. You can get better and better and better at it. You're not going to get to 100%, but you can strive to be that. When we're thinking about that, I knew that the product success is going to be highly dependent upon can we build a behavior? Can we take what people do today, which is do everything on their computer with their keyboard, and replace literally the keyboard that has been around for the last 200 years, and give people something new? And how do you build that trust? And so a lot of what, when we think about, okay, what does the person do in the first minute, the first hour, the first day and so on is taught that way. And we craft that all out. And my source of inspiration for that is not other software products. It's actually video games. Because video games are phenomenal at teaching users new mechanics, building new behaviors, and that is their bread and butter because you are thrown into a new world. Like, just take the game of Mario, right? There are so many things that you need to teach the person. Like, you have to go to the right. There's some point where a level ends, you can jump, you can dive, you can hit a brick, and a mushroom comes out. And if you eat the mushroom, like, good things happen and yada, yada, yada, the. The list. And I call all of these things mechanics. These mechanics are all important to teach the user. And there is a certain sequence in which you want to teach these mechanics. And if you start to teach your software product like a video game, it just completely changes the mental model of what you think you should be. And you essentially learn from the best. And so it's my general philosophy overall is I rarely ever take inspiration from other software products. Think about Onboarding and activation, like video games. Think about our brand like the best brands in the world, which is not, don't want to name any names, but the best brands in the world are like Sephora and Louis Vuitton and all of that. You want to look at what they're doing that makes their brand memorable for billions of people and then do that.
A
Super cool. So when you talk about the video game and training new behaviors, where does that most directly manifest itself in the product? Is that in like onboarding or setting up the product for a voice first product?
B
Yeah, that's an onboarding. So I'll give you a simple example. So with whisper, you can speak short things and you can speak long things. With short things, you have this interaction that we call push to talk. You hold a button, speak. As soon as you let go, text shows up. When you're speaking for long, you just want to kind of lock it in place. You double tap it, you can speak, it's hands free and then you can tap it again and then it shows up. And so that way you don't have to keep holding something, your fingers don't hurt if you're, you know, rambling for five minutes. Now these two are important things to teach people, but if you teach people too much at the same time, they're going to get confused. And so we stagger this education, we make this education contextual. So first we just teach you how to do the short thing. Because I know the short thing builds the most amount of dopamine. Yeah. And so we, we teach people that, we get people to try that, they love it. And then the moment we see them doing something longer, they're speaking for more than 20 seconds. Then we tell them like, hey, did you know you can do this thing? And so there are so many mechanics in this. And as a pm, as an engineer, you want to teach your users everything about your product as soon as they first enter it. Right. And you see this, a lot of products have tours like, hey, do this, this, this, this.
A
Oh my goodness.
B
Yeah, it's like a 27 step tour you have to do. You don't remember anything. When you wake up the next morning, you barely get activated. But making it contextual, teaching people why they should care about it, how to do it, spaced repetition. You, like put all of that in and you start to build a good framework for how you should teach users to be most effective with the tool you've given them.
A
Yeah, it's almost like thinking about instruction and onboarding as a funnel rather than Just like beating them over the head with everything they could possibly know right up front.
B
Yeah. So I tell my team, like, the onboarding is, yes, there is a certain part of it before you get access to the application, but actual onboarding lasts months. And all the things that we want to teach people, all these mechanics, the way we do it, we list all of these out. And you'd be surprised. Like, Whisper flow has about 57 different mechanics that we got to teach people. It's a really simple product. You press a button and speak is what it looks like. But there's actually a lot to teach people. And so we spread that out over time. Some things are contextual, some things are not. And you just build this entire user journey, which, again, as a user, you don't see it.
A
Yeah.
B
But from our side, that's what gets all the decisions made on how the product looks and feels.
A
How do you get over the challenge of integrating at such a deep level with the operating system? This is an input mechanism. It's akin to a keyboard or a mouse. Right. And so to get where the user is actually inputting speech, like, you have to embed yourself very, very deeply. And obviously, different operating systems have different levels of controls and restrictions. You know, I would say macOS is probably a little bit more permissive than, say, iOS. Right. Where I know you have to, you know, become a keyboard app, which I think. Which I think Whisper is.
B
Yeah.
A
How do you think about that challenge?
B
I think about that challenge as a necessity. Because here's the thing, right? When you're building a consumer product and you've built consumer products for years, like, you know this, you cannot change a lot of user behaviors at once.
A
That's right.
B
You can change one thing at a time. And so if you're somebody who, like, works in Slack all day, I can't take your slack away. Max I can do is tell you, like, hey, you know, you use Slack like this, keep everything else the same. Just switch your keyboard with voice. And so with that, it's like, okay, Whisper just has to work everywhere where you work. It can't be a separate place that you go and use. Cause you won't use it as much then. And the second thing is, what a lot of companies do is, oh, go connect it with your calendar and your Google account and your Slack account and log in with all of your apps and sync them up and add in mcps and yada, yada, yada. Most people aren't going to do that.
A
Yeah. Too many steps in the funnel.
B
Yeah. So if you're trying to build something that you want hundreds of millions of people to just be able to use, you have to do the hard thing of making sure this works across all the applications. Right now Whisper has been using over 500,000 applications and websites.
A
Crazy.
B
Where it just works off the bat. And that was a product decision that came from users and it made it hell for our technology team to make sure that it works in all the places. Because I can't even begin to talk about all the edge cases that we have to deal with in every single application. Notion does bullets different than Slack does bullets. And you need to know all of that and manage all the edge cases. It just has to be built that way. That is a non negotiable from a user standpoint. And that is why on the surface it looks like the simplest product ever. But I think it might be one of the most technically complex software products to be out there in the market today.
A
Those are typically the best and the most defensible ones. Right? The ones that are extremely technically challenging to build, but extremely accessible and beautiful and useful on the, on the front end. Right. You know, we always talked a lot at Spotify about how, yeah, it seems, it seems easy to make a music player. Right. Play button, playlist search. But it's like actually there's a lot under the hood that's very, very technically challenging and you know, we've spent a decade plus building it and yeah, yeah, let's, let's see if somebody can catch up. So I agree with your philosophy there for sure. I actually heard, speaking of hard to build, I actually heard that Whisper started or was going to be a hardware company at first. Is, is that accurate?
B
And a hardware company for three years.
A
Yeah. And tell us about that.
B
I'll take you back to 2021, the start of 2021. Right. SAJ gives me a call, he's like, then I've just left my job and I'm down to start a company. And then I was like, never thought this man would say these words to me, but this is one guy I.
A
Would love to like. This is my moment.
B
Yeah, I was working at this, this larger company at that point. So I decided to kind of start working with Saj on this. And the thing that we narrowed down to was this is when GPT3 had come out. This is before ChatGPT. Right. This is now in February of 2021. It was very clear to us that in the next few years people are going to be using voice for everything that they do. Talking to their computers, phones and beyond. For a world like that, I thought was like, okay, you want to be able to use voice when you're around others, to keep it private, not disturb them. And so what you need at that point is a device that can understand when you're silently speaking. So over the next three years, we assembled a team of 40 of some of the best PhDs in neuroscience, Signal processing, machine learning, electronics, and put them together. And we built what was the world's first device that went from thought to text. No limits on the number of words, full speech. We also built thought to voice that sounded pretty much like you. And this was maybe one of the most incredible pieces of technology. And for reference, it looked like a pair of like a larger AirPod you would wear completely non invasive. Had about a data collection program with 50 people coming into the office every hour to wear our hardware devices, collect data that we use to train our models and more. Mid last year, so this is mid 2024 was when this finally started to work, Started to work where I could put it on and I could just use it. And so we were like, okay, what do we use it for? We hooked it up at ChatGPT City, Alexa, and they all sucked. We were like, hey, we need something that goes from your mental rambles to. To something that is structured and useful. And we call that little project Flow. And it was the Flow operating system that ran on the hardware. And one day I was like, hey, I want people to be able to test it who don't have our hardware device. Let me make it a desktop application, which is an afterthought. And this afterthought is now what everybody knows as Whisper Flow and the product we all know and love. But that's where it started because when we launched that, we gave it to a few beta users. It was insane what the market pull was. And we saw within the company all of us were using that day in, day out. And here's the kicker. All of us were totally fine using it in the open office without needing a silent speech interface. And so that to me was the real belief that, okay, this is something that the world has been craving for, that they haven't had until today. This is what we need to deliver to the world. And once we do that, right, once we have this, once this is in the hands of hundreds of millions of people, then one day we can go and tell them, hey, you know this thing you love using on your phone or laptop, what if you didn't have to take your phone out of your pocket for that. And that would be a time where a hardware product would make sense. And so we essentially flipped the order in which we were building the company. And that was maybe the hardest few months of acknowledging that we were going to do the pivot killing. The thing we'd been working on for three years, going from a 40 person company to a five person company overnight, switching from a deep tech R and D brain computer interface startup to building a consumer AI product. Right now, in hindsight, it's maybe one of the best things we did, but.
A
You probably had to go through what you did to get to that moment. Right. Like you in some alternate universe, I mean, could you imagine yourself actually starting with the software product or did it take the learnings of the hardware to get to the software product?
B
I don't think anyone in their right minds decides to start a voice dictation company.
A
That's true. Yeah. Especially in 2024 or whenever you pivoted. 2020, I guess you probably pivoted in 24, right?
B
Yeah. So 2024, August 2024, pretty much, yeah.
A
That's fascinating. I'm sure you've seen there's a company that recently had a demo video that I think is very similar to the hardware, or at least it seems like it seems very similar to the Harvard you describe. Is that the same technology? Is it different technology? Does that make you want to rush back to the hardware side?
B
By the way, it's Arnav. He's from the same city and neighboring high school as me from Delhi, which is really funny. Like the two of you were talking about.
A
That is really funny.
B
And his younger brother, who also worked on this was actually one of my close friends. We both got into MIT together. I decided to go to Stanford because MIT was too cold. He went to mit. So in some world, if I'd gone there, I'd likely be working on that with them.
A
Wow.
B
And so long respect for them. But I also know where their technology is today, which is similar to where ours was, January of 2024. And I know all the work that needs to go in to get it production ready.
A
Yeah.
B
And I also know that given what I know now with seeing how people use flow, if I was building a hardware device, I would build something completely different.
A
Oh, wow.
B
No rush, no formal. I have a really good sense of what I would want to build, but I'm actually excited somebody's taking this up because honestly, that technology is magical if you ever get to use it.
A
Yeah. I mean, it's smart I mean, I think if you can become synonymous with voice input, it seems like an easier leap to sort of ladder into hardware than if you started fresh. Right. If Whisper just is kind of the voice input company, it's like a very natural evolution.
B
Yep.
A
Yeah. Makes a ton of sense. Where else do you think we're going to see Voice pop up? I mean, you said, you know, that you years ago noticed that Voice was going to become kind of the interface of AI and we all see and we all believe it. It feels like it hasn't fully happened yet, even though people have been talking about it for a couple of years. In fact, it feels like Whisper Flow is kind of the first true validation of this thesis. Yeah, I can't really think of much else. So, yeah. How do you think the landscape is going to evolve over the next couple of years?
B
So I'll give you the short term and the long term view on this. So the way I think about what we're building at Whisper and where Voice is most useful is taking a look at all the grunt work that you do in the day, things that you would rather not spend time on, and be able to automate that away so that you have time to do what you really care about and what you really enjoy. Where I see this come next is right now, Whisper, you speak and Whisper writes for you. And the next is you speak and Whisper is going to do things for you. And that is what we're building towards, the biggest challenge. Right, because personal assistants like Siri and Alexa, again, have existed for years, but the thing with them is they promised a thousand things, did 50, and did just do well, which is why we all use it to change songs and set alarms. And the way you're going to build a voice assistant that people trust is you want to be able to reliably execute everything the person asks you to do and make the person know, like, hey, these are the things that this thing could do for me. So we're not going to promise people the world. We're going to promise people 10 things that are insanely valuable to them that they want to do day in, day out. And we're just going to do them insanely well. And that is what you're going to see next happen with Whisper and likely other players as well. And so that is what the next phase to me looks like. And then what it all goes to is like, even today, Whisper is a fantastic tool. Right? But if you don't have Whisper, you're going to be sad. You can still use your computer, and you can still do those things. The place where it starts to become a necessity is when you think about what's going to happen in the next three to five years. We're going to be stepping away from our phones and laptops. We're going to be getting to immersive computing devices. These are your AR glasses, smart watches, smart rings. And the biggest thing that people don't realize happens with them is you no longer have display as the primary interface.
A
Right.
B
You are just reliant on voice, and you need a voice interface that you trust, which is why our focus is on Zero Edit, which is why we want people to not even read what Whisper writes, just press send, which is what usually happens for most to most users, because once you have that, that is what you need to build immersive computing devices. And that's what we're building everything for. Because I want to be the company that lives between the person and everything else that's happening on AI and devices.
A
So clearly you will be or you envision becoming sort of like the input side of the interface. Is there a world where Whisper captures sort of the output as well, like maybe through audio? Where Whisper is almost like giving you information through your ears or through your hearing, other than through your eyes, if that makes sense.
B
You know, on the surface, it's a voice company, right. But what I mostly care about is building the most intuitive interfaces for the problem. And so in some cases with these devices, you may want a little visual output, you may want an audio output, you may want some other kind of output. To me, I'm ambivalent to all of those. I just want to build what's best for people. With a lot of what we're building today, it's voice input, because there are so many places where voice is the superior and the right solution for the problem. That is what we're focusing on. But overall, it's just about the intuitiveness and seamlessness that we care about. And that's the core mission of the company as well.
A
Do you think about wedging in even higher up in the funnel of the operating system? For example, right now, on iOS, again, you're a keyboard Apple. The user has to take a series of steps just to act, you know, to have you sort of embedded there. Like, can you ever be. How do you replace Siri? Or, you know, be able to access the microphone sort of right off the.
B
Bat on iOS, very doable on Android. For iOS, I need to become better friends with Tim.
A
Yeah, yeah, exactly. We gotta get him using Whisper. And get him addicted to it.
B
And then we honestly should. I think there's a few hundred people at Apple that use Whisper today.
A
Yeah, I'm sure, I'm sure that makes, that makes a lot of sense. What do you think is maybe holding us back from a technology perspective to kind of reach that vision for the future you have? So you mentioned we start to step away from our phones, start to step away from our desk. Like what needs to happen in the market that you're not working on.
B
Yeah.
A
Before we're in that future that Whisper can really take advantage of.
B
The biggest thing for me is AI agents. A lot of people think that AI agents are here and AI agents are fantastic. I'm in the exact opposite camp. They are so, so, so far from where they need to be.
A
Yeah.
B
Because it's not about building a capability. It's not that you tell it to find somebody on LinkedIn and can find somebody on LinkedIn, it's that it finds the right person on LinkedIn. Because right now I would say AI agents are comparable to extremely mediocre interns. And you want it to be something that you can delegate tasks to and trust them to do those tasks. Well, there is nothing in the market today that does that. And that is something I want to see happen. That is one of the hardest problems to solve in AI today is doing those. And the reason it doesn't work is when you look at these data sets that these models are trained on. I was going through some of them and the kinds of things that they said was like, hey, take this window on my screen and then resize it to 300 pixels by 600 pixels and move it across the right edge. And it just goes on. And you're like, no human being is going to speak like that. Humans are going to be like, hey, can you take things from this tab and put it into that tab? And it needs a lot more context, it needs a lot more two way communication, needs a deep understanding of the user and so much more to make it right. Nobody solved that yet. And so I really hope people do. Otherwise there's for some specific problems that we're solving. This is something we're working on internally to just build incredibly reliable agents. Because if something's not reliable, it's not worth building.
A
Yeah. So let me ask you about that. So you imagine a world, you talked about this earlier, where you go beyond input and you start going into actions and agents taking actions on the user's behalf. It sounds like what you just said right there at the End that you may actually build those agents yourself rather than, you know, maybe it calls over to Claude, which takes an action, or ChatGPT, which takes an action, or maybe some macOS level agent that takes an action. So these are going to be your agents.
B
I want to solve this problem for people, right?
A
Yeah.
B
If somebody else builds it, fantastic. I'm the happiest guy. Less work for us, but if nobody's building it, then we just have to go and do it. And the same thing happened with Whisper. Right. In 2021, we said, voice is going to be that 2024, our hardware product is ready. Voice still sucks. And so we had to take it upon ourselves to solve this Voice problem, which we did. And with agents, again, I hope that in the next few months starts to get better and actually usable. If that doesn't happen, then again, it's something we just have to solve because that's what people want.
A
Does typing go away? Will we stop using keyboards or thumbs on glass in the near future?
B
Yeah, typing is ridiculous. It's just a hack that we had to build for the last 200 years because we had no better way.
A
So it goes away completely. You think?
B
Yeah. Why would you need it? And once you have immersive computing devices, like, you're not going to do this in the air. That looks stupid.
A
Yeah.
B
You're just going to commute the way you talk to another person. Why does talking to technology be any different than that?
A
And I guess there's even a version of the future where. Well, like the device you previously worked on, you won't even have to say the thing out loud. I mean, if you're in a quiet lab in your university or something like that, or the library, you don't even need to say it out loud.
B
Yeah. Or if I want to tweet on this podcast, I can just do that.
A
Right, Right. Wow. So in between, while I'm asking questions, you might be tweeting right now with.
B
Thought I could have had so many great tweets.
A
That's awesome.
B
What.
A
What can we expect from Whisper in the near future? Like, what should we be looking out for in terms of, you know, upcoming releases or anything like. Like that.
B
There's a lot. Whisper is going to be able to take actions on your behalf. We're expecting to ship an Android app soon.
A
Nice.
B
We're making Whisper better in a lot of languages, and so we expect to see a lot of happy users across the world for the company. But more than anything else, it's setting up a really strong foundation for everything that we're planning to ship next year. And so if you're somebody exceptional who's looking to join a fast growing company and do your best work, now is the best time to join.
A
Yeah, join Whisper. Where can they see the job listing? I'm guessing like whisperflow.com jobs?
B
Yep. Pretty much go on our website. It's there. Search us up on LinkedIn. It's there. We're hiring for literally every single role possible. So just drop a note regardless.
A
Amazing. Tenay, thank you so much. Super inspiring. Very, very excited for you and the team and can't wait to see where you take it from here.
B
Hey, Mike. Pleasure's all mine.
A
Thank you for listening to Generative Now. If you liked this episode, please rate and review the show. And of course, subscribe. It really does help. And if you want to learn more, follow lightspeed at lightspeed VP on X, YouTube or LinkedIn. Generative now is produced by LightSpeed in partnership with Pod People. I am Michael Magnano and we will be back next week. See you then.
Host: Michael Mignano (Lightspeed Venture Partners)
Guest: Tanay Kothari (Co-founder & CEO, Whisper Flow)
Air Date: October 16, 2025
This episode of Generative Now dives deep into the story and vision of Tanay Kothari, the serial entrepreneur behind Whisper Flow — an AI-powered voice dictation platform that aspires to replace keyboards and redefine human-computer interaction. Host Michael Mignano explores Tanay’s early passion for building products, the technical and design challenges of achieving voice-first computing, and the future of seamless, intuitive interfaces. The conversation moves from Tanay's origin story and lessons learned from Whisper Flow’s viral success to bold predictions about a post-keyboard world.
Early Coding Days (04:19–08:23)
Entrepreneurial Growth (01:40–03:55)
Zero Edit Usability (11:04–13:18)
Inclusive Product Thinking (09:17–11:04)
In-House Model Development (14:44–16:09)
Path to Quality & Latency (16:09–20:19)
Breadth of Coverage (27:33–29:19)
Origins as a Hardware Company (30:01–33:36)
On Competitors and the BCI Future (34:06–35:24)
Short-Term Innovations (36:15–38:39)
Long-Term Vision: Post-Keyboard World (38:39–44:13)
Potential for Voice Output (38:59–39:39)
Barriers to the Vision: The Agent Gap (40:44–42:49)
Tanay speaks with humility, builder’s grit, and a relentless drive to “make magic” for all users — not just the tech elite. The vibe is optimistic yet clear-eyed about the gnarly technical and behavioral obstacles. There’s joy in clever hacks (even saving three milliseconds!), reverence for gaming as a model for onboarding, and ultimately, infectious enthusiasm for a future where “keyboards are just a hack” — and we speak our intent into reality.