
LiveKit is a platform that provides developers with tools to build real-time audio and video applications at scale. It offers an open-source WebRTC stack for creation of live, interactive experiences like video conferencing, streaming,
Loading summary
Shawn Falconer
LiveKit is a platform that provides developers with tools to build real time audio and video applications at scale. It offers an open source WebRTC stack for creation of live interactive experiences like video conferencing, streaming and virtual events. LiveKit has gained significant attention for its partnership with OpenAI for the advanced voice feature. Russ Dussa is the founder of LiveKit and has an extensive career in startups. In this episode he joins Sean Falconer to talk about his startup journey, the early days of Y Combinator, LiveKit, WebRTC, LiveKit's partnership with OpenAI, Voice and Vision as the future paradigm for computer interaction, and more. This episode is hosted by Shawn Falconer. Check the show notes for more information on Shawn's work and where to find him.
Sean Falconer
Russ, welcome to the show.
Russ Dussa
Hey Sean, thanks for having me.
Sean Falconer
Yeah, absolutely. I was looking forward to this. You know, I'm excited about all the things that you're doing at Live Kit. So I know that Live Kit isn't your first company. You know, you had a couple of swings at the entrepreneurial bat, so to speak, and you were even part of an early YC batch. Can you talk a little bit about your background and perhaps your experience with yc?
Russ Dussa
Yeah, for sure. So background wise, my dad was an entrepreneur as well and in technology. He was starting companies in the semiconductor era and early GPUs in like the kind of late 80s and early 90s. And so I grew up in the Bay Area and I've kind of been around people starting companies and around technology for pretty much my whole life. And Y Combinator, it's kind of interesting. I was in college and a friend of mine ended up dropping out and joining a company that was just down the street from YC's Mountain View office. I think this was YC started in Boston, more in Cambridge, and then opened up an office in Mountain View. And so my friend walked over one day and you said, hey, how do we be part of this? And they invited us to a dinner. I think that was for the second batch. So we went to dinner and met a bunch of founders and decided we were going to apply to yc. And so we didn't get in our first time, but then the second time we applied, I think it was for the fifth batch. Summer 07 was when we ended up getting accepted into YC and joined a group. I think it was 18 companies in that batch. Everyone was scared because it was a jump from the previous batch size. I think the previous batches 1 through 4 were all maybe like 10 to 12 companies, maybe even the first one was 8 and this was 18. And so we weren't sure how that was going to work. And so we moved to Boston, to Cambridge and I think we were the second last batch in Cambridge and started a company there. And in my batch, who was in that batch? Drew and Arash from Dropbox were in that batch. Imad from Mercury was in that batch. And so I think YC today and YC back then, both valuable propositions for an entrepreneur, but also very different than this was in 2007. There wasn't a lot of information on the Internet, but also it just wasn't normalized outside of maybe zuck about being like a young founder coming out of school and how do you actually start a company and build a company from the ground up? At that time, the iPhone platform was just announced. Sorry, actually the iPhone just came out that summer. There wasn't even a. Yeah, there wasn't even a development platform for it yet. You couldn't build your own apps for mobile. And so everyone at that time was just working on a website. I guess Drew was working on a, a desktop app for storing files, but everyone else was pretty much working on a website. And so it was just a different time. And I think what was interesting about my YC experience, that maybe isn't necessarily the same as it is now, just because it's a different beast. But back then you kind of found your tribe in YC. You know, it's 18 companies and when you're in college, there weren't a lot of people at that time who wanted to start companies or were thinking in that way. Everyone was, I'm going to get a job at Google, I'm going to get a job at Microsoft. And so being part of yc, it was much more like a small family of like minded individuals who were trying to go and do this thing against the grain of what young kids out of college are supposed to do. Amazing experience, one of the best of my life. But yeah, YC now is different. Valuable but different.
Sean Falconer
Yeah, absolutely. I vividly remember those days. I remember like Dropbox in the YC batch. And that was a time when you could really like, if you were sort of deep in this world, it was easy to pay attention to like everything that was going on in YC because it wasn't that many companies. And that was also not a hot time. About 2006, 7 to 2011, 12 was not a hot time to do a startup. I started a company in 2009 and it was really a time where we were like, really? Is that a good idea?
Russ Dussa
Yeah. We were hitting kind of like a startup winter in 2007. And I remember that as well. I think, like, I want to say one of the top VC firms, maybe it was Sequoia, they definitely published, like, a rip, the Good Times. But that was similar sentiment in 2007 as well. They was happening. I think. Another interesting part about YC during that time was the biggest hit that YC had during that period was Reddit. You know, it was a $10 million or so sale to Conde Nast, and that was kind of the big exit for yc. And so I would say that there was more of a sentiment back then around, oh, YC is this kind of startup school that is taking bets on kids. But, you know, like, who knows if this actually works? And maybe it's actually a negative signal that you're funded by yc. Positive signal that it's Paul Graham, but negative that it's this new unproven thing. And then over time, of course, you had Heroku and other companies that came out that started to do a bit better in the market and have bigger exits. And then YC became legitimized over time, and Dropbox, of course, helped with that too, and Airbnb, but it was a very different time. Another interesting tidbit about YC was that we were on the east coast and we were pitching all these east coast investors, and I remember Paul saying, like, okay, well, this is a warmup for the real kind of pitch, which is the real demo day, and that's on the West Coast. And so after the east coast demo day, we all flew to the west coast, and Paul kind of recommended that founders stay on the west coast, like, move to California and build your companies there. And so at that time, Silicon Valley was still, I would say, largely South Bay, Peninsula, kind of focused. There was, of course, Twitter in the city, and Salesforce was in the city, but a lot of the smaller startups weren't. And I think Justin Kahn and friends from Justin tv, they had this apartment in, I think it was called Crystal Tower somewhere in the city. I don't exactly remember where, but from my batch, all of these startups, I think Drew and I think Daniel at disgust, and a bunch of them moved into Crystal Tower and it became called, I think, like, YC Tower, I think, was the name or something like that. And I've thought about it back now in retrospect, and in some ways, I feel like our batch of YC kind of created the San Francisco Startup ecosystem. Not intentionally, it just kind of incidentally happened because all of these founders from our batch were moving to the city and all living in the same place and everyone knew about this being like the YC startup kind of epicenter after you graduated from yc. So kind of interesting to look back.
Sean Falconer
Yeah. And I think that of course, as companies that did well, that maybe had their start in San Francisco, like it creates a situation where people take the wrong interpretation of that and like associate okay, well, if I want to do well, I also, you know, have to be here and stuff like that. It becomes this, you know, people like cloning what they believe to be the recipe of success and so forth. But so now you're at, you know, you started Live Kit. How did that end up? What led to you starting that company?
Russ Dussa
What? Live Kit is my fifth company and the one I did in YC was the first one and I've had a few in between. It's interesting because my YC company back in 2007 was trying to do real time streaming of video over the Internet, introducing two people to one another that have never met before. I don't like to use the chat roulette, three years before chat roulette kind of analogy, but effectively that's what it was. And very few people were doing real time audio and video through a web browser at that time. And so we kind of rigged something together with glue and tape and there wasn't great support within the browser for it. Kind of Fast forward to 2012. WebRTC, which is this protocol that was designed specifically for streaming real time media, started to get built into Chrome and then slowly expanded into other browser implementations as well. And now in 2021, when I started LiveKit, I'm kind of returning back to the same similar type of technology that I was working on in 2007. Now I'm much more mature and I think the use cases are a lot more clear too. So we started off trying to connect people over the Internet in 2007, and then pandemic happens in 2020 and everyone's building software to try to connect people over the Internet because that's the only place you could go to find other people and interact with other people that weren't in your house with you. And in that world, it's kind of shocking how little had happened in terms of having scalable infrastructure that made it easy to build an application that connected people in real time. Yes, you had something like Zoom, which they'd been working on it for a decade and it was very mature and performed very well. But that was at the application level. If you wanted to be able to build something like Zoom or take Zoom's features and put them into your application, that still wasn't easy to do. And there was actually no open source infrastructure for doing it in 2020, 2021. And so I was working on a side project that was trying to do something similar, which is connect people over the Internet in real time. This time though, for audio and no video, and I struggled with the same thing. There was not really great infrastructure to do it. There were commercial providers, but they were really expensive and they didn't scale to really large numbers of people. And so my application needed to be able to run economically. It needed to be able to have thousands of people in a session together potentially. And that's when I started to look at open source, realized there was nothing there and so pinged my old co founder from the previous company. You know, company number four. And fun fact, we met during company number one. We started separate companies in YC and we met in that batch. But we started LiveKit together as an open source project to build infrastructure that any developer could use to create real time experiences in terms of that infrastructure.
Sean Falconer
And that open source project, I guess, like where did you start with that from like a engineering perspective? And then how did you actually get from, you know, this open source project into it, turning essentially into a company?
Russ Dussa
So the way that it kind of happened was the original impetus for working on it was a side project. So I was working on. My previous company had been acquired by Medium and I was now leading product and Medium and I was working on a side project that was kind of best described as clubhouse for companies where you could have kind of drop in casual audio conversations with your coworkers during the pandemic. And when I went to go actually integrate the streaming audio piece of it, what we found out was when we were looking at open source to integrate into the application, there wasn't anything that was easy to Deploy, that had SDKs on every platform and that really felt kind of consumer or production grade to use. I kind of liken what LiveKit was trying to do and is trying to do as what Stripe did for payments, Live Kit is trying to do for communications. So Stripe didn't invent kind of payment processing. There's already kind of an underlying network of payment processors and gateways. And what Stripe did was they effectively took all of this infrastructure, which is very difficult and a bit obtuse to kind of integrate into your application. And made it into a very simple piece of software that you can go and write. And they kind of handle all of the complexity or the undifferentiated problems of, you know, handling payments in an app. LiveKit is doing the same thing or an analogous thing for communication. So handling the undifferentiated pieces of how do you actually connect one or multiple people or machines together with ultra low latency anywhere in the world where those nodes kind of in this graph are located. Anywhere. The interesting part about kind of the technology that existed before Live Kit existed is that the protocol itself is already there. I mentioned it earlier. It's called WebRTC Web Real Time Communication. And the way to think about that protocol is it's a higher level abstraction built on top of udp. Most of the Internet itself is actually when you use a browser, when you interact with applications, what you're interacting with underneath is really tcp. TCP and UDP are these two different protocols. And TCP is not really designed for real time media streaming. The main reason for that, there's one key difference between TCP and udp, a lot of differences, but one key one. The key one is that TCP requires every packet that is sent over a network to be ordered. So the receiver must wait for those packets to arrive in the order that they were sent. That's the main difference. So why is that a problem for real time media streaming? Because with real time streaming, you only really care about the latest packet. Something that someone said a second ago or a video stream Capturing someone from 5 seconds ago doesn't really matter. You really just care about the absolute right now edge kind of packet. And so with a protocol that requires that all packets arrive and are handed to the application in order, if a packet gets delayed while it's transported through a network, or if it gets lost somehow and not delivered, you have to halt the entire application before you can like actually touch any of the packets that are coming in real time. You have to go and get that old packet before you can do any of that. And so TCP is not well designed. And the Internet in general is not really designed for real time audio and video. UDP gives the infrastructure provider or the lower level software layer, it gives you full control over what to do when you miss a packet or whether a packet's delayed. You can decide to actually fetch that old packet. If you want, you can decide to try to conceal it by interpolating between packets you have and closing the gaps. That way you can add error correction so that if you have a damaged packet or missing information, you can reconstruct the information that you were missing from kind of the pieces that you do have. There's all kinds of techniques that you can use to deal with packet loss or delayed packets over the network. And UDP gives you that control. You don't have that control in TCP. And so WebRTC is just another layer on top of UDP that gives you kind of nicer facilities. But at the same time, the problem with WebRTC by itself is that it's peer to peer. So you're sending your data over the public Internet and there's no server in the middle when you use vanilla WebRTC. And so if I'm talking to three people in a video call, I'm sending three copies of HD video to all of those people at the same time over my home Internet connection. It just doesn't really scale. And so what LiveKit does is Live Kit builds server infrastructure and client SDKs where you take the server, you deploy it somewhere in a data center, and then from the SDKs, the client SDKs that you integrate into your app, they all connect to that server. And the server is like a router that everyone only sends one copy of their information to that one server. And that server is determining who should get what at what frame rate and resolution. And so it's the mediator here. LiveKit Cloud is the next step above that. And that's kind of the commercial thing you were asking about, like, when did we start a company from this? I think we were actually on a more accelerated timeline than most companies are in the commercial open source space. So folks like Elastic and Mongo have had a longer amount of time to kind of build out communities and continue to improve the open source kind of by itself in isolation before they have the pressure to commercialize. For Live Kit, it was a little bit different. And the reason it was a little bit different is we put out the open Source project in July of 2021. It became a top 10 repo on GitHub across all languages for six months, very quickly, I think within three weeks. And then we had companies like Reddit and Spotify and X and Oracle who, who were already deploying it internally and starting to experiment with it. And when we had conversations with them, they said, hey look, we're coming from Twilio or we're coming from another one of the larger real time streaming companies. We love your code, we can see your code, you guys designed a good system, but we don't really want to deploy and scale this ourselves, even though we know we can. We would rather you do that for us and operate your own network. And we're happy to pay you for that. And so I think a combination of large companies that were pinging us and offering to pay us right out of the gate was one kind of piece of pressure that caused us to commercialize very rapidly. The second was it was 2021 and so the demand for real time audio and video infrastructure was very, very high because of the state of the world in the macro environment. And so we made this decision that we were going to continue building in open source and supporting that and giving free support and helping people in the community build. But we also had to raise some money and build the team up a bit bigger and start to actually work on a commercial product which is called LiveKit Cloud.
Sean Falconer
Yeah, I mean that's a fantastic story. It's actually I had a similar conversation with the CEO and founder of CREWAI recently where kind of similar idea. He built something for a side project that he was interested in open sourced it. Suddenly Fortune 500 companies were calling him and asking him questions and how do they manage it and so forth. I mean that's like the dream scenario for anybody that wants to start a company.
Shawn Falconer
Total this episode of Software Engineering Daily is brought to you by Capital One. How does Capital One stack? It starts with applied research and leveraging data to build AI models. Their engineering teams use the power of the cloud and platform standardization and automation to embed AI solutions throughout the business. Real Time Data at Scale enables these proprietary AI solutions to help Capital One improve the financial lives of its customers. That's technology at Capital One. Learn more about how Capital One's modern tech stack data ecosystem and application of AI ML are central to the business by visiting capitalone.comtech in terms of going.
Sean Falconer
From sort of this peer to peer WebRTC World to then making it so that you can have this essentially server side router that is going to proxy these calls to the various people who are trying to interconnect. What were some of the engineering challenges that you ran into with changing from this peer to peer model into this more client server model.
Russ Dussa
So I think that there's kinds of two pieces to this or two sort of scaling mechanisms that you have to tackle when you do this. So going from a peer to peer model to a server mediated model that I would say is relatively straightforward. There's this term used in this world, not super important to know but just for completeness. It's called an sfu. So a selective forwarding unit, just a fancy name for a router, a router of media. The selective forwarding unit is your single server system. You deploy it somewhere in a data center and then what it's doing is it's acting like a peer in WebRTC. So just like my client devices appear, this server acts like a peer as well. It speaks the same language as if it's just another human somewhere else in the world. A client device sends its media to that server and then that server makes a decision, or it's aware of who else is connected in the session to that user and then starts making routing decisions about, okay, send them this byte, send them only audio because they only want the audio track and they muted that user's video, et cetera. Building that server fairly straightforward. There's other people who have implemented SFUs. LifeKit was not the first one. Now it's the most popular one, but it definitely wasn't the first one. And that system works to kind of surmount the first scaling challenge of peer to peer WebRTC, where everyone's sending multiple copies of their media to everyone else. The problem with that model though, and the scaling wall you hit relatively quickly with this single server mediated WebRTC, we call it single home architecture, is that there's three problems. There's a reliability problem, there's a latency problem, and there's a scaling problem. So the latency problem is that when you send your packets from point A to point B, what you really want to do is you want to minimize the amount of time that your packets spend on the public Internet. When I send data from San Francisco to Germany in a peer to peer world, I am effectively sending my packets over the road system of the world. You know, networks are like the road system. There's all kinds of bridges and ditches and traffic and, you know, rush hour happening over here. Routers and ISPs kind of can hold your packets for certain amounts of time. It depends on what other data people are sending. So it's a pretty complicated web that a packet has to traverse to get to where it needs to go. And what you want to do is you want to be able to bypass that kind of web. It's not a big deal. If someone in San Francisco is sending packets to someone else in San Luis Obispo, that's not a big deal. But sending packets from San Francisco to the uk, it's a much longer haul and you're going to encounter more of this kind of mess of Road system on your way between those two points. So what you want to do is you want to try to terminate the user's connection as close as possible to them and then use the private Internet backbone to route the packets. What is the private Internet backbone? All the data centers of the world and all the cloud providers, they basically have interconnects and they have kind of these private fiber networks on the backend that are wired up and not as noisy as like the public Internet. And so you want your packets to spend as much time as possible on these super fast routes, these, I don't want to say information superhighways, but like you want them to like be on that kind of good network for as long as possible.
Sean Falconer
And then the autobahn of the Internet.
Russ Dussa
Exactly, I love that. The autobahn of the Internet. And you want them to spend as little time as possible on like the residential street. So you want to have a server as close as possible to San Francisco and a server as close as possible to the UK and then route between those two servers over private backbone. And so that's something in a single server system that you don't get. Everyone connects that single server wherever it is located in the world, and you're going public Internet to the server and then public Internet out of the server to wherever the destination is. So latency penalty, reliability penalty, because that server is on commodity hardware in a data center, it's going to crash. How do you resume a session kind of seamlessly without a user interruption or with minimal user interruption, Unclear how you do that. So kind of the failover, recoverability or reliability is another issue with the single server model. And then the third issue is scale. You can vertically scale that machine, right, add more and more connections to it, more and more users, but ultimately you will run out of resources of the physical device underneath. And you need to be able to kind of horizontally split and scale out horizontally. And so these are the three weaknesses of the single server system. And you have to be able to scale past that. So how do you scale past that? Well, you scale past that by creating a multi server sort of mesh network. What you want to do is you want to have servers all around the world, be able to spin up as many nodes as you want anywhere in the world. And then any user that is trying to connect, regardless of where they are, they connect to the closest server to them. So you minimize that time on the public Internet. And then this mesh network on the back end, all of these servers and data centers communicate with one another and they form Kind of a fabric for routing information between them. And you have software that is doing all of that, kind of connecting different points and packets, go over that kind of mesh on the back end and then exit out and travel along the shortest path to their destination. And so that's kind of the system that we built with LiveKit Cloud when these larger companies were asking us. The open source piece is all of the SDKs and our single server. And then the part that is not Open Source with LiveKit Cloud is this kind of orchestration system on the back end that allows you to spin up these LiveKit servers just kind of infinitely as long as you can find machines. And they all get registered as part of this fabric. And there's software that does all the routing. The challenge that you have to solve there, there's a lot of them. So one in particular that I think is quite interesting is state synchronization, right? What you don't want to have in a system like this where it's kind of a mesh and all of these data centers are connected to one another, is you don't want to have a single point of failure. If you have a single point of failure or one single coordinator or a few of them, when those coordinators go down, all of a sudden a session is severed if the session spans multiple people all around the world. So you need to have kind of this, you know, almost like a truly distributed system where any data center can kind of run independently and then there is a type of a quorum that is formed when they want to intercommunicate. And so how do you manage state synchronization properly across all of these data centers? And you're aware and measuring the network in real time. For are there connectivity issues between two data centers and does a connection get severed? We had to deal in the early days with these undersea cables getting severed between Europe and Asia. And how do you deal with situations like that? There was another issue. I remember where in India, in Bangalore, somebody in a data center somewhere went and just like disconnected a line in a router and all of a sudden Bangalore and San Francisco could not talk to each other anymore directly. You had to go through another link. And so how do you detect those issues and deal with them? And for us, it's one of those things where I kind of tell people in the same way that like with SpaceX Elon in the early days said, go explode rockets in the desert as fast as possible so you can figure out what makes rockets explode and, you know, mitigate what kinds of situations in networking and connectivity happen between two or multiple users around the world when they're trying to talk to one another, figure out what breaks and what situations cause things to break and what you need to mitigate and then mitigate them as fast as possible. So in the early days of life, life kits, kind of commercial life, we had outage after outage after outage after outage, and it was painful. But we ended up building pretty sophisticated software that now understands the, I would say, like 97, 98% of the scenarios that you can run into with streaming media over around the world, and we have software that can deal with it.
Sean Falconer
Where's this all run? Like, are you running this on essentially public cloud and then you're deploying your own proprietary software there to help manage the routing of this based on your sort of deep understanding of where network issues could happen?
Russ Dussa
Yeah, it's a great question. And so the way that we kind of went about it from the very start was, I think, like a default answer for how? Well, so backing up, we take our software and we deploy it on public clouds, right? So public clouds can be AWS or GCP or Azure or DigitalOcean, Linodakamai. They all have kind of servers all around the world. And so we go and deploy our software there, and then we have, at the software level, these servers are interacting with one another and speak a language or a protocol between one another. And I think a default answer, even for myself before I was working on Live Kit was I'm just going to use aws, right? Like, it's the most popular cloud, a lot of people use it and their network is great. But one thing that we realized with aws, well, we realized a couple of things, but one important one was that their product is really good and they charge accordingly, Right? It's very expensive, too. They sort of you. You kind of, when you build on them, you get kind of insulated from a variety of networking issues because they're running their platform quite well. And so you don't really get a viewpoint or visibility into some of the things that can go wrong. If you were to, say, build on your own cloud, like spin up your own data centers all around the world, which eventually we want to get to. But you insulate yourself from that problem when you build on aws. There's other reasons to not build on AWS too. I mean, it is expensive and all of that, but. Well, two things. One, we didn't want to depend on any one cloud provider because an entire cloud provider could go down and we wanted to make sure we were resilient to that. And the second was we wanted to from the very start, understand what all the different types of issues that might happen, and we don't want to insulate ourselves from those. And so what we did was we said, okay, well we're going to go build on other clouds that maybe don't have as mature of a product, and we're also going to build on multiple of them at the same time. So what we do is we run our own overlay network that treats them all like one massive kind of cloud, but spans different hardware providers underneath. And then we have software that deals with, okay, I need to send data from here to here and it goes across clouds. And that kind of transparently happens due to this kind of overlay network that we have. And then we're always measuring the connections between different providers and different data centers in real time. Software is automatically taking some out of rotation, slotting other ones in. And so we kind of have this multi cloud fabric that is impervious to not just a single server going down, but it's impervious to a single server going down, an entire data center going down in a region, or an entire cloud going down. Without naming names, there was actually a cloud provider where all of their data centers became impacted at the same time maybe about two years ago. And we had never seen that problem before. And so we built in software that allows us to deal with that scenario and we never run across just a single cloud now.
Sean Falconer
So presumably you end up having to route these packets across like multiple cloud providers, sort of hidden behind this fabric that you've created. Does that create any latency issues that you have to deal with?
Russ Dussa
Very minimal. So now I guess I said there are different kind of quality of interconnect between these cloud providers. Like they have peering agreements and stuff like that for kind of routing packets between them. And yes, we've seen definitely different performance levels across them as well. But in general the added delay is minimal, I would say on the order of like a millisecond or maybe two at most, for kind of this like transfer piece. So it does not add an appreciable amount of latency to the transport developers.
Shawn Falconer
We've all been there. It's 3am and your phone blares, jolting you awake. Another alert you scramble to troubleshoot. But the complexity of your microservices environment makes it nearly impossible to pinpoint the problem quickly. That's why Chronosphere is on a mission to help you Take back control with Differential Diagnosis, a new distributed tracing feature that takes the guesswork out of troubleshooting. With just one click, DDX automatically analyzes all spans and dimensions related to a service, pinpointing the most likely cause of the issue. Don't let troubleshooting drag you into the early hours of the morning, just DDX it and resolve issues faster. Cycronosphere was named a leader in the 2024 Gartner Magic Quadrant for Observability Platforms at Chronosphere IO Sed.
Sean Falconer
What about where you're dealing with like multiple modalities across real time streaming? So, you know, even us talking to each other right now, we have audio, we have video, for example. How do you sort of synchronize across those different modalities?
Russ Dussa
Yeah, it's an interesting question. So I'll say a couple of things about synchronization. When you're in a Google Meet session or you're in a Live Kit session or a Zoom one too. Synchronization is an interesting problem because for some use cases you don't actually want to synchronize. The reason you don't want to synchronize is because let's say that a user I'll just use kind of like a UX example, you're talking to someone that is located halfway around the world and their network connection, maybe their router or something like that, their ISP hits some kind of connectivity issue where packets are getting delayed. Video packets and audio packets are different bandwidth. So video carries more information than audio. So you know, it's maybe 10x or more. You know, the amount of information you have to send is larger and the packets are actually bigger. And when you're synchronizing these packets, let's imagine that the network connection isn't good enough such that it can transport high quality video or video, period, but it can still sustain audio if you're synchronizing those packets. That means that when you hit a network blip, you actually can't render audio or video until you receive both. But if you have them separated and kind of independent tracks or streams, then you can actually like do things like say, hey, you know what, I'm just going to like freeze frame the video or I'm going to shut the video off, but I'm still going to allow audio to be delivered. And what that does is it doesn't break the sense of presence that you have with someone else when they're around the world. And so by default we don't actually synchronize the packets. But, well, let's say this, we don't enforce pure synchronization of the packets, but we do things on the back end using sequence numbers and timestamps to try as much as possible to synchronize the packets. Because also from a user experience perspective, with video, especially paired with audio, you want the mouth to match the voice, right? Or what they're saying. And so you definitely try your best to do this, but you don't have like kind of an invariant that the audio must be synchronized with the video. You kind of let them jitter a little bit independently or flex a little bit independently. Now, for AI use cases, it's a little different because you have an AI model that might be saying something to you, you know, speaking to you. And you want the transcription of the it's saying to line up with the speech of what it's saying. And so the transcription is sent over a data channel. So it's not audio or video, but it's text. It's sent over a data channel with Live Kit, and then the audio is sent over the audio stream. And you want those things to line up another use case there is those transcriptions or let's call it like metadata. So I'll use an example. With an avatar, you might have an avatar that you render on the client. And what you want is you want like the manipulations of that avatar, like I need to move the mouth this way, or I mean to move it down kind of the. You want those like X, Y coordinates to line up with the speech itself so that the avatar which is rendered on the client, its mouth can actually move synchronized up to the audio that it's saying. So it looks like it's actually saying the words that it's saying. And so that's another case where you want synchronization between the audio track and the data stream. Another scenario is where you're doing video avatar generation. There are companies like Simile and Tavis, for example, who they use generative AI to generate a video based avatar that can kind of speak. And you want to line up the audio and the video such that there's perfect synchronization between the two. Because humans are very sensitive to seeing this. And you don't want to be in that uncanny valley where it's like, okay, this is obviously fake. And so those are scenarios where you do want to actually do true synchronization between the streams. And so we have mechanisms built into Live Kit that will actually enforce that and make sure that, that the Packets are held in a buffer on the receiver side such that once they both come in, then you can actually hand them to the application and allow them to play out.
Sean Falconer
So you mentioned some AI related use cases there. So how does OpenAI use LiveKit?
Russ Dussa
So OpenAI uses LiveKit the way to, I think, like, I'll describe kind of the way the architecture works and then it'll probably become apparent as to how OpenAI is using us. So there's kind of two fundamental modes within the ChatGPT application. One is where you're in this kind of text chat and you're texting with ChatGPT, and that's kind of using a traditional HTTP request response. I type something, I hit Send, makes an HTTP request to OpenAI's servers. The model responds with some text and uses HTTPSSE to stream the response, or those text tokens back out down to the client and render them. This model, you know, of kind of using an HTTP request doesn't work for real time audio and video, for the reason I kind of mentioned at the start of our conversation, where HTTP is called hypertext transfer protocol, not hyperaudio, not hyper video. It's built on top of tcp, but for audio and video and advanced voice and advanced vision, you want to actually use UDP. So WebRTC is the layer on top of UDP and LiveKit is WebRTC infrastructure. So what OpenAI does is when you tap on that advanced voice mode Button in the ChatGPT application, you enter into a different view of the app where there's a LiveKit client SDK on your iOS device or on your Android device embedded within the ChatGPT app, even in the desktop apps too, and on web. But when you tap on that button, there's a Live Kit client SDK that connects to LiveKit's network. So LiveKit, our cloud servers all around the world, you connect at the closest point on the edge to you, and then at the same time there's an AI agent on the back end that is getting taken, you know, out of a pool and connected to you. So you say, I want to talk to ChatGPT. There's a ChatGPT agent on the back end using Live Kit's framework on the backend, our agents framework, that agent gets DQ'd taken out of a pool connected to the user, and it's also connected through LiveKit cloud. So now when the user speaks, their audio is traveling through our network to that agent on the backend. That agent, also connected to our network, is taking the audio processing it in GPT and then as the audio or text streams out of GPT, it is getting passed from LiveKit's agents framework on the backend to through the network, again received on the user's device where it's played out. Similar thing for the vision features. So where you can screen Share or ChatGPT can see what you're looking at through your device's camera. Similar type of thing, except instead of audio, it's now video, where that video is traveling over the network, arriving at the agent and then the agent is processing it and then the video is getting transported back over our network to the client device when it's generated. Yeah.
Sean Falconer
What are your thoughts on this? Where I think historically when we look at how people use computers, we've built a lot of devices that we've gotten good at using to interface with a computer, like a mouse, a keyboard and stuff like that. But it's not naturally how we interact with people. Or even if you look at the history of search, we've over the last 20 years figured out the mechanics of how to manipulate searches on Google to get what you want. And it's not really how you talk to people, but I think some of the things that have changed over the last couple years has been with large language models and now multimodal models is I can essentially talk to something, or at least type to something in the way that I might talk to you essentially, and it talks back in a way that you might speak to me. And now with multimodal I might not even need the mouse and the keyboard. I can just simply speak to it. What are your thoughts on where maybe some of this stuff is going and how it might actually change the way that we sort of interact with computers.
Russ Dussa
So I think that we're already seeing kind of this interface that's presented to us right, in the ChatGPT application. There's other voice based applications as well that present this interface. And I also think in other areas like telephony for example, which is kind of a native voice interface to a system, right, I pick up the phone and I call a customer support line or something like that and I invariably talk to like an IVR system, right, like an automated, kind of a dumber, automated telephone system, we're starting to see kind of AI get integrated into these use cases and a glimpse, I would say, of kind of where computing interfaces in general are going to go over the long term. I think that there are going to be some bigger catalysts to. Well, so at a high level, as the Computer gets smarter, the inputs and outputs to that computer become more natural and human. Like right where the computer is starting to become more human, that means the senses also become more human, right? What it takes in, what it can output. And so naturally, I think over time, the keyboard and the mouse, they won't go away completely, but they will, or it'll take a long time before that happens, if it happens completely. But I think that voice and vision will become predominantly the interface for how you interact with a computer and give it information and how it gives information back to you. It's not to say that a screen won't exist anymore. There's still certain types of things. Like if I'm having a computer order food for me, I don't want it to read, you know, the entire menu. To me, that would take way too long. It's much faster for me to just look at the menu and scroll through it and tap on what I want or maybe tell it what I want with my voice. So screens won't go away as, like a, as a visual mechanism for presenting information. But, you know, we will increasingly leverage cameras and microphones as the peripherals or sensors for how we interact with the computer. And I think the catalyst for this. So as you mentioned, we've become really good at how to use computers in the kind of old paradigm, right. We're really good at keyword search and using that to find what we want on Google. But I personally have been using Google less and less now that I've gotten accustomed to using ChatGPT. There's this kind of talk about like copilots. And if copilots are going to be like the interface. And I think that they will definitely be an interface. I think Jarvis is going to be an interface, right, for creators to do things and express themselves. But I think that for us, we've gotten so good at or so comfortable with the flows that we're used to. I think the real proliferation of voice and computer vision as the dominant interface is going to come from one, a younger generation. I don't know if you have kids. I don't, but I have friends that have kids. And they all talk about how they just interact with their voice all the time. They're like very used to sending voice memos and they're used to interacting with Alexa or Google Home, et cetera. And so it's a normalized behavior for them to talk to the computer. And so I think in younger generations, they're not going to have kind of like the baggage that we bring in of, you know, specializing ourselves for how to interact with the computer using a keyboard and a mouse. I think that's one thing. I think the second thing is once we start to have real agents that can do work, not just the copilot, where it's pairing with you over your shoulder, but where you can tell it to go do stuff and it just goes and does stuff that's coming pretty quickly. And once that's the kind of relationship you have with the computer, it's going to be much more similar to how you work with another human being where you have a meeting, you talk about what you want to accomplish, and then people go off and start to work on stuff on their own independently, and then you come together again to sync up or to refine. That's the predominant way that people work. They don't pair as much as they kind of work independently. And so I think that's going to be another catalyst for voice and just, you know, naturally speaking and interacting with a computer to become more mainstream. And then I think the third one, which is the huge one, is when the models get smarter and smarter, there's this increasing pressure to embody them within robotics. And so when we have humanoid robots that really is going to be like interacting with another human. It's just, you know, made out of, I don't know what they're going to make them out of, but you know, carbon fiber, I don't know, I'm not a material scientist. But you know, those humanoids walking around like you're not going to like walk up to it and type on its keyboard. You're going to like, you know, interact it like you would with another human being in a physical space. And so I think that's going to be the third kind of major catalyst that really takes the voice and computer vision interface into the mainstream. And computers in the future, they're not going to look like the computers of the present. They're going to be on any surface, or you'll be able to just call them up onto any surface and they're going to be walking around and interacting with you as well.
Sean Falconer
Yeah, I think even that asynchronous work mode for like a consumer facing application is a new type of workflow as well. Because other than like when you send something that's supposed to be a communication to a person, most of the things that we do with computers from like a business perspective is kind of like immediately I get some sort of output from the machine, essentially. I'm not telling it to go do some deep research and then come back to me five days later with the results and stuff. And I think that's like, also sort of. And you kind of alluded to this like a. Will be a real shift in the way that people interact with computers. We're just not used to sort of these asynchronous workflows where we're telling the machine a task and it comes back with a result at some point.
Russ Dussa
Exactly. Yeah. And we're kind of. It's starting to. We're really in the early innings of it, but we're starting to kind of iterate towards that model, especially with like, 01 and O3 and this kind of test time compute stuff. We're starting to kind of get accustomed to the model taking some more time thinking about things, doing maybe some more complex work, and then coming back to us with the result. And I think the interface is going to evolve for how you interact with it. Today you kind of sit there and watch it think, you know, it says, oh, I'm doing this, and now I'm doing this. Another step away from that is going to be like, okay, hey, like, I'm going to go do some research and tackle this for 10 minutes or 15 minutes or 45 minutes, and I'll ping you back and let's talk about the results and review what I came up with. Like, you can kind of see the direction that this is going, and it's pretty exciting.
Sean Falconer
Yeah. I think this is really fascinating stuff. Russ, thanks so much for being here.
Russ Dussa
Oh, yeah, thanks for having me and appreciate the conversation.
Sean Falconer
Cheers.
Podcast Information:
In this episode of Software Engineering Daily, host Shawn Falconer engages in an insightful conversation with Russ D’Sa, the founder of LiveKit. The discussion delves into Russ's entrepreneurial journey, his experiences with Y Combinator (YC), the evolution of LiveKit, its partnership with OpenAI, and the future of computer interaction through voice and vision technologies.
[01:22] Russ D’Sa:
“My dad was an entrepreneur as well and in technology. He was starting companies in the semiconductor era and early GPUs in like the kind of late 80s and early 90s.”
Russ shares his deep-rooted connection to entrepreneurship, influenced by his father’s ventures in the tech industry during the semiconductor and early GPU eras. Growing up in the Bay Area, Russ was immersed in a culture of startups and technological innovation.
Russ recounts his experience with Y Combinator, highlighting the differences between the early batches and the current iteration.
[01:22] Russ D’Sa:
"We didn’t get in our first time, but then the second time we applied, we ended up getting accepted into YC and joined a group of 18 companies in the summer of 2007."
During his time at YC, Russ was part of a pioneering batch that laid the groundwork for what YC would become. He emphasizes the sense of community and the challenges of being part of a growing cohort.
[07:35] Russ D’Sa:
"Our batch of YC kind of created the San Francisco Startup ecosystem. Not intentionally, it just kind of incidentally happened because all of these founders from our batch were moving to the city and all living in the same place."
Russ reflects on how his YC batch inadvertently contributed to the San Francisco startup ecosystem by clustering founders in a single location, fostering collaboration and innovation.
[08:02] Russ D’Sa:
"LiveKit is my fifth company and the one I did in YC was the first one. It's interesting because my YC company back in 2007 was trying to do real-time streaming of video over the Internet."
Russ’s entrepreneurial spirit led him through several ventures before founding LiveKit in 2021. His initial foray into real-time streaming laid the foundation for LiveKit’s mission to simplify real-time audio and video communications for developers.
LiveKit emerged as a response to the lack of scalable, open-source infrastructure for real-time communication. Russ identified a gap where existing solutions like Zoom were mature but not easily integrable into other applications.
[11:00] Russ D’Sa:
"LiveKit is trying to do what Stripe did for payments, LiveKit is trying to do for communications."
He draws a parallel between LiveKit’s role in communications and Stripe’s impact on payment processing—both aiming to abstract and simplify complex infrastructure for developers.
LiveKit began as an open-source project, rapidly gaining traction due to its comprehensive WebRTC implementation. High-profile companies like Reddit, Spotify, and Oracle adopted LiveKit, prompting a swift transition to a commercial model with LiveKit Cloud.
[10:49] Russ D’Sa:
"We put out the open source project in July of 2021. It became a top 10 repo on GitHub across all languages for six months, very quickly within three weeks."
This swift adoption by major companies underscored the demand for robust, scalable real-time communication infrastructure, validating LiveKit’s approach and accelerating its commercialization.
[19:09] Sean Falconer:
"From a peer-to-peer WebRTC world to making it so that you can have this essentially server-side router that is going to proxy these calls to the various people who are trying to interconnect. What were some of the engineering challenges?"
Russ delves into the technical complexities of transitioning from a peer-to-peer WebRTC model to a scalable, server-mediated architecture.
[19:29] Russ D’Sa:
"There's this term called an SFU, a selective forwarding unit, which is basically a router for media. It acts like a peer in WebRTC, receiving media streams and making routing decisions."
Implementing an SFU was pivotal in reducing the scalability issues inherent in pure peer-to-peer models, where each participant must send separate streams to every other participant.
Russ outlines the three primary challenges:
Latency:
Minimizing the time packets spend traversing the public internet by utilizing servers closer to users and leveraging private backbone networks.
Reliability:
Ensuring continuous service despite server outages by implementing failover mechanisms and state synchronization across multiple data centers.
Scalability:
Transitioning from single-server architectures to multi-server mesh networks that can handle increasing loads by horizontally scaling.
[23:02] Russ D’Sa:
"You want to terminate the user's connection as close as possible to them and then use the private Internet backbone to route the packets."
By deploying a globally distributed network of servers, LiveKit ensures low latency and high reliability, even in the face of network disruptions or server failures.
To enhance resilience and avoid dependency on a single cloud provider, LiveKit adopts a multi-cloud approach.
[28:00] Russ D’Sa:
"We run our own overlay network that treats multiple cloud providers like one massive cloud, spanning different hardware providers underneath."
This strategy ensures that LiveKit can seamlessly route traffic across various cloud infrastructures, maintaining service continuity and performance.
[36:57] Russ D’Sa:
"When you tap on the advanced voice mode Button in the ChatGPT application, there's a LiveKit client SDK that connects to LiveKit's network."
LiveKit powers the real-time audio and video capabilities within OpenAI’s ChatGPT application. By embedding LiveKit’s client SDKs, ChatGPT can facilitate seamless voice interactions, enhancing user experience.
Russ explains the technical integration:
Client Connection:
Users connect via LiveKit’s SDK, establishing a low-latency, server-mediated connection.
AI Agent Interaction:
Users interact with an AI agent running on LiveKit’s infrastructure, enabling real-time voice and vision functionalities.
[36:57] Russ D’Sa:
"The user speaks, their audio travels through our network to that agent on the backend, which processes it in GPT and streams the response back to the client."
This integration exemplifies how LiveKit’s infrastructure supports advanced AI-driven interactions by providing the necessary real-time communication backbone.
[40:46] Russ D’Sa:
"As the Computer gets smarter, the inputs and outputs to that computer become more natural and human. Voice and vision will become predominantly the interface for how you interact with a computer."
Russ envisions a future where voice and vision supersede traditional peripherals like keyboards and mice. As AI models become more sophisticated, natural language and visual interactions will dominate computer interfaces.
Generational Shifts:
Younger generations, accustomed to voice assistants and smart home devices, will drive the adoption of voice and vision as primary interaction modes.
Advanced AI Agents:
AI agents capable of performing tasks autonomously will necessitate more natural interaction methods, akin to human-to-human communication.
Embodied AI in Robotics:
The integration of AI with robotics will further solidify voice and vision as key interaction modalities, making interactions more intuitive and human-like.
[44:00] Russ D’Sa:
"When we have humanoid robots that are truly interactive, you'll be interacting with them like you would with another human being in a physical space."
Russ highlights the importance of synchronizing audio, video, and data streams to ensure seamless user experiences, especially in AI-driven applications.
[32:34] Russ D’Sa:
"With avatars, you want the mouth movements to match the voice, ensuring that what's being spoken aligns with the visual cues."
The episode concludes with Russ D’Sa reflecting on the transformative potential of LiveKit and its collaboration with OpenAI. By addressing the engineering challenges of real-time communication and leveraging AI advancements, LiveKit is poised to redefine how developers build interactive applications and how users interact with computers.
[47:08] Sean Falconer:
"Russ, thanks so much for being here."
Russ expresses gratitude for the opportunity to discuss LiveKit’s journey and its future trajectory, emphasizing the exciting developments on the horizon for real-time communication and AI integration.
Key Takeaways: