
Modal is a serverless compute platform that’s specifically focused on AI workloads. The company’s goal is to enable AI teams to quickly spin up GPU-enabled containers, and rapidly iterate and autoscale. It was founded by Erik Bernhardsson who was previ...
Loading summary
Erik Brynhardsen
Modal is a serverless compute platform that's specifically focused on AI workloads. The company's goal is to enable AI teams to quickly spin up GPU enabled containers and rapidly iterate and autoscale. It was founded by Erik Brynhardsen, who was previously at Spotify for seven years, where he built the music recommendation system and the popular Luigi workflow scheduler. In this episode, Eric joins Sean Falconer to talk about the motivation for founding his company. The market gap in ML and AI tooling, optimizing container, cold start, Modal's interface design and more. This episode is hosted by Shawn Falconer. Check the show notes for more information on Sean's work and where to find him.
Sean Falconer
Eric, welcome to the show.
Erik Brynhardsen
Thank you. It's great to be here.
Sean Falconer
Yeah, thanks so much for being here. So I was diving a little bit into your background preparing for this and so it seems like you spent a lot of your time working in data throughout your career, which also kind of matches my own experience. You know, you previously were at Spotify for a number of years. You were the CTO of better.com. now you're the founder and CEO of Moldle. Were there certain things in these prior roles that led to identifying some sort of need for Moldle? What's the story behind essentially going off and deciding to start this company?
Erik Brynhardsen
Yeah, for sure. The answer is yes. And the long story is I was at Spotify for seven years, built the music recommendation system, but as a part of building that, I also realized there's kind of a general gap in the tooling. I ended up building a vector database called Enoy, no one uses it today. And also workflow scheduler called Luigi that very few people use today. But generally realized like as a part of building all of that stuff at Spotify and also did a lot of other stuff that this is very little tooling in data, AI machine learning. There's more today, but I still never really felt like much later, like in 2020, 2021, when I started thinking about building a company, I realized there's kind of a gap in the market for a tool I always wanted to have myself. So that was kind of the genesis of Moto. It's almost like building selfishly for what I always wanted to have throughout my years at Spotify, building a music recommendation system and also to some extent better as a cto. So a little bit more general role. We spend a lot of time thinking about platforms and data and stuff like that too.
Sean Falconer
Yeah, I mean, sometimes people talk about how, you know, the Discipline of essentially like or like the tooling and the things that are available for data engineers lag somewhere, you know, five years maybe behind, you know, more traditional application development. Where would you say that something like ML engineering from an applied sense of being able to actually run these things in a company that may be user facing lags behind, you know, what we think about from traditional application engineering, it's definitely behind.
Erik Brynhardsen
I don't know what the number is in terms of the number of years. I think to me it all comes down to developer productivity and how much time is wasted on tooling stuff versus actually delivering business value. And the only way I found that I think it's somewhat correlated with developer productivity is fast feedback loops. How fast are your feedback loops? And it's not necessarily a perfect metric, but I think it speaks a lot about developer productivity when you ask developers if they feel productive. A lot of it, I think it comes down to having fast feedback loops. And when I look at other types of software engineering disciplines, if you look at front engineers, I don't know if you do it a lot of front end, but I kind of enjoy it because you have editor in one window and you have to browse in the other windows. And today when you're writing front end, it's so crazy fast, you save the code and it just automatically reloads the other screen and you see it right away. So you have this sub second level feedback loops. And I think there's something similar about backend. Maybe it's a little bit different, you run unit tests or whatever. But with data, AI, machine learning, all of that stuff, I would almost argue we've taken a step backwards in that. As much as I love the cloud and I'm a massive proponent of cloud and it gives us tremendous power to do stuff for whatever reason using the cloud. And suddenly you have to build Docker containers, you have to trigger things, you have to mess around with. Infrastructure ends up just being this massive friction in that loop when you're working with AI in machine learning on the engineering side. And that was my sort of frustration. A lot of what modal came from was just I just wanted to be fast. Maybe I have lack of patience, but I want to write code and then run it in the cloud in less than a second. So that was my kind of starting point is like, how do I solve that problem?
Sean Falconer
Yeah, to your point about front end, there is something I don't know, there's just immediate gratification, like I can go and change the string and I see it Reflected on screen. And I think that's probably why, you know, you see a lot of people who are starting to get into engineering probably start with front end because you have sort of that immediate feedback loop. And also even from languages, you know, adoption of languages, there's sort of a more immediacy of getting to that like aha moment and those feedback loops perhaps using something like Python versus using like a lower level language where you might struggle a bit more to just kind of get started. And even older languages, you know, now, even things like Java and C have tried to reduce sort of that barrier to entry of doing that simple like hello World, essentially equivalent application because of how easy it is in some of these other languages that have like severely simplified it and then led to a high adoption curve for sure.
Erik Brynhardsen
And kind of as a side topic, like one of my beefs with Rust, I love Rust, it's a great language. I think one of the things they got wrong is like, it's just, you know, like they made a language that makes developers very productive in certain ways. But then you have the compiler like getting in the way. It's like if the compiler was just fast, I would actually really, really enjoy writing Rust. Now it's like I enjoy it, but it's like to your point about the older language, I used to write a lot of C and having that compilations thing take minutes every time you want to just run, it just takes the joy out of coding.
Sean Falconer
So can you break down a little bit more about what are you essentially doing with mobile? You want these super fast essentially deployment times for ML workloads. But what's that end up looking like?
Erik Brynhardsen
Yeah, totally. Yeah. And maybe we should take a step back. We talked a little bit about making developers productive and happy, but what is modal, Right. I tend to think a lot of running data, AI machine learning stuff today is about building code and shipping it into the cloud and scaling it up, running it on GPUs, mapping over large data sets, et cetera. Right. So that's like originally what I set out to build. And as a big part of that, like as I mentioned, I wanted the feedback loops to be very fast. We spent a couple of years focusing on that platform as kind of a core concept. Like we felt there's such a huge opportunity. This is also during the ZIRP era, so we had like no pressure to make money. But two years in, where we started seeing a lot of traction was with Genai applications. And so a lot of our use cases today that we see that is driving A lot of the growth of Modal is various types of GPU inference use cases in Genai. So large scale, particularly like video, audio, music, like people who deploy models, often proprietary models that need to scale up to sometimes thousands of GPUs and they don't want to handle the infrastructure or they're running like very large spiky batch jobs. So modal is both sort of an infrastructure provider in the sense that we have a big pool of compute in the cloud, like a lot of GPUs, a lot of CPU. And then kind of the separate side of Modal is we have a easy to use Python SDK that lets you iterate very quickly and in a way where you don't have to install anything, you don't have to configure anything, you just have to write a little bit of code and then when you run that code, it runs in modal and that makes it very easy to take particular machine learning code and scale it up.
Sean Falconer
And then in terms of this big cluster of GPUs essentially your customers are sharing. Is that like a shared look?
Erik Brynhardsen
It's a multi tenant model. Right. Which is different from I think traditional applications. I tend to think it's the future of the cloud. There's so many benefits of us having a shared multi tenant pool. Just to give you one example, if someone needs 100 GPUs, we can often get you that in a few seconds. Because we can pool all this variable demand means we can run do capacity planning at a very large scale. And so there's many benefits of having this shared tenancy model. Not to mention also the fact that there's zero installation, you just have to, well, you have to do PIP install modal, but then once you do that you can immediately start running code. Because we manage the infrastructure in the.
Sean Falconer
Cloud for people who are managing their own GPUs, do people end up typically underutilizing the GPUs they have available?
Erik Brynhardsen
Which is another thing that modal solves beautifully is that because of the multi tenancy, we have a model where everything is paid as you go. So it's all usage based. So we only charge for the time the GPUs are actually running. So there's no capacity planning with model, you just start running modal and then we charge you per, you know, essentially GPU second or CPU second. Actually I should say there's many use cases where people don't use GPUs with Motal. And again, kind of going back to the multi tenancy Model like, part of the reason why we can offer this is like we can pool a lot of people's very burstic workloads and run an underlying shared compute pool. The other thing we also spend a lot of time on is doing things like very fast container cold starts, because, you know, in order to have a usage based pricing model, you need to be able to start containers and stock containers very quickly. Interestingly, that is a problem we have to solve in a separate context, which is like we talked about this previously, right? I wanted the ability for users to write code and immediately run it on the cloud. And that's actually the same problem like fast container cold start. So that is another thing we've had to spend a lot of time on. We built our own file system, we built our own container runtime, we built our own scheduler and all these things lets us have a very fast feedback loops, B, a fully usage based pricing and C, I guess, you know, fully managed infrastructure running in the cloud so people don't have to think about it.
Sean Falconer
So can you break that down like, you know, what's happening essentially behind the scenes? So I write, I've done pip, install modal, I write some Python code, presumably like I tag it in some fashion, so I know you know what I want to run within the cloud, I run it. But what's essentially the magic that's happening between my machine and be able to run this on mobile?
Erik Brynhardsen
Yeah. So modal is an SDK, which means you import modal. And basically the way to think about, I think the easiest mental model to think about modal is function as a service. So similar to AWS Lambda, if you're familiar. The idea is that you can take any function in Python and turn that into a function that runs in the cloud on invocation. So you can have all these functions in Python and you can say, this one should run on H100, this one should run on whatever T4 different container environments, like different even drivers. And you can specify that in the code using decorators that you apply to these functions and then you can have these functions call each other. And so you have this function as a service programming model. And under the hood, the way it works is we take the code, we're able to build a container image and launch that container image in the cloud in about a second. Right. And more for like very large images. Like if you have an Image that's like 100 gigabytes, it might take a few seconds. But we spent a lot of time on like, how do we take the code on the local computer, stick it in a container in the cloud with arbitrary dependencies and launch that on a worker in the cloud that might have, you know, GPUs or many workers. Maybe you need 100 GPUs, then we need to spin up that container a hundred times over on different workers in the cloud.
Sean Falconer
Does each function that I'm specifying, like, let's say I have, you know, function one, function two, but I want to run them on essentially different GPUs. Do those end up being separate containers that have to get deployed?
Erik Brynhardsen
Yeah, so every function ends up being a separate container. In fact, every function is auto scaling. So if you start issuing many requests to the same function, we will just auto scale it up to as much as is needed in order to serve all those requests.
Sean Falconer
In some ways it sounds like as a developer of these, it's almost like you're coding a monolith, but then the deployment is able to automatically sort of scale this out as more of like a marker service architecture.
Erik Brynhardsen
Yeah, sort of. Yeah. And, you know, I think that's one way to look at it. Like, every function ends up being its own kind of container. I mean, you can have like a lot of code running inside a single container, but you can have like very, you know, isolated pieces of code. You know, you can have one container running, you know, Python 3.9 and another container running, you know, Python 3.12, calling each other just like a normal Python function call. Like, you have this convenience of just, you know, it just feels like local code, just functions calling functions, including, you know, handling tracebacks and exceptions and all these things. Right. It just feels like Python, even though it's like running in a distributed way in the cloud.
Sean Falconer
Right. How does that function calling happen behind the scenes? Like, essentially, are you using some sort of like GRPC call or, you know, what is essentially allowing you to call functions across these containers?
Erik Brynhardsen
Yeah. So under the hood, we use GRPC just like for internal communication. And also the client library talks to the server using grpc, there's like a few different layers. But we also, like within Python, we use Cloud pickle just like serialize all the payload and send it between containers. But again, like, that's not something that developer necessarily has to think about because it just feels like you're just calling a function in Python. So we handle the serialization, all the exception management, all of that stuff. But yeah, there's a lot of work obviously under the hood. Basically, it's all complex queuing theory and scheduling and you Know, doing that fast at a large scale. I mean, I think we're serving something like 100,000 requests a second right now across the scale of modal. So managing that, you know, the state management and the scheduling and constantly like kind of scaling up and down. Obviously there's a lot of work to build all that infrastructure.
Sean Falconer
Yeah. How does the state management and persistence in these distributed applications work?
Erik Brynhardsen
I would say modal right now focuses on compute. So there is some level of state management you can create simple distributed. We have a primitive to create a distributed key value store and also queue. Most of what people use modal for today is things like Genai, where Genai is actually kind of like a weird use case in many ways because it's extremely compute intensive, but it's actually kind of low IO. So if you think about stable diffusion, for instance, you send this tiny piece of text to a GPU and then that GPU does like a trillion operations and then it sends you a little JPEG back that's like 100k or whatever. And that's like the type of applications that we do really well. It's like very compute heavy stuff, but not necessarily like super I O intensive. We're not necessarily trying to compete with Spark in that sense, which has these operations you can wrangle very large datasets and stuff like that. So there is some state management in modal. We also have a distributed file system you can set up in modal and you can attach to all the containers basically just like a local Postix compliant file system. Which means you can basically interact with it just using normal file system operations. But anything you read or write is globally distributed across all the other containers. So that's another way. You can also work with large datasets in Moodle, but we don't necessarily handle the sharding and partitioning and shuffling. That's not something we put built yet.
Sean Falconer
When I think about something like an IPASS for traditional applications, a lot of times people run into these challenges eventually where essentially the abstractions start to get to a place where they don't work. They want to get in there and sort of adjust things, tune it to their specification once they reach a certain scale. And it leads to this graduation problem. Is that a challenge that, you know, people building on Moldle in this world face?
Erik Brynhardsen
I think so. I mean, I think that's like, you know, something I've spent a lot of time thinking about is like the right abstraction sacrifices a little bit of power, but makes the remaining stuff so much easier to do. So I think any abstraction layer you add, of course you're going to sacrifice a tiny bit of capabilities. But by doing that it turns out you can actually the remaining stuff make it so much easier to build. I think the bad abstractions, they basically sacrifice 80% of the capabilities, you know, and then like, turns out they're like very limiting. So like how do you make modal like fairly general purpose so that we can do almost anything in modal? I think that's very hard. And that being said, I mean I think like so far this is something like I'm like obsessed with, by the way, thinking about, you know, the right abstractions and thinking about the developer experience and making sure, you know, modal is fairly general purpose. That modal is not necessarily like super frameworky. I think that's sort of like, I think I don't like about a lot of modern framework where either they're very config based or they kind of lock in a certain way. Modal is opinionated, but I want Modal to be a bunch of Lego blocks and then you can take those Lego blocks and build whatever you want. And I think to a large extent it lets people do that. We've always thought about programmability, we always thought about making it possible to run any code you want. We're thinking a lot about doing non python stuff, for instance, as example. So I don't know, I mean, you know, I'm obviously biased but I tend to think that Modal is a platform that doesn't sacrifice that much capabilities. Like you can do almost anything in modal that you could do just, you know, using lower level cloud primitives with like 5% of the effort.
Sean Falconer
Yeah.
Erik Brynhardsen
This episode is sponsored by Mailtrap, an email platform developers love. Go for fast email delivery, high inboxing rates and and live 24. 7 expert support. Get 20% off for all plans with our promo code Sedaily.
Sean Falconer
What are the typical like gen AI applications workloads that people are running on this? Is this primarily people running their own model and they want to be able to run inference across these GPUs or are there other types of workloads that are also there?
Erik Brynhardsen
Yeah, so one use that I love, one of our customers is a company called suno. They do AI generated music. So they run a lot of their inference on modal very large scale. So basically what they run on modal is GPU based. They have their own proprietary model and it generates music. Super cool application. I think a lot of our customers fall into that bucket it's like there is a proprietary model. It does some very cool magic stuff on a gpu, typically in audio, video, image stuff, music. It's been kind of the domain we've seen most traction. There's other use cases too. We've recently seen a lot of traction coming from computational biotech which I think is like super exciting. So like protein folding, multiple sequence alignment, like there's all kinds of medical imaging processing, like very large data sets of, you know, using applying computer vision to like assays. I don't know too much about the field but like I find it incredibly exciting and kind of in that vein, like we've seen people use modal for like geospatial analysis, physics simulations, like turbulence stuff. And then there's a lot of like kind of people who like just to like the developer productivity and they don't necessarily run like big things but they like have a little web scraper running in modal or a little web server. There's a lot of that stuff. So modal is, you know the goal was always to build a fairly general purpose platform. We found initial like kind of core product market fit and like the gen AI that's like been the main use case for us. But there's just so many other things that people always surprised me. Like I was just talking to someone the other day who's like running a chess engine on modal. Like I don't know why but you know it was really cool.
Sean Falconer
I mean some of those examples you gave are pretty high computer examples which I think make a ton of sense. Are there certain workloads that don't make sense? I guess. You know you mentioned things that maybe require high IO are probably, maybe not the right fit.
Erik Brynhardsen
Yeah, high IO I think is but we're like pushing into that. I think it's going to be a big focus for 2025 to also handle similarly I would say like very low latency things is something we're like very excited about right now. There's like an overhead in the system of like 100, 200 milliseconds of every function call a little bit less if you run it in the same region. But like that's enough for a lot of use cases especially in gen AI like stable diffusion. Like no one really cares if it takes like 200 milliseconds because the inference in itself takes like a couple seconds. So that's pretty negligible overhead. But it's not enough for something like real time streaming of audio or like real time video. So that's like another thing we definitely want to push into over the next year is how do we get the overhead of the system down to 10 milliseconds, 5 milliseconds, whatever, in terms of.
Sean Falconer
You know, getting to a place where you're able to run these for things like real time, you know, video, audio streaming, like what do you see as the main sort of technical hurdles that you have to solve in order to reach that kind of performance?
Erik Brynhardsen
Today it's mostly about geolocation or basically like we run exactly like the speed of light is pretty high, you know, and as you may know, like, you know, it doesn't take that many hundred milliseconds to like send something to Australia and back to the US like you know, 200 milliseconds or whatever. But you know, it is a challenge when you're, you know, doing real time stuff. So a big challenge for us is like while we have a distributed data plane, like all our workers run in many different regions, many different cloud providers. Our control plane right now is not distributed. So one of the things we want to do in 2025 is decentralizing the control plane so that we can use smarter ways. Basically like kind of route to multiple edge. I feel like the word edge is overused. But you're running a model in many different regions across the world. Like also the control plane so that we can route things and execute it faster. That's a big RE architecture.
Sean Falconer
So in this multi tenant architecture that you have today, so you have your control plane, you have your data planes, your data planes are distributed across these regions. Is the primarily the sort of reuse of resources happening in the control plane?
Erik Brynhardsen
Yeah, the control plane makes all the decisions. Right. And it sort of has the sort of the global state of the world, which is like another thing to some extent. Like when you have like a very large worker fleet and many different scheduler running in different regions, like I think you also kind of need to change the truth of the system to be owned by the workers themselves because they always know the latest. Because again, speed of light is not always as fast as we wish. So there's a lot of that state management. Where do you make the decisions? Who has the authoritative view of which worker is running which containers and how do you propagate that information? I mean these are hard technical challenges, but luckily we like those at Modal.
Sean Falconer
Was the plan from the very beginning always to build this out as this multi tenant structure.
Erik Brynhardsen
It is challenging sometimes because I think people are just not entirely comfortable with that model yet. But I don't know. I've been coding for 30 years and I remember when the cloud came in 2007 or something like that, AWS launched or EC2, I think something like that, 2006 maybe. And my first thought was like, that's insane. Why would I put my code on someone else's computer? And then a few years later I was doing it, I was like, this is kind of nice. I like this.
Sean Falconer
Yeah.
Erik Brynhardsen
So like it, you know, it took a few years. I mean, it's still taking a few years. Like many people still run on prem. Right. But like, I think there's been like, obviously, you know, people are seeing that the cloud makes sense for a lot of stuff, even though it's like arguably like a shared resource.
Sean Falconer
If you look at, you know, a company like Snowflake, they were like, hey, we're going to build a cloud native database. And at that time, which maybe doesn't seem that crazy now because basically everybody's doing that, but at that time it was like, what do you mean you're going to take this thing that we run in, I don't know, on prem Oracle today and move it to the cloud? We're not going to do that.
Erik Brynhardsen
Yeah. It's funny you took words out of my mouth because I was just going to talk about Snowflake next. It's funny, I actually interviewed with Snowflake 2012 and they told me the idea. I'm like, this kind of, I don't think this is going to work. And then I turned out the job offer. It was obviously a very terrible decision. But I think what Snowflake showed is like, you know, beyond just the cloud vendors. Also Snowflake showed that you can be infrastructure as a service and host people's data and eventually people will be comfortable with that. And so I don't know. I think security compliance is shifting. I think people, a lot of customers today, they don't necessarily worry about the fact that something is multi tenant. They worry about best security practices and we take that extremely seriously. We think a lot about how do we encrypt all the storage, how do we encrypt all the data in transit, how do we run the containers in a way where it's impossible to break out of containers. Those things are very, very important, like exactly where things are running in terms of networking or in terms of VPCs. I don't necessarily think those are like the prime concerns and I think over time there'll be even less of a concern. So multi tenant is definitely like a Change in mindset, It's a change in how security and how compliance operates. But I think the trend is our friend here.
Sean Falconer
Yeah, I also think from a security perspective, when you look at things like breaches or other security vulnerabilities, I don't see a lot of reports that are a result of some sort of compromised multi tenant cloud offering. It has a lot more to do with just like a lot of times simply human error of like, hey, we accidentally committed, you know, our API credentials to GitHub and now it's, you know, available online.
Erik Brynhardsen
That's always what happens.
Sean Falconer
Have an unencrypted log file that has the Social Security numbers of all our customers in it or you know, things like that.
Erik Brynhardsen
Totally. And even network segmentation I think is kind of to me like an outdated model. There was like a hack a few years ago where like Target got hacked because it turns out their H Vac system was running on the same network and like there was like a vulnerability and the like whatever the H VAC system and someone was able to get into. So I think network segmentation is like to me is like never like a strong security model. And I'm kind of glad like people are not, you know, there's like the Beyond Car Corp and Zero Trust. That to me is like very clearly the future of cloud security.
Sean Falconer
You know, in terms of things like capacity planning for inference. Like even outside of mobile is capacity planning for inference and gen AI applications like a fundamentally difficult task for businesses today because it's hard to know in any given time like how many tokens you're going to be generating and what the workloads actually look like.
Erik Brynhardsen
Yeah, I think so because there's like so many sources of noise, right? I mean first of all you have sort of a daily variation, like kind of a sine curve over the day. But then often you have like kind of a noise. I mean it's like a Poisson distribution. At any point in time times I have always a little stochastic noise. But then there's so many other things like people running batch jobs. Suddenly you launch something and it goes viral. In hacker news there's all these different sources of noise. The distribution is extremely fat tailed and so it's fundamentally hard to plan do capacity manage for inference. I think for training it's a little bit easier when you're training a model. You can just buy 1024 GPUs and just kind of make sure that GPUs stay warm. But for inferences it's much harder because you can't fundamentally plan. And I think that the thing that also exacerbates that is sort of the only way in the past, at least up to very recently, to get high end GPUs was to make big reservations, long term reservations like 3 year or whatever. Right. And that's fundamentally kind of a hard matching problem. You have this unpredictable demand where you have to go out and buy fixed capacity, which means that a lot of people are running things very poorly utilized. However, I actually tend to think resource pooling is kind of a free lunch in many ways. If you take a lot of people's noisy workloads and you aggregate them, you can run the aggregate at much, much higher levels of utilization. So that's one way we can save a lot of cost for people. Even though frankly our prices is sometimes higher, we can still save money because they run things at effectively 100% utilization. From the point of view of paying.
Sean Falconer
Modal, how do you see some of the landscape around AI infrastructure evolving over the next couple of years? Maybe we're still very much in early days. How do you think that's going to continue to change and evolve?
Erik Brynhardsen
Yeah, we're in the super early days. This is incredibly hard to predict. I think it's also there's so many different layers, there's so many different boxes. I don't know. I'm very bullish on high code tools. I think at the end of the day, looking back at my 30 years of coding and going even further back, I think the story of software engineering has always been better tools drive more demand and then making engineers more productive is always fundamentally where the value is created. So I'm very bullish on building better tools. I'm very bullish on the infrastructure layer making it easier for people to build these applications because clearly there's a lot of demand. So that's the modal focuses on in.
Sean Falconer
Terms of like the types of things that people are doing today in industry to build, you know, AI applications, you know, using things like vector databases for, for building like RAG applications. And there's been all kinds of takes. Now on rag, I see a new three letter acronym with like a weekly basis. But what are your thoughts on, you know, vector database as a dedicated storage for AI? Just given that you've spent a lot of time, you know, thinking and working in the space, like do we need that is, is that sort of the right form factor for building some of these applications?
Erik Brynhardsen
I don't know. I mean, I think it's needed on some level. Right. Like we definitely, you know, and I always, I started using vectors and built my own vector database at Spotify back in 2012, I think something like that and open sourced it and for a while like, you know, it was called Anoy, still called Hanoi. A lot of people actually used it. Funny actually when Twitter open sourced their recommendation algorithm a year or two ago, I was looking through the source code and apparently they were using annoying for some of the stuff anyway. So talking about vector databases, I think there's definitely a need for vector databases. I think that being said, when a space is new, I feel like no one really knows what's the actual ultimate abstraction in the right interface boundaries. So with vector based databases, I think a valid criticism is like maybe that should just be a part of postgres. Maybe postgres should just do vector. And I think that's valid. But I also wonder if in the long run I don't even know if we know where the boundaries are going to be. I think right now it's easy to look at postgres and say yeah, we should just put vectors in it. But in the long run things kind of end up redrawing themselves. I could see a world where for instance, in one direction you can say maybe people shouldn't even think about vectors. Maybe people should think about a database. You can insert text, you can insert images, and then you can search for which images and text are close to each other. Whatever. Vectors is kind of a low level primitive. So that's like one way you can think about it. It's like maybe the interface shouldn't even be vectors because right now it's kind of tedious to it. Like you first have to embed it. Maybe the embedding should sit in the data. I don't know. That would be like one direction. Another thing I thought about a lot is LLMs are kind of in a way like vector databases. They store very large matrices and the way they store these matrices, this would be like the other direction. And they do the state lookup through very expensive matrix multiplications. Maybe there should be like a differentiable vector database inside every LLM. My point is just like, I don't know, we're still kind of early with vectors and I don't know, like in the long run abstractions, interface boundaries, like all of these things may change and categories never look the same. You know, when you look at it, I think it's too early. It's too early to say.
Sean Falconer
Yeah, I mean it is a little bit strange that currently. And I think this is just a sign of the times of things being early that you have to think about actually like generating a vector or even think about what, you know what a vector is in order to search a database. It's a little bit like having to really understand and there is some value to this if you're like a DBA or something, but really understanding the underlying tree structure that's used in indices to think about optimizing my lookup and so forth. Like for most people doing simple application development, they don't necessarily need to be digging into that level of detail.
Erik Brynhardsen
Yeah, I don't think that's the interface people necessarily want in the long run. And for that reason I don't know if just shoehorning vectors into postgres is going to be the right boundary. But we'll see. We'll see.
Sean Falconer
What are your thoughts on sort of the role of things like lakes and warehouses when it comes to building gen AI applications? I think in traditional machine learning a lot of times we're aggregating specific data down into a particular location, going through a process of feature engineering to build a bespoke model that we need to deploy and maybe we don't update it it that often, but in sort of gen AI applications we have like a very general model and then we're sort of massaging it for application specific behavior by, you know, adjusting the prompt during prompt assembly, which is a lot less about sort of the old world of batch updating these models, but really sort of real time updating the prompt in order to generate some sort of behavior that's going to be relevant for the user. So I'm curious about what are your thoughts on that? Is this sort of a shift in the way that we need to think about the role of lakes and warehouses?
Erik Brynhardsen
I'm not sure. Maybe this is my boomer perspective. I feel like in a way maybe we're throwing out the baby with the bathwater. When I look five, 10 years ago, if you look at a search application they would have a multi stage retrieval process and then a re ranker they'd use an ensemble. They would have all these feature stores, generate a lot of features and in the end they would run some sort of xgboost to aggregate all those features and then rank on that. My feeling is like and recommendation systems kind of similar like searching, ranking, like all these things had this kind of complex setup and it was kind of hard to build. And then my feeling is like there's so much demand for these applications but they're kind of Hard to build. So then when like LLMs came, people are like actually let's just turn this into like a bunch of prompts instead. And doing that I kind of feel like we, it's almost like low code, like retrieval. My feeling is like doing that we kind of threw out the baby with the bathwater. So like where like in the long run I wonder if a lot of those prompt engineering things will just become features into like a multi step retrieval process that will also combine other features. So like depending on will swing a little bit back towards those more like traditional models. LLM ends up being like one feature, a very powerful feature. But just like you might have different prompts generating different features, but in the end you sort of go back to sort of more traditional view of like a multi step retrieval. I'm talking specifically about those types of applications. There's many other types but like in general like I wonder if LLMs may just become like one feature out of many other features and that's like will be a step backwards towards the traditional sort of feature stores and all that stuff. That has been a very powerful paradigm for many years in some sense.
Sean Falconer
I mean I think you could think about like agents to like, you know, this sort of multi step process where you can weave in, you know, traditional ML models or even other types of workflows. And a core component of that of course is the LLM or the foundation model. But it's not necessarily, it doesn't have to be responsible essentially for all behaviors in there. You can use a combination of things to get essentially the behavior, the output that you want.
Erik Brynhardsen
Yeah, for sure. I'm very bullish on the high code approach. I think LLM is a little bit the low code, but over time clearly there's a lot of demand for people wanting to build AI applications and machine learning applications. So I think there's going to be 10x more machine learning researchers in the long run. They're going to use LLMs, they're going to be very happy with it, but they're also going to use the underlying core models as well.
Sean Falconer
What are your thoughts on the energy consumption required to do model training? I think the latest OpenAI model consumes more electricity than the city of Pittsburgh.
Erik Brynhardsen
I don't know, I think it's kind of overstated. You know, part of why I think it's like somewhat overstated is actually the cost of a GPU over its lifetime versus the energy consumption of its lifetime. It's actually still the main cost of training. A model is actually still on The GPU side, building the GPUs is like a lot more expensive than the energy required to run the GPUs. Even if you run a GPU at full capacity for its entire lifetime, the cost of the gpu, which to some extent I think speaks to just the cost of building GPUs. So I don't know if running it's actually the other way around for CPUs. CPUs. If you look at the cost of operating CPUs, the energy cost is higher than the cost of buying the CPUs if you've run at full utilization for a few years. So I don't know, I tend to think you focus a lot of the energy consumption. I don't know. Humanity always needs more energy and they always find uses for more energy. I, I'm an optimist, otherwise I wouldn't have started a company. So I'm very optimistic when it comes to things like climate change and energy consumption. I think we're always going to find more energy sources and the cost of GPUs, the energy consumption is going to come down and the GPU costs will come down and all these things will be easier and do more better in the end, they're going to create a lot of value for humanity.
Sean Falconer
Speaking of having this optimistic view of the world and that being a core part of, you know, surviving the hardship of, you know, building, building a company, when you were thinking about mosal originally and you had this sort of vision of being able to do, you know, these deployments at, you know, less than 100 milliseconds for running ML workflows, but were there parts of that that you weren't sure you'd be able to actually solve in order to realize that vision? Like, were there certain things, you know, technical challenges that you were really scared about whether you'd actually be able to be successful with?
Erik Brynhardsen
For sure. And I think a lot about one model I have of a startup is you have to pick a problem that's hard enough that you create a lot of compelling value, but it's not hard enough that you can't solve it. I look at a lot of AI startups and I feel like there's a lot of companies in all three buckets. There's companies that I think are solving too easy problems and for that reason they're just kind of rappers and they don't have a lot of pricing power. And it's like you can. I sort of doubt that they're going to have the competitive advantage long term. Then on the flip side, there's companies that I think have way too ambitious goals and they're like, we're going to do AGI or whatever, we're going to solve. We're going to do this agent thing that can act autonomous for everyone. And that sounds very hard. I think the trick of a startup is kind of picking a problem that could conceivably be solved in about three years. And I think three years, four years, maybe five years is a good timeframe. And so, yeah, I've spent most of my career in infrastructure and for me, looking at the stuff problem, I knew all the components would be possible to solve. I knew looking at containers, I was like, yeah, it's possible to start containers quickly because Docker works this way and Kubernetes works this way and they're doing a lot of unnecessary stuff. So that being said, a lot of my early VC conversations, people are like, why aren't you just using some existing system? But I was kind of adamant. I'm like, no, we're going to build our own thing. And I'm very happy I did that. And it took about three years, but I think doing that means now we have a pretty strong competitive advantage and we have a very unique set of infrastructure primitives that lets us build things and deliver much better developer experience than anyone else.
Sean Falconer
Are there particular performance optimizations that you did that you're particularly proud about so much?
Erik Brynhardsen
But I think at a high level, a big part of it was building your own file system for serving container data. As it turns out, Docker, as much as I think I respect Docker for introducing a very new paradigm, it's quite inefficient in how it stores images and pulls and pushes images. So what we realized looking at what happens when you start a container, is that most of the data is never read, that is in a container image, and the data that's read is highly redundant between images. So what if we can switch to using a content address system and then we built a few space file system that caches all the data under the hood. And then we then spent two years figuring out how to optimize the page cache and how do we all these things. But that is a big part of why we can start containers very quickly, is just optimizing the hell out of just the file system side of it. How do we send the data very quickly.
Sean Falconer
Got it. So as we start to wrap up here, is there anything else you'd like to share?
Erik Brynhardsen
What else? I mean, we're working on distributed training I think that's going to be really exciting. We're hoping to launch that pretty soon. And the focus is not super crazy large like running thousands of GPUs, but for companies who, who are training models. And maybe it takes too long to train on a single GPU. We make it pretty easy to scale up to eight GPUs right now, but we want to go beyond a single box. So pretty soon we'll make it really easy to scale out to 16, 32, 64, maybe, maybe even 128 GPUs and get these super fast feedback loops. Just like kind of was always like the core value of modal. Now you can also hopefully get that up to 100 GPUs. So I think training is going to be super exciting. What else? I mean, I think throughout the year, like, you know, we're spending a lot of time on security compliance. We're kind of moving up market and focusing a lot on like enterprise customers and serving their needs. You know, all the range, you know, the whole range from like SSO to soc to building custom telemetry integrations and stuff like that. So that's another area I'm also like super excited about.
Sean Falconer
Awesome. Well, Eric, thanks so much for being here.
Erik Brynhardsen
Yeah, it's great. Thank you so much for hosting me.
Sean Falconer
Yeah, cheers.
Erik Brynhardsen
Thanks.
Release Date: July 31, 2025
Host: Sean Falconer
Guest: Erik Brynhardsen, Founder and CEO of Modal
The episode kicks off with Sean Falconer introducing Erik Brynhardsen, the founder and CEO of Modal. Erik brings a wealth of experience from his seven-year tenure at Spotify, where he developed the music recommendation system and the Luigi workflow scheduler. His journey led him to recognize significant gaps in AI and machine learning (ML) tooling, ultimately inspiring the creation of Modal—a serverless compute platform tailored for AI workloads.
Erik Brynhardsen elaborates on his motivation for founding Modal:
“I realized there's kind of a general gap in the tooling. I ended up building a vector database called Enoy, no one uses it today. And also a workflow scheduler called Luigi that very few people use today.” (01:26)
Erik's experiences at Spotify highlighted the lack of robust tools for data and AI engineering. This realization drove him to create Modal, aiming to build the exact tools he needed but couldn't find in the market.
Sean initiates a discussion on the lag in ML engineering tooling compared to traditional application development. Erik emphasizes the importance of developer productivity, linking it to the speed of feedback loops:
“How fast are your feedback loops? ... we've taken a step backwards in that [AI and ML].” (02:50)
He contrasts this with frontend development, where immediate feedback enhances productivity. In AI and ML, the reliance on cloud infrastructure introduces significant friction, slowing down the iteration process.
Sean probes deeper into Modal's deployment capabilities. Erik Brynhardsen describes Modal as an SDK that transforms any Python function into a cloud-executed function, akin to AWS Lambda:
“The easiest mental model to think about Modal is function as a service. So similar to AWS Lambda, if you're familiar.” (10:07)
Modal leverages multi-tenant architecture, pooling compute resources to enable rapid scaling and efficient utilization. This design allows Modal to offer usage-based pricing, charging only for the actual compute time used.
Erik discusses the advantages of Modal's multi-tenant model:
“Because we can pool a lot of people's very bursty workloads and run an underlying shared compute pool.” (08:23)
This approach ensures high GPU availability, enabling users to access substantial computational power almost instantaneously. Modal's infrastructure manages fast container cold starts, critical for maintaining quick feedback loops essential for developer productivity.
Modal caters to a diverse range of AI applications. Erik highlights several key use cases:
“Modal is the goal was always to build a fairly general-purpose platform. ... we've seen people use Modal for like chess engine.” (17:11)
Looking ahead, Erik outlines Modal's plans to enhance performance, particularly targeting low-latency applications like real-time audio and video streaming. Achieving this requires:
“Another thing we want to do in 2025 is decentralizing the control plane so that we can use smarter ways.” (20:00)
Sean and Erik delve into the security implications of a multi-tenant architecture. Erik reassures that Modal prioritizes stringent security measures:
“We think a lot about how do we encrypt all the storage, how do we encrypt all the data in transit ... impossible to break out of containers.” (22:26)
He draws parallels with companies like Snowflake, noting the gradual industry shift towards embracing multi-tenant models with robust security frameworks.
The conversation shifts to the broader landscape of AI infrastructure. Erik shares his thoughts on the role of vector databases in AI applications, acknowledging their current necessity while questioning their long-term abstraction:
“It's too early to say. ... we're still kind of early with vectors and I don't know.” (28:13)
He speculates on future developments, such as integrating vector capabilities directly into traditional databases like PostgreSQL or evolving towards even more abstracted storage solutions.
Erik highlights significant technical optimizations that set Modal apart:
“We built a few space file systems that cache all the data under the hood.” (37:59)
This innovation enables Modal to launch containers swiftly, a cornerstone of their promise for rapid feedback loops.
As the episode wraps up, Erik shares Modal's upcoming initiatives:
“Just like kind of was always like the core value of Modal. Now you can also hopefully get that up to 100 GPUs.” (38:57)
Erik emphasizes Modal's commitment to building superior developer tools and infrastructure to meet the growing demands of AI and ML engineering.
This episode provides an in-depth look into Modal's mission to revolutionize AI inference deployment, the challenges faced, and the innovative solutions being developed to empower AI teams worldwide.