Summary8 min read

Podcast Summary: Inferact — Building the Infrastructure That Runs Modern AI

Podcast: AI + a16z
Guests: Simon Mo & Woosa Kwan (Co-founders of Infract, creators of open source inference engine vllm)
Host: Matt Bornstein, a16z
Date: January 22, 2026

Episode Overview

This episode delves deep into the often-overlooked world of AI infrastructure, focusing particularly on inference — the process of running trained AI models in production. The conversation highlights the technical complexities of deploying large language models (LLMs) at scale, addresses why open source is vital to the future of AI, and introduces Infract, a company springing from the popular open-source vllm project. The discussion is rooted in real-world stories, technical deep-dives, and perspectives on open source's role in advancing AI.

Key Discussion Points and Insights

1. Genesis of vllm: From Grad School Project to Open-Source Backbone

Woosa Kwan describes vllm's origins as a side project at UC Berkeley in 2022, initially to optimize a demo service running Meta's OPT model, one of the first major open-weight GPT-3 alternatives.
Learning curve: Started with the assumption the work would be quick, but it revealed a host of open problems unique to autoregressive LLMs.
- “Initially I was thinking that it may only take like a couple weeks to optimize the service end to end. But it turns out that it actually has a lot of open problems inside in it...” (Woosa Kwan, 04:00)
Autoregressive LLMs vs Traditional ML:
- Traditional ML workloads could normalize inputs (e.g., resize images for CNNs), making scheduling and memory management simple.
- LLMs are highly dynamic — prompt lengths and response times vary widely, making scheduling and memory management “first-class” engineering problems.
- “Your prompt can be either like hello, like a single word or ... spanning hundreds of pages. And this kind of dynamism exists inherently in the language model. And this makes things whole kind of in a different world.” (Woosa Kwan, 07:30)

2. The Hidden Complexity of Inference

Inference is now the hardest problem:
- “The public story of AI progress is about better models and bigger breakthroughs. But underneath it is a quieter systems problem ... the challenge of running AI systems has started to rival the challenge of building them.” (Matt Bornstein, 01:43)
Traditional ML serving:
- Deterministic, batch-oriented, clockwork-like.
LLMs in production:
- Non-deterministic and continuous.
- Hardware (GPUs) never designed for this level of unpredictability.
- Surge in “chaotic requests” with real-time needs for thousands of users.

3. vllm: An Explosive, Truly Open Source Community

Community scale:
- From a handful of grad students to 50+ regular contributors, 2000+ overall contributors (now one of GitHub’s fastest-growing open source projects).
- Diverse participation: Users and contributors from big industry players (Meta, Red Hat, Nvidia, AMD, Google, AWS, Intel), model providers, and application builders.
- “This is kind of a classic. We’re solving the M times M problem ... you can just go into this one system and then magically you'll work for all the models out there in the world...” (Simon Mo, 14:36)
Community management lessons:
- Borrowing from the playbook of Ray, Linux, Kubernetes, Postgres: set clear vision and roadmaps, encourage new contributors through clear scopes and objectives, welcome unsolicited pull requests.
- “We have set for our vision every quarter and then but also invite the community to contribute ... keep an extremely open mind to all the GitHub pull requests ... a blend of all the lesson learned from previously other open source projects.” (Simon Mo, 15:15)
- Frequent in-person meet-ups globally to foster collaboration.

4. Financing and Scaling Open Source

Early a16z grant funding kicked off larger culture of open-source sponsorships.
vllm’s operational costs: e.g., $100k+/month on continuous integration testing.
- “Our CI bill for example is more than 100k a month ... we want to make sure every single commit is well tested. ... people are going to deploy at not thousands, but potentially millions of GPUs across the world...” (Simon Mo, 18:21)

5. Deep Dive: How LLM Inference Engines Work

Inference engine refers to the software layer that runs a fixed, trained LLM on hardware to generate outputs as efficiently as possible.
Critical components (21:03):
- API server
- Tokenizer (turns input into model-readable integers)
- Scheduler (batches and schedules requests)
- Memory manager (manages key-value caches)
- Worker (initializes model, handles pre/post-processing)
Woosa Kwan: “It’s not like a crazy new architecture, but each one basically highly optimized and specialized for this LM inference workload.” (Woosa Kwan, 22:14)

6. Why Inference Keeps Getting Harder

Three main drivers:

Scale:
- Models have gone from hundreds of billions to trillions of parameters.
- Managing sharding (splitting models across multiple GPUs/nodes) raises complex trade-offs in performance and resource utilization.
- “We believe we will see like multi trillion parameter open source model this year.” (Woosa Kwan, 23:13)
Diversity:
- Model architectures are increasingly diverse, requiring inference engines to support different attention mechanisms, tokenizers, and memory management strategies.
- Hardware diversity: accommodating a wide spectrum of GPU/compute architectures.
Agents:
- Next-gen LLM applications involve “agents” — multi-turn, tool-using, environment-interacting systems.
- This requires smarter inference layers that can manage persistent state, unpredictable cache access patterns, and external tool integrations.
- Simon Mo: “With agents ... you actually don’t know whether or not the agent will think it finishes ... now it becomes external environment interaction. ... The patterns got pretty disrupted by the new paradigm.” (Simon Mo, 29:20)

7. The Role and Power of Open Source in AI Infrastructure

Open source as an engine of diversity and speed:
- “We believe that diversity will triumph that sort of single of anything at all ... the best way to promote diversity and improve that is through open source ... everybody can participate and then innovate together ... way easier and cheaper in fact in the end to deploy.” (Simon Mo, 31:00)
Practical competitive edge: Closed-source companies (such as OpenAI) will always optimize for their own stack and use case; open source enables broader tailoring and faster innovation for varied use cases and hardware.

Notable Quotes & Memorable Moments

“The public story of AI progress is about better models and bigger breakthroughs. But underneath it is a quieter systems problem ... the challenge of running AI systems has started to rival the challenge of building them.”
– Matt Bornstein (01:43)

“I think I also started from curiosity. I didn’t really think it’s the most important problem in the world back in the day. I just wanted to have a hands on experience on how this actually works.”
– Woosa Kwan (05:32)

“Your prompt can be either like hello, like a single word or your prompt can be a bunch of documents spanning hundreds of pages. And this kind of dynamism exists inherently in the language model ... We have to handle this dynamism as a first class citizen.”
– Woosa Kwan (07:30)

“We’re solving the M times M problem ... for applications who are using VLM as well as infrastructure building with VLM, having a common ground where everybody can participate in and then innovate together is way easier and cheaper.”
– Simon Mo (14:36)

“That’s where the tension lies. A public story of AI progress is about better models and bigger breakthroughs. But underneath it is a quieter systems problem. How do you schedule chaotic requests efficiently? How do you manage memory when you don’t know when a conversation is actually finished?”
– Matt Bornstein (01:43)

“Open source moves so fast that the only way to stay ahead is adopting and that's why we want to make happen. And in fact this is exactly why we're staying all in on open source.”
– Simon Mo (36:34)

“From a computer science point of view, pretty rare if people ask me this question. That is if you're working at a vertically integrated company ... you are working on the vertical size of the problem. In Infract, you will be working on an abstraction of horizontal layer. This is similar to operating system databases and different kinds of abstraction that people have built over the years.”
– Simon Mo (39:51)

Technical Deep Dives and Important Segments

[06:43–09:00] — Distinction Between Traditional ML Workloads and Modern LLM Inference
[12:59–16:54] — Building and Managing a Thriving Open Source Community
[20:03–22:20] — Dissecting the Components of an Inference Engine
[23:28–30:36] — Surging Complexity: Scale, Diversity, and Agents
[33:31–35:03] — Real-World VLLM Deployments (Amazon Rufus, CharacterAI)
[35:12–42:22] — The Founding of Infract, Its Mission, and Open Source as Top Priority

Stories from the Field

Amazon’s global e-commerce assistant now runs on vllm, making Simon momentarily marvel that his own purchases passed through his former research project. (33:31)
CharacterAI rolled out a cutting-edge feature just from an unmerged PR in vllm, exemplifying the project’s rapid worldwide adoption. (34:17)

Conclusion: The Universal Inference Layer & The Future

Simon Mo and Woosa Kwan position Infract as a universal, horizontal abstraction for AI inference — analogous to operating systems for CPUs — uniting open source contributors, model and hardware providers, and users into a fast-iterating, deeply technical ecosystem.

“Our goal is to make VLM the world’s inference engine ... It is only when VLM becomes a standard and VLM helps everybody to achieve what they need to do, then our company in a sense has the right meaning and to be able to support everybody around it.”
– Simon Mo (00:00 & 35:49)

For Listeners Seeking Key Takeaways

AI inference is the new bottleneck and the new frontier — harder and more essential than ever as models scale and diversify.
vllm is a thriving, industry-wide open source project, rapidly adopted by major companies and continually evolving via global collaboration.
Infract is betting its company on open source — seeking to build a universal inference layer that sets the foundation for modern and future AI systems.

For anyone working on deploying LLMs, scaling cloud AI workloads, or interested in the next wave of system infrastructure, this episode is filled with practical lessons, war stories, and visionary thinking about where AI is going next.

Loading summary

Transcript98 lines

[00:00]
Simon Mo
Our goal is to make VLM the world's inference engine, really push the capabilities on the open source front and then build a universal inference layer. That means we'll have the runtime to power any new model on new hardware for new application, be able to tailor that to extreme efficiency, and support all the AI workload going forward. I fundamentally believe that open source, especially how VLM itself is structured, is critical to the AI infrastructure in the world. And what we want to do with Infrac is to support, maintain, steward and push forward the open source ecosystem. It is only that vlm when VLM becomes a standard and VLM help everybody to achieve what they need to do, then our company in a sense have the right meaning and to be able to support everybody around it.
[01:01]
Matt Bornstein
What if the hardest problem in artificial intelligence isn't training smarter models, but simply keeping them running? For most of the history of computing, once a system was built, the hard part was over. You wrote the program, pressed run, and the machine behaved predictably. Even early machine learning followed that pattern. Inputs were standardized, workloads were regular, the computer did its job and stopped. Large language models quietly broke that assumption. Every request is different. Prompts can be a sentence or an entire archive. Outputs can end instantly or stretch on indefinitely. Thousands of users can arrive at once, each making incompatible demands on the same hardware. And all of this has to happen in real time on GPUs that were never designed for this kind of unpredictability. Over the last few years, this problem has moved from obscure to essential to as models have grown larger, more diverse, and more deeply embedded into products, the challenge of running AI systems has started to rival the challenge of building them. That's where the tension lies. A public story of AI progress is about better models and bigger breakthroughs. But underneath it is a quieter systems problem. How do you schedule chaotic requests efficiently? How do you manage memory when you don't know when a conversation is actually finished? And what changes when AI systems stop behaving like single turn tools and start acting like agents that think, pause and interact with the world over time? This episode focuses on the hidden layer. We examine why inference, the act of running trained AI models, has become one of the most complex and important problems in modern computing, and why open source infrastructure is increasingly central to solving it. Matt Bornstein, general partner at Andreessen Horowitz, is joined by Simon Mo and Woosa Kwan, co founders of Infract and creators of the open source inference engine vllm. This is a conversation about the infrastructure beneath AI and why it May matter more than the models themselves.
[02:55]
Host (possibly a16z Podcast Host)
We are here today with Simon Mo and Woo Seok Kwon, lead contributors on the Vllm open source project and co founders of Infract, a new AI inference company. Super excited to have you guys on the show today. Thank you. Thank you so much for coming. We're gonna talk a little bit about Vllm, the open source project. We're going to talk a lot about inference and what inference technology really is and then we'll talk a little bit about Infrac, the new company. So to start, can you talk a little bit about where Vllm came from? What is it, how did you start it and why is it such an exciting project?
[03:25]
Simon Mo
Thank you for having us. VLLM project started from actually Woostock's prototype project at UC Berkeley doing his PhD and grow into today's open source project on GitHub for inference runtime for everybody. Maybe Wuso can talk a little bit about the page. Attention paper.
[03:42]
Woosa Kwan
Oh yeah. So basically I think it kind of started in 2022 when Meta released the OPT model as open source. I'm not actually sure how many people actually like remember the model nowadays, but it was kind of the one of the first open weight large linkage models to reproduce GPT3. And our lab tried created a demo service to run the model and to, you know, demonstrate it for the broader audience. And yeah, like it was working but super slow. So I was starting a small side project to optimize that demo service that was kind of at the beginning and initially I was thinking that it may only take like a couple weeks to optimize the service end to end. But it turns out that it actually has a lot of open problems inside in it because this, you know, auto regressive language model is pretty different actually. It was pretty different from out of traditional like ML workloads and it wasn't actually, it was kind of like a brand new at least like outside this Frontier Labs back in the day. I started to work on it and it became a research project and we wrote a paper and it even became a like a like open source project. Pretty well defined open source project as yeah more and more people got interested in it.
[04:58]
Host (possibly a16z Podcast Host)
So 2022, this is pre GPT 4 obviously this is pre chatgpt. Yeah, pre chatgpt. Yeah. And you're thinking like oh I'll just like work on this INF different server. This should be a fairly straightforward problem. Like four years later it actually you're like doing more work instead of less.
[05:14]
Woosa Kwan
Exactly, exactly. Yeah.
[05:16]
Host (possibly a16z Podcast Host)
Why did you think this is a meaningful problem to work on at the time? Because like I would say most people in the world at that time saw GPT3 as a curiosity in some sense. And OPT was kind of like a curiosity attached to a curiosity in a way. Like what made you and your lab mates sort of excited to work on this back then?
[05:33]
Woosa Kwan
I think I also started from curiosity. I didn't really think it's the most important problem in the world back in the day. I just wanted to have a hands on experience on how this actually works. I mean, I think I was so impressed by the size of the model. The OPT largest model has 175 billion parameters and that was the largest model available. So it's kind of like a meaningful for me, like it was kind of pretty rewarding to work on such a large model.
[05:56]
Host (possibly a16z Podcast Host)
This reminds me of when, like when I was like growing up, you know, we would build like computers. That was like the coolest thing to do. And each step change in like memory capacity was such a big. I was like, oh my God, this one has 4 megabytes of RAM. Oh my God, this one has 512 megabytes of RAM. Looking back, it's silly, but at the time that was actually like maybe it's because we're like nerds, but it like it gets, you get like emotionally excited about like the numbers getting bigger on these systems.
[06:21]
Woosa Kwan
Right, right, right.
[06:22]
Simon Mo
Yeah.
[06:22]
Woosa Kwan
I think that was like one of the main motivation clearly.
[06:25]
Host (possibly a16z Podcast Host)
And so you started to say the sort of technical problem is different for autoregressive transformers compared to traditional machine learning. Do you mind explaining a little bit, you know, how that is and even compare just to normal kind of computing workloads for you know, listeners who, who you know are engineers who may not be familiar with AI workload.
[06:43]
Woosa Kwan
So basically compared to the traditional workload, you know, the clear difference is definitely like GPUs right now, all the compute or kind of most of the computer happening on GPU and we have to optimize for the. Which like presumably have less memory than CPU, at least back in the day. Now GPUs has much larger memory, but typically it has much smaller memory than CPU maybe still. And all the computations happens on gpu. So you have to write program in a different language, a different type of parallelism in mind. Yeah, so that's kind of like a fundamental difference from the traditional Compute Happy Workload vs Deep Learning Workload I would say. But within the deep learning workload there's actually still A huge difference between the kind of traditional deep learning workload versus large language model inference. So for traditional workload I think the biggest characteristic is that it is pretty static. Which means for example, for image models back in the day like CNNs, what people do is you may have several images with different sizes. Then what we do is we resize them or crop them into the same size and then we batch them and then we put it to the model to run the inference at once. And this is basically because of this resizing and cropping at the end they're kind of compressing to the same size tensor and that actually makes things much simpler for the GPU to handle. All the shapes are pretty regular static and it's kind of like well defined. But for large language model, if you think about it, they're pretty dynamic. Your prompt can be either like hello, like a single word or your prompt can be a bunch of documents spanning hundreds of pages. And this kind of dynamism exists inherently in the language model. And this makes things whole kind of in a different world. We have to handle this dynamism as a first class citizen. And yeah, back in the day that was not like people didn't have a clear idea about how to handle it. And yeah, fortunately we were one of the first to see the problem.
[09:01]
Host (possibly a16z Podcast Host)
That's very interesting. So kind of regularizing a batch of inputs, it sounds like one of the first problems you had to solve.
[09:09]
Woosa Kwan
It's actually more about scheduling and memory management. Yeah, yeah.
[09:13]
Simon Mo
As well. Yeah, yeah. So the problem we're solving before in all the serving system is about just what we call micro batching to leverage first CPUs fundamental like vectorization in the early days before LLMs and then early GPU for vision models like Resnet is all about micro batching. You put together four requests together that arrive around the same time. But the change in the LLM world is you always have requests that continuously filling and coming in and then each request looks differently. You just cannot really normalize them. So that's why you have to have a notion of a step within the LLM engine to process one token across all the requests at the same time, regardless of each request having different kinds of input lengths and output lengths. Now it's also non deterministic. The language model itself will decide when does it stop instead of in the traditional sense of other machine learning servings. It's very much like work like a clockwork. And here is very sarcastic. It's always flowing, it is always continuous. That's why ScheDuling is the first problem to solve and then memory management, that's where page attention come about is a second problem to solve.
[10:23]
Host (possibly a16z Podcast Host)
So when did you get involved in the project, Simon?
[10:25]
Simon Mo
Well, I got involved as in around 2023. I first Wooseok issued a call in the Skylab Slack channel to say hey, we need someone to work with us on this page. Attention, paper and kernel. Actually, surprisingly I was on spring break and I was like, look, someone else can do this. Let me just play with GPT for the entire week. So I would just ended up just playing with prompt engineering. So I actually didn't end up joining with usuk.
[10:50]
Host (possibly a16z Podcast Host)
And so this is what a vacation looks like in Jan Stoica's lab playing with models for a week. And he's playing with kernels.
[10:56]
Simon Mo
Yeah, exactly. So he's playing with kernels. I was trying to build more prompt engineering and explore like different kind of agentic, early agentic workflow. And then over the summer and especially this is when around August and September and we really get to work together on. Actually this is where you come in. We get to work together on our very first VLM meetup, Async Z and where I had the experience of managing open source project before as well as deeply interested in actually building a serving platform and into a fully open source project. And this is where I start to get involved. Wrote my first lines of code and like sort of build out the CS system, build out the performance benchmarking systems and then really much work with Woosuk.
[11:43]
Host (possibly a16z Podcast Host)
Ever since I had forgotten about that. So this was the very first VLLM meetup, right? Yeah, it was in. In this office.
[11:49]
Simon Mo
In this office, on this exact floor. I think we are previously anticipating just 10, like 10, 20, maybe 50 people showed up and then the registration was like exactly over the anticipated capacity. People are extremely interested in this technology.
[12:05]
Host (possibly a16z Podcast Host)
I remember that very well because we run events here for ourselves and it's always very hard to get people to show up. We're always scrambling. And instead I got a call from our security team saying too many people have been approved for this meetup. We need to scale it back. This isn't safe. I'm like, oh, okay, probably don't tell. I don't think we ever scaled it back. So don't tell the security.
[12:25]
Simon Mo
It was quite crowded. The piece I ran out like the first 10 minutes.
[12:28]
Woosa Kwan
So.
[12:28]
Host (possibly a16z Podcast Host)
But this is a big deal, right, because this is not like a consumer app, right, that you were building. This is pulling from systems engineers, right, for the most part who want to learn about how to, how to serve LLMs and contribute to. So it's, it's actually a big deal to get I think so much interest from such a kind of narrow, sophisticated group of people who, who don't like meeting other humans in real life that often either. You know, at least speaking for myself. So, so can you talk a little bit more about the community behind dllm? Like how big is it now? How did it come together and like how do you guys manage it as it's gotten big?
[13:00]
Simon Mo
Yeah, so in the beginning of course it's just few grad students working on it and then so and but over time we start to having this very much open minded, developing the open kind of mindset. So as of now we're looking at 50 or more regular full time contributor who open up GitHub every single day to work on VLM. We crossed 2000 contributor bars on GitHub, one of the fastest growing top open source projects ranked by GitHub itself. And then this is really a diverse community. So there is folks like Woozy and I are sort of the team from UC Berkeley from grad student days and as well as Meta and Red Hat pulling their way behind this open source project and then as well as of course people who are not just people who are making the model, Mistral and Quern team and of course like anyone who's making openway model are participating in our community and then on the model side, Nvidia, amd, Google, aws, intel, they're all having their own participation and be able to support the ecosystem. So everyone in VO using VEO has the ability to choose among different silicons for accelerated computing.
[14:13]
Host (possibly a16z Podcast Host)
That's very interesting though, which I think is a property that many successful open source projects have, which is that people aren't all contributing for the same reason. Right. Some people I'm sure just love the technology. But it sounds like you're saying the model providers actually have incentives to contribute to the project because they want their models to run well. The silicon providers want it to run well in their silicon. The infra providers want to have first divs on running it so they can sell infra, that kind of thing.
[14:37]
Simon Mo
Yeah, this is kind of a classic. We're solving the M times M problem so that as a model provider you don't need to talk to everybody. And as a hardware provider you can just go into this one system and then magically you'll work for all the models out there in the world and, and then for applications who are using VLM as Well as infrastructure building with vlm, like having a common ground where everybody can participate in and then innovate together is way easier and cheaper in fact in the end to deploy.
[15:07]
Host (possibly a16z Podcast Host)
What's your philosophy for managing a pool of contributors this large? Do you tell them what to do? Do they choose themselves? Like how do you maintain high code quality?
[15:15]
Simon Mo
It's a constant sort of iteration, months over months, year after years. So for this I have to go back to my previous open source project which I was working on a project called Ray and then later anyscale where we have this kind of. Where I learned this community driven approach in a way that have a clear requirement, have a clear roadmap, have a clear sort of milestone being set. So we kind of try to borrow that, but also really study the really successful open source project out there. I went all the way back to Linux and then to study Kubernetes, study postgres, how are these community operating and together. So in VLM we had kind of a special model that we do like any normal engineering organization, set clear team scope, but also clear objective and result and milestones with different kind of technologies, technical features we want to push forward and build. So this is where we have set for our vision every quarter and then but also invite the community to contribute. So we're saying great, we're working on these. Also need help on these items that we don't have anyone actively working on. If you are brand new and want to engage with us or engage with the community, here's what you can work on. And additionally we keep an extremely open mind to all the GitHub pull requests that people just opened up that we're seeing. Oh, is this a good request? Is this a good feature? And then as well as a request for common processes. So kind of is a blend of all the lesson learned from the previously from previously other open source projects and then code quality wise code reviews, but also a lot of constant refactoring iterations.
[16:55]
Woosa Kwan
Yeah, yeah, I do a lot of refactoring like every six months kind of. Yeah. And actually one thing to add is you know like we do in person meetups, you know like every two months. And we are kind of expanding to globally actually like sometimes in the Europe, sometimes in some other places in Asia. Yeah and yeah like we actually from the first meetup in Asian Z we learned that it's actually super, super useful to meet those like collaborators and users in person. And yeah, we are continuing doing that.
[17:24]
Host (possibly a16z Podcast Host)
It's funny, it's another one of these lessons that Silicon Valley engineers, we've gotten so kind of like, you know, high up the abstraction stack that were like relearning, you know, lessons from a thousand years ago saying oh, it turns out in person communication is high bandwidth and doesn't suffer from consistency problems. So. So around the time you guys did that first meetup, we also made grant funding to, to the project through, through the academic lab. I was, I think it was a small amount of money, but it was actually the very first open source grant that we made. So it's so it's super, you know, just like fun and kind of gratifying for us to see like the money was actually put to good use and the project grew massively and then we even had a chance to invest in a related company later. However, I did hear a rumor that at the time that we made the grant funding that you guys put a portion of the money into Nvidia stock. Can you confirm or deny?
[18:13]
Woosa Kwan
I didn't.
[18:14]
Simon Mo
Not him, someone else in the recipient list.
[18:18]
Host (possibly a16z Podcast Host)
So you probably turned our tiny grant into 10 times as much money.
[18:22]
Simon Mo
Before the funding for VOM, a lot of these funding for VOM is that we set aside for project development and sort of project development, testing and everything around operating this project. And one thing I were actually super grateful for. The first grant is actually kicked off a culture and nowadays you can get even a tradition for people really open up to sponsor open source projects in a quite significant way. Because running VOM, our CI bill for example is more than 100k a month. That's could be tiny for some folks and it's like over growing over time. This is where at a burn of million dollar amounts and sorry a year. A million dollar a year.
[19:06]
Host (possibly a16z Podcast Host)
And for an academic project it's actually very.
[19:08]
Simon Mo
Yeah, because we want to make sure every single commit is well tested. And then this is something that people are going to deploy at not thousands, but potentially millions of GPUs across the world in different environments. So we want to make sure it's well tested, it is reliable. And then this requirement, this infrastructure right now all comes from contribution and sponsorship and from everybody are chipping in to help on this project. And now of course we also run meetups and sometimes expenses associated with meetups are directly leveraging the sort of the grants that you all provided.
[19:45]
Host (possibly a16z Podcast Host)
Yeah, I mean it makes sense, you know, for us and for other corporate sponsors of yellm, you know, it benefits the whole ecosystem. Right. So I think it makes a lot of sense. Let's talk more about the technical aspects of the problem, if that's okay with you guys. Do you mind to start just defining exactly what like an inference server or an inference engine is?
[20:03]
Simon Mo
Sure. So an inference engine turns, it takes a already trained model. So this can be a very small model like Quin 1B. It could be a very big model on Deep Seq or Kimi K2. Run it on accelerated computing device. And its job is to fully utilize the computing device to be able to generate text and images and videos essentially. But this all got tokenized into individual tokens. So the goal of inference engine is to produce. The goal of inference engine is to run the model at highly efficient speed to make sure that we can produce maximum outputs at the highest efficiency and.
[20:49]
Host (possibly a16z Podcast Host)
Just from a high level. Can you explain some of the architecture, how sort of a typical inference engine works? What are just the few most important components that people would be interested to learn more about?
[20:58]
Simon Mo
Maybe one goes through a life of a request, like if I say hello, what will happen to vlm?
[21:03]
Woosa Kwan
Yeah, yeah. So basically there's a kind of traditional API server definitely, you know, guess the retest and once the model generates output, it stream backs the tokens one by one. Yeah, so there's definitely a traditional API server layer. And inside in it we have kind of typically something called tokenizer, right. Like to transform these like inputs to like the tokens, basically some integers, the list of integers that the language model can consume. And inside it we have basically an engine, what we call like engine, and that includes a scheduler to, you know, which decides how to batch the request into incoming retest. And we have a memory manager to manage something called kvcache, which is a kind of the core part of the transformer for LLMs. And we definitely have some kind of a worker, this is a very generic term which basically actually initialize the model and run the model and get the output and do all the pre processing for the input and process post processing for the model output. Yeah, so yeah, that's basically. I mean, in a sense it's not like a crazy new architecture, but each one basically highly optimized and specialized for this LM inference workload.
[22:21]
Host (possibly a16z Podcast Host)
Do you think it's getting easier or harder over time running, running inference?
[22:25]
Woosa Kwan
Yeah, definitely. I think it is definitely getting much more difficult over time. Actually. Honestly, maybe one and a half years ago I wasn't thinking inference as a hard problem at all, to be very honest. But now things have changed. The trend has changed so far. So I think there are kind of three factors. One is scale, another is diversity and the last one is kind of agents. So for scale the models are definitely getting larger. And you know, right now we have Kimi K2 with more than a trillion parameters. But I think we believe we will see like multi trillion parameter open source model this year. And I think that's still clearly a trend that people will be training a larger model. And definitely it's much more challenging to deal with such a model compared to, you know, like the early days of LLMs where we just only deal with like small llama models.
[23:29]
Host (possibly a16z Podcast Host)
And with larger models, presumably you need more nodes working concurrently, you need, you have more memory to manage that may or may not fit in each, you know, chip's available memory. Can you describe some of the challenges from scale?
[23:41]
Woosa Kwan
Yeah, for these kind of large models we definitely need to shard, you know, distribute the model into multiple like GPUs, multiple nodes. Right. And then, yeah, then there's definitely like a problem of how to chart how to distribute this model. Right. There are actually many dimensions we can use to chart the model and they have different trade offs and. Yeah, trade offs, for example, in terms of how much communication we should pay to chart the model in this way. And also there's a trade off in terms of load balancing. If I chart this in this dimension, then how does the, how significant is the load imbalance? So these all need to take into account for the final performance estimation and to get the best performance. And yeah, that's basically becoming more and more a bigger problem as the models get larger.
[24:32]
Host (possibly a16z Podcast Host)
And what about just cluster scale? I mean I think Simon, how many, how many nodes is VLLM running on at any given time?
[24:39]
Simon Mo
Right now we're looking at this is true our sort of like a very small subsample of our usage statistics that it's used for us to figure out what feature to deprecate. Just literally from this one signal we're looking at 400k to 500k GPUs 24,7 running VLM and there's quite a big scale thinking about the global deployment of GPU footprints and we definitely believe there's a lot more out there. And of course this is a wide diversity of different kinds of GPUs, GPU architecture as well as model architecture being deployed. We're not seeing like a one size fit. All people are using it for just one singular use case.
[25:18]
Host (possibly a16z Podcast Host)
I see. And this is sort of your point, your second point was about diversity, sort of making inference a harder problem over time.
[25:24]
Woosa Kwan
Yeah, the chip diversity harder diversity is definitely one factor and also Models are getting also diverse. You know, if you think about the like, for example, like for Nvidia like a year ago I think they only released a few series of open source models, but now they're releasing many open source models every month in different domains. Right. Some are on the video, some are on the robotics, some are on the, on the language. And yeah, this kind of like open sourcing trend is getting expanding and people are training many different kinds of models in many different domains and releasing them like every month. So there's model diversity and even for just for text models, they're all transformers in that. But their detailed architecture still are very diverse and we see they're even diverging like say for like Deep Six 3.2 was using sparse attention, something called sparse attention. But say for QN and Kimi they're kind of exploring like linear attention, which is kind of different attention mechanism and they have different ways to manage the memory. So yeah, this model architecture divergence is also getting more significant.
[26:38]
Host (possibly a16z Podcast Host)
And so is it up to you as meaning blm to implement all of these, to implement sparse attention, for instance, so that it's available for the models to use?
[26:48]
Woosa Kwan
Yeah, we basically leverage open source community definitely because we collaborate with these model vendors. We often get help from these model vendors. They basically provide some kernels or at least reference implementations of you know, of these new kind of like operations. And yeah, we like our job is often like basically leverage this collaboration and making more mature and also available for more diverse environments.
[27:17]
Host (possibly a16z Podcast Host)
I remember early on in open source models there was some standardization. Like everyone was kind of using Llama. I think everyone's using sort of like the same tokenizer and the same like input format and you know, and like end of stream token and stuff like that. Is that still the case or is it like, is it different for each provider now?
[27:35]
Woosa Kwan
It is, yeah. It diverged quite a bit over the last few years, maybe last couple of years. Yeah. One thing is that many, yeah, like the model architecture itself has changed a lot, you know, especially on the attention side and also even for like input output processing because like different labs have different kind of their own ways to form, you know, how to form the conversation and how to form the tool calls, for example, for their own models. So now like this has been diverging quite a bit. And now yeah, this has been diverging quite a bit for the last couple years.
[28:07]
Host (possibly a16z Podcast Host)
I see. Okay, so scale of models, diversity of models and hardware deployment scenarios. And then agents were the third thing you mentioned. Sort of getting hard over.
[28:16]
Woosa Kwan
Yeah, yeah, you Know like for agents we need a. Definitely we need a kind of different, I mean just beyond the inference engine, we also need to set up the whole new environment, whole new infrastructure to support all the tool callings and to support all the multi agent things. Yeah, that's probably becoming kind of a new emerging challenge for inference as a whole.
[28:40]
Host (possibly a16z Podcast Host)
Do you think this means more? There will be more state managed in the inference layer over time.
[28:46]
Simon Mo
As before, the paradigm has been text in, text out and then just single request response and then. But as we evolve into the year and the decade of agents, we're seeing multi turn conversation turn into hundreds and thousands of turns. And then these terms also involves external tool use like interacting with sandbox, performing web searches, running Python script or any programming languages and be able to have this kind of long iterative process where LLM is involved, but also external environment interaction is involved. And this really kicked off a huge wave of co optimizing a Genentech architecture with inference architecture. So just to give example that when just to give an example, it is very important for VLM to understand whether or not the conversation is still happening. If the conversation is no longer happening, we can remove the KB cache that is the persistent state associated with each text completion streams. But in agentic use cases you actually don't know whether or not the agent will think it finishes or also the interaction previously. And the interaction previously was just a human typing in the text box, but now it becomes external environment interaction. It could be one second just for a single script to finish. It could be 10 seconds for search or a complex analysis to finish. And then it could also be minutes, hours. If there's humans in the loop now with that uncertainty, we actually don't even know when is a request going to come back. And then the uniformity of cache access pattern and eviction pattern got kind of. The patterns got pretty disrupted by the new paradigm.
[30:36]
Host (possibly a16z Podcast Host)
I see, I see. And so you have to be much smarter about how you manage the cache.
[30:40]
Simon Mo
As one, as one example of that.
[30:43]
Host (possibly a16z Podcast Host)
Yeah, gotcha, gotcha. Which is one of the like unsolvable problems in computer science.
[30:47]
Simon Mo
Caching validation.
[30:49]
Host (possibly a16z Podcast Host)
Yeah, so exactly. So I can see how that would, that would get harder over time. Yeah, I think I know the answer to this. But are you guys big believers in open source AI compared to closed source? And can you just explain like how you think about that?
[31:00]
Simon Mo
We're definitely big believers in open source. What we believe is diversity will triumph, that sort of single of anything at all. So that means we believe in diversities in Models diversity in chip architecture. Fundamentally this because the world is complex. For any application, you're going to need to find and tailor the right sort of model architecture to the right chip architecture for that Right. Exact use cases. And the best way to promote diversity and improve that is through open source. Because open source everybody know where everybody else is up to and be able to make their opinionated take based off the common ground. And finally, if you look at the history of computer science, operating system, cluster managers, databases, every single system field get better when they're starting to have a common standard. And everybody that deviate a little bit innovate on top of each other versus following a single line of trend that is proprietary and single source control.
[32:04]
Host (possibly a16z Podcast Host)
I see, that's very interesting. So you're almost saying OpenAI will tune their stack very tightly for their use case, which is ChatGPT or whatever other apps they're running for an enterprise or another tech company. If I want that same level of tuning, I like can't just use off the shelf closed source models because I don't sort of have control of the whole stack. And like the different participants in the stack kind of aren't paying attention.
[32:27]
Simon Mo
Yeah, of course one part is data, one part is the model architecture itself which will impact the performance. And then just on the model architecture itself. Right. How smart do you want the model to be? Do you want the model to be able to handle millions of context, Token context or just shorter context is totally fine.
[32:45]
Woosa Kwan
Right.
[32:45]
Simon Mo
And then you also need to specialize that model to your exact compute architecture. What chip are you using? For example, for Nvidia, the model you design for a H100 chip is very different from a B200 chip. And then it is very different for a GB200 MVL72 system. And then compared to for example the model architecture you designed for tpu. Then again, that is also drastically different. And then using it for vision model video generation and for reasoning, mass coding. In the end we'll all kind of look at the vertical stack integration. We're like, wow, they're so much different from each other.
[33:23]
Host (possibly a16z Podcast Host)
That makes sense. Can you just share any stories about live BLM deployments that you thought were particularly interesting or important?
[33:31]
Simon Mo
I have a few. One is, I think around 2024 we learned that Amazon is running VLM to power their Rufus assistant bottom which was like really surprising to all of us because one, as a point, like of course, like we believe VM can be deployed at scale, but seeing this as a massive scale, like kind of global e Commerce deploying this as like front page feature. That means when everybody, when they're opening Amazon app and clicking the bot's suggestion or even entering a search query is going through a vom. And this is kind of the first sort of magical experience in a way. One of the first experience was, wow, my purchase is going through VLM right now. It's kind of exciting, but also scary.
[34:18]
Host (possibly a16z Podcast Host)
You're like PhD students at the time.
[34:19]
Simon Mo
Yeah. And also across not just Amazon, LinkedIn and every major deployment of VLM, we're surprised to find out they're always the first adopter of cutting edge features. So I've seen one of the example of deployment of VOM within Character AI was when we first make the N grand speculation for spec decode available as just a single PR pull request in vlm, not even merged. And then while we're still iterating on that feature and I hear summer from Character AI saying, oh, actually we already wrote it out you hundreds of GPUs at scale given just your first iteration of this feature. So it's really much. Everybody is staying on the cutting edge of VLM and we're quite excited about that.
[35:04]
Woosa Kwan
Yeah.
[35:04]
Host (possibly a16z Podcast Host)
Okay, should we talk about the company then? Infract. What is Infract and why did you guys decide to start the company?
[35:12]
Simon Mo
So infrac, created by the creators and maintainers of the VOM project, our goal is to make VOM the world's inference engine. Really push the capabilities on the open source front and then build a universal inference layer. That means we'll have the runtime to power any new model on new hardware for new application, be able to tailor that to extreme efficiency and support all the AI workload going forward.
[35:42]
Host (possibly a16z Podcast Host)
And implicit in what you just said is that you're devoting a lot of resources, I think, to the open source project. Could you, I guess, is that right? And can you expand on that?
[35:49]
Woosa Kwan
Yeah.
[35:49]
Simon Mo
One thing we believe is, I fundamentally believe that open source, especially how VLM itself is structured, is critical to the AI infrastructure in the world. And what we want to do with infrac is to support, maintain, steward and push forward the open source ecosystem. It is only that vlm, when VLM becomes a standard and VOM help everybody to achieve what they need to do, then our company in the sense have the right meaning and to be able to support everybody around it. So open source is definitely number one and in fact sometimes the only priority of our company right now.
[36:30]
Host (possibly a16z Podcast Host)
Yeah, you're not supposed to tell your investors, by the way, that.
[36:35]
Simon Mo
We do Believe that open source project is also kind of a secret weapon in a sense that having this community all work together for this open source we have the execution beyond any single entity can have. This is the thing we heard over and over again that people just tell us we just cannot keep up with vlm. So that's why we're using vlm. We have our internal team, we have our internal fork, we have our internal inference engine. But open source moves so fast that the only way to stay ahead is adopting and that's why we want to make happen. And in fact this is exactly why we're staying all in on open source.
[37:14]
Host (possibly a16z Podcast Host)
That's awesome. We mentioned Jan Stoica before, obviously one of the founders of Databricks. He was your, I think Both of your PhD advisors at Berkeley and he's going to be involved in infra act too. Can you talk about maybe a little bit how he's going to be involved in this company and even more importantly, what have you guys learned from him as you know his students and about startups and you know, distributed systems and all this stuff?
[37:34]
Simon Mo
Sure, yeah, yeah, you're exactly right. Young is both of our advisors. I have actually worked with Yang since 2017, since I was an undergrad working on my first open source project for serving and then work with him at any scale for my second. You're second open source project for serving.
[37:49]
Host (possibly a16z Podcast Host)
You're just addicted to like Berkeley based open source AI serving companies.
[37:55]
Simon Mo
So at this company and Vom, Ian is quite involved as so as a company he will be a co founder and then as an open source project he has been advising this project since its inception. Yang knows open source project, academic project, industry research, trend in and out. So from what we're working together on, Young really helps us with both clearly understanding all the lessons learned about bringing open source through the final miles of adoption in companies, enterprises as well as what is actually happening on the research world. Sky Computing Lab over the last few years has produced amazing infrastructure and new research ideas and Yang continue to explore a new frontier on that front and then we're quite excited to hear that and also innovate on the open source together.
[38:49]
Woosa Kwan
Yeah, and he also helps like recruiting a lot of and you know like all he is involved in all of our hiring process. He basically tells us, I mean teaches us how to tell, you know, talents, how to where to find talents. These are all amazingly helpful.
[39:05]
Host (possibly a16z Podcast Host)
So. So on that topic, what are some of the big problems you need to solve now and what type of people are you hiring to to help you solve.
[39:13]
Woosa Kwan
Definitely. You know, the inference at scale is kind of the. One of the biggest challenge I think in the field. Not only for us, but in the field overall. So we are trying to hire more like a very experienced ML, ML infra engineers overall to make, you know, for for example, you know how we what would be the best way to utilize the GB200GB300 MBL72 rack entirely for the giant open source model? Still, I think it's a. It's an open problem. There are definitely some endeavors in academia and industry, but I think there are some room for improvements. So yeah, that's some of our focus at the moment.
[39:51]
Simon Mo
Here's my pitch. From a computer science point of view, pretty rare if people ask me this question. That is if you're working at a vertically integrated company that have an end product for let's say for Chatbot, for assistant, you are working on the vertical size of the problem. In infract, you will be working on an abstraction of horizontal layer. This is similar to operating system databases and different kinds of abstraction that people have built over the years. Operating system abstracted CPU and memory databases and file system abstracted storage devices and networking. For accelerated computing. There's a brand new physical device that inference and VR abstracted a large part of it for inference specific workload, of course it's training, but we are a singular focus is on inference and this necessitates a layer, a software layer that abstracts away GPUs and asserted computing device for models. And this is as important from my point of view as abstraction unity built for OS for databases, which are both fields we're really passionate about when we're PhD students too. So that's why ML system is fundamentally a new system research and system deployment. So you here at infrac will be working on this layer that is not a vertical slice but a fundamental runtime and impacting all the future generation of software that will run on accelerated computing device. And your work will span from both working with different models and then working with different applications and as well as understanding the pros and cons of different chips as well as their whole integrated data center systems to be able to figure out oh actually for these we should build the abstraction this way and we'll constantly remove abstraction, break abstraction and build it over and over again. Just like how operating system got innovated over time, databases got innovated over time with a new information we have at hand. So you have come here to have the constant exercise of building an actual widely deployed production system that sort of, that will be at the frontier of inference.
[42:20]
Host (possibly a16z Podcast Host)
And this is what you call universal inference layer.
[42:23]
Simon Mo
Yeah, it's purposely vague in a way, but what we really focus on is going from page attention, from going from the serving system to the whole runtime you need for intelligence.
[42:38]
Host (possibly a16z Podcast Host)
Woosak Simon, thank you so much for being here today. Thrilled to have you on the podcast, of course, and we're thrilled to be working together in the company. It feels like it's been a few years we've already been working together, but yeah, great to have you here and congratulations on getting off to a great start.
[42:54]
Simon Mo
Thank you for having us.
[42:55]
Woosa Kwan
Yeah, thank you.
[42:58]
Matt Bornstein
Thanks for listening to the A16Z podcast. If you enjoyed the episode, let us know by leaving a review@ratethispodcast.com a16z we've got more great conversations coming your way. See you next time. As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see a16z.com disclosures.