Podcast Summary: Inferact — Building the Infrastructure That Runs Modern AI
Podcast: AI + a16z
Guests: Simon Mo & Woosa Kwan (Co-founders of Infract, creators of open source inference engine vllm)
Host: Matt Bornstein, a16z
Date: January 22, 2026
Episode Overview
This episode delves deep into the often-overlooked world of AI infrastructure, focusing particularly on inference — the process of running trained AI models in production. The conversation highlights the technical complexities of deploying large language models (LLMs) at scale, addresses why open source is vital to the future of AI, and introduces Infract, a company springing from the popular open-source vllm project. The discussion is rooted in real-world stories, technical deep-dives, and perspectives on open source's role in advancing AI.
Key Discussion Points and Insights
1. Genesis of vllm: From Grad School Project to Open-Source Backbone
- Woosa Kwan describes vllm's origins as a side project at UC Berkeley in 2022, initially to optimize a demo service running Meta's OPT model, one of the first major open-weight GPT-3 alternatives.
- Learning curve: Started with the assumption the work would be quick, but it revealed a host of open problems unique to autoregressive LLMs.
- “Initially I was thinking that it may only take like a couple weeks to optimize the service end to end. But it turns out that it actually has a lot of open problems inside in it...” (Woosa Kwan, 04:00)
- Autoregressive LLMs vs Traditional ML:
- Traditional ML workloads could normalize inputs (e.g., resize images for CNNs), making scheduling and memory management simple.
- LLMs are highly dynamic — prompt lengths and response times vary widely, making scheduling and memory management “first-class” engineering problems.
- “Your prompt can be either like hello, like a single word or ... spanning hundreds of pages. And this kind of dynamism exists inherently in the language model. And this makes things whole kind of in a different world.” (Woosa Kwan, 07:30)
2. The Hidden Complexity of Inference
- Inference is now the hardest problem:
- “The public story of AI progress is about better models and bigger breakthroughs. But underneath it is a quieter systems problem ... the challenge of running AI systems has started to rival the challenge of building them.” (Matt Bornstein, 01:43)
- Traditional ML serving:
- Deterministic, batch-oriented, clockwork-like.
- LLMs in production:
- Non-deterministic and continuous.
- Hardware (GPUs) never designed for this level of unpredictability.
- Surge in “chaotic requests” with real-time needs for thousands of users.
3. vllm: An Explosive, Truly Open Source Community
- Community scale:
- From a handful of grad students to 50+ regular contributors, 2000+ overall contributors (now one of GitHub’s fastest-growing open source projects).
- Diverse participation: Users and contributors from big industry players (Meta, Red Hat, Nvidia, AMD, Google, AWS, Intel), model providers, and application builders.
- “This is kind of a classic. We’re solving the M times M problem ... you can just go into this one system and then magically you'll work for all the models out there in the world...” (Simon Mo, 14:36)
- Community management lessons:
- Borrowing from the playbook of Ray, Linux, Kubernetes, Postgres: set clear vision and roadmaps, encourage new contributors through clear scopes and objectives, welcome unsolicited pull requests.
- “We have set for our vision every quarter and then but also invite the community to contribute ... keep an extremely open mind to all the GitHub pull requests ... a blend of all the lesson learned from previously other open source projects.” (Simon Mo, 15:15)
- Frequent in-person meet-ups globally to foster collaboration.
4. Financing and Scaling Open Source
- Early a16z grant funding kicked off larger culture of open-source sponsorships.
- vllm’s operational costs: e.g., $100k+/month on continuous integration testing.
- “Our CI bill for example is more than 100k a month ... we want to make sure every single commit is well tested. ... people are going to deploy at not thousands, but potentially millions of GPUs across the world...” (Simon Mo, 18:21)
5. Deep Dive: How LLM Inference Engines Work
- Inference engine refers to the software layer that runs a fixed, trained LLM on hardware to generate outputs as efficiently as possible.
- Critical components (21:03):
- API server
- Tokenizer (turns input into model-readable integers)
- Scheduler (batches and schedules requests)
- Memory manager (manages key-value caches)
- Worker (initializes model, handles pre/post-processing)
- Woosa Kwan: “It’s not like a crazy new architecture, but each one basically highly optimized and specialized for this LM inference workload.” (Woosa Kwan, 22:14)
6. Why Inference Keeps Getting Harder
Three main drivers:
-
Scale:
- Models have gone from hundreds of billions to trillions of parameters.
- Managing sharding (splitting models across multiple GPUs/nodes) raises complex trade-offs in performance and resource utilization.
- “We believe we will see like multi trillion parameter open source model this year.” (Woosa Kwan, 23:13)
-
Diversity:
- Model architectures are increasingly diverse, requiring inference engines to support different attention mechanisms, tokenizers, and memory management strategies.
- Hardware diversity: accommodating a wide spectrum of GPU/compute architectures.
-
Agents:
- Next-gen LLM applications involve “agents” — multi-turn, tool-using, environment-interacting systems.
- This requires smarter inference layers that can manage persistent state, unpredictable cache access patterns, and external tool integrations.
- Simon Mo: “With agents ... you actually don’t know whether or not the agent will think it finishes ... now it becomes external environment interaction. ... The patterns got pretty disrupted by the new paradigm.” (Simon Mo, 29:20)
7. The Role and Power of Open Source in AI Infrastructure
- Open source as an engine of diversity and speed:
- “We believe that diversity will triumph that sort of single of anything at all ... the best way to promote diversity and improve that is through open source ... everybody can participate and then innovate together ... way easier and cheaper in fact in the end to deploy.” (Simon Mo, 31:00)
- Practical competitive edge: Closed-source companies (such as OpenAI) will always optimize for their own stack and use case; open source enables broader tailoring and faster innovation for varied use cases and hardware.
Notable Quotes & Memorable Moments
“The public story of AI progress is about better models and bigger breakthroughs. But underneath it is a quieter systems problem ... the challenge of running AI systems has started to rival the challenge of building them.”
– Matt Bornstein (01:43)
“I think I also started from curiosity. I didn’t really think it’s the most important problem in the world back in the day. I just wanted to have a hands on experience on how this actually works.”
– Woosa Kwan (05:32)
“Your prompt can be either like hello, like a single word or your prompt can be a bunch of documents spanning hundreds of pages. And this kind of dynamism exists inherently in the language model ... We have to handle this dynamism as a first class citizen.”
– Woosa Kwan (07:30)
“We’re solving the M times M problem ... for applications who are using VLM as well as infrastructure building with VLM, having a common ground where everybody can participate in and then innovate together is way easier and cheaper.”
– Simon Mo (14:36)
“That’s where the tension lies. A public story of AI progress is about better models and bigger breakthroughs. But underneath it is a quieter systems problem. How do you schedule chaotic requests efficiently? How do you manage memory when you don’t know when a conversation is actually finished?”
– Matt Bornstein (01:43)
“Open source moves so fast that the only way to stay ahead is adopting and that's why we want to make happen. And in fact this is exactly why we're staying all in on open source.”
– Simon Mo (36:34)
“From a computer science point of view, pretty rare if people ask me this question. That is if you're working at a vertically integrated company ... you are working on the vertical size of the problem. In Infract, you will be working on an abstraction of horizontal layer. This is similar to operating system databases and different kinds of abstraction that people have built over the years.”
– Simon Mo (39:51)
Technical Deep Dives and Important Segments
- [06:43–09:00] — Distinction Between Traditional ML Workloads and Modern LLM Inference
- [12:59–16:54] — Building and Managing a Thriving Open Source Community
- [20:03–22:20] — Dissecting the Components of an Inference Engine
- [23:28–30:36] — Surging Complexity: Scale, Diversity, and Agents
- [33:31–35:03] — Real-World VLLM Deployments (Amazon Rufus, CharacterAI)
- [35:12–42:22] — The Founding of Infract, Its Mission, and Open Source as Top Priority
Stories from the Field
- Amazon’s global e-commerce assistant now runs on vllm, making Simon momentarily marvel that his own purchases passed through his former research project. (33:31)
- CharacterAI rolled out a cutting-edge feature just from an unmerged PR in vllm, exemplifying the project’s rapid worldwide adoption. (34:17)
Conclusion: The Universal Inference Layer & The Future
Simon Mo and Woosa Kwan position Infract as a universal, horizontal abstraction for AI inference — analogous to operating systems for CPUs — uniting open source contributors, model and hardware providers, and users into a fast-iterating, deeply technical ecosystem.
“Our goal is to make VLM the world’s inference engine ... It is only when VLM becomes a standard and VLM helps everybody to achieve what they need to do, then our company in a sense has the right meaning and to be able to support everybody around it.”
– Simon Mo (00:00 & 35:49)
For Listeners Seeking Key Takeaways
- AI inference is the new bottleneck and the new frontier — harder and more essential than ever as models scale and diversify.
- vllm is a thriving, industry-wide open source project, rapidly adopted by major companies and continually evolving via global collaboration.
- Infract is betting its company on open source — seeking to build a universal inference layer that sets the foundation for modern and future AI systems.
For anyone working on deploying LLMs, scaling cloud AI workloads, or interested in the next wave of system infrastructure, this episode is filled with practical lessons, war stories, and visionary thinking about where AI is going next.
