Kubernetes Podcast from Google
Episode Title: LLM-D, with Clayton Coleman and Rob Shah
Released: August 20, 2025
Hosts: Abdel Sghiouar (absent), Kaslin Fields, Mofir Adan (guest host)
Guests: Clayton Coleman (OpenShift/Kubernetes core contributor), Rob Shah (Director of Engineering at Red Hat, VLLM contributor)
Episode Overview
This episode examines how running Large Language Model (LLM) workloads on Kubernetes is fundamentally different from hosting conventional applications, and introduces LLMD—a new open-source project unifying best practices for LLM serving. Key guests Clayton Coleman and Rob Shah dive deep into the technical, operational, and community collaboration aspects of scaling LLMs in the cloud-native world.
Key Discussion Points & Insights
1. Why LLMs Are Different on Kubernetes
- Workload Shift: LLMs moved AI workloads from software development focus to resource management and scalability.
- Traditional Load Balancing Doesn't Fit:
- Web apps: random load balancing and horizontal auto-scaling suffice.
- LLMs: Have unique resource intensities per request (e.g., short prompts vs. long ones), requiring specialized traffic management.
- Clayton Coleman [02:50]:
“What was really interesting with large language models is it shifted the problem space from being one of software development to one of being resource usage and scale... it stopped looking like a traditional microservice.”
2. The Role of Inference Gateway and VLLM
- Inference Gateway developed as a smarter load balancer for LLMs by managing traffic with deeper model/server knowledge.
- VLLM: Emerged as the best-in-class model server with innovations like paged attention and continuous batching, crucial for the unique, stateful demands of autoregressive models.
- Technical Distinctions:
- “KV cache” management optimizes performance but makes workloads more stateful and complex to orchestrate.
- VLLM uses “paged attention” similar to an OS managing virtual memory, mapping logical KV cache to physical blocks for efficient token generation.
- Rob Shah [08:49]:
“LLMs are autoregressive… every token gets generated with another pass through the model. Traditional predictive apps are stateless; LLMs are not.”
3. The Evolution: LLMD Project
- LLMD melds Inference Gateway and VLLM communities, building “well-lit paths” as reference architectures for common LLM serving patterns.
- Key Patterns Provided in LLMD (v0.2):
- Intelligent Inference Scheduling: Standardizes scalable, smart LLM load-balancing.
- Prefill/Decode Disaggregation: Splits model serving roles for better compute/memory utilization.
- Wide Expert Parallelism: Supports large “mixture of experts” models (e.g., Deepseek, Kimi, Llama 4) with distributed, multi-node deployments.
- APIs and Upstream Contribution: Emphasis on keeping optimization/contribution upstream to avoid forks and quickly benefit from LLM/infra advancements.
- Rob Shah [16:00]:
“The idea is to bring these two communities together… have Gateway drive requirements down to VLLM and have VLLM drive requirements up into Gateway… highlight state-of-the-art ways to deploy common patterns.”
4. Community-Driven Optimization: Call to Action
- Open Source Incentives:
- Encourages teams to return “bespoke” optimizations to the community, reducing future rework, easing benchmarking and debugging.
- Sustained progress comes from pooling and curating best practices across companies/projects.
- Clayton Coleman [27:35]:
“Open source works best when everybody gets something. ... The incentive that I think we'd be looking for is you can go down this well lit path... you're working off of a path where not just the one piece ... works, but some of the tunables ... and new algorithms and tuning.”
- Rob Shah [29:57]:
“In LLMD we're not taking forks... we're using and driving these things into the upstream directly. ... The pace at which things are improving and changing in the ML ecosystem is absolutely breakneck.”
5. Looking to the Future: LLMs, Kubernetes, and AI
- Hardware and System Paradigm Changes:
- Assert hardware/network co-evolution with model scale, more parallelism, faster interconnects.
- Open Source Will Win:
- Projects and models, both open and closed, but open source “tilts toward” leading the innovations and efficiency.
- Clayton Coleman [35:51]:
“Five years from now, the best and most important models are going to be a mix of open and closed innovation, but I think they're going to tilt towards open.”
- Agentic Applications & Compound AI:
- The future is not just larger models but systems of interacting models, tools, and subsystems.
- Rob Shah [38:16]:
“Agentic applications emerging—users try to customize the model to their use case through these mega models... with their own enterprise or custom data. ... We'll need to evolve the LLMD and model server roadmaps to work in those application patterns.”
Notable Quotes & Memorable Moments
- Clayton Coleman [02:50]: "I like to think of the large language model as a little bit like its own host, its own CPU that has to be shared."
- Rob Shah [08:49]: "VLLM really emerged in the summer of 2023 with a fundamental algorithm called paged attention... an homage to virtual memory in an operating system.”
- Rob Shah [14:17]: "It's hard to underestimate the like first L of LLMs, which is large. We're looking at just an amount of compute that's ginormous."
- Clayton Coleman [22:10]: "Making it easier for people to anchor on those [well-lit] paths makes contribution easier... If we can go out there and look at the 10 or 15 different ways people have done prefilled decode disaggregation, we can apply some judgment and say it works in these scenarios..."
- Clayton Coleman [35:51]: "Never underestimate a whole bunch of people optimizing their hardware to get the best performance out of it."
- Rob Shah [39:32]: “Last year, Clayton, you had a quote… 'Inference is the new web app.' And this year… 'agents is the new web app.'”
- Clayton Coleman [40:09]: "It is never too early or too late to learn about ML. ... Don't be daunted, don't be intimidated. Give it a try. Learn and come help, participate, contribute back. That's all we need."
- Rob Shah [40:35]: “It's gotta be the most fun place to be working right now. The pace, the amount of innovation, the speed at which research moves from a paper into a real production system is so fast. … Please feel free to jump in, tell us your requirements, get involved. We'd love to see you.”
Timestamps for Important Segments
- [02:25] — Introduction to Inference Gateway & LLM-specific serving challenges
- [06:03] — How VLLM evolved to serve the new LLM landscape
- [08:49] — Why traditional serving systems (like Tensorflow Serve, KFServing) fall short for LLMs
- [14:17] — Compute intensity and performance focus of LLMs
- [16:00] — LLMD: its mission, structure, and initial “well-lit paths”
- [27:35] — Advice to startups/teams on why to participate in open source optimizations
- [33:20] — Predictions: LLM serving in 5 years
- [39:32] — Memorable quote round (“agents are the new web app” update)
- [40:09] — Final advice for listeners: Get involved and don’t be intimidated
Contextual Takeaways & Further Reflections ([41:44] onward)
- LLMD’s role is not to reinvent the stack but to unify existing projects (Inference Gateway, VLLM) into production-grade friendly best practices, with direct upstream contributions.
- Community Involvement—Optimization innovation is evolving too fast to go it alone; contributing openly helps everyone, including your own team.
- Open standards and APIs (like OpenAI compatibility in VLLM) are making it easier to build portable, composable AI infrastructure.
- Kubernetes’ Role continues as the foundational distributed systems platform, now stepping up to support compute- and memory-hungry LLM applications with complex routing and scaling needs.
How to Get Involved
- The guests encourage listeners, especially those building or running LLM workloads, to engage directly:
- Share optimizations and pain-points with LLMD, VLLM, and Inference Gateway
- Participate in open meetings, Slack channels, and contribute code or use cases (links available in show notes)
Conclusion
This episode is a comprehensive technical and pragmatic look at what it takes to run LLMs at scale on Kubernetes, emphasizing open-source collaboration, the shift to workload-centric AI infrastructure, and the ongoing rapid evolution of both models and the software ecosystem. LLMD emerges as a central, community-driven reference project for best practices in LLM serving, pushing the field toward greater openness, standardization, and shared success.
For Further Information:
- Show notes include links to LLMD, VLLM, Inference Gateway GitHub repos and community channels.
- Episode transcript and additional resources available at kubernetespodcast.com.
