Loading summary
Kaslin Fields
Hello and welcome to the kubernetes podcast from Google. I'm your host, Kaslin Fields.
Mofir Adan
And I'm Mofir Adan. This week I had a chance to sit down and talk with Clayton Coleman and Rob Shah. Clayton Coleman is a core contributor to Kubernetes, the container's cluster manager and founding architect for OpenShift, the open source platform as a service. Clayton helped launch the shift to cloud native applications and the platforms that enable them at Google. His mission is to make Kubernetes and GKE the best place to run workloads, especially accelerated AML workloads and especially especially very large model inference at scale with the Inference Gateway and llmt. Rob Shaw is a Director of Engineering at Red Hat and he's a contributor to the VLLM project. In the interview we Talked about why LLMs are different than any other workload running on Kubernetes, and why projects like LLMD exist. But first, let's get to the news.
Kaslin Fields
Kubernetes 1.34 is expected to release here at the end of August. If you haven't seen the Sneak Peek blog yet, head over to Kubernetes IO to check it out and look forward to our interview with the release lead.
Mofir Adan
Qcrash is a community led virtual event happening on September 23rd. Attendees can expect to learn about a variety of topics in the form of Cloud native open source crash courses for platform engineers. The event will also be Raising Money for Deaf Kids code, a nonprofit organization with a mission to provide equitable access to computer science education. Check out the schedule and register for the event. The link is available in the Show Notes.
Kaslin Fields
The CNCF published a blog post listing the top 30 open source projects in 2020. Unsurprisingly, Kubernetes has the largest contributor base, followed by OpenTelemetry, which is quickly becoming the Kubernetes of O11Y communities, as called out in the blog post. The blog lists out a number of other projects like Backstage, which we featured on episode 136, as well as Argo, Crossplane, Kubeflow and many others. This growth shows where the community is headed and where the future investments are. Make sure to check out the link in the show notes and that's the news.
Mofir Adan
Welcome to the show. Clayton and Robert hello.
Rob Shah
Hey, thanks for having me on.
Mofir Adan
So Clayton, this past Kubecon EU 2025 there was a keynote where we talked about Inference Gateway for Kubernetes and running Inference Workload on Kubernetes for the listeners. For people that have been using Kubernetes for a long time, but not necessarily running LLM workload. Why would something like inference be any different than running, let's say, a web application?
Clayton Coleman
That's a great question. It's taken me, and I'm sure others as well, a lot of time to wrestle with this. AIML workloads before large language models tended to be really interesting. They were highly custom, they were dependent on lots of local software stacks. To Kubernetes they were really just another workload and they were thousands of variations. Every organization had this unique take on how to run an AIML platform in production. What was really interesting with large language models is it shifted the problem space from being one of software development to one of being resource usage and scale. And interestingly, that was really when it stopped looking like a traditional microservice and started looking like a very specialized bit of software that just so coincidentally needs a ton of direct access to hardware and started taking on characteristics that were different from regular web apps. Like a regular web app, random load balancing is pretty good. Horizontal auto scaling In Kubernetes, they stick a CPU number on there and when you cross out the CPU number, we scale up and when you fall below it, we scale down. It was pretty simple. What was pretty obvious from a lot of our discussions with people running large language models on Kubernetes was a lot of the primitives didn't work for them because the problem had changed. And we experimented. We started working group serving two years ago at Kubecon EU 2024 and started partnering with people. One of the things that was really obvious was load balancing. What's really different from large language models is they're a bit of a computer in and of themselves. The model is processing things. And so I like to think of the large language model as a little bit like its own host, its own CPU that has to be shared. And so as we looked at it that way, we said there's nothing really that helps you share a large language model acting as a CPU for a bunch of different workloads. And that led to Inference Gateway and the idea that your load balancing traffic differently than you would for web app because the requests are different links, right? A really short prompt is much cheaper to calculate than a very long prompt. You have to do them iteratively. And so Inference Gateway started as, hey, we can do better than random load balancing and maybe we can divvy up access to these large models fairly, help operators and deployers get a good idea for how the workloads are going And Kubecon EU was actually really transformational for us because it was that shift from thinking like a load balancer. We had had the realization that as a generic load balancer you can only do so much. We were already starting to think, hey, we need to work more closely with the servers that were actually running the models. And those are, they're called model servers. And there's a couple of them that are really important. And one of the most widely known and most popular is dllm. And so after Kubecon eu, I got a out of the blue call, even though we had chatted a few times before from Rob Shah and he wanted to talk about what we could do together. Not just Inference gateway or not just vllm, but how can we bring those two together?
Rob Shah
Yeah, and from my side in vllm, we've been partnering with model providers over the course of the past two years since VLM came out, and have really been staying up to date with the evolutions in the model architectures themselves. And we had seen very commonly VLLM being deployed inside of kubernetes in operational systems. And we're really interested in how we can make this work more closely together. But I think what really started to become acute over the course of 2025 is we saw model architectures really with the advent of deep seq right at in December of 2024 is this shift towards very large mixture of experts models. And the problem of running this inside of kubernetes became really acute because a mixture of experts models, these huge Deep SEQ like architectures that have with the recent model like Kimi, that have a trillion parameters, are really designed to be deployed with techniques like disaggregated prefill where we'll have a pre fill instance and a decode instance that need to work together to serve an individual request or things like wide expert parallelism where you know, multiple nodes would be working together to serve an individual inference request to really scale things out, get high performance in a distributed system. And the demands of the models as well really started to make the like operational challenge of deploying these frontier models more and more acute. And we obviously had seen inference gateway solving the problem of load balancing with intelligent scheduling. And we really wanted to drive the concerns that we had around how to serve these bigger models with these more sophisticated optimizations and make that really compose nicely with all of the amazing work that the upstream community had done to deal with the load balancing workload. So that was a little bit of like why things started to become Acute and why we really thought there was a good opportunity to bring together the VLM community and the Gateway community to work on building a project that has tighter requirements between the two that help to drive the APIs that are needed in both systems.
Mofir Adan
So you mentioned the model server and VLLM happens to be one of these model servers. But the task of serving model, it's not necessarily something that is new. LLMs are new are. But AI workload needed to be something that where people are serving different types of model before like things like Selden Core was a thing people used before kfserve now called kserve that exists that does similar thing then you have TF serve as well like tensorflow models, things of that nature. In this new era of serving large language models, why VLLM or what is VLLM doing that these other solutions did not really answer before?
Rob Shah
Yeah, sure. The fundamental problem with large language models is that they're autoregressive, which basically means that every token that gets generated requires another pass through the model over and over again to generate text. Whereas traditional models like a BERT or a YOLO or a predictive ML model basically does one forward pass to execute the request and is done and returns the response back to the user. So effectively, from the perspective of the model, they're stateless. Right? Traditional predictive apps, you'd see a very common strategy called dynamic batching, where servers like Triton Server would queue up a series of requests, dispatch them off, all off to the model to batch them together, one forward pass, generate the responses, return it to the users. But as we look at LLM workloads effectively during the life of an inference request, there's a lot of state that needs to be managed because we want to reuse the intermediate states between those forward passes. These are called the KV caches, which are a really critical optimization to avoid having to recompute the prompt over and over again as we generate tokens. So this performance optimization of KVCaching is really critical to getting good performance out of the LLMs, but it comes at the cost of needing to keep track of this intermediate state. So VLLM really emerged in the summer of 2023 with a fundamental algorithm called page detention, which allowed us to manage this TV cache in a smarter way, using what we call a block table, which is an homage to the concept of virtual memory in an operating system, where each request has a logical view of its kvcache, which maps to random access physical KVCACHE blocks. And there's an attention algorithm called paged attention, which was really a fused gather attention operation that made this all work really well. VLLM really emerged in the summer of 2023 with a good implementation of continuous batching and and this primitive of kvcache management through a block table with paged attention that really ticked off the open source LLM serving ecosystem with the right fundamental abstractions. And that core of vllm, a continuous batching engine and a KV cache management engine is still fundamental to the VLM system. But VLM really became the leading inference engine in the ecosystem and so VLM has really grown up alongside the open source ecosystem. We've seen major changes from 2023 to 2025, where 2024 we saw a huge advent of multimodal models. We saw lots of new techniques like chunk pre fill or prefix caching or structured generation or speculative decoding that emerged over the course of 2023 and 2024. And then in 2025 we saw a push towards large mixture of experts models. We've seen explosion in the number of different hardware backends that VLM supports, whether it's Nvidia, amd, Google, tpu, a big project that we did over the course of the past nine months in partnership with the Google team to add different accelerator backends. So VLM really started from that core problem of kvcache management, continuous batching, which are those fundamental abstractions that are needed to run a large language model performantly. And it's grown up and exploded in the number of features, model support, et cetera, as we've seen that open source ecosystem really blossoming over the course of the past two and a half years.
Clayton Coleman
That's a great point. I want to add something to what Rob said to your early question, Muffy, or to the original question as well, what's different? I really want to emphasize how much the ML focus has shifted from what's the platform that lets you bring many different types of models to production, which is some of the things that KSERV and Selden and TensorFlow focused on models with wildly different architectures to a world where the model is more important. We still will have multiple different models. There might be different sizes. I'd be trying a couple of different model architectures from different model providers as each model provider races to one up the other with better performance. But you have many fewer. So I like to think that the focus that we're trying to take now and what VLM is really built around and where we're going is a workload centric View, which is the model itself is a big complex workload. It's a distributed system. If you were. And it started out simple, all things do. We had replicas. And a lot of what Rob described is this. We're adding pieces that complement that internal cache, the internal processing that are just really they. You wouldn't need them if you have thousands of models. But because you have one big model that's very large, it's larger than the hardware it runs on and it's spread across all these machines, it becomes a distributed system. And that's really just a different approach, just distributed databases. Everybody's running a really small postgres database on their laptop. But a distributed database is a completely different beast. I think that's the real transition, or that's the other transition that's happened along with large language models, is from platform to deliver models to model as a workload. And these are big, massive, important workloads that will form the core of a whole host of satellite workloads that are consulting these models and using them to bring new capabilities. Whether it's multimodality or reasoning or agentic workloads. The future is bright for calling models. What do we have to do to support it?
Rob Shah
Absolutely. I think one other point I'll just add is. But it's hard to underestimate the like first L of LLMs, which is large. We're looking at just an amount of compute that's ginormous in terms of the just raw processing power that's needed to get reasonable throughput and latency in out of your cluster. And a lot of the more complicated deployment patterns that we're talking about are fundamentally performance optimizations that are trying to reduce the overall amount of operational spend that's needed to support the models. And it's really because the compute is big to actually serve these things that drives the need for more and more performance optimization in the inference server level, in the cluster level, in the model level to really continue bringing down the cost of these overall systems. And I think the magic of what we're trying to do with LMD is take all these performance optimizations that are complicated, hard, require a lot of engineering sweat to make work, and deal with an ML software stack that has lots and lots of pieces that's hard to work with at times and really bring it into the operational model of kubernetes to try to make this easier for folks to run as they go into production with these architectures.
Mofir Adan
That's a perfect tee. Up to the next question I was going to ask is that you just mentioned llmd. So you have the inference gateway which is handling the routing using the knowledge of the model itself to route it better. You have vllm, which is the model server. Paint me the picture of where does LLMD fit in or where does LMD come in to help?
Rob Shah
So with LMD I just want to emphasize it's really the merging of the two communities, right? Like VLM and Inference Gateway jointly driving requirements. And not just Gateway, but also other components in the Kubernetes ecosystem we've increasingly been working with like leader worker set. But the idea is to drive common requirements across both key dependencies and key upstreams and make these upstreams better and better at the well lit paths that we're targeting. And so the idea is to bring these two communities together and have Gateway drive requirements down to VLLM and have VLLM drive requirements up into Gateway, into leader worker set and into other dependencies that we'll rely on to build common well lit paths. And I think with these well lit paths what we're trying to really highlight is state of the art ways to deploy common patterns, right? So right now with our 0.2 release that we came out with, we have three well lit paths that we're targeting. The first is intelligent inference scheduling, which is an example of a deployment pattern that we think everybody should use in every situation. It takes a lot of the existing really amazing load balancing logic that comes out of Gateway, brings VLLM in and provides a really common way that everyone should deploy every model with these techniques. And what we're starting to build with these newer well lit paths is architectures for running more and more sophisticated deployments. So an example of this is pre filled decode disaggregation. This is a technique that allows us to split the model server into two parts. We'll have one replica of VLM that's a pre filled server that we've configured and optimized to do pre fill requests. And typically this means using more replicas with less parallelism because the collective operations that are needed to process pre fills, which is a compute bound operation, is quite heavy. So in general you want to use less parallelism for the pre fill workers and then the decode will process the decode requests. And in general we want to maximize the amount of KV cache space and this is a memory bound operation. So in general we want to have as high of a batch size as possible. We with and to do this we use more parallelism. And so this is an Example of a configuration that we've in a technique that we've added a lot of features into VLLM for adding things like nixle as the kvcache transfer library supporting a protocol to tell VLLM that this request should be processed with disaggregated serving. And then on the Gateway side we've had to implement extensions to do pre filled decode related scheduling. And so we've developed a joint protocol where Gateway and VLLM are able to talk to each other in the right way to build a whole system that allows for pre filled decode disaggregation to be something that folks can use when they go to deploy and of course running on top of a Kubernetes cluster. So this is a great example of if we were just working in vllm like we would have to write a proxy layer that's doing complicated scheduling logic to decide which pre fill replica to use, which decode replica to use. But we're able to just push the requirements up into Gateway and push the requirements down into vllm and we're able to use a real production load balancer and a real production proxy just directly with VLM under the hood. So I think this was a good example of we're ironing out the issues associated with running a PD disag kind of scenario and bringing together the two components to make them work together to run really performantly. And again get this performance optimization which is pre filled decode disaggregation. And then the third path that we have for the recent release that we did is something called wide expert parallelism. This is an optimization that's targeted at these wide mixture of experts models. So deep seq Kimi llama 4 these all have 128, 256 experts. They're huge models, hundreds of billions of parameters. And the idea is we want to deploy these in a multi node setup. So we've added a lot of features to VLM to do prefilled decode disag of course, but then things like data parallel attention with expert parallel MLP layer integrating a lot of the key kernels that Perplexity has put out as well as deepseek has put out. Deep EP is a name. Deepgem is the name of the GEM kernel. We've implemented all this stuff inside of Vllm and we've been working with the Gateway community to compose the existing load balancers that we have and then we've been working with leader worker set to deploy these multi node replicas. Of vlm and we've encountered lots of issues doing this and we've been ironing them out and this is helping to really drive the requirements into lwc, helping to drive the requirements into Gateway and really just providing a lit path. We validated that this all works. We've dealt with a lot of the issues with dealing with these things and are helping to push the requirements into the upstreams to make that whole path work really smoothly. So I think these are three examples of well lit paths that we have now where not every one of these is going to be something that's used in every deployment, but we're trying to iron out how to run these more sophisticated deployment patterns and make that smoother. And then we're working on adding more of these. Of course there's things like kvcache offloading to CPU RAM and then eventually remote storage, which is another more sophisticated pattern for deploying in a cluster. This is something we're working on. And then we also are working on some things related to SLO based scheduling and auto scaling, which are other well lit paths that we're going to be bringing in over time. So that's the overall idea of the project, is to define these key user stories and deployment patterns, make them work really well, bringing together upstream kubernetes, projects and vllm make it work really well together, iron it out and provide references for folks on how to run these things and try to identify the top 5, 10 different paths over time that we think are useful ways to deploy LLMs. I went on for a long time, Clayton, do you have anything to add?
Clayton Coleman
Well, no, and I think a lot of as you, every time I hear it said back, I pick up something new and it helps me think about things a different way. So one of the things Rob and I agreed on really early in this is there's two hard problems I think in this space right now. And one of those is something Rob's really familiar with, which is everybody has all of these tricks and techniques, but they're all scattered. So everybody, this ecosystem, large language models, generative AI, everything's going so fast that a lot of the ecosystem is people making individual tweaks, learnings, and they're able to get the optimizations they need and it stops there. So they take vlm, maybe they have a patch or two, it gets them up and going and they want to leave that because they're startups or they're rapidly moving new AI natives and speed matters. And so some of that extra hard work, which is taking all this and bringing it back together wasn't happening. And that's something that Rob and the Red Hat team and the IBM team and the larger VLM community are really interested in, is pulling this back together. So those well lit paths, in some senses that's classic oss. It's all of us are more powerful than each of us. And making it easier for people to anchor on those paths makes contribution easier. If we can go out there and take look at the 10 or 15 different ways people have done prefilled decode disaggregation, we can apply some judgment and say it works in these scenarios or these scenarios and we can bring that expertise back. We're not necessarily the ones driving it, but what we are doing is curating and pulling it together. But a second part of this I think is another ecosystem thing that I learned very early in Kubernetes. At the end of the day, to a lot of people this is just something that helps them get their job done. But once you've got it, once you deploy something, if you're deploying Kubernetes. For the last 10 years, most of the people who are deploying Kubernetes in anger were deploying, we're building platforms, platform engineering teams, they're supporting lots of workloads. Kubernetes is not perfect by any means. It was just better than writing your own. You could write a better one. And I encourage everyone to go out there and write better orchestration systems. The reality is it wasn't the core of the business. And so where Kubernetes was really successful was about you don't have to be perfect, you have to be useful, usable and help people focus on the problems they actually want to focus on. People didn't want to go write for loops that recovered services when the node crashed. So coming into this effort, I think something that we can really do beyond helping create those well lit paths, bringing ecosystem optimizations together so that all of us get the benefit of them and then that means there's a nice tight release pipeline that everybody can depend on and Vllm and inference gateway and Kubernetes where the stuff just works and it keeps working better over time. But the other one is defining the APIs between these components. So Rob mentioned pre, fill and decode. It's been really difficult. There's lots of different approaches, but there's a very common refrain that I hear is we're pushing massive amounts of data, like a thousand, a thousand character token really, but let's call it a thousand Character prompt might generate on the order, a gigabyte 100 megabytes to 10 gigabytes of data that you have to push across the network. Most people's microservices are not pushing across 10 gigabytes of data in a throughput oriented, relatively latency dependent setting. And we've got these new fast networks. But it's a challenging problem. And so some of what we can do is by coming in and looking at the operational patterns, by looking at what people have done, we can apply a little bit of the thumb on the scales and say this pattern works for this use case and this pattern works for this use case. But if all of us are going to do it, what are the one or two paths that we can focus on? And some of this is just opinion. We're coming in with an opinion and we're saying we think this will operationally scale. Some of that comes from our own experiences at Google. We have a lot of people who've been doing things like this internally who provide feedback and guidance on patterns. Just like for kubernetes, there was a lot of folks inside Google who'd had experience running containers at scale. So what we're trying to just do is get some really good opinions around APIs between these different components. And maybe I don't think neither Rob nor I view this as we have to win at everybody's expense. In fact, what we'd rather have happen is that everybody in the ecosystem converge because there's other model servers and there's other load balancing approaches and there's other ways to orchestrate. What can we do to bring those APIs, show something working clearly, articulate why and see if we can build a center of gravity. And that, that excites me. That's what I think I like to see, is we can lock away some of that complexity and make stuff easier not just for folks today, but folks three, four years down the road.
Mofir Adan
I want to drill down on one of the things you mentioned there, Clayton, is that a lot of the teams that are like startups or they have like crunch for time or they're trying to move very fast, there may be like taking this open source VLLM or LLMD or whatever the guidance there is, and tweaking one or two things, one of the optimizations that works for them, what would you tell those teams? What is the reason they want to be at the table bringing those optimizations that work for them, but make it more general so that it lives the entire industry up?
Clayton Coleman
I think this is pure Self interest, which is open source works best when everybody gets something. So usually it's. There's a lot of obstacles to contributing back. There's kind of two models. There's the deep and the broad model. The broad model is the lots of eyes. There is no better distributed bug catching mechanism than a whole bunch of programmers just hacking on stuff. And that is happening now. There's lots of little things that break and honestly the ML ecosystem is fundamentally dependent on taking very sophisticated algorithms, breaking them apart using highly performance optimized libraries, gluing them together and then not touching them. You don't want them to break. And of course that leads to. There's lots of subtle breakages. So the incentive that I think we'd be looking for is you can go down this well lit path, you could easily fork it and get your contributions, you could take it and you can have those patches that work around. But instead of so much of what you're adding being pretty bespoke to your environment, you're working off of a path where not just the one piece like Vllm patch works, but some of the tunables for how pre fill decode work. Some of the future things like ep new models are gonna come out, they're gonna change the required mix of parameters and the libraries and new algorithms and tuning the more that we can just concentrate attention. What that would mean is your patch, you have to carry fewer patches. And then when you've got something working, just reporting the issue by centralizing some of these flows it makes reporting, debugging and verifying those issues and verification. Benchmarking is a fundamental part of this. It's just hard to benchmark a small stream of fixes to a wide range of configs. What we can do, what we can do with the VLM community and with Emirates Gateway and the larger ecosystem is build some of the tooling that's going to make it easier to do performance regression testing to try out these scenarios and even just that really boring work that'll make the job of those startups. They can pick this foundation and instead of it being a bunch of pieces they assemble, it's a fewer set of pieces they have to assemble. Rob, I don't know if I did justice to the kind of overview. I think you have a much deeper connection to the fuel or mindset.
Rob Shah
Yeah. The other thing I think is also like important is in LMD we're not taking forks of things like we're using and driving these things into the upstream directly. And I think this is really Important because the pace at which things are improving and changing in the ML ecosystem is absolutely breakneck. I always have been laughing to myself. It's like we spent all of 2024 optimizing the llama 3 architecture and it's just like completely irrelevant. A lot of that work is just completely irrelevant for how to serve deepseek. Right. And so if you forked a system in 2024 in December, we're six months later and now you want to run deepseek and you didn't get all those changes right. Like you have to implement all that yourself. And now we've gone through this whole effort to add YDP and pre filled decode desegregation and just this week we have a new flavor of disaggregation that we've seen which disaggregates from ByteDance with their megainfer which does attention feed forward network disaggregation. And I'm sure there's going to be a bunch of work to make that all happen in vlm. So one of the things I think is important about staying with these upstreams and working directly is you're going to benefit from all the progress that's happening. We're not yet even close in my opinion to stability as it relates to architectures stopping having improvements. We're not at that point yet. We're still going to continue to see evolution in the architectures. There's still a lot of interesting research that's being done by academia and labs, et cetera, that we're going to have to pull in to these systems. And the more that we can push these into the upstreams and make sure that they're working together, I think the more everyone can benefit from things. So I think that's another kind of key piece is we're not at a VLM is not static yet. It's going to continue to evolve rapidly to support these new techniques. And as folks fork and run things specifically they run the risk of having to re implement all those things themselves or deal with constant rebases and et cetera. Yeah, that's I think another like piece of context for the value of the way that we're going about this development process.
Mofir Adan
Yeah, so those of you that are listening right now and are interested in getting involved in some ways we're going to link the link to LLMD communities and the links, if you have any use cases to VLLM to inference gateway to llmd, there'll be links for you to join. But the thing I want to Ask both of you. And oftentimes in this breakneck speed that Robert mentioned, of things moving and changing, guessing what's going to happen in the future, it is a difficult task. But I am going to ask both of you to put your speculation hat on for a minute and try to imagine a world five years from now, 10 years from now. Again, five years sounds like such a foreign idea in this world that moves so fast. But let's say five years from now in your mind, either in an ideal case or whatever you want to think about, what does serving AI model look like in five years? In the world of kubernetes, the work you're doing now, what do you want to get the world to in the next five years? And what would that look like in your ideal version?
Rob Shah
It's definitely a difficult question. I think that a lot of what we're doing in LMD is very much a transformer centric set of optimizations. Like we've been talking a lot about KVCaching, we've been talking a lot about techniques like prefilled decode disaggregation or KVCache offloading or prefix cache aware routing. All of these things are taking the view of how do we best route requests and manage this KV cache and exploit the fact that there is a KV cache in the cluster. A lot of the optimizations are really come down to the fact that this kvcache state management is fundamental to the problem. And so I think that if the transformer architecture continues to be something that is frontier with the model architectures, I think that this kvcache management will still be something that's fundamental and potentially pushed to even more and more extremes. The other thing we haven't really talked about as much today is the multimodality of the models. I think that we will see more of this in the future, and I would be shocked if there's not disaggregation associated with splitting up those models into smaller services to deal with things. But those are, I think, two overall thoughts about how things could really transform fundamentally. I think that if the transformer architecture continues to really dominate, I think we will continue to see optimizations associated with managing that TV cache being something that gets pushed on further and further since it is so fundamental to the problem. To the extent that the architectures change, I think we will see something that looks quite different than LMD and what VLM looks like today. And we'll have to of course take those new techniques and bring them into the systems that we're leveraging today. So probably not the best answer, but that's at least a little bit like how I think about it as a model server. We really take the inputs from the models themselves and try to do our best to serve them. And so it's definitely something I really have my eye on is how the architectures are changing. We've seen some things over the course of the past couple years, with things like Mamba as an example, or things like diffusion models, which are potential other architectures that could really change things if they do become more standard in terms of how to best serve these. So I think this is an area where we'll still see a lot of experimentation from model vendors and we'll have to make sure that we make sure that we follow up with them as new architectures come out. But yeah, Clayton, go ahead.
Clayton Coleman
Rob, it's great to hear you say that too, because I'm going to go. I'm going to go even broader. And let's say that, as always, I'm conscious that I'm only human and that I might get some of this wrong. My guess would be five years from now, the best and most important models are going to be a mix of open and closed innovation, but I think they're going to tilt towards open. A couple of years ago, people were a little skeptical that there was going to be any space for open source models. All of those people were fundamentally wrong. I was saying the other day, excited, because my guess is that the state of the art in OSS for running models efficiently is probably pretty close to the state of the art at scale. And I don't have any deep knowledge other than just reading the tea leaves. But never underestimate a whole bunch of people optimizing their hardware to get the best performance out of it. If there's anything that I've seen, if there's money on the line and you can make something cheaper, making inference cheaper by optimizing it is going to be a trend that is going to mostly happen in the open. There will be closed elements to some of these models and people will continue to come up with new algorithms. But I'm pretty optimistic that the open ecosystem is going to run models and that not only that, it is a technology that we're all going to have access to because there's more people who are interested in contributing and writing papers and who are starting their own companies who come with techniques that are no longer state of the art. So I think it's going to be pretty. I think it's going to be a very big open ecosystem. And I think the other to match that, I think we're going to see hardware change the way that people build servers for microservices. I think we're going to start seeing that the needs of the very large models are going to create some differences in how we think about what machines look like and how they're interconnected. Faster networks between machines, more parallelism. And that's going to need people writing software that optimizes all of that stuff and it's going to need orchestration that distributes it across many machines. So I'm pretty confident that Rob and I have a have a long ramp of features and capabilities. The feature is open and it's going to be built on top of Kubernetes and the evolution of both Kubernetes and Vllm and all of the other technologies in the ecosystem and that you will probably recognize the world today or the world today. I think five years from now you'll see some of the same elements and the ones that have changed are going to be the scale and how much value we get out of it. So I'm pretty excited.
Rob Shah
The other thing Clayton and I were talking about over this weekend as well is this trend towards like agentic applications, which obviously is a huge buzzword, but in general the LLM system's becoming compound with many pieces, whether it's tools or other smaller models that are going to do subtasks, et cetera. And so I think we'll continue to see this trend of agentic applications emerging as users try to customize the model to their specific use case through these mega models, these mega centralized models with their own enterprise data or custom data, or with tools and other capabilities. I think we'll continue to see this trend of compound AI systems starting to emerge and we'll need to evolve the LMD and model server roadmaps to make sure that we can work in those application patterns as well, which is a somewhat orthogonal component to the overall models themselves, is how the models fit into a broader AI application where there's a really robust ecosystem that's emerging and experimenting on those lens. And so we'll be seeking to collaborate with those types of developers over time to make sure that our components fit into those architectures as best as possible.
Mofir Adan
It's not often I get to quote one of the guests during the interview, so I'm going to take that opportunity now. Last year, Clayton, you had a quote in your slides that you where you said inference is the new web app. And this year. I think there's a revised version that says agents is the new web app. So we have like that being echoed here as well.
Clayton Coleman
Absolutely. And the future is big and complex and awesome, so there's much more exciting stuff to come.
Mofir Adan
I thank you both for taking the time to talk with me about Vllm llmd. Anything that you want our listeners to take from this conversation. As the last thought.
Clayton Coleman
It is never too early or too late to learn about ML. Two years ago I was a novice and now I get to hang out with really smart people like Rob, who sort of continue to amaze me by the depths of complexity in this ecosystem. Don't be daunted, don't be intimidated. Give it a try. Learn and come help, participate, contribute back. That's all we need.
Rob Shah
Yeah, and I'll just say it's gotta be the most fun place to be working right now. The pace, the amount of innovation, the speed at which research moves from a paper into a real production system is so fast and so you really feel like you're on the bleeding edge working in this area. So yeah, we're really excited to have this community and we're trying to develop LMD in public. Every meeting is open, we got an open slack and so please feel free to jump in, tell us your requirements, get involved. We'd love to see you.
Mofir Adan
Thank you so much. It was wonderful. Thank you for taking the time.
Rob Shah
Thank you. Thank you.
Kaslin Fields
Thank you, Mophie, Clayton and Rob for that interview. I have been very interested in LLMD and the Inference gateway. It's something that a lot of the folks working in open source Kubernetes especially have been telling me about as a way to make the AI workloads that folks are running on Kubernetes clusters more efficient. And so I haven't learned as much as I would like to about it yet. And so I'm very excited about this interview. What were your top takeaways, Mophie?
Mofir Adan
I think the top couple of takeaways would be that LLMD again. When I first heard about it it seemed like, oh great, another open source project that is going to do a bunch of things is going to become try to be one more standard in the list of standards. But one of the things that is interesting is that instead of trying to kind of have LLMD become its own software stack, it uses and utilizes existing things that are quite great at doing the things it does and find the gap. LMD uses Inference Gateway and Vllm quite heavily underneath and has a bunch of other things it builds up and tries to build a well lit path as Rob and Clayton would call it in giving people a way to have production grade inference applications running on kubernetes. Now where this is a slightly different than some of the other attempts in the past is the folks that are working on LLMD are the same people that also have a lot of contributions to VLLM and Inference Gateway. Which means that when LLMD finds a gap in the inference stack, they can then go back into the subsequent projects that they rely on and make those changes upstream. And now building the entire stack up rather than saying oh we rely on this other project that doesn't do what we need to do now we have to build it ourselves or just wait for them to build something. So having the same people having almost eye level view of the whole stack or the and also have the access to the individual projects make it possible.
Kaslin Fields
So what I think LLMD provides here really is an open source tool for describing some of those best practices and some of the tools that exist in the open source space for folks to run inference workloads. So it's interesting to see these things all come together in one package.
Mofir Adan
Yeah, I think this is one of the questions I actually also had to Clinton and Rob and the reason I had that question is that every company, every team, every midsize, big size startups that are now serving large language models like these open models like Gemma Llama, Qwen, Deepseek, all of them over time will build up some sort of techniques to optimize for the cost or the performance or just the accuracy. Like what benefit does this team get by spending that time to bringing it out in the open, telling others about it. And the answer I think there, Clayton and Rob Gay was very what I also think about open source is that it's very in some ways it's not necessarily something you're doing for free. It's very self preserving in some ways where you sharing your ideas with others helps you actually refine that idea better, but also lets you get the access to a huge mind share of other people that are doing similar but different optimizations that you can learn from and building out in the open and building together you can build much more than your individual teams potentially could. The other thing is one of the quote that was being used a lot is that the pace of innovation is breakneck is that in the last two years the innovation that are happening that is going to improve the quality of life for any AI models in the future, even for not language models, right if you're running your pipelines for reinforcement learning, like fine tuning pipelines, all of these things because there's so much more investment engineering time being spent. All of those things are seeing improvements because we just have more people trying things, more people contributing to these things.
Kaslin Fields
It took me until this very moment to realize that LLMD is probably a play on systemd, isn't it?
Clayton Coleman
Maybe.
Mofir Adan
I think another quote they also used is that LLMs are computers in themselves. So it would make sense to make LLMs are computers. Systemd is like the engine of or the brain of your computer processes. LLMD wants to be that integral part of your LLM serving. I think other thing people. Another question I had to them, which I have been actually chatting with Clayton for a long time. Clayton had a quote couple of years ago is that inference is the new web app. Earlier this year he updated the quote to say agents have the new web app. So that quote I think has taken a life of its own. Is going to evolve and update and upgrade over time. But in some ways we are looking at a different type of application at the same time as a different type of engineering that is needed to run this application. When Kubernetes first came out, a lot of work was done to optimize for web apps and a lot of work was done to optimize for stateful application. Same thing in a different scale, which like we're doing all the optimization to make sure Kubernetes is a good fit for large language models. Right. It seems like a lot, but in the same time, it's not a net new thing. We have been doing this over the last ten years of Kubernetes. We have seen where the industry and what are the people using Kubernetes are using Kubernetes for and optimizing the underlying engine to make sure that they have a good time.
Kaslin Fields
I always like to bring it back to Kubernetes as a platform for running distributed systems, which I think maybe some people think that's a little reductive sometimes. But the point there is that you have all of this hardware and you need to do things with it. And that's even more true than ever in the world of AI, where the hardware accelerators are really at the core of being able to do exciting things with the technology.
Clayton Coleman
Yeah.
Mofir Adan
So I think the biggest, the last part of the takeaway I would say is that the projects are like again, two years can be a really long time in the world of AI, but it's also a fairly short time. We still are in the phase where we have a bunch of people trying a bunch of different things. So a lot of new projects and standards are being created all over the world. But I feel like in the next few months, two years again, time is at this point no longer a real thing anymore. So in a few months to a.
Kaslin Fields
Few years, for so many reasons, yes.
Mofir Adan
A few months to a few years, we should start seeing a lot of, not necessarily consolidation, but more like people learning from each other, where projects learn from each other to build out functionality that seems similar to each other.
Kaslin Fields
It's very interesting to hear the variety of features that are in vllm. When I first started working with LLMs, VLLM was something that got in my way because I was trying to run different models and the way that I would interact with them would be different based on whether they were using VLLM or the hugging face one, whatever.
Mofir Adan
That's tgi.
Kaslin Fields
Yeah, tgi. Yeah. And so that was one of the first things that I really wanted to dive into because it was causing me a lot of trouble. But hearing VLLM talked about not just as the thing that was causing me trouble, but as this open source project that's having this fundamental impact in the way that we run LLMs really helps me to see the open source ecosystem that's developing around LLMs.
Mofir Adan
Yeah, like even there, right? Like about a year and change ago, VLLM had their own kind of spec of their API, but then formalized on OpenAI spec, where most of the industry has formalized on. Now you can serve something in VLM, serve something in TGI, something in Nvidia's NEMO framework. All of them can provide you an OpenAI compatible API, which means your application doesn't have to know what is the underlying serving engine, which gives you the abstraction that is needed for you to be able to consume an OpenAI model as a service. A model being served via VLLM. Gemini also provides you OpenAI compatible API too. So now you can separate the application layer to the underlying model layer. This is like the innovations that are happening. Not necessarily the topic of the conversation at hand here, but it's so many different things are fitting together and we're coming together to give you this nice abstraction layers at every step and for you to be able to build the things you want to build.
Kaslin Fields
As was said in the interview, the focus has shifted from software development to resource usage and scale.
Mofir Adan
It's not even that big of a shift, but if you just think about where money is like the most what is most expensive in your stack? Before writing software means like the cost of engineering was very expensive but now starting the application became very expensive again because GPUs and TPU that are costly resources. So now you have to look at where can you optimize like where do you spend more time optimizing for? I think again the interview itself I feel was really packed with lots of valuable information. So I would ask people if they listen to it again, pay attention, take notes because Rob and Clayton gave us so many different technical details of how things work and I definitely will go back and listen to it multiple times to actually absorb everything. And the call out to everybody else and it is the call out in the interview as well is that if you are someone who is serving LLM applications and you have either struggling with optimization or found some ways to optimize your stack, bring it out in the open, talk to the community, Talk about if LLMD's well lit path works for you and if it doesn't, why it doesn't and maybe we can find out some interesting use cases that the team is not thinking about. So if you are someone that interested in LLMs or serving LLMs or learning more about it llnd community we'll link in the show Notes will be a great place for you to find some other people that are thinking about the same problems.
Kaslin Fields
Yeah, get involved with the open ecosystem that's trying to help folks understand how to do these things. I feel like this was a buy one get many kind of an episode. There's so many different topics that we talked about. Thank you very much Mopi.
Mofir Adan
That brings us to the end of another episode. If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us you can find us on social media Kubernetes pod or reach us by email@kubernetespodcastgoogle.com you can also check out the website@kubernetespodcast.com where you will find transcripts and show notes and links to subscribe. Please consider rating us in your podcast player so we can help more people find and enjoy the show. Thanks for listening and we'll see you next time.
Kaslin Fields
Sam.
Released: August 20, 2025
Hosts: Abdel Sghiouar (absent), Kaslin Fields, Mofir Adan (guest host)
Guests: Clayton Coleman (OpenShift/Kubernetes core contributor), Rob Shah (Director of Engineering at Red Hat, VLLM contributor)
This episode examines how running Large Language Model (LLM) workloads on Kubernetes is fundamentally different from hosting conventional applications, and introduces LLMD—a new open-source project unifying best practices for LLM serving. Key guests Clayton Coleman and Rob Shah dive deep into the technical, operational, and community collaboration aspects of scaling LLMs in the cloud-native world.
“What was really interesting with large language models is it shifted the problem space from being one of software development to one of being resource usage and scale... it stopped looking like a traditional microservice.”
“LLMs are autoregressive… every token gets generated with another pass through the model. Traditional predictive apps are stateless; LLMs are not.”
“The idea is to bring these two communities together… have Gateway drive requirements down to VLLM and have VLLM drive requirements up into Gateway… highlight state-of-the-art ways to deploy common patterns.”
“Open source works best when everybody gets something. ... The incentive that I think we'd be looking for is you can go down this well lit path... you're working off of a path where not just the one piece ... works, but some of the tunables ... and new algorithms and tuning.”
“In LLMD we're not taking forks... we're using and driving these things into the upstream directly. ... The pace at which things are improving and changing in the ML ecosystem is absolutely breakneck.”
“Five years from now, the best and most important models are going to be a mix of open and closed innovation, but I think they're going to tilt towards open.”
“Agentic applications emerging—users try to customize the model to their use case through these mega models... with their own enterprise or custom data. ... We'll need to evolve the LLMD and model server roadmaps to work in those application patterns.”
This episode is a comprehensive technical and pragmatic look at what it takes to run LLMs at scale on Kubernetes, emphasizing open-source collaboration, the shift to workload-centric AI infrastructure, and the ongoing rapid evolution of both models and the software ecosystem. LLMD emerges as a central, community-driven reference project for best practices in LLM serving, pushing the field toward greater openness, standardization, and shared success.
For Further Information: