
Yuan is a principal software engineer at Red Hat, working on OpenShift AI. Previously, he has led AI infrastructure and platform teams at various companies. He holds leadership positions in open source projects, including Argo, Kubeflow, and...
Loading summary
Abdel Sigewar
Hi and welcome to the Kubernetes Podcast from Google. I'm your host Abdel Sigewar.
Kaslan Field
And I'm Kaslan Field.
Abdel Sigewar
In this episode we spoke to Yuan Tang and Eduardo Arango, organizers of working group Serving within the Kubernetes project. We spoke about this newly formed working group trying to solve serving for AI and ML workloads, the challenges they are trying to tackle and what the future future looks like.
Kaslan Field
But first, let's get to the news. Docker launched their official Terraform provider. The provider can be used to manage Docker hosted resources like repositories, teams, organization settings and more.
Abdel Sigewar
Titrate and Bloomberg started an open collaboration to bring AI gateway features to the Envoy projects. This effort is focused on building gateways capable of handling AI traffic. Specifically, the first set of features will focus on usage limiting based on input and output token. Unlike traditional rate limiting for HTTP apps, API uniformity and upstream authorization to LLM providers. The community is looking for ideas of features to build. You will find a link to the Show Notes with details.
Kaslan Field
The CNCF is hosting a laptop drive at Kubecon Cloud Native con North America 2024 to benefit two nonprofit organizations, Black Girls Code and Kids on Computers. If you're interested in donating a laptop, you'll need to make sure the device meets the requirements and you'll need to fill out a form a link to. More information is available in this episode's Show Notes.
Abdel Sigewar
There are four remaining Kubernetes Community Days events going on around the globe. Denmark, Accra in Ghana, Indonesia and Floripa in Brazil are all hosting KCDs before the end of 2024. The ACARA event is happening virtually. If you are able and interested in attending, make sure you check out these events to support local Kubernetes communities. You can check out the list of upcoming events at Community CNCAPP IO and that's the news. Hello everyone. Today we are talking to Johan and Eduardo. Johan is a principal Software Engineer at Red hat working on OpenShift AI. Previously he has led AI infrastructure and platform teams at various companies. He holds leadership positions in open source projects including Argo, kubeflow and Kubernetes Working Group. Servin. Johan authored three technical books and a regular conference speaker, technical advisor and the leader at various organizations. Eduardo is an environmental engineer derailed into a software engineer. I'm going to have to ask you questions about that. Eduardo has been working on making Containerized Environment the de facto solution for high performance computing HPC for over eight years now. Begun as a core Contributor to the Niche Singularity containers today known as Apptainer under the Linux Foundation. In 2019, Eduardo moved up the ladder to work on making Kubernetes better for performance oriented applications. Nowadays Eduardo works at Nvidia at the core cloud native team working on enabling specialized accelerators into Kubernetes workloads. Welcome to the show, Eduardo and Johanna, thank you.
Eduardo Arango
Glad to be here and super excited to be next to Joan. He's a role for everyone, like three books. Wow.
Johan
Thank you Abdel for inviting us. It's a pleasure to be here.
Abdel Sigewar
Awesome. Thank you for being on the show. So we are here to talk about Workgroup Serving, which is actually a workgroup that I learned about at Kubecon Paris this year from Clayton. So I was chatting with Clayton and he was new working groups that were created. One is called Servian and one is called Device Management and we decided we have to talk to the people who are behind this. So let's start with the obvious question, what is Workgroup Serving?
Johan
So basically I can share a little bit about the introduction, the creation of the working group Serving. So basically Clayton Cotton and I had a discussion at Kubecon Europe this year and we had really good discussions around model serving. We talked about some of the solved challenges by Kserve pain points and limitations of the current Kubernetes APIs, especially for model serving use cases. For example, Kserve community developed the Model car feature to pull the model from OCI image. This reduces startup time and it also allows for advanced techniques like prefetching images and lazy loading, which makes the auto scaling of large models more efficient. We also discussed how other ecosystem projects in model serving space, like Kaito and Ray, are trying to come up with workarounds for similar challenges. There was really a lot of interest in the community to propose better primitives and better foundational pieces for model serving workloads that can benefit the broader Kubernetes and cloud native ecosystem, especially for hardware accelerated workloads. So after Kubecon Europe we officially established this working group and I joined the working group as a co chair representing Red Hat and other relevant communities that I'm involved in like kserve and kubeflow.
Eduardo Arango
Yeah, I think I can add to that. And everything started at Paris as you both already mentioned and goes back to the same name, Clayton. So during the contributor Summit, right? That is when that like a small room at Kubecon where all the contributors gather during the conference sessions, like in the afternoon, everything was around AI inferencing. So we Spent like one or two hours speaking about it. And at some point Clayton said like, hey, I had a couple of conversations with Joanne from kserve and from kubeflow. We should have a working group there. We kind of like had a conversation about what would be the main differences between the working group serving and working group batch that already existed. Right. So we needed kind of like differentiate what will be the role of working group bash versus the role of the working group serving. And also knowing that there is a CNCF cloud native AI working group. So we have a lot of parallel working groups in the CNCF itself and in the Kubernetes communities. That difference was well defined. And we decided that two weeks after Kubecon, giving everyone time to rest, we will start all the process to create the working group. That's a quick history on how it got created. I think we all got it back to Kubecon Paris for sure.
Abdel Sigewar
Got it. Before we move to the next question, question, I do have actually like a statement and then I think I'm talking to experts who can correct me. I find it that a lot of times when you are on stage doing a talk or you are creating content and you want to address this topic of serving AI models, sometimes you would alternatively use either the term serving or use the term inference. Right. And one kind of dumb way I've been explaining this to people so that people can wrap their head around it is you can think of serving or inference as the piece of software that does exactly what a web server does for a website. Like to host a website, you need a web server and the content that would be served on quotes on quotes by the web server. So I'm wondering, is that like a close kind of comparison or close way to explain it? Like, what are your thoughts?
Johan
It's a tough question.
Eduardo Arango
Maybe I can start defining the concepts and Joanne can take it from there.
Abdel Sigewar
Sure.
Eduardo Arango
So to me, there are two main workloads during the entire AI ecosystem. One is training and the other one is inferencing. Training involves a lot of distributed jobs, meaning the MPI operator that we know in Kubernetes is very important for that. During training is where you take a lot of data and you create what we call creating a model. Then inferencing is using that model to detect patterns that it was trained to detect and provide an output. That's the whole inferencing concept. But then the serving word I think comes from. Now we have this model that was trained and we need to provide a full infrastructure to have it always listening to what we now call prompts and to being able to scale it out. So I don't know. Johan, would you add something to that?
Johan
Yeah, I think before the term AI became popular or Gen AI, we're talking about machine learning, statistics and everything was basically predictive right at that time it was very easy to explain training versus prediction. Now that with serving and inference and with genai term comes in, it's really difficult to describe the differences. But in general we talk about serving requests, no matter if it's a model or if it's even like database requests or regular prompt kind of requests. So I think there's no strong differentiation in my opinion.
Abdel Sigewar
Got it, got it. All right, so then what is working group serving trying to achieve? Like what is the mission? Why does it exist?
Johan
Generative AI has really introduced a lot of challenges and complexity in model solving or inference systems and we really haven't seen many of those challenges in traditional ML systems. And to meet these new demands and address those new challenges, this working group is dedicated to enhancing serving workloads on Kubernetes with a special focus on hardware, accelerated AI or Mr. Inference. In my opinion, it's very important to address the need and to come up optimized solutions to handle compute intensive inference tasks using those specialized accelerators. We hope that all the improvements we made at this working group would also benefit other serving workloads like web services or stateful applications. And any new primitives coming out of this working group could also be reused and composed with other ecosystem projects like kserve and Kaito and Ray. The mission of this working group serving is really to advance the capabilities and efficiency of serving on Kubernetes to make sure that they are well equipped to handle evolving requirements of generative AI and maybe future workloads for serving. And we are operating within the Kubernetes community and governed by the CNCF code of conduct. It provides us a neutral place to work on necessary initiatives. And with the leadership from the four organizations, namely Red Hat, Google, Nvidia and ByteDance and a lot of participating companies from the community, we would also like to invite others from the community to join us and share your use cases so that we can solve the serving challenges holistically.
Eduardo Arango
Yeah, from my point of view, I will define it as goals. So there are three main goals for setting the working group and the first one of it will be enhancing the QRS workloads controller. And it's basically since Joanne mentioned there are many companies joining these meetings. The idea is to provide recommendations and better patterns for improving Kubernetes workloads and controllers. As we are right now are building like operators to handle specific workloads for companies and how the recommendations coming out of the working group will enhance performance in popular inference serving frameworks. Right? So right now there is. The working group has a GitHub repo where we are kind of like collecting blueprints from the entire community to say like, hey, this is how we run this model and oh, this is how I run the model so people can compare and improve their workloads at their companies. We are also investigating orchestration and scalability solutions. A lot of meetings the working group serving has been spent around like what should we measure for auto scaling, right? Like should we measure like GPU consumption or should we measure prom size? So this all falls into the category of research or investigating which are the key metrics that we should monitor as a community to then build better orchestration and scaling and load balancing ideas and projects. Right now, speaking about load balancing, for those joining the meetings, you will notice that we are talking about a new project that is the LLM gateways and this is something related to load balancing. So we are all investigating how to enhance workloads overall for LLM serving. And this will be the second goal and the third goal will be to optimize resource sharing. And this kind of like ties the working group serving with the working group device management. We want to have a nice communication between the two working groups. Mostly because the exciting new feature of Kubernetes that is DRA that everyone is talking about, we want to create a list of needs of things that are not possible right now in Kubernetes via the regular device plugin and hand that over to the device management working group and tell them like, hey, we need new features. Can you please prioritize that? Because they are coming from the working group serving. So that would be the third goal.
Abdel Sigewar
Got it. So I do have a follow up question and feel free if you don't have an answer. It's fine, we can skip it. Can you folks give us some examples of concrete like limitations in Kubernetes that working group serving is trying to solve? Like just one or two examples. If you can think of something, I.
Eduardo Arango
Can start with one. That is what I just said about DRA and is that right now defining multi GPU multi node workloads in Kubernetes is almost impossible. DRA is going to provide solutions for that, but it's not ready yet. Right? Like doing that with device plugin, like the tools that we have today at hand will be the MP operator, the leader, working set, and device plugins, like joining those three, you will get close to it. But for what? Workloads that are coming, like the new models are so big that they need to be run on multiple nodes. We need the features that are being promised by dra.
Abdel Sigewar
Okay, so, and just for the audience to know, Dwayne Eduardo was talking about dra, that stands for dynamic resource allocation. And it's like a whole new set of features coming into Kubernetes, which some of them are being implemented, some of them are not being implemented. So it's just for people to understand what they are. But Johan, you had something to add there.
Johan
Yeah. So there are a lot of challenges we offer in different org streams. I can talk about that later. But for example, for auto scaling, auto scaling on device utilization, memory is really not sufficient for production workloads and it's very challenging to identify and configure HPA to autoscale or model serving metrics. It's also hard to measure how latency, throughput and workload sharing interact with auto scaling so that the deployment can achieve a target latency and model server configurations to achieve a better optimized performance.
Abdel Sigewar
Got it. Yeah. And I think that there is another whole other topic that is probably worth its own episode about observability for LLMs, which is a completely different way of how do you handle observability for LLMs compared to like a web server or something? Or like a backend application. So I was checking out the GitHub repository of the working group. You have three work streams, so auto scaling, multi host, multi node, and the third one was orchestration. Right. Can we talk briefly about this work stream?
Johan
Yeah, there's actually another one for DIA, which serves as a bridge between working group device management and working group serving.
Abdel Sigewar
Got it. Okay, can we talk about briefly about each of those? Like just what are these work streams? I know that we covered these topics, but like going a little bit more into details.
Johan
Yeah, maybe I can introduce the orchestration work stream and multi host and then Eduardo, you can cover the auto scaling and dia.
Eduardo Arango
Yeah, sure.
Johan
For in the orchestration orchestra, we focus on identifying challenges in implementing high level abstractions for orchestration serving workloads. So we are working closely with ecosystem projects like kserve and Ray. For example, we hosted dedicated sessions to collect solved challenges, pinpoints and use cases from these ecosystem projects. There are also interesting proposals from the community after those discussions. For example, There's a Blueprint API that proposes a new Kubernetes workload API for deploying inference workloads. The idea is to offer standardized APIs to define blueprints or preset configurations to instantiate serving deployments. However, it has a certain level of overlap with KSOF's serving runtime API. So we decided to switch our focus on a new project sponsored by this working group, the RM Serving Catalog. For that project, we'd like to provide working examples for popular model servers and explore recommended configurations and patterns for inference workloads. We also sponsored another project that Eduardo mentioned earlier, the LLM Instance Gateway sub project, to more efficiently serve distinct LLM use cases on shared model services running on the same foundation model like system Prompt or LORA adapters or other parameter efficient fine tuning methods. For example, schedule requests to pools of model servers to multiplex use cases safely onto a shared pool for higher efficiency. And for multi host or multi node work stream, we focus on extracting patterns and solving challenges for multi host inference. So we had discussions around various implementations for multi host inference and their cost effectiveness, their capacity optimizations. We also had a deep dive into the architecture and use cases of Leader Worker Set. So Leader Worker Set addresses some common deployment patterns of multi host inference workloads. For example, large models will be sharded and served across multiple devices on multiple nodes. There wasn't really that much demand when we first started that work stream. But as models get larger and larger, serving them on multiple nodes really becomes necessary. Even though Leader Worker Set provides a good API to describe multi host workloads, there are still a lot of challenges to be solved together with the working group, for example, we wanted a way, a better way to express network topology preferences for multi hosting frames. And multi host is also poorly supported by orchestration tools, but they actually are very actively working on it. For example, kserve is working on multi host support now and yeah, that's for orchestration and multi host.
Abdel Sigewar
All right, and I'm going to get you, Eduardo, to talk about the other two work streams. But is the problem of multi host serving because you think that we're going to get to a point where the models will be too big such a way that they cannot fit on a single node anymore.
Johan
Yeah.
Abdel Sigewar
Is that the background?
Johan
Yeah, that's basically the hard use case for that. And there are also advanced use cases like deploying a desegregated model serving configuration, which is really hard, but it should be relatively common for large model serving in the future.
Abdel Sigewar
Got it. That's Interesting, because like on the other side, what we are. Also what I am noticing is that cloud providers are able to provide bigger and bigger VMs. Right now you can on Google Cloud get like something like some 3 terabyte memory, some ridiculous 196 cores on a single VM. But it's interesting to me that we are seeing that in the future those will not be enough for a single model, but the model will be bigger than what a VM can actually do.
Eduardo Arango
Yeah, I think that going to my teacher, my company, is that it doesn't matter the size of the node, but the number of GPUs that you have in the node, right?
Abdel Sigewar
Oh yeah.
Eduardo Arango
So some models that to run them with low latency, you require right now big, let's say like avalanche, big model requires to four GPUs and you can feed that on a node. But we will get to a point where we need multi node because one single model requires eight GPUs or more. Right. By the way, what I'm saying is because we need low latency, you can run LLAMA on your laptop. The thing is, it will take like three minutes for it to start responding back on a production system. You don't want that. You want fast responses.
Abdel Sigewar
Interesting. Okay, so what's the actual limitation for attaching multiple GPUs to a node? Where is the technical challenge there? Where is the technical limitation?
Eduardo Arango
If you want physical limitations, depending on the architecture of the node, you can get from four to eight GPUs. I know Blackwell itself is going to be a GPU node on its own. So it's not like we were used to racking up GPUs, but it's more like the entire node now it's a gpu. We are moving to different type of architectures.
Abdel Sigewar
Got it.
Eduardo Arango
So it's that, right?
Abdel Sigewar
Yeah. As I was asking the question, I realized I didn't ask it in the right way because obviously the physical limitation is the number of PCI express ports you can have on a single server. That would be the obvious one. But then I was thinking around the lines of what you said, which is we're moving toward entire physical nodes that are GPUs that have the GPUs built in. Is it common for people to do compute on a physical node and then GPU on a separate node and have them talk to each other through some like fast networking link? Is that a thing that is possible?
Eduardo Arango
I have seen this in gaming, but for like production system, no, you want to have everything together. I Know that in gaming if your laptop doesn't have a gpu, there is a way using external gpu. Yeah, like a Thunderbolt, you can have an external gpu but for a production systems you want to have everything as close together as possible as possible.
Abdel Sigewar
Okay, cool, got it. So what are then the. Can you talk a little bit about the other work streams, Eduardo? Like the auto scaling and Derek?
Eduardo Arango
Yeah, and I think Talking about multiple GPUs is a good introduction to auto scaling is. As you said, cloud providers are trying to provide better and better VMs as time comes by. And this means that Kubernetes will be creating new nodes on the fly. The auto scaling work stream has been focusing on two key aspects. One is caching and the second is metrics. As I was saying before, LLM models can be quite big. We during the working group serving we have heard people talking about a couple of gigabytes to hundreds of gigabytes. Then this creates a complexity on auto scaling. And if the model is not cached in a way that it can be quickly used by the pod that Kubernetes is deploying, then you will start getting latency and the user experience gets degraded. You don't win things by your cloud provider providing a new VM in 2, 3 seconds if you cannot cache or move the model to that node. So your pods can use them, right?
Abdel Sigewar
Yeah.
Eduardo Arango
So speaking about caching strategies like having network provided attached volumes via Kubernetes and are being like strategies that we have been discussing at the working group. And the second thing being metrics, right? Like should we listen to the hardware GPU utilization or memory utilization or should we focus on what we are receiving and leaving the hardware as a black box. Latency tokens per second, size of the prompt. That is kind of like an input, not an output or a combination of both. This has been discussed right now for the inference gateway new project and it's how can we do a better load balance task? Should we listen to hardware, should we listen to other metrics? And this is kind of in the fly right now and as Johan said, we want people to join the conversation, come to the working group and say to me in my company, tracking deep utilization works. Cool, that will help us as a working group. We need input from the community on these type of topics because right now we are leading to the realization that depending on the model and the use case of the model, you should track hardware or you should track this like soft layer metrics like latency in tokens per second so that's the auto scaling work stream focus and there are very interesting discussions there. The last work stream that is kind of like a British work stream because it breaches serving with device management Working group is what we call the DRA work stream and is basically as we have been talking, is gathering all the feature requests for DRA from the point of view of what we need to make serving better and then kind of like championing that list at the working group device management meetings. Right? So we basically say like hey, we know that right now for example, we are cutting. Everyone in kubernetes is talking about the cut of 1.32. So we are like okay, what do we want from the serving side of the house that the guys at the device management will prioritize before the 1.32 cut? So that's what the Workstream DRA is. It's creating a prioritized list and championing it at the device management meetings.
Abdel Sigewar
Got it. I remember that from Kubecon in Paris. I was talking with Clayton and we specifically talked quite a bit about this problem you covered during the explanation of the auto scaling work stream, which is how do you load balance based on the user input? Because for LLMs your input is not a typical REST call or like whatever like SQL database query or something, it's basically text. And one of the things that Clayton was mentioning was that it would be interesting if there is a way to load balance also based on the size so you can send smaller inputs to smaller, more efficient models and bigger prompts to kind of bigger models to optimize for the time to get the response back. So is this kind of like in the space of what you're also working on this like whole? Because you mentioned the LLM gateway which is I would say most of fork really, but like inspired from the API gateway but for LLMs. Right. Can you talk a little bit more about that specific? Because that's essentially the load balancing problem we're talking about here.
Johan
Yeah, maybe I can talk a little bit about Model Mesh. So before we even introduce the instance gateway project from the working group serving kserve actually has a project called Model Mesh is very mature and general purpose model serving management and routing layer designed for high scale, high density and frequently changing model use cases. And it works really well with existing model servers and it also acts as a distributed LRU cache for serving runtime workloads. So even before the large language models we had a lot of use cases for traditional machine learning models. We need to handle the traffic and the routing between Those different model servers depending on their usage and density and their frequency of changing. So I just want to mention that it's not just a requirement for large models, but also for traditional models it has a use case as well. But for the LLM instance gateway, the initial goals for the POC is to making sure it works well with our OM use cases, especially for high density LORA adapter use cases. But later on it may be extended to support other traditional use cases as well.
Abdel Sigewar
Got it. Yeah. So I think that that's. That's basically all questions I had for you folks. Do you want to add anything? Where can people find you? Where can they find the WG serving? We'll make sure there is a link in the show notes for your GitHub repository. But like how can people join if they're interested?
Eduardo Arango
Sure, people can find me on LinkedIn as my full name which is very long Carlos Eduardo Arango Gutierrez. But I guess the link will be attached and also on the QR analysis. But no, I don't have Twitter, don't support it. I don't want to be there. So yeah, that's where you can find me.
Abdel Sigewar
Awesome.
Johan
Yeah, so make sure you just join the mailing list and then you'll get invited to all the existing and future calendar invites and make sure to join the Slack channel as well because that's where most of the real time communication happen.
Abdel Sigewar
Awesome. Cool. Well, thank you for being on the show folks.
Johan
Yeah, thank you for having us.
Abdel Sigewar
So Kaslyn, how are you holding for Kubecon? Are you excited?
Kaslan Field
I am always excited for Kubecon. I walk into Kubecon and I'm like, wow, I'm home.
Abdel Sigewar
And then five days later it's over. It just like flies by.
Kaslan Field
Yeah. And then I am destroyed and need to go take a nap.
Abdel Sigewar
I call that like post conference depression.
Kaslan Field
But yeah, it's a thing.
Abdel Sigewar
It's so engaging and you meet so many people, then it's sad when it's over. Right.
Kaslan Field
And of course Kubecon is the CNCF's primary event and it has such a huge focus on open source. So I'm excited that today's topic was about working group survey.
Abdel Sigewar
Yeah, I don't even remember how this came to exist as a topic that we wanted to talk about. I don't remember where I think it came from our discussion with Tim Hockin and Clayton during the.
Kaslan Field
I think it did. Yeah. When we were working on the 10 year anniversary episodes. Oh yes, we were talking about. Yeah, with Tim Hocken and Clayton Coleman. In those episodes we talked about the new working groups that were spinning up to support specifically AI oriented workloads implementing new functionality in Kubernetes to help Kubernetes users better manage the underlying hardware, since hardware is so important to those types of workloads.
Abdel Sigewar
Yeah. And so the two working groups are Serving and Device Management. So Serving is the one we covered today, which is inference, essentially. And Device Management is. We will cover it next year.
Kaslan Field
So Serving is really focused on the specific type of workload. Device Management is all about that challenge like we were talking about, of managing the hardware better through Kubernetes. But Inference is specifically digging into what do inference workloads look like right now and what could we do with Kubernetes to make it better.
Abdel Sigewar
Yeah. And so I remember from the conversation, one of the challenges that Eduardo specifically was talking about that they are trying to solve is multi host serving. So if you have a gigantic model and you have a physical limitation in terms of how many individual GPUs you can attach to a single node, can you split that model across multiple nodes? Right. So that's just one of them. Then there was like a lot of other conversations, but this is specifically something that stuck with me because I never really thought about it. Like a distributed machine learning model, essentially.
Kaslan Field
Exactly. I love it when a look into something in the Kubernetes world comes back down to the roots of Kubernetes is a distributed system. There's a whole bunch of computers running workloads. And so how do we do that in the most efficient and most useful ways? So that is a really cool aspect that I also had not thought of. But if you're running an inference workload on Kubernetes, then of course it's running on a distributed system. So you need to think about how to make the most efficient and best use of that distributed system.
Abdel Sigewar
Exactly, yeah. And preparing to record this, we just realized that working groups, and this is something you will have to teach me, I guess.
Kaslan Field
Yes.
Abdel Sigewar
So in the working groups we don't talk about leads, we talk about organizers, apparently.
Kaslan Field
So as I have talked about a number of times on here, I am deeply involved with the Kubernetes community. I'm a lead of a special interest group myself. But working groups are a little bit different from special interest groups. So special interest groups are the core tool that we use to split up the work of maintaining the Kubernetes project and building the Kubernetes project. So we have special interest groups for networking and docs is Its own special interest group infrastructure testing. We have really big areas that are covered by special interest groups. But working groups tend to spin up when there is a topical thing that the project needs to think about. Of course, serving workloads makes sense right now since we're seeing a big increase in the number of people wanting to run those types of workloads on kubernetes. And device management is a topical thing for us to cover because we need to better handle those devices for those types of workloads. And the project hasn't really done that that much before. There were tools, of course, but like, this is another level. So we needed kind of focus groups to think about what's going on with these things and spun up these working groups. And so working groups are a lot like a SIG in that they have a specific area that they're looking at, but they're, like I said, topical. So they're about something that's going on currently. And the idea is for them to eventually roll into a SIG or become a SIG themselves. They aren't going to last forever. And as such, they don't actually own any code. Generally in the code base of Kubernetes, any code that they produce is going to be owned by a special interest group because those are going to continue existing regardless of what happens. So there's a kind of a maintenance plan in place in that sense. So they operate very similarly to sigs in some ways, but also differently. So this organizers thing, I had always thought of the leads of working groups as, you know, in sigs, we call them tech leads and co chairs. I figured that they used similar language in working groups, but maybe not. Maybe they use the word organizers.
Abdel Sigewar
And also the other thing to keep in mind, from my understanding, is that working groups also can potentially span across multiple SIGs.
Kaslan Field
Yes, they generally do.
Abdel Sigewar
So, yeah, they work with multiple sigs to. Because, yeah, I don't think serving is a specific special interest group problem. I think it's something that multiple sigs will be involved in trying to solve quotes on quotes. Right.
Kaslan Field
And we'll include a link in the show notes to the GitHub repo for the working group. And if you check that out, it actually says which sigs they're most closely aligned with and work most closely with. So all of their work theoretically will go into those SIGs rather than being owned by the working group. And then someday the working group will probably dissolve and those SIGs will own that code instead. Unless something changes and we decide that we need that Working group forever and it becomes its own sig.
Abdel Sigewar
But yeah, and while you were talking, I was looking at the GitHub repository and they realized that they are a sponsor of a sub project called the LLM Instance Gateway, which we covered in the news. But this is something I'm super excited about for probably next year we should have an episode about it.
Kaslan Field
They're a sponsor of the sub project. That's also very interesting.
Abdel Sigewar
There is a sub project called LLM Instance Gateway. Yeah, it's listed as a sponsor, but I've been following the LLM Instance Gateway effort for a while because, yeah, it's basically the what gateways are, but for LLM specific workloads. So there are some interesting things going on there and I think we should eventually at some point cover that.
Kaslan Field
Yes, that sounds very interesting. So I mentioned the SIGs and the working groups and sub projects in Kubernetes tend to be part of a sig. Sigs have very broad scopes and so they have sub projects to bring that scope down a bit into something a little bit more actionable. So it sounds to me like what happened here is the working group exists of course, to address this topical issue. They identified a need, they worked with a SIG to create a sub project is what it sounds like.
Abdel Sigewar
Or they are sponsoring. That's what it says. Like sponsored. So I don't know what sponsored me.
Kaslan Field
But I would imagine sponsored means we figured out that it needed to happen and so we're helping to spin up this sub project. Could be, because it's not like there's money involved. Yeah, it's open source.
Abdel Sigewar
Yeah, I mean like specifically the LLM Instance Gateway is an in development project. So there is pretty much nothing. I mean, there is code, you can build it yourself, but there isn't really much you can use yet. So the more you are, the more you learn, I guess.
Kaslan Field
Is it a separate project or is it part of Kubernetes? It's listed under Always a good question.
Abdel Sigewar
It's actually listed under Kubernetes 6.
Kaslan Field
Okay.
Abdel Sigewar
Yes, that's where it's listed.
Johan
Right.
Kaslan Field
That makes sense.
Abdel Sigewar
So that's. Yeah. Yeah. Anyway, yeah, it was pretty cool to talk with Johan and Eduardo. I learned a lot.
Kaslan Field
Yes. And we hope that you all enjoyed listening to the episode and learning about what the community is doing to support serving workloads in the distributed system that is Kubernetes.
Abdel Sigewar
Awesome. Well, thank you very much, Kathleen.
Kaslan Field
Thank you, Abdel. That brings us to the end of another episode. If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media at kubernetespod or reach us by email at kubernetespodcastgoogle.com you can also check out the website at kubernetespodcast.com where you'll find transcripts, show notes and links. To subscribe, please consider rating us in your podcast player so we can help more people find and enjoy the show. Thanks for listening and we'll see you next time.
Kubernetes Podcast from Google: Episode Summary
Title: Working Group Serving, with Yuan Tang and Eduardo Arango
Hosts: Abdel Sghiouar, Kaslin Fields
Release Date: October 31, 2024
In this episode of the Kubernetes Podcast from Google, hosts Abdel Sghiouar and Kaslin Fields engage in an insightful conversation with Yuan Tang and Eduardo Arango. The discussion centers around the newly formed Working Group Serving within the Kubernetes project, which focuses on optimizing serving mechanisms for AI and ML workloads. This summary encapsulates the key points, discussions, insights, and conclusions drawn during the episode.
Before delving into the main topic, Abdel and Kaslin share recent developments in the Kubernetes and cloud-native ecosystem:
Docker's Terraform Provider Launch
[00:33] Kaslin announces that Docker has launched its official Terraform provider, enabling the management of Docker-hosted resources such as repositories, teams, and organization settings.
Open Collaboration Between Tritate and Bloomberg
[00:47] Abdel highlights a collaboration aimed at integrating AI gateway features into the Envoy projects. This initiative focuses on building gateways capable of handling AI traffic, with initial features targeting usage limiting based on input and output tokens.
CNCF Laptop Drive at KubeCon
[01:20] Kaslin informs listeners about the CNCF's laptop drive at KubeCon Cloud Native Con North America 2024, benefiting nonprofit organizations like Black Girls Code and Kids on Computers.
Upcoming Kubernetes Community Days
[01:43] Abdel mentions four remaining Kubernetes Community Days events scheduled globally, encouraging listeners to participate and support local Kubernetes communities.
Yuan Tang
Yuan Tang is a Principal Software Engineer at Red Hat, working on OpenShift AI. With a rich background in leading AI infrastructure and platform teams, Yuan holds leadership positions in open-source projects including Argo, Kubeflow, and the Kubernetes Working Group Serving. He is also an accomplished author and regular conference speaker.
Eduardo Arango
Eduardo Arango transitioned from environmental engineering to software engineering and has been instrumental in promoting containerized environments for high-performance computing (HPC) over the past eight years. As a core contributor to Apptainer under the Linux Foundation, Eduardo now works at NVIDIA on the Core Cloud Native Team, focusing on integrating specialized accelerators into Kubernetes workloads.
Creation and Mission
The Working Group Serving was established following discussions at KubeCon Europe, where Yuan Tang and Clayton Cotton identified the need to address challenges in model serving within Kubernetes. The group's mission is to enhance serving workloads on Kubernetes, particularly for hardware-accelerated AI and ML inference tasks.
Yuan Tang:
"[03:53]…Generative AI has really introduced a lot of challenges and complexity in model serving systems… The mission is to advance the capabilities and efficiency of serving on Kubernetes to handle evolving requirements of generative AI and future workloads."
Goals of the Working Group Serving
Eduardo outlines three primary goals of the working group:
Enhancing Workloads Controllers
[11:16] The group aims to provide recommendations and better patterns for improving Kubernetes workloads and controllers, focusing on performance enhancements in popular inference serving frameworks.
Orchestration and Scalability Solutions
[11:16] Investigating key metrics for auto-scaling, such as GPU utilization and token-based metrics, to build more effective orchestration and load balancing solutions.
Optimizing Resource Sharing
[11:16] Collaborating with the Working Group Device Management to communicate resource needs and prioritize new features like Dynamic Resource Allocation (DRA) for better resource sharing.
1. Multi-Host and Multi-Node Serving
As AI models grow in size, deploying them across multiple nodes becomes essential. Yuan explains the complexities involved in multi-host inference workloads, such as network topology preferences and load balancing across nodes.
Yuan Tang:
"[15:04]…auto scaling on device utilization, memory is really not sufficient for production workloads… it's challenging to configure HPA to autoscale model serving metrics."
2. Auto-Scaling Limitations
Traditional auto-scaling metrics, like memory utilization, fall short for AI workloads. The working group is exploring alternative metrics, including latency, tokens per second, and prompt sizes, to achieve more efficient scaling.
Eduardo Arango:
"[23:16]…the auto scaling work stream has been focusing on caching and metrics. If the model is not cached effectively, latency increases, degrading user experience."
3. Dynamic Resource Allocation (DRA)
DRA aims to address the limitations of defining multi-GPU and multi-node workloads in Kubernetes. Eduardo emphasizes the necessity of DRA for running large models that exceed the capacity of single nodes.
Eduardo Arango:
"[14:03]…defining multi-GPU multi-node workloads in Kubernetes is almost impossible without DRA. It's essential for models that require more GPUs than a single node can provide."
LLM Instance Gateway
A significant focus is placed on the LLM Instance Gateway, a sub-project designed to optimize load balancing for Large Language Models (LLMs). This gateway intelligently routes requests based on input size and model efficiency, ensuring low latency and optimal resource utilization.
Yuan Tang:
"[27:42]…the LLM Instance Gateway aims to safely multiplex use cases onto a shared pool of model servers for higher efficiency."
Model Mesh Integration
Yuan discusses the integration of Model Mesh, a mature model serving management layer that works seamlessly with existing model servers. It acts as a distributed LRU cache, enhancing performance for both traditional ML models and LLMs.
Yuan Tang:
"[29:02]…Model Mesh is not just for large models but also benefits traditional machine learning models by managing traffic and routing effectively."
The Working Group Serving operates within the Kubernetes community, adhering to the CNCF Code of Conduct. It collaborates closely with ecosystem projects like KServe, Ray, and the Device Management Working Group to ensure holistic solutions for serving workloads.
Yuan Tang:
"[09:25]…We hope that all improvements made by the working group will also benefit other serving workloads like web services or stateful applications."
Yuan and Eduardo encourage community members to join the Working Group Serving by subscribing to the mailing list and participating in the Slack channel. Interested individuals can contribute by sharing use cases, proposing features, and collaborating on ongoing projects.
Yuan Tang:
"[29:18]…Make sure you join the mailing list and the Slack channel to stay updated and contribute to the discussions."
The episode provides a comprehensive overview of the Working Group Serving, highlighting its mission to enhance Kubernetes for AI and ML workloads. Through collaborative efforts and community involvement, the group seeks to address critical challenges like multi-host serving, auto-scaling, and dynamic resource allocation, paving the way for more efficient and scalable AI deployments on Kubernetes.
Kaslin Fields:
"[37:45]…We hope that you all enjoyed listening to the episode and learning about what the community is doing to support serving workloads in the distributed system that is Kubernetes."
For listeners interested in diving deeper or contributing to the Working Group Serving, detailed information and resources are available in the episode's Show Notes and the GitHub Repository. Engaging with the community through mailing lists and Slack channels is encouraged to stay abreast of the latest developments and contribute to shaping the future of AI serving on Kubernetes.
Connect with Hosts: