
Guests are , Product Manager on GKE for AI Training, and , Software Engineer on the GKE team at Google. We explore what it means for GKE to support 65k nodes, and the open source contributions that made this possible Do you have something cool...
Loading summary
Kaslyn Fields
Hello and welcome to the Kubernetes Podcast from Google. I'm your host Kaslyn Fields and I am Abdel Sigewar. Today we're talking with Maciek Rozhaski and Wojtek Tychensky about some exciting new updates in the world of scaling Kubernetes. But first let's get to the news.
Abdel Sigewar
We are publishing live from Kubecon Cloud Native con North America 2024. It's going to be a week filled with learning, networking and cool technology in the cloud native space. Stay tuned to the community and to our social media for updates from the event.
Kaslyn Fields
Speaking of social media, the Kubernetes podcast from Google would like to invite you to follow us on our new account on BlueSky. There will be a link in the.
Abdel Sigewar
Show notes OpenTelemetry is expanding into the CI CD space with the release of version 1.27 of the OpenTelemetry semantic convention, which is the common spec for defining objects, operations and data. In OpenTelemetry, CI CD attributes have been added so things like pipelines, tasks, run, IDs, etc. Can be observed and their execution status can be reported and monitored.
Kaslyn Fields
Gitpod announced they are moving away from Kubernetes. In a blog post, the company behind the cloud based development environment cited the many technical challenges they faced running their workloads on top of Kubernetes. Check out the detailed article from the links in the show Notes.
Abdel Sigewar
Open Cost is now a CNCF incubated project. Open Cost is a vendor neutral tool that provides visibility into Kubernetes costs across major cloud providers and on premise.
Kaslyn Fields
And that's the news. Today I'm speaking with Maciek, the Product Manager on GKE for AI training and Wojciech Engineer on gke. Would you each introduce yourselves? Maybe Maciek first.
Maciej Kurzhatsky
Hi, my name is Maciej Kurzhatsky. I'm a Product Manager in the Google Kubernetes Engine team. I'm responsible for shaping our roadmap of capabilities for supporting AI training and machine learning training use cases on Kubernetes and gke.
Wojtek Teczynski
I'm Wojtek Teczynski. I'm one of the engineering leads in GKE. I'm also heavily involved in open source Kubernetes. For example, I'm a TL of like 6 scalability.
Kaslyn Fields
I am so excited to be speaking to you both today. We have a very interesting topic and I'm very interested in it from open source perspectives as well, which we will get into. But we are Going to be talking today about GKE announcing support for 65,000 node clusters, which is enormous. I don't remember what the previous recommended from the open source project was, but GKE has had an industry leading 15,000 nodes for some amount of time. So this is a massive increase from 15,000 to 65,000 stated supported nodes. So can you tell us a little bit more about what it means to say that GKE supports a 65,000 node cluster?
Maciej Kurzhatsky
What we've seen over the course of recent few years is that with the era of, or say new generation of AI technology development, there is a clear demand for customers to start running at a much larger scale than before core. We've seen earlier in the previous years that customers were interested in operating in clusters of the scale of a couple of thousand of nodes. For microservices workloads for more high performance computing use cases, customers were going above like 10,000 nodes. You can see for example a case study that we did together with pgs. I think they were even a guest on this podcast. But recently we've seen that these needs for scale and sizes of the computing power in clusters have grown even further. And today the scale limits of Kubernetes are good enough, we think, for training and serving models of the sizes of like 1 trillion parameters. And in some time, hard to say when, but we will see models 10 times bigger, maybe even larger. And to meet the needs of customers, to be able to both train and serve these models, we need to innovate both in the sizes of clusters and in the capabilities of hardware that they run with. So what this means is that to operate at 65,000 scale, we've narrowed down the use case that we want to support with these clusters to building AI platforms. And if we make some assumptions on what the customers will do with the cluster, combined with lots of innovations both in open source and in house to Google, we were able to offer to cloud customers the ability to operate at 65,000 VM nodes Type of computing power just in a single cluster.
Kaslyn Fields
And I was also just thinking about the PGS episode. Fantastic call out if you haven't checked that out. Very interesting use case. There a lot of stuff going on here with supercomputing and regardless of what you may think of where AI is going, it's definitely an exciting time to be in infrastructure. So Wojtek, from an engineer's perspective, is there anything you would like to add?
Wojtek Teczynski
I would just add as Maciej to what Maciek just mentioned that we were primarily focused on Training and inference or AIML in the Genai use cases. But we are also thinking about mixing those workloads and this is something that we are already supporting with this announcement. Many of the customers or many of the users consider splitting those two into separate clusters, but I think it's important to give them a possibility to, to actually mix those. And this is something that we are actually testing as part of evaluation and as part of ensuring that clusters of such scale actually work too.
Maciej Kurzhatsky
I would maybe add to what Wojciech said, is that it's a very interesting domain, these machine learning platforms and backend for artificial intelligence because on one hand what's quite unique and also stems from the supercomputing patterns that were less adopted in cloud is that the training workloads and the whole process of building the model, it involves quite a lot of tightly coupled workloads. So you have jobs that are very vulnerable to physical characteristics of the data center in which they run. Proximity of machines, how far each individual host of your pod and container is, how far is it from another container. This matters and affects cost, efficiency and speed and even scalability of your workload. And the same applies also to inference. So the largest models are very difficult to serve just from one host. Customers typically shard them into multiple workloads that run on a few VMs and this creates a need for co locating massive amounts of computing power in one physical location. And then as Wojtek mentioned, when we need to, when customers want to both train and validate their models, like if we think about where AI is at least, especially the leaders of this space, everybody's working on finding the right model. The best model folks are competing on the quality of models, how they are responding, various rankings online. So what users that we see, what they need is the ability to very rapidly repurpose their hardware that is very scarce at the moment. We need to say that we have the crunch of chips, crunch with availability of electrical power to power the data centers. So customers are dealing with scarce resources. And then we want to here enable them to have a tool that will easily allow them to readapt the infrastructure that they have and those resources to various use cases. So you may be training your workloads using like 60 or 80% of total computing power that you have available to you and use the remainder for some research workloads or running inference to validate your model to get feedback from your users, from customers. And at the same time, if you see for example a significant success with one of your models or a surge of traffic associated with that model that you can very quickly stop. Other workloads move easily, virtual machines to be serving a different purpose. And Kubernetes is just great for that. And so unlike other systems that were built primarily with supercomputing in mind, Kubernetes was built both with supercomputing and these research workloads and the microservices. And here we are enabling customers to run both in one environment and dynamically adapt the need and the use case that they serve within minutes, even as their needs of their business change.
Wojtek Teczynski
Yeah, I think that within the minutes is super important here because especially during the times of shortages and stock outs of the capacity, it's often impossible to get new capacity within this timeframe. And the minutes is often or even seconds or 10 of seconds is what the customers often expect. And being able to have that capacity without provisioning the accelerators or provisioning the VMs themselves, which usually takes minutes, is an important factor for why those users choose to repurpose the existing capacity instead of provisioning or reprovisioning some of the machines they were previously using.
Kaslyn Fields
I think it's amazing that the pressures of the infrastructure world that Kubernetes exists in have changed so dramatically from the time that it was created to now. Like you were saying, the hardware pressures, the energy pressures, the scale of these workloads is just so immense. And the core concept of distributed computing is just essential to how we actually technically make these workloads possible. I really like that you call called out scheduling there and how the workloads can be interconnected. You have to have them be across hardware just because you need so much hardware to do them, but they still need to be very tightly coupled in working together, which one goes first, what pieces interact with each other. You have to consider all of these things. There's a lot of work going on, I know, in open source Kubernetes to enable these kinds of things. And, and in order to enable a 65,000 node cluster that is focused on these types of workloads, supercomputing, AI, all of these types of topics, you must have done some really cool engineering work to solve some of the problems there. So could you call out some of the cool engineering and technical challenges that the team had to solve in order to make a 65,000 node AI oriented Kubernetes cluster, GKE cluster possible?
Wojtek Teczynski
Yes, we indeed had to solve a bunch of interesting problems. And in fact we were kind of preparing to that for years. Even though if we look back like three Four years ago we weren't thinking about supporting that scale, but there were a bunch of investments that just takes years and were preparing us to where we are now. And I think probably one of the most interesting and most challenging things that we did is actually replacing etcd with like our own GKE specific storage. We call it spanner based storage because underneath it's using the spanner which is like Google's technology that we are internally using as the database solutions for many our Google products, not just in cloud. And we are actually in the middle of replacing all the storage for all the existing GKE clusters with spanner based multitenant solution. The main goal here wasn't really the scale and increasing the scale. The main reasons for that were making our control plane stateless, making it more thanks to which we make it like we are making it more flexible. It will like all the operations will be faster and so on. But scalability was one of our design principles. So like it just unblocked us here without any additional work that has to be done concretely for that effort.
Maciej Kurzhatsky
I can also a little bit extend like the investments in the control plane and performance are one aspect. Second one is probably like investments in data plane and the ability to handle the network traffic. Another element is also on how on various APIs around Kubernetes. So like if you may remember, we started the work around high performance computing workloads and batch workloads as a very deliberate way work with cncf I think three years ago that's when the batch working group was established. This year we also have the serving working group joining let's say portfolio of CNCF's working groups that look at Kubernetes APIs and see where they need to evolve to support this new era of workloads. And there's lots of very cool examples that are really capabilities that enable us also to tap into the possibilities that search scale offers. Like as an interesting example, dynamic resource allocation. That is a whole domain of how do you model this very advanced and sophisticated hardware? And how you operate the scheduling domains domain is a very interesting one because these AI workloads change a couple of paradigms in scheduling. In the past a typical microservice it had a couple of assumptions that we were designing these systems with them in mind. For example, a typical, let's say replica of a microservice is rather small. It's definitely smaller than a single host on which it runs. So we've invested quite a lot in kubernetes into capabilities like Oversubscribing physical machines with many microservices that run them share them. We have pod bursting and various node level and kubelet level capabilities to manage that and combine cost efficiency elasticity with great service data that these applications are offering to users. Now in AI space we have jobs that take thousands of VMs, if not now tens of thousands of VMs or even individual replicas of a model server. It's not uncommon that they run on more than one vm. So now you have deployments that need every replica of the deployment is actually multi host workload which is an API extension. At the same time also a very interesting scalability dynamics cost efficiency dynamic. So we have the advent of APIs like leader worker set to enable these more complicated deployments or job set to have a job level equivalent where the job is heterogeneous and it accounts for these network structures and topologies. So lots of cool stuff happening in Kubernetes and lots of cool stuff also was also in the domain of allowing us to effectively use this platform. Maybe I'll mention just one more which is Q the job scheduling add on that we've added that we've also started also I think these three years ago when we started the batch working group maybe a little bit later as part of the work of the community there, which is we believe that is the best cloud native job scheduling extension in the Kubernetes ecosystem that also then allows you to mix various workloads within these large AI platforms and juggle your resource allocation between jobs also deployments stateful sets. So like it is capable of integrating also with these serving type of workloads so that that you can balance the capacity sharing between jobs and your serving workloads.
Wojtek Teczynski
So let me just add one more thing because I started with GKE but we also did a lot of cool stuff in open source Kubernetes itself. One of the interesting features or enhancements that is not super directly visible to users but helps a lot with scalability is consistence list from cache that allow us to serve the list request directly from API server cache without contacting etcd. Or in our case the spanner based solution that helps a lot with reducing the load on the storage and helps a lot with scalability. But that's just one example. We did a bunch of improvements across not just core Kubernetes but also in other projects we improved connectivity which is very tightly coupled with Kubernetes but but technically a separate thing that thanks to which you can actually dynamically add or Remove the new API servers or control plane replicas to your cluster without the need to restart all the others. So that's just one example. Another thing, there are a lot of scalability related improvements going directly into Cilium, for example, which is one of the options for data plane and networking solutions that we use also in gke. But while a bunch of improvements there were done internally in Google, there are also improvements that are going into upstream Cilium done by our engineers. So yes, there is a lot of things that we are giving back to the community. I would even say more that whenever we actually need to change or adjust or enhance Kubernetes itself, we always do that in open source. We are not patching our internal fork of Kubernetes. We are all the improvements that we need in core Kubernetes. We are doing upstream so that everyone can actually benefit from that.
Kaslyn Fields
This is something that I have seen much more clearly since joining Google and that I'm constantly surprised by is how much the GKE engineers, when they're working on a feature for gke, they are doing stuff in open source and stuff is just appearing in open source. And it may not be clear that that is what is backing these GKE features, but that's kind of the point of it is it's available in open source and anyone can use these improvements too. And I think another theme that I want to call out here is there were a lot of new features that went into open Source to enable 65,000 nodes on GKE that are very kind of individual improvements. It's hard to understand the whole, I think sometimes with these individual features that lead up to some big thing that you can now do. The same thing is true in the world of stateful applications on GKE or on Kubernetes open source in general. I've given some talks about how it's very difficult to understand the space of stateful applications on Kubernetes because a lot of the features that enable stateful applications are just features. They're not called out as stateful features. They are just features within Kubernetes. And I think the same thing is kind of true here. There's a whole bunch of features in networking, there's a whole bunch of features in scheduling and workload types and all kind of throughout the project that are all enabling this together. But those ties may not be immediately clear if you were just looking at the new features in Kubernetes.
Wojtek Teczynski
Yeah, I would just add to this that there were a bunch of improvements that we were making in open source. And in fact none of them we justified with increased scale. I mean, we were justifying them with increased scale, but not in the dimension of size of cluster, but throughput of the system or some other things. Because those enhancements or improvements, they don't just help for the size of the cluster, but they can help many other users even if they have much smaller clusters for for other dimensions of scalability. But also they actually make the system itself more reliable, they reduce cliffs, they help with stability of the system under high load, and so on and so on. So it's not that we only complicate the system for higher scale, we just make the lives of many other users that just use smaller clusters also better.
Maciej Kurzhatsky
I can actually share also a funny anecdote on this that we were discussing with Wojtek and a couple of our teammates. How do we wrap this launch of the support of 65,000 node clusters in the proper formal launch process? Like every large software provider, Google has a strict especially in Google Cloud, we have a strict launch process that we follow to make sure we correctly support enterprise customers and do all of their regression, validation, all that stuff. And so definitely from how this capability and the feature looks, it's a massive launch for us. But then when we tried to pinpoint it to a very specific technical milestone, what did we change in the code base or what's in production that is this launch? And we couldn't. It's a funny thing, it was just lots of micro changes that indeed were oriented on solving a variety of problems. So we really believe that while this launch pushes the edges of the technology, it makes the lives of all Kubernetes users easier and better. Like much more rapid control plane scaling, which means that you can have ephemeral clusters that work much better and you have less worries like with control plane warmup effects on managed clouds or the performance characteristics of your speed of pod scheduling. This is a dimension of performance that is not directly tied to scale. You may want to have a very rapidly churning workloads also on a very small cluster. And lots of improvements went to Kube Scheduler to the other components running on the API server in the control plane to make sure that we can actually support fantastic performance characteristics irrespective of the scale. And then when you all of a sudden look at the hundreds or thousands of really changes everywhere and we run tests and just put it together and then cleaned up some of the rough edges when you actually do test the very large case and make changes also associated with when you make A target at launch of the scale. But it's not that there is a specific thing that was changed that really made this possible. It's like four years of the work of our engineers of the community to enable this.
Wojtek Teczynski
I would just add that or slightly clarify that. Yes, indeed, there are a lot of tons of smaller changes across pretty much all the product and not just the product also in our dependencies, but there were also a bunch of large launches that we did that, that were actually going through that process and that we heavily depend. They were just happening across like couple past years. But without those, we wouldn't be where we are now with scale too. So it's a combination of those big transformational projects, plus a lot of glue code and smaller improvements to make all of those work together.
Kaslyn Fields
I think all of this is a really strong indication of how robust Kubernetes is in its core concepts. Of course, Kubernetes has made it to being a decade old and it's not slowing down. There's still so much that's going on with the project and the core concepts that underlie it. Enabling distributed systems are still so relevant and even more relevant arguably in the world today. And so we see this continuing movement towards kind of the same goals that Kubernetes always had. We're making all of the components that go into the distributed system that Kubernetes has designed better and that's enabling us to reach these higher scales as well. And it also means for the community that there's still lots of work going on and it's important to celebrate that work and for people to know about how awesome it is, which we will be doing at kubecon this week. So this episode is coming out right as Kubecon is beginning and we have lots of exciting things planned for kubecon. Of course, it's a huge event for the community that builds Kubernetes, but it's also a huge event for the end users who use Kubernetes. And it's one of those wonderful moments where both of those things get to come together and we get to see the interactions between those communities. So. So I'm excited for kubecon. I know that we've got lots of stuff planned for Google. Is there anything you all would like to call out about kubecon?
Maciej Kurzhatsky
There is lots of very exciting stuff happening at kubecon. I'm personally very excited about the presentations that will happen on the AI Day and collocated event. Our engineers, together with our customers and partners from the community, will be presenting a couple of very interesting things. Also the lineup of talks is very cool. The one that I'm really keen to hear about is a presentation by an engineer from our team and engineers from Apple talking how they used Kubernetes and Q to build a very sophisticated multitenant environment for researchers and how researchers can share resources and capacity of pre allocated hardware between them. But there is lots of also other interesting stuff. I don't know Wojtek, if you want to add on things on the main.
Kaslyn Fields
Event, I also want to call out that Kubecon also added the poster session. So I think we're probably seeing a lot more researchers at Kubecon these days. So I bet the audience of that session will be very interesting and I'd love to talk to them. Go ahead Wojtek.
Wojtek Teczynski
Thank you. So yes, there are definitely a lot of interesting stuff. I think there are a bunch of talks by actual Kubernetes contributors, the special interest groups, working groups and so on that you can talk to people who actually create the stuff and influence how they do that. I highly recommend those, but I think the biggest value for me personally was actually talking to different people during the corridor discussions. And and if you are specifically Interested in the 65k notes clusters, there will be a lot of people from Google. You can find them in the Google booth or on the corridors and there were so many people involved in this work that by just asking any of those they will either be able to tell you something about that or will like easily redirect you to someone who you can speak to.
Kaslyn Fields
And like we said, there's so many changes that went into open source. So if you go to any of the open source sessions you might hear about some features that went into this.
Maciej Kurzhatsky
Yes, especially like the six scheduling presentations. I think there is going to be a maintainer track batch working group from like this domain of AI platforms is going to be definitely very interesting. Probably also the serving working group will have a presentation that I'm less involved in. But I would expect the team also will have a very interesting content content during the session. And as Wojtek mentions, going to that session or to our booth and then just spending time with all of the presenters and attendees, that's the best way to make the most out of the Kubecon event.
Kaslyn Fields
For all of you end users out there listening, I would like to issue a challenge. If you are attending Kubecon or if you check out the recordings later, I challenge you to check out at least one maintainer track session, see what these contributors are doing and how they talk about their work and see if it's interesting to you and how it might relate. Think creatively because like we said, a lot of the work that goes into open source is these maybe changes that look small or look like they might not be related to what you're doing, but you might find out that they are actually related to the entire system of Kubernetes and it all kind of bubbles up into things that you use. So I challenge you to check out a maintainer track session session from this Kubecon and see what you learn from it. I would be very interested to know.
Maciej Kurzhatsky
I would add that maintainer tracks are really cool, that they're very different. Like it's very interesting that they are a little bit neglected. Like they don't get the same sizes of audiences as many other sessions. While at the same time, if you think about those sessions are led by some of the most thought leaders in the community that shape the direction of Kubernetes within various domains and the format of these sessions, given that they are a little bit low key and lower profile, there's not much marketing in there or it's just about like the sessions are about what we see in the industry, what patterns we see, how users are changing, where do we see the IT industry in a year, in five years from now, and how do we need to evolve kubernetes to be able to respond to these challenges or a specific aspect of kubernetes? How do we need to evolve it? So these are, are very exciting and very interesting sessions and also building some direct relationship with those folks that present. These are very frequently like introvert engineers for whom this is a stressful presentations. But at the same time these are just fantastic folks and sessions that like make sure you meet them, you get their contacts because these folks are really the ones that shape the direction and those sessions give you a an insight into where Kubernetes is going to be evolving in a particular space.
Wojtek Teczynski
And even more than the sessions themselves, like just grabbing those people right after the presentation and throwing the topic at them and having brainstorming or whatever. It's something that I always like, really enjoyed and I took a lot of that on both sides actually, both as a maintainer and as a person that is trying to challenge some other maintainers. So yes, I highly recommend those. And actually this is one of the best opportunities to influence the direction in which the project is going and what we as a project will be working on in the upcoming months or quarters.
Kaslyn Fields
At A maintainer track session, you know that you're talking directly to the engineers who are influencing those areas of the Kubernetes project. So if you have things you want to discuss, bring them up. This is what they do and they would love to talk about it generally. And like we said, they do tend to be smaller sessions. So I highly encourage you to ask questions during these sessions. These engineers are trying to make these decisions and make this work happen. So your input could influence the direction of Kubernetes. And one last thing. Oh, actually one last thing on maintainer track sessions that I wanted to mention also, if you happen to go to multiple maintainer track sessions, I personally find it fascinating how you hear similar themes of like influences and pressures on the project that are influencing the project in different ways in different areas is fascinating. So if you can do more than one, you might see that too.
Maciej Kurzhatsky
And I would add also that it's actually a very interesting place where Kubernetes is that we at the moment are in a place where like Kubernetes is built with that premise of separation of concerns, that there are various components that really are responsible for doing one specific task well. And at the same time we see more and more requirements for capabilities that are cross cutting those layers. If you think about like scheduling, the concept of job scheduling or queue, like it introduces the idea that you run a full workload or not and then you have Kubernetes scheduler that wants to decide which where to run which pod. But you need to combine the placement of those pods with information about the network where those VMs are. You got autoscaler to the mix and all of a sudden you end up with very interesting situation where you have various components of Kubernetes optimizing certain behaviors and at the same time having a user needs to have that coordinated and like cutting across these components and how to do it well so that we move Kubernetes forward without breaking. It's a very interesting challenge that Kubernetes is facing. And you can hear that through all of these conversations across these working groups and sigs. And that's why you also see see that those maintainer tracks also that are use case oriented. Because if you think about like you have like SIG scheduling meeting which is very like related to a component as an example. But then you have these horizontal sigs or working groups, like scalability is a horizontal that crosses various components or batch which is a use case oriented really working group. And then people from node and from Autoscaler and Scheduler and other meetings one place and try to figure out how to make all of these various distributed components work together well without breaking the nature of a distributed system that is extensible pluggable. Very interesting challenges for Kubernetes and I'm sure there's going to be lots of very interesting discussions in corridors how to figure it all out.
Wojtek Teczynski
Yeah, I would just emphasize that the use case driven thing that you mentioned, even in the scalability, the goal is not to push the boundaries as far as possible. The goal is to meet the user requirements. We don't want to optimize for the sake of optimization. We just want to solve real user problems. So understanding those is critical to making good decisions. And if you have a use case that Kubernetes is currently not addressing, please come to talk to us because we probably didn't hear about it, or maybe we hear and we just didn't yet figure out how to do that, or maybe we didn't prioritize because we didn't think it's important enough or so. And ensuring that we as project maintainers understand the priorities, understand the use cases is the most important thing that you can help us with as a user.
Kaslyn Fields
Please interact with the community and get involved. We would love to hear from you, especially you end users out there. We hope to see many of you at kubecon. So let's wrap this up with where can folks learn more about 65,000 nodes on GKE?
Maciej Kurzhatsky
We will be posting information on our cloud blog. You can also find updates in our documentation. And through the coming days and weeks we will be releasing varieties of materials. So stay tuned with demos where we'll show how these clusters work and how they use that, how you can use them. And we will be sharing more insight into some of the technical capabilities that enable this technology so that we can give you a deeper dive. Like for example, especially this innovation that we're proud of that Wojtek mentioned, where we will be using Spanner as the cluster state storage and what it means really to how we operate and manage Kubernetes control planes and the variety of capabilities that this opens to, not only the scale but also adaptability, flexibility and various other characteristics that this opens for us. If you actually want to run at this scale, it is scale large enough that it takes a power plant to also power such a cluster. So definitely reach out directly to us or to your account team so that we can work together on plugging the necessary power supplies and hardware and helping you build such large infrastructure for your AI workloads.
Kaslyn Fields
It's an exciting time in the infrastructure world and we hope you all have a wonderful time at kubecon. Maciek and Wojtek, I will see you on the show floor. Thank you so much for being on today.
Wojtek Teczynski
Thank you.
Maciej Kurzhatsky
Thank you. See you there.
Abdel Sigewar
Well, Kathleen, that's some very exciting news indeed.
Kaslyn Fields
It's pretty cool to break this exciting announcement for everyone.
Abdel Sigewar
Yes, it's pretty good to premiere the news on the show. So I guess we should probably open up by saying, if you are listening to this and you go to this portion of the podcast, we are probably in the middle of the keynote at kubecon.
Kaslyn Fields
Yeah, if you're listening right after it.
Abdel Sigewar
Was released, then yes, if you're listening right after, yes. But yeah, no, the conversation was pretty cool. So already GKE was ahead of the market by supporting the 15,000 nodes, but now we're going all the way until 65, which is like a huge leap.
Kaslyn Fields
Yeah, it's a huge leap in terms of the availability of super high scale clusters in a managed provider. But also, I really wanted to talk about the open source side of it. A lot of the features that go into GKE are things that the engineers contribute back to the community, and there's a whole bunch of contributions back to the community that have happened with this. So I was really glad that I got to talk with Maciek and Wojtek about some of the work that they've done that is now available in open source as well. So you can build bigger clusters wherever you may be.
Abdel Sigewar
Yeah, I mean, it's important to stress the fact that none of this would be possible without all the years of contribution and improvements into Kubernetes open source. Right. And those are improvements that are going to benefit everybody else. But before we go there, there's actually one comment that Maciek mentioned at the beginning, which I found hilarious and interesting. Maciek was saying, basically, we are looking at potentially a large language model having trillions of parameters, and that's where we want to be in terms of making Kubernetes good for that. And I'm like, are we already. I don't think we're already at trillions of parameters. Right. I think we're still at billions.
Kaslyn Fields
So for now, the whole idea of this really large cluster, like so background, always starting with the background. One of the first questions people always, always started to ask with containers and also of course with Kubernetes is how big can it scale? So, like, scale for scale's sake is exciting. But it's very interesting to me that a lot of this work is driven by the whole AI movement that's happening right now. These workloads are unique in a lot of ways and we're seeing more features and more technology coming out to specifically address the needs of these workloads. And so this, this huge update to scalability of Kubernetes itself being driven by that is very interesting to me.
Abdel Sigewar
Yeah, no, definitely. I mean, I guess that even before AI there was. I mean we had a couple of episodes where we talked to some people running large scale clusters and I think we didn't cover everything, but there was probably people doing HPC on top of Kubernetes. I know that CERN for example, was doing quite a lot of that. So there was already people running at like very high scale. It just like it's a very massively leap driven probably by the AI and helped by all this, the micro adjustments or the micro improvements as Maciek mentioned in their show.
Kaslyn Fields
Yeah, and I thought it was really interesting when they went over the capabilities of the absolute tippity top scale of these new GKE clusters is they are very geared toward these AI specific workloads. There's a lot of. Yeah, there's a lot of the improvements that went into this that are useful for all kinds of workloads. And the max cluster size for all sorts of workloads is going to be improved by all of this work. But at the very top levels the work has been very focused and very scoped, like you said, with the needs of these AI workloads.
Abdel Sigewar
I think that's one interesting thing. Also related to this was that Maciek and Wojtek both mentioned one of the obvious benefits which I never thought about before, is allowing people to do both training and serving on the same cluster instead of doing them on separate clusters. So that was like. I never thought about it as being an actual problem that people care about, but I guess having one not impact the other negatively is something that people ran into in the past, so they had to separate them.
Kaslyn Fields
Yeah, I do tend to kind of lump the AI workloads together because of the way that AI as a whole is like influencing the project. But one of the first things that I did when I started diving into this space was try to understand the differences between those workloads. And they are very different in terms of how you need to run them and how they work and what they need to do. So inference and serving each have such unique characteristics. I hadn't thought either about them being on separate clusters or what that really means for how you would implement them in practice. But yeah, that is an interesting point that I also learned from Wojtek and Maciek.
Abdel Sigewar
Well, I think the first thing that jumps into mind is that training is usually a batch type workload. So it's a lot of mini pods or a lot of pods that have to spin up and then finish very quickly. While serving is more like you're running a workload for an extensive period of time. So the assumptions I have in my head are around like, how do you make sure that one doesn't influence the other in terms of like resource availability? And. And then also they were like talking about training. Jobs are sensitive to latency, sensitive to like the host configuration and you know, all this stuff. So now it makes a lot of sense. After listening to the episode. Episode.
Kaslyn Fields
Yeah. And I said inference and serving, but yeah, training, training and serving. Inference or serving. Because inference and serving are used kind of interchangeably though. I personally have opinions about that.
Abdel Sigewar
Yes, yes, me too. And so then the technical challenges, a lot of very small improvements, as was mentioned. Mostly stuff that have been done open source or upstream and before even AI was a thing that people cared about. Just stuff that Kubernetes was, was not very good at doing and that had to be fixed, put in fixed between codes. Because it really depends who defines what is good and what is bad. But like, you know, so like the consistent list from cache, I had to go dig into that one, like serving list requests from the API server instead of serving them from the storage. That's something I never thought about. But in my day to day using Kubernetes, I do tend to use the list function very, very often. Right.
Kaslyn Fields
Yeah, I haven't looked into that feature much yet. So hearing about it from them was my first time hearing about it.
Abdel Sigewar
Yeah, yeah, this was also my first time. And then yeah, all the work that the batch working group was doing for a while, because that's existed for a while as well. And then, yeah, a bunch of other things that were mentioned. I find that, I find it interesting when Maciek was talking about when we had to pinpoint what was the thing that allowed us to get to the scalability, we couldn't. Right.
Kaslyn Fields
Yeah, you can't. Because I remember doing a video several years ago about this concept and I don't remember the exact terminology that we used, but it's like, like envelopes of different limits that impact how large of a scale you can actually go to on Kubernetes. Which the terminology that we used didn't make any sense until you kind of explained it more. But the way that scaling limits work in kubernetes, the way that I think about it at least, is you have all of these different limits and your most strict one is pretty much the one that sets everything else in a lot of ways.
Wojtek Teczynski
Yes.
Kaslyn Fields
Though they interact with each other in weird ways so that there's no finite limit on exactly how many pods you can put on a node or exactly how many nodes that you can put in a cluster. It's all about how you set up the tools underneath so that it affects those limits.
Abdel Sigewar
Yes.
Kaslyn Fields
And so depending on how you do that, you can achieve really huge scale.
Abdel Sigewar
Yeah, I think I know what you're talking about. If I remember correctly, it was talk from 2019 which was the scalability envelope or the scalability limits. I don't remember what it was called, but there was a talk at one of the Kubecons talking about these like scalability dimensions and the fact that scalability is a multidimensional problem. So it's not a single one, it's multiple things that you have to like or find the link and add it to the show notes.
Kaslyn Fields
Yeah. It is surprising to me how often that comes up in conversations with folks. And keeping that in mind is really important when you're talking to folks about their environments. If you're trying to learn about the scalability of different Kubernetes environment, you're going to hear different answers for what is keeping people from reaching higher scales in different situations. So it's important to keep in mind that all of those are valid.
Abdel Sigewar
Yes, yes. And I think just one thing that also comes to mind while we're talking about this topic is something that I don't think a lot of people realize and I've seen this being reported as an issue very often is the API server is actually technically a choke point in kubernetes. Right.
Kaslyn Fields
Since it's a single point, certainly can.
Abdel Sigewar
Or it could be. Right?
Kaslyn Fields
Yeah, yeah, it's good.
Abdel Sigewar
Like, because what you talk to when you use kubectl or whatever your CI CD pipeline talks to whenever you are deploying or updating things. But also all the components inside Kubernetes talk to the API server.
Kaslyn Fields
Yep.
Abdel Sigewar
And so this comes very often when people install too much operators in the cluster and they all have to query the API server and then it ends up being ddosed. Right. So yeah, it's quite interesting. I mean, it's quite difficult to wrap your head around it, but with. When you spend some time thinking about it, it all makes sense.
Kaslyn Fields
Yeah. That is one that I. It's not the most common one that I hear. The most common limit that I hear, of course, is IP exhaustion.
Abdel Sigewar
Oh, yes, of course.
Kaslyn Fields
Yeah, yeah, that's common because, yeah, IPv4. Don't we all want to move to IPv6? Isn't it the year of IPv6? It's not, but it's been the year.
Abdel Sigewar
Of IPv6 since 2012, so it's never.
Kaslyn Fields
Going to be the year it feels like. But really the IPv4 exhaustion is a real issue that hits a lot of folks who run Kubernetes clusters because you need to have IPs for all of the workloads that are running on the nodes. And then like, how do those workloads interact with other workloads? Do they have their own externally accessible IPs and load balancers? And it's a networking problem in the end. Distributed computing because. Because you're just trying to hook a bunch of computers together. So naturally the networking gets tricky. But yeah, the API server is kind of a hidden one. I feel like that when it comes up, you're like, huh, really? But yeah, it handles all of the requests from within the cluster as well as from outside of the cluster. So naturally it can get overwhelmed.
Abdel Sigewar
Yeah, yeah, yeah. And then just to wrap up, I think that by the time people would listen to this, like either you listen to it on the day drops or later. And if it's later, there was a conversation about.
Kaslyn Fields
Those are the options.
Abdel Sigewar
Yeah, of course. That's actually, that's kind of like sometimes I tend to state the obvious, but I think what I wanted to say is if you do later, you will notice in the show that you talked about all the maintaining tracks, all the talks basically at Kubecon, which will be available on YouTube later.
Kaslyn Fields
Yes, absolutely. Check out those recordings if you can. Yeah, I challenged everyone in the interview too. Check those out. Yeah, good luck.
Abdel Sigewar
Yeah. And go listen to the maintainer tracks. I think that this. I go very often when I'm at Kubecon. I find them some of the most interesting talks. Not that the other ones are not interesting, but like I find them very interesting because they go very deep into the weeds of how things work.
Kaslyn Fields
Yep. It's at the core of the thing that the conference is about.
Abdel Sigewar
Yes.
Kaslyn Fields
The open source projects that are at the core of the cloud Native Computing foundation. So. So they are foundational to the event.
Abdel Sigewar
That is pun not intended all right, cool. Well, thank you, Kathleen. That was pretty cool. And we report back from Kubecon.
Kaslyn Fields
Yeah. We hope you all enjoyed learning about Kubernetes at scale and we'll see you in our Kubecon episode.
Wojtek Teczynski
Yes.
Abdel Sigewar
All right, cheers.
Kaslyn Fields
That brings us to the end of another episode. If you enjoyed the show, please help us spread the word and tell a friend. Friend. If you have any feedback for us, you can find us on social media at KubarezPod or reach us by email at kumarespodcastgoogle.com you can also check out the website at Cabrera's podcast.com where you'll find transcripts, show notes, and links. To subscribe, please consider rating us and your podcast player so we can help more people find and enjoy the show. Thanks for listening and we'll see you next time. It.
Title: 65k Nodes on GKE, with Maciej Rozacki and Wojciech Tyczyński
Hosts: Kaslyn Fields and Abdel Sigewar
Guests: Maciej Rozacki (Product Manager, GKE for AI Training) and Wojciech Tyczyński (Engineering Lead, GKE)
Release Date: November 13, 2024
In Episode 65 of the Kubernetes Podcast from Google, hosts Kaslyn Fields and Abdel Sigewar delve deep into the remarkable expansion of Google Kubernetes Engine (GKE) to support clusters with up to 65,000 nodes, a significant leap from the previous 15,000-node limit. Joining them are Maciej Rozacki and Wojciech Tyczyński, who provide insights into the technical advancements, engineering challenges, and the broader implications of this monumental update in the Kubernetes ecosystem.
The era of Artificial Intelligence (AI) has exponentially increased the demand for colossal computational resources. Traditional Kubernetes clusters, suited for microservices and high-performance computing (HPC) workloads, grappled with scalability constraints as AI models grew in complexity and size. GKE’s new support for 65,000-node clusters is a direct response to these evolving needs, enabling seamless training and serving of AI models with unprecedented scale.
[03:08] Maciej Rozacki: “There is a clear demand for customers to start running at a much larger scale than before... to meet the needs of customers, to be able to both train and serve these models, we need to innovate both in the sizes of clusters and in the capabilities of hardware that they run with.”
GKE’s announcement signifies a fourfold increase in the maximum supported nodes per cluster, propelling it to an industry-leading position. This enhancement is meticulously engineered to support AI training at scales previously unattainable, accommodating models with up to 1 trillion parameters and paving the way for even larger models in the future.
[04:50] Wojciech Tyczyński: “We were able to offer cloud customers the ability to operate at 65,000 VM nodes Type of computing power just in a single cluster.”
Key Implications:
Achieving support for 65,000-node clusters was no small feat. The journey involved overcoming numerous technical hurdles, primarily focused on enhancing Kubernetes' core architecture to handle such scale.
One of the most significant changes was replacing the traditional etcd datastore with a Spanner-based storage solution. This shift was pivotal in making the Kubernetes control plane stateless, enhancing flexibility and scalability.
[11:19] Wojciech Tyczyński: “We were replacing etcd with our own GKE-specific storage. We call it Spanner-based storage because underneath it's using Spanner, which is Google's technology for database solutions.”
Benefits:
Investments were made in the data plane to handle increased network traffic efficiently. This included optimizing connectivity and improving components like Cilium, a popular networking solution in Kubernetes.
[16:23] Wojciech Tyczyński: “We did a bunch of improvements across not just core Kubernetes but also in other projects... our engineers contributed back to upstream Cilium.”
To support AI workloads, several Kubernetes APIs were extended. Innovations such as dynamic resource allocation and advanced scheduling paradigms were introduced to manage the complex dependencies and resource requirements of AI tasks.
[12:52] Maciej Rozacki: “Dynamic resource allocation is a whole domain of how do you model this very advanced and sophisticated hardware... enabling these large AI platforms.”
The establishment of specialized working groups like the Batch Working Group and the Serving Working Group ensured focused development on use-case-specific enhancements, facilitating better integration of AI and HPC workloads.
[10:00] Kaslyn Fields: “We have a bunch of scalability-related improvements going directly into Cilium... ensuring clusters of such scale actually work too.”
A cornerstone of GKE’s scalability advancements is the extensive contributions made to the open-source Kubernetes project. These enhancements not only powered GKE’s 65,000-node clusters but also benefit the broader Kubernetes community.
Spanner-Based Storage Integration: Making control plane storage more scalable and flexible.
Consistent List from Cache: Reduces API server load by serving list requests directly from the cache.
[17:07] Wojciech Tyczyński: “Consistency list from cache allows us to serve the list request directly from API server cache without contacting etcd... reducing the load on the storage.”
Advanced Scheduling Mechanisms: Incorporating AI-specific scheduling requirements to handle multi-host workloads and dynamic resource allocation.
[15:54] Maciej Rozacki: “Leader worker set to enable these more complicated deployments... balancing capacity sharing between jobs and your serving workloads.”
These contributions enhance Kubernetes’ core capabilities, enabling more users to leverage high-scale clusters irrespective of their specific use cases.
The expansion to 65,000-node clusters brings a myriad of benefits not just for AI-centric workloads but for the entire Kubernetes ecosystem.
Users can now train and serve AI models within the same cluster, enhancing operational efficiency and reducing the complexity of managing separate environments.
[22:00] Maciej Rozacki: “Unlike other systems that were built primarily with supercomputing in mind, Kubernetes was built both with supercomputing and these research workloads and the microservices... enabling customers to run both in one environment.”
The ability to rapidly repurpose hardware allows users to adapt to fluctuating demands, crucial for AI research and deployment.
[09:15] Wojciech Tyczyński: “Being able to have that capacity without provisioning the accelerators... is an important factor for why users choose to repurpose existing capacity.”
Improvements such as serving list requests from cache and optimizing the control plane ensure that Kubernetes remains reliable and performant, even under high load.
[19:56] Wojciech Tyczyński: “None of these improvements just help for the size of the cluster, but they also make the system itself more reliable, reduce cliffs, and help with stability under high load.”
All enhancements are contributed back to the Kubernetes open-source project, ensuring that even users operating smaller clusters reap the benefits of improved scalability, reliability, and performance.
[18:19] Kaslyn Fields: “The engineers contribute back to open source and everything is available in open source, allowing anyone to use these improvements.”
Released concurrently with the GKE announcement, KubeCon Cloud Native Con North America 2024 serves as a platform to showcase these advancements and foster community engagement.
AI Day Presentations: Featuring collaborations between Google engineers and partners like Apple, demonstrating sophisticated multitenant environments for researchers.
[25:24] Maciej Rozacki: “Our engineers, together with our customers and partners from the community, will be presenting a couple of very interesting things.”
Poster Sessions: Encouraging researchers to share their work and engage with the Kubernetes community.
[26:23] Wojciech Tyczyński: “Just asking any of those people involved in the 65k nodes clusters will easily redirect you to someone you can speak to.”
Maintainer Track Sessions: Offering deep dives into Kubernetes' core components and allowing attendees to interact directly with maintainers.
[30:45] Kaslyn Fields: “At a maintainer track session, you know that you're talking directly to the engineers who are influencing those areas of the Kubernetes project.”
[27:57] Kaslyn Fields: “It's very important to celebrate that work and for people to know about how awesome it is, which we will be doing at KubeCon.”
The expansion of GKE to support 65,000-node clusters is a testament to Kubernetes’ evolving capabilities in the face of burgeoning AI demands. This leap not only reinforces GKE’s position as a leading managed Kubernetes service but also underscores the collaborative spirit of the open-source community in driving technological innovation.
The podcast emphasizes that scalability in Kubernetes is a multi-faceted problem, involving various components and layers that must work in harmony. Understanding these interdependencies is crucial for effectively scaling Kubernetes clusters.
[43:35] Kaslyn Fields: “Keeping in mind that scalability is a multi-dimensional problem... all of those are valid.”
The strides made in scaling Kubernetes are a direct result of sustained community collaboration and open-source contributions. The hosts encourage listeners to engage with the community, participate in events like KubeCon, and contribute to the ongoing evolution of Kubernetes.
[34:50] Maciej Rozacki: “We will be posting information on our cloud blog... reach out directly to us or to your account team.”
As AI models continue to grow, further innovations in Kubernetes’ architecture and capabilities will be necessary. The groundwork laid by GKE’s 65,000-node support sets the stage for future advancements, ensuring Kubernetes remains at the forefront of cloud-native computing.
Episode 65 of the Kubernetes Podcast from Google offers a comprehensive look into the monumental scaling achieved by GKE and the collaborative efforts that made it possible. Maciej Rozacki and Wojciech Tyczyński provide invaluable insights into the engineering feats, open-source contributions, and the broader implications for the Kubernetes ecosystem. As Kubernetes continues to evolve, its resilience and adaptability shine through, solidifying its role as a cornerstone of cloud-native infrastructure.
[36:23] Abdel Sigewar: “None of this would be possible without all the years of contribution and improvements into Kubernetes open source.”
Listeners are encouraged to attend KubeCon, explore the latest features, engage with maintainers, and contribute to the ongoing success of Kubernetes.
Resources: