Loading summary
Abdel Sighiwar
Hi and welcome to the Kubernetes podcast from Google. I'm your host Abdel Sighiwar.
Mofi Rahman
And I'm Mofi Rahman. Ricardo leads the Platform Infrastructure team at CERN with a strong focus on cloud native deployments and machine learning. He has led the internal effort to transition services and workloads to use cloud native technologies as well as dissemination and training for several years. Ricardo got signed to join the CNCF and is a member of the Technical Oversight Committee, currently chairs the End User Technical Advisory Board as well as leading the Research User Group. But first, let's get to the news.
Abdel Sighiwar
Kubernetes introduced NFD or Node Feature Discovery NFL NFD is an open source project that automatically detects and reports hardware and system features on the cluster nodes, helping users schedule workloads on nodes that meet specific requirements. This feature bridges the gap between the workload container image and the node os, making it possible for application to leverage drivers for GPU and network devices, libraries and software and kernel features like vfio.
Mofi Rahman
Google announced the Gemini cli, a command line based AI agent to interact with Gemini from your terminal. The tool can be used to query GitHub issues, code bases and pull requests, scaffold new apps, generate media and more. And the cherry on top is that it's all open source and available on GitHub.
Abdel Sighiwar
The CNCF announced the Vietnamese version of the cloud native Glossary is live. The effort to localize the glossary and the addition of the Vietnamese language brings the number of languages to 15.
Mofi Rahman
The CNCF announced a new Executive Director. Jonathan Bryce joined as the new Executive Director replacing Priyanka Sharma who served the role for the past five years. Jonathan Bryce brings 15 years of experience in the open source space including Rackspace, OpenStack and the Open Infra Foundation. And that's the news. Welcome to the show, Ricardo.
Ricardo
Yeah, it's a pleasure. Thank you for the invitation.
Mofi Rahman
So to kick us off, instead of talking about tech things, I wanted to start the show off by talking about one of your hobbies. Doing a bit of Internet stalking. I looked into your profile and it looks like you are into flying airplanes. So my question is there anything you can bring into the world of cloud native with learning how to fly a plane?
Ricardo
That's a pretty good question. So yeah, indeed. My main hobby and passion in life apart from computing is anything related to aviation. So flying multiplanes and gliders. I never thought about it in this way, but now that you ask, actually there might be quite a lot of similarities between the two. If you think about Kubernetes clusters, they behave pretty well as long as you prepare things in advance. With flying, it's a bit the same if you're flying motor planes, you probably want to check the weather in the morning, prepare your flight path, your plan where you're about to pass the airspace, all these things. And things work better in Kubernetes as well if you, if you do these steps in advance. On the other hand, maybe my real passion in addition to motor planes is sail planes and gliders. And there it's more the other side of Kubernetes, I would say, which is the exciting part where you go, you push a bit the boundaries and you look for turbulence constantly and you end up more often getting into trouble than with the standard flight. I would say. Yeah.
Mofi Rahman
So speaking of like pushing boundaries, so you work at cern, which is a scientific research. How did you come about in the world of Kubernetes? In my mind when I think of scientific research seems very rigorous, like a bit more like traditional, not necessarily like the cutting edge of cloud native. So how did that connection happen?
Ricardo
Actually in. At CERN we always had very large requirements for code getting resources. Even before big data was called big data, we already had to deal with terabytes and petabytes of data. So we are constantly looking for the new technologies that will allow us to do more with a fixed budget because we don't sell anything, so our budgets don't change when we produce more data. So we have always to find better ways to cope with the increasing requirements from the experiments, the physics experiments. And this is basically how we end up looking at everything and everything. Ten years ago was definitely starting to look into cloud native and making computing resource usage more efficient and automating more and making ourselves more efficient. And I think the main drive has been that if you look at the pre cloud era, we ended up writing a lot of the tools ourselves because there was no open source or community offering this kind of tooling. And this is how I joined CERN actually as a software developer building distributed computing tools. But then with the advent of the cloud and eventually cloud native, this massive community with very large organizations with similar needs started working together, which is kind of magical if you think about it. And we decided, okay, this is how, how we should be focusing for the near future and for the future and just join the community instead of staying in our corner.
Mofi Rahman
So I guess like a follow up question to that, then you mentioned that we're having a fixed budget and having no commercial selling ideas. From cern. Other than that, how else would you say the scientific research computation is fundamentally different from project products that people are building or working in cloud native that are meant to be more used by end users then?
Ricardo
Yeah, that's a very good question. So the original design of Kubernetes was really for the typical IT service, where you have an endpoint and you have requests and you might have to scale the resources according to the number of requests coming, but it was mostly service oriented. Scientific computing is different in terms of how the workloads are managed. They're usually. And then you have to be able to scale significantly in the number of jobs and the resources these jobs consume. So you need concepts for advanced scheduling, scalability. A lot of the concepts that we now all manage with the recent changes in the computing ecosystem, but things like queues, quotas, priorities, preemption, all these things that have been in scientific computing infrastructure supercomputers for many decades and that were missing in the original Kubernetes. Even if you think of the original job concept in Kubernetes, which has been there for a long while, it was very much designed focusing on the notion of mapreduce kind of workloads, which is not the traditional batch computing that we need for scientific computing. The other part is also that because of this demand in the resource and efficiency, there's a big push for optimization constantly. So things like optimizing the node usage, like pinning CPUs or NOMA awareness, all this kind of very low level things, they were not a priority for the traditional service. You were mostly wanting to scale up and down, but not necessarily taking all the small percentage left on the resources.
Mofi Rahman
So this is probably at this point you already mentioned the word probably with a different spelling. At this point I wanted to ask you about Q. It is a project that is part of the Kubernetes sig. And how did you get involved and what made you interested in learning more and using it at this point?
Ricardo
This has been something we looked from the very first day we started using Kubernetes like everyone else for our internal services in the campus and for a ton of things, but very much service oriented. But at the same time, from day one, we started thinking, okay, can we also use this stack to do better also for our scientific computing workloads. And there were projects appearing even in the early days that were focusing on this and this was about mostly having a batch scheduler. And the batch scheduler is, as we said, like queues, quotas, priorities, preemption, all these things Gang scheduling. There were projects in the beginning like Volcano Kubebatch. Even before that there were for federated or multi cluster deployments. There was something called Kube Fed v1 and v2 and all of them did the job, but they were not on the core Kubernetes, which meant some of the integrations were not necessarily perfect. And also you would have to. You need to buy into different types of resources. These projects are still very popular and largely used. But I think there was a reason to come up with a core common component in the scheduler that even those projects can rely on. And the Q came from this idea from multiple people. We were one of the advocates in groups like the research and user group and then other larger organizations with more developing development capacity bought in and we started collaborating to make sure that as an end user we can provide the requirements and also push in the community to build momentum around this project. So this is how we started talking to other universities around this topic and then Google and other organizations in the cncf. It's really out of our need to simplify and make overall usage internally better. We saw the value of Kubernetes so there was a lot of motivation to do the same for centrifuge computing.
Mofi Rahman
So Q at this point, I think the latest version of Q is 0.11 if I'm not mistaken, could be 0.12 by now. But it is still, I would say early days for something like this. But Kubernetes itself is about 11 years old. So it feels like in many ways as you mentioned also Kubernetes initially was for stateless web application type things. Is it a lack of having scientific and research type voices? Early days in the Kubernetes world like why do you think it took so long for Kubernetes project to have a strong opinion about how jobs should have these features that research needs?
Ricardo
I think that's it. The use case and motivation was not there like the use case existed. But traditionally scientific computing workloads are done in very specific kind of infrastructures, especially for high performance computing HPC workloads where we rely on very large supercomputers or you have your own on premises data centers with things like low latency connectivity, infiniband. Very specific scheduling requirements and tools existed to do this. There was tools like HDConda or Slurm that is very popular. So the motivation was not there from the people managing those centers to transition to something new. The motivation appeared when people started realizing that by using or looking at something Like Kubernetes as a kind of commodity these days, where everything integrates with IT and all infrastructures expose an API to manage the resources via it. We could go beyond what we can do today with traditional HPC schedulers. And this is where the topics became more popular now. Still, there were not so many large organizations that would justify implementing this. This changed a lot with as big data became the norm. Any kind of company or a startup will now talk about petabytes or even exabytes of data. And once you get to that, then you start looking at the things that are traditionally scientific computing workloads. And then the last bit has been Genai. This has really been the big transition. Once Genai appeared and people started thinking that building on the existing stack we've worked on for the last 10 years is probably the best way to manage AI workloads instead of building something completely new, then the investment really came and you saw this growth in this kind of project and in this area at all levels. This has been really the transition. So I think it took long because the use cases were also being built at scale at the same time.
Mofi Rahman
So currently at cern, is the workload mostly running on PREM hardware or is it like a mixture of on PREM and cloud or mostly cloud?
Ricardo
Yeah, so CERN is mostly on premises and the reason for that is really cost management. If you have a very large workloads like we do, it is cost effective to build on premises data centers. And we do that. But also we have a history of managing data centers, so we know how to do it. And we also have a history of managing remote data centers from the sites that collaborate with us and things like the grid that we've built in the last 20 years. The usage then of external resources in particular public clouds are mostly for bursting capacity, for peak workloads and for scarce or kind of specialized resources. And this is especially important for things like GPUs which are extremely expensive. You don't want to over provision those. And then the number of updates coming and new cards and new heterogeneous kinds of hardware appearing. It's very hard to follow that when you have an on premises data center and you're not offering it as a service. So. So we tend to use external resources for those for things like benchmarking, POCs and even for scaling out our workloads.
Mofi Rahman
So you are using Q currently at cern, so do you get a lot of benefit? Because queue works really well when you have this elastic workload and it can create preemption and also create priorities. But for on premise workload, when you have a data center that you own all the hardware anyway, what benefits are you getting out of Q with the preemption and the fair sharing?
Ricardo
Yeah, that's an excellent question. The model is very different in the public cloud. You want to minimize the resource usage for your workloads that you pay the absolute minimum in on premises because you already bought the hardware. You want to maximize overall usage. So the principles are different. You don't look so much at cluster auto scaling or auto scaling in general, but you do have requirements, requirements to optimize overall efficiency and usage that are you need to have tenants and each tenant has nominal quotas, but you need to be able to borrow quotas, borrow from different queues from different tenants in case those tenants are not filling up their nominal quotas. Because again, what you want is to maximize overall usage, not specific tenant usage. And things like Q provides concepts like cohorts for borrowing or things like fair sharing. Fair sharing is one of the key features that motivates us to use Q and all these ideas of having priorities so that you can backfill for resources that are currently available, you can backfill with other workloads and they can preempt them and replace with higher priority workloads when they come in. All of this is extremely important for us. And this is really the key features of Q that are not necessarily that important. In the case of public clouds for example, fair share is important, but it's actually a key feature if you're running an on premises data center. Then there are other features that we use internally, things like gang scheduling or array jobs. These are also things that require a scheduler, what Q is offering and that you cannot do with standard kubernetes as well.
Mofi Rahman
Yeah, you did mention a little bit about like existing schedulers like the tools like Unicorn, Volcano, Kubefed, they all had some ideas of how to do non web app type applications on kubernetes. So can you speak a little bit more about what is something Q being more kubernetes native? Like how is that? How did that help you make the decision to choose Q versus something that is in itself a new scheduler on top of kubernetes?
Ricardo
Yeah, yeah, that's a good point and I think I can answer that with a bit of history. When we started looking into kubernetes, I used to work on the development of what we call the grid computing infrastructure where we've built a lot of software long time ago that we would like to replace with something more sustainable. And so one of the first things I started looking was can Kubernetes be a replacement for this notion of great computing sites and managing jobs? So I started looking at jobs in Kubernetes and how I could submit a lot and monitor them and all these things. And Kubefed at the time was around Kubefed V1 and Kubefed V1 had one big advantage, which is it was only using normal resources from Kubernetes. So I could just take an existing workload, configure Kubefed with multiple sites on the back or multiple clusters on the back and everything would work. Now the job concept was not good enough because there was missing abstractions in the job at the time. So people came with Kubefet V2, but Kubefit V2 actually went too far. It created its own custom resources. So then suddenly none of the tools in the ecosystem was compatible. You would actually have to change existing helm charts and all of this to make use of Kubefet V2. So it wasn't, it wasn't a big success. I think a lot because of this, and I think the same is with the rest of the tools. The motivation for Q is that it really is designed inside the Kubernetes project. So every design decision is reviewed by people that are contributing directly to Kubernetes and reviewed by the other special interest groups. If you're designing policy for managing scale out jobs, the auto scaling, the people working on autoscaling cluster auto scaling will have to approve that and ensure that whatever the decision is, it fits well into the cluster auto scaling policies and the ways of working the same. For the management of low level devices and optimizations on the nodes, you will have SIG node looking into it. So all of this makes Q a very good solution for, for the ideal integration with the rest of the Kubernetes core. That doesn't mean that it replaces the other projects. It means that probably it takes a lot of what the other projects had to implement themselves into a common core. And this is where I see the value of Q is that it really gives us all the functionality we need for the batch computing kind of workload, HPC kind of workloads integrated into the rest of Kubernetes.
Mofi Rahman
Yeah, I guess it goes down back to the point of going fast versus going far. Right. All the other projects initially, when nothing existed, they paved the path, showed the use case in many ways, showed people that Kubernetes is a place you can run HPC and batch type workload. And then in some ways it actually motivated the maintainers of Kubernetes to see okay, we need to make Kubernetes natively better. In some ways there was a huge benefit from having those projects proving the POC of yes, it's work and these projects still exist. I actually spoke to a few of the maintainers of like Volcano Unicorn in one of the past Kubecon just to hear how those projects are doing. And there is, I feel like cloud native is so big there is room for pretty much any type of use case there.
Ricardo
But this is pretty much the point is that even if we put a lot of the common things in the core, there's always things that will not make it into the core because there's not an enough use cases supporting integrating it. But it doesn't mean that those use cases are not valid, it just means that they have to live outside the core of the ecosystem which justifies the continuation of these projects. And this is the CNCF does a pretty good job with the idea of sandbox incubation graduation in supporting the projects in this maturity level path. And we learn as they go, a lot of experimentation and then there's some consolidation happening. This is what we are seeing in this area.
Mofi Rahman
Yeah, I think again more and more functionality in Kubernetes even are being added out of tree instead of in tree. Things like Gateway API is not actually part of Kubernetes Kubernetes, it's somewhere else. So I think the whole picture you have painted so far about scientific computing, queue batch and all this stuff, fantastic. Love to see the work And I think last year you also accepted an award for end user award, so congratulations. If I already have not said it in one of the kubecons that we run into, I think the last thing I was going to say it's a bit of a future looking. So take you're putting on a speculation hat you get to speculate as much as you want in this point. Where do you see the future of batch workload and potentially Q also going in the Next, let's say five years from now, in 2030 we record a next episode of Kubernetes podcast with you what kind of things we'll be talking about.
Ricardo
Okay, I will take the risk but I will start slowly and then I will go for the more out there ideas. But I think if we look at Q, I think the big developments will be on things like multiq and better support for this kind of multi cluster even Multi domain, multi region, multi cloud kind of management. And especially if the trend for this high demand of high end, high end GPUs continues, it will be essential that we get this optimized so that we manage or optimize costs across multiple deployments. But also this is something that I talk about from time to time, which is with this kind of high end GPUs, the cloud is no longer what it used to be. It doesn't feel on demand anymore. It feels more like on premises. Because you're doing this very long reservations to get any kind of discount you can, which means you're basically pre committing resources for a year, two years, three years, which is not that far from buying stuff to put on premises and then having them there and having to manage them efficiently. So I think queue and multiq have a really good opportunity for this idea of, I think exploring the notion of this provisioning request and future reservations, all these things are really essential to manage this kind of very high demand type of resources. That's on the, I would say the easiest side or the less speculative. I think the other one is, and this is a very big point, which is we see especially in the AI world, but we see a trend to build very dense compute resources. We had a trend with the clouds with having commodity hardware and a lot of nodes. And we see this going back a bit to having less nodes and much more density, with very low latency interconnects and very tightly coupled or other type of resources. And this kind of feels like going a bit back to the mainframe era where you have these beasts in your data centers and you end up giving users timeshares instead of like full nodes or even GPUs. I think this is a challenge in all respects. The first one is allocating resources to users because again it will be very similar to what used to be done with timeshares and mainframes. The other one is for anyone that is not a hyperscaler. The data centers are not designed to accommodate this kind of thing. The density of power, the needs for cooling, we suffer from this today. There are servers that you can buy with very high density GPUs that we cannot fit in a rack, in a full rack that we own for a single server because the power is much. So I think a lot of it in the next couple of years will be learning again how to use kubernetes to manage different kinds of resources. Again, it's the story repeating itself and probably learning how to partition very dense resources in a way that people can share them. I think it's very interesting but extremely challenging.
Mofi Rahman
Yeah, absolutely. I think the work that is happening in the DRA space with making this specialized hardware almost similar API to something like storage, like bringing it down to that level of understanding and usability is going to be very key in making this happen. In any case, I'm super excited to whatever the future brings in terms of compute. The last question I guess maybe not the last question because I've been enjoying talking to you so much. So have you had any conversation with folks or yourself trying to run? You mentioned something like Slurm that people used to use in data centers, but there have been some work that is happening over the last few years as well to make SLARM work on Kubernetes. Thoughts?
Ricardo
I have a lot of thoughts actually. This is one of the most popular topics of discussion in the research and user group. And for those listening, the technical oversight committee in the CNCF just had some restructuring. We called it tag reboot and there's this notion of initiatives where anyone can come forward with an initiative. There will be one about cloud native HPC which is focusing on exactly that, doing kind of a survey of the options available in the ecosystem for this. If you ask me, I think it will be very hard to transition to Kubernetes managed HPC supercomputers, the traditional ones that I know, the very large ones in the top 500 because there's a lot of history and integrations with tools like Slurm. So I think the best option that we have, and that is being followed by several people, several projects, is to manage still use Kubernetes for managing the workloads, but being able to submit to Slurm endpoints behind this is this bridge between Kubernetes and the traditional HPC scientific computing. I think that's that there are several motivations for that. The two that I think are the main ones. One is that you don't have to convince the sysadmins of these supercomputers to move to anything else. And the second one is that for any kind of modern workload machine learning AI, the frameworks and tools that exist, they all integrate with Kubernetes and they know how to manage their distributed trainings and similar workloads with the Kubernetes backend, they do not integrate with things like Slurm very easily. So there is a lot of motivation to just rely on Kubernetes as the common API for all of this. There are projects if you're Interested? There are projects like Interlink that just became a sandbox project, supernettis, there's learnbridge or Slinky from scadmd. There's plenty of things popping up.
Mofi Rahman
Yeah. So in that world, are you saying that you would submit your jobs through the Slurm cli and that would pop up in Kubernetes or the other way around?
Ricardo
So I think both are possible. So if you want to support Slurm users but you're backend, your infrastructure is based on Kubernetes. That's the option. I think the more interesting one for me is the opposite is to submit and manage your workloads as Kubernetes workloads, but still make use of the infrastructures that exist that expose Slurm APIs. The reason for that is there are very large supercomputer computers in Europe and the US with a lot of GPUs, and I would love to have an easier way to get access to them. Right now, the easier way, at least from my point of view, is just to expose and integrate with the Kubernetes APIs.
Mofi Rahman
Lovely. Yeah. I think we also had a project similar to this called xpk that you could use the very much Slurm like commands, but through a CLI tool called XPK that can create and run your Kubernetes cluster and the jobs as like a single CLI command. We'll link all of that and the links you mentioned, I'll get them from you to link for our listeners so that they can also take a look if you're interested.
Ricardo
Anyone listening? If you're interested. Yeah. Watch the space. Under the Technical Oversight Committee, there will be initiatives coming in this area. So just join and participate.
Mofi Rahman
Yeah, I'm super excited to learn more. Hopefully I'll run into in person in one of the kubecons. That happens. But before we finish this off, any final thoughts? Anything people should know about?
Ricardo
I will finish as I often finish, which is everyone involved in this community. It doesn't matter if you're a maintainer or a supporter or an end user, making sure these projects are successful and helping everyone or just giving feedback. I think it's quite important that everyone realizes that we've built a community that goes way beyond individual organizations. Like you mentioned the awards CERN got. The reason that we are so involved is because all this community, all this software that we are supporting and different projects is completely changing the way we do scientific computing and for the better. We can do a lot more now than we could 10 years ago. Thanks for. To the efforts of the whole community. Of course, we thank very much the big organizations that keep the lights on and the projects going and the release is coming, but also everyone, every other organization that helps keeping the groups together, keeping podcasts going, of course, and all the rest that is required to keep the community healthy. Yeah. I always stress this. You're making a huge difference for science and scientific research.
Mofi Rahman
Yeah. And also to anybody listening that happens to be not in a position to directly contribute to Kubernetes, but if you're using Kubernetes in a meaningful way, being an end user, giving the project feedback about how you're using it, finding interesting ways to. A lot of the things that exist in Kubernetes now that didn't exist in the early days happened because we found end users that came up and said, this use case does not cover this thing we're trying to do. Can we do something? From that came caps or like new ideas, new PRs. And we probably even found new maintainers and contributors to projects because they were using something, something was not working the way they wanted to. And they added, there's no better time to start. Like, the best time to start was yesterday. The next best time to start is today. So, yeah, absolutely. That's how I'll end it. Ricardo, thank you so much for spending the time and hopefully you still have some sunlight left in your day. So enjoy the rest of the day. Thank you so much.
Ricardo
Yeah, thank you. Thank you.
Abdel Sighiwar
Thank you, Mophie, for recording that episode. I know that you've been trying to get your hands on Ricardo for a while and it has been challenging.
Mofi Rahman
Yeah. I mean, again, we have a very short sliver of time during the day. He is based off Europe time. I'm based on like US New York time. But I really wanted to chat with Ricardo. I have a lot of interest in the Q project and so I wanted to chat with him about like the batch and. And yeah, I'm glad finally we managed to make it work and got the interview.
Abdel Sighiwar
Yeah, no, it was a good interview because you touched on a lot of things. But before we get into it, I'd like to start. He's into flying. I didn't know that you managed to find that information.
Mofi Rahman
Yeah, I mean, it was not that much of like a digging. I just, I was trying to get his bio for the episode and it was in his personal website that is like the first thing. So I was like, yeah, we should talk about oftentimes in cloud native or in technology, we talk about, like, talk to folks and we get too deep into their technical background. I thought it would be fun to start the conversation off with something that is not related to tech personally. But again, I think I managed to bring in the question that tied it together anyway.
Abdel Sighiwar
Yeah. Which was like he said, basically that planning a flight is basically like planning a Kubernetes installation. The more you do upfront, the more, the easier it is, basically.
Mofi Rahman
Right, yeah. Again, I think in any type of things you do. Right, like planning, like getting a good plan in place is probably really powerful. The old adage of measure twice, cut once, that is used in carpentry. So it's the same idea. Right. Like the more you plan up front, the more things you get done and more you know what is to come, the less surprises you have. Right. It's not that you're going to get it perfect every time, but it's almost like you get to make new mistakes, not the same mistakes again.
Abdel Sighiwar
Yeah. And so I don't know if this, this was a mistake or not, but one of the first things discussed was like Cube Fed, which seems, seems to be like a federation multi cluster tool that we're looking into it. Like version one doesn't exist anymore. It's archived actually, until the Kubernetes project. But I don't know if I think version two is now called Kubeadmiral or something.
Mofi Rahman
Yeah, I think the conversation kind of went there as we're talking about Ricardo's experience and the world of batch workload in Kubernetes. Again, Kubernetes is about 11 years old. This year was 10 last year. And Ricardo had been in this space, looking at it since Kubernetes was, let's say, like two, three years old. Right. So at that time there were. I mean, there are many ways to kind of solve this problem of I need to run this ephemeral job in this, in this platform, how do I go about doing this? And many people, rightly so, were thinking about, okay, I can run stateless web application pretty well, how do I do jobs? So there are a number of different solutions, actually, and some of them still exist, some of them has changed. Like Kubefed is now called Kubeadmiral as of 2021 and still exists to some capacity. And then there is. So in Kubernetes, when you want more resource, you could either have like bigger pods. More pods, or bigger nodes. More nodes. Right. Like that's kind of the fourth dimension of scaling in Kubernetes. But there is an upper Limit, right? Open source Kubernetes have an upper limit of 5,000 nodes. But there are jobs that you need to run for the training workloads. For large language models that go way beyond 5000 VMs, you need way more machines to do that. At that point you have a few options. One is the architecture of Kubefed or this multi cluster is that you have a controller cluster that knows about bunch of other clusters and send those job that way. Or you could do something where instead of owning or controlling the cluster, you just control some mechanism to send jobs or send resources use resources in other clusters. So there are many ways to handle this. Recently we announced multi cluster Orchestrator. That is a similar concept. You have a centralized hub and that can send things to other things. But at the same time in GKE we have been working on making clusters but bigger, right? Like we recently showcased. I was part of one of the demo me and Maciek did on Kubecon about showing off 65,000 nodes in a single cluster, right? So 65,000 node on a single cluster is equivalent to like 13 5,000 node clusters. So if you had like 10 5,000 node cluster before now you potentially can do that in a single cluster. But eventually we'll find a world where 65,000 node is not enough. Then you need like two 10 65,000 node clusters or one 650,000 node cluster. So we're going to have to keep one upping each other. If the scale of training jobs is the same, trajectory stays there. So far, making models bigger have always made it better. So as long as that trend continues on large language models, we're going to continue to see jobs and training gets bigger and bigger. And there have been other approaches too. There have been some rewrites or of the Kubernetes scheduler itself like with Volcano or Unicorn, which are like batch systems that runs on top of Kubernetes but somehow replaces the Kubernetes scheduler itself. There have been work to run slurm on Kubernetes. There have been ray like Kubrae works on Kubernetes. We gave a talk on Kubecon last year about Kubray. So it's not like oh, everybody just sat together in a room and decided on one solution. They just like people kept trying different things and we learned from each approach what is good and what is bad about it. And Q kind of came from a lot of this conversation. And talking to Ricardo was almost like getting history lesson in the whole spectrum because he has been through all of it from the beginning and seeing the evolution. So it was really for me a really good experience to talk to him about that journey almost. Right. So like to see why this is the way it is and what makes Q like, what sets Q apart from the other solutions. It's not that Q is inherently better than anything else. It's just like Q had the time to take in the learning from all the other solutions that was. That came before.
Abdel Sighiwar
So what is. Can you. Can you just tell maybe the audience like one of the key learnings that Q does different? Because one of the things you mentioned during the interview is like native objects, which I. Yeah.
Mofi Rahman
So pretty much every other scheduler on Kubernetes that does job related things would basically create a new concept of a pod. Right. You basically need to have something that what understands what a job is. It's like basically a new cid. That means you are replacing what Kubernetes gives out of the box. Right. So which is great. You can have a lot of control, you can put as many kind of hooks and flags as you want, but now you are out of tree from what Kubernetes gives you. So as Kubernetes gets better, like the job objects and the POD objects get better, you either have to implement that yourself or you kind of have to live without it. Right. Q on the other hand, on Kubernetes 1.25 I believe there was a flag introduced to the POD object called suspend. So the scheduler could tell if the suspend flag is set to true. Scheduler knows this part should not be scheduled. Basically that's all that is. So what Q controller does looks at your resources, looks at how much resource you have, and either sets the suspend flag to true or false. So that's all Q does fundamentally. So that means when Q sets a pods suspend flag to true, scheduler won't touch it. And again, it's a very simple idea. But what it means is that since we're not actually reinventing any of the POD semantics on Kubernetes, the job and the pod as it gets better with newer Kubernetes releases, Q takes advantage of all of that because Q is not rewriting anything. It's just one flag, literally one flag in the Kubernetes object like the POD object it touches. So the learning was if you kind of go out of tree of the Kubernetes semantics, you have to do way more work to catch up. If Kubernetes pod, which again over time gets better. But if you're outside of that line, you have to either do more work or over time if you go out of maintenance, all of a sudden people are losing out on good features that got introduced, performance improvement, quality of life, more metrics, logs, everything. You have to manually do it yourself again. So initially it's easy to move fast with it because you are doing your own thing. You don't have to wait for Kubernetes implementation, implement things, but later down the line now you are slowed down because you have to implement all the things that is in the main tree yourself again.
Abdel Sighiwar
Yeah, yeah. And I see this kind of philosophy of doing things also replicated across multiple other tools like MCO arguably does the same thing. It doesn't really touch your pods, it doesn't schedule, it doesn't do anything. It just provides you with a recommendation for where to place the workload. Right. Yeah, so I understand that philosophy of like let's let kubernetes do what it's better at doing and only provide extra stuff to like handle kind of edge use cases in a way kind of, you know.
Mofi Rahman
Yeah. So I think like that has been the biggest, I think learning and changes in the mindset of folks building on Kubernetes over the few years is that instead of fighting the system, you're basically using the system to get what you want to get done. Use as much of native Kubernetes as you can and then extend it rather than trying to have a parallel execution on the side.
Abdel Sighiwar
Yeah. Yeah.
Mofi Rahman
Cool.
Abdel Sighiwar
Well thank you Mophie for this interview.
Mofi Rahman
Yeah, I really enjoyed it. I hope people also enjoyed it. And Ricardo also shared a few links that we will put in the show notes and I think there are some fun learnings there. That brings us to the end of another episode. If you enjoyed this show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media Ubernetespod or reach us by email at kubernetespodcastgoogle.com you can also check out the website at kubernetespodcast.com where you will find transcripts and show notes and links to subscribe. Please consider rating us in Ear podcast player so we can help more people find and enjoy the show. Thanks for listening and we'll see you next time.
Kubernetes Podcast from Google: HPC Workload Scheduling with Ricardo Rocha
Release Date: July 9, 2025
Hosts: Abdel Sghiouar and Kaslin Fields
Guest: Ricardo Rocha, Platform Infrastructure Lead at CERN
In this episode of the Kubernetes Podcast from Google, hosts Abdel Sghiouar and Kaslin Fields delve into the specialized world of High-Performance Computing (HPC) workload scheduling with guest Ricardo Rocha from CERN. Ricardo brings a wealth of experience in cloud-native deployments and machine learning, spearheading efforts to transition CERN's services to cloud-native technologies. As a member of the CNCF Technical Oversight Committee and chair of the End User Technical Advisory Board, Ricardo provides invaluable insights into the intersection of scientific research and Kubernetes.
Before diving into the main discussion, the hosts cover the latest updates in the Kubernetes ecosystem:
Node Feature Discovery (NFD):
[00:49] Abdel Sighiwar introduces NFD, an open-source project that automates the detection and reporting of hardware and system features on cluster nodes. This facilitates the scheduling of workloads on nodes that meet specific requirements, bridging the gap between workload container images and node operating systems. NFD enables applications to leverage various drivers, libraries, and kernel features seamlessly.
Google Gemini CLI:
[01:16] Mofi Rahman discusses Google's announcement of the Gemini CLI, a command-line AI agent designed to interact with Gemini directly from the terminal. This tool allows users to query GitHub issues, codebases, pull requests, scaffold new applications, and generate media, all while being open source and available on GitHub.
CNCF Vietnamese Glossary:
[01:33] The CNCF has localized the cloud-native glossary into Vietnamese, expanding its reach to 15 languages. This initiative enhances accessibility and understanding of cloud-native terminologies for Vietnamese-speaking communities.
New CNCF Executive Director:
[01:44] Jonathan Bryce has been appointed as the new CNCF Executive Director, succeeding Priyanka Sharma. With 15 years of experience in the open-source space, including roles at Rackspace, OpenStack, and the Open Infra Foundation, Bryce is poised to lead CNCF into its next phase of growth.
Mofi Rahman starts the conversation by touching on Ricardo's personal passion for flying airplanes.
[02:35] Ricardo Rocha:
"Flying multi-planes and gliders involves a lot of planning, much like managing Kubernetes clusters. Preparing in advance, checking the weather, and planning the flight path are akin to setting up a Kubernetes environment—both require meticulous preparation to ensure smooth operations."
Ricardo draws parallels between aviation and Kubernetes management, emphasizing the importance of upfront planning to avoid turbulence, whether in the skies or in cluster management.
Kaslin Fields probes into how CERN, a hub of rigorous scientific research, integrated Kubernetes into its infrastructure.
[04:02] Ricardo Rocha:
"At CERN, we always had large requirements for code and resources, managing terabytes and petabytes of data even before the term 'big data' became commonplace. Our fixed budget necessitated finding more efficient ways to handle increasing experimental demands. Ten years ago, we began exploring cloud-native technologies to automate and optimize resource usage, leading us to join the Kubernetes community instead of operating in isolation."
Ricardo explains that CERN's need for efficient resource management and automation drove their adoption of Kubernetes, leveraging community-driven tools to meet their scientific computing needs.
Kaslin Fields follows up by asking about the fundamental differences between scientific computing and typical cloud-native projects.
[06:02] Ricardo Rocha:
"Traditional Kubernetes was designed for service-oriented workloads, focusing on endpoints and scaling based on request volume. In contrast, scientific computing involves managing a vast number of jobs with significant resource consumption, requiring advanced scheduling features like queues, quotas, priorities, and preemption. Additionally, optimizing node usage at a low level—such as CPU pinning and NUMA awareness—is crucial for us, something that wasn't a priority in typical service environments."
Ricardo highlights that scientific workloads demand more sophisticated scheduling and resource optimization compared to standard web applications, necessitating enhancements to Kubernetes' original architecture.
The conversation shifts to the Q project, a key component in managing HPC workloads on Kubernetes.
[08:04] Ricardo Rocha:
"From the outset, we sought to use Kubernetes not just for internal services but also for our scientific workloads. Existing projects like Volcano and Kubebatch provided batch scheduling capabilities but were not part of the core Kubernetes project, leading to integration challenges. The Q project emerged from a collective need to have a scheduler integrated into Kubernetes itself, ensuring better compatibility and leveraging the core system's features."
Ricardo explains that Q was developed to address the shortcomings of existing schedulers by creating a Kubernetes-native solution, enhancing compatibility and performance for batch and HPC workloads.
Mofi Rahman inquires about the advantages of using Q at CERN, especially given their on-premises infrastructure.
[14:46] Ricardo Rocha:
"In an on-premises setup, our goal is to maximize overall resource usage since we've already invested in the hardware. Q provides features like fair sharing and preemption, which allow us to optimize resource allocation dynamically. This ensures that we can backfill available resources with lower-priority workloads, thereby maximizing efficiency. Additionally, Q supports gang scheduling and array jobs, which are essential for our HPC tasks and are not feasible with standard Kubernetes."
Ricardo emphasizes that Q enhances resource utilization and scheduling flexibility, enabling CERN to efficiently manage their large-scale, on-premises HPC workloads.
Looking ahead, Kaslin Fields asks Ricardo to speculate on the future of batch workloads and the Q project.
[22:07] Ricardo Rocha:
"I envision Q evolving to support multi-cluster, multi-region, and multi-cloud environments more effectively. With the increasing demand for high-end GPUs driven by AI advancements, optimizing costs and resource management across diverse deployments will be crucial. Additionally, as compute resources become denser and more specialized, Kubernetes will need to adapt to manage these efficiently, much like the mainframe era's timesharing systems. This will involve partitioning dense resources to allow shared usage effectively."
Ricardo anticipates that Q will play a pivotal role in managing increasingly complex and distributed HPC environments, particularly as AI workloads demand more sophisticated resource management strategies.
The discussion then turns to integrating traditional HPC schedulers like Slurm with Kubernetes.
[26:15] Ricardo Rocha:
"Transitioning entirely to Kubernetes-managed HPC supercomputers would be challenging due to the deep integrations and history with tools like Slurm. Instead, the more viable approach is to use Kubernetes to manage workloads while interfacing with existing Slurm endpoints. This allows users to submit and manage jobs through Kubernetes APIs while leveraging the robust scheduling capabilities of Slurm."
Ricardo suggests a hybrid approach, maintaining the strengths of traditional HPC schedulers like Slurm while utilizing Kubernetes for workload management and modern AI integrations.
Ricardo concludes the interview by emphasizing the importance of community involvement.
[29:52] Ricardo Rocha:
"Everyone in the Kubernetes community, whether a maintainer, supporter, or end user, plays a crucial role in the success of our projects. Our collective efforts are enabling significant advancements in scientific computing, allowing us to achieve more than we could a decade ago. It's vital to continue supporting each other, providing feedback, and contributing to keep the community vibrant and effective."
The hosts echo Ricardo's sentiments, encouraging listeners to engage with the community, provide feedback, and contribute to Kubernetes projects to drive further innovation and success.
Kubernetes for HPC:
Kubernetes has evolved to support complex scientific workloads through projects like Q, addressing the unique scheduling and resource management needs of HPC environments.
Q Project's Role:
As a Kubernetes-native scheduler, Q offers advanced features such as fair sharing, preemption, gang scheduling, and support for array jobs, making it ideal for large-scale scientific computing.
Community and Collaboration:
The success of Kubernetes in specialized domains like HPC is driven by active community participation, collaboration, and the continuous integration of user-driven features and improvements.
Future Directions:
The integration of multi-cluster management, optimized high-density compute resource handling, and hybrid approaches with traditional HPC schedulers like Slurm will shape the future of HPC workloads on Kubernetes.
Ricardo Rocha on Aviation and Kubernetes:
"[02:35] Flying multi-planes and gliders involves a lot of planning, much like managing Kubernetes clusters..."
Ricardo Rocha on CERN's Adoption of Kubernetes:
"[04:02] At CERN, we always had large requirements for code and resources, managing terabytes and petabytes of data..."
Ricardo Rocha on Q Project's Need:
"[08:04] Existing projects like Volcano and Kubebatch provided batch scheduling capabilities but were not part of the core Kubernetes project..."
Ricardo Rocha on Maximizing Resource Usage:
"[14:46] Q provides features like fair sharing and preemption, which allow us to optimize resource allocation dynamically..."
Ricardo Rocha on Future of Q and HPC:
"[22:07] I envision Q evolving to support multi-cluster, multi-region, and multi-cloud environments more effectively..."
This episode offers a deep dive into how Kubernetes is transforming HPC workloads in scientific research environments like CERN. Ricardo Rocha's insights shed light on the challenges and solutions in integrating advanced scheduling and resource management within Kubernetes, highlighting the pivotal role of community-driven projects like Q in advancing the cloud-native ecosystem.