HPC Workload Scheduling, with Ricardo Rocha - Kubernetes Podcast from Google

Summary

Kubernetes Podcast from Google: HPC Workload Scheduling with Ricardo Rocha

Release Date: July 9, 2025
Hosts: Abdel Sghiouar and Kaslin Fields
Guest: Ricardo Rocha, Platform Infrastructure Lead at CERN

Introduction

In this episode of the Kubernetes Podcast from Google, hosts Abdel Sghiouar and Kaslin Fields delve into the specialized world of High-Performance Computing (HPC) workload scheduling with guest Ricardo Rocha from CERN. Ricardo brings a wealth of experience in cloud-native deployments and machine learning, spearheading efforts to transition CERN's services to cloud-native technologies. As a member of the CNCF Technical Oversight Committee and chair of the End User Technical Advisory Board, Ricardo provides invaluable insights into the intersection of scientific research and Kubernetes.

News Highlights

Before diving into the main discussion, the hosts cover the latest updates in the Kubernetes ecosystem:

Node Feature Discovery (NFD):
[00:49] Abdel Sighiwar introduces NFD, an open-source project that automates the detection and reporting of hardware and system features on cluster nodes. This facilitates the scheduling of workloads on nodes that meet specific requirements, bridging the gap between workload container images and node operating systems. NFD enables applications to leverage various drivers, libraries, and kernel features seamlessly.
Google Gemini CLI:
[01:16] Mofi Rahman discusses Google's announcement of the Gemini CLI, a command-line AI agent designed to interact with Gemini directly from the terminal. This tool allows users to query GitHub issues, codebases, pull requests, scaffold new applications, and generate media, all while being open source and available on GitHub.
CNCF Vietnamese Glossary:
[01:33] The CNCF has localized the cloud-native glossary into Vietnamese, expanding its reach to 15 languages. This initiative enhances accessibility and understanding of cloud-native terminologies for Vietnamese-speaking communities.
New CNCF Executive Director:
[01:44] Jonathan Bryce has been appointed as the new CNCF Executive Director, succeeding Priyanka Sharma. With 15 years of experience in the open-source space, including roles at Rackspace, OpenStack, and the Open Infra Foundation, Bryce is poised to lead CNCF into its next phase of growth.

Interview with Ricardo Rocha

Connecting Aviation and Kubernetes

Mofi Rahman starts the conversation by touching on Ricardo's personal passion for flying airplanes.

[02:35] Ricardo Rocha:
"Flying multi-planes and gliders involves a lot of planning, much like managing Kubernetes clusters. Preparing in advance, checking the weather, and planning the flight path are akin to setting up a Kubernetes environment—both require meticulous preparation to ensure smooth operations."

Ricardo draws parallels between aviation and Kubernetes management, emphasizing the importance of upfront planning to avoid turbulence, whether in the skies or in cluster management.

Adoption of Kubernetes at CERN

Kaslin Fields probes into how CERN, a hub of rigorous scientific research, integrated Kubernetes into its infrastructure.

[04:02] Ricardo Rocha:
"At CERN, we always had large requirements for code and resources, managing terabytes and petabytes of data even before the term 'big data' became commonplace. Our fixed budget necessitated finding more efficient ways to handle increasing experimental demands. Ten years ago, we began exploring cloud-native technologies to automate and optimize resource usage, leading us to join the Kubernetes community instead of operating in isolation."

Ricardo explains that CERN's need for efficient resource management and automation drove their adoption of Kubernetes, leveraging community-driven tools to meet their scientific computing needs.

Distinguishing Scientific Workloads from Traditional Kubernetes Use Cases

Kaslin Fields follows up by asking about the fundamental differences between scientific computing and typical cloud-native projects.

[06:02] Ricardo Rocha:
"Traditional Kubernetes was designed for service-oriented workloads, focusing on endpoints and scaling based on request volume. In contrast, scientific computing involves managing a vast number of jobs with significant resource consumption, requiring advanced scheduling features like queues, quotas, priorities, and preemption. Additionally, optimizing node usage at a low level—such as CPU pinning and NUMA awareness—is crucial for us, something that wasn't a priority in typical service environments."

Ricardo highlights that scientific workloads demand more sophisticated scheduling and resource optimization compared to standard web applications, necessitating enhancements to Kubernetes' original architecture.

Introducing the Q Project

The conversation shifts to the Q project, a key component in managing HPC workloads on Kubernetes.

[08:04] Ricardo Rocha:
"From the outset, we sought to use Kubernetes not just for internal services but also for our scientific workloads. Existing projects like Volcano and Kubebatch provided batch scheduling capabilities but were not part of the core Kubernetes project, leading to integration challenges. The Q project emerged from a collective need to have a scheduler integrated into Kubernetes itself, ensuring better compatibility and leveraging the core system's features."

Ricardo explains that Q was developed to address the shortcomings of existing schedulers by creating a Kubernetes-native solution, enhancing compatibility and performance for batch and HPC workloads.

Benefits of Q in On-Premises Environments

Mofi Rahman inquires about the advantages of using Q at CERN, especially given their on-premises infrastructure.

[14:46] Ricardo Rocha:
"In an on-premises setup, our goal is to maximize overall resource usage since we've already invested in the hardware. Q provides features like fair sharing and preemption, which allow us to optimize resource allocation dynamically. This ensures that we can backfill available resources with lower-priority workloads, thereby maximizing efficiency. Additionally, Q supports gang scheduling and array jobs, which are essential for our HPC tasks and are not feasible with standard Kubernetes."

Ricardo emphasizes that Q enhances resource utilization and scheduling flexibility, enabling CERN to efficiently manage their large-scale, on-premises HPC workloads.

Future Prospects for Batch Workloads and Q

Looking ahead, Kaslin Fields asks Ricardo to speculate on the future of batch workloads and the Q project.

[22:07] Ricardo Rocha:
"I envision Q evolving to support multi-cluster, multi-region, and multi-cloud environments more effectively. With the increasing demand for high-end GPUs driven by AI advancements, optimizing costs and resource management across diverse deployments will be crucial. Additionally, as compute resources become denser and more specialized, Kubernetes will need to adapt to manage these efficiently, much like the mainframe era's timesharing systems. This will involve partitioning dense resources to allow shared usage effectively."

Ricardo anticipates that Q will play a pivotal role in managing increasingly complex and distributed HPC environments, particularly as AI workloads demand more sophisticated resource management strategies.

Integrating Slurm with Kubernetes

The discussion then turns to integrating traditional HPC schedulers like Slurm with Kubernetes.

[26:15] Ricardo Rocha:
"Transitioning entirely to Kubernetes-managed HPC supercomputers would be challenging due to the deep integrations and history with tools like Slurm. Instead, the more viable approach is to use Kubernetes to manage workloads while interfacing with existing Slurm endpoints. This allows users to submit and manage jobs through Kubernetes APIs while leveraging the robust scheduling capabilities of Slurm."

Ricardo suggests a hybrid approach, maintaining the strengths of traditional HPC schedulers like Slurm while utilizing Kubernetes for workload management and modern AI integrations.

Final Thoughts

Ricardo concludes the interview by emphasizing the importance of community involvement.

[29:52] Ricardo Rocha:
"Everyone in the Kubernetes community, whether a maintainer, supporter, or end user, plays a crucial role in the success of our projects. Our collective efforts are enabling significant advancements in scientific computing, allowing us to achieve more than we could a decade ago. It's vital to continue supporting each other, providing feedback, and contributing to keep the community vibrant and effective."

The hosts echo Ricardo's sentiments, encouraging listeners to engage with the community, provide feedback, and contribute to Kubernetes projects to drive further innovation and success.

Key Takeaways

Kubernetes for HPC:
Kubernetes has evolved to support complex scientific workloads through projects like Q, addressing the unique scheduling and resource management needs of HPC environments.
Q Project's Role:
As a Kubernetes-native scheduler, Q offers advanced features such as fair sharing, preemption, gang scheduling, and support for array jobs, making it ideal for large-scale scientific computing.
Community and Collaboration:
The success of Kubernetes in specialized domains like HPC is driven by active community participation, collaboration, and the continuous integration of user-driven features and improvements.
Future Directions:
The integration of multi-cluster management, optimized high-density compute resource handling, and hybrid approaches with traditional HPC schedulers like Slurm will shape the future of HPC workloads on Kubernetes.

Notable Quotes

Ricardo Rocha on Aviation and Kubernetes:
"[02:35] Flying multi-planes and gliders involves a lot of planning, much like managing Kubernetes clusters..."
Ricardo Rocha on CERN's Adoption of Kubernetes:
"[04:02] At CERN, we always had large requirements for code and resources, managing terabytes and petabytes of data..."
Ricardo Rocha on Q Project's Need:
"[08:04] Existing projects like Volcano and Kubebatch provided batch scheduling capabilities but were not part of the core Kubernetes project..."
Ricardo Rocha on Maximizing Resource Usage:
"[14:46] Q provides features like fair sharing and preemption, which allow us to optimize resource allocation dynamically..."
Ricardo Rocha on Future of Q and HPC:
"[22:07] I envision Q evolving to support multi-cluster, multi-region, and multi-cloud environments more effectively..."

This episode offers a deep dive into how Kubernetes is transforming HPC workloads in scientific research environments like CERN. Ricardo Rocha's insights shed light on the challenges and solutions in integrating advanced scheduling and resource management within Kubernetes, highlighting the pivotal role of community-driven projects like Q in advancing the cloud-native ecosystem.