Multi-Cluster Orchestrator, with Nick Eberts and Jon Li

Wed May 28 2025

Guests are Nick Eberts and Jon Li. Nick is a Product Manager at Google working on Fleets and Multi-Cluster and Jon is a Software Engineer at Google working on AI Inference on Kubernetes. We discussed the newly announced Multi Cluster Orchestrator...

Summary

Kubernetes Podcast from Google: Episode Summary

Title: Multi-Cluster Orchestrator, with Nick Eberts and Jon Li
Hosts: Abdel Sghiouar, Kaslin Fields
Release Date: May 28, 2025

Introduction

In this episode of the Kubernetes Podcast from Google, hosts Kaslin Fields and Mofi Rahman delve into the intricacies of managing workloads across multiple Kubernetes clusters. They are joined by Nick Eberts, a Product Manager at Google, and John Lee, a Software Engineer at Google, to discuss the newly announced open-source tool, Multi-Cluster Orchestrator (MCO). MCO aims to address the challenges of orchestrating workloads, particularly those requiring expensive accelerated hardware like GPUs, across diverse Kubernetes environments.

Understanding the Problem

John Lee sets the stage by highlighting the evolving landscape of cloud computing:

“Kubernetes was built with the assumption of an infinite and uniform cloud.”
— John Lee [03:56]

Originally, Kubernetes was designed under the premise that cloud resources—such as CPUs and memory—were abundant and consistent across regions. However, with the advent of specialized accelerators like GPUs and TPUs, this assumption no longer holds true. Challenges such as hardware stockouts and non-uniform availability across regions necessitate a more sophisticated approach to workload management.

Current State: Single vs. Multiple Clusters

Nick Eberts provides insight into the current Kubernetes cluster dynamics:

“Even though your cluster can be massive, there's still like the blast radius of the control plane on any particular cluster.”
— Nick Eberts [04:39]

While Google and other upstream Kubernetes contributors have scaled clusters to impressive sizes (e.g., 65k nodes), managing a single large cluster isn't always optimal. The risks associated with a single point of failure (“blast radius”) and the logistical complexities of handling diverse workloads across regions underscore the need for a balanced multi-cluster strategy.

Introducing Multi-Cluster Orchestrator (MCO)

MCO emerges as a solution to streamline multi-cluster management. Nick Eberts elaborates:

“The goal of the products that I build and what I'm trying to push upstream is this ability to think about how you want to bin pack applications together onto the same shapes of clusters.”
— Nick Eberts [05:22]

MCO facilitates the efficient distribution of workloads across multiple clusters by recommending optimal placement based on predefined priorities and actual cluster capacities. Unlike traditional Continuous Deployment (CD) tools, MCO specializes in workload orchestration rather than deployment, ensuring that resources like GPUs are utilized cost-effectively.

How MCO Works

The conversation delves into the technical architecture of MCO, underpinned by two primary components:

Cluster Inventory API
Cluster Profile API

Nick Eberts explains:

“Cluster Profile is a CRD that we built upstream in SIG multicluster, and it's essentially just a pointer to an actual cluster.”
— Nick Eberts [11:15]

Cluster Inventory serves as a centralized repository of cluster metadata, eliminating the need for disparate tools to manage cluster lists. Cluster Profiles encapsulate detailed information about each cluster, including available resources and capabilities (e.g., GPUs, networking configurations), enabling MCO to make informed placement recommendations.

John Lee adds context-specific insights related to inference workloads:

“For inference, Kubernetes is not your typical web traffic in the sense that the request could be long, the size of the request could be big.”
— John Lee [10:19]

These workloads demand specialized handling due to their unique characteristics, such as prolonged processing times and substantial resource utilization.

Integration with Existing Tools

MCO is designed to integrate seamlessly with popular CD tools. Nick Eberts states:

“The first implementation that you'll see is with Argo CD, because that's where most of our customers are right now.”
— Nick Eberts [08:03]

Currently, MCO supports integration with Argo CD, with plans to extend compatibility to other tools like Flux and Config Sync. This integration allows MCO to provide placement recommendations that CD tools can act upon, ensuring efficient deployment across clusters.

Use Cases: Inference Workloads

The discussion highlights how MCO specifically benefits inference workloads, which often require specialized hardware:

John Lee explains the complexities of handling transformer-based inference:

“The latency here could be in the order of seconds as opposed to what we're used to in the microservice world.”
— John Lee [10:19]

MCO addresses these challenges by dynamically scaling resources based on demand, ensuring that expensive hardware like GPUs are utilized only when necessary. This dynamic scaling is crucial for cost-effectiveness and maintaining high performance.

Future Plans and Open Source Release

Nick Eberts outlines the roadmap for MCO:

“We're going to make the metric that determines whether or not there actually is a capacity issue in any region that's live, open and accessible so that you could bring your own sort of metric to evaluate the running workloads.”
— Nick Eberts [08:48]

MCO is set to be released as an open-source tool, with initial support targeting GKE clusters. The team plans to collaborate with SIG Multicluster to standardize and expand MCO’s capabilities, ensuring broad adoption across the Kubernetes ecosystem.

Conclusion and Closing Remarks

The episode concludes with enthusiasm for the upcoming release and future developments:

“My goal is to have this conversation with you again in six to seven months and talk about its sort of birth into sig multicluster and use by other companies besides Google.”
— Nick Eberts [20:24]

Hosts express anticipation for MCO’s impact on multi-cluster management and invite listeners to stay tuned for further updates and episodes exploring related advancements.

Key Takeaways:

Multi-Cluster Orchestrator (MCO) is an open-source tool designed to optimize workload placement across multiple Kubernetes clusters.
MCO leverages Cluster Inventory and Cluster Profile APIs to maintain a centralized and detailed view of cluster resources.
Integration with CD tools like Argo CD allows MCO to provide actionable placement recommendations.
MCO is particularly beneficial for managing inference workloads that require expensive and specialized hardware.
The tool aims to enhance scalability, reliability, and cost-efficiency in multi-cluster Kubernetes environments.

Notable Quotes:

“Kubernetes was built with the assumption of an infinite and uniform cloud.” — John Lee [03:56]
“The goal of the products that I build and what I'm trying to push upstream is this ability to think about how you want to bin pack applications together onto the same shapes of clusters.” — Nick Eberts [05:22]
“Cluster Profile is a CRD that we built upstream in SIG multicluster, and it's essentially just a pointer to an actual cluster.” — Nick Eberts [11:15]
“My goal is to have this conversation with you again in six to seven months and talk about its sort of birth into sig multicluster and use by other companies besides Google.” — Nick Eberts [20:24]

For those interested in exploring Multi-Cluster Orchestrator further, keep an eye on the GitHub repository and upcoming releases detailed in the episode.

Summary

Kubernetes Podcast from Google: Episode Summary

Title: Multi-Cluster Orchestrator, with Nick Eberts and Jon Li
Hosts: Abdel Sghiouar, Kaslin Fields
Release Date: May 28, 2025

Introduction

Understanding the Problem

John Lee sets the stage by highlighting the evolving landscape of cloud computing:

“Kubernetes was built with the assumption of an infinite and uniform cloud.”
— John Lee [03:56]

Current State: Single vs. Multiple Clusters

Nick Eberts provides insight into the current Kubernetes cluster dynamics:

“Even though your cluster can be massive, there's still like the blast radius of the control plane on any particular cluster.”
— Nick Eberts [04:39]

Introducing Multi-Cluster Orchestrator (MCO)

MCO emerges as a solution to streamline multi-cluster management. Nick Eberts elaborates:

“The goal of the products that I build and what I'm trying to push upstream is this ability to think about how you want to bin pack applications together onto the same shapes of clusters.”
— Nick Eberts [05:22]

How MCO Works

The conversation delves into the technical architecture of MCO, underpinned by two primary components:

Cluster Inventory API
Cluster Profile API

Nick Eberts explains:

“Cluster Profile is a CRD that we built upstream in SIG multicluster, and it's essentially just a pointer to an actual cluster.”
— Nick Eberts [11:15]

John Lee adds context-specific insights related to inference workloads:

“For inference, Kubernetes is not your typical web traffic in the sense that the request could be long, the size of the request could be big.”
— John Lee [10:19]

These workloads demand specialized handling due to their unique characteristics, such as prolonged processing times and substantial resource utilization.

Integration with Existing Tools

MCO is designed to integrate seamlessly with popular CD tools. Nick Eberts states:

“The first implementation that you'll see is with Argo CD, because that's where most of our customers are right now.”
— Nick Eberts [08:03]

Use Cases: Inference Workloads

The discussion highlights how MCO specifically benefits inference workloads, which often require specialized hardware:

John Lee explains the complexities of handling transformer-based inference:

“The latency here could be in the order of seconds as opposed to what we're used to in the microservice world.”
— John Lee [10:19]

Future Plans and Open Source Release

Nick Eberts outlines the roadmap for MCO:

“We're going to make the metric that determines whether or not there actually is a capacity issue in any region that's live, open and accessible so that you could bring your own sort of metric to evaluate the running workloads.”
— Nick Eberts [08:48]

Conclusion and Closing Remarks

The episode concludes with enthusiasm for the upcoming release and future developments:

“My goal is to have this conversation with you again in six to seven months and talk about its sort of birth into sig multicluster and use by other companies besides Google.”
— Nick Eberts [20:24]

Hosts express anticipation for MCO’s impact on multi-cluster management and invite listeners to stay tuned for further updates and episodes exploring related advancements.

Key Takeaways:

Multi-Cluster Orchestrator (MCO) is an open-source tool designed to optimize workload placement across multiple Kubernetes clusters.
MCO leverages Cluster Inventory and Cluster Profile APIs to maintain a centralized and detailed view of cluster resources.
Integration with CD tools like Argo CD allows MCO to provide actionable placement recommendations.
MCO is particularly beneficial for managing inference workloads that require expensive and specialized hardware.
The tool aims to enhance scalability, reliability, and cost-efficiency in multi-cluster Kubernetes environments.

Notable Quotes:

“Kubernetes was built with the assumption of an infinite and uniform cloud.” — John Lee [03:56]
“The goal of the products that I build and what I'm trying to push upstream is this ability to think about how you want to bin pack applications together onto the same shapes of clusters.” — Nick Eberts [05:22]
“Cluster Profile is a CRD that we built upstream in SIG multicluster, and it's essentially just a pointer to an actual cluster.” — Nick Eberts [11:15]
“My goal is to have this conversation with you again in six to seven months and talk about its sort of birth into sig multicluster and use by other companies besides Google.” — Nick Eberts [20:24]

For those interested in exploring Multi-Cluster Orchestrator further, keep an eye on the GitHub repository and upcoming releases detailed in the episode.

wavePod

Multi-Cluster Orchestrator, with Nick Eberts and Jon Li

Powered by Wave AI

Summary

Introduction

Understanding the Problem

Current State: Single vs. Multiple Clusters

Introducing Multi-Cluster Orchestrator (MCO)

How MCO Works

Integration with Existing Tools

Use Cases: Inference Workloads

Future Plans and Open Source Release

Conclusion and Closing Remarks

Summary

Introduction

Understanding the Problem

Current State: Single vs. Multiple Clusters

Introducing Multi-Cluster Orchestrator (MCO)

How MCO Works

Integration with Existing Tools

Use Cases: Inference Workloads

Future Plans and Open Source Release

Conclusion and Closing Remarks