Podcast Summary
Podcast: Kubernetes Podcast from Google
Episode: GKE 10 years and SIG Networking, with Antonio Ojea
Hosts: Abdel Sghiouar (A), Kaslin Fields (B)
Guest: Antonio Ojea (C) – Software Engineer at Google, Kubernetes core maintainer, SIG Networking/Testing tech lead, Steering Committee member
Date: October 1, 2025
Overview
This episode marks ten years of Google Kubernetes Engine (GKE) and dives deep into the evolution of Kubernetes networking with Antonio Ojea—a leader in the SIG Networking group. The conversation covers the historical journey from hardware-driven networks to Kubernetes’ API-driven approach, integration with traditional networking, the challenges of egress, new possibilities powered by dynamic resource allocation, and trends like AI/ML’s impact. Antonio also shares updates on major networking-related features, the development of the Gateway API, quality of service (QoS), and security.
Key Discussion Points and Insights
1. Antonio’s Networking Background and The Shift to Software (01:48–04:19)
-
Antonio’s journey began with traditional network engineering, moving to software-defined networks during the 2010s open-source boom.
"I’m tired of being a network engineer. I want to be the person doing the virtual networks." (C, 02:06) -
Traditional networking was hardware and configuration-heavy; Kubernetes abstracts networking through APIs, favoring programmability and integration.
-
The industry’s shift:
"Kubernetes has a different approach for networking...you have everything as an API...more abstracted.” (C, 02:50)
2. Integrating Kubernetes with Traditional Networking (04:19–08:03)
-
Interfacing with Existing Infra: Controllers like those from F5 help bridge load balancing between Kubernetes and legacy systems.
-
Mindset Shift: Treat the Kubernetes cluster as an "autonomous system" with its own routing domain, ingress, and egress points.
"You need to treat the cluster as an autonomous system...define the ingress and egress point." (C, 05:14)
-
Complexities: Integrations become complicated when moving from basic load balancing to more nuanced traffic (e.g., HTTP, API gateways, firewall rules).
-
Standardization Efforts: Work on Admin Network Policy, set for beta (QCon), addresses granular firewall-like controls familiar to on-premises users.
"This is a heavy-demanding feature for people from on premises...this API is able to cover that." (C, 06:52)
-
Egress handling (sending traffic out to specific appliances/firewalls) is not yet standardized; major CNIs provide their own solutions.
3. The Egress/NAT Identity Problem (08:03–14:10)
-
Persistent Issue: Users want pods to consistently use a specific external IP for egress, but Kubernetes has no standardized solution.
"This is something that...have never been solved by Kubernetes." (A, 08:41)
-
Barriers: Solution is complex and implementation-specific, varies by environment.
-
AI/ML Workloads: Growing need for pod-based identities as workloads become more stateful; dynamic resource allocation increases flexibility.
"With dynamic resource allocation...now...we are going to see more of these things that we couldn't do before..." (C, 13:48)
-
Work in Progress: Antonio’s team is developing allocation of virtual IPs as a dynamic resource, opening new possibilities like NAT pools to distribute traffic and mitigate throttling issues.
4. Updates on Gateway API (14:10–16:35)
-
Strategic Importance:
"Ingress is an API that we consider frozen...we are only developing new features in Gateway API." (C, 15:27) -
Gateway API subsumes Ingress; enables richer, more flexible L7 (application layer) traffic protocols, with a focus on features for AI inference workloads.
-
Push for Adoption: Community is being surveyed to understand migration blockers from Ingress to Gateway; ongoing development like the inference gateway is aimed at AI use-cases.
5. Multi-Service CIDR & Service Scalability (16:35–19:57)
-
Challenge: IPv6 service CIDR space was limited by API server’s bitmap allocation system, leading to scale and performance limitations.
"So basically how it started...I GA IPv6 in 2020...I cannot use more than /112 for the service CIDR in IPv6—why is that?" (C, 17:06)
-
Solution: New multi-service CIDR allows dynamic allocation and expansion; major improvement for upgrading and scaling production clusters.
-
Human Factor:
"We have pessimistic and optimistic planning in the world." (C, 19:19)
6. Core Networking Across Environments (kind, on-prem, cloud) (19:57–21:46)
-
Kubernetes Networking Model:
"It boils down to common solutions: use an overlay to create this flat network or use routing..." (C, 20:25) -
For environments like kind (testing on Docker), Antonio and team developed KYNet CNI to enable simple routing between containers, leveraging Docker’s flat networking.
7. Networking Performance & Quality of Service (QoS) (21:46–27:05)
-
QoS is Complex:
- Needs extend beyond just pod-to-pod (rate limiting, priority) to include node-level constraints—log, metric, image pulls, especially with large AI model downloads.
-
Modeling Challenges:
"How do we design the system of quality of service so the users can program the system?" (C, 23:44) -
Resource Sharing: Like CPU reservations for kubelet, similar bandwidth considerations are essential.
-
Cloud vs. On-Prem: Cloud providers have solutions (e.g., Google’s GVNIC), but on-prem users with legacy hardware need adaptable solutions.
8. The Impact of AI/ML on Networking (27:26–31:58)
-
AI is Reshaping Requirements:
- AI/ML changes expectations from stateless to stateful, persistent networking.
- High bandwidth, low latency (e.g., RDMA, InfiniBand) needed for distributed inference.
-
Antonio's work:
- Introduction of dynamic resource allocation and
dranetproject at Google to support multi-interface workloads, especially for high-performance AI/ML workloads. - Published research on Kubernetes’ network driver model for these scenarios.
- Introduction of dynamic resource allocation and
-
Design Parallels: "You don’t require to model the storage network. What you require is the POD to attach this NIC..." (C, 29:53)
9. CPU as a Network Bottleneck (31:58–33:15)
-
Cloud Realities:
"Your performance for storage, your performance for networking...all is CPU dependent." (A, 32:20) -
Offloading: Technologies like eBPF and NetFilter help, but user-level programmability can be limited by hardware support.
10. AI Security and The Messy State of MCP Servers (33:15–36:08)
-
Security is Lagging:
"Everybody’s building MCP servers today—but no one seems to talk about authentication and authorization..." (A, 33:38) -
Wild West: Lack of standards means some frameworks embed security, others offload to service mesh/network.
-
Historical Echoes:
"This is not new. It just changes the scale...But the problem is something that we were working on it for a lot of time." (C, 35:44)
Notable Quotes & Memorable Moments
- "Kubernetes has a different approach for networking...you have everything as an API. You don't have much more virtual routers...Everything is more abstracted." (C, 02:50)
- "Treat the cluster as an autonomous system...define the ingress and egress point." (C, 05:14)
- "You have to size the cluster to the maximum amount...but no one knows what that maximum amount is going to be." (A, 19:00)
- "This is not new. It just changes the scale, changes some of the environment, but the problem is something that we were working on...for a lot of time. So let's try to use the best practices..." (C, 35:44)
Important Timestamps
- Antonio’s background & industry shift: 01:48–04:19
- Integrating with legacy networking: 04:19–08:03
- Egress/NAT, pod IP identity: 08:03–14:10
- Gateway API and Ingress future: 14:10–16:35
- Multi-service CIDR evolution: 16:35–19:57
- kind & different deployment environments: 19:57–21:46
- QoS and network performance: 21:46–27:05
- AI/ML’s impact and dynamic resource allocation: 27:26–31:58
- CPU as network bottleneck; offloading: 31:58–33:15
- AI security and MCP server wild west: 33:15–36:08
Conclusion
Antonio Ojea provided a comprehensive, insider look at networking’s past, present, and future in Kubernetes—spanning real-world integration with legacy infra, standardization efforts in SIG Networking, the transformative demands of AI/ML, and enduring security challenges. The episode balances detailed technical insights with big-picture reflections, making it informative both for practitioners and those following the community’s strategic direction.
