
Loading summary
A
Hi and welcome to the Kubernetes podcast from Google. I'm your host Abdel Sighiwar.
B
And I'm Kazlyn Fields.
A
Today we talk to Antonio Oja. Antonio is a software engineer at Google and one of the core maintainers of Kubernetes. He's one of the tech leads of SIG Networking and Testing and a member of the steering committee.
B
But first let's get to the news. Google Cloud announced the availability of GKE autopilot mode inside standard clusters. Autopilot is a fully managed version of GKE where Google Cloud manages the entire cluster and only charges for the resources consumed by the pods. Autopilot in standard mode allows users to have the flexibility of switching between two modes inside the same cluster.
A
The CNCF announced the list of KCD events happening during the first half of 2026. KCD or Kubernetes Community Days are community organized and CNCF supported events happening around the world. And 2026 will see a lot of new locations like New Delhi and Kochi in India, Panama, Beijing in China and more. Check the link in the show notes for the full list.
B
Metal Cubed joined the CNCF as an incubating project. Metal Cubed is an application that runs on Kubernetes and provides components for bare metal host management. You can provision OS images and manage machines via the Kubernetes API. And that's the news.
A
Welcome to the show, Antonio.
C
Hi, thanks for having me.
A
So you've been around for a very long time. We know each other. I'm going to take us and me and you a little bit on a train back in time. So you're coming from a networking background, right? Like you did network engineering as part of your career?
C
Yeah, I started as a network engineer but then when the 2010s or something like that, there was this open source explosion with virtual networks and I joined a small startup and I say I'm tired of being network engineer. I want to be the person doing the virtual networks. And then I started my career in software development in networking at that time.
A
And then today in Kubernetes. How do you see the industry kind of shifting from the old school hardware based networking all the way to like virtual networking software defined network and the way we do it today in Kubernetes, like how that have been that shift.
C
This is an interesting topic, right? That networking is always, how can I say, silo technology people don't like much the networking people, right. Because they only call us for when there is a problem and yeah, it's evolving right now. Kubernetes has a different approach for networking than the traditional networking, right? You have everything is an API. You don't have much more virtual routers and all these things. Everything is more abstracted. There was two years ago a proposal from more traditional networking to be implemented in the core APIs and after evaluating it, we decided to not proceed with that approach in core because it was too disruptive. So traditional networking still have a room in Kubernetes. We have a working group working on that in size network and is using dynamic resource allocation for that. They are progressing. They did a demo like in the last signature meeting that I think that was a few weeks ago and they have good results and we have a good opportunity to solve some of the workarounds that the industry had to do because Kubernetes was not much traditional networking aware. And it's exciting, right? There is a good group of people working there and we are seeing results after they move to dynamic resource allocation.
A
So from your experience, how do you see, let's say that there is a company that you know has an established network and established infrastructure with like switches and routers and like the quotes on quotes, old school networking and they want to adopt Kubernetes, right? And they have to somehow integrate this API driven Kubernetes way of doing networking to their traditional networking stack. So in my experience I've only seen once with F5 because they have a controller that allows you to sync up load balancing configuration from Kubernetes to the F5 load balancer which can be a virtual appliance or physical. Is this like controllers is the way. Is that how people do it today?
C
I think that first they need to. There is a fundamental change in mindset, right? So they need to treat the cluster as an autonomous system, right? Everything that happens inside the cluster is their own routing domain. You need to think that in Kubernetes the main difference is that the networking. What's the name? The principles of networking Kubernetes are simple, right? You have one pod and. And every pod has to talk with any other pod and this is a strong constraint. And this when you start to go in bare meta, then you enter these IP exhaustion problems as. So my recommendation is treat the cluster as an autonomous system and then define the ingress and egress point, right? Then you are able to compartmentalize and to put good design so you can start to grow on that. Then you need to define your integrations as you say, what do you want to integrate if it's just virtual IPs, that's the simple thing. You can put a load balancer to handle the ingress traffic, but then things start to get complicated and then you start to have more application level, let's say HTTP protocols or API gateways. Then you don't know if you need to go service mesh. You are very familiar with the gateway API project and they are solving this problem, right? They are representing these abstractions so people can handle these more complicated protocols for the ingress traffic to the cluster, then from the egress traffic. That also happens in bare metal, right? And there are two kind of problems that you observe there. One is that people want to apply network policies per POD or per instance. And then we in Signature are working in standardizing the network policy. In the admin network policy. There is another working group, it's a sub project. We are going to try to get to beta in this qcom. And this is a heavy demanding feature for people from on premises because they have all these firewall rules and they need to apply these firewall rules in a more traditional way. And this API is able to cover that. And the other problem is the egress traffic. Usually you have choke point, right? You have the F5 for ingress, but you may have some firewall vendor appliance, right? And then you want to send this traffic through this appliance. This is an area that we don't have a standardization, but any of the main CNI plugins, network plugins like Calico or Cirion implement this functionality with their custom series. We in Sync Network try to standardize. Once we see that it's possible to standardize, and if not, we encourage these projects to cover these gaps with CRDs and controllers. The functionality is there and users can use it.
A
So on the topic of egress, I have actually a very specific question. I think that this is something that have been floating around for a while and they have never seen a standard solution. So if you have a cluster that egresses toward the remote endpoint, you either have to let the entire cluster talk out or you do a NAT gateway, right? Typically on the, on the cloud you would do a NAT gateway to route traffic through a gateway. The problem is on the receiving end the IP address will be the IP of the gateway itself, right? So now if you need the pods to have identific like specific IPs, like I want this group of pods in this particular namespace to always egress with a known IP address like that have never to My knowledge have never been solved by Kubernetes. Right?
C
Yeah, that falls into this area that we don't have enough. I wouldn't say time, but enough push. Because the problem is that that's very implementation specific, right? Oh, this is typically in cloud you use NAT gateway, but typically on premises you use, I don't know, you will have your switches routed or your internal stuff. So how do you. And then the other thing is we talked about the Kubernetes networking principle, right? That every POD should be able to talk with any other POD without nat. Then this means that you have a full distributed fabric inside the cluster and then when you try to put a state on this fabric to say, okay, then from this namespace you have 1000 nodes coming. Every single node needs to share one IP. And this become a very challenging distributed system problem to implement policy routing per node. So it's complicated. It's also on the realm of the CNI that in SIG network we try to not get very deep into that. And that's one of the main reasons why as SIG network we don't have a standard solution. There were several approaches to create an egress router, CRD or API, but I say it's not in the top priority, unfortunately.
A
Got it, got it. Yeah. Because I've seen like solutions floating on the Internet. The CNI obviously is one way to do it. It is like to me, this particular topic is funny because it fundamentally shows how a lot of security on the Internet still relies on IP addresses. Because this particular problem comes up when people say I have an external endpoint that I'm talking to and that vendor asked me to give them an IP address to whitelist. Right.
C
There is an interesting. I was talking just last week about this. Now with AIML workloads, it's changing a lot of the requirements, right. Kubernetes is built with a lot of. Everything is scattered and stateless and you should scale out and not keep a state. And now we move to agents and NCP servers and then everything is stateful and everything has to live forever and everything. And one of the requirements I was discussing last week was about this. Okay, I want to have an ip. The POD IP right at the end is I don't care. This POD has this IP and it's identity and if I want to move the application like live POD migration, I want the IP to move with it. Many people are not familiar. Dynamic Resource allocation Winga the main driver was gpu. But the thing is that it spans the possibilities of Kubernetes to make dynamic resource allocation. It has the primitives to create more complex things. One of the things that I'm working with a colleague in this GKE Labs, I don't know if it's still open, but we have an organization in GitHub to put this in is the dynamic resource allocation for virtual IPs. With this feature, when you create a pod, you're going to request a and the resource is going to be ip.
A
Okay, interesting.
C
Yeah. So dynamically the scheduler and some controller is going to find, oh, this pod is requesting an ip. So you are able to get an IP connected to this pod via this resource. It's still in progress, but now that dynamic resource allocation is ta, we are going to see more of these things that we couldn't do before or we had to do with some custom annotations and this stuff doing with Kubernetes native primitives. Yeah.
A
That's interesting because I came across recently a discussion with a customer about one particular other problem they're facing with AI, which is like they have a NAT gateway on the way out. So all the traffic looks like it's coming from a single ip. But when they have multiple pods scaling up the place where the models are downloaded for from throttles the traffic. Right. It sees multiple requests coming from the same ip so it just slows down traffic. Right. So the model downloads take even longer because of this single IP out problem.
C
Yeah. We should be able to start being able to model this complexity. Right. In this case, maybe you model this. I want to do a NAT pool. Now we don't have that way of defining a NAT pool. Or maybe some people do a NAT pool in some custom way. Now with Dynamics resource allocation we can model the NAT pool as a resource, as an object and we can make a deployment to request IPs from this NAT pool.
A
So you can have multiple IP addresses as the way out, right?
C
Yeah. Then we need to start implementing these projects with these things. But right now it's native and it's possible to do that. And this opened a lot of horizons for networking that we were very on the limitation of existing APIs and this is going to uncover a lot of new scenarios that the one that you're explaining.
A
Nice. So I'm going to shift the conversation a little bit. Where are we on the Gateway API project? How involved are you in that and where is that thing heading?
C
Okay. I'm traditionally more involved in the core Kubernetes networking. Gateway API is a sub project and it has A very large scope. So to the point that it has independent maintainers, independent leads and a SIG network, we rely. We completely delegate that to the maintainer. Right. We always have to check that it's aligned with the core project. Actually this morning I was working on that. We had this funny project that is kind Kubernetes Docker that was created by Benjamin Elder and I joined it. We have this cloud provider kind that originally was created to handle load balancer. You cannot use load balancer. I'm implementing Gateway API in cloud provider kind and I'm almost finishing with the alpha. So what is the reason? The main problem is that Gateway API covers a lot of space and a lot of complicated space. Right. It's just you need to handle all these L7 protocols mainly and it requires also to deal with TLS. It has a lot of reference between objects and the end uses are not. The majority of end users are still stuck on Ingress. One of the things that we want to solve is let's get more end user feedback. What is blocking the people from Ingress to move to Gateway and definitely in sync. I don't know if we officially declare that, but Ingress is an API that we consider frozen and we are only developing new features in Gateway API.
A
Gateway API?
C
Yeah. The more exciting one was this about this one Ga like last week I was doing the APA review last week is the inference gateway. Right. That's an extension for handling these models and implement a better routing and improve the efficiency. There is a lot of work in that area and yeah, I will be moving to work more on this area for the next year.
A
Nice. Yeah, yeah. I did a talk about it. There is a lot of interest in the Kubernetes community and people coming to kubecon. The session that we did with somebody with another speaker was like pre packed. There's a lot of interest. I'm gonna talk really quickly about something that I know that you were one of the main people working on it. So in release 131 of Kubernetes, the multi service CIDR went live. Right. And if I believe correctly, you're one of the main person who worked on that, Right?
C
Yeah, this was an interesting feature. So basically how it started is because in 2020 I started working in Kubernetes. In 2018 I was working in another thing and when I joined it, I asked where can I help? And they told me IPv6 is orphan. And as I said I know something about IPv6 and I started to work on that and I GA IPv6 in 2020 and one of the things that was shocking is oh, I cannot use more than last 112 for the service cider in IPv6. Why is that? So it turns out that the services, the way that are implemented, you have a bitmap in the API server. So you increase the size of the cider, you increase the size of the bitmap. This bitmap is stored in ETCD and if it grows too much, you don't even have space in ETCD and everything starts to go slow, whatever. So I say, oh, we need to change that. And I started to explore this and I say, okay, we need to change all this bitmap allocation. And I started to talk with Tim Hawkins API Machinery and other people in CNECO and say, okay, if we do this, we have also this problem that once you set a service side there is forever. Why don't we make it dynamic? And this started in 2020 and I think that as you say that at that time maybe was 121 or something like that and we GA in 133 and beta in 131 or yeah, something like that. So it was a very complex change that touches most of the herd of kubernetes. It was like doing brain surgery. But we made it and we have a very positive feedback because people run the classes are not ephemeral anymore and they need, they run out of space. They need to increase the size of the service either. We also have people coming to the lag channel to think about that. It was really complex feature and we are happy to have it in prod now.
A
Yeah, this particular limitation before the release of the multi cidr goes to the core of one of the, I think one of the fundamental problems in tech, which is a capacity planning because before the conversation would always be you have to size the cluster to the maximum amount that you. But no one knows what that maximum amount is going to be. Right?
C
Yeah. Then you see, that's a funny thing because then you see the different kind of person right there is the optimistic person. They say, oh, I'm going to put the last 12 and then, oh, I need all this space for other things. And then you have the pessimistic oh, I'm going to use, I don't know, 10 services. So I put as last 28 or last 27. Okay, I ran out of things. So it is very funny. And this happens. We have feedback from all these cases and it's a really nice social experiment to see how we have pessimistic and optimistic planning in the world.
A
You spoke about kind. And I do have one quick question. How does Kubernetes core networking handles the networking difference between running Kubernetes inside kind which is running in a container on your laptop and running in the cloud or running on prem? Like how do you reconcile all of these different environments from Kubernetes point of view?
C
That's one of the nice things of the Kubernetes networking model, right? It requires to get connectivity across clusters. So you can do basically it boils down to common solutions or you use an overlay to create this flat network or you use routing that allows to connect each node with each other node. That's the first one, is the simplest one, but it deals with a lot of complexity and performance problems. The second one is the more performance but it ends that you need to do better subnet planning and implement the routing and all this stuff in kind of because we use the Docker network we have this flat model, right? This non overlay, this direct routing. So what we created, Benjamin and I, we created Kinet that is the CNI that work for Kine that we are actually proposing to donate to kubernetesor. And with KY. Net it's very basic. The routes the POD subnets in each node and then you just install a route from each container to this other container with the podsubnet and it just simply work because the Docker network is flat. So it's just a simple routing problem and you just need a very small CNI that is able to route correctly between pods and between nodes.
A
Got it, got it. I was going to ask you something about performance because I think that like your work on the SIG or SIG network in general, I think that the work goes beyond just like IP planning performance is probably one of the main things. And my question is like I never understood why Kubernetes doesn't have. And humor me here because I know that this is probably complex, not complex for you, but like why Kubernetes doesn't have quality of service for networking.
C
Okay, this is a recurring topic because the main thing is when you talk about quality of services then you have different requests. There's some people that just want to do rate limiting. Then the other problem with networking is that on the contrary to other technologies is that it's statistically multiplexed. So you have typically one outbound connect that you have a VM and the VM has an interface. That's the interface that everybody needs to share that's your shared resource. The way that this is implemented is you implement queuing. So all the pods are going to end in the queue of this interface. And then when you mean quality of services, usually you mean okay, I want this pot to get more priority than this other, so their packets go out first. But on the other hand is oh, if this spot that I want higher priority is not sending anything, I want to reuse this for the other pot. Otherwise I am wasting resources. And the main problem is that how do you can to. The only model that I see that work is with priority, right? You can assign different priorities to pods and in the priority indicates some preference or something like that. The thing is that then we end in this Kubernetes distributed model. Problem is we have every POD in any other pod be able to connect between each other and we don't know beforehand what are going to be the traffic patterns. So how do you model this to the user? How do you say, oh, this web server to use 80% of the traffic at this time? Or if it's in this node, I always have requests. I'm trying to find somebody that work with us to model this better. Because the main gap that we have is the ux. How do we design the system of quality of service so the users can program the scene? Because what we have today is an annotation that says bandwidth, but bandwidth is rate limiting and rate limiting purpose. In this you are wasting resources because if nobody is using the uplink, then you're wasting the resources. And Kubernetes is very well designed to optimize and be efficient and that's problematic.
A
So I guess that I'm going to play the devil advocate here. I think that the question goes beyond just rate limiting or quality of service between pods, because on the node you have also other things that are fighting for the bandwidth. You have the image pool traffic, you would have your logs and metrics traffic. And so now with AI like these models are huge. You don't want to end up in a situation where a POD is pulling a model so it's using all the bandwidth available on the VM and then preventing other pods. Do you see what I mean?
C
Yeah, that's the other thing, right? It's like with cpu, remember that when we just said the Kubelet you also allocate some CPU for the cubelet, right?
A
Correct.
C
This is the same. Okay, in this model that you only have one nick for control plane, for images, for everything is how do you know how much bandwidth can you allocate because as you say before, you have a Image that was 500 MHz and you say it's big. But now I'm going to download the model that is we are going to and sending back DVD to the server because the image sizes now are all in the order of gigabytes. And then you touch an important problem, the storage. The storage used to be over the network, right? You connect a bucket or a disk. Or a disk.
A
NFS or something.
C
Yeah, whatever. And that is going to consume network and then the storage is very critical to latency. And then if you put something with more priority and start, I don't know, doing whatever thing is going to create this standard inheritance, right? You want to to use the network and then you want to use the storage. And because the storage is low you start to reconnect with the network and then you call support. Yes.
A
So actually the storage, the storage. I think if I can, I think I have a better solution because in the old school networking again, back to the topic we started talking with, the way you'd solve this is you will have a dedicated interface for storage, right. Your server will have one interface for ethernet and then one fiber channel connection going to the nas. So you isolate the traffic if you want to have low latency for storage traffic at least, right?
C
Yeah, that's the nice thing of kubernetes. We don't reinvent the wheel. The question is then you start to add dollars to the bill, right?
A
Yeah, of course.
C
Oh, okay. I don't want to pay so much for this storage. I just want this to work perfect with the current cost. But it's a trade off between cost, performance and as I said, this is something that I'm especially interested and we are working hard on GKE that's in optimizing this price performance.
A
I think this is basically where things can start getting a little bit kind of gray area. Because if you are on a cloud provider there is probably a solution. We have GVNIC and Google which is like this massive bandwidth network interface, right. Which can do hundreds of gigabytes. But not everybody is using a cloud provider. Like people are still stuck on prem with their own kubernetes or whatever. So that's why I was asking this question. And so leading toward the AI stuff, right. I guess you talked about it a little bit how kind of AI is changing a little bit the way you are working in the sig networking. But where do you see the future? Where do you see this AI taking you in the networking space? Because it's touching everything including the basic infrastructure components.
C
Yeah. As I mentioned before, if we oversimplify networking into the high level, that is Gateway API that is clear that this requires more efficient algorithms to expose and to allocate resources for inference. We had the low level networking. That's what we started like the more traditional networking. That part we already solved. I've been working during the last two years on that and was with dynamic resource allocation. The original proposal was to use a multi network API. And as I said before, it was very complex. It cannot be standardized. It will be ending fragmentation because everybody will need depend on their own CNI plugin and we DRI solve this. I wrote a paper, it's going to be published in one of the IEEE conference about the Kubernetes network driver model and how to use it for this problem. And we created a project at Google that is called dranet that is solving this problem. Right. For AIML low level networking you usually need to use an out of band network that is rdma. You can have an RDMA ROC interface that use Ethernet or an Infiniband interface. And what originally we thought that was a multi network problem. It was a multi interface problem, right?
A
Yeah.
C
Same as in storage. You don't require to model the storage network. What you want to require is the POD to attach this NIC that goes to the storage network. And this is the same problem the POD requires to attach this NIC that is RDMA to the POD network. And we saw that we have a project and it's working fine. It's working really good. So to the point that I expected to release it in GA for next year and we're going to do the GA in one month or something. Maybe for QCon, who knows.
A
Okay. Yeah. Because this is actually a particularly interesting problem because a lot of these like big models that require like multi host inference. For example, if you have a model that you run on top of multiple nodes, then you wouldn't need high bandwidth between the nodes for the GPU to GPU communication. Right.
C
That's the key that is actually the problem. Right. When we talk about network performance, the first time that this is a good network that they have is the first time that I get in touch with this was with an HPC problem. They have a networking problem. They were talking with something like that and they say, oh, call Antonio. And when I went there I said okay, what is the network here? I was thinking oh, this is slow. And they were talking slow. It was 10 microseconds or something. I said, are you saying micro or milli? Micro. Micro is slow. I say, okay, this is a totally different problem that we were using to deal with. And the more that you start digging into that and you start to see, okay, how do you solve a problem when you are CPU bandit and then offloading. Right. And that's the key thing. What the GPUs and Disney do is they upload all the processing of the packets or all the processing of the information to these protocols. They don't even use TCP because it's slow. They use rdma. And then you just need to be able in kubernetes to offer the user the opportunity to express. Oh, I want to use this application that's support with this hardware, that is the GPU and the nic. And also to say we also want to get the optimal performance. That means you need to match the PCI bus for the GPU and the nic because otherwise you have a penalty that is important if you don't get alignment between the intra node architecture.
A
Yeah, it's actually interesting you touched on a topic that I think we probably should cover in the podcast. We should need to find someone like a lot of things that people don't realize is how CPU boundaries when you are on the cloud. You are right, because your performance for storage, your performance for networking and your performance for pretty much everything else on the node is CPU dependent, right?
C
Yeah, that's the main problem right at the end in networking people used to get the tables with the bytes per second and all this thing. But you need to think that you don't have a router. What you are doing is processing the packets on CPU or offloading to sanic. And that's the thing that is going to limit your performance. Whatever has to process the package is your bottleneck. And usually in networking you can offload. We have EPF, we have NetFilter, there are a lot of technology to offload networking. But then you also want to have this nice feature of. Oh, I want to check in this header to send this to this backend or not to the other backend. So you have this three office. You can only upload the things that the hardware is able to understand. Otherwise you need to process in somewhere.
A
Yeah, and I just gonna throw at you one more question and then we can close out the conversation. I know that this is mostly touches on security, but I'm always bundling security and networking in the same space. What's your take on this craziness of unsecured MCP servers? Like everybody's building MCP servers today. But no one seems to talk about authentication and authorization, which drives me crazy. Like you have an autonomous agent and you give them access to a bunch of HTTP server with no limitation to what they can do. Like where do you think we're heading there? I know that there is no standard for sure.
C
I had this conversation the other day because this is a hot topic, right. And this is when we start to rethink the lines of Kubernetes and where traditionally you implement a service mesh kind of service. Right. To handle the network authentication. So the question is, how is going to be the future, right. Is the applications, all these frameworks going to embed authentication or are we going to delegate it to the network? I don't have an asset. I see the industry moving in both ways in parallel and I'm sure that some smart person will come up with idea and say, oh, this has to be done this way and people will follow. But right now, as you say, it's the wild west. Everybody run things. We are going to get hacked or something like that. Sure. That most of the C. What is the name? CIO C. What is the ciso? The security guys are scared about this. The CIS are scared. And yeah, I. We are working on that is an area with a lot of development and unfortunately I don't know what's there.
A
Yeah, it's an industry wide problem. Right. And it's very funny to me. Like I have been in the industry for 15 years, you have been there for longer than me. It's funny to me today how a lot of these problems are stuff that we have been talking about like 25 years ago.
C
That's why I tell you I was having this conversation. I say, okay, we are moving back to stakeholders. And that is shocking because with Kubernetes everything was a cutter. You don't use pets and now you have all these pets and you need to take care of them.
A
It's funny, I guess like the much bigger observation is fundamentally the problems are things we solved a while ago. Like we're just like authorization authentication. That's not new. We have done this thing for a very long time. It's just, it's a different problem in a different place. But fundamentally the solution is the same.
C
Yeah, that's. That's the best advice I give to the junior people that come to this is look, this is not new. It's just change the scale, change some of the environment. But the problem is something that we were working on it for a lot of time. So let's try to use the best practices and experience and tackle this problem.
A
Awesome. I couldn't have ended better than this. Thank you so much Antonio for your time.
C
Thank you Adel and hope to see you soon in Kubecon.
A
Or I will see you at one of the Kubecons for sure.
C
Okay.
A
All right, thank you.
C
Bye.
A
That brings us to the end of another episode. If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media ubernitespod or reach us via email at kubernetespodcast at Google. Com. You can also check our website at kubernetespodcast. Com where you will find transcripts and show notes and links to subscribe. Please consider rating us in your podcast player so we can help more people find and enjoy the show. Thanks for listening and we'll see you next time.
Podcast: Kubernetes Podcast from Google
Episode: GKE 10 years and SIG Networking, with Antonio Ojea
Hosts: Abdel Sghiouar (A), Kaslin Fields (B)
Guest: Antonio Ojea (C) – Software Engineer at Google, Kubernetes core maintainer, SIG Networking/Testing tech lead, Steering Committee member
Date: October 1, 2025
This episode marks ten years of Google Kubernetes Engine (GKE) and dives deep into the evolution of Kubernetes networking with Antonio Ojea—a leader in the SIG Networking group. The conversation covers the historical journey from hardware-driven networks to Kubernetes’ API-driven approach, integration with traditional networking, the challenges of egress, new possibilities powered by dynamic resource allocation, and trends like AI/ML’s impact. Antonio also shares updates on major networking-related features, the development of the Gateway API, quality of service (QoS), and security.
Antonio’s journey began with traditional network engineering, moving to software-defined networks during the 2010s open-source boom.
"I’m tired of being a network engineer. I want to be the person doing the virtual networks." (C, 02:06)
Traditional networking was hardware and configuration-heavy; Kubernetes abstracts networking through APIs, favoring programmability and integration.
The industry’s shift:
"Kubernetes has a different approach for networking...you have everything as an API...more abstracted.” (C, 02:50)
Interfacing with Existing Infra: Controllers like those from F5 help bridge load balancing between Kubernetes and legacy systems.
Mindset Shift: Treat the Kubernetes cluster as an "autonomous system" with its own routing domain, ingress, and egress points.
"You need to treat the cluster as an autonomous system...define the ingress and egress point." (C, 05:14)
Complexities: Integrations become complicated when moving from basic load balancing to more nuanced traffic (e.g., HTTP, API gateways, firewall rules).
Standardization Efforts: Work on Admin Network Policy, set for beta (QCon), addresses granular firewall-like controls familiar to on-premises users.
"This is a heavy-demanding feature for people from on premises...this API is able to cover that." (C, 06:52)
Egress handling (sending traffic out to specific appliances/firewalls) is not yet standardized; major CNIs provide their own solutions.
Persistent Issue: Users want pods to consistently use a specific external IP for egress, but Kubernetes has no standardized solution.
"This is something that...have never been solved by Kubernetes." (A, 08:41)
Barriers: Solution is complex and implementation-specific, varies by environment.
AI/ML Workloads: Growing need for pod-based identities as workloads become more stateful; dynamic resource allocation increases flexibility.
"With dynamic resource allocation...now...we are going to see more of these things that we couldn't do before..." (C, 13:48)
Work in Progress: Antonio’s team is developing allocation of virtual IPs as a dynamic resource, opening new possibilities like NAT pools to distribute traffic and mitigate throttling issues.
Strategic Importance:
"Ingress is an API that we consider frozen...we are only developing new features in Gateway API." (C, 15:27)
Gateway API subsumes Ingress; enables richer, more flexible L7 (application layer) traffic protocols, with a focus on features for AI inference workloads.
Push for Adoption: Community is being surveyed to understand migration blockers from Ingress to Gateway; ongoing development like the inference gateway is aimed at AI use-cases.
Challenge: IPv6 service CIDR space was limited by API server’s bitmap allocation system, leading to scale and performance limitations.
"So basically how it started...I GA IPv6 in 2020...I cannot use more than /112 for the service CIDR in IPv6—why is that?" (C, 17:06)
Solution: New multi-service CIDR allows dynamic allocation and expansion; major improvement for upgrading and scaling production clusters.
Human Factor:
"We have pessimistic and optimistic planning in the world." (C, 19:19)
Kubernetes Networking Model:
"It boils down to common solutions: use an overlay to create this flat network or use routing..." (C, 20:25)
For environments like kind (testing on Docker), Antonio and team developed KYNet CNI to enable simple routing between containers, leveraging Docker’s flat networking.
QoS is Complex:
Modeling Challenges:
"How do we design the system of quality of service so the users can program the system?" (C, 23:44)
Resource Sharing: Like CPU reservations for kubelet, similar bandwidth considerations are essential.
Cloud vs. On-Prem: Cloud providers have solutions (e.g., Google’s GVNIC), but on-prem users with legacy hardware need adaptable solutions.
AI is Reshaping Requirements:
Antonio's work:
dranet project at Google to support multi-interface workloads, especially for high-performance AI/ML workloads.Design Parallels: "You don’t require to model the storage network. What you require is the POD to attach this NIC..." (C, 29:53)
Cloud Realities:
"Your performance for storage, your performance for networking...all is CPU dependent." (A, 32:20)
Offloading: Technologies like eBPF and NetFilter help, but user-level programmability can be limited by hardware support.
Security is Lagging:
"Everybody’s building MCP servers today—but no one seems to talk about authentication and authorization..." (A, 33:38)
Wild West: Lack of standards means some frameworks embed security, others offload to service mesh/network.
Historical Echoes:
"This is not new. It just changes the scale...But the problem is something that we were working on it for a lot of time." (C, 35:44)
Antonio Ojea provided a comprehensive, insider look at networking’s past, present, and future in Kubernetes—spanning real-world integration with legacy infra, standardization efforts in SIG Networking, the transformative demands of AI/ML, and enduring security challenges. The episode balances detailed technical insights with big-picture reflections, making it informative both for practitioners and those following the community’s strategic direction.