
Modern cloud-native systems are built on highly dynamic, distributed infrastructure where containers spin up and down constantly, services communicate across clusters, and traditional networking assumptions break down.
Loading summary
A
Modern cloud native systems are built on highly dynamic distributed infrastructure where containers spin up and down constantly, services communicate across clusters, and traditional networking assumptions break down. Linux networking was designed decades ago around static IPs and linear rule processing, which makes it increasingly difficult to achieve scale in Kubernetes environments. At the same time, modifying the Linux kernel to keep up with these demands is slow, risky and impractical for most organizations. The Extended Berkeley Packet Filter, or ebpf, is a Linux kernel technology that allows sandboxed programs to run safely inside the kernel without modifying kernel source code or loading kernel modules. Cilium is an open source cloud native networking platform that's built on EBPF and provides secures and observes connectivity between workloads in Kubernetes and other distributed environments. Bill Mulligan is a maintainer in the Cilium ecosystem and a member of the team at Isovalent, the company behind Cilium. He joins the show with Gregor Van to discuss how EBPF works under the hood, why Cilium has become one of the most widely adopted Kubernetes networking projects, and how the future of cloud native infrastructure is being reshaped by programmable kernels. Gregor Vand is a security focused technologist, having previously been a CTO across cybersecurity, cyber insurance and general software engineering companies. He is based in Singapore and can be found via his profile at Van HK or on LinkedIn.
B
Hello and welcome to Software Engineering Daily. My guest today is Bill Mulligan.
C
Hey, thanks for having me.
B
Yeah, great to have you here today, Bill. We're going to be talking all about Cilium and the technology epbf. Before we get there, as we like to do, it would be great just to hear a bit about you and what was your journey to joining Cilium. And I believe the company you work for is sort of a wrapper around Cilium, for example, so maybe just walk us through kind of all of that.
C
Yeah, definitely. So I like to say it's a little bit of an accident how I've ended up here. Just a series of circumstances kind of going on. So actually originally got my undergrad in biochemistry, so very, very far away from technology, got my master's in social science and then ended up at the first startup that I was working at and they were doing back in 2018 an AI platform on top of Kubernetes and at the time nobody was doing AI, nobody was doing Kubernetes, so obviously it went out of business pretty quickly, moved on to the next startup. Then I worked for the cncf, the cloud native Computing foundation, kind of looking at the global cloud native community ending up at Isovalent as kind of like this promising startup in the cloud native space. And I was excited about going to Isovalent because Cilium at that time was like really starting to emerge onto the scene as kind of a new and exciting way to do networking in the Kubernetes and cloud native world. So like this company seems pretty interesting. They just emerged from stealth and I was like, let's see where this rocket ship goes. And it's been kind of a wild ride since then.
B
Awesome. For those that are not totally familiar, what exactly is cncf? And could you just then describe what Isovalent and Cilium, what is the relationship between the technology and the company and that kind of thing?
C
Yeah, definitely. So CNCF or the Cloud Native Computing foundation is a sub foundation of the Linux Foundation. So Linux hosts obviously Linux Kernel, but I think it's 900 other projects and CNCF is the largest sub foundation under that. And CNCF itself hosts just over 200 projects. Now, Cilium being one of the projects, really the CNCF was created with Kubernetes as the core project. And then all of the cloud native projects have been brought in around that. And Cilium being one of those. What Cilium does in the cloud native world is. So Kubernetes is a way to orchestrate containers and other types of workloads now too. But it actually doesn't come with any networking in the world of Kubernetes. It's all distributed systems. And the most important part of a distributed system is the network because everything's got to talk to each other. So Cilium is the CNI or the container networking interface that plugs into Kubernetes and basically says this packet needs to go here. This is how traffic is getting into our cluster. This is how we egress traffic out of it and a lot of other things. So essentially you can think about Cilium at a very high level as networking for the cloud native world at the very beginning. It's expanded a lot beyond that, which I guess we'll dive into a lot more after this. And then Isovalent is a company that originated Cilium, then they gave it to the cncf. So it's neutral governance under there. And we have a lot of contributors from different companies around the ecosystem. And Isovilant is the company that's creating commercial products around the different projects that are in the Cilium ecosystem.
B
Yeah, I saw the annual report came out just a few hours ago, actually for Cilium effectively. And it was said on December 16, 2015, Thomas Graff pushed the first commit for Cilium. So we're literally almost to the day, 10 years on from that first commit.
C
Yeah, exactly. So it's a decade in the making. Decade in the making. Overnight success, as people like to say. And it's kind of wild, I think, different than a lot of open source projects. It actually is open source from the first commit. If you go look at the first commit, it's I think like 200 lines of code, the license and like gitignore file, like that's it.
B
Awesome. Full open source, Proper open source. Yeah, awesome. So I think we maybe should sort of get sort of super base level and just understand what is. So Cilium is what could be described as ebpf. And what is that? I think that's kind of where we need to start. I imagine some of our audience know this already and it's great that you're here. Equally, I think a lot of our audience have maybe no clue what this is.
C
Yeah. So EBPF is also a technology that's not 10 years old, birthday's a little bit earlier, it's about 11 years old now. And it's a Linux kernel technology. And Cilium was founded from the ground up based on EBPF as a technology. And what EBPF allows you to do is to reprogram the Linux kernel. And the comparison a lot of people like to make is like EBPF is to the kernel of what JavaScript is to the browser. And so if you think back like before we had JavaScript, you kind of had like static web pages, right? You could kind of consume information off of it, but you couldn't actually kind of do anything with the web page. JavaScript comes along and suddenly you could add interactive elements. As you start to interact with the web page, it changes what it's doing. And that's exactly what EBPF is doing for the Linux kernel. And if you're not familiar with how the Linux kernel development cycle works and how the Linux kernel kind of works as a whole, I'll kind of jump into that. And so the way Linux kernel works, it's not just you get this distribution and you download it and you're using it. The way it works is you need to upstream things into the kernel. And Linux is the largest open source project in the world. And kind of the development cycles are a little bit longer because it's deployed on literally billions of devices. Every single Android phone is running some portion of Linux. So they need to be very careful about what actually goes into the upstream kernel. So the development cycles. And if you know anything about the Linux kernel mailing list, there's a lot of technical discussions that go on on the mailing list. So to be able to get something upstream is a long process. It can take years or maybe even for some more controversial things, a couple years. So not just like, okay, we need this new feature in Linux kernel, let's just ship it. It's like, okay, well you need to have the discussions work with the people upstream, decide on the right path forwards. And then it gets into the kernel and then you have it in the kernel, right? So that's like the latest one that Linus just goes out and produces. But that's not actually what you're running in production. If you look at most people, what kernel they're running, it's actually 2 years old, 3 years old, 5 years old. I mean, people don't run the latest kernel. They wait for it to actually kind of like bake. They wait for like an lts, like a long term stable release or they get something from their vendor. So if you look at the actual kernel version you're running, it's most likely a couple years out of date. So if you one year development cycle, a couple years till it actually gets into you, maybe it's five years from like, okay, I have this idea to when I can actually receive this future, right? So for most people, all intents and purposes, like the Linux kernel is pretty static, right? Like you're not going to change it. EBPF came along and completely changed this programming model. What it allows you to do is you're saying you want a new feature. And the way the applications usually interact with the kernel is they interact by making system calls into the kernel or other ways to interact with it. The application says, okay, can you read this file? Can you open this networking socket? Can you do these different things? And it makes a call into the kernel, kernel does that thing and sends things back to user space. But what EBPF allows you to do is to modify how the kernel is actually running. So you take a program that you write, you insert it into the kernel. And now when the application makes a call from user space into kernel space, instead of working how the kernel normally does, your program now runs when that specific hook is called. So if somebody is like say a malicious process is trying to read a file, you can be like, we don't want this process to read this file. So it's blocked. If you want to, in the case of Cilium, be like, we want to do networking faster or more programmatic. We can be like, okay, this packet is coming here. We want to actually just reroute it directly without it going through the whole networking stack. And so what EBPF is allowing you to do is to actually add functionality on the fly into the Linux kernel. And if you've gotten to this point, you're kind of like, okay, well, Linux kernel, pretty important. I know if I crash it, that's really bad. All the systems are over. And so that's kind of what makes EBPF really powerful is because it's not just extending the Linux kernel, but it's doing it in a safe way because if you just like throw random code into the kernel, you're very likely to crash it. If you're familiar with the CrowdStrike incident where they took out half the world's it, right? That was because there was a bug in the kernel and they crashed all the kernels around the world. It's obviously something you want to avoid. And so what EBPF allows you to do is it lets you insert programs into the kernel in a safe way. And the way that it does that is for each of the programs that you're adding to your kernel, it goes through a verification step. And what this verification step is basically checks that the program is safe to run on the kernel. So it's not going to crash the kernel, it's not going to like, call memory out of bounds. It's not going to do these things that will essentially harm the kernel. And so these programs are safe to run in the kernel, they won't crash the system. And they're a very efficient way to reprogram what's happening. And so that's kind of like the foundational technology, right? EBPF got merged into the kernel about 11 years ago. The Cilium team was like, okay, this way of programming the kernel is going to let us rebuild everything in the kernel better. Like, what can we rebuild first? And the team at the time was working on, they came out of the open vSwitch team, they were doing a lot of networking stuff, and they're like, okay, we're going to start rebuilding networking in the Linux kernel better, faster and ready for the cloud native world. And so that's the birth of Cilium.
B
Got it. And I think it's actually helpful. What does EBPEF stand for? It's Extended Berkeley Packet filter, I guess Packet filter on the end there is kind of the thing I believe there was in theory a retro actively named classic Berkeley Packet Filter. And EPPF is the sort of advanced version of that, Is that correct?
C
Yeah. So BPF or Berkeley Packet Filter was kind of like the original thing from decades ago. It's like, okay, can we put a packet filter into the kernel? And that was the thing. And originally this story was Alexei, who was one of the co creators of ebpf, came to the Linux kernel and says, I want to be able to insert bytecode into this. And they're like this huge path check coming to the kernel. They're like, no, no, no, we don't want to do that. And so Alexei went back to work with Daniel and a couple other people and they're like, okay, how can we actually get to this into the kernel? And they're like, well, there's actually this packet filter already in the kernel. What if we just improve that to be able to get what we wanted into it? And so piece by piece they started improving the original BPF like the packet filter, and then also extending it. And so what they were able to do was to improve the existing subsystem and then change it into a more generalized ver, the cbpf. The classic BPF is the original Berkeley Packet Filter and now EBPF is the extended version. We don't really call it Extended Berkeley Packet Filter because it does so much more than just networking now and saying that its packet filter isn't really true. It does things across observability, security, profiling, scheduling, interacting with the devices. It's basically a generalized way to reprogram the Linux kernel. So we just use it as the standalone EBPF term. But that's the history. And why, if you hear somebody, it's a little bit confusing because there's bpf, there's CBPF and ebpf. And if you talk with some of the kernel people, they use EBPF and BPF kind of interchangeably just because of that history. But technically now it's called ebpf.
B
Gotcha. Okay, so I think you've set the stage pretty well in terms of what does EBBF allow within Linux and Linux kernel? So it's kind of become this framework for enabling things that can run in the kernel, almost like a sort of set of, I guess, rules that mean that if you want to build things that touch that they're not going to break some of the big things. And I think, yeah, the CrowdStrike example was a good one. That's what happens when this isn't done. Properly, that's obviously Windows, so it's not Linux. But yes, that was a problem. So where did then Cilium come from in that sense? And I guess Cilium is handling the networking side of what can be done with this capability. There must be other products, companies that then deal with other bits that are now possible with EBF being there. But yeah, what does Cilium enable? And Kubernetes is obviously a big part of that, so let's kind of go there.
C
So if you rewind back again, about 10 years ago we had the whole containerization movement that was really exploding at the time. And Kubernetes had also just been at least as open source, not from the first commit, but as this actual standalone project. And so we kind of have this new cloud native world. And if you think about the transition into the cloud native world, what you're kind of seeing is a lot more ephemeral, dynamic environments, containers are coming and going and kind of like the traditional world of how Linux was built. Linux is not a 10 year old technology, it's next year a 35 year old technology. And so the programming model for the Linux kernel is very different from what you need in the cloud native world. A lot of Linux networking is built based on IPs based on IP tables. Think of like, okay, we have this list of IPs that we trust, a list of IPs that we don't trust. But if you think about the cloud native world, you're spinning up containers up and down all the time. And so how can we have this decades old technology and bring it into the cloud native world? And that's really the challenge that Cilium was set out to do, is how can we do cloud native identity based networking. And so some of the original challenges that Cilium was looking at, one is iptables is the way that we do a lot of networking. So as you're being like, okay, we need to send packet from this IP to this next ip. And the way that you do that is you have a list of IP tables and you go through them linearly. But if you're having thousands, tens of thousands, a million containers in a cluster, going through a linear list of rules is not very efficient. So one of the first things I do, and one of the reasons that a lot of people like Cilium is because it replaced iptables and Kube Proxy, which is the proxy that routes most of the traffic in Kubernetes. We replaced that with what we call Kube proxy replacement. And that's all written in bpf and so it replaces iptables with ebpf and rather going through things linearly and having to do things kind of like in an on order. What EBPF allows us to do is have everything in the hash map and it's able to look things up in O a lot faster and a lot more scalable. So if you look at the difference when you have like 10 services in the cluster, it's not that big, right? Because you can run through a list of 10 services pretty quickly. But if you have 10,000, the difference between reading through the rules linearly and being able to just look them up in the hash map is very significant. So that's one of the things like being able to write things in a modern way for the modern technology. And the modern way of doing things is making networking a lot more efficient. And there's a lot of stories now about being able to replace Kube Proxy, like Trundle, like an E commerce company in Turkey. By replacing Kube Proxy, they increase cluster throughput by 40%. So big performance gains there. Then the next thing about Cilium is so you have a lot of containers, you have these IPs, but the IPs aren't fixed because they're coming and going all the time. And a big thing in networking is not just, yes, one is connecting things, but it's also making sure that things that, that aren't supposed to be connected don't stay connected. This whole network security part, and if you're doing that based on IPs, you're going to have to be updating all these rules a lot and being like, okay, this IP is no longer being used, so we shouldn't be able to route traffic to it. And so being able to understand network routing and network security, if you're doing that with fixed IPs as containers are coming and going, you're going to have to be updating these rules a lot. So it's going to cause a lot of churn in the cluster, a lot of overhead. And Cilium was like, okay, as we're moving from a world of IPs towards identity, like the classic DevOps analogy from Pets to the cattle, we're not looking at individuals anymore, we're looking at groups or sets of people or things with labels. And so Cilium switched the whole networking model from this IP based model to this identity based model. And so rather than saying IP X can talk to ipy, we can say front end talks to backend. So then as we kind of rotate the containers behind These labels, it doesn't actually matter. And as you spin up a new container you can be like, okay, this is a back end label and so it can now automatically talk to all the front end labels. And so if you think about the cloud native world, it's like, how can we switch to this identity based model? And by being able to give things identity it makes things a lot easier because you can swap things out on the back end and the identity is still the exact same and it reduces a lot of the churn in the cluster too. So what Cilium was trying to do at the beginning is like, we have this new modern cloud native world. Things are a lot more dynamic, ephemeral. The current networking technology that we have isn't going to be able to keep pace with what we need to do in this world. So how can we rethink networking for the modern world with EBPF and the way EBPF allows us to do that? We can take out iptables, we can route things very efficiently, we can move from IP towards identity for both our networking and our security model and it allows us to bring networking into the modern cloud native world.
B
Yeah, it's a really good way to explain it. I guess there are some analogies with just cloud IAM identity management, like how the cloud providers effectively added this, what is unfortunately now an incredibly complex thing. You've got layers on IAM now to try and make it easier to administer because people end up creating the wrong identity profiles and all sorts of things. But yeah, so is that kind of a good analogy?
C
Yeah, exactly. So if you think about it, when you join a new company, they give you like an example would be like, or in my personal life, I log into Google and it gives me access to a lot of different services based on the identity that I have. I'm like, I say I'm Bill Mulligan, I give this identity to Google and Google goes out and says this is Bill Mulligan to all these different services. You could do the same thing in Kubernetes. You can be like, okay, this new POD is now front end, it's front end to all these other services. Or it's the backend that all the front ends want to talk about to essentially. And if you think about when you're spinning, adding a new developer to your team, you don't want to give them access to GitHub and your cloud resources and your developer environment and to all the other services that they need. The way that you probably do it is you probably give them One identity to something like Okta and then Okta provides the identity out to all the other services that they need access to.
B
Yeah, that makes sense.
A
In mobile application security, good enough is a risk. Guard Square uses advanced multi layered code hardening techniques and automated runtime application self protection and mobile application security testing combined with real time threat monitoring to deliver the highest level of mobile app security. Discover how Guard Square brings all these together to provide mobile app security for your Android and iOS apps without compromise at www.guardsquare.com if you're an engineering leader, you know this cycle your team's focused on building product. But someone in ops needs a dashboard, marketing needs an admin panel, finance needs a custom workflow. The requests pile up, you can't get to them all. So people start building their own solutions, Shadow IT spreads and eventually you're the one stuck cleaning up tools that were built with duct tape and good intentions. Retool breaks that cycle. Their AI appgen platform gives teams a governed place to build the tools they need so everything stays secure and under your control. Someone could type build me a customer admin panel that manages accounts from postgres and they'd get a real production ready app with proper permissions built in. Your teams get unblocked and you don't inherit a pile of technical debt down the road. So if you're tired of being the cleanup crew for Shadow IT, head to retool.com sedaily and see how other engineering teams are democratizing app building without creating chaos. Because honestly, we could all use a better way to handle internal tools. Sometimes you just need Retool.
B
So maybe if we just look at what are the feature set, if you like, of Cilium, we've got things like we've touched on it there, but network policies, I think that'd be interesting to kind of dive into it a little bit more service mesh as well. And I think that would be also helpful maybe when we get there to just touch on what even is a service mesh, because I think some listeners may not be familiar with that and then we could maybe just sort of get onto some of the more advanced features as well. But yeah, so let's maybe just start with network policies. That seems to be kind of the core of Cilium, maybe just dive into that a little bit more.
C
I talk to a lot of users out in the Cilium community and the main three reasons that they choose Cilium because there's a lot of different networking solutions in kubernetes is one is network policy, another one is Kube proxy replacement, getting the performance and scalability benefits, encryption of network traffic and observability with Hubble but starting with network policy. So this is going back distributed system. We want to make sure things can talk to each other, but we also want to make sure things can't talk to each other. And so in Kubernetes there's Kubernetes network policies and these are layer three, layer four network policies. So you're looking at things like IPs like this IP can or can't talk to that other IP. So Cilium implements Kubernetes network policies for layer three and layer four. But the additional thing that a lot of people look at is also it's not just these low level, we also want to look at like layer 7 network policies. So Cilium has layer 7 network policies too and we call them Cilium network policies. So things like allow traffic from star.google.com or don't allow traffic from this domain. So being able to look at the actual domain with layer 7 network policies is super helpful for a lot of people. You can also do things like cluster wide network policies looking at which namespaces cannot talk to each other or can talk to each other. So one interesting one, that use case that we had was like Bloomberg, obviously a lot of financial data and they were coming out with a new product that was essentially like Data Sandbox Studio. So customer logs in, they're able to access the financial data, they're able to do different types of work with it. So they had a Jupyter notebook and they could write different programs against the data, get what they wanted to and then see the data. But I mean the important thing is Bloomberg's financial data. They want to make sure data is not being exfiltrated out of this data Sandbox Studio. They want to make sure they had multiple tenants. And so each of the tenants can't talk to the other tenants. One person can't see what the other person is doing with the data. And a lot of that you can do with network policy. So you basically namespace each of the tenants within the namespace and make sure that they can with network policy talk across the different namespaces in the cluster. And you can also write network policies basically saying that data can't egress out of the Kubernetes cluster at all too. And so with that they're able to create a new product for their customers while still keeping their sensitive financial data secure. So network policy is a really important thing to be able to secure Your Kubernetes clusters.
B
That's a great, really good example. Funny enough, I'm working on something similar, so I'm going to just ask a question on that basis. So why does Cilium make that easier than if not using Cilium, if you're
C
just using Kubernetes, there's different CNIs that you could use in Kubernetes and some of them, it's not a requirement that they implement network policies. So some CNIs don't have any network policies. Then you can't write any kubernetes or you can write Kubernetes network policies, but they won't be enforced. So it's really not effective. Some of them just do the Kubernetes network policies. Then you only get the layer 3, layer 4. So you can't write more complex or advanced use cases around network policy. And then the other one is other things, like if you're doing like multi cluster network policy. So if you're running multiple Kubernetes clusters, this is another thing a lot of people turn towards Cilium for because it simplifies that. Cilium allows you to look at network policy not on just one Kubernetes clusters, but across multiple Kubernetes clusters. Kubernetes gives you basic network policy and Cilium allows you to do much more advanced use cases around network policy.
B
Yeah, makes sense. So how about. Is there anything more around network policies or do you want to go to Service Mesh?
C
We can go to Service Mesh.
B
So, yeah, talk to us about Service Mesh again. I think what is a Service Mesh first and foremost and then how is Cilium helping to that end?
C
Yeah, so anybody that knows me might know that I'm trying to kill the world. Service Mesh in the category Service Mesh, if you go look at it, there's an article that I wrote that I think explains a lot of my opinion that it's called the future of Service Mesh is networking. Service Mesh is a little bit newer than Kubernetes and it's once again like, okay, new cloud, native world. There's a lot of things that we need to rethink and a lot of this is like service routing. So service mesh is a term that tried to mean a lot of different things. So if we have microservices, we have a lot of new challenges of how do we do the networking between them, how do we do the observability, how do we do the security between all of them, like microservices running all over the place. And Service Mesh tried to solve this with a lot of layer seven networking stuff. So, right, it's like networking observability and security a lot around layer seven stuff. And this is why I kind of have a problem with the category Service mesh is. If you look at all those, those are all fundamentally networking things. And if you're trying to do it at just one specific layer, you might be missing a lot of the context from all the other layers. So I think we had a big arc where service mesh was very hot and a lot of people were trying to implement it. But I think we're kind of getting into a phase where people are understanding that you can't separate out the different layers of the networking stack. Can't be like, oh, I'm just going to look at only layer three. If you want to have the full context for your application, you need to look at all the layers and you need to look at them holistically. Because an example that I've heard is people are running Cilium as a cni and then Cilium has a service mesh, which I'll get to in a second. But you can also run a different service mesh on top, right? So they're running Cilium and they're running a different service mesh and they're like, okay, well, Cilium does a lot of smart things in ebpf. It doesn't use iptables. It just reroutes traffic, can do things like just route packets from socket to socket within the same Linux host. You can do a lot of things that will make it much more efficient, scalable performance, save you CPU cycles. But it doesn't mean all the other parts of the networking stack know what's happening. So an example would be like, Cilium can route things directly from one socket to the next, and it doesn't go through the whole Linux kernel networking stack. So the service mesh is looking at the end of the Linux kernel networking stack. And it's like, okay, I'll do this, like layer 7 processing. Once it comes out of that, what, like actually never goes through the networking stack. So you never see the packet. And so you're like, okay, like all this traffic is disappearing, or we don't route it or we don't see it. It's because it doesn't go through the traditional networking stack as the service mesh was expecting. And so service mesh, I think, can't be its own standing loan category. You need to think of it in the context of your whole networking stack. And so this is why Cilium came along. It was originally just a CNI doing A lot of this layer three, layer four routing. But then we're like, okay, well networking is not just a couple layer standalone category. It's actually, you need to have the context of the full stack. So what we did is we came up with Cilium Service Mesh and this is where some of the layer 7 network policies started to come in. And also doing other things like traffic routing in layer 7, some of the observability stuff, but integrated within the rest of the CNI and the rest of the kind of like networking story. And I think I also have a problem with the term service mesh because it's kind of like nebulous. It's like where does service mesh live? Where does networking start? Where did they overlap? Where they're all just kind of like overlapping concerns. What I think of today as Cilium Service Mesh is Cilium's gateway API implementation in kubernetes and some gamma. I think, sorry, I have a lot of problems with service Mesh.
B
There's a lot of emotion pent up with service Mesh
C
but I think it's also really funny. So I look at the analytics for the website and one of the top, I think three pages is a Cilium Service Mesh page. So it's what people are interested in. But I'm like, so what are you actually interested in? Is it the routing, is it the observability? Is it the security people aren't trying to solve? Service Mesh isn't a problem. What they're trying to solve is okay, we need to do layer 7 routing and host tasks or something, or we need to do layer 7 network security. That's the actual problem you're trying to solve. Service meshes is this kind of nebulous term that somebody told me I needed it.
B
Yeah, I think I can probably give then my perspective on that because working on a specific problem, I'm not an engineer by day anymore, but working pretty hand in hand with pretty advanced engineers. And when I was looking at what we're needing to achieve and did a sort of bit of running around, sort of understanding, okay, what, what are the bits to the network stack that we need to think about here? Service mesh just kept popping up. So I'm like, okay guys, do we have a service mesh? Was kind of my first question just to kind of maybe get a sense of do we have a concept of this thing or not? And then I got my answer and then we move on from there. So I guess it's just sort of a catch all term to help to that end and maybe that's why people are look it up so much on the website, because it's sort of like, yeah, but that's just an anecdote, I guess.
C
Yeah, I think you're exactly right. I think it's kind of like as we saw this transition to the cloud native world, there is kind of like a lot of new problems that were like, as we have a bunch of microservices, there's new networking problems and new network security problems. And service mesh was the label to solve a lot of those problems. Okay. We want a service mesh to solve the specific problems that we have. But kind of how the Cilium service mesh came around is like we also. People started asking us about service mesh and we looked into it and we're like, okay, what does a service mesh actually look like? And we're like, okay with what we have so far. We actually have like 80% of the service meshes because it's a lot of like networking, network security, network observability parts. The only part we're missing is like a bit of like the layer 7 stuff. And so we're not actually building a whole service mesh from scratch, we're actually just adding that last 20% is what people are looking for. And so that's how we originally came out with what was the Cilium service mesh.
B
Got it. So let's maybe move on from service Mesh. That's been, I think, super helpful and I'm sure there's a lot of people that will sort of look at the term service mesh differently now.
C
Sorry if I'm destroying people's hopes and dreams. One thing to solve it all.
B
Exactly. That's all we're looking for, right? We're always looking for a term that just solves all our problems. Cilium also does observability to my understanding through I guess, a sort of arm of the product called Hubble. Could you maybe talk to us a bit about Hubble?
C
Yeah, definitely. So this is like everything else in kind of the Cilium ecosystem is based on ebpf. And so what Hubble does is like, okay, since our EBPF programs are in the kernel routing all the traffic and we can see all of that, what if we just took that information and surfaced it to the user? To be honest with you, I think Hubble is the favorite feature of basically every user. I talk to the quote from the ESNET Energy Science Network, which is all the national laboratories in the US they're doing crazy stuff like IPv6 only Kubernetes cluster. And it's like Hubble's a godsend. It lets me what used to take multiple days of engineering time, I can now solve it in 30 seconds. And so going back, if we think about it, distributed computing packets are flying everywhere. We need to be able to understand it. It's not just like, okay, let me debug my one application, I can follow it through the whole program. It's like, okay, applications are making calls out to different programs, it's going over the network. We don't know where the packets are going, we don't know where the information is going, where are things being dropped? And so this is where Hubble came along. It's like, okay, so if we're actually routing all the packets already with ebpf, why don't we actually observe them? And so Hubble kind of just like, kind of piggybacks on top of the Cilium CNI and basically takes the information from the CNI and like surfaces it to the user in a couple different ways. So there's one is like network flow logs. So it's basically saying like, here are all the packets going through, this is like where they're going. And then the other one is the Hubble ui. And this allows you to create a service map of everything that's going on in your cluster and you can see where things are being routed, where different things are connecting, and also I think more importantly where traffic is being dropped. Because to be honest, that's when most people run into networking. Everybody likes to not have to think about networking. The only time they do is when things are going wrong with Hubble. They're able to very easily visualize either through the UI or through the flow logs, they're able to see, okay, where is our traffic being dropped? Because that's probably most people are concerned, the security team is concerned about where are things going that they shouldn't be. But most developers on a day to day basis they're saying, why isn't my traffic reaching the destination? And Hubble's a great way to understand that. So it can give you the reasons for the policy drops. You can be like, okay, our security team wrote all these new network policies and deployed them into the cluster and now all of my traffic is being blocked because they didn't want this type of traffic going in the cluster anymore. And then you can go have that conversation or you deploy a new service. Why isn't any traffic reaching it? Well, okay, it's in a new namespace and we don't have the network policy. We have a default denial network policy in our cluster and we forgot to write one. Okay. We should allow traffic to this new namespace. So people love Hubble because it allows you to have insight into where all the network packets are going in your cluster in a very easy way.
B
How would this be done? I guess without Hubble you just have to kind of roll your own or
C
so Linux, decades old technology. There's a lot of networking tools to be able to understand where things are going, like TCP dump, but that's right if you're using the networking stack. And the problem with BPF is there's kind of like this. Sometimes people say EBPF magic that people sprinkle on. It's like this black magic happening in the kernel and the packet just disappears because it's routed from one socket to the other and it doesn't go through all the traditional tooling. So as you're switching how you're doing things, you also need to come up with new tools. So yeah, Hubble is one of them. And there's also another project under Cilium called Peru, like Packet, where are you? And that's another great debugging tool that allows you to essentially pull more information out of the kernel. One of the reasons that people love EBPF is because you can hook anywhere in the kernel. You can pull any system information that you want, you can modify any system information, you can essentially pull out this fire hose of data. The limitations of some of the previous tooling is it's either one, it doesn't pull out the information you need, or it's, it's designed in a certain way. It's like this is what it does. But EBPF allows you to look for anything that you want to. You want to pull out new type of information from the kernel. Well, you can write a program to be able to do that. And so what Peru and Hubble allow you to do is to surface exactly the information that you want rather than relying on tooling that may not even work.
B
Yeah, I think that's a helpful call out given, as you say, EPBF could be seen as a bit of magic then, unfortunately, with the magic often. Yeah, I mean my sort of long ago version of that was when you start using Ruby on Rails, for example, and then it's like, yeah, but we need tools to actually then understand what's going on because there's just a whole layer of magic passing data between the back end and the front end. And I need to sort of understand what's going on there, for example.
C
Exactly. So you can Think about it. New tooling for the new world.
B
So we've kind of looked at what Cilium does and why developers might want to look at it or why they're using it already. Let's maybe just spend a little time going a bit deeper on how Cilium actually works under the hood. So we've got the idea of what is a Cilium data path, for example, and then we can maybe jump into just the actual component architecture after that. So I believe there's sort of a concept of BPF hooks. Maybe we could kind of start there and just sort of how does Cilium actually work?
C
I guess so, kind of like the basic architecture for Cilium is there's a Cilium operator running into your cluster, and this runs kind of like the lifecycle for all the Cilium agents. And the way the Cilium agents work, it's like Damon said, so it's one Cilium agent running on each of the nodes in your Kubernetes clusters. And what the Cilium agent does is it installs all the EBPF programs onto that specific node. And so what the agent does is it gets information and basically writes all the BPF programs and installs them actually into the cluster. And the interesting thing about this architecture is it in some ways simplifies a lot of the networking and upgrade lifecycle because the data plane, which is the BPF programs running in the kernel, is actually separated from the actual life cycle of the control plane, which is the agent and the operator running in the cluster. And so one thing I didn't touch on before with EBPF programs is that you install them into the cluster and they start running and you can de install them. And so what that allows you to do, and there's doesn't have to be any communication between kernel space and user space between the control plane and data plane. So the agent is installed on the node, it installs all the BPF programs. The BPF programs start routing all the networking packets, or they're with Hubble and they're like observing all the networking packets. And with that you're able to essentially, if the agent goes down, it doesn't matter because all the programs are pre installed and they're still going to be routing the packets. The only thing that won't happen is you're not able to update any of the data path because the agent's not there anymore. And so what you can do is you can update the agent, new agent is there, and it can then modify the EBPF programs. It's A little bit different with Envoy. So that's with all the DPF stuff. Some of the layer 7 stuff, which I guess I didn't touch on yet, is done with Envoy, which is a very popular service proxy. It's what a lot of the other service meshes are based on. And Envoy is also run as a daemon set. So one envoy on every single node in the cluster. And that works with the Cilium agent and the BPF programs to do some of the layer 7 based routing. And so the BPF programs and the Envoy envoy together kind of create the whole networking story and they work in concert with each other. And then layer seven is a little bit different because this is something that they're working on right now. It's like some layer 7 connections are more long lived. So if you restart the Envoy on the node, then it resets some of the connections, but now they're working on hot restart of Envoy. So that story's changing a little bit different. But I guess what's different about the Cilium architecture is the agent is separate from the BPF programs that are running in the kernel. So the data path can keep on running even as things are changing on the control plane too.
B
So then we've got the Cilium agent. I guess there's also the Cilium operator as well.
C
Yeah, the operator manages the lifecycle of all the agents in the cluster.
B
Gotcha. And then the CNI plugin and then identity management, which we've obviously touched on a little bit as well. So in terms of getting up and running with Cilium, maybe we could take sort of two brief examples. One is like total greenfield, which is obviously hopefully the easier one. And then one is you've already got a Kubernetes cluster kind of running something medium advanced. So maybe we take those kind of two cases. Like what are we talking in terms of getting up and running?
C
So you have a brand new Kubernetes cluster. There's a couple different ways to install Cilium. So the first and the easiest one is you're on a cloud provider doing something like GKE or aks and Cilium's already default on that cluster. So the cloud provider sets up your cluster, you get Cilium often running to the races. It's already in there. I think that's one of the cool things, is like Cilium is already the default CNI for a lot of managed Kubernetes clusters. The next one is you're setting up your own Kubernetes cluster. And this is like super common use case for like on PREM or other things. Cilium has like a couple different tools. I think most often people install Cilium with a helm chart. So install that with helm into your cluster. You can also install it with the Cilium cli, but this is I guess maybe not as recommended because then it's what things do you pass into the CLI and trying to remember that versus like a helm chart. So you're going to probably see most people install Cilium with the Cilium helm chart. And then the migration story is something that we see commonly. Because when I look at the way like people usually set up the Kubernetes cluster, it's like what I was saying before. Nobody wants to care about the network until they have to care about it. And they don't usually turn to Cilium or they might not always use Cilium until they come across one of the problems that they're having with like that Cilium helps solves, like I was saying before. So something like performance and scalability, the network policy or encryption, the observability aspect or like multi cluster networking. Because if you're running on something like OpenShift, OpenShift has their own CNI that they install into their Kubernetes clusters, or AWS has the AWS VPC cni, so there's already one installed in there. Or maybe for instance, a lot of tutorials start with flannel. Once you already have a Kubernetes cluster running with a different networking CNI in there and you run into one of these challenges that you're like, okay, well I need to have better performance and scalability in my cluster. And our security team says we need to have these layer 7 network policies. And our application developers are having a hard time debugging the cluster because they don't have any insight into where the traffic's actually going. And we're thinking about spinning up some more Kubernetes clusters and you're like, okay, well Cilium solves a lot of these problems. So I think we should migrate to Cilium as our cni. And the questions become, okay, how do we do that? And so the migration path that we see most commonly that people do is the cool thing about Cilium is it's not like a big bang where you're like, okay, we need to switch this over. It actually allows a lot of incremental things. One common path is I see people doing is like, okay, we need better observability in our cluster. And so what you're able to do is to do CNI Chaining. So you essentially have your first cni, say like flannel, and you're like, okay, I need better observability because it doesn't have any observability since you're able to essentially install Cilium on top of flannel. Flannel still does all the network routing, but Cilium is able to see all that networking and being able to surface that information through Hubble. So you now have the network observability without having to change any of your data plane. And you also now have the added benefit of having Cilium as a CNI in there. Or people also install Cilium for network policy. And so we don't want to change our data plane. Actually there's a lot of companies that actually wrote their own in house data planes. So for example, Alibaba wrote their own cni, but they wanted to add network policy. So they installed Cilium on top to do the network policy part. And so now you have two CNIs in there. One's doing the networking, one's doing the network observability or network policy. And what you're able to do is Cilium has this flag called Cilium nodeconfig. And you're able to specify on each node in your cluster which CNI you want to do the network routing. And so at the time, you're able to essentially say, okay, I want all of my original CNI to do the network routing. But what you then can then do is you're able to drain traffic off of a node and say like, okay, we want this new node to be, as you're adding new nodes to the cluster, we want this new node to be using Cilium as the CNI for traffic routing. And so you can basically roll over your whole cluster node by node. And each node that's coming online now is able to have Cilium as a cni until eventually you're basically switching over all your traffic to the new cni. Cilium's your cni and at that point you can uninstall the old cni. Wow.
B
Yeah, that's really cool. Yeah, like the CNI chaining, that's super cool in terms of. Yeah, because as you say, migration is probably the more likely case for a lot of people maybe listening today that aren't using already. But unfortunately that's also often the reason it doesn't get adopted as quickly, because it's often challenging.
C
Yeah, nobody likes to mess around with the network.
B
Yes, quite.
C
Yeah, if it's working, just leave it. Right. But yeah, if you start to run into some of these problems, you're looking at a migration story and a lot of people want to do live migrations. So if you go on the Cilium website, there's actually quite a few stories. The most recent one was from D.B. schenker, which is like the German national rail logistics thing. And they needed to migrate over to Cilium and they did the Cilium node config and were able to do a live migration of Cilium.
B
Exactly. That's infrastructure at some of the most important levels. That's pretty cool. So kind of looking at, as we touched on right at the beginning, Cilium is a very open source project. And again, looking at just some of the stats from the report that you guys just put out, there's a certain number, yeah, 1,000 individual contributors. You crossed that line basically on the project in October 2025. So that's a huge number. That's a huge number of contributors. This is surely one of the, I guess, largest open source projects out there.
C
Yeah, I think this is kind of interesting for me because also if you're looking at other vanity metrics like GitHub stars and things like that, it's like, okay, how do you measure the size success of the project? And a lot of the people that I'm talking to, Cilium is run by the platform team and it's four engineers supporting 200 developers. And so if you look at a lot of other projects, they're going to be a lot bigger, but it's just because the number of people actually interacting with that is a lot bigger, like a front end framework or something like that. It's going to have 200 developers for one SRE supporting those 200 people. And so it's kind of wild to me that Cilium is kind of grown into like such a large project. Right. If you're looking at there's a lot more developers that can write HTML than can write bpf code. So 1000 contributors is actually quite a lot in some ways in terms of stats. So depending on how you look at it, Cilium is in the top three projects in the Cloud Native Computing Foundation. So 200 some projects are the largest ones in the cloud native space. Kubernetes obviously, number one, because it's also the second largest open source project in the world, number two. And number three is Cilium. So, yeah, one of the fastest moving projects in the whole CNCF ecosystem.
B
Yeah, I think I said in the report, now the second largest. Cilium is now the second largest. That's pretty awesome. So yeah, I guess on that note, in terms of community and contributions, is it usually someone who's kind of, I guess using already through their company? I guess that sort of ends up then jumping in and making a pull request for something that they've seen. Or do you also just have kind of die hard Cilium fans that just work on this?
C
Yeah, I would say the most common thing, it's kind of like what I was saying before, the number of people that can write EBPF code in the world is not that large. The Cilium agent is written in Go. So it's a little bit different there. I think it's a bit more approachable. But yeah, Cilium is a pretty deep networking technology or technology as a whole. And so yeah, most common use case is we're running Cilium in our production cluster. We're running into this issue or bug, it's open source. So we're contributing this fix because nobody else in the community is working on it and we really need this. So that's actually how some of the earliest maintainers of the project came along. So some of the earliest ones were actually from Palantir and Datadog. Because they were running Cilium in production, they needed things solved. Easiest way to do that is to upstream the changes into the project. They got more and more involved and became maintainers of the project. And then in the exact same way, two of the other big companies, like I said, Google and Microsoft, use Cilium as the data plane in their managed Kubernetes cluster. So they need to get involved to be able to upstream their changes. So it's people running Cilium production that are trying to solve the issues that they have. And that's really how they get involved.
B
Looking ahead when it's sort of this open source roadmap can be a bit of an ephemeral term. But what are you. 2025 looks like it was a pretty big year. What do you think is on the horizon through 2026?
C
I think there's a couple different things. So one in the Kubernetes world is really starting to see this transition to IPv6. I was just at Ciliumcon in Atlanta, right next to Kubecon and There is actually two talks from ESNET and also from TikTok talking about using IPv6 only Kubernetes clusters. I think maybe this is finally the year of IPv6, but I think we're really starting to get there and a lot of the work done this year by the project was around how can we basically bring IPv6 feature parity up to IPv4. So I think there's a lot of work being done around there, because we're actually starting to really see IPv6 only clusters going into production and only that going into production at scale. If you look at the scale of the National Labs infrastructure in the US it's pretty big also TikTok also quite big. Very different use cases. So that's one is like in Kubernetes itself. The next one that I didn't touch on that much is around. I guess everybody's kind of aware of the whole VMware thing, and so people are looking to migrate off of that. I think that plays into Cilium a couple different ways. One is how can we bring VMs into Kubernetes? And another part is how can we connect Kubernetes or the cloud native world with the rest of our IT estate sitting in virtual machines? Outside of that, if we're migrating and modernizing, how do we still connect it to what we already have? There's two pieces there. One is how do we run VMs in Kubernetes? And Kubevirt I think, is what a lot of people are using. One thing that I'm really excited about that Cilium's coming up with is this thing called NetKit. And this was originally developed to solve the container networking overhead. So the problem with containers is, right, it's a process running on the host in its own network namespace. And by traversing into the network namespace, it creates essentially some overhead. And what NetKit allows you to do is to basically take something off of the NIC and put it into the container with essentially no overhead. So eliminating the networking overhead of containers. Since kubevirt is kind of like a VM running in the container in Kubernetes, what we're doing now is like, can we use NetKit to get the packet into the VM directly from the NIC? So eliminating the overhead of not just the container, but of the VM running inside the container running inside of the host. So it's once again, how can we reprogram the networking stack to make it faster, more efficient, more scalable? As we add more layers of abstraction, how can we still kind of make those abstractions thins that we can still get the performance that we need to out of our clusters? So that's on running VMs inside of it. And then the second part is connecting to the outside world. So we're actually doing a lot of work at Isovalent, like getting Cilium to connect to the outside world. So VMs running outside of your cluster or VMS that you're trying to migrate into your Kubernetes cluster in a seamless way. It's once again, how do we make that transition, that migration story as easy as possible. And doing that in a very smart way with EBPF.
B
Awesome. Very cool. So NetKit is that they'll be available in 2026.
C
No. So that's actually out now on the part of Cilium. So this is another thing. It was made by Daniel Borkman, who's also one of the co creators at vvpf. That was kind of like the next project he's working on was NetKit and that was a Linux kernel feature. So it's actually part of the Linux kernel. Cilium uses NetKit to be able to eliminate that container networking overhead. And then we're also implementing it to work with Kubefert too. So the whole part with four containers is already, it's in the kernel and it's in Cilium if you're running like the right versions. And then the part with vms is kind of for next year.
B
Awesome. So just as we wrap up, where's best for a developer or just someone who's kind of interested in Cilium and maybe thinking about, I don't want to say throwing over the fence to the developers, but throwing it into the mix like where's best to go and just sort of get acquainted with Cilium.
C
Yeah, so I'm extremely biased because I'm a maintainer of the website. So I would say go to Cilium IO first. And I think there's a lot of helpful resources for the different things that I've talked about. So if you're interested in like okay, I'm interested in this cube proxy replacement or other things like that. We have pages for all the different features. So like I want to learn more about Cilium as a cni, about Kube proxy replacement, about BGP or cluster Mesh or host firewall. I want to learn more about Hubble. We can do that. We also have like created different pages for different industries, kind of talking to the challenges that a cloud provider or consulting company or financial services. So it's not just like here are the features, it's here are these features and how do they apply to your actual industry? And then the last part that we have is these outcome pages because once again it's like you have these features, but companies aren't buying a service mesh, they're buying a specific outcome like how do we do layer 7 routing, or how do we do zero trust networking, or how do we do network automation? How do we consolidate our networking tools? And so actually looking at those outcomes and so it's kind of like moving up the stack from here's the feature to here's the business value that we're getting and how do we do it for the industries. But if you actually want to get hands on, what I recommend is going to the getting started and going to the labs. And there's a lot of hands on labs around Cilium. What these will actually allow you to do is to you don't even have to set up your own Kubernetes cluster. It's set up for you, Cilium is installed and it'll walk you through the different features. So basic one is like, okay, how do we install Cilium? And it'll walk you through that. The next one is, okay, I need to do network policy. How does Cilium network policy work? And there's a really famous Star wars demo. It's like, how do we blow up the Death Star or how do we protect the Death Star? So that's like a fun lab and it's different, like hands on labs. You're actually in a Kubernetes cluster and walking you through these different features, how they work and how do you actually apply them to your cluster. Yes, those are all great sources. Obviously there's always GitHub, we have a Slack channel if you want to jump in there too. But I would recommend if you want to get hands on with Cilium, go to the labs. I know I like to actually be in a cluster and be able to do things or just read through some of the stuff. And then there's also the documentation pages there too.
B
Awesome. Sounds pretty fully featured on that front. Yeah. Cool. Well, Bill, awesome to have you on today. Thanks a lot for coming on. As we were talking about before recording, you've been doing a lot of traveling so you've managed to find a slot on that schedule for this. So that's really appreciate it. And no doubt we'll be following along and maybe catch up again in a couple of years or something.
C
Yeah, that'd be great.
B
All right, thanks a lot.
C
Yeah, thanks for having me.
This episode explores how eBPF (extended Berkeley Packet Filter) is transforming Linux kernel development and enabling cloud-native networking with Cilium, one of the most widely adopted Kubernetes networking projects. With Bill Mulligan, maintainer at Isovalent (the creators of Cilium), the discussion delves into the origins and technical innovations of eBPF and Cilium, compares classic and modern networking approaches, outlines Cilium’s feature set, community impact, and looks ahead at the trajectory of cloud-native networking.
Notable Quote:
"What eBPF allows you to do is to actually add functionality on the fly into the Linux kernel."
— Bill Mulligan [08:50]
Notable Quote:
"Cilium switched the whole networking model from this IP-based model to this identity-based model...so as we kind of rotate the containers behind these labels, it doesn’t actually matter."
— Bill Mulligan [17:22]
Notable Quote:
"Kubernetes gives you basic network policy. And Cilium allows you to do much more advanced use cases around network policy."
— Bill Mulligan [26:19]
Notable Quote:
"People aren’t trying to solve 'Service Mesh' as a problem. What they’re trying to solve is, okay, we need to do layer 7 routing…layer 7 network security. That’s the actual problem you’re trying to solve."
— Bill Mulligan [31:19]
Notable Quote:
"Hubble’s a godsend. It lets me — what used to take multiple days of engineering time, I can now solve in 30 seconds."
— Bill Mulligan quoting ESNet [34:13]
Notable Quote:
"The cool thing about Cilium is it’s not like a big bang...it actually allows a lot of incremental things."
— Bill Mulligan [44:44]
eBPF as a paradigm shift:
“What eBPF allows you to do is to actually add functionality on the fly into the Linux kernel.” — Bill Mulligan [08:50]
Why Cilium ditched IPs:
“Cilium switched the whole networking model from this IP-based model to this identity-based model...so as we kind of rotate the containers behind these labels, it doesn’t actually matter.” — Bill Mulligan [17:22]
Hubble's impact:
“Hubble’s a godsend. It lets me — what used to take multiple days of engineering time, I can now solve it in 30 seconds.” — Bill Mulligan (quoting ESNet) [34:13]
On Service Meshes:
“People aren’t trying to solve ‘Service Mesh’ as a problem. What they’re trying to solve is ... we need to do layer 7 routing ... layer 7 network security. That’s the actual problem you’re trying to solve.” — Bill Mulligan [31:19]
On open source growth:
“Cilium is, depending on how you look at it, in the top three projects in the Cloud Native Computing Foundation...Now the second largest.” — Bill Mulligan [49:00]
Notable Resources Mentioned:
The conversation highlighted how cloud-native networking is undergoing a revolution driven by programmable kernels, eBPF, and flexible solutions like Cilium. The project’s identity-based approach, superb observability with Hubble, and migration-friendly architecture position it as both a pragmatic and forward-thinking solution for Kubernetes networking, with a future keenly focused on further performance, scalability, and seamless interoperability across infrastructure layers.
For anyone seeking to understand or adopt state-of-the-art Kubernetes networking, this episode offers clarity—bridging foundational tech, real user stories, and an eye toward the future.