
ByteDance is a global technology company operating a wide range of content platforms around the world, and is best known for creating TikTok. The company operates at a massive scale, which naturally presents challenges in ensuring performance and stabi...
Loading summary
Chen Tang
ByteDance is a global technology company operating a wide range of content platforms around the world and is best known for creating TikTok. The company operates at a massive scale, which naturally presents challenges in ensuring performance and stability across its data centers. It has over a million servers running containerized applications, and this required the company to find a networking solution that could handle high throughput while maintaining stability. EBPF is a technology for dynamically and safely reprogramming the Linux kernel. ByteDance leveraged EBPF to successfully implement a decentralized networking solution that improved efficiency, scalability and performance. Chen Tang is an engineer at ByteDance, where he worked on redesigning the company's container networking stack using ebpf. In this episode, Chen joins the show with Kevin Ball to talk about ebpf, the problems it solves, and how it was used at ByteDance. Kevin Ball, or K. Ball, is the Vice President of Engineering at Mento and an independent coach for engineers and engineering leaders. He co founded and served as CTO for two companies, founded the San Diego JavaScript Meetup, and organizes the AI in Action discussion group through latent Space. Check out the show notes to follow K. Ball on Twitter or LinkedIn or visit his website Kball LLC.
Kevin Ball
Chen, welcome to the show. Hi Kevin, I'm excited to get to talk to you.
Chen Tang
Yeah, me too.
Kevin Ball
Let's maybe start with you can introduce yourself. Just tell me a little bit about your background and how you got involved doing cloud native stuff and networking.
Chen Tang
Okay. My name is Chen and I'm currently a software engineer in ByteDance. So my job is focusing on networking parts in our data center to keep like every service in our data center especially deployed in containers running and to make sure their connectivity and stability. And in this field we use a lot of different technologies, we using kernel technology and hardware technologies, but in the kernel part we use ebpf. It's a customized program you wrote and you can somehow put it wrong inside the kernel without loading a kernel module. I think for us the EBPF technology has been developed rapidly in the recent decades and now it's very popular. You can use it in not just networking, but basically you can do everything with EBPF and you can just write your program and you want them to run inside the kernel and you don't have to be afraid that your code might jeopardize the entire system.
Kevin Ball
I think that's worth digging into, especially for older developers like me. I remember to get anything into the kernel when I started. It was this months or years long process where you would Go back and forth on email and all these different things. But ebpf as I understand it, lets you essentially run sandboxed code similar to how you might run JavaScript in a browser. Is that a fair analogy?
Chen Tang
I'm not quite familiar with JavaScript, but yes, I think you are right. Like you run like virtual machines inside the kernel and you have the environment prepared for you and you just wrap your code, your program and you get them run inside the system.
Kevin Ball
That is really cool. So before we dive into some of the specific ways that you're using it, maybe we could talk a little bit more about ebpf. So what does the programming environment look like for these virtual machines?
Chen Tang
First, BPF means Berkeley Packet Filter. It is a technology developed by the UC Berkeley and it was first to filter packet inside the kernel. And then when people found that this mechanism to do something inside the kernel, it can be used in a different way. So the people just expand this entire technologies and we call it ebpf. The E means extend. Extend packet Berkeley Packet Filter. But we can do more than just filter filtering packet and we can run different code inside the kernel. You can filter the packet, you can monitoring the entire systems and you can like trigger your custom functions when somebody load a file. You can do basically a lot of things. And the interesting things is that when we talk about kernel we know is a big system and is fragile if you do something wrong and you probably blow up the entire systems. But EBPF provides you with a mechanism. They check every lines of your code, they make sure your code can run safely inside the kernel. So this will help provide developers with powerful tools for them too. They don't need to care about the check the verifier, everything. They just run their code inside the kernel. If the verifier checks your code is not safe, that might blow up the system and they will tell you you cannot load it inside the kernel. But once you pass the check and you don't need to be afraid of that. So basically this is what EBPIP is about.
Kevin Ball
Yeah, so to make sure that I understand it essentially provides a set of APIs or hooks into the kernel that are more stable than kernel internals. So you've got you can hook into networking stack, you can hook into system calls. Yes, things like that. And on top of that it does a static analysis, verification upfront to make sure that this is going to be safe to run.
Chen Tang
Yes, exactly.
Kevin Ball
That makes a ton of sense. So I think this is super cool because there's a lot of Times when you need to get down into the kernel and I know writing application code, crossing the kernel interface, doing a system call, it's expensive.
Chen Tang
Yes.
Kevin Ball
What was the motivator for you to start working in ebpf?
Chen Tang
Yes. Let's go back to the networking part. Because why we need EBPF in networking? Because it's simple. Because we come back to the container cloud native things. We run multiple containers on the server and each container they own a unique network namespace. Main means the network environment of the container is isolated from the host. So each container they have its own IP address. They probably have its own network interface card. And that's caused a problem. So if your application inside the container and you want send a packet to the outside world and there is a war between the container and the physical NIC is the host and you need something to connect between the container and the host interface and that's. We use EBPF and the mechanism is simple. Let's assume we run a customer EBPF program in our kernel stack when we receive a packet, and the EBPF program will capture the packet when we analysis the packet and we see, okay, this packet belongs to this container and we send them to the container. That's simple. That's why we need EBPF in container networking. Just for a very simple target.
Kevin Ball
Got it. So to make sure that I understand when you're in this container environment inside the container, it doesn't know it's in a container. It wants to treat its network like a network stack. But outside you can look at the kernel layer that's running all of these containers. You can say, hey, this packet actually is going to another machine or a container inside of my same virtual machine. There's no need to go through an expensive interrupt stack to go through an actual physical device. I know in software I can just route it straight to that other container.
Chen Tang
Exactly. It's a lightweight virtual machine. You can think container is just a lightweight virtual machine.
Kevin Ball
Yeah, that's really interesting. So what were you doing before and how did you kind of realize you needed to do something like that?
Chen Tang
Because let's go back to the network virtualized things before we run EBPF for container networking. What we use, we still need network virtualization, like for virtual machines. Right. We have virtual switch. I'm not sure you're familiar with virtual switch. It's kind of like a switch running in a software route switch on your node and when you receive a packet and that virtual switch will capture the package and determine where to send it. Out. So basically this virtual machines, they need dedicated cores to run them because to achieve the maximum performance. So it's heavy, virtual switch is heavy and the virtual machine is heavy. And this is for the. Well for most cloud provider, because virtual machine is a technology probably we had like 20 years ago and contained something new. And the feature of container is lightweight. We don't want those heavy things, but we want to achieve the same purpose. We want isolation, we want security and we want resource designation to make sure the container is running in the safe environment. And we need two cores of probably four gigabytes of ram. And we make sure the connectivity between the container and outside work. But we don't want that virtual switch. And what we need to do next and let's go inside the kernel to see what we have, what the kernel can provide and the kernel provide us with ebpf. But at the very beginning, the EBPF is still not functioning as what we have today. And there is still a lot of work to do. So there's many develops. We gather together to think how to make EBPF more powerful for container networking. So it's been developed for five years and now I think it's mature technology for container networking. So basically you have a very many popular projects, open source projects, aiming to use EBPF to serve for container networking in each aspect, for example, not just connectivity, but like security. And you bring a lot of cloud native concept inside ebpf.
Kevin Ball
That's interesting. So actually let's maybe walk through that journey a little bit. When you started working on this and you said, hey, we have these heavy virtual switches, we want to get rid of those. And you looked at ebpf, you said it was not ready, what was missing and what were the things that you needed to add or develop to get this to work for you.
Chen Tang
Yeah, so the thing is you use ebpx, but fundamentally you are leveraging the ability from the kernel. And EBPF is they help you to expose some of the key functions from the kernel for the user. But at the very beginning that functions exposed is not enough, so you will have these problems. The performance is not good enough. And you can write a very simple program if you write the program more complicated. Like I said before, we have a verifier and the verify will reject the program because it's too complicated for you to recognize if it's safe or not. So you will encounter many problems. So the people keep working on this, keep optimizing the entire system. So basically this is what we are all the EBPF programmers have been doing in these five years.
Kevin Ball
Yeah, to make sure I understand they were one, extending the surface area of what you could do, what the hooks were things around that and two, improving the static analysis verifier to be able to handle more complex cases and still prove them to be safe.
Chen Tang
Exactly.
Kevin Ball
Got it. Okay, so you mentioned pulling a lot of different cloud native tooling into ebpf. What are some of the different ways in which you are using this in your stack today?
Chen Tang
Yeah. So basically what we are doing now, focusing on two parts. The first part is networking. We have packet direction and we have to enforce different networking security policies. It works as a data plan. We receive policies from the remote controller and the EBPF help you to implement those policies. And the second different part is the kernel tracing. Like we said before, the EBPF that provides a lot of hook points inside kernel and we call them trace point. And each trace point you can like when the trace point is triggered. And you can run your custom EBPF program to help you to understand what is going on inside the kernel. Like for example when the function is called and you assume some bugs has happened and you want to know what is going on inside these functions and you can bury your EBPF program inside this hook point and when the event is triggered. And EBPF will help you to print everything you want, especially the context at this moment and to deeply understand what is going on inside kernel. Because most of the time it's a black box. But with the help of EBPF program you can really understand the kernel.
Kevin Ball
It's almost like you can insert your observability stack down into the kernel and say oh, we think there's something going on with this function. Print me the context on entry, Print me the context on exit. How much overhead is there in this.
Chen Tang
Tracing the kernel tracing? Actually the cost is very high, so we don't use them often we use them only when there's a bug is reported. We need to analyze the kernel, but for packets filtering because there's different kind of trace points inside kernel. Some trace points the cost is low and some is high. Especially when you want to bury your customized choice point inside the kernel the cost is very high. But for most, especially in the networking part, all the trace point we cannot that is not a trace point, the hook point, the trigger of those hook points the cost is low. So we use it for networking. We direct tons of millions of package and the cards can be limited in like let's see the cost is acceptable.
Kevin Ball
Yeah. Now what does it take to roll out? Like say you want to turn on tracing. Can you do that live on a running container? Do you need to restart the server? What does that lifecycle look like?
Chen Tang
Yes, and this is like the attracting part. You can run the ebpf, the kernel tracing in the live container. You don't need to restart anything. You just inject your code inside the running system and you get a everything you need. And once it's done and you just cancel it, you can remove the program and the system will become back to normal.
Kevin Ball
That's wild to me that you can inject observability code down into your kernel on a live system with knowledge of safety, run it as long as you need it and just pull it out.
Chen Tang
Yes, exactly. That's why EBPF becomes so popular today.
Kevin Ball
That's cool. So, so diving in maybe a little bit, you mentioned also there's a set of open source projects that have been building and bringing these cloud native technologies into ebpf. Are you working on any of those or are those areas that you're connected to?
Chen Tang
Yes, for now I've been working on some of the open source project, but I'm mainly a user because I work for a company. So my purpose is first to serve what the company wants me to do and I will use a lot of open source tools and when I found there's things I can modify and yes, I will like to. It's kind of my way to how to say this is what community do, right?
Kevin Ball
Yes, absolutely. Which projects are you using?
Chen Tang
Since I first worked for the company, so my top priority is help the company to get things done, to get the job done. And during this process I use a lot of open source projects and I get help from the communities and especially there is an EBPF library called EBPF Go. Now it is the most popular EBPF library for Golang developer and I use this library help me to load EBPF program inside the kernel. I think this is the tools I use the most, the open source project.
Kevin Ball
That brings up kind of an interesting area which is what does the development environment for EBPF look like because you're compiling down, as I understand it, to essentially a bytecode that is what gets analyzed and loaded. So what is the environment that EPPF Go for example exposes to you?
Chen Tang
Okay, so for most developers like me, we wrote the C program, we want the C program to run inside the kernel and once the C program is done, we first we need to Compile them in bytecode and we load them into the kernel. And before we load them into the kernel, we need the verifier to make sure all the code can be run safely inside the kernel and to run a syscall to the kernel to load the program the bytecode inside the kernel. And after that, once the program is inside the kernel, it's still not function. You just somehow this kernel help you to store the program. And if you want them to run, you have to attach the program to a hook point. And this is what those libraries help you. You develop your program in a C code, but the library helps you to do the load, to do the verify, to do the attachment. Everything relates it like.
Kevin Ball
Yeah, got it. So the core EBPF programs developed in C, but all of the different attach, detach, load, all of this stuff is what you're using EBPF go to do.
Chen Tang
Exactly.
Kevin Ball
Awesome. I'd love to go back a little bit to the use used in the network. What was the impact of moving from this big virtual switch approach to the EBPF networking stack?
Chen Tang
Okay, so that's a very interesting story because it is related to the company's development. So let's take a look at the companies like Meta or Amazon. Those companies when they run their data centers is probably like 20 years ago. So by the time the technologies they had for virtualization is virtual machines and virtual switches. So this is what they used at the very beginning. And now we have to know one thing. If you migrate from one technology to the others, especially you have to understand upgraded entire data center. The cost is huge. So probably they will keep using virtual machines, they keep using virtual switches for their data centers. But I work for ByteDance. I don't know if you know ByteDance, but I think you know TikTok is one of the apps developed by our companies. ByteDance is founded in 2012, so it's only 10 years ago. So when ByteDance is keep growing and we need to build our data centers, cloud native technology emerged. So at this time we have a better choice to manage our data center. That's why we use cloud native technologies with Kubernetes to run everything runs as a container in our data centers. So I think this is luck, I don't know. Because at this time we have a better options and we still have the chance to make a choice and we choose Kubernetes and container.
Kevin Ball
That's really an interesting point of how coming later, you can often jump over the learnings that the earlier companies had to do. Maybe we can expand a little bit now into some of these other cloud native areas. What other places do you feel like going? Cloud native first has made a difference for ByteDance relative to say Amazon AWS or something like it.
Chen Tang
Okay, so now most companies, when they embrace cloud native first, let's make a statement. For companies like Meta or Amazon, they run their data centers in a different way. So when cloud native technologies emerge, of course everyone wants to use it, but we use them in a different purpose. For Amazon or Meta, they use cloud native technology as a service, especially for Amazon, they are a cloud provider. They use cloud native technologies mainly as a service for their customers. They don't use themselves, they provide them. But for us, we use ourselves. So the difference is since we have the kubernetes running inside our clusters and I think the biggest difference is the problem of scalability. For Amazon, they use cloud native as a service for their customers and their customers, they mainly have a small scale of cluster because they cannot afford to build their own data center. That's why they want to buy machines from Amazon. So the scale of their clusters, let's assume 1,000 machines, I think that's a lot, but for us we have over a million. So the scalability is the major problem. And we found that the cloud native technology, since it's very powerful, but it has a fatal problem is the scalability. If you run a Kubernetes on a cluster with 1,000 machines, that's enough and you can have like a powerful cluster management tool with basically everything you want. But if we have like 100,000 machines and you have the Kubernetes becomes a bottleneck of performance. So by the time, if you still want to use kubernetes, you have to do a lot of modifications. Some of the concept from cloud native ecosystem, it's powerful, but it's just not so efficient. We have to optimize them. So I think this is a major difference from how ByteDance use cloud native technologies from the others.
Kevin Ball
Yeah, there's a scale factor there that is really blows my mind. You mentioned finding progressive bottlenecks in kubernetes and cloud native abstractions. As you scaled up to 100,000 and beyond, where were those and what have you replaced or improved?
Chen Tang
Yes. For example, there is a very important concept in container networking. We call it service because once you have a client container, you want to talk to a server container, you cannot just ask for the IP address of your destination and you send the message to the destined IP address. In kubernetes, they Build a concept called service. You ask for the service and the service will return to you a target. And they say you can connect to this destiny. But the service, the total cost for you to run a service is huge. Because a service, you have to store all those backend containers, you have to choose from one of like the real server. If you have 100 server, that's okay. But if I have 100,000 and the cost to run this service concept is huge, it's not acceptable. So we have to remove them. So we use a different framework. It's like a service discovery. We develop our own service discovery framework to help the client container to discover their target. So this one difference and also because for us, because we run our own data centers and the cost things is in recent years, all those IT companies, they are not running so well. You see, there's layoff everywhere. So we can see the cost becomes a big problem for all those companies, they want to save money, they want to buy less servers because the server is what is the biggest cost for our company. So let's see if you have 1% of improvement for the total cost of the machine. And we have over a million servers, the 1% would be a lot. So this is what drives us to seek for new solutions to optimize the entire system. We know Kubernetes is powerful, the cloud native concept is powerful, very useful. But sometimes it's just not what we want the most at this very moment.
Kevin Ball
Yeah, so let me make sure I understand the service example. So in Kubernetes you have the service abstraction and it kind of assumes a global view is going to index all of the different servers or containers that might respond for this service type. And when a client asks, it looks up in its index who's free, sends them the target. That kind of assumes a bounded set that is small enough to perform quickly in that type of index. And when you scaled up large enough, you needed to say, oh, we actually need a much more contained version of service discovery that's not going to be trying to index 100,000 machines.
Chen Tang
Exactly.
Kevin Ball
That makes a lot of sense, I think, and points towards a potentially even just class of problems where Kubernetes is providing this nice abstraction where it's going to do all the management for you of all these different things, package it up in a single location, and that's going to run into scalability concerns. As you go far enough, maybe you bump out of a cache or a memory or something like that. Were there other areas that you found needing to kind of almost scope down the abstraction exposed to be able to deal with that level of scale.
Chen Tang
Actually I'm not expert in this field because we have a different team that manages the resource orchestration and they are the teams to build the kubernetes systems. I'm focusing on the networking part. What I know about the service things reaches like my knowledge boundary.
Kevin Ball
Okay, looking within networking. Then again, we talked a bit about using EPPF to make a much sort of more efficient networking stack within the physical machine. So you connect to other areas of the networking stack. That being cloud native from the beginning has really made a difference.
Chen Tang
Yes, we have iptables that is old technologies being first used when cloud native actually at the very beginning, when container networking becomes a problem to be solved, people use iptables at first, but iptables, it's slow. And then we use EBPF to replace iptables. So now EBPF becomes the majority and iptables just emerged from at the very beginning and then it gets replaced because of its bad performance.
Kevin Ball
So I understood the EBPF approach to network device replacing the virtual switch. Instead you just drop into this kernel hook that looks in different places. What does it look like for iptables?
Chen Tang
Iptables is also a function inside the kernel stack. But iptables works in a chain format. For example, if can write a chain of rules, you have the match and the actions, the packet match and hit these rows and will be proceed with the action in its chain and you can pass the packet to another chain. So the entire iptables system is just a bunch of chain. It's a chain system and we can see it is not efficient. And especially once the rules expand and the cost will not be acceptable. But EBPF is a single program. You write your own program and one program will be enough. So that's why people replace iptables and choose to use ebpf. And when it comes to the virtual switch things, I remember you also asked how EBPF replace the virtual switch. No, EBPF doesn't replace virtual switch. They just work in a different scenarios for different purpose.
Kevin Ball
Got it. Let's maybe go in on there, because I misunderstood. Which scenarios are you using? The EBPF networking direction versus the virtual switch.
Chen Tang
So why we need a virtual switch and ebpf? Because they serve for different purpose. EBPF focusing on container networking. Why EBPF can be used in container networking? Because it's a lightweight isolated environment. There is still only one system, one kernel, even container. They run in an isolated environment. But actually the kernel is the same as the host one. So the EBPF just play the tricks on the host kernel, help you to redirect the packet from the physical NIC to the container. But in the virtual machine, Virtual machine is a full isolated environment. They use a different kernel than the host one. So it's impossible for the host kernel to talk to the guest machine. So you need a different mechanism to redirect the packet to the guest machine. That's why we use virtual switch. So basically these two technology, they serve for different purpose.
Kevin Ball
To make sure that I understand it. The way that I'm seeing it right now, it's almost like there's layering. So if you have containers within a single machine or virtual machine, you can route between those containers purely with ebpf. As soon as you start to go outside of that machine, you need to go over an actual physical NIC or a virtual switch. If you're going to a virtual machine.
Chen Tang
Yes.
Kevin Ball
Okay, that makes sense. I'm curious then, from a performance gains standpoint, how much of the traffic that you're directing stays within that single machine and is able to leverage ebpf, and how much still ends up going past? Like what types of performance gains do you end up with on an aggregate.
Chen Tang
Basis, you cannot make a full comparison between EBPF and the virtual switch, because difference is the performance varies between how you use them. So the main advantage of EBPF is easy management. If you want to run a virtual machine and you have to do a lot of preparation for that, and you have to reserve a lot of resources simply for the virtual machine, it cannot allocate those resources to like guest host. But for ebpf you don't have to consider that, because all EBPF programs runs inside the same one kernel. So all these things make sure it's much easier to manage a container than the virtual machine. So I think this is the biggest advantage of using ebpf. But when it comes to the performance comparing between EBPF and virtual machine, it's hard to see which one is better. And I think in the scenario, let's see, the machine is full loaded. I think the virtual machine can have a better performance than ebpf. That is the fact.
Kevin Ball
Interesting. What are the scenarios in which you still want to have virtual machines? It feels like containers is the cloud native way to do it.
Chen Tang
Okay, so when we want to use a virtual machine, let's go back to aws. As a cloud provider, you run the virtual machine, not for your own services, you run the Virtual machine for your guest. The people buy your virtual machines, they run their own applications. So what they want is full isolation and security. That is top priority. They don't want their information to be accessed by even they run their service on aws, but they don't want their data to be accessed by the cloud providers. So AWS still have to build a full isolated environment for their customer. And that's why they still choose using virtual machine. But in bytedance we run the entire data center we run is run, but we run our own service on the data center. We don't have this requirement for security and isolation. So a lightweight method will be enough. And what we want to achieve is the easy management that makes sense.
Kevin Ball
Okay, what do you see as the next frontiers in this space? What are you working on for ebpf or within the networking stack that you think is taking this to the next level?
Chen Tang
Yes, I think what we are considering at this moment is still the cost. We see the EBPF brings a lot of advantages, easy management, but still the cost of the kernel stack is still inevitable. Because if you look into the kernel stack, we have multiple interrupt and memory copies. When we receive a package from the nic, we have the first copied packet from the NIC to the kernel stack. And when we go through the kernel stack, we have to copy the packet from this kernel to the user application. And this cost is inevitable. And it becomes when we want to optimize the entire system. There is no way for us to ignore this cost. And this is the bottleneck for EBPF technology. Even though it's popular, it's because it's easier to be used. But if we want to save more resources, we have to optimize EBPF at this moment for us. That's why we have several solutions. First is NetKit we mentioned before it help us to reduce res 1 interrupt when packet is transmitted between the container and the host. NetKit using, let's see a special mechanism help us to reduce one interrupt but that is not enough. So we asking help from the hardware. And now what we are doing next is to combine EBPF with hardware offloading. Because we know the difference between EBPF is that is powerful. We can write our own custom program in the kernel, but the cost is higher. But once we leverage the ability from especially now the smartnic we have a hardware interface from different vendors. They help us to offloading packet processing ability from the kernel to the hardware. But the problem is difficult to use. You cannot write your own program inside the hardware. You can just inject the rules or policies like what I mentioned, like iptables. So then somehow we need to translate the EBPF program into hardware rules and to load those rules inside the hardware to using SmartNIC help us to process the package. And this is what we are doing at this moment to like, let's see, to achieve the best performance.
Kevin Ball
So kind of curious there, is there a standard for how those rules are defined for the hardware offloading? Could you create a compiler essentially that takes your EBPF rules into them?
Chen Tang
It's predefined, yes, let's see. If we write EBPF program, everything can be defined by the code. So we can write whatever we want. But let's see. A typical hardware offloading rules. It's just like iptable rules. You have a match, you match the header of the packet and you have an action. Match, action, match action. You have multiple rules with different priority and each rules they have a match field and action. Then you need to write a different program to translate the EBPF program to those rules that they behaves the same. So this is the difficult part and what we are doing now is we combine these two technologies. Because the rules cannot be predefined. If you predefined everything you want into rules first, we don't think it's possible, it's just too difficult. And what we are doing now is we have a separation of the slow path and fast path. The fast path is hardware offloading. And once the packet misses the rules and the packet will go back to the kernel again and EBPF program will process the packet. And when the EBPF program decides, okay, this packet will allow them to be proceed and we will inject a rule. The match of the rules is the header of this packet, the destination, the source IP address and the destin port and the source port. And the action is that we see redirect to container A and we inject this rule to the hardware and the hardware will recognize the following packets. And this is what we are doing.
Kevin Ball
That's interesting. So to make sure that I understand you're in some ways treating the hardware as a almost caching layer of rules where a packet comes in, starting from a blank state packet comes in, you don't have a rule for it. It goes to the networking stack. Your EBPF code picks it up, analyzes it, says okay, here's where this needs to go. And furthermore, here's the rule that the hardware can use to do that fast next time it loads that up into the hardware, which then for subsequent packets with similar patterns, we're following that rule, knows what to do.
Chen Tang
Exactly.
Kevin Ball
What's the sort of lifespan of those rules? Are they durable or is there like a timeout or how does that work?
Chen Tang
Every time a packet match the rules, we have a counter and we see when this rules is being idle for about 30 seconds and we recycle them because there's no way for the rules to delete themselves. We have to delete the rules. But since if let's assume the rules is in the hardware and all the packages, the kernel cannot capture the packet anymore because the kernel has been bypassed. So how could we know, okay, the session is over, we need to delete them? There's no way for us to know that. So what we are doing is that we're running another program on the host. They periodically to fetch all the rules from the hardware and to analyze them. If this rule has been idle for like 30 seconds or 1 minutes, we have a timeout setting and we just recycle the rule to make sure there's no rules leak inside the hardware.
Kevin Ball
That makes sense. That's cool. And you're able to get all the data you need from the hardware itself. Or does there need to be some sort of communication between the EBPF sets them and the program that's clearing them?
Chen Tang
Well, for now we use the user, we run another agent to fetch all those data from the hardware. And since there is no way for EBPF to communicate to the hardware, since this part is still missing, and maybe somehow in the future we can find a way for EBPF program to talk to your hardware directly. But now there's no way for us to do that.
Kevin Ball
How does it set the rules then? If it can't talk directly, we first.
Chen Tang
Let EBPF program to talk to our agent. There is a channel for the EBPF program to communicate with the host application. And the agent will analyze the message from EBPF and the agent will translate the message to our hardware rules and the agent will help us to inject the rules to the hardware. This is what we're currently doing, but we see there is a two way communication from the host to the user space and from user space to the hardware. And the user space application will periodically fetch data from hardware and yeah, it's not. It does look a bit ugly. But there's no better ways for us now I think hopes. Let's see. In the future, I hope we can find a way for the EBPF program to talk to hardware.
Kevin Ball
Honestly, that kind of makes sense though, because the agent needs to be somewhat durable. It needs to run every 30 seconds, check things, keep track of it. Whereas EPPF, as I understand it, is event driven, right? It's always happening. Just a thing comes in, we do it. So to make sure that I understand the whole thing, then you have this durable agent that is responsible for keeping track of what rules are currently on the hardware and translating when there's an EBPF rule that triggers move it into a hardware rule. And so network packet comes in, misses the hardware, cached logic, goes into ebpf. EBPF applies its logic, puts a rule on and sends a message to your agent. Your agent then says, ah, here's a new rule. Let me push that up into hardware.
Chen Tang
Yes, it's a bit complicated, right?
Kevin Ball
It is, but I think it's quite clever. That's cool. What else are you doing in this.
Chen Tang
Space and beside this, since we have rules offloading in the hardware and now we can leverage the RDMA technique. I'm not sure if you're familiar with.
Kevin Ball
Rdma, you essentially map user space memory to the NIC and it can send it directly.
Chen Tang
Yes, exactly. And this will because RDMA is a technology developed by NIC manufacturers called Manalux. Now it's a part of Nvidia since we using EBPF help us to leverage the ability of hardware offloading. And now on top of this, we can use rdma.
Kevin Ball
So let's talk real quickly about what that looks like. So you send ebpf, figures out where it needs to send the network packet. Does it know the memory that is mapped for the NIC to be able to go? Or does that go through your agent or like what is the flow?
Chen Tang
No, the EBPF doesn't know anything about this EBPF only knows about the packet redirect. We need to parse this packet or drop it. And once we have the we translate the EBPF policy into a hardware rule and we push them into the NIC and the NIC and the subsequent packet can be goes directly through the hardware and bypass the kernel. And at this moment we use RDMA because RDMA provides a different mechanism of packet transmitting help you to especially is mainly bypassing the kernel. And since the hardware know where to route the packet and you can use RDMA to talk directly to the destination and there's no need, you don't need to involve kernel anymore.
Kevin Ball
So. And it's been made many years since I did anything with rdma. If I understand you need to give it not Just a network address, but you need to actually give it the location in memory. Is that correct or.
Chen Tang
Yes, this is what RDMA is doing. But let's see if we have two physical machines running NIC support. Rdma, you can use RDMA directly because the NIC is working, they know exactly where the destination is located. But in container networking, the NIC doesn't know at the very beginning. So this is the trick. You need EBPF to help the NIC to identify the destination. And once you provide the connectivity for the nic, you can leverage rdma.
Kevin Ball
Got it, Got it, yeah.
Chen Tang
Because previously RDMA is being used in host machines. All the applications that develop that deployed on the host, they. They never been deployed on the container. Because once you have your application on the container, there's no way for the NIC to locate the destination. They don't know where the destination container is. But since with the help of ebpf.
Kevin Ball
Yes, it works, that's clever. So when you do that sort of fallback, EBPF locates, it sends this information also to your agent, which can then load all of that back up into the hardware and you can bypass the container boundary and do RDMA straight to a container. Yes, that is super cool. Well, I think we've covered a lot and I actually really like the example with the hardware because it connects not just this kernel piece, but shows how this can be a bridge between container technology and essentially the old world. Anything that was living outside of the container world didn't understand it. You can kind of intercept with ebpf, pass that data off to an agent that understands both sides and make this sort of bridging connection.
Chen Tang
Yes, and you also mentioned me about one thing I want to share, that is the relations between the old world and new world. Because we see the Linux kernel is very big and it's been developed for decades and for us, for most people it's kind of old. And nowadays we have more technologies emerged from different from open source communities or from hardware manufacturers. And the people will say that the kernel is big and there's no way for us to change the kernel by our will, but we can find a way to bypass them. So we see. But we actually is on a crossroad, right? What we will choose next, if we embrace the technology that bypassing kernel, or that we embrace the technology that we still stick with the kernel. Actually we are facing a choice and I think it's a very interesting topic and actually I don't have the answer yet because we are still wondering if they can coexist in the future or they have to be enemies. And I think it's a very interesting question. And yeah, I'm just bringing it up and actually I don't have answers for that.
Kevin Ball
It is a really interesting question, I feel like, and it's been around for years, of how much can you do in user space? How much can you do without having to jump down into the kernel and face the performance costs that go going back and forth over a system call? It does seem like EBPF is a nice middle ground where you can write your own custom code, dynamically load it without having to restart or do anything with your kernel. And yet it runs inside the kernel and has access to all the privileged information.
Chen Tang
Yes.
Podcast Summary: Software Engineering Daily – ByteDance’s Container Networking Stack with Chen Tang
Episode Overview
Title: ByteDance’s Container Networking Stack with Chen Tang
Host/Author: Software Engineering Daily
Release Date: July 1, 2025
In this episode of Software Engineering Daily, host Kevin Ball engages in an in-depth discussion with Chen Tang, a software engineer at ByteDance. They delve into ByteDance's innovative use of Extended Berkeley Packet Filter (eBPF) technology to overhaul the company's container networking stack. The conversation explores the challenges of operating at ByteDance's massive scale, the intricacies of eBPF, and the future of kernel vs. user-space networking solutions.
[00:00] Chen Tang:
Chen introduces ByteDance as a global technology leader behind platforms like TikTok, highlighting the immense scale of over a million servers running containerized applications. He emphasizes the necessity of a robust networking solution to maintain performance and stability across data centers.
[01:52] Chen Tang:
“We use a lot of different technologies, we using kernel technology and hardware technologies, but in the kernel part we use eBPF... you don’t need to be afraid that your code might jeopardize the entire system.”
[02:54] Kevin Ball:
Kevin compares eBPF to sandboxed environments like JavaScript in browsers, making it more accessible to developers by simplifying kernel interactions.
[03:20] Chen Tang:
Chen agrees, explaining eBPF as a virtual machine within the kernel that safely executes custom programs without the need for kernel modules.
[05:20] Kevin Ball:
He clarifies that eBPF provides stable APIs or hooks into the kernel, complemented by static analysis to ensure safety before execution.
[06:08] Chen Tang:
Chen discusses the challenges of container networking, especially with unique network namespaces and the need to efficiently route packets between containers and the host. He explains how eBPF enables a decentralized networking solution by capturing and analyzing packets to direct them appropriately.
[10:45] Chen Tang:
“As the functions exposed initially were not enough, we faced performance issues and complexity in writing eBPF programs. Over five years, the community has optimized eBPF to make it mature for container networking.”
[19:49] Kevin Ball:
Kevin highlights the unique scalability challenges ByteDance faces compared to companies like Amazon or Meta, which use Kubernetes primarily as a service provider, handling smaller clusters up to 1,000 machines.
[22:26] Chen Tang:
Chen explains that ByteDance manages over a million servers, necessitating custom solutions to overcome Kubernetes' scalability limitations. For instance, they developed their own service discovery framework to manage the massive scale efficiently.
[24:50] Kevin Ball:
He underscores how cloud native abstractions can become bottlenecks at large scales, prompting the need for tailored optimizations.
[27:29] Kevin Ball:
Kevin asks about replacing virtual switches and iptables with eBPF.
[27:42] Chen Tang:
Chen details that iptables, though integral to kernel networking, operate inefficiently with extensive rule chains. eBPF offers a more streamlined approach by allowing single, custom programs for packet handling, enhancing performance and manageability.
[28:45] Chen Tang:
He differentiates between using eBPF for container networking and virtual switches for virtual machines, explaining that eBPF is optimal for lightweight, isolated environments like containers, whereas virtual switches are necessary for the full isolation required by virtual machines.
Notable Quote:
Chen Tang:
“eBPF is a single program. You write your own program and one program will be enough. So that's why people replace iptables and choose to use eBPF.”
[33:25] Chen Tang:
Chen discusses the ongoing efforts to minimize kernel stack costs by integrating eBPF with hardware offloading technologies like SmartNICs. This combination aims to translate eBPF programs into hardware rules, thereby enhancing packet processing efficiency.
[37:57] Chen Tang:
He explains the current method of using an agent to manage rule translation and injection into hardware, acknowledging its complexity but recognizing it as a necessary step until more seamless integration solutions emerge.
[42:10] Chen Tang:
Chen elaborates on leveraging RDMA (Remote Direct Memory Access) to bypass the kernel entirely, using eBPF to direct packets directly to containers, thus further reducing latency and overhead.
Notable Quote:
Chen Tang:
“We have a separation of the slow path and fast path. The fast path is hardware offloading. And once the packet misses the rules and the packet will go back to the kernel again and eBPF program will process the packet.”
[45:55] Chen Tang:
Chen raises a thought-provoking point about the future coexistence of kernel-based and bypass technologies, pondering whether they will complement each other or become mutually exclusive.
[47:42] Chen Tang:
He acknowledges the potential of eBPF as a middle ground, offering dynamic, safe kernel modifications without the overhead traditionally associated with kernel programming.
Notable Quote:
Chen Tang:
“We are facing a choice and I think it's a very interesting topic and actually I don't have the answer yet because we are still wondering if they can coexist in the future or they have to be enemies.”
The conversation between Kevin Ball and Chen Tang offers a comprehensive look into how ByteDance leverages eBPF to address the complexities of networking at an unprecedented scale. From replacing traditional tools like iptables and virtual switches to integrating advanced hardware offloading and RDMA technologies, ByteDance exemplifies the cutting-edge application of eBPF in cloud-native environments. The discussion also touches on broader industry questions about the future interplay between kernel-based and user-space networking solutions, highlighting the ongoing evolution of software engineering practices in managing large-scale infrastructures.
Key Takeaways:
eBPF as a Catalyst for Networking Efficiency: ByteDance's implementation of eBPF allows for dynamic, safe, and efficient packet processing within the kernel, replacing older, less efficient tools.
Scalability Challenges: Managing over a million servers necessitates custom solutions beyond traditional Kubernetes deployments, particularly in service discovery and network management.
Integration with Hardware Technologies: Combining eBPF with hardware offloading and RDMA presents opportunities to further reduce latency and overhead, though it introduces complexity in rule management.
Future of Kernel vs. Bypass Technologies: The industry faces pivotal decisions on whether to continue enhancing kernel-based solutions like eBPF or to shift towards bypass technologies, with potential for both coexistence and competition.
Relevant Quotes:
Chen Tang: “You just inject your code inside the running system and you get everything you need. And once it's done and you just cancel it, you can remove the program and the system will become back to normal.” [14:38]
Chen Tang: “We have iptables that is old technologies being first used when cloud native actually at the very beginning, when container networking becomes a problem to be solved, people use iptables at first, but iptables, it's slow.” [26:56]
Chen Tang: “Because we use ourselves. So the difference is since we have the kubernetes running inside our clusters and I think the biggest difference is the problem of scalability.” [20:15]
This episode provides valuable insights for software engineers and technologists interested in large-scale cloud-native networking solutions, the practical applications of eBPF, and the future trajectory of kernel and user-space networking technologies.