Kubernetes at LinkedIn, with Ahmet Alp Balkan and Ronak Nathani - Kubernetes Podcast from Google

Summary8 min read

Kubernetes Podcast from Google: Detailed Summary

Episode Title: Kubernetes at LinkedIn, with Ahmet Alp Balkan and Ronak Nathani
Hosts: Abdel Sighiouar, Kaslin Fields
Guests: Ahmet Alp Balkan, Ronak Nathani
Release Date: March 25, 2025
Duration: Approximately 41 minutes

1. Introduction

The episode begins with hosts Abdel Sighiouar and Mofi Rahman introducing their guests, Amit Al Balkan and Ronak Nathani, both software engineers from LinkedIn's Compute Infrastructure team. The focus of the discussion revolves around how LinkedIn leverages Kubernetes at scale, the challenges faced, and the lessons learned during their transition from a proprietary container orchestration system to Kubernetes.

2. Transitioning to Kubernetes at LinkedIn

Key Points:

Historical Context: LinkedIn initially developed its own container runtime and scheduler approximately a decade ago, prior to Docker's emergence.
Reason for Transition: The proprietary stack became increasingly costly to maintain and lacked the scalability offered by the mature Kubernetes ecosystem.
Current State: The majority of LinkedIn's workloads, including stateless, stateful, and batch jobs, are transitioning to Kubernetes. Full migration is ongoing, with plans for complete adoption being highly anticipated by management.

Notable Quote:

Amit Al Balkan [02:39]:
"But over the last few years we realized that it's aging a little bit too. See the marginal cost of adding every new feature is increasing. ... with Kubernetes and other open source ecosystem becoming just way more mature, it just made sense for us to transition onto that path."

3. Running Stateful Workloads on Kubernetes

Key Points:

Running Databases: Despite common perceptions, LinkedIn successfully runs databases on Kubernetes by leveraging its flexibility and controlling the full stack from bare metal to configuration.
Custom Protocols: They utilize a generic stateful workload operator that allows various databases to implement a specific protocol, enabling Kubernetes-agnostic operation.
Maintenance Lifecycle: The team ensures coordinated maintenance with stateful systems, avoiding abrupt pod evictions and respecting application states.

Notable Quotes:

Amit Al Balkan [03:59]:
"So we are insane enough to run kubernetes on bare metal and we are also insane enough to run databases on kubernetes."

Mofi Rahman [05:25]:
"We have written our own generic stateful workload operator. ... that protocol is largely Kubernetes agnostic, which lets us run any number of different databases without writing a separate Kubernetes operator for each."

4. Handling Kubernetes Control Plane and Dependencies

Key Points:

Current Control Plane Setup: Kubernetes and its components (API server, etcd, controller manager, scheduler) run as systemd services on LinkedIn's legacy orchestration stack.
Future Plans: LinkedIn is exploring running Kubernetes within Kubernetes (“cubeception”) to streamline operations and achieve cost savings.
Networking Stack: LinkedIn employs a flat, data center-routable networking model, avoiding common cloud-native networking solutions like kubedns or flannel to reduce latency.

Notable Quotes:

Mofi Rahman [08:38]:
"Today our Kubernetes runs on our legacy orchestration stack. ... we want to run Kubernetes itself on kubernetes. I think some people call it cubeception."

5. Scaling Kubernetes Clusters

Key Points:

Cluster Size: LinkedIn aims to push Kubernetes cluster sizes beyond 5,000 nodes, managing multiple clusters across regions to avoid fragmentation and capacity wastage.
Shard Management: To handle large clusters, events are sharded across separate etcd clusters to enhance scalability.
Future Aspirations: There is interest in open-source alternatives to etcd and solutions like Spanner to further scale beyond current limitations.

Notable Quote:

Amit Al Balkan [11:04]:
"If there are open source alternatives available to etcd, which allows us to scale the cluster way beyond what we can do today, that would be of lots of interest not just at LinkedIn but also some of the other folks we have spoken to."

6. Infrastructure as a Service and Hardware Refresh

Key Points:

Machine Management: LinkedIn has developed an Infrastructure as a Service (IaaS) layer that programmatically manages bare metal inventory, integrating with Kubernetes through custom resources and controllers.
Hardware Refresh Strategy: Machines are organized into pools with node profiles (e.g., high memory). During hardware refreshes, LinkedIn scales down old pools and scales up new ones, ensuring seamless transitions without significant downtime.
Maintenance Zones: Machines within pools are spread across maintenance zones to minimize impact during upgrades or failures, adhering to strict topology spread constraints.

Notable Quotes:

Mofi Rahman [14:46]:
"Anytime you rack a machine, it automatically gets added to our infrastructure as a spare machine in our data center."

Amit Al Balkan [17:03]:
"Our machines within a pool are also spread across what we call maintenance zones, ensuring that any scale-up or scale-down impacts only a maximum of 5%."

7. Ensuring Predictable Performance Amidst Hardware Diversity

Key Points:

Node Profiles: LinkedIn uses node profiles to abstract hardware specifications, allowing workloads to request specific profiles (e.g., latest CPU generation) to ensure performance consistency.
Performance Testing: Before introducing new hardware SKUs, extensive performance tests are conducted to validate suitability for sensitive applications.
Future Enhancements: Plans include introducing scheduler plugins to dynamically adjust node weights based on application requirements, mitigating fragmentation and optimizing resource utilization.

Notable Quotes:

Amit Al Balkan [19:42]:
"One thing we do right now is we have different node profiles ... we provide a pool of machines which has the latest sku and applications who actually want this would basically say I want to opt into asking for this specific SKU."

8. Custom Controllers and Managing Complexity

Key Points:

Prevalence of Custom Controllers: LinkedIn extensively uses custom controllers to manage Kubernetes functionality, tailored to their unique needs.
Development Challenges: Controllers in production environments are complex, necessitating meticulous development to avoid issues like memory leaks, throughput bottlenecks, and infinite loops.
Evaluation of Open-Source Components: LinkedIn adopts a cautious approach, rigorously testing and evaluating open-source components before integrating them into their stack. If a component fails to meet scalability or reliability standards, they opt to build custom solutions.

Notable Quotes:

Mofi Rahman [22:56]:
"Anytime a random team out there in the company shows up saying, hey, I have a controller that I would like to deploy to all our clusters, please. Usually our response is not very positive."

Amit Al Balkan [27:05]:
"Number of stars doesn't represent how something will run in your production environment. ... if we find out that any of the things I mentioned aren't true, then we look at the cost of what it means for us to write that component from scratch."

9. Developer Experience and Platforming

Key Points:

Golden Path: LinkedIn offers a curated Kubernetes experience, abstracting complexities and providing a "golden path" for developers to deploy applications without needing deep Kubernetes knowledge.
Simplified Deployment: Developers specify compute resources, application identifiers, and deployment environments through user-friendly interfaces and workflows.
Guardrails: To prevent inadvertent disruptions, LinkedIn enforces guardrails that restrict actions like scaling replicas to zero, ensuring stability and preventing site-wide outages.

Notable Quotes:

Amit Al Balkan [31:31]:
"We have certain nouns within LinkedIn which uniquely identify your application. ... you don't have to worry about which cluster my application is running into."

Mofi Rahman [35:13]:
"If they need to worry about a cluster, a namespace, then we did something wrong. We don't want them to worry about that."

10. Managing Application Dependencies

Key Points:

Database as a Service: LinkedIn offers various stateful services (e.g., relational storage, key-value stores, caches) managed by dedicated teams. Developers can provision these services through automated UIs, integrating seamlessly with their applications.
Separation of Concerns: While compute resources are managed through the Kubernetes platform, stateful services are handled separately, ensuring specialized management and scalability.

Notable Quote:

Amit Al Balkan [37:06]:
"We run a set of databases as a service. ... If I want to deploy a service and depending upon what I need for my application, I would essentially go and get a database provisioned."

11. Incidents and Learnings

Key Points:

Component Failures: Integration of open-source components like the Etcd Operator and Argo CD led to incidents when scaling thresholds were exceeded, highlighting the challenges of adopting off-the-shelf solutions at scale.
Root Causes: Issues such as improper handling of timeouts, label management errors, and reconciliation failures under high churn rates were identified.
Proactive Measures: LinkedIn emphasizes thorough stress testing, source code reviews, and proactive replacement of components that do not meet reliability standards.

Notable Quotes:

Mofi Rahman [38:08]:
"... our clusters and our workloads hit a specific threshold and we had a week-long outage. ... things are taking forever to reconcile."

Abdel Sighiouar [38:39]:
"People wanting us to have more end users on the show ... I will have to assume that part of the reason why people want end users is because they want to hear about incidents."

12. Conclusion and Resources

In the concluding segment, Amit and Ronak share additional resources for listeners interested in deeper dives into LinkedIn's Kubernetes platform. They mention upcoming talks at Kubecon London and encourage listeners to explore their LinkedIn engineering blogs for detailed articles on their experiences and lessons learned.

Notable Quotes:

Mofi Rahman [40:06]:
"We're blogging actively on our LinkedIn engineering blog. ... we have a talk coming up in Kubecon London."

Amit Al Balkan [40:56]:
"We're actively hiring, so if you want to chat more about opportunities at LinkedIn ..."

They also promote their personal podcast, "Software Misadventures," where they discuss various technical challenges and solutions.

Additional Information

Show Notes and Resources: Links to the discussed blog posts, talks, and the "Software Misadventures" podcast will be available in the show notes.
Connect with Guests:
- Ahmet Alp Balkan: Active on LinkedIn and other social platforms.
- Ronak Nathani: Host of the "Software Misadventures" podcast.

This episode provides an insightful look into how a major tech company like LinkedIn navigates the complexities of running Kubernetes at an immense scale. From transitioning away from proprietary systems to handling stateful workloads and ensuring developer productivity, Amit and Ronak share valuable experiences that can guide other organizations facing similar challenges.

Loading summary

Transcript78 lines

[00:00]
Abdel Sighiwar
Hi and welcome to the Kubernetes podcast from Google. I'm your host, Abdel Sighiwar.
[00:04]
Mofi Rahman
And I'm Mofi Rahman.
[00:15]
Abdel Sighiwar
In this episode we talk to Amit Al Balkan and Ronak Nathani. Amit and Ronak are software engineers at LinkedIn, part of the compute infrastructure team running the Kubernetes platform for LinkedIn. They joined us today to talk about how they ran Kubernetes at scale and what they learned along the way.
[00:32]
Mofi Rahman
But first, let's get to the news. Kubefs was moved to the CNCF graduated maturity level. Kubefs is a distributed storage system supporting access protocols like POSIX, HDFS and S3. The project was created in 2017 and was accepted to the CNCF in 2019. The project is used for AIML workloads, but also container platforms where separation of computing and data storage is required, like databases.
[01:01]
Abdel Sighiwar
Canonical announced 12 years Kubernetes long term support, frequent releases and upgrade frequency is a topic of discussion within the Kubernetes community. While UPSTREAM Kubernetes offer 14 months of support and major cloud providers extending that to two years. This new announcement from Canonical aligns with the company's strategy for long term support for Linux extended to Kubernetes. The company will be releasing LTS versions of Kubernetes every two years, starting with version 1.32 and interim releases every four months. With the Ubuntu Pro subscriptions, LTS version of Kubernetes will continue to have CVE patches for at least 12 years.
[01:40]
Mofi Rahman
The conference season is starting and events are rolling out. Here is a rundown of what to expect in March and up to Kubecon London. KCD Beijing on March 15 KCD Rio de Janeiro on March 22 KCD Guadalajara on March 29.
[01:55]
Abdel Sighiwar
And that's the news. All right, today we're talking to Amit and Ronak. Ahmed and Ronak are software engineers at LinkedIn. They work for the Compute Infrastructure team running the Kubernetes platform for LinkedIn and they join us today to talk about how they run Kubernetes at scale and what they learned along the way. Welcome to the show. I'm Adan Ronak.
[02:16]
Mofi Rahman
Hey, thanks for having us.
[02:17]
Amit Al Balkan
Yeah, thanks for having us, I guess.
[02:20]
Abdel Sighiwar
So we had a very, very interesting discussion at Kubecon North America and you folks told me you're running an insane scale of Kubernetes on bar metal, which is. I still have to like, comprehend. So let's start with the basic is everything at LinkedIn doing Kubernetes like is it everything running in kubernetes or do you run something else?
[02:40]
Amit Al Balkan
Right now? It's not just kubernetes and I can provide some context to this as well. Back in the day, I would say around 10, 11 years back when Docker 1.0 wasn't around, LinkedIn still need containerization because we want to make sure we can bin back applications, stack them on a single machine. So we wrote our own container runtime. We also wrote our own scheduler of course. Friends don't let friends write CAS graduates by the way. Just saying. And that stack has served us pretty well actually. It's been running within all of our bare metal data centers and it's been scaling as the site has grown. But over the last few years we realized that it's aging a little bit too. See the marginal cost of adding every new feature is is increasing. Or rather it's increasing more than linearly. And with Kubernetes and other open source ecosystem becoming just way more mature, it just made sense for us to transition onto that path. So we've been on this journey for a while now moving majority of our workloads to kubernetes and this includes stateless, stateful as well as batch workloads. Not everything is on kubernetes yet, but it is soon going to be. And if you ask any of our managers they will say oh, they wish it was yesterday.
[03:52]
Abdel Sighiwar
All right, so then I think this is like I have to ask this question. What about databases? Do you folks do databases on kubernetes? What do you think about that?
[04:00]
Amit Al Balkan
So we are insane enough to run kubernetes on bare metal and we are also insane enough to run databases on kubernetes. It's kind of a running joke where people say well, kubernetes cannot support stateful systems, but to be honest, kubernetes is quite flexible and adaptable and if you understand it deeply and when you control the full stack all the way from your bare metal machines to the configuration on top, to what kind of disk you can attach to it, to the scheduler where you can control what scheduler plugins you write, where you can also control your API server and have very strict policies on what features one can and cannot use, you can go as far to run stateful systems on Kubernetes too. And I'll also say that we have several stateful systems which use local disk, so we don't use network attached storage everywhere because of performance issues. So we run these applications on kubernetes right now. And again, these are in transition phase. Migration is ongoing. And the part about controlling the stack that lets us do this is the full maintenance lifecycle. And I'm sure Amit can speak more to that, where when we run maintenance across our data centers, it coordinates the requirements with the stateful systems. So we can go into that rabbit hole if you would like. But we don't just evict the pod and say, here, you have one hour grace period. Hey, database, you gotta shut down. We but what we do is we respect whether an application or a stateful system is okay being shut down at that moment in time or not. And this is beyond just using pdbs.
[05:24]
Abdel Sighiwar
Got it.
[05:25]
Mofi Rahman
And honestly, I might add that we have written our own generic stateful workload operator. This was our talk at last. Kubecon North America so we introduced this system that you can bring a stateful system to, and the stateful system needs to implement a particular protocol to take a workload out of rotation or add a new instance. And that protocol is largely Kubernetes agnostic, which lets us run any number of different databases without writing a separate Kubernetes operator for each.
[05:54]
Abdel Sighiwar
Oh, interesting. So does that make the workload itself Kubernetes agnostic? If it has to implement the protocol like the database will have to listen to something before it gets evicted. How does that work? I'm curious about this.
[06:08]
Amit Al Balkan
So the database doesn't need to be scheduler aware in this case. So let's say, for example, just like a simple example, I'm running etcd on Kubernetes in this case. So this protocol, what we call this, is an application control manager. So think of it as an endpoint which our controller talks to. So this endpoint is aware of kubernetes, but your database is not any operation that requires your part to be moved, whether that's because of an update that you're making to the part because of a version change, CPU resource change, or because this is because of maintenance. What this generic controller does, it will basically contact your ACM in this case to say, hey, I'm trying to do this sort of an operation. It's an update operation, or if it's scale out, or if it's scale in one thing, which is also interesting and just a side note is for many stateful systems there's a need for instance swaps as well, meaning you don't want the entire system to go through a rebalance term just because a machine went down. So what you sometimes want to do is you say, hey, Shard a. Just stay Shard A. I'm going to give you another machine. Replicate this data and don't shuffle everything. So there are these kind of operations that this protocol supports. So during this update, we talk to that ACM for the specific database and say, hey, this is the kind of operation I'm doing. This ACM is aware of the health of the database to see do I have enough replicas for each partition that might be impacted if you take down this pod, for instance, and that provides a year and a whether the system can proceed with that specific operation or not. Got it. And we go into way more detail on a blog post as well that we publish for the stateful system. More than happy to share that with you if you want to add it to the show notes.
[07:43]
Abdel Sighiwar
Yeah, we'll make sure that both the talk and the blog is added to the show notes. You folks reminded me of something that we have at Google. So I think I have just one more question about this and then we can move on. Does this mean that this application controller manager can also be configured by specific teams with policies like, can I say what sort of disruption my application can handle or can support?
[08:04]
Amit Al Balkan
Yep. So every single database or a stateful system that you run on this platform brings its own ACM and we provide the protocol. So as long as you abide by that interface and the protocol, you write your own acm. In many cases, several teams share this ACM too, because they follow a similar disruption model and behind the scenes, they can control how to approve a disruption or an update.
[08:24]
Abdel Sighiwar
Got it, got it. Interesting. Okay, so what about the things that are not Kubernetes specific? So you talked about etcd. Like, do you run ETCD on kubernetes which require Kubernetes to like, how does that work? How do you handle like cyclical dependencies?
[08:39]
Mofi Rahman
Today our Kubernetes runs on our legacy orchestration stack. We directly run Kubernetes and all its control plane components, API, server, etcd, controller, manager, scheduler, directly as systemd services. Now, that said, we don't want to maintain these tool stacks for running the Kubernetes control plane, as well as running workloads on kubernetes. So that's why we actually want to run Kubernetes itself on kubernetes. I think some people call it cubeception. Yeah, we want to go through this journey and we're actually midway through our development. We don't have anything in production operating this model, but we know for a fact that there are cloud providers out there running the Kubernetes control plane as pods. And so as a result, we know that this is possible and we know that there's a lot of cost savings actually when you're running ETCD on a machine, just because there's a single ETCD instance on that very gigantic bare metal machine. It's actually pretty wasteful. So yeah, we were planning to run Kubernetes inside Kubernetes for that reason as well. So today our ETCD directly runs on host disk and I think we'll continue to maintain that. And network attached storage latency is unfortunately not tenable for Etcd writes for us, so we'll probably keep it that way as well. Now, one thing that we're actively participating is there's an open source project going on in the community right now around Etcd Operator and we want to participate that mostly because we have been solving the exact same problem ourselves. We wrote an operator for Etcd specifically just so that we can handle these disruptions without losing data, et cetera. And we don't want to create too much cyclic dependencies between Etcd and Kubernetes as well. So we want to keep this stack pretty lean. And that's where the ETCD operator itself is pretty promising. And in terms of stuff like networking, I would say that any company that is running on bare metal at this scale pretty much has their own networking stack that has completely nothing to do with anything out there in the cloud native ecosystem right now. So a lot of our workloads, whatnot, we don't use kubedns, right? We don't use anything like flannel for the vast majority of our workload workloads. So a lot of our network stack is pretty much flat data center routable networking where any pod can be routable to any other pods with an IP address directly.
[10:46]
Abdel Sighiwar
Got it. So what are your thoughts then on like I think you probably have seen it, there is a movement. I don't know if it's a movement, but like we announced last year that we can do 65,000 nodes and part of this announcement was moving towards Spanner. I know that Spanner is not something that exists on prem, but like what are your thoughts on that? Like moving away from ECD altogether.
[11:05]
Amit Al Balkan
Our thoughts on this are we would love to do that at some point. In fact, this is a question that was pretty popular, I would say, or a topic that is very popular at kubecon North America and fall last year, most of the teams we spoke with who run Kubernetes at any decent scale basically came back and said the same thing where now our bottleneck is at CD small couple hundred node clusters. They just don't cut it for something that we are running where you have machines on the order of six digits. So we want to be able to run or replace etcd, where we can push the boundaries of how big a cluster can get. Now some of the things that Amit mentioned, because we have this control over how we manage the entire control plane plus the database today what we do is we shard events into a separate ETCD cluster just so that we can scale that part. But because we don't use several components within kubernetes and that's by choice, for example, like Kubedns or CoreDNS or Kubernetes services, that's a design choice where we don't want to use some of these objects as part of our application ecosystem because several, for example services, covery network policy, etc. These kind of things run at a global scale for us. So because we don't use these things, it takes away that load from API server as a result city. But it also means because we're in large clusters, we need certain large objects like number of nodes, number of pods. So in general, I would say if there are open source alternatives available to etcd, which allows us to scale the cluster way beyond what we can do today, that would be of lots of interest not just at LinkedIn but also some of the other folks we have spoken to. Yeah, and the spanner part, I wish it was available.
[12:44]
Abdel Sighiwar
I wish as well. So when we talked at Kubecon last year, one of the questions that's kind of lingered in the back of my head is barometer. Right? So you run a barometer. So I worked on data centers way before kubernetes existed, so I racked servers. I actually dealt with physical hardware. So I am a technician at LinkedIn. I come in, I rack a hardware, a piece of server, right? I connect power and networking. What happens after that? Where does kubernetes come into the mix?
[13:08]
Mofi Rahman
I would say that the answer a year ago at LinkedIn versus the answer now is pretty substantially different. So last year has been pretty transformational in that regard that I would say we kind of built our own infrastructure as a service machine management layer from the ground up. And as part of that project we still have data center technicians that are still racking machines. All that's still there. Now one thing that we've done is more programmatic management of our data center inventory. So anytime you rack a machine, it automatically gets added to our infrastructure as a spare machine in our data center, essentially as a spare capacity in a spare pool. And so from that point on, we have built various APIs that are very similar to Kubernetes APIs to manage the list of machines and the pools that we have in our data center. So these APIs are kind of like Kubernetes resource model, declarative APIs almost. And we also built a orchestration layer on top in Kubernetes to configure these pools. So basically, if I want to add that new capacity to my pool in one of my Kubernetes clusters, I just go to an object called Kubernetes Pool, and as you can imagine, a custom resource that we created. And once I declare that intent, that intent is communicated to our infrastructure as a service layer, and then that machine gets provisioned and added to our Kubernetes cluster. Now, when I remove this machine, it similarly goes through some sort of state machine where the machine gets wiped clean and gets returned back to the spare capacity. So that means next time someone else requests it, it can be added to a new pool. So that's basically pretty similar to how cloud providers work. Right. And the main difference here is that they are not VMs, except they're just bare metal machines, and we have a way to reimage them and clean them up every time we use them.
[14:46]
Abdel Sighiwar
Interesting. So, like you mentioned that there are controllers and there are crd, so there's like Kubernetes is still the orchestration layer, even for the actual physical hardware. So I think the question that pretty much is, I mean, it's a question that comes often, what is the right number of clusters? Is it one or is it thousands or 42? That's a good one.
[15:08]
Amit Al Balkan
So I would say depending upon the use case, the answer would vary for that specific team. In our case, there are different environments. So specifically for our staging environments, we create test clusters. They are not very big in size. So many teams developing these platforms on top where they're running CRDs, webhooks, kind, minikube, or any of these solutions, like, yes, they're helpful, but they just go so far. In many cases, you want to be able to run these tests in an environment where you will actually be running these systems. So we carve out these test clusters, which are isolated from rest of the cluster running production workloads, and they are much smaller in size. They look very similar in terms of their, let's say, control Plane shape and size, where like policies enabled, et cetera, those aspects are similar, but the number of data plane nodes is not too big. However, when we get to our production clusters, in those cases we want to run large clusters where we ensure the blast radius is not crazy. So again, 65,000 nodes for a cluster sounds really fascinating, but we wouldn't necessarily run 65,000 nodes in one cluster, even if we had spanner, for instance. But we want to be able to push Kubernetes beyond 5,000 nodes. We run our clusters pretty close to that size in production. And considering our scale, we already know that we need to run many of these clusters across multiple regions. And part of the reason why we don't want to run too small of clusters is because of fragmentation, capacity gets fragmented too much. In certain cases we have workloads which are really large. And in those cases, when you have this capacity fragmented, it's too much wastage in terms of compute, of course.
[16:41]
Abdel Sighiwar
Yeah. So what about hardware upgrades? So say I want to upgrade a certain shape of a physical server, like certain shape of a bare metal. I don't know, go from 128 gigabytes of memory to 256 on the same cluster spanned across multiple regions. How would you handle that? Because a cluster is still a single blast radius technically, right?
[17:04]
Mofi Rahman
Yeah. I would say that for us, hardware refresh is a pretty regular fact of life that we have to go through every couple years. And I would say that the way we designed our infrastructure as a service layer is accommodating the fact that we'll have to go through these hardware refreshes. Right. So the way we handle this whole notion of hey, the bare metal machines may go away someday is actually through the pool concept. Like, for example, imagine you're in a stateless machine pool, right? That machine pool actually doesn't directly declare These are the SKUs or SKUs of machines that I'm going to run on. They actually declare something like a node profile. This is a concept that we kind of introduce as an abstraction between our bare metal and workloads expect. So for example, if your workload expects like high memory, well, we should probably have node profile called high memory, right? So as part of the hardware refresh cycle, what we do is we add and remove skus from these node profiles. And another capability that we have is that we can make two different machine pools look like a single pool inside a Kubernetes cluster. Because essentially we use a label called pool and we control what machines that we add to the pool. So for us, going from one hardware to another generation of hardware just looks like scaling down one of the pools and scaling up the other pool. By doing that, we can basically decommission the set of machines that we have in the data center.
[18:26]
Amit Al Balkan
And one thing which I would just add is that our machines within a pool are also spread across what we call maintenance zones. So these are maintenance zones, again, as a concept that we define within our data centers and it's just encoded throughout our infrastructure stack. So any pool, if you get machines in that there, we spread them as much as possible across all these maintenance zones. And this also shows up as a label on machines. So we use topology spread constraints on all parts. So any application gets, let's say, say at most 5% of its replicas in one maintenance zone. So at any time you go through the scale up, scale down exercise, it only impacts maximum 5%.
[19:01]
Abdel Sighiwar
Yeah, got it.
[19:02]
Mofi Rahman
And Kubelet upgrades are part of that too, by the way. Like anytime we upgrade the cluster, we actually upgrade them 5% by 5% so that we don't take more than a certain amount of capacity in the cluster.
[19:12]
Abdel Sighiwar
So, Ahmed, you mentioned something very interesting, which I want to follow up. So you obviously have like hardware refresh, right? Like, so you will always have a new generation of hardware coming and you know, old generation that will be deprecated or whatever. But I assume like any big company of your size, you're not going to even have the same platform of cpu, you're not going to have the same architecture. So let's say I'm a developer at LinkedIn. How do I ensure that I have predictable performance regardless of where my workload runs?
[19:42]
Amit Al Balkan
It's actually one of the most interesting challenges that we deal with because a lot of applications that fall in our serving stack, meaning they're in the critical path of LinkedIn.com, an example is your feed. An application that is powering the LinkedIn feed, for instance, is extremely latency sensitive and behind the scenes is actually calling out to multiple services, evaluating multiple posts and seeing which one to rank and give you back on the feed so that you're most engaged. So in this case, what happens is many of these teams are very sensitive or aware of what kind of hardware their applications run on. So from time to time what we do is. So I'll cover it from a couple perspectives. So one thing, let's say we're introducing a new sku. Many teams would actually go through the exercise of running performance tests to see how does this SKU perform as opposed to the old one that they had. And some applications care about the SKU very much. In this case a specific CPU generation, they care about that a lot. In other cases, not so much. So what we do right now is we have different node profiles in this case. And this is something you want to evolve as well because of fragmentation problems. But what we do right now is we'll have a multi tenant pool which has different generation SKUs. But we know this pool doesn't go below a certain threshold, meaning these are just the last two generations and we won't go beyond that. Anything that is older than these two generations goes into a separate pool which is for internal applications which are not too performance sensitive. But there are certain cases where we provide a pool of machines which has the latest sku and applications who actually want this would basically say I want to opt into asking for this specific SKU and we ensure they are routed to that specific pool. And then of course you have quota in place to make sure that not everyone comes asking for. Just give me the latest one. But one challenge that this comes with is you end up with fragmentation. Where I have a pool of machines which are, let's say, pretty beefy with the latest generation of cpu. The application asking for it doesn't take up all of it, but I still want to make sure I can guarantee that capacity. One of the ideas we have is to introduce scheduler plugins so that we can adjust the node's weights depending upon what the application is asking for, while still having these new generational SKUs in the general pool. This is not something we have done yet, but something we definitely want to explore.
[21:50]
Abdel Sighiwar
Got it. Or maybe another idea and I'm just going to throw an idea here and then you do whatever you want is create something like flexible capacity, like capacity that is spare, that no one is using, that can be used by less kind of sensitive. And when I say sensitive, it could be performance sensitive or time sensitive workloads, right?
[22:07]
Amit Al Balkan
Yes.
[22:08]
Abdel Sighiwar
If you have a batch that you can wait for it to run whatever you want. This is essentially what we have inside of Google. So I'm just telling you how we do it.
[22:15]
Amit Al Balkan
You're absolutely right. I think we could totally do that. What we see in our experience is the most compute intensive workloads, they actually care about the SKU a lot and those are the ones where we need to plan for enough capacity and make sure they get it. But yes, for the lower priority ones, we could totally do what you suggested.
[22:32]
Abdel Sighiwar
Got it. And so we talked a bunch about how you do stuff right? And it sounds to me like whatever answer to any question where Kubernetes is involved is a custom controller. And I know that Ahmed is a big fan of custom controllers and I know that I'm going to ask you an immensive task of resuming all your blog, like, what's your thinking process about this? Like custom controllers, building your own using something that the community already provides.
[22:56]
Mofi Rahman
I think it's the reality that a lot of shops out there right now using Kubernetes are developing their own controllers, and that's fine. That's why the whole notion of custom resources and controllers exist to begin with. Now, in our experience, we noticed that the controllers that we put in our production's hot path have been risky enough to the extent that we have to develop them extremely carefully. Even the smallest things that you can think of in a controller actually have a lot of importance when the controller actually runs, when it starts managing thousands of objects, it suddenly becomes a hotspot. It becomes really important how that controller is implemented. That's why I've been trying to share some of the stuff that we learned about. What are the controller development pitfalls on my block now? I would say that if there is a task that can be achieved without writing a controller, we will obviously have to do that. There are of course some tasks that we know that we opted it explicitly into the Kubernetes resource model. We said we're going to create a CRD for this and that's why we're writing controllers for this now in the open source. By the way, if you're just writing controllers to glue a few resources. There's been a project that I think AWS open source called kro.dev. that's been actually a pretty fun project that I've been trying to find use cases for it internally as well. So far we don't have any. But I will say that as we spoke earlier, some of our internal workload types are pretty custom, like our stateful workload type, as we call them creatively LI statefulset. That's a huge controller. That's probably one of our biggest operators. Similarly, we have cluster management controllers, pool management controllers. We have to have them. That's why we have them. But anytime a random team out there in the company shows up saying, hey, I have a controller that I would like to deploy to all our clusters, please. Usually our response is not very positive. I mean, the thing with controller is that it looks deceptively simple to develop one. And you can also start believing that your controller works excellent. However, in a real production environment, you're going to start dealing with memory issues, throughput issues and strain that you put into API server and accidental infinite loops that your controller is going to get caught up in and things like that. So that's why we're pretty. We're trying to scrutinize development of new CRDs and controllers pretty actively in our controllers. Sorry, in our cluster ecosystem.
[25:09]
Abdel Sighiwar
It's funny you mentioned the controller Spitfalls blog, because that's the one I was actually reading specifically. And it feels to me like one of the problems with controllers is that they can do a lot of things that makes them so fragile. Like if you have a controller that can randomly add or remove labels from nodes, this is like an example that said, it's from your blog that sounds like remove labels. That's a lot of things. Kubernetes relies on labels. Right? So if you just have like an extra piece of of software that can just randomly add labels or remove them or whatever, that's like a recipe for disaster in a way.
[25:42]
Mofi Rahman
Yeah. I think you're referring to the tale of the node feature discovery incident on our blog. Yeah, so that was a fun incident. Again, this is one of those things that we were running on an old version of this component that was written like a really long time ago and we decided to upgrade. Turns out the entire component was rewritten and the component occasionally started to remove the labels and stuff like that. There has been also like that prompted us to actually dig into the source code of the controller and figure out how many more places could this thing actually start removing the labels. And we found a few more places. I mean now we reported all those paths that the component can fail, but at this point I think we're probably not going to use that component. As you said, it's too risky if something is actually going to actually has the chance of bringing your production down. Maybe that shouldn't be a controller or maybe that shouldn't be an off the shelf controller that you're bringing in from the open source. Unless that component is very well defined, the whole world world relies on it exactly the way you rely on it. Whatnot. Yeah, I think we can talk a little bit about how do we evaluate the cloud native ecosystem separately.
[26:43]
Abdel Sighiwar
I was about to ask that question, actually. I think that that's one of the topics that Ronak mentioned that you would be willing to discuss. You use the component from the open source, it screwed up your environment. So you learned from it I guess. Right. So what now? How do you go about evaluating like number of stars? Do you like the maintainers? How does that work?
[27:06]
Amit Al Balkan
So that's a good question. And before I say anything, I'll just say that all of us, I mean both Amit and I, plus the team at LinkedIn, we really appreciate all the work that all developers do in the open source. Kubernetes is open source again and we rely on a whole bunch of other components which are open source too. So we appreciate all the work that goes in and we want to make sure all the developers are recognized as well. But I will say that number of stars doesn't represent how something will run in your production environment. So none of these components are necessarily not good. They are just not the right fit for how we want to use them. Yeah, I'll take an example like Amit mentioned nft. I'll share another example which Argo City for instance. I love Argo City as a system. We used it extremely heavily or we still use it extremely heavily, but it's more of a behind the scenes system. It's basically a GitOps engine for us, which no user actually sees. Was running perfectly fine until our clusters and our workloads hit a specific threshold and we had a week long outage. And that outage is basically things are taking forever to reconcile. When we dig into that code base we're like, well we basically have like maybe five custom resources that we need this thing to reconcile, but it is actually looking at the entire cluster. It's looking at every change made to a config map, it's looking at all the node object in the events. I don't need it to look at all of that. And when I go and try to disable these settings, in some cases you end up with a bug where there are layers and layer of caches at different places which some of which have a good TTL on. In some cases you have a memory issue. So again, not to pick on Argo CD necessarily. Again, great product folks who build it, I've spoken with them personally, really smart engineers and the solves a really critical problem. Again, what we found out is anytime we had a specific gap in our capabilities, what we typically do is we would go out and see is there a solution in the open source world because we don't want to reinvent the wheel. So then we'll go out and see can we leverage something off the shelf and in some cases even contribute back. We have a very curated experience for an average LinkedIn developer, and we can go into what that looks like as well. In those cases, many components that we pick off the shelf aren't necessarily exposed to the users. So we have a very opinionated platform that we build on top of Kubernetes. So as a user, you wouldn't even know many times that your app is running on Kubernetes, and that's by design. So when we go and look at or use some of these components off the shelf, we start putting them as part of our stack. Once we start scaling our clusters, once we start scaling our workloads, if we hit a threshold where that component doesn't scale anymore, or we find out that operationally running this component is really challenging. And as we dig more into that code base, we find out the quality of the code doesn't match our style or our standards. Internally, where you want to put something in production and be on call for it in the middle of the night, then we go and essentially replace it. And this is an exercise that we have done time and time again, where we start with an open source component because it solves the need of the hour, meaning I don't have to wait three months to solve this gap in our capability. But as soon as we do that, we start evaluating it to make sure this component is actually going to be stable and remain part of our stack for the long term. And if we find out that any of the things I mentioned aren't true, then we look at the cost of what it means for us to write that component from scratch. And in many cases, if there is a capability we really need as part of our compute platform, and if that capability is really important, then investing in it, where we build it from scratch, seems like the right thing to do. And at this point, we have done that for a few things. We'll see how many more we do.
[30:36]
Mofi Rahman
I would say in terms of the evaluation path, anything that we're bringing new into our ecosystem, we're trying to have the team teams bringing those things do pretty large stress test as much as they can. Now, again, stress tests only can also go far. Like, for example, in a bare metal environment, you really don't always have 5,000 machines sitting around doing nothing, so you can't easily create a very large cluster. Now, that said, it's still possible to exercise a lot of the controllers and components and see exactly how they break. And especially if you read the source code, or at least you have a pretty good understanding of how does this controller work, you can kind of figure out where it's going to break. And I would say running Kubernetes at scale is mainly about figuring out where these, like, scaling challenges of each component is.
[31:17]
Abdel Sighiwar
Got it, Got it. So now I want to, like, shift away a little bit from your team and talk about your developers. So how do you fox platform? Because that sounds to be the term of 2024 and 2025 and beyond. How do you platform your platform?
[31:32]
Amit Al Balkan
I guess, how do we platform our platform? Great question. If you ask us. Lots of improvements to be made, but we drew our best. What I would say is good enough.
[31:41]
Abdel Sighiwar
We could stop here, right? Just kidding.
[31:44]
Amit Al Balkan
Well, I was going to say, depends on who you ask. If you ask me, our platform is awesome. If you ask them of our users, they're like, well, they might have a different opinion. But jokes aside, I mean, I'll start by saying Kubernetes is really flexible and it is very adaptable, but I don't think all of these features need to be exposed to all the end users. I think in our case, our team is pretty opinionated about what we expose to the end users and what are the features that they can use and in fact, how we use Kubernetes. And part of that is because we want to curate an experience for our engineers. Now, Kubernetes is a beast. Anyone who says otherwise is either lying or hasn't used it enough. I would say it's very easy to get started, but we don't want our engineers to worry about what does podspec.DNS policy really means? Or what does podspec hostnetwork really means? Or how do I go about setting my liveness probes in specific detail? For instance, so generally what we provide is, as a LinkedIn engineer, you specify your compute resources, which is CPU memory, in some cases storage. You would specify, if you care about it, the kind of, let's say node profile or SKU you want to run on. You would specify your application identifiers. So we have certain nouns within LinkedIn which uniquely identify your application. You don't have to worry about the registry uri, you just specify the identifier. We take care of where the registry comes from, we ensure it's mapped to the right region. And then you go to check this into your repo, which is next to your code. There is a specific directory structure that you follow to specify different variants of your application that you want to run. So you might say, this is my staging, that's my production. For instance, you can do that. You can also enable auto scaling in some cases where you don't want to worry about replica size of your app, you want a system to take care of the replica size based on the site traffic. Once you do that, you check this and you go to a UI which has all your application environments listed. So you say I want to deploy in these three regions for production, these two regions for staging, and then you set up that workflow and on average day you typically go through either clicking that or you have a workflow which is preset for you. And as long as your tests pass. So what you would do is you would, let's say deploy something in staging, you would deploy a canary in a production, you would run a test to make sure your canary passes the test that you set up initially. And we have certain some that we provide out of the box too. So for example, if you're running an application, your CPU memory usage doesn't blow up with the new version. And if all of those tests pass, most applications have an auto advance where they would go to rest of that region, then onto the next region, so on and so forth. So this is what the typical rollout experience looks like for a user. Now I will say where some of the opinionated pieces here that include our A user doesn't just write a deployment file. A user writes what we call an LI deployment or an LI statefulset which only has maybe six fields you wanna specify just to get this going, but you have the flexibility to override that podspec if you really want to and you know what you're doing. Kubernetes is abstracted, but it is not hidden. So we have our own kubectl plugins which users can use to look at their pods exec into it while making sure they don't have to worry about which cluster my application is running into.
[34:56]
Abdel Sighiwar
Got it.
[34:56]
Amit Al Balkan
And again, I can go into different details based on what you're interested in.
[35:00]
Abdel Sighiwar
Are these kubectl plugins using Ahmed, plugin manager for kubectl?
[35:05]
Mofi Rahman
Damn, I'm keeping all that mess out of this company.
[35:10]
Abdel Sighiwar
Well, I don't know. I love the plugin manager, so he's too modest.
[35:13]
Mofi Rahman
Thank you.
[35:14]
Amit Al Balkan
I will say Amit might not say it, but all of our platform teams use all of Amit's plugins really heavily. Our end users, not so much because you want to try and abstract some things away from them. If they need to worry about a cluster, a namespace, then we did something wrong. We don't want them to worry about that.
[35:31]
Mofi Rahman
Yeah, I would say that maybe like one thing that I would highlight is that I don't think we want users to be entirely unaware of the fact that they're running on kubernetes because eventually they're gonna find out. And them finding out this hard way is probably not really preferable. Right. Like we want them to kind of understand kubectl logs kubectl exec exactly when they need to, or if they need port forward. Okay, let's have you port forward because that's probably going to help you troubleshoot something. Right. But aside from that, we really want to hide kubernetes in the well paved path when everything is happy and dandy. Other than that, yeah, they'll probably see kubernetes.
[36:06]
Abdel Sighiwar
So you have basically a golden path, but then you can deviate from the golden path if you really know what you're doing in a way. Right?
[36:13]
Amit Al Balkan
Yeah. And this is where we found users doing several interesting things of course. For example, setting your replica count to zero is done pretty easy. And sometimes scaling down your application again you go from thousand to ten just because you fact fingered something. Well, that is really easy too. And we unfortunately we have seen some of those incidents in production. So what we ended up doing was adding a bunch of these guardrails where you can do a lot of things when you want to go off the golden path, but then there are guardrails to make sure you stay your application is protected and you don't take the site down.
[36:48]
Abdel Sighiwar
Can I ask a quick follow up question as a developer? What about my application dependencies? Like if I need a database or if I need anything else, monitor it, login, you know, all the extra stuff that an application needs. Like how does that work? Is that part of the automation that your platform provides or is that something that I still have to like do myself?
[37:07]
Amit Al Balkan
We have several database teams at LinkedIn. We run a set of, we provide a set of databases as a service. So there is a team providing relational storage, there's a team providing key value store, there's a team providing a cache as a service, there's a team providing object store, so on and so forth. And many of these LinkedIn wrote itself like LinkedIn wrote Kafka. So we have a team running Kafka, we have a team running one of our key value store called Espresso. We run Venice as a heavy store which is used for machine learning features that is also open source. So we have Data Infraorg which has a bunch of these teams which are providing these stateful services. If I want to deploy a service and depending upon what I need for my application, I would essentially go and get a database provisioned. And all of this is also automated through a ui. So you would go to a UI and say, I want this much storage for this org. It wraps up into your org quota, for example. So that is handled by a separate team. It's not part of compute. But then you get coordinates for your database which you set up as part of your configs and you say, hey application, go talk to this database, for example.
[38:09]
Abdel Sighiwar
Got it, got it. So I mean, obviously talking to you guys is awesome, but this is part of a feedback we got last year, which is people wanting us to have more end users on the show. Right? Like more people actually using Kubernetes and not just vendors and community, which is typically what we do. And I guess I will have to assume that part of the reason why people want end users is because they want to hear about incidents. Because those are the fun part. Right? Can we talk about if any. If you can, if you are allowed to say anything, anything just high level, you don't have to go into details.
[38:39]
Mofi Rahman
Yeah, I think one example that I can give is the NFT example that I mentioned earlier. Here's an open source component we brought into our cluster. It worked all these years very fine and we decided to upgrade one day. Everything seemed good until we started hitting the largest cluster that we had. And at that largest cluster the controller's informer was failing to sync timely and the timeout error was not handled properly. So the controller thought that there is no data. So I'm just going to clear all the labels on all the nodes so it didn't distinguish between error versus empty data case. This happened when I was at Google. By the way, there's a famous Google Cloud incident where yeah, I think all of the load balancer GCLB configs were cleared out because of an empty file. So I would say like this sort of stuff happens pretty regularly. I think the Ronx Argo CD example was also pretty relevant to containers. The scale of the cluster itself and how many objects are in the cluster and how much turn there is in the cluster. I would say that for the most part a lot of the incidents that we're hitting are result of churn.
[39:41]
Abdel Sighiwar
All right, cool. We're going to put like a link to your blog post for people to go read more about that. Obviously like, you know, talking about details about incidents is kind of like sensitive for a lot of companies. Especially that from the conversation it sounded to me, like LinkedIn, you're just reinvented an internal cloud platform.
[39:56]
Amit Al Balkan
Really?
[39:56]
Abdel Sighiwar
Like that's how I'm understanding how you function. So that's actually pretty cool. Well guys, thank you very much for being on the show. Before we go, Ahmed, you have a blog you want to plug?
[40:07]
Mofi Rahman
Absolutely, yeah. So we're blogging actively on our LinkedIn engineering blog. We're going to be talking about our Kubernetes platform more in detail. Ronnek and I, we do have a talk coming up in Kubecon London. I'm hoping that this episode airs by that time.
[40:21]
Abdel Sighiwar
It will be.
[40:22]
Mofi Rahman
Yep. So we'll be in London hopefully talking about Kubernetes platform. Also, I personally blog about various Kubernetes controller misadventures as well. I think my next article is going to be about all the ways Kubernetes can evict your pods because we learned this the hard way as well.
[40:38]
Abdel Sighiwar
I cannot wait to read that.
[40:40]
Mofi Rahman
Ronak, is there anything else you want to plug?
[40:42]
Amit Al Balkan
I will say we're actively hiring, so if you want to chat more about opportunities at LinkedIn or just want to have join the discussion and talk about running Kubernetes at scale, everyone can find us on both Twitter or X and LinkedIn and happy to chat more.
[40:57]
Abdel Sighiwar
Yeah, find it. LinkedIn. LinkedIn. Ronak, you have a podcast, right? Like the Software Misadventures, if I remember correctly.
[41:02]
Amit Al Balkan
Software Misadventures, yes, that's the right one.
[41:05]
Abdel Sighiwar
All right. We'll make sure to have a link for it in the show notes so people can go take a listen. Thank you very much folks for your time. I appreciate it and have a good day.
[41:12]
Amit Al Balkan
Yeah, thanks so much for having us.
[41:14]
Mofi Rahman
Bye.
[41:16]
Abdel Sighiwar
That brings us to the end of another episode. If you enjoyed the show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media KubernetesPod or reach us by email at Kubernetes podcastoogle.com you can also check our website at kubernetespodcast.com where you will find transcripts and show notes and links to subscribe. Please consider rating us in your podcast player so we can help more people find and enjoy the show. Thanks for listening and we'll see you next time.