Kubernetes Podcast from Google: Detailed Summary
Episode Title: Kubernetes at LinkedIn, with Ahmet Alp Balkan and Ronak Nathani
Hosts: Abdel Sighiouar, Kaslin Fields
Guests: Ahmet Alp Balkan, Ronak Nathani
Release Date: March 25, 2025
Duration: Approximately 41 minutes
1. Introduction
The episode begins with hosts Abdel Sighiouar and Mofi Rahman introducing their guests, Amit Al Balkan and Ronak Nathani, both software engineers from LinkedIn's Compute Infrastructure team. The focus of the discussion revolves around how LinkedIn leverages Kubernetes at scale, the challenges faced, and the lessons learned during their transition from a proprietary container orchestration system to Kubernetes.
2. Transitioning to Kubernetes at LinkedIn
Key Points:
- Historical Context: LinkedIn initially developed its own container runtime and scheduler approximately a decade ago, prior to Docker's emergence.
- Reason for Transition: The proprietary stack became increasingly costly to maintain and lacked the scalability offered by the mature Kubernetes ecosystem.
- Current State: The majority of LinkedIn's workloads, including stateless, stateful, and batch jobs, are transitioning to Kubernetes. Full migration is ongoing, with plans for complete adoption being highly anticipated by management.
Notable Quote:
Amit Al Balkan [02:39]:
"But over the last few years we realized that it's aging a little bit too. See the marginal cost of adding every new feature is increasing. ... with Kubernetes and other open source ecosystem becoming just way more mature, it just made sense for us to transition onto that path."
3. Running Stateful Workloads on Kubernetes
Key Points:
- Running Databases: Despite common perceptions, LinkedIn successfully runs databases on Kubernetes by leveraging its flexibility and controlling the full stack from bare metal to configuration.
- Custom Protocols: They utilize a generic stateful workload operator that allows various databases to implement a specific protocol, enabling Kubernetes-agnostic operation.
- Maintenance Lifecycle: The team ensures coordinated maintenance with stateful systems, avoiding abrupt pod evictions and respecting application states.
Notable Quotes:
Amit Al Balkan [03:59]:
"So we are insane enough to run kubernetes on bare metal and we are also insane enough to run databases on kubernetes."
Mofi Rahman [05:25]:
"We have written our own generic stateful workload operator. ... that protocol is largely Kubernetes agnostic, which lets us run any number of different databases without writing a separate Kubernetes operator for each."
4. Handling Kubernetes Control Plane and Dependencies
Key Points:
- Current Control Plane Setup: Kubernetes and its components (API server, etcd, controller manager, scheduler) run as systemd services on LinkedIn's legacy orchestration stack.
- Future Plans: LinkedIn is exploring running Kubernetes within Kubernetes (“cubeception”) to streamline operations and achieve cost savings.
- Networking Stack: LinkedIn employs a flat, data center-routable networking model, avoiding common cloud-native networking solutions like
kubednsorflannelto reduce latency.
Notable Quotes:
Mofi Rahman [08:38]:
"Today our Kubernetes runs on our legacy orchestration stack. ... we want to run Kubernetes itself on kubernetes. I think some people call it cubeception."
5. Scaling Kubernetes Clusters
Key Points:
- Cluster Size: LinkedIn aims to push Kubernetes cluster sizes beyond 5,000 nodes, managing multiple clusters across regions to avoid fragmentation and capacity wastage.
- Shard Management: To handle large clusters, events are sharded across separate etcd clusters to enhance scalability.
- Future Aspirations: There is interest in open-source alternatives to etcd and solutions like Spanner to further scale beyond current limitations.
Notable Quote:
Amit Al Balkan [11:04]:
"If there are open source alternatives available to etcd, which allows us to scale the cluster way beyond what we can do today, that would be of lots of interest not just at LinkedIn but also some of the other folks we have spoken to."
6. Infrastructure as a Service and Hardware Refresh
Key Points:
- Machine Management: LinkedIn has developed an Infrastructure as a Service (IaaS) layer that programmatically manages bare metal inventory, integrating with Kubernetes through custom resources and controllers.
- Hardware Refresh Strategy: Machines are organized into pools with node profiles (e.g., high memory). During hardware refreshes, LinkedIn scales down old pools and scales up new ones, ensuring seamless transitions without significant downtime.
- Maintenance Zones: Machines within pools are spread across maintenance zones to minimize impact during upgrades or failures, adhering to strict topology spread constraints.
Notable Quotes:
Mofi Rahman [14:46]:
"Anytime you rack a machine, it automatically gets added to our infrastructure as a spare machine in our data center."
Amit Al Balkan [17:03]:
"Our machines within a pool are also spread across what we call maintenance zones, ensuring that any scale-up or scale-down impacts only a maximum of 5%."
7. Ensuring Predictable Performance Amidst Hardware Diversity
Key Points:
- Node Profiles: LinkedIn uses node profiles to abstract hardware specifications, allowing workloads to request specific profiles (e.g., latest CPU generation) to ensure performance consistency.
- Performance Testing: Before introducing new hardware SKUs, extensive performance tests are conducted to validate suitability for sensitive applications.
- Future Enhancements: Plans include introducing scheduler plugins to dynamically adjust node weights based on application requirements, mitigating fragmentation and optimizing resource utilization.
Notable Quotes:
Amit Al Balkan [19:42]:
"One thing we do right now is we have different node profiles ... we provide a pool of machines which has the latest sku and applications who actually want this would basically say I want to opt into asking for this specific SKU."
8. Custom Controllers and Managing Complexity
Key Points:
- Prevalence of Custom Controllers: LinkedIn extensively uses custom controllers to manage Kubernetes functionality, tailored to their unique needs.
- Development Challenges: Controllers in production environments are complex, necessitating meticulous development to avoid issues like memory leaks, throughput bottlenecks, and infinite loops.
- Evaluation of Open-Source Components: LinkedIn adopts a cautious approach, rigorously testing and evaluating open-source components before integrating them into their stack. If a component fails to meet scalability or reliability standards, they opt to build custom solutions.
Notable Quotes:
Mofi Rahman [22:56]:
"Anytime a random team out there in the company shows up saying, hey, I have a controller that I would like to deploy to all our clusters, please. Usually our response is not very positive."
Amit Al Balkan [27:05]:
"Number of stars doesn't represent how something will run in your production environment. ... if we find out that any of the things I mentioned aren't true, then we look at the cost of what it means for us to write that component from scratch."
9. Developer Experience and Platforming
Key Points:
- Golden Path: LinkedIn offers a curated Kubernetes experience, abstracting complexities and providing a "golden path" for developers to deploy applications without needing deep Kubernetes knowledge.
- Simplified Deployment: Developers specify compute resources, application identifiers, and deployment environments through user-friendly interfaces and workflows.
- Guardrails: To prevent inadvertent disruptions, LinkedIn enforces guardrails that restrict actions like scaling replicas to zero, ensuring stability and preventing site-wide outages.
Notable Quotes:
Amit Al Balkan [31:31]:
"We have certain nouns within LinkedIn which uniquely identify your application. ... you don't have to worry about which cluster my application is running into."
Mofi Rahman [35:13]:
"If they need to worry about a cluster, a namespace, then we did something wrong. We don't want them to worry about that."
10. Managing Application Dependencies
Key Points:
- Database as a Service: LinkedIn offers various stateful services (e.g., relational storage, key-value stores, caches) managed by dedicated teams. Developers can provision these services through automated UIs, integrating seamlessly with their applications.
- Separation of Concerns: While compute resources are managed through the Kubernetes platform, stateful services are handled separately, ensuring specialized management and scalability.
Notable Quote:
Amit Al Balkan [37:06]:
"We run a set of databases as a service. ... If I want to deploy a service and depending upon what I need for my application, I would essentially go and get a database provisioned."
11. Incidents and Learnings
Key Points:
- Component Failures: Integration of open-source components like the Etcd Operator and Argo CD led to incidents when scaling thresholds were exceeded, highlighting the challenges of adopting off-the-shelf solutions at scale.
- Root Causes: Issues such as improper handling of timeouts, label management errors, and reconciliation failures under high churn rates were identified.
- Proactive Measures: LinkedIn emphasizes thorough stress testing, source code reviews, and proactive replacement of components that do not meet reliability standards.
Notable Quotes:
Mofi Rahman [38:08]:
"... our clusters and our workloads hit a specific threshold and we had a week-long outage. ... things are taking forever to reconcile."
Abdel Sighiouar [38:39]:
"People wanting us to have more end users on the show ... I will have to assume that part of the reason why people want end users is because they want to hear about incidents."
12. Conclusion and Resources
In the concluding segment, Amit and Ronak share additional resources for listeners interested in deeper dives into LinkedIn's Kubernetes platform. They mention upcoming talks at Kubecon London and encourage listeners to explore their LinkedIn engineering blogs for detailed articles on their experiences and lessons learned.
Notable Quotes:
Mofi Rahman [40:06]:
"We're blogging actively on our LinkedIn engineering blog. ... we have a talk coming up in Kubecon London."
Amit Al Balkan [40:56]:
"We're actively hiring, so if you want to chat more about opportunities at LinkedIn ..."
They also promote their personal podcast, "Software Misadventures," where they discuss various technical challenges and solutions.
Additional Information
- Show Notes and Resources: Links to the discussed blog posts, talks, and the "Software Misadventures" podcast will be available in the show notes.
- Connect with Guests:
- Ahmet Alp Balkan: Active on LinkedIn and other social platforms.
- Ronak Nathani: Host of the "Software Misadventures" podcast.
This episode provides an insightful look into how a major tech company like LinkedIn navigates the complexities of running Kubernetes at an immense scale. From transitioning away from proprietary systems to handling stateful workloads and ensuring developer productivity, Amit and Ronak share valuable experiences that can guide other organizations facing similar challenges.
