Summary6 min read

Kubernetes at Uber with Lucy Sweet

Kubernetes Podcast from Google
Hosts: Abdel Sghiouar, Kaslin Fields
Guest: Lucy Sweet (Staff Software Engineer, Uber)
Date: May 13, 2026

Episode Overview

This episode dives into Uber's massive infrastructure migration from their legacy systems (notably, Mesos-based Peloton) to Kubernetes (K8s). Lucy Sweet, a lead engineer at Uber and member of the Kubernetes Node Lifecycle Working Group, shares her first-hand account of the challenges, surprises, and engineering lessons from moving millions of compute cores and thousands of microservices to K8s—without disrupting users. The conversation touches on large-scale migration strategies, the importance of service portability, evolving Kubernetes capabilities (like stateful workload handling), incident debugging at Uber scale, and even playful side projects blending AI and Kubernetes.

Key Topics & Discussion Points

1. Background & Migration Trigger

[03:31] Lucy Sweet

Legacy Stack: Uber ran three main compute platforms: stateless (Peloton on Mesos), stateful (Odin), and batch. None used Kubernetes.
Why Migrate?
- Mesos was losing active development and community momentum.
- “We saw that everyone was converging on this one platform ... you don’t have to build solutions to common problems.” [04:00]
- The network effect of Kubernetes became compelling.
Migration Timeline:
- The stateless move to K8s started ~3 years ago, now complete.
- Positive outcomes encouraged migration of stateful workloads next.

2. Migration Strategies: Portability & "Make Before Break"

[05:55] Lucy Sweet

Obstacle: Lack of Portability
- Many services were coupled to specific hosts or ports, making scene changes difficult.
- Addressed through a persistent campaign: “took over a year ... pestering people to undo this.” [05:55]
Pattern: Make Before Break
- Always start new containers (on K8s) before terminating old ones.
- Automated approach tests for portability (do health checks, end-to-end tests), building a backlog (“long tail”) of failing services to be manually addressed.
Extending Portability:
- Enabled smooth rollouts for things like ARM migration: silently test workloads on new CPUs via the same methodology.

Notable quote:

"Portability is so important and we keep using it now ... If it breaks, not a problem. We'll just take the new containers away. If it doesn't break, awesome.”
—Lucy Sweet [07:42]

3. Handling Traffic & Blue-Green Deployments

[09:08] Lucy Sweet

Canaries: Most apps get 1% of their traffic sent to a canary version.
Abstraction: The aim is “platform as a service”—app developers should focus on code, not infrastructure.
On Blue-Green for Infrastructure:
- Lucy remarks on the complexity: node readiness can be workload-specific, not just a simple yes/no.
- "Nodes are ready for some things at some times and may not ever be ready for other things..." [11:39]
- The future: more granular node readiness; avoid excess K8s complexity.

Notable exchange:

“Don’t worry, Kubernetes never has any bugs.”
—Lucy Sweet [12:54]

4. Criticality & Service Tiering

[13:20] Lucy Sweet

Tier System: Tiers 0 (most critical infra services) to 5 (least important, e.g., internal foosball league service).
Migration proceeds from least to most critical.
Delineation allows informed incident response, capacity allocation, and risk mitigation (e.g., putting non-critical services on spot instances).

Notable moment:

"It's not DNS. There's no way it's DNS. It was DNS."
—Lucy Sweet, referencing a haiku on her wall [15:50]

5. Preserving & Innovating on Platform Features

[16:49] Lucy Sweet

Example: Peloton had “container snapshotting”; K8s does not natively.
Uber’s solution: a sidecar container uploads the file system to “Terrablob” before pod deletion, preserving the debugging workflow users relied on.
Emphasized the need for feature parity during migration; hard to take capabilities away.

6. Debugging & Observability at Scale

[19:23] Lucy Sweet

Uber Scale: Over 5 million CPU cores across 200+ Kubernetes clusters.
Grail: A distributed in-memory graph DB cataloging the state and relationships of all infra (services, pods, Kafka queues, etc.)—crucial for debugging and incident response.
- “You can craft these very powerful queries ... show me every pod that has like this property or ... a GPU, and the service is also part of this team.” [21:11]
Clusters and underlying complexity are abstracted away from service owners.

7. The Hardest 10%

[22:44] Lucy Sweet

Most services are straightforward to migrate; the hardest 10% are those with specialized, unportable behaviors (ephemeral file usage, host affinity, or stateless-in-name-only services).
Strategy: minimize such exceptions for scalable platform improvement.

8. The Challenge of Stateful Workloads

[25:48] Lucy Sweet

Next Frontier: Moving stateful workloads is fundamentally harder due to things like locally attached disks and the need to carefully control data tenancy.
Eviction Limitations:
- Current K8s eviction mechanisms are too crude for nuanced, state-aware moves; can’t prevalidate disk or capacity status.
Solution & Contribution:
- Uber and Lucy’s Node Lifecycle WG are working on Eviction Requests: allowing controllers/interceptors to coordinate and sequence eviction safely.
- This will allow “business logic” (e.g., wait for data migration, custom handoffs) to be part of the eviction flow.

"It lets you say, I would like to get rid of this pod at some point in the future ... Interceptors annotate the pod, are signaled, do their thing ... then they say they're completed and move on to the next interceptor.”
—Lucy Sweet [28:36]

9. Playful Innovation: AI & Kubernetes

[32:50] Lucy Sweet

Lucy developed an LLM wrapper that acts as a (fake) Kubernetes API server, hallucinating objects into existence for kubectl commands.
Motivated by curiosity and fun, not utility; could be useful for learning.
- “I tried with kubectl ... did a patch against a deployment that didn’t exist, and the LLM hallucinated it into existence.” [34:12]
Live soon at kubect.org.

10. Lightning Round & Fun Facts

Kube, Kubectl, or Cube-cuddle?
- "Cube cuddle. Sorry but this is the way." —Lucy Sweet [36:58]
Closing banter about conference cakes, language authority, and living in Denmark.

Notable Quotes & Memorable Moments

On Portability & Technical Debt:
“It took over a year ... pestering people to undo [host and port dependencies].” —Lucy Sweet [05:55]
On "It's always DNS":
"It's not DNS. There's no way it's DNS. It was DNS." —Lucy Sweet [15:50]
On Strange Engineering Solutions:
Recalling a workaround where devs solved a Node.js memory leak by adding a sidecar that killed pods after a threshold:
“If your viewers could see my face right now...” —Lucy Sweet [32:03]
On LLM Kubernetes API Simulation:
“I tried with kubectl ... the LLM hallucinated [a deployment] into existence.” —Lucy Sweet [34:12]
On Kubepronunciation:
“Cube cuddle. Sorry but this is the way.” —Lucy Sweet [36:58]

Important Timestamps

[03:31] Migrating from Mesos/Peloton to Kubernetes—motivations & history
[05:55] The importance of portability and the “make before break” approach
[09:08] Handling deployments, canaries, and blue-green strategies
[13:20] Uber’s service tiering explained (from foosball apps to mission critical infra)
[16:49] Achieving feature parity: container snapshotting with a sidecar
[19:23] Debugging and observability at Uber scale, introducing Grail
[22:44] The “10% hardest” services and lessons learned
[25:48] The challenges of moving stateful workloads and the need for mature eviction handling
[28:36] New “Eviction Request” pattern for safe stateful migrations
[32:50] Fun projects: LLM-powered fake API server for kubectl
[36:58] The definitive “cube-cuddle” pronunciation

Conclusion & What's Next

Lucy is active in K8s community leadership, contributing especially to the node lifecycle. Uber has finished its stateless migration, with stateful workloads next—requiring new patterns and upstream features. Lucy will present at KubeCon US and Europe, including an AI/ML panel and a live cluster exploitation demo.

Lucy’s excitement:
“I'm looking forward to when we hit 10 million cores on K8s. I want the 8 figure number...” [38:27]

Links mentioned:

kubect.org (Lucy’s AI+K8s side project)
lucy.sh (Lucy's personal/professional updates)
Upcoming KubeCon talk recordings

Contact & Listen:

Twitter: @KubernetesPod
Email: kubernetespodcast@google.com
Show notes, transcripts: kubernetespodcast.com

Loading summary

Transcript142 lines

[00:00]
Abdel Sighwar
Hi and welcome to the kubernetes podcast from google. I'm your host, abdel sighwar.
[00:04]
Kazlyn Fields
And I'm kazlyn fields.
[00:16]
Abdel Sighwar
Imagine trying to move millions of compute cores and thousands of microservices to a brand new platform, all without dropping a single user request, ride or delivery. Sounds like an absolute logistical nightmare, right? Well, today we're sitting down with someone who actually lived to tell the tale. Lucy Sweet is a staff software engineer at Uber and the lead for the Kubernetes Node Lifecycle Working Group. In this episode, we're diving deep into Uber's monumental infrastructure journey, moving away from their in house system to Kubernetes. We'll be unpacking the reality of running this at scale, why it's always DNS, and why building things for fun is worth it.
[00:56]
Kazlyn Fields
But first, let's get to the news. Broadcom announced they're donating Velero to the CNCF at the Sandbox level. Velero is a Kubernetes native backup, restore and migration tool. It traces its origins to Heptio, which was founded by former Google engineers Joe beta and Craig McLuckie and acquired by VMware and eventually Broadcom.
[01:19]
Abdel Sighwar
The CNCF released the Kubecon and Cloud Native Con Amsterdam 2026 Transparency Report. This edition became the largest event in CNCF history with over 13,500 attendees, 46% of which were visiting Kubecon for the first time, representing 100 countries and over 3,000 organizations.
[01:38]
Kazlyn Fields
The call for proposals for Kubecon Cloud Native Con North America 2026 is open and will close on May 31, 2026. The event will take place in Salt Lake City, Utah, November 9 to 12.
[01:51]
Abdel Sighwar
Open Choreo released version 1.0 to the CNCF Sandbox. The project originated as the open Source counterpart to WSO2's commercial Choreo SaaS platform and is designed to give engineering teams a complete foundation for running workloads on Kubernetes without requiring them to build it themselves. It includes a Backstage powered developer portal built in CI, CD, GitOps, workflows, observability, and what the project calls a programmable control plane.
[02:19]
Kazlyn Fields
And that's the news
[02:22]
Sam
today. I'm talking to Lucy Sweet. Lucy is an engineer at Uber. She is part of the team responsible for building and maintaining most or nearly all the platform infrastructure used by engineers globally. At Uber, Lucy is also a Kubernetes Node Lifecycle Working Group lead. Welcome to the show, Lucy.
[02:40]
Lucy Sweet
Thanks, and thanks for having me.
[02:42]
Sam
So I'm very Happy to be here. It took us quite a bit of time to get this together so we can talk to you folks. This is part of our series interviewing end users of Kubernetes. This is like a feedback we got for the show and we want to talk to you a little bit about your Uber journey and your personal journey through moving to Kubernetes. Right, so you did a talk a while back. I think it was Kubecon during COVID time, if I remember correctly.
[03:08]
Lucy Sweet
Oh, wow. That's like a lifetime time ago at this point. Jesus.
[03:12]
Sam
Yes. Feels like. Feels like ages.
[03:15]
Lucy Sweet
Yeah. Was I even alive back then? My word. Yeah.
[03:19]
Sam
And part of your talk was talking about the story of how Uber migrated millions of cores to Kubernetes. Can you talk a bit about that? Like, what's the triggering point to like converge to Kubernetes?
[03:31]
Lucy Sweet
Yeah, absolutely. So we originally at Uber have been running separate stateful, stateless and batch compute platforms. So we have a stateless platform called up Stateful One called Odin, and then there's a batch one as well, and so on. And none of these used to use Kubernetes. The stateless one was built on a system called Peloton, which came into existence literally like a decade ago plus, and it came into existence around the same time as Kubernetes. It is a bit different though. Peloton was built on Mesos and was a bit more workflow based rather than reconciliation based. And we happily used it out for many, many years. But over time, Mesos started to be less and less actively developed and maintained. And also we saw where the win was going. We saw that everyone was converging on this one platform and that brings you a lot of benefits. When you're on the same platform as well, you get a lot of network effect and you don't have to build solutions to common problems that everyone else has also already had. So back about three years ago now, we decided to move our stateless compute fleet to Kubernetes from Peloton. And that process all kicked off just actually as I started to join Uber. So today we've got our whole stateless fleet on Kubernetes, and I'm sure we'll talk a bit more about that as well. And we're just starting to move our stateful compute stack over as well. Because we had such a good experience with the stateless one, we were like, okay, well, why not go further? Why not do more?
[05:04]
Abdel Sighwar
Nice.
[05:05]
Sam
And so I was reading some of your publications on the engineering blog of Uber, and one Thing that I found very, very entertaining and very cool actually to read is the story of the migration. So migrating millions of cores, I assume hundreds or thousands of microservices without downtime. I mean at the scale of Uber, I assume it's a very complex thing that's essentially like if Google decided, oh, we're going to move from Borg to Kubernetes while keeping Google alive. Right.
[05:33]
Lucy Sweet
Where are you gu going to do that?
[05:35]
Sam
That's a very good question. I am in Bergen at Cloud Native Bergen and somebody literally asked me the same question today, like whether you guys move into Kubernetes. So what's in your mind have been the key to success for that specific migration? Like migrating without downtime.
[05:50]
Abdel Sighwar
Right.
[05:50]
Sam
Like what have been some lesser learned patterns, things that you have kind of learned through that journey?
[05:56]
Lucy Sweet
Well, one of the first things I think that we kind of had to pay our technical debt on was that we didn't have this property called portability. So a lot of our services were dependent on, oh, I have to run on like this machine and if I'm not on this machine then everything breaks or oh, I have to bind to this like static network port. So if someone else tries to bind to that network port, everything goes wrong. Right. And this before the migration was all over Uber and it was everywhere, it took over a year from memory of basically pestering people to undo this. The way that we tried to do it was we normally, with every migration and in fact every upgrade of every service at Uber, do something called make before break. So we try and start new containers of the application on whatever new stack we're running or if there's a new version upgrade with like the new container image on the same stack. And then we only start destroying the old version when that new version has come up fully and when all of the health checks and everything's passing, when end to end tests have passed, these sorts of things and we use that pattern to actually find services who we thought were not portable. Because what we can do is we can say, okay, I'm going to try and spawn a container of your service on K8 cluster and let's see if you come up, let's see if your service immediately does it exit 1. Do the health probes pass? Does the end to end test pass? If they don't, we just back off. We take the new version on K8s away and then they end up on a long tail. And eventually you can keep trying on that long tail automatically, but eventually it gets to a human and a Human has to go to the service owner and say, hey, folks, why does your application not work on K8s? What crazy stuff have you done this time? That process. Yeah. Took over a year to get to. But that portability is so important and we keep using it now. So, for example, we used portability once we got it to migrate to K8s. But more recently we also looked at, hey, you know, our entire fleet is on AMD 64 arch CPUs. We want to start using ARM. ARM's cool. Okay, well, because we already have this portability trait and we have this make before break trait, we can kind of make this touched. As the service owners, we can just start trying to spawn them make before break style on CPUs, and if it breaks, not a problem. We'll just take the new containers away. If it doesn't break, awesome. We can put them on arm and you can. We can do that without actually having to talk to service owners at all, really. We only talk to them if we need to move them and their service won't.
[08:33]
Sam
Got it.
[08:33]
Lucy Sweet
But at least for me, yeah, it's very easy to lose portability in a company if you're not always looking out for it just because of Hiram's law. You know, if you give someone a way to depend on a host at
[08:45]
Sam
scale, they will do it.
[08:46]
Lucy Sweet
Someone will depend on a host.
[08:47]
Sam
Yes.
[08:48]
Lucy Sweet
I've seen this far too many times. I'm guilty of this.
[08:51]
Sam
Sometimes it's quite interesting, actually, this whole concept of portables. Yeah, I'm curious about something. Like, I understand the concept of, like, spinning up the service and trying to see if it works, and it's pass testing, but, like, how do you handle it from the traffic point of view? Is that like a green, blue, blue, green deployment kind of thing or so.
[09:09]
Lucy Sweet
Network traffic can be always a bit of fun. So most applications at Uber have something called Canary. So we send 1% of the traffic to like the Canary new version. And then the idea is that if that messes up, if that fails, we can always retry. Obviously, there are error cases where it's not just a binary. It errored. It didn't. But most of the time that's good enough and that's solvable. We do provide customization for our users, but honestly, most users don't have to touch that. At least when we build our platforms, what we're trying to do is build them into a way where as a user, you just want to write code and move on with your life. Our end users who build the Uber App they shouldn't have to care about what machine I'm on, what my network traffic looks like, what my scaling looks like. It's just I write application, it runs the end, hopefully. And that's where we come in and that's where we have to abstract all of these problems away.
[10:11]
Sam
Interesting. So this is a question that just popped up in my head and it's not related strictly to Uber. This is actually a feature we have on gkey and I'm curious about your opinion, specifically because you're working on the Node Lifecycle working group, right? So GKE have a feature called Blue Green Node Pools. So what it allows you to do basically is bring up node pools under a new Kubernetes version, test if the migration works, if it not fall back, and then, you know, blue green. But for the infrastructure side. And to me, like, I've been in the industry for 15 years, so blue green is application, it's not infrastructure. So what's your thoughts? I'm just curious, like, what do you think about this?
[10:46]
Lucy Sweet
So obviously I haven't. We don't use GKE at Uber. I know. Don't boo too much.
[10:49]
Sam
Yeah, it's fine, it's coming fine.
[10:52]
Lucy Sweet
But one of the really tough things, I think, especially with nodes upgrades, is can you effectively test whether a node is in a ready state for an application without risking disruption? That can be, in my opinion, really tough. Like, even if you have like blue green for the nodes themselves, if you place an application on a new node and that node is for some reason not compatible with that application, then you could cause problems. Right? And those problems may lead to actual disruption. This is actually one of the things we've been discussing in the Node Lifecycle working group is this idea as well. Not just of this readiness, but also whether a node is ready and schedulable for a workload can actually depend on the workload itself. And obviously in K8s we have like, you know, the node is ready and the node isn't ready. But that's a very binary signal. Some nodes are ready for some things at some times and may not ever be ready for other things other time. And this has become especially relevant with like AIML accelerated workloads where, you know, I must be on this node because this node has this TPU or gpu and I must not be on this node because of that. So blue Green is a great place to start, especially with the current K8 stuff. I really want us to push further in K8s eventually in the long term, I Would love us to be in a position where we could express node readiness and node upgrades for a given node based on the workloads that could then be placed on that node. So you could say like, hey, I'm ready for this type of workload. I'm not necessarily ready for this right now, or I need to be in maintenance for this. But not necessarily this, this granularity. But it's always fun because we always have to balance this against the fact that we can't make Kubernetes infinitely complex. Yeah, of course we can't turn Kubernetes into a big Turing machine, as funny as that would be.
[12:37]
Sam
Yeah. And if I'm reading a little bit into your thoughts, I think that one of the challenges with node upgrades specifically
[12:43]
Abdel Sighwar
is are you catching in regressions?
[12:45]
Sam
Is the performance application going to be the same in the new version? Are there any bugs in Kubernetes itself? Right, so it's super interesting.
[12:54]
Lucy Sweet
Don't worry, Kubernetes never has any bugs.
[12:56]
Sam
No it doesn't. It's completely bug free. Yeah.
[12:59]
Lucy Sweet
License lander.
[13:02]
Sam
So back to the topic of Uber. In the talk that you have done, the video I watched, you talked about this like classification of services. I believe it was tier 5 to tier 0. Right. Can you talk a little bit about that? Like how, how do you go about like classifying? I guess it's easy. It's like the thing that is business critical is tier zero and the thing that no one cares about is tier five. Right?
[13:20]
Lucy Sweet
Yeah, basically. So yeah, Aduba we have these six service tiers to indicate criticality. So tier five is like we have an internal foosball league in Denmark and the service that runs that because of course we over engineered it into a service tier 5. You know how engineers are, you know, in tier 4 or tier 3 you might find like internal tools that are useful but you can live without them with pain. Tier 2, you're looking at something that's customer facing but maybe isn't the core trip flow. So like it could be like a promotion system, a redeeming, that sort of thing. Tier one, you're looking at things that touch what we call the trip flow. So that's the minimum stuff you need to call an Uber. The car arrived, you get in, you get to your destination, you get out. If your service is needed for that, that trip flow to run, it's tier one. And then tier zero is the infrastructure services that support tier one. So the actual stateless compute platform itself. And we use these a lot at Uber. So Migrations, we normally work tier by tier. We start in tier five and we work our way down through to tier zero. And that, yeah, that can be useful for us. And I think it's actually one of the things that I think is quite valuable. And like nearly anyone could get this by just labeling even their kubernetes deployments. In my opinion, understanding what workloads in your organization are more important and less important. It can both help during incident response because you can know, oh, this is down. And that's like really, really bad versus this things down, who cares? Especially during a capacity crunch. And on capacity you can actually start to do very interesting things if you for example, are okay to to expect maybe some downtime of the less important tiered workloads. So for example, we've been thinking about what would happen if you put high tier workloads on spot instances. For example, maybe they'll go down every now and then. But if it's a foosball league, who cares? But if we didn't have this tiering system, that would be really, really tough because we can't put the core trip flow on a spot instance and pray. I mean we could, but I would get in trouble about apparently, apparently this is not good engineering practice.
[15:28]
Sam
Yeah, and as I joke always when I do my talks is like if you go out to all your users and ask them, can you tell me how would you in your head classify your service? They will all say it's mission critical. Right. Everybody thinks that services are the most important. So yeah, here is a curveball question. Where is DNS on this tiering system?
[15:50]
Lucy Sweet
DNS is Tier 0 because you know how it goes. Any incident is it DNS? It's always DNS. The AWS outage we all had recently, the fun of experiencing Uber doesn't run on aws. But don't worry, a lot of our suppliers do anyway. So we were definitely involved. Yes, that was a fun DNS one. I think I've read like the mini postmortem on that and of course it's always DNS. Right. I have a haikyuu of that on my wall in my home. It's not DNS. There's no way it's DNS. It was DNS.
[16:20]
Sam
Yeah, exactly. So I do have a question about this I think is coming from the talk as well. Like this. When you had peloton, you had like a native snapshot in functionality, which my understanding was basically it allows failing containers to be snapshotted and stored so that engineers can debug them later.
[16:38]
Abdel Sighwar
Right.
[16:39]
Sam
And the ephemerality nature of Kubernetes makes it slightly harder to do something like this. So how did you kind of solve that in Kubernetes? If solved, it is the right term to use here. Yeah.
[16:50]
Lucy Sweet
In Peloton we had this thing called container snapshotting. And what that let you do is, yeah, if a container fails, you get an entire snapshot of the container at the point it failed. And engineers love this because they go in, they like read like internal Java logs, whatever. They like read the state of the file system. Great fun. Kubernetes doesn't natively support this because, you know, Kubernetes is clean and blows up the container very quickly after the pod goes away. So what we did is we added a little sidecar to all of our containers. And this sidecar, all it does is when the container exits, it stops the container being deleted and very quickly jumps onto the file system and uploads it all to a thing called Terrablob, which you can just basically think of like S3. It's like Uber S3 that then sits in Terrablob, the sidecar exits and then the pod goes away like normal. But we've now captured the snapshot of what the user wanted. And so users got to keep that feature because this is one of the big things about this migration. We did not want to be in a position where we were taking features away from users. They want to focus on their code. We don't want to be an annoyance to them if they have something. It's very hard to take things away once you've given it to people. Very hard, yeah.
[18:01]
Sam
Because if you are executing such a large scale migration, you want to at least try to have a feature parity between the existing platform and the new platform. Right?
[18:09]
Lucy Sweet
Yeah.
[18:10]
Sam
I mean, that's in theory, what most organizations are trying to do. The practice is slightly more difficult. Difficult sometimes, right.
[18:16]
Lucy Sweet
Look, I'm speaking from theory, okay? Definitely not reality.
[18:20]
Sam
I mean, there is this joke we have at Google which is there is no such thing as a yellow banana. It's either a green banana or a brown banana. So it's either alpha or deprecated. So, like, stable doesn't exist.
[18:31]
Lucy Sweet
Isn't there a comic from Goomix as well where it shows the two paths at Google? What? Path one deprecated. Don't even think about it. Path two under construction.
[18:39]
Sam
Yes. Yes. Pretty much, yeah. So it's either. So one tool will have a deprecated banner and the new one that is replacing it will have an alpha banner. That's it. Oh, don't worry.
[18:50]
Lucy Sweet
This feels very real to me as well. And I don't even work at Google.
[18:54]
Sam
All right, so I do have another interesting question. I mean, you run like hundreds of millions of, well, millions of cores, right? A lot of cores. We launched recently the 65,000 nodes in GKE. EKS replied by launching 100k. I was testing the 65k nodes. A Kubectl get pods takes like 5 minutes. Right. As you can imagine. Right. So what does it like, what does that like? How is debugging inside for Uber? Like, how do you kind of solve this kind of large scale Kubernetes storage problems?
[19:24]
Lucy Sweet
Yeah. So we measure in CPU cores because we think it's a nicer way than measuring in hosts and nodes because they can all be different sizes. So right now we have just over 5 million cores on our stateless fleet. But then we spread those. So at Uber we have over 200 Kubernetes clusters right now. And the idea is that these are broken up into availability zones that we have inside the company. But then our users don't actually see this complexity. So from the stateless compute platform side and from the stateful compute platform side, they hit deploy and they just see that pods have come up somewhere and then we abstract away all the networking needs to reach you and your logs need to reach you and this sort of thing so that the users don't really have to care. Debugging K8 itself, though, when that goes wrong, can be fun. The first thing we have to do is find what cluster the workload's even on. We have a platform called Grail and we use Grail a lot inside Uber. Grail is basically a very big distributed graph database that is in memory and represents the current state of the world across the whole company. And the powerful thing about Grail is it has associations between things. So there is an association from a Kubernetes pod to an Uber service and from an Uber service to maybe a networking group. And you can craft these very powerful queries where you say, look up Uber services that meet these parameters, then follow to the Kubernetes pods that are associated with them.
[20:45]
Abdel Sighwar
Oh, interesting.
[20:46]
Lucy Sweet
Then do this, Then do this. We've posted a blog post about this before and it's one of the most powerful features I find inside Uber is the ability to effectively have not just disparate Kubernetes clusters, but even disparate things that aren't even K8s, like proprietary networking groups. On the one hand internal service definitions, Kafka queues, whatever, and associate all of these together in one place where you can query once and jump between these technologies. This is one of the coolest things I think we have here. You know, we have a query catalog. There's thousands of queries that people written in there. You can do really powerful things for debugging, like show me every pod that has like this property or show me every pod that's asking for a gpu. And the service is also part of this team because you can follow through.
[21:33]
Sam
Yeah.
[21:34]
Lucy Sweet
And so yeah, a lot of the debugging at Uber from our end as platform engineers, where we actually have touch K8s, we use grail heavily for it. But from a user pov, we try and correlate and bring together and centralize all this for them so their logs get shipped off to our central monitoring system Umonitor, which is built on M3 and all these other things. And like the metrics do as well. So they. They as a user, the cluster is just a detail. Like users don't normally care, but yeah, it is. It has been fun seeing people really push the boundaries on how many nodes you can get in a Kubernetes cluster. It's like a new arms race.
[22:08]
Sam
Yes, Yes. I don't know where it's going to end. We'll see. I like this idea of like being able to correlate across not only applications, but infrastructure and like dependencies and stuff. That's. I find that very powerful because basically if you are an SRE being awakened at 3am in the morning to fix an incident, the last thing you want to do is like have to stay on Slack and ping people to tell them what is your database. Right? Yeah. So in one of your. I think it was in the video as well, you talked about during the migration.
[22:38]
Abdel Sighwar
Right.
[22:38]
Sam
There was the biggest amount of work was on the 10% unique services or special services.
[22:45]
Abdel Sighwar
Right?
[22:45]
Lucy Sweet
Yeah.
[22:46]
Sam
Basically we started by discussing this at the beginning, which is like everybody has the same problem, so 90% of all the applications are exactly the same. Right. So what was the complexity there? Like what makes those 10% kind of special?
[22:58]
Abdel Sighwar
Right.
[22:59]
Lucy Sweet
It depends on the service, but normally they have done something that is critical now to their application, like they've built on top of it. That is simply not something that can be supported portably. So one of the things we saw before was beyond static network ports, which is one issue. We also saw people doing things like, oh, I'll just put these files on the host. And then when my Application comes back, I'll just read them later. You know, it's not like I'll get moved to a different host. That's crazy.
[23:26]
Sam
No, of course not.
[23:27]
Lucy Sweet
That would never happen. The file system is global and consistent and infinite. That's how it works. Right. So with these guys in particular, this can be a huge challenge. The other one as well was there were some applications that were maybe not as stable as we hoped they would be. So, for example, we found some applications on the stateless compute fleet that maybe weren't as stateless as they claimed to be. Maybe if you took down too many of their pods at the same time, they had a big incident and it caused a lot of issues. And so for those ones, we had to, at least in the interim, we wanted to migrate them and we didn't want to wait for them to fix these issues. So, you know, there were initially quite a lot of conditions of, you know, if this service has this label, which means like, be really slow on rollouts, then one pot at a time, globally, no more, please. Yeah, and that can take a lot of engineering time because fundamentally, an engineer in our team is spending time to write this path for like, 1, 2, 3 services. It's not really a place where we can scale our impact in the way we want. It's why I am so much into pushing to make sure that we do not regress back to this and that we maintain this portability because we want to scale our efforts, and we can't scale our efforts if we're spending a lot of our time dealing with these special cases, these different services that maybe have these unique requirements. So some of them are always going to exist. But it's all in my head a job of minimization as few special things as we possibly can so that we can spend time making the 90% of people happy and not spend our time just getting the 10% to a basic running state.
[25:10]
Sam
Yeah, I did a little bit of work in the consulting space and that's definitely probably one of where you spend most of your time. Right. Everybody has the same problems, but then there is this unique use cases that are usually the ones that take most of the time when you're doing any sort of migration or any sort of architecting. Right. So speaking of unique, I want to jump ahead to. So we talked a lot about stateless compute.
[25:34]
Abdel Sighwar
Right.
[25:35]
Sam
So that kind of was the first phase, I think, of your migration. And there is, of course, batch and stateful workloads. So that would assume everything that has to do with data and large language model and AI. So where are you on that? Not what's next? Where are you on that journey?
[25:49]
Lucy Sweet
Basically, yeah. So right now, obviously we have all of our stateless fleet now is on K8s, and we're really starting to leverage K8s features, which is really, really cool. So now we're looking at stateful and batch. So right now the main thing we're looking at is over the next year or so, how much of our stateful fleet can we get onto Kubernetes? But the problems here are different because stateful workloads have unique challenges and unique architectural issues. One of the big ones is you cannot move a stateful application in the way that you would move a stateless one. You can't just, oh, just spawn it up here and then shut down the other one immediately and you're good. What could possibly go wrong? Especially because at Uber we have locally attached disks to a lot of our hosts. So the, the place that data is tenanted is very important to us. We don't have much in the way of network attached storage for reasons. And so one of the things we found actually is that there are gaps in kubernetes that have been a challenge for us to work around when we've designed our stateful migration. So one of them, for example, is eviction in Kubernetes right now is not that mature. You know, we have the eviction API where you can like create an eviction object against the pod and either at that point in time it will be accepted and the pod will go away if it's within the PDP, the pod disruption budget, or it will 429and that's it. It's just a point in time decision. Right. But that's not really expressive enough because as someone who runs a stateful platform, I want to be able to say like, hey, I want to evict this workload and I need to check things that maybe kubernetes might not know about before doing that. Like I might need to check, do I have capacity to put it somewhere else, what's the status of the data on disk? These sorts of things, right?
[27:42]
Sam
Yeah.
[27:43]
Lucy Sweet
And with the existing system, you can't really do that. You could like maybe add a finalizer to the pod. But when pods are in terminating, a lot of stuff has already happened, especially around networking. So finalizer doesn't really solve for that. So one of the things we've been looking at which solves this for hopefully more people as well in the future in the node Lifecycle Working Group is this concept of eviction requests. And what an eviction request does is it lets you say, I would like to get rid of this pod at some point in the future. And so, as someone who wants to get rid of the pod, you create that eviction request and then you have stakeholders called Interceptors. And Interceptors annotate the pod with, hey, I want to be told when someone wants to evict this pod. And then one by one on the eviction request, they appear in a list. They are signaled to, hey, you can do your thing now. And when you've done your thing, just say that you're completed and we will move on to the next interceptor.
[28:37]
Sam
Got it?
[28:37]
Lucy Sweet
And that now is a much more powerful verb, because instead of a point in time decision of, can I evict you right now? 49 or 200, you know, that's it. You can say, I would like to evict this podcast at some point. Please do your business logic, please, like, move your data. Maybe it will take you a few minutes, maybe it will take you a few hours, maybe it will take you a day. If you've got a huge piece of data and this allows you to do effectively, you can defer pod eviction to a point in time where it makes sense to your business. If the pod needs to be evicted now because it's a bad host, you can progress it more quickly. You can also just understand and visualize this a lot easier in Kubernetes, because right now, obviously, eviction objects don't really exist. Like, you know, you create one and it blows up with the pod. Yeah, you know, it's special, but with this, you can actually model like, oh, this pod is. Someone's trying to evict it because of XYZ abc. And the status right now is that this interceptor, maybe one that's like moving the data, is copying the data to a new replica, and then once that's done, it will do this, this, this, and then it will go away. That, I think, is a lot easier as well for users of K8s to see and understand, especially when we're modeling it as resources inside the cluster. So right now we're hoping we didn't hit the release for December, unfortunately, but we're hoping that next release we can get at least a KEP in and maybe a nice alpha within Uber. We're pushing this really hard. We've already managed to get a working reference implementation of the KEP online and rolling, and we're actually going to use that in Our migration, like starting in about January as a production part of it, because it's outside of just Kubernetes itself, it's a really good primitive and it's really powerful. And in my opinion, it's not too complex as well.
[30:18]
Sam
I can see how that would be super useful. So a follow up question, will that mean also that you can have some sort of signal inside the container that the application can intercept and do something with it, like close connections or commit data to disk? You see what I mean?
[30:35]
Lucy Sweet
Yeah, yeah. What you could do is you could just an interceptor as the last one. So interceptors have to be able to look at the cluster, but you can do that with service accounts.
[30:43]
Sam
Right.
[30:44]
Lucy Sweet
You can have an interceptor on the pod that the pod itself container is looking for. And when it comes up, that could be the signal to the application of hey, close your connections, shut yourself down, or you know, get yourself ready to shut down, I should say, because you shouldn't shut yourself down because then the pod goes away anyway and then progress. But yet all you need to be an integration interceptor is you need access to the cluster to see eviction request objects and you need to make sure on the pod you're patched in as, hey, I'm an interceptor for this pod.
[31:11]
Sam
Yeah, that's certainly better than. I'm going to tell you this. This is, I think, a funny story. I worked on a project a long time ago, which by a long time ago, I mean 2018.
[31:19]
Abdel Sighwar
So. Right.
[31:19]
Sam
Not that long ago.
[31:21]
Lucy Sweet
I hadn't even graduated in 2018. So for me this is a lifetime ago.
[31:25]
Sam
So what this project was basically is a customer that, that had a Node JS app with a known memory leak issue.
[31:33]
Abdel Sighwar
Right.
[31:33]
Sam
And they debugged it down to the very simple fact that after a certain number of requests served from that particular app, the memory leak issue happens. Right. So what was the solution? The solution was we're going to put a sidecar that counts how many requests the pod is processing, and after the critical number of requests we kill the pod. So I hope they moved away from this. But that was one of the funniest implementations I have seen.
[32:04]
Lucy Sweet
If your viewers could see my face right now.
[32:07]
Sam
Yeah, every time I tell this story, people go like, what?
[32:13]
Lucy Sweet
I mean, I've definitely seen some fun implementations in the past. One time I saw a team who wanted to do custom control of node readiness and their solution was not no readiness gates. Their solution was just to crash loop the Kubelet until they were ready. Oh, of course, that was an interesting implementation approach. If it works, it works.
[32:36]
Sam
I mean, that is technically upper readiness if you always reply 400 and Leo are ready. Right?
[32:42]
Lucy Sweet
Yeah, exactly. This is the next level. So. Yeah, next feature in Kubernetes, the Kubelet crash loops itself in LUT is ready. Yeah, no, maybe not.
[32:51]
Sam
So, speaking of PAL implementation, you are working on something. It's an LLM wrapper. I'm reading what you wrote. So then you tell me. So it's an LLM wrapper that pretends to be a Kubernetes API object, works with kubectl, hallucinates objects into existence when you try and get them. My question is why?
[33:10]
Lucy Sweet
Under the fifth Amendment of the United States Constitution. I don't have to answer that.
[33:14]
Sam
No, you're not American, Commander.
[33:16]
Lucy Sweet
Okay, yeah, fair enough. No, damn it. So this is what I would call a moment of weakness. At 7 o' clock in my house, when I realized maybe that I could do something. And I don't stop to think about, should I do it. So what this is is it's a go binary that's hooked up to an LLM. And the LLM's instructions are, you are a Kubernetes API server. You must respond with fully formatted JSON responses to HTTP requests. You will be given the URL and the payload from the request. You must not add any extra commentary or the JSON parser array. Then you run that as a HTTP server and you pipe it all back to dlm and then you make a Kube config that points your KUBECTL directly at it and you can actually go further. I tried with kubectl. That was very funny because I did like a patch against the deployment that didn't exist. And the LLM hallucinated into existence.
[34:12]
Sam
Sure.
[34:13]
Lucy Sweet
Say it was patched. Then I went to get the deployment and it was slightly different from what I just patched. I created a deployment object and then I did get pods, because obviously there's no actual Kubernetes behind this. There's no Controller, Manager, no ETCD. And of course, the pods had suspicious IDs like 1, 2, 3, 4, 5 and ABCDE. Of course, then in another moment of weakness, I thought, thought, what happens if I connect a Kubelet to this? That got interesting. At first it spawned about 10 copies of nginx. Then it just started pulling random images off of Docker Hub for all sorts of applications.
[34:50]
Sam
What can go wrong?
[34:52]
Lucy Sweet
What could go wrong? You know, it's. I'm just. I'm just being a good steward, letting people Run stuff on my computers. It's very charitable of me. So now by the time this record recording goes up, it should be on a website. Actually I decided that the best thing to do would be to buy a domain name and put it on a public website. Because what could go wrong with that? Of course, if you go to. What is it called? I think it's kubect.org not while we're recording because I haven't turned it on yet. But I'll turn it on before we publish this. Okay, you can get a kubeconfig and if you use that kubeconfig against your kubectl you will be connected to a cluster. But it's not a cluster at all, it's just an LLM. Pretend to be one and you can try and do a deployment on it and maybe it will work. Maybe the LLM will hallucinate it into existence. Maybe the LLM will do something else entirely and turn it into a statefulset because it feels like it. Who knows?
[35:40]
Sam
I mean this could be a good learning tool. You know what? Yeah. It's only a matter of time that somebody will use this to learn how to pass the CKA certification. I'm quite sure.
[35:51]
Lucy Sweet
Can I? Can I?
[35:52]
Kazlyn Fields
How.
[35:53]
Lucy Sweet
How many things does it have to pass for me to get the certified Kubernetes badge? This is my follow on question.
[35:59]
Sam
That's a very good question. I think you will have to ask the LLM to figure that out, right?
[36:03]
Lucy Sweet
Yeah, yeah, yeah. I'll send it a config map and the key will be replace this value with how many things I have to pass Do I get the Kubernetes certified badge? This is prompt injection. Yes, I guarantee that the moment people realize this domain exists that my LLM credits are going to die very quickly. But you know what? Worth it.
[36:23]
Sam
Okay. I mean, look, I was in a conference this weekend and somebody was talking about something I learned for the first time. There is an open source project called Osquery and what Osquery allows you to do is query your operating system metrics as a SQL database. I mean, why would you want to do that? I have no idea. But you know, whatever. All right, so I do have one last question. This is before you give me your closing thoughts or anything you want to close with. You are British, so you have a little bit of authority over the English language. Is it Kubectl or Cubectl?
[36:58]
Lucy Sweet
Cube cuddle. Okay, I'm sorry but this is the way. And if you don't like it. Then you'll have to come on the podcast yourself and explain to Abdel why I'm wrong.
[37:06]
Sam
It's fine. We're not publishing this episode. It's okay. Have a good day. I'm just kidding. It's over. It's over.
[37:15]
Lucy Sweet
Look, as I've said to people in Denmark before, when I make a mistake with English, it's really funny because English is my native language, but it's not there. So I just go. Actually, as the native speaker, that is completely okay. You guys can't say anything back about it because you learned it in school. I learned this when I was born.
[37:34]
Sam
Well, I mean, in the same context, I could also make mistakes and say, you know, English is not my native language. So I'm sorry, it can go both ways.
[37:43]
Lucy Sweet
I should try doing that as well.
[37:46]
Sam
So you say, oh, by the way, I have been living in Denmark for a very long time. I can then forgot about English.
[37:51]
Lucy Sweet
Yeah, yeah, exactly. It's. I've got too used to a really weird words for things and speaking from my throat.
[37:59]
Sam
Awesome. So, Lucy, this has been a fun discussion. I learned a lot. Any closing thoughts? Anything you're excited about? What's going on? Are we going to see you at Kubecon Europe?
[38:07]
Lucy Sweet
Oh, absolutely. You are. I've got two talks at Kubecon US this year.
[38:11]
Sam
Awesome.
[38:11]
Lucy Sweet
I'm going to be on an AI ML panel with a few folks from Google of all places, among others.
[38:16]
Sam
Okay.
[38:17]
Lucy Sweet
And then I also am going to be doing a talk with Sandeep from Gen where we're going to be breaking into a Kubernetes cluster live on stage and doing privilege escalation and all these fun things.
[38:27]
Sam
Awesome.
[38:28]
Lucy Sweet
So both of those should be great fun. And outside of that, I'm looking forward to when we hit 10 million calls on K8. I want the 8 figure number we need to get the Safeway fleet on there and then we can get there, I believe.
[38:39]
Sam
All right, so this is your open invitation when you hit that number to come back on the show and tell us all the fun stuff you have learned.
[38:46]
Lucy Sweet
I'll bring cake.
[38:47]
Sam
All right, cool. Sounds good. I'll be coming to Aros to celebrate that in this case because last time I was in Aros, there was cake for the 10 years anniversary of Kubernetes. Right?
[38:55]
Lucy Sweet
Yeah, absolutely. And if you want to be physically here for cake, just a heads up, you could just join uber uber.com careers.
[39:03]
Sam
Awesome. Yeah, that's great. We're also going to make sure that the links to your talks because this is going to air after Kubecon North America. So we'll make sure that we include the recordings and there is also your LinkedIn profile will have that. And then there is also Lucy Sh. I really like the domain.
[39:16]
Lucy Sweet
Thank you.
[39:17]
Sam
That you can go check to follow up on what Lucy is up to, upcoming talks, talks that have been done already, et cetera, et cetera. Awesome. Thank you so much Lucy.
[39:26]
Lucy Sweet
Thank you so much. It was lovely to talk to you as ever. Look forward to seeing you. You're not a Kubecon this year, are you?
[39:32]
Sam
No, not North America. I'm going to be in Morocco. It's warmer and better food.
[39:37]
Lucy Sweet
Listeners, I need you to collectively boo right now all together.
[39:42]
Sam
Yes. But I will see you in Europe for sure.
[39:44]
Lucy Sweet
Oh absolutely. Look forward to it.
[39:46]
Sam
Awesome. Thank you.
[39:49]
Abdel Sighwar
That brings us to the end of another episode. If you enjoyed this show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media ubernetespod or reach us by email at kubernetes podcastoogle.com you can also check out the website at kubernetespodcast.com where you will find transcripts and show notes and links. To subscribe, please consider rating us in your podcast player so we can help more people find and enjoy the show. Thanks for listening and we'll see you next time.
[40:24]
Sam
Sam.