Agent Sandbox with Lovable, with Jonathan Grahl - Kubernetes Podcast from Google

Summary7 min read

Kubernetes Podcast from Google
Episode: Agent Sandbox with Lovable, with Jonathan Grahl
Date: June 12, 2026
Hosts: Abdel Sghiouar, Kaslin Fields
Guest: Jonathan Grahl, Team Lead, Infrastructure at Lovable

Episode Overview

In this episode, hosts Abdel Sghiouar and Kaslin Fields chat with Jonathan Grahl, who leads infrastructure at Lovable, a platform empowering non-technical users to create SaaS and tech-based products. The discussion explores how Lovable operates at scale on Kubernetes, the unique requirements of sandbox workloads, the company’s experience with sandboxing technologies (such as Gvisor and Firecracker), and the evolving landscape of infrastructure around AI agent orchestration. The conversation also touches on the challenges of state management, cold starts, warm pooling, and the future of Kubernetes core components. And, for a sweet finish, Jonathan shares how his family’s chocolate business intersects with his tech journey.

Key Discussion Points & Insights

Introduction to Lovable & Its Core Mission (02:12–03:16)

Lovable’s Vision:
- Aims to be the co-founder for users building SaaS or tech products—no technical experience required.
- Helps the “99% who are not technical to achieve their dreams.” (Jonathan Grahl, 02:12)
Growth & Scale:
- Lovable is a major player in the “Vibe coding” space, experiencing explosive growth.
- Building “several hundred thousand new projects per day.” (03:16)
- Scaling challenges on Kubernetes have been “at the frontier” of what’s possible.

Scaling Kubernetes for High Churn Workloads (03:32–07:46)

Unique Scaling Vectors:
- Main bottleneck is “POD churn or POD creation rate, more than how many pods we run.” (Jonathan Grahl, 04:22)
- Example: Issues with Cilium data plane, internode encryption, and troubleshooting MTU-related packet drops.
- Under intense loads, “Cilium could take up to 30 seconds to get the Pod IP to be routable in the cluster.” (04:55)
Lovable’s Use Case:
- Each user gets a dedicated pod for a “live preview,” creating one-to-one pod mapping.
- “Our pods have to be alive until the user essentially closes their browser, which could be 5 minutes or 24 hours.” (Jonathan Grahl, 06:54)
- This design puts unusual pressure on startup time and state management.

Understanding Sandboxes: Stateful but Non-Persistent (07:55–10:29)

Definition of Sandbox:
- For Lovable, sandboxes are “stateful but non-persistent workloads.”
- Pod must stay up during user session, but can be rebuilt from git/history if needed.
- “None of the kind of workloads [controllers] fit us at all. We use pods directly.” (Jonathan Grahl, 09:31)
- “We hack Kubernetes a lot to make it work.” (10:40)
Differences from Traditional Kubernetes Workloads:
- Standard controllers (deployments, statefulsets) don’t fit; pods must keep their identity for session continuity.

State Management Strategies (11:19–12:45)

Two Types of State:
1. State while a project is developed (workspace/chat/git history/etc.).
2. State when the project is deployed/running.
Architecture:
- User state is persisted in a document database (chat “trajectory” and git).
- Everything is driven through the git project.

Pod Resource Usage and Optimization (12:46–14:22)

Pod Sizing:
- Pods are sized based on CI/CD-like workloads; high burst CPU/memory on startup, then dropping to baseline.
- Overcommitment and heuristics drive resource allocation.
No Dynamic Resizing:
- Rely on “warm pooling” to hide start-up costs rather than dynamic resizing of resources.

The Agent Sandbox Project & Warm Pooling (14:39–17:35)

Lovable’s Warm Pool Hack:
- “We remove the owner reference and selector label from the pod, then the deployment loses track, and the pod continues living on forever. That’s warm pooling for us.” (Jonathan Grahl, 14:47)
- At scale, such hacks lead to issues with Kubernetes’ ReplicaSet controllers—pushing Lovable to look at Agent Sandbox.
Agent Sandbox Overview:
- Provides custom resources (sandbox, warm pool, claims), replaces need for deployments/statefulsets for sandbox workflows.
- “It’s very elegant... but it does not use deployments and replica sets underneath.” (17:16)

Sandbox Runtimes: Gvisor, Firecracker, Kata Containers (18:56–23:12)

Sandbox Technology Choices:
- Started with Firecracker (AWS micro VMs), moved to Gvisor (Google’s user-space kernel) due to better memory reclamation.
- “There’s no virtual machine in Gvisor... we can give a lot of memory and just take it back whenever we want.” (Jonathan Grahl, 19:12)
Tradeoffs & Compatibility:
- Gvisor: Streamlined memory management, but file system operations can be 5x slower due to the way hard links (as used by Bun) work in the sandbox.
- “Most of the things we’ve ended up having problems with is file system and network. This is just pure performance.” (21:13)
- “You have to test it. The only way to know.” (22:34)

The Cold Start Tradeoff: Speed vs Cost (23:13–28:38)

Fast Pod Startup is Overrated (Sometimes):
- “In the end, a lot of those things really don’t matter in reality... The user will still have to start writing. It takes a couple of seconds.” (24:30)
Warm Pooling Approach:
- Lovable keeps pools of ready pods (“warm pool”) to optimize for user experience; allows for fast claim and minimizes load time.
Snapshotting:
- Used approaches like disk snapshots, bundling common node_modules, and file system tricks to skip repetitive tasks. Now moved to more git-driven state resumption.

Advanced Image & File System Strategies (28:38–30:40)

Secondary Boot Disks:
- “We have this... secondary boot disk... where you have all your images and then you don’t have to stream it. We update that every week or every day.” (28:11)
Emerging Projects:
- Hinted at dynamic registries that could compose docker images with specific layers on the fly.
- Mentioned Dragonfly (CNCF) for efficient OCI image layer streaming and potential for “live patching” container layers.

Future Directions: Can Kubernetes Core Solve These Problems? (31:06–33:38)

Limitations of Scaling in Kubernetes Core:
- “You would have to be willing to replace core parts of Kubernetes that other people depend on.” (Jonathan Grahl, 31:06)
- etcd scaling is a hard cap (16GB), which imposes limitations for workloads with massive churn.
- “Maybe it’s easier to work around it... people use virtual Kubelet for this heavily.” (32:23)
- Describes the pattern of external per-node “runners” with the kubelet API to offload the burden from the core control plane.

Running Databases at Scale for Agent Workloads (33:38–36:05)

Lovable Cloud’s Approach:
- Provides developers with Postgres DBs and S3-compatible storage, currently running with a third-party provider.
- Aspirational for a system like Vitess but designed for “hundreds of thousands of databases,” matching agent-level granularity.
- Future: How to give every sandbox its own DB with strong isolation remains an open, unsolved challenge.

Security, Monitoring, and Anomaly Detection (35:35–36:05)

Security Mechanics:
- No structure can guarantee perfect isolation; focus on anomaly detection, pen testing, and “security is as strong as the weakest link.” (36:05)

Memorable & Notable Moments

On Kubernetes Scaling Challenges:
- “We’ve hit every type of bug and scaling problem in all our providers; we’re kind of at the frontier here.” (Jonathan Grahl, 03:32)
On Warm Pooling ‘Hacks’:
- “If you remove the selector label and the owner reference from the pod, then the deployment will be like, oh, I lost a pod, I need to create a new one. But the pod will continue living on forever. So this is warm pooling for us.” (Jonathan Grahl, 14:47)
On File System/Network Performance:
- “Our bundle install times [with Gvisor] take five times longer in the worst case.” (Jonathan Grahl, 22:03)
On Reality of Kubernetes for High Churn:
- “I don’t believe Kubernetes as it is will be the right solution... Maybe it is a nice solution to work with, but it would require such radical changes in the core that it’s hard to get it upstream.” (Jonathan Grahl, 31:06)

Bonus: The Chocolate Entrepreneur Story (36:06–37:41)

Jonathan’s Family Chocolate Business:
- Mother is a “solo entrepreneur chocolatier,” locally known as “the chocolate lady” near Stockholm.
- Company named “Chocolate for the Soul,” showing a family tradition of joyful entrepreneurship.
- [Hosts joke about organizing CNCF event desserts supplied by Jonathan’s mom.] (37:25–37:38)

Important Timestamps

Lovable’s vision and scale: 02:12–03:16
Kubernetes scaling challenges & Cilium: 03:32–05:05
Sandbox definitions: 07:55–10:29
State management: 11:19–12:45
Pod sizing, resource optimization: 12:46–14:22
Agent Sandbox architecture: 14:39–17:35
Sandbox runtimes (Gvisor, Firecracker, Kata): 18:56–23:12
Cold start tradeoffs: 23:13–28:38
File system strategies: 28:38–30:40
Kubernetes future directions: 31:06–33:38
Databases for agent workloads: 33:38–36:05
Chocolate story: 36:06–37:41

Notable Quotes

“Lovable kind of requires you to have the... live preview. That’s a websocket connection directly to the pod. And if the pod disappears, we don’t know if we lose state or not.”
—Jonathan Grahl (06:53)

“We hack Kubernetes a lot to make it work.”
—Jonathan Grahl (10:40)

“You have to test it. The only way to know.”
—Abdel Sghiouar on Gvisor limitations (22:34)

“You would have to be willing to replace like core parts of Kubernetes… It would require such radical changes in the core that it’s hard to get it upstream into Kubernetes.”
—Jonathan Grahl (31:06)

Takeaways

At massive scale, running user-focused sandboxes on Kubernetes requires inventive workarounds and creative architecture, often pushing the platform beyond its intended limits.
Emerging projects like Agent Sandbox, and runtime choices like Gvisor, are promising but still maturing, especially for ultra-high churn use cases.
State management, cold start minimization, and security at scale remain open areas for innovation.
The intersection of AI, infrastructure, and traditional Kubernetes models is producing new patterns, new problems, and a need for more pluggable solutions.
Even tech leaders come from sweet backgrounds—sometimes, literally!

For more resources and links, check the show notes and transcript at kubernetespodcast.com.

Loading summary

Transcript160 lines

[00:00]
A
Hello and welcome to the kubernetes podcast from Google. I'm your host Kaslyn Fields.
[00:06]
B
And I am Abdel Sighiwar. In this episode we speak to Jonathan Grall. Jonathan is a team lead of infrastructure at Lovable, where he oversees the platform stack the company runs on. In this episode we talked about Kubernetes, sandboxes and chocolate.
[00:30]
A
But first, let's get to the news.
[00:35]
B
OpenTelemetry is a CNCF graduated project, cementing its status as the de facto standard for observability. The project reached technical maturity with massive adoption. The JavaScript and Python APIs both recently surpassed 1.3 billion downloads. OpenTelemetry is seeing an increase in interest as the standard layer for observing performance, reliability and and trustworthiness specifically for AI workloads.
[00:59]
A
Congratulations to the OpenTelemetry team, the CNCF Technical Advisory Group. Elections are open with three former TAG leads, Brandt Keller, Mario Fallant and Mauricio Salatino stepping up into the 2026 Technical Oversight Committee. This opens up new positions within the advisory groups. Nominations are open for TAG infrastructure, operations, resiliency, dev experience, workloads, foundation and security. And votes will start soon. Check the link in the description for details and dates.
[01:32]
B
In events, news. Kubecon and cloud Native Con India is happening in Mumbai on June 18, 19 KCDs in Czech and Slovak, New York and Kuala Lumpur are happening on June 4, 10 and 27 respectively.
[01:47]
A
And that's the news.
[01:50]
B
Our guest on the show today is Jonathan Gral. Jonathan is the team lead of infrastructure at Lovable, where he oversees the platform stack the company runs on. We'll be talking Kubernetes, sandboxes and chocolate. Welcome to the show, Jonathan.
[02:02]
C
Thank you so much. So nice to be here.
[02:05]
B
Awesome. So I think everybody knows who Lovable is, but just in case people don't know who is Lovable, what do you guys do?
[02:12]
C
Yeah, so Lovable is a platform that kind of aims to be your co founder if you're making like a SaaS product or a tech based product. So we are. Well I guess it's called like Vibe coding nomenclature. So we help you build applications even though you don't know technology at all. So we aim to help the 99% who are not technical to achieve their dreams, basically.
[02:33]
B
Awesome. So Lovable is clearly one of the biggest players today and have been for a very long time one of the biggest players in the Vibe coding space. Which means that you folks have seen actually a very rapid growth, right? Your platform have been growing very fast and you run Primarily on Kubernetes. You do also run on other platform. We're going to talk about that. Have you seen any scaling challenges? Because we all know Kubernetes is not easy to scale.
[02:54]
C
Yeah, I think all of them, as expected, of course. So this is why it's very interesting and very timely to talk about sandboxes overall, or I guess overall infra. As you can imagine, level has gone like crazy. I can't actually remember how big we were in the summer, but right now we're building like several hundred thousand new projects per day on level.
[03:17]
B
Wow.
[03:18]
C
So it's pretty crazy. And we were not that many when I joined in August and now we are significantly bigger. So in terms of scaling problems, I think we've hit every type of bug and scaling problem in all our providers that we're kind of at the frontier here.
[03:33]
B
Yeah. So can you talk specifically about some scaling challenges you've seen? Obviously, I think we all know that scaling envelope that Kubernetes has, which is this multidimensional scaling problem that if you pull Kubernetes in one way, you stretch it too much. Can you talk about some very specifics?
[03:48]
C
Yes. So how Kubernetes works. You're absolutely right. Depending on what you're scaling and there's so many different scaling vectors, things will start breaking or you need to optimize for different things. So for example, a big one that we wrote about lately on our blog is Cilium or like the datablend v2 breaking due to. We had a problem for a couple of weeks where we would just drop packets, sometimes during load very uncontrollably. And it was due to us enabling and disabling internode encryption in data plane or in Cilium, which meant that the MTU or packet size was different in different places. And this probably wouldn't have happened if we didn't do it under load. So this is a place where Cilium has been maybe one of the things that are the most problematic in terms of getting. In our case, our scaling vector is actually POD churn or POD creation rate, more than how many pods we run. So our POD creation rate can be a few hundreds per second in certain times. We try to optimize each cluster to support that, which is a really hard problem. So in this case, Cilium could take up to 30 seconds to get the Pod IP to be routable in the cluster, even though the POD is ready.
[05:06]
B
Yeah. I remember when we met back in March, you were preparing for the International Women Day. I believe you were offering Like Lovable for free for International Women Day. So we're performing some basically tests to check if Cilium would not be a button neck.
[05:19]
C
And it wasn't that day, but we did, that day went great. We had like a crazy day. I don't know how many multiple times more load we had only in one day. But we had prepared for, you know, since October last month, I think we or September, we started optimizing for running on, you know, open source software rather than proprietary platform.
[05:40]
B
Yeah, yeah. And specifically for people who are following us who might not know where this problem comes from. It's basically the. When you scale up your pods, the time it takes for all the IPs to propagate across the cluster, right?
[05:52]
C
Yes. And in our case it's actually they need to operate globally like on public, you know, across in the arches, all of Google because we don't running multi region and we use this like native network access from our backend all the way to the POD IP when within Google. So actually it needs to be available cross region as well.
[06:12]
B
Yeah. So maybe just to give an audience a little bit of a deeper understanding, can you explain the reason why you care about the individual pod?
[06:20]
C
Yeah. So this is where I think sandboxes are like a new type of workload. Well, it's not new, it's actually like very old. But there's a reason why it's important for us is lovable kind of requires you to have the let's take it from the product perspective. So you are a user of lovable, you open lovable, you want to work on your project. When you get in, then you get what we call a live preview. That's actually like a websocket connection directly to the pod. And if the pod disappears and you know, the agent does stuff and we have git, the pod is not allowed to disappear because we don't know if we lose state or not.
[06:54]
B
Yes.
[06:54]
C
So our pods kind of have to be alive until the user essentially closes their browser and which could be 5 minutes or 24 hours. So this means that whenever the user clicks edit on the project that time until the pod is ready. It's like how long the users need to sit and wait. Yes, that's why it's so important for us.
[07:13]
B
In a way you are kind of using Kubernetes in a bit of an unconventional way because it's not your typical horizontally scalable workload. You are basically assigning effectively one pod per user session, whatever.
[07:24]
C
Right, exactly. And it's kind of similar to CI CD in many ways as well. It's kind of interchangeable. In this case, we have a connection open during the work, but what we're doing is we start a sandbox, we do kind of install, start some services, et cetera, et cetera, and then it needs to stay around until it's built and ready and otherwise you just need to start a new one.
[07:46]
B
Yeah. So obviously startup time is important, state management is important. We're going to talk about this later, but I wanted to pull back a little bit and talk about what does sandbox mean?
[07:56]
C
Yeah, this is why. Yeah, for those who sit on X and follow all these new benchmarks or sandboxes and all the. Now the hyperscalers are getting into sandboxes, I think there's different categories of sandboxes that focus on different things. So for example, in our case, how we view sandboxes that they are stateful but non persistent workloads, so they need to be around and they have state in them and whenever the user is working they kind of need to stay. It will not be able to just be replaced at any time. So it is stateful, but we can refresh them from cold and at any point as long as we've done the git commit.
[08:36]
B
Got it.
[08:37]
C
So that sandbox is for us for other people. For example, cloudflare released this new sandbox thing last week which is based on V8 isolates, which is just JavaScript code runs in a process without a container or a micro VM at all. So they use the V8 like the Chrome virtual machine just for JavaScript. And as well, before cloud there was something called VPSS, which is essentially what sandboxes is today for most people, which is like, give me a machine with some RAM and some CPU and hold it until I want to delete it. Yeah, that's the most normal one today.
[09:10]
B
Yeah. And that's interesting the way you described it. It's stateful, but it's not persistent in the sense that once the user is done with the work, you don't necessarily have to persist it in its current state because you can always recover. Right. So that's an interesting way of putting it in the context of kubernetes, because when people talk about kubernetes, it's always stateful versus stateless. Right?
[09:31]
C
Yeah, exactly. So what Kubernetes is really good at is like keeping. If you describe a workload, it just makes sure to keep it around always, but it does not ensure. In our case, for example, if we connect to a sandbox we need a pod name or the IP to be the same for that particular session of the user. So we can't. In Kubernetes case, it's like, oh, you have described the pod, I'll just recreate it and availability is up. But that actually means that it's not the same sandbox anymore for us. So we don't use the things like deployments, statefulsets. We use pods directly because none of the kind of workloads fit us at all.
[10:07]
B
Yeah, because if you use a deployment or any type of controller, you'll get unpredictability because you can get assigned any random pod, essentially.
[10:13]
C
Exactly. And deployment, for example, interesting case. If you use a deployment to have your pods available. If the pod restarts, you are not able to make the pod not restart if you use a deployment. But in our case, if the pod restarts, it loses all the state. So then we don't need it anymore.
[10:29]
B
Oh yeah.
[10:30]
C
So we actually need. If a pod restarts, we actually need to proactively delete them to make sure our system doesn't think that the sandbox is still there.
[10:37]
B
Oh yeah. So you have to deal with lack of race conditions, essentially.
[10:41]
C
Yeah, all the time. And this is where we have broken, you know, we break the replica set controller actively and like we do, we hack Kubernetes a lot to make it work.
[10:52]
B
And so I think that this is an interesting use case that you're talking about. I personally think that there will be more and more of those specific use cases going forward, because I think that Kubernetes is by far moving away from the standard give me a deployment of 10,000 pods and put load balancer in front of it. It's like the standard backend use cases or whatever. I think that there will be more and more of these particular use cases. So let me then ask you a question, and I guess I'm jumping ahead, but how do you deal with keeping state?
[11:20]
C
We keep state outside, or we try to think of primitives, I think is a very healthy way of thinking about making stateful system work. Well, so in our case, the state of the user is stored in a document database where they keep all their chats and what we call their trajectory, which is the chat messages and their git, which is the state of their project. And then we have lovable cloud, which is the state of the application, which is not our state. So there's kind of two pieces of state here, but mainly everything is driven through the project itself, the trajectory and git.
[11:52]
B
So in A way you have basically two types of state. You have the state of the projects where it's being developed and then you have the state of the projects when it's running. So when the application is deployed.
[12:00]
C
Essentially, yes.
[12:02]
B
So from the perspective of the user, if understand it correctly and correct me, I go to lovable. I log in, I get assigned a pod. That pod has my workspace, essentially, let's call it. Right, sure. That's the sandbox from your perspective. Yes, I do my chat, I talk to our large language model. The large language model does the code changes for me. I assume I hit save or there is auto save.
[12:21]
C
Yeah, the agent is autonomous here. Of course, what lovable is, is a bespoke agent or harness that does things a certain way. But yes, every time you chat, if it succeeds, then it's always persistent. That's part of the iteration loop.
[12:35]
B
But then it's persistent outside the pod. So in case the podcast inside the session ends unexpectedly, whatever, you can always go back and reassign it to a new pod, but then reload the existing state.
[12:45]
C
Exactly, yes.
[12:46]
B
Okay, but so how. If this is the question you can answer, how big are these pods? Like how in the size, like the image, the container, whatever.
[12:53]
C
That's a very nuanced question. So this is why I brought up the CI CD or I guess CI before. If you imagine you're running a CI job, you start a new GitHub action or whatever. In the beginning it's like, oh, I need to compile this before I test it. So it will spike in CPU immediately to as much as it can in memory and then go down to basically zero or whatever your workload needs. And this is very much what we the problem we have where I think our pods are very small, like our sandboxes in terms of, you know, unit economics at our scale is very important. So we need to manage it very well. So we, you know, all the over commitment and this. But I think they can spike to like, you know, 15, 16 cores or I guess as much as they are allowed to do, of course. And then they go down to like, you know, less than half of the CPU after, depending on the size of the project. It's kind of a power law thing where big projects take more time, of course. So we have to act kind of around heuristics basically. We can't really have a fixed amount of resources to give or take away from pods.
[13:54]
B
Are you hinting to the fact that you are using the dynamic resizing of pods?
[13:58]
C
We do not. So in this case Lovable, might be highly valued company, but we are very new and we are quite small. So we do a lot of not over provisioning, but warm pooling in this case to work around these kind of problems. That's why Kubernetes and how we built is very nice because we just build it around native OSS kubernetes primitives. So no resizing and stuff for us right now at least.
[14:22]
B
Yeah, I mean the feature itself is open source, but the dynamic resizing is still not there in open source. So you hinted to an important point, which brings me to the next question. When we met back in March, we talked about the Agent Sandbox and then you told me you were not actually using the Asian Sandbox today, or at least back in March. That was not the case case. Right.
[14:40]
C
Yeah.
[14:40]
B
But you were actively looking into it. So can you talk a little bit about how did you build the system and where does the Agent Sandbox fit into the system?
[14:48]
C
Yeah. So how we start in system three is I have a brilliant colleague, Will, who came up with this idea that if you create a deployment in Kubernetes then it will ensure pods always run. But if you remove the selector label and the owner reference from the pod, then the deployment will be like, oh, I lost a pod, I need to create a new one. But the pod will continue living on forever. So this is warm pooling for us. Whenever someone says sandbox, we remove the own reference and the selector label and thus we have a warm pooling. So we didn't really have to build any technology around that. But as you go get to scale. And this worked really well up until maybe one or two months ago because we had grown so much that we just hit problems in the replica set controller where it thinks it has pods still because we popped them from the deployment, things kind of break. So we are looking into refining this and we didn't know, but we didn't know that the SIG group for Agent Sandbox was coming up when we started working on this. And we have a tight relationship with Google. So when that came in we're like, oh, this is what we should use. Why not move towards a shared understanding of sandbox scheduling and operations? Mean. But to be honest, Agent Sandbox as the cgroup today is still not ready for high throughput sandboxes. Yeah, clearly it's built for a different, not a different purpose. But a good example is that each sandbox with Agent Sandbox creates a Kubernetes service that is headless.
[16:13]
B
Yes. Because you need to Talk to that specific pod.
[16:15]
C
Yeah, exactly. We don't really need that and we can't do that because we would have one service per pod. And when you run, you know, 100,000 of them, it ends up being kind of a problem. Yes. So there's all these things that will work great if you're like at low scale here, meaning thousands still. But as soon as you get to the edge of the API server starts breaking. You need to reduce queries per second in all your controllers. Then how the Agent Sandboxes stay doesn't work that well.
[16:42]
B
Yeah, I think we jumped a little bit too far. I think so. For those who doesn't know, I'm going to leave a link in the episode for the Agent Sandbox itself, but maybe. Jonathan, can you describe the Kubernetes Agent Sandbox, the open source projects briefly? What is it actually?
[16:56]
C
Yes, it's a set of custom resources. In this case it's a sandbox warmpool and a sandbox claim they all each have a controller that sits and waits for these objects and what they allow it to do is this thing that I mentioned before that you create a warm pool. You describe the warm pool is like this is the pod template for that warm pool.
[17:16]
B
Yes.
[17:16]
C
And has some small details in how you create pods and if you should recreate them, if there's a new image, things like this. And when you want to take a pod from the worm pool, you create a sandbox claim it's quite simple and I think it's very elegant, but it does not use deployments and replica sets underneath. So it's essentially a replacement for.
[17:36]
B
Yeah, I know, it just creates pods.
[17:37]
C
So it's kind of a replacement for deployments or stateful sets in this case.
[17:42]
B
Yeah. The reason why it's called Agent Sandbox, it was specifically designed for AI agents to be able to run agent generated code inside the sandbox environment. But I have talked about it in a couple of conferences and including your use case, which by the way, I saw on X from your cto, I have seen it kind of being considered in other use cases, including what you're talking about. Because what you want is not just run agent generated code, but to run the entire agent with all its environment inside a sandbox environment. Basically.
[18:12]
C
Yes, for sure. So this is where things are moving around week from week in the agent, because the agent world, tech world is very fragile or fast moving, very agile. I guess in December it was like, no, of course you run the agent on your laptop and then like, oh wait, now open Clois I want to run it somewhere else. And now there's like I Claude, you know, and I guess all the provides OpenAI also like, oh, why don't we just run the agent all the time at the provider? And this is where I guess this kind of agent sandbox esque solutions come in.
[18:43]
B
Yeah. And specifically when you look about the design of the agent, Sandbox projects literally what makes it sandbox is the fact that you change the runtime that's in the POD template. That's the only thing that makes it sandboxed, Right?
[18:56]
C
Yeah, exactly.
[18:57]
B
So speaking about that, in the context of kubernetes, the sandbox usually refers to either using gvisor, which is the Google thing, or using something like Firecracker from aws. Right. Or KATA containers. Right. So how does these tools fit into your particular use case?
[19:13]
C
Yeah, so we've used GVISOR since we started. Actually we started with Firecracker, but that was before I joined. And it also works well, but there's some really both operational and security details between this or like performance details between all of them. So you kind of have to know each of the characteristics before you pick one. There's no like one solution fits all really. But we are really very happy with jvisor because there's no virtual machine in jvisor. So it's, you know, maybe a bit deeper tech, but it's like a user space kernel that manages things for you. What this means is that if a sandbox in our case uses a lot of memory and then it uses no memory, then like on your normal machine, the machine will just reclaim it. This is not really possible with other solutions. There's some hacks around it or like memory ballooning and things. But this is like the biggest reason why we use GVISOR in this case because then we can give a lot of memory and just take it back whenever we want.
[20:06]
B
Yeah, I mean to go more into the details, in this particular case, the key difference between GVISOR and something like Filecracker or Katacontainers is that Kata containers, Firecrackers are micro VMs. Right. So the sandbox in Boundary is a micro VM, but for Gvisor, it's a Linux user space kernel essentially.
[20:22]
C
Right, yes, exactly. It's like a normal process on your machine.
[20:26]
B
Yeah. And so speaking of that, like being a Linux space or. Yeah, Linux user space kernel, it means that it doesn't pass through the syscalls to the node, it implements them in that particular kernel.
[20:40]
C
Yes.
[20:40]
B
So have you seen any incompatibilities between your application code and the implementation that GVISOR supports.
[20:46]
C
Yeah, we actually could not use GVISOR when we started, when we started experimenting this like in August. We love a technology called bun. It's great. It's a kind of replacement for NPM or these kind of things.
[20:58]
B
Yeah, no GS replacements.
[21:00]
C
Yeah, but how BUN works, to make it really fast, you have something called hard links. So BUN installs your node modules in a shared space. So if you have a different project, you just essentially link or make like from your file that is in your project to the shared storage.
[21:13]
B
Okay.
[21:14]
C
So you don't need to install multiple times. But gvisor doesn't like that because it doesn't give you access to the disk and to the system. Right?
[21:21]
B
Yeah.
[21:22]
C
So how they implemented hardlinks is they just write it twice, as far as I understand it. So it's actually quite inefficient. So most of the things we've ended up having problems with is file system and network. This is just pure performance. So as you can imagine, a user space kernel could probably not, in some cases can be more efficient than the kernel itself. For example, if you do file system things, you can often skip the kernel directly and just go directly to the disk with direct IO or BPF for example, for networking. But in our case, it ends up in some cases making our bundle install times take five times longer. In the worst case.
[22:03]
B
Yeah, because it has to duplicate the dependencies on the node, essentially.
[22:06]
C
Yeah, exactly. And since you're adding more work in the network, in the file system path directly there takes longer. So it's not really an incompatibility in that sense, it's just slow.
[22:19]
B
So because I get this question quite a lot when I talk about GVISOR from people, the question of what limitations do you know in gvisor that might impact my application? And my typical answer is you have to test it. The only way to know do you have a different. Would you answer this question differently?
[22:35]
C
No, I think you're right. We use gvax and it works. It works great. And it's kind of the opposite. Right. Like what doesn't work in other solutions? Yeah, like you kind of have to pick the one that has the least amount of flaws that you care about in this case. So yeah, we kind of have to live with the fact that it's slower. Just because we have the memory capabilities, this does not mean that it's not fixable in gvisor. So there's commits all the time with the vendors we Work with that. Like they just fix file system specific things in gvisor to make it faster for us.
[23:07]
B
Yeah, yeah. So it's not really incompatibility, it's more a performance issue that is solvable, essentially.
[23:12]
C
Yeah, exactly.
[23:13]
B
Yeah. And I think that brings me at least to a very question that or kind of a logic that I have in my mind which is. So whenever we talk to people about both process isolation for containers, so gvisor, firecracker, whatever, or the whole concept of fast start of containers, like how fast they get to start, pods, blah blah blah, there is always this trade off between performance and cost essentially. Right. So how much are we willing to pay performance and how much are we willing to pay cost? And everybody seems to be wanting the best of both worlds, which is not possible. So in this particular case, the choice between gvisor, firecracker or KATA containers will be how much isolation do you want and how much speed do you want?
[23:56]
C
Yeah, there's also nuances to that. Right now the big fad I guess of the last week is to have a fast cold starts as possible.
[24:04]
B
Yes.
[24:05]
C
But people still look at firecrackers be like, oh, you can start in 70 milliseconds. And then some vendors like we can start in 10 milliseconds. But in the end a lot of those things really don't matter in reality. Imagine this is why it again goes down to the use case. Imagine a user for Lovel goes through their browser. It probably will take 200, 300, 500 milliseconds to render the browser.
[24:30]
B
Yeah.
[24:30]
C
So there's no need really for the sandbox that could run in parallel to start in 20 milliseconds, the user will still have to start writing. It takes a couple of seconds. So this is where people really need to optimize for difference. But specifically cold starts have been like the reason which there's competitions about who has the fastest cold started sandbox.
[24:51]
B
So this is actually an interesting discussion because in the context of kubernetes, your cold start can be basically caused by literally two things. It's either the node cold start itself, so how fast cluster itself can auto scale, or how fast the pods can auto scale. Which then includes things like how fast you can download the image into the node if it's not already cached, how fast your application itself starts, blah blah, blah. Right. So the way I reason about this all the time is that like somebody has to pay for that cost somewhere. Like we can't make all of these things fast always all the time. So you either have to accept to balloon or to have one pool and then you're paying the cost for that or you have to accept to pay the cost for startup.
[25:30]
C
Yeah, we take the choice of the warm pool. In this case we optimize for the user experience only. Of course, cost is a thing and we believe that the cost from being able to reclaim memory offsets the worm pool availability and the user having. In our case we have to do fetch a git repo and do a bunny soll and something we actually haven't talked about, which mitigates a lot of this, is disk and memory snapshotting.
[25:56]
B
Yeah, that's an interesting point. So let's talk about it. So what are you using? How are you solving this with snapshotting?
[26:01]
C
We have experimented with every type of snapshotting. So keeping bundling the most common GNOME modules in the image kind of snapshotting, you always take the latest image. We have moved since then to a more efficient way to use git. But before we used to take disk snapshot only of one folder and then mount that folder in whenever the project, you know, the pod is claimed or the sandbox is claimed from a worm pool. We would mount a file system for their git repo so they don't have to fetch it again.
[26:32]
B
Okay.
[26:32]
C
Because there's some limitations. We have to fetch the entire repo, all branches and all commits every time. We don't have that problem anymore. So now disk snapshotting has kind of a less impact on our general projects. Yeah, but of course as with anything new from Google, there's this power law problem where some customers or the far end is like a lot bigger than the general. So you can imagine people create projects that are massive and then of course skipping a bun install could save you minutes. So then again the cold start of the pod does not matter. All that matters is, you know, you want to do as little as amount of work possible until the user gets, you know, can start working.
[27:11]
B
Yeah, like there is always extreme solutions you can take to solve this kind of problems. So one of the things I was thinking about for your particular use case, you could always get like the most expensive network attached disk and then literally just dump all of your git on it and then mount it to the node when the node starts. But then that's going to be a problem in the sense that somebody has to pay for that disk, right?
[27:34]
C
Yeah. And unfortunately in kubernetes you cannot add a PVC to a running pod.
[27:38]
B
Not dynamically.
[27:39]
C
Yeah, no, exactly. So we can't really use the warm pool Kubernetes native way to do it. So here some other produce like Fuse this. Like there's also a new movement here, like doing fuse file systems like User Space file system to GCS or to S3.
[27:53]
B
Yes.
[27:54]
C
To kind of mitigate that problem.
[27:56]
B
Or maybe the other one would be the new. I'm not sure if it's supported in Kubernetes yet. I think it is. It's mounting a Docker container as a volume, essentially, so.
[28:06]
C
Exactly.
[28:06]
B
But that also comes with caveats because you. Basically it matters how often that Docker image has to be updated.
[28:11]
C
Yeah, yeah, we do this. I think we update our. You know, we have this. I think it's called secondary boot disk, which is, I believe is rather new, where you can kind of get a boot disk into the node where you have all your images and then you don't have to stream it. So here we can skip 85, 90% of the image streaming to each of the nodes. And we update that every week or every day. And it saves at least a few hundred milliseconds in the general case.
[28:39]
B
So there is a. A new project. There is somebody working on something and it's not public yet, so I'm not going to mention the name, I'm not going to mention any of this stuff, but I'm going to just tell you like on a high level, what they're working on. Somebody I know is working on something that is around the lines of a dynamic registry. So what does that mean? So it's just a Docker registry. Right. And you would do a Docker pool on that registry on a specific image, and then you would do a comma separated list of plugins in your particular case. You could imagine you do a Docker pool on BUN plus a bunch of bundles, then the registry would return to you a Docker manifest that is composed on the client side. So the Docker manifest contains the layers, the Docker layers, and for each layer it contains the link to where that layer is in the registry and it's coming soon. So I think once this becomes available, this could be actually something you could use to solve the. The dynamic linking of bon. Because essentially what you want to do is dynamically bundle BUN installation on the pod, right?
[29:41]
C
Yeah. This is very familiar. I know there's another. I think it's a CNC project called Dragonfly, if I remember correctly. I was confused if it's the Redis clone, which is also called something similar, or a sandbox type, which is also called Dragon something which allows you to stream OCI layers down correct and kind of in peer to peer and also send them out. So I've heard what you're working on seems to be also has propagated in the industry the last six months or since last summer. I've heard it puttering and people kind of figure out how do you hack the OCI or not hack, but use the OCI manifest to add new layers from the machine?
[30:19]
B
Yes. Like dynamically inject the layers into the node. That's another one. Yeah. That particular use case is useful for patching a writing container actually. So if you discover a vulnerability in an application, you could just patch the particular layer containing that dependency without restarting the application. Right, right.
[30:35]
C
So if it's like the read only layer, for example, for a specific path, you can just replace it as it's running.
[30:41]
B
Correct, correct. Without restarting the container. But I mean, so this actually brings me to. I believe it's going to be my last question. A lot of what we talked about for your particular use case is that you are doing things kind of in a hacky way. I'm putting hacking codes here because you're trying to optimize kubernetes for your use case. But do you see kubernetes moving in a direction where all of these things are not going to be a problem for you?
[31:07]
C
I don't believe that's going to be possible, but not from a technical reason. I think you can definitely optimize kubernetes for our use case, but you would have to be willing to replace like core parts of kubernetes that other people depend on and make it, you know, Kubernetes is quite pluggable, but a content topic which depends on you talk about is like oh, we only use etcd for example.
[31:30]
B
Yeah.
[31:31]
C
And then as like oh, this is the delayer we should work on. And ETCD is perhaps not the most scalable database depending on, you know, I don't know if they fixed it, but at least during last year you cannot have more than 16 gigs of storage in ETCD as like a hard cap.
[31:45]
B
Yeah, Max, yeah, hard cap, yeah.
[31:46]
C
So many providers are like, oh, we have to make something ETCD compatible and run it on Spanner for example or run run it on FoundationDB and things. So where our problem is going to go specifically for this high churn, high super wide horizontal running 500,000amillion sandboxes and the churn is hundreds of thousands per second. I don't believe kubernetes as it is will be the right. Maybe it is a nice solution to work with, but it would require such radical changes in the core that it's hard to get it upstream into Kubernetes. So then maybe it's easier to work around it or. I know, for example, people use virtual Kubelet for this heavily.
[32:23]
B
Okay.
[32:23]
C
So for example, you make your own per node scheduler that you don't have to rely on the Kubernetes API or control plane because it can be quite slow. But it's Kubernetes compatible. So you run your control plane in Kubernetes, but the node itself is just Kubelet compatible so you can register it as a node. This seems, I think, the way I like to think about it.
[32:43]
B
Yeah, I know exactly what you're talking about. It's something like sort of an external runner where the control plane stays Kubernetes but where the stuff runs is not the way Kubernetes runs things. Right.
[32:52]
C
Yeah, exactly. So I do think. I'm not virtually. I'm pretty sure it's a CNCF product.
[32:57]
B
I think it is. I'm not aware of any provider that has this today.
[33:01]
C
I'm quite sure you can run, for example, like, I don't know if it's ECS or other like, you know, functions as a service essentially.
[33:09]
B
Got it.
[33:10]
C
I think that's the common one. I do, I do. You can also do like cri which is like run any type of container through it. So in this way we can keep the good parts which is, you know, if we worked. This is good. It's running control planes where it's like really good. And people, I know people don't like Kubernetes for stateful workloads, but you know, vtest is huge and is great and if you put the thought into. You understand how Kubernetes works. It's like best in class for that.
[33:38]
B
So the more you talk, the more I have questions. How are you guys solving the like. Are you using Vitesse?
[33:43]
C
Yeah. No, we do not.
[33:44]
B
But do you give people like when they're building an application, how do you handle databases within their sandbox?
[33:49]
C
Yeah, so lovable cloud is like whenever you ask for a database, we will give you essentially a postgres database that can scale vertically in this case and S3 compatible bucket and a few more
[34:02]
B
things running inside Kubernetes.
[34:04]
C
No, that runs on a third party provider today. Okay, but this is also a very interesting question, which is maybe besides here, but where I believe there's a project. I want people to work on it. So maybe I should bring it up. I would love for people to work on something like Vitess. But vitess is essentially MySQL but it's made for like massive single databases. So you can run. YouTube is famous for it, like one MySQL database, but it's, I don't know, thousands, ten thousands, hundreds, thousands of nodes.
[34:30]
B
A lot of nodes.
[34:32]
C
Massive. And where, you know, the new world is kind of sandboxes. You could of course run the database on the sandbox, but then you need to keep the sandbox always running. Yes. So I believe the world is kind of going towards where I guess Neon revolutionized the industry last summer for a different type of use case which is not like production ready, but it's like, how do you run hundred thousand databases?
[34:53]
B
Yeah. And they were so successful. They were acquired, essentially.
[34:56]
C
Yeah, exactly. Eventually they did very well. And it's a completely different problem than running a database and keep it always available.
[35:04]
B
But I do see your point that in the future where you might need to run like hundreds of thousands of agents on the same cluster, you need to solve this. Like, how do you handle hundreds of thousands of databases? Because you want to have isolation between each autonomous agent. Right. Like, I don't want my single open claw instance to just be able to talk to all my databases because, you know, agents are agents.
[35:25]
C
Yes, exactly. And that's, to be honest, it's where you need people with knowledge. It's really hard to install like a. A managed system to do the isolation for you today.
[35:35]
B
Yeah.
[35:36]
C
So knowing, I guess it's the way of security works, you can never know really, truly that all your stuff is isolated.
[35:45]
B
Yes.
[35:45]
C
Even if you're no network air gapped, you still don't know. So I believe that's also like something that we work on that all the time. In our case, it's mostly about finding the anomalies if they happen and then trying your best to keep them from happening, you know, and pen testing and such.
[36:00]
B
Yeah. I mean, as we always say, like security is as strong as the weakest link within the system, essentially.
[36:06]
C
Right.
[36:06]
B
So yeah. Awesome. Cool. Well, I do have one more question, and this is unrelated to Kubernetes. How do you come from a family that makes chocolate to kubernetes? Help me draw a link there.
[36:19]
C
Yeah, yeah. I had, I talked about it with my. I was out with my dog during lunch and my neighbor came like, oh, my family is a family of entrepreneurs, essentially everyone. My grandparents, you know, and all their kids and all cousins. So chocolate ended up being, you know, for my mom. She has had so many companies. Single, you know, solo entrepreneur.
[36:39]
B
Oh, wow.
[36:39]
C
Her whole life. And Nicole and my dad. So she's like, I want to make people happy. And she likes to talk and, you know, converse and, you know, she's kind of, I don't know what you would call it, pastor or herder or whatever you want to call it. So she thought that, you know, chocolate is fun and tasty and what if I just teach people chocolate and we just have fun?
[36:58]
B
Wow.
[36:58]
C
Nice. So like 15 or. Yeah, I think 15 years ago, she, you know, became a chocolatier, did the formal training, started from home, and now she's kind of the local famous person in our municipality. Like, oh, the chocolate lady. And she named her company, you know, like, Chocolate for the Soul. It's kind of that. That mentality.
[37:16]
B
Awesome. That sounds awesome. All right, so I think that this is probably going to be the first time we do this on a. On a podcast. We're going to leave a link to your mom's company or store in the show.
[37:26]
C
Notes, please. Yes, you can find it, you know, 30 minutes drive outside of Stockholm in Sweden.
[37:31]
B
All right, so that's probably the next time Kubecorn comes to Stockholm. We can make sure that your mom talks to the cncf.
[37:39]
C
Yes. She'll do the catering for the desserts.
[37:41]
B
Awesome. Thank you so much, Jonathan. It was a great conversation.
[37:45]
C
Yeah, likewise. Super nice to be here.
[37:46]
B
Thank you.
[37:47]
C
Yeah. Thank you so much.
[37:50]
B
That brings us to the end of another episode. If you enjoyed this show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on social media Kubernetespod or reach us by email at Kubernetespodcastgoogle.com you can also check our website at Kubernetespodcast.com where you will find transcripts and show notes and links. To subscribe, please consider rating us in your podcast player so we can help more people find and enjoy the show. Thank you for listening and we'll see you next time. Sam.