Summary8 min read

Podcast Summary: Behind the Craft

Episode: How This Ex-Meta L8 Engineer Ships 40 PRs a Day with AI Agents | Kun Chen

Host: Peter Yang
Guest: Kun Chen (Ex-Meta/Microsoft L8 Engineer, Solo AI Builder)
Date: June 7, 2026

Episode Overview

In this episode, Peter Yang interviews Kun Chen, a former L8 engineer at Meta and Microsoft who now builds AI-powered products solo. Kun shares his hands-on workflows for using AI agents to radically accelerate product development, discussing techniques, custom tooling (including open-source utilities he’s released), and strategies for scaling outputs—even as a one-person team. The conversation provides a practical roadmap for product leaders and builders who want to maximize the value of AI agents across the full product lifecycle.

Key Discussion Points & Insights

1. Rethinking the Product Development Workflow with AI Agents

Three-Phase Workflow: Plan, Code, Validate
- Planning: Kun invests heavily in crafting detailed specs and goals with agent-assisted brainstorming.
  - “If I spend a lot of time crafting a very detailed plan, then I can let the agents go for longer.” (03:01, Kun)
- Coding: Delegated almost fully to AI agents, maximizing execution speed and parallelization.
- Validation: Agents perform most tests and checks; Kun only steps in for ambiguity or judgment calls.
Shifting Time Investment
- Emphasizes up-front planning and parallel sessions to multiply throughput:
  - "Currently, I think I spend more time in the planning phase... the coding phase is pretty much entirely the agents." (01:11, Kun)

2. Practical Demonstrations & Custom Tooling

a. Lavish Editor: Visual, Collaborative Planning

Problem: Long text-based plans are inefficient for human review and agent-human collaboration.
Solution: Kun’s tool, Lavish, turns agent responses into interactive HTML artifacts:
- Integrated into agent sessions for real-time visual planning (e.g., UI redesigns).
- Enables point-and-click feedback, direct annotation, and visual iteration.
- Open-source and easily invoked from agent workflows (17:22).
Quote:
- "HTML as an artifact can be a lot richer... It's going to be very visually things I can just interact with." (09:46, Kun)
- "Whenever I find friction in my workflow... I just build something myself." (10:27, Kun)
[Demo begins at 06:25] — Shows Lavish in action redesigning a kids' AI app.

b. Treehouse: Simplifying Worktree Management

Problem: Git worktrees for session isolation add cognitive load and inefficiencies.
Solution: Treehouse, a tool to quickly spin up ready-to-go worktrees—dependencies pre-installed, no manual management.
Quote:
- "Every time I have to spin up a new work tree to do something new, I don't need to think about it... I just type Treehouse." (14:06, Kun)
[Demo at 13:48] — Shows how Treehouse reduces mental overhead for parallel agent sessions.

c. No Mistakes: Automated, High-Recall Validation Pipeline

Problem: Reviewing every agent-written line is a bottleneck—agents now generate orders of magnitude more PRs.
Solution: No Mistakes, an agent-operated pipeline that:
- Creates new branches, rebases, runs rigorous reviews (in a fresh context), tests, and documents changes.
- Escalates only ambiguous or product-impacting issues for human judgment.
- Presents concise risk summaries and actionable evidence (screenshots, videos).
Quote:
- "If you review every single line of code, you become the bottleneck. So I don't review this first pass code from the agents." (32:29, Kun)
- "Eventually I got to a point where I find myself never catching anything the agents don't catch." (34:18, Kun)
[Demo at 31:59 & 32:29]

3. Managing Parallelization & Agent Collaboration

Massive Agent Concurrency:
- Kun runs 20–40 PRs daily, with 20–30 agents actively working in parallel sessions (23:21, 48:26).
- Both intra- and inter-project sessions are paralleled (04:51).
- "I'm always spending my time productively while the agents are doing the work." (04:12, Kun)
Sub-Agents and Context Management:
- Uses sub-agents for tasks that would bloat the main context window (24:43).
- Examples include codebase exploration or running isolated experiments (e.g., language benchmarks).
- "The main reason I would use a sub agent is to avoid context window blowing up..." (24:49, Kun)

4. Best Practices for Agent-Driven Development

Code Review (Automated):
- Fresh context is crucial for code review agents to avoid bias (35:48).
- "When you use a fresh context window, you just get a lot more edge cases caught." (36:17, Kun)
Testing:
- Custom testing instructions placed in project-level agent MD files for e2e validation.
- AI agents can automate user-level browser flows, not just unit tests (30:11-39:47).
- "The general principle is if you're doing something manually... try to turn that into something the agent does for you." (31:32, Kun)
Merging & Risk Management:
- Merge low-risk PRs with little review; medium/high-risk changes get more attention (47:04–47:40).
Delegation Philosophy:
- “The more things that I find myself doing that I can delegate to an agent, I turn them into instructions and then let the agents do the work instead of me operating the app myself manually.” (31:02, Kun)

5. Changes in Team Dynamics & Product Engineering

Comparison to Big Tech:
- In traditional teams, code output is limited by human review velocity.
- Now, productivity is agent-limited, not team-limited.
- Many startups dropping peer review for agent-reviewed PR flows.
- "Our workflows... were built at a time when we spend most of our time coding... But when you start to write 10 times more PRs, we are not ready for that." (43:37-44:17, Kun)
Liberation and Bottleneck Removal:
- “Largely speaking, I feel liberated... Everyone is busy, and if I write like, 20 PRs every day, no one's gonna review that.” (42:23, Kun)

6. Advice for Builders Getting Started with AI Agents

1. Build Prolifically, Iterate Rapidly:
- Just build—learn from each attempt, iterate, and accumulate micro-lessons (50:18).
- “Whenever you have some inspiration... just give that to the agent and have it run for you.” (50:31, Kun)
2. Push for Parallelization:
- Run more agents, use more tokens, force yourself out of the loop (51:14).
- "To really scale up how much we can get from the agents, we have to move ourselves out of the loop as much as possible." (51:04, Kun)
3. Apply AI End-to-End:
- Delegate planning, coding, validation, documentation—everything, not just code writing (52:15).
4. Try Automated Insights:
- Tools like Cloud Code's "insights" command analyze your AI agent sessions to suggest workflow improvements (53:07).
Notable Quote:
- "The mindset I would encourage is to just like build every single idea you have... Through that process, a lot of learnings can be derived." (50:18, Kun)

Notable Quotes & Memorable Moments (with Timestamps)

On process:
- "If you review every single line of code, you become the bottleneck. So I don't review this first pass code from the agents.” (00:00; reiterated at 32:29, Kun)
- “Currently, I think I spend more time in the planning phase. So planning is mostly me with assistance from the agents. The coding phase is pretty much entirely the agents.” (01:11, Kun)
On tool philosophy:
- "Whenever I find friction in my workflow... I just build something myself." (10:27, Kun)
- "Treehouse will basically set up the work tree for me and drop me into the new work tree. Now you can see it set up a work tree in this directory and it dropped me into it. The good thing is that this directory is from a pool of managed work trees. The dependencies are already installed here..." (14:06, Kun)
On agent validation pipeline:
- “Eventually I got to a point where I find myself never catching anything the agents don't catch.” (34:18, Kun)
- "When you use a fresh context window, you just get a lot more edge cases caught." (36:17, Kun)
On team dynamics:
- "Our workflows and how our teams work were built at a time when we spend most of our time coding. But when you start to write 10 times more PRs, we are not ready for that." (43:37–44:17, Kun)
On advice for aspiring builders:
- “Whenever you have some idea... just give that to the agent and have it run for you.” (50:31, Kun)
- "To really scale up how much we can get from the agents, we have to move ourselves out of the loop as much as possible." (51:04, Kun)
- "Whenever we find something manual... just try to think about, can we delegate that to the agent as well?" (52:15, Kun)

Important Segment Timestamps

AI-Driven Workflow Overview: 01:11–03:38
Planning Phase & Lavish Editor Demo: 06:25–19:36
Treehouse (Worktree Management): 13:48–15:06
Sub-agent Usage & Parallelization: 23:21–26:02
Experiment Management with Agents: 26:08–28:01
Validation & No Mistakes Tool Demo: 31:59–38:44
Team/PR Review Culture Shift: 42:23–45:04
Productivity Metrics (PRs per day): 47:40–48:46
Advice for New Builders: 50:18–54:48

Tools & Resources Mentioned

Lavish Editor: Visual collaborative spec/artifact generator (GitHub) (17:22)
Treehouse: Easy worktree session manager (14:06)
No Mistakes: Automated PR review and validation pipeline (32:29)
Cloud Code Insights: AI-generated workflow improvement suggestions (53:07)
Benchmarks: Programbench, Sweepbench (26:08)

Where to Find Kun

GitHub, X (Twitter), YouTube, LinkedIn: @kunchanuid (55:44)
Shares tools and workflows open-source—recommended to follow for more resources.

Takeaways for Product Leaders & Builders

Invest in detailed agent-friendly specs to maximize autonomous agent output.
Embrace parallelization: run many agents and sessions; build infrastructure/tools to manage context and dependencies.
Automate validation: combination of agent-authored tests and independent agent review can surpass traditional velocity.
Use the “delegate everything possible” mindset; convert all manual steps to agent-readable instructions.
Building with AI is a new literacy—practice, iterate, let go of bottlenecks, and treat friction as a prompt for new tools.

Links to Lavish, No Mistakes, and other tools are provided in the episode description.

Loading summary

Transcript165 lines

[00:00]
A
If you review every single line of code, you become the bottleneck. So I don't review this first pass code from the agents. Eventually, I got to a point where I find myself never catching anything the agents don't catch. I typically have, like, at least five different sessions actively running. On average, there's like 20 to 30 agents running most of the time. It's like 20 to 40 kind of PRs every day. Our workflows and how our teams work were built at a time when we spent most of our time coding. But when you start to write 10 times more PRs, we are not ready for that to really scale up how much we can get from the agents. We have to move ourselves out of the loop as much as possible.
[00:45]
B
Hey, everyone. Today I'm really excited to welcome my friend Kun, an L8 engineer from Meta and Microsoft who's now a solo AI builder. Kuhn is going to show us exactly how he builds products using agents. I've been asking him a lot of dumb questions about all this, so really excited for him to show us live. So welcome, sir.
[01:02]
A
Thanks for having me here, Peter.
[01:04]
B
All right, so let's get right into it. Maybe you can start by kind of walking through at a high level how you're building products with agents.
[01:12]
A
All right, that is my workflow, plan, code, and validates. I don't think this is too different from what everybody does, so I will probably talk through the parts where I think I'm doing something unique. So I think typically when we build something meaningful, we typically go through these phases. We plan what the requirements are, and then we let the agent code, and then we have to do some validation to make sure the agent actually did what we wanted them to do. So this, the high level workflow, I think, is pretty standard. Where I think I do something different is how much time I spent in each phase. So currently, I think I spend more time in the planning phase. So planning is mostly me with assistance from the agents. The coding phase is pretty much entirely the agents. So once the requirements are planned very clearly, I trust the agents to do most of the work. Then in validation phase, I use agents a lot as well, and agents do most of the work with some judgment from me when things are ambiguous. And I think the part about this is that if we actually start to delegate most of the coding to agents, the way I think I can get agents to do more for me is to try to increase the amount of time agents spend in this phase, because this is entirely agents. If we can get the agents to go for longer, then I will get more done. This is one area where I tried a lot of things to just scale up the amount of time I can let the agents run autonomously.
[02:54]
B
Yeah, it's almost like the code and the validation is a loop that the agent can run itself. Right. So that it can actually code for a longer time period.
[03:02]
A
Yeah, yeah. And also, I think it depends on how much time we spend in the planning phase. So if I spend a lot of time crafting a very detailed plan, then I can let the agents go for longer. If I only write a very short prompt, then what I'll find is that very quickly the agents will get work done and then I'll need to go back and prompt them again. So, like, how much time we invest in the planning phase actually affects this a lot.
[03:27]
B
Okay, that's a really good point because I. I've gone like, super lazy with these agents. I don't actually, like, I just give them like one line prompts and yeah, it never works for hours. So, yeah, we'd love to kind of see each phase.
[03:39]
A
Yeah, yeah. So, yeah, I think the things that we can do differently in the planning phase is like go from a short prompt to say, what is the next action you should take to something more like a spec, where you write down a more comprehensive set of details of the requirements and then go from spec to a goal. So if you can actually craft a measurable goal, you can let the agents do a lot of experimentation.
[04:04]
B
Okay. Okay. So can you show us how this works? Like maybe we can start with the planning phase. Some example plans that you write.
[04:12]
A
Yeah. Actually, there is another dimension of how I optimize this flow as well, which is like, if you look at this timeline, the parts that need me is only this beginning and the end. So what I do is I make sure I can parallelize a lot of sessions. So I'm always spending my time productively while the agents are doing the work. So I think increasing the. The amount of concurrent parallel sessions, that's also a very important aspect of how I get more done.
[04:46]
B
And do you paralyze sessions in the same project and product or like across products or both?
[04:52]
A
Both, both. So I have a hybrid of different projects, but even within the same project, I sometimes have multiple sessions doing different things.
[04:59]
B
Yeah, it's funny. It's funny cause we used to, like, you know, both of us used to work in big tech, and it used to be a lot of context switching between meetings, but now you're context switching between different threads. It's actually Faster context switching in some ways.
[05:14]
A
Yeah, totally. I think it's kind of like someone that's overseeing a very large scope. There's always different things happening and there are different things escalating to you and you need to jump into different things depending on where you are needed the most. So this is very much I like.
[05:31]
B
Okay. This episode is brought to you by Linear. When engineers use tools like cursor, clock, code and codecs, a lot of work happens invisibly. Someone can go from a bug report in Slack to a shipped fix without creating any record of what happened outside of the code editor. And that's fine for speed, but it makes coordination harder as you scale. Linear integrates with the very best agent coding tools directly, like cursor and codecs. That way anyone can see what an agent is working on and and who assigned them to the task. You get the speed of agents without losing visibility across the team. Product teams at OpenAI, RAMP and Blog are all using linear to collaborate with AI agents. And I use Linear myself to run my creator business. So check it out at Linear App Agents. That's Linear App Agents. Now back to our episode. Can you show us your AI stack or agent decoding setup?
[06:25]
A
Yeah, let's do it. This is my terminal. This is where I do all of my work. Pretty much occasionally I switch to a GUI or a browser, but most of the time I'm spending here. I'm using a project here as an example to walk through it. This is a project called Hibit. This is the AI tutor I'm building for my son. It's an AI agentic harness for kids. Basically, I just built a new screen. Let me show you what that looks like. I revamped the main screen a little bit, but this is very messy because I just did this this morning and it's not looking good. This is not how I want this to look like. What I'll do, Very typical workflow. I'll take a screenshot of this. Took a screenshot, and then I come to my agent. I use open code a lot, so I'm going to just launch open code in here.
[07:28]
B
And you use it because you can use multiple models.
[07:30]
A
Yeah, exactly. So I can very quickly try different models. When the new models come out. That is the big benefits I get from these open source tools. What I'll do here is I'll just say, hey, look at this, this screen. I'll paste the image here and I'll say the things we saw on the screen, the things that I was not very happy about Was there is too much technical details that are not friendly for kids. Also there is a big area of white space unused. But those were the problems that we saw on the screen that were clearly not ideal. So I'll point out these problems and I'll say, hey, can you propose some options for how we improve? Right. So this is the request I sent to the agent. So because I sent the screenshots, the model is going to be able to see visually what is going on there and then it's going to look at the code base as well. So yeah, it's very quickly came up with this plan. So it says like best direction, option one, option two. The thing with this plan is that it's not very easy to read. Right. So like when you look at this long wall of text, I will spend so much time reading this text. So what I do instead, let me just try a new session. What I actually do is I use a visual editor to do the planning. I'll say the same thing, look at this screen. There is too much technical details. Same thing. I will just add one bit to say use Lavish to discuss this with me along with any questions. So Lavish is a visual editor I built after I read the article about HTML over markdown. Have you seen that?
[09:43]
B
Yeah, from the third week. Yes.
[09:46]
A
Initially when I saw the article I was not very sure about that because I felt like HTML is going to be so token inefficient the models will have to write a lot more than a simple markdown. But when I tried it, it's actually super use. So I'll show you once we have this result from here. The HTML as an artifact can be a lot richer in terms of supporting this collaboration between human and agents. So it's not going to be a long wall of text I have to read through. It's going to be very visually things I can just interact with.
[10:22]
B
So Lavish is like an app that you build to create the HTML in the format that you want?
[10:28]
A
Yeah, yeah, it's a tool I built. So what I do is every time I encounter any of a friction in my workflow and I don't find anything that can solve the problem for me, I just build something myself. Lavish is a tool I built. It's a tool for both generating the HTML artifact and also supporting the back and forth interactive experience between human and agents on that. Because what we could do is I can just ask the agent to generate a HTML file, then I can open up the HTML file in the browser and it works. The problem with that Approach is that once the HTML file is open and I look at the HTML file and I see that there are some things I don't like, it's very hard for me to then tell the agent, hey, please change this part. Please iterate on this aspect. That back and forth is what Lavish Editor is trying to solve. Awesome.
[11:26]
B
Yeah, really excited to see what it is.
[11:30]
A
Now it's writing the HTML. It'll probably take a little while because that's usually a lot of content to write. So let's see, maybe one thing I can show here is that while the agents are working, typically agents either coding or planning can spend quite some time doing this work. So what I do is I'll just spin up another parallel terminal tab a window. I use tmux, so this is a new TMUX window. And in this window I will do something else. We can see it's in the same directory. The problem here is that if I spin up another agent to work in the same directory, they will run into each other. So what this agent does in this session will step on toes of the other agents that's already doing the work. This is where people started using work trees. So typically what people do is git work tree, add and give another directory like hibit and spend like five minutes thinking about the name. But I'm just going to say Hibit 2. So the thing, the problem with this approach is that once I create a work tree like this, next time I come to this work tree I have to think about what is hybrid 2 doing? What is this work tree doing? Is it still being worked on? Is it okay to used for something else? It's very hard to keep track of. The other problem is when we create a new work tree, the dependencies are not installed in the work tree. In this work tree we have things like node modules. These are dependencies downloaded on the fly. And these dependencies won't exist in the new work tree until you install all of them. Again, there were many problems like that.
[13:17]
B
And just for people who don't know what's the definition of the work tree? Is it like a copy of the code base? Right?
[13:22]
A
Yeah. So a work tree is basically like, you can think of it as a clone of your current git repo in another directory. So it's going to be a parallel directory and they don't directly interfere with each other. So you can do a different kind of work, different set of work in the work tree and it won't affect what you were doing in the main
[13:42]
B
repo okay, but you're saying that there's like many issues with the work tree. So what?
[13:48]
A
Yeah, basically there's a very heavy cognitive load to maintain the work trees. You have to think about which work tree is which and which ones are okay to clean up, etc. Etc. So what I did was I have a tool called Treehouse. So Treehouse is basically like a no brainer, like a very that simple way to manage work trees. Every time I have to spin up a new work tree to do something new, I don't need to think about, do I have another work tree I can use, do I create a new one? I just type Treehouse. Treehouse will basically set up the work tree for me and drop me into the new work tree. Now you can see it set up a work tree in this directory and it dropped me into it. The good thing is that this directory is from a pool of managed work trees. The dependencies are already installed here because I have used this worksheet before, so I don't have to reinstall the dependencies, rebuild the project every single time. It also saves on the efficiency aspect. Just reduce the mental load a lot. I don't need to think about anything. I just type Treehouse every time I want to start a new session.
[15:04]
B
That makes sense. Okay. All right, dude, well, let's go back to the other tab.
[15:07]
A
Yeah. So this is what the HTML looks like. So it's saying, hey, redesign discussion. It's basically there's a tiny icon here, not available. Not sure what happened there. But basically it's wrote the proposal in a visual artifact. Right. So what's feeding off the screen is doing like grown up work in kids space. Exactly. Right. And these things, there's unused space.
[15:38]
B
Yeah, it's easier to skim and read for a human.
[15:43]
A
Yeah. And if there's something I look at this artifact and if I see something that doesn't feel right, I can just annotate. So bit has no visible body. I can say I just click on this and say I don't care about this and give the feedback to the agents this way. Oh, I see.
[16:04]
B
So this is your app. Okay, got it, got it. Okay, that makes sense.
[16:06]
A
Yeah. So this is a lot more difficult to do when it's a long wall of text. Right. When it's a wall of text, you have to say to the agent, hey, I'm not happy about this part of the spec. And you sometimes have to copy paste a lot.
[16:21]
B
Got it.
[16:22]
A
Yeah. So it basically proposed a bunch of things. Copy cleanup.
[16:28]
B
Yeah, Some of the layout things is not ideal, but yeah, I get it. It's easier to read for sure.
[16:33]
A
Yeah. And I think there's probably something that went wrong in this page. Let me just check. I can just ask the agent as well, because when I look at this, I think the agent is trying to give me a visual representation of the layout, but because of the css, it's not quite working or something, it seems to CSS styles not working. So, yeah, I can just send feedback back to the agent this way, and I don't have to keep switching between the HTML artifact and the agent in the terminal. I can just talk to the agent here and I can easily annotate everything and just pinpoint exactly where.
[17:18]
B
I mean, can you show folks where they can download this tool? It's open source, right?
[17:23]
A
So it's in my GitHub repo lavish axi in this repo, and it's actually very simple to start using it. Just tell your agent, use NPX Lavish Axi to write the technical plan or do whatever you want, and the agent will go invoke this and everything goes on from there.
[17:46]
B
And do you have to hook up your own API key for the LLM?
[17:51]
A
No, you just use whatever agent you are already using. This lavish editor itself does not run another agent. It runs within your agent session. So.
[18:01]
B
Okay. Okay, got it. Yeah.
[18:03]
A
So you can see here the agent here calling Lavish Axe to pull this artifact.
[18:10]
B
Okay, that makes sense.
[18:11]
A
So let's come back to it. Yeah, so now it fixed the CSS problem. Right? This is what it's supposed to look like. So you can see this is a lot more visual and easier to understand.
[18:24]
B
Looks a lot better.
[18:26]
A
Yeah. So this is like pointing out the current layout, current problems, and then it's probably proposed a new thing. Okay, so it proposed four directions for using the space better. Option A looks like this. This is so much easier to see than the long wall of text we have in the terminal here. We can see. Okay, it's moved layout a little bit. Now. This is the chat. This is some other area. Okay, that's one option. And even gave me buttons. So if I like option A, I can just click this button and I get the option A.
[19:05]
B
Got it.
[19:06]
A
So option B looks like this. Today's goal. Okay, option C is this. Okay, option C is very simple. I actually like this option D. Okay, yeah. So let's say I like option C. I can just click this. And it basically queued a piece of feedback to the agent saying, I like option C. It's Just so easy to interact with. I don't have to keep typing every time I want to tell the agent something. Everything can be done interactively.
[19:37]
B
Okay, so this is the plan phase for building a new feature on top of an existing app, right?
[19:43]
A
Yeah.
[19:43]
B
I'm curious, and maybe not to show this, but I'm just curious how you plan something from scratch initially. Did you spend a lot of time planning the milestones and the tech stack and that kind of stuff?
[19:53]
A
Yeah. So if it's something from scratch, I usually have to spend a little bit more time. So what I do is that I use the same lavish editor. I tell the agent that I want to brainstorm a new idea with you and I'll probably talk through some of my initial thinking for what things I think are the core parts of my idea, and then I'll ask the agent to criticize that and come up with areas of risks or weaknesses maybe I haven't thought through yet. And then come back with its opinion. And the agent will then come back with a HTML artifact like that and I can look at the artifact to basically work with the agents to refine the idea to a point where it becomes a spec, basically.
[20:44]
B
Do you always include some certain sections in your spec, like build it in three phases or here's the milestones or here's the tech stack. I want you to use that kind of stuff.
[20:53]
A
Yeah, yeah. So for some projects, for some ideas, I already have some opinions on things to use and things to do. In those cases I'll just write them down and say these are my preferences. But I always tell the agents that it's okay for you to push back if you see something that is not right because I want to give the agents the flexibility and I want to see more options as well. So yeah, I basically like give my ideas to the agents, but let the agents give more back.
[21:22]
B
So, so then do you have like a user level agent MD or something that like has some of these best practices like you know, you can push back on me or it's just more natural through the conversation.
[21:32]
A
Yeah, so I, I actually built a lot of those instructions into lavish editor. So whenever the agent is, is using the lavish editor to work with me, the agent already knows a lot of those like those best practices.
[21:49]
B
Got it. Okay. And how about, how about like if you're building like a user facing product, how do you think about the design? Do you have like another tool for design or just you have some skills for design?
[21:58]
A
You mean visual design?
[21:59]
B
Yes.
[22:00]
A
Yeah, so for visual design I like Cloud design a lot since it came out. I use that a lot and very often I'll use a lot of the quota they have for me. So if you look at this, this, this bar where I track my quota. Claude, I mostly used up my weekly quota already. I'm waiting for the reset and cloud design. I used like 2/3 of it.
[22:26]
B
Okay.
[22:27]
A
Because, yeah, I just find it very useful to, especially for new projects. I use this a lot to build a new design system because once I get the design system built, I can apply that to many, many different components in my project very easily.
[22:42]
B
Yeah, okay, maybe you can show that later, but why don't we finish this workstream first?
[22:48]
A
Yeah, cool. So yeah, we basically we chose option C, right? So now we can just say, hey, build option C now. And because we already have the plan written in the HTML artifacts, the agent already has the context on what that means and what the choices were made. Right. So the agent can just like go ahead and implement that. Now.
[23:11]
B
How many, like, since you're just like building solo now at home, like how many of these agent building sessions do you have going? Like the agent actually building something for you at any given time?
[23:22]
A
Yeah, yeah. So I closed as many sessions as I could before I started this session. But I typically have like at least five different sessions actively running. And in each session there are usually like a bunch of sub agents or different agents working. So in total I never like really counted, but I would guess on average there's like 20 to 30 agents running.
[23:47]
B
Okay, got it. You mentioned you have sub agents running. Like you actually specifically asked it to run sub agents or like it just decides to like, when do you actually need a sub agent versus just using one agent?
[23:58]
A
Yeah, yeah, great question. So I think most of the models today and the harnesses, they are not very great at proactively using sub agents. There are only a few cases where Cloud Code or Codex will proactively use the sub agents. It's when they have their built in agents. Explore. When you ask a complex question, Cloud code will often run a explore sub agent to do some exploration in the code base and come back with some investigation results. Those are the cases where the models will proactively use a sub agent. But in a lot of cases, because the models, I think they are not trained enough yet to use sub agents in various different kind of cases, you often have to prompt it to do so.
[24:44]
B
Got it. Okay, what are some cases where you actually want to prompt it to use sub agents? Like to, for like validation or.
[24:50]
A
Yeah, so the Reason I think that the main reason I would use a sub agent is to avoid context. Context window blowing up in the main agent's session.
[24:59]
B
Oh, I see.
[25:00]
A
Yeah. So what I do, I think the time when I choose to use sub agents is when I realize what I'm about to do is going to use a lot of context and most of the context is going to be like investigation kind of exploration kind of scenario. And most of the exploration may be not meaningful for the main session. So in those cases, basically I carve out those sub agents to do those investigations and only come back with their conclusion.
[25:30]
B
Okay, so it's like, hey, spin up a sub agent to look at this code base or do some research on this topic and summarize it and give it back to the main agent. Like that kind of stuff, right?
[25:39]
A
Yeah. Or there are cases where I have 10 experiments, ideas to run and each experiment can be done in isolation. So in those cases I also just say, hey, spin up 10 sub agents to do that. If I do that all in the main agent, it's going to just blow up the context window and take a lot of time and tokens as well.
[26:03]
B
When you say experiment ideas, you mean AB testing stuff or what? Different ways to build things.
[26:09]
A
Yeah. So there are various kind of experiments I run. There's one example here I can show. This is something I'm running. This is the one I didn't kill. This is a benchmark I'm running to evaluate the effectiveness of different programming languages when given to agents. And there was this benchmark that was published two weeks ago called programbench. It's called programbench, it's built by the same people that built sweepbench and it's their new thing. Programbench basically asked the agents to build a bunch of programs like ffmpeg, like these tools from scratch, and, and see whether the agents can actually get all the requirements done and pass all the test cases. So that was the benchmark. But I thought the benchmark can be very useful for evaluating different harness techniques and also different programming languages. So right now what I'm evaluating here is I'm running program bench on Codex and I force Codex to use these programming languages like TypeScript, JavaScript, Python, and see, when they use different languages, do they get different results? Is there a programming language that will lead to the agent getting more requirements done and passing more tests and use less tokens, et cetera, et cetera. So this is a very large amount of experiments. Basically there are 200 multiplied by eight, so that's a lot of things to run and in those cases I basically have sub agents running and if I run all these in a single main agent, it's just going to keep running compaction and not going to be very efficient.
[28:01]
B
That makes sense. Okay, cool. Let's go back to the CAD app. Yeah. So it looks like it's running a bunch of tests right now. Right. So is that just the model knows you run tests or you actually you have some instructions to have it built, unit test and stuff like that?
[28:16]
A
I typically in my agents MD in each project I will have some instructions for how to perform tests. So here for example I can show the agent's MD here. This is the agent's MD for the hybrid project we were looking at. In here we'll just have some high level context on the structure of the project. Then I'll have some testing instructions. This is actually super helpful. Previously I didn't do this and I left the agent decide what to do. The agent will just do the minimum and they are trained to run some basic testing but they are not going to be comprehensive enough. I have here is instructions for how to do end to end testing. This is important for building front end and UI kind of projects. We were looking at hybrids which had a gui. In this case I tell the agents, hey this is an Electron app. You can drive this app by running a browser and blah blah blah, how to do this testing, how to actually test things end to end. With that instruction here the agent will just once it's done its work it will actually validate things end to end for me. So that can save me a lot of time from running the app myself and visually validating. Is that actually what I want?
[29:46]
B
Okay, so it's basically like using browser user and checking out the apps. If it looks okay, maybe checking some browser errors.
[29:52]
A
Yeah, exactly. And take screenshots as well. Take screenshots and look at these things visually and see whether it's actually aligned with what we talked about.
[30:00]
B
I think if you use the Codex app I think it does it by default. But let's say I'm not very technical. How do I even know to include this stuff? Should I just tell the agent to run a lot of tests?
[30:11]
A
Yeah. So typically one thing that's really interesting I found is that by default the agents like to write unit test like very purely code based unit tests and those unit tests often don't actually validate things end to end. So for example, even in Codex I think Codex by default likes to use the built in in that Browser, right?
[30:37]
B
Yeah.
[30:39]
A
So when you work on some front end changes, it will use the in app browser to look at the change and have you look at that as well. But this is an electron app, it's a desktop app. So it actually requires a different set of facilities to validate that. So the instructions here are basically how I would test this thing myself.
[31:01]
B
Okay.
[31:02]
A
Yeah. So basically the more things that I find myself doing that I can delegate to an agent, I turn them into instructions and then lets the agents do the work instead of me operating the app myself manually.
[31:16]
B
Okay, got it. Okay, so I guess someone who maybe is not as knowledgeable as you can just. I guess the general principle is if you're doing something manually, like you're now opening the app and looking at the screens, just ask the agent, hey, can you just automate this for me? Right? Just, just ask it. And hopefully it can figure something, something out too.
[31:33]
A
Yeah, yeah. So yeah, if you are like not trying to dig into the technical details, then the principle, the high level principle is like if you find yourself manually doing something, then try to turn that into something the agent does for you. And you can very likely, like with today's models, you can very likely just ask the agent to do what you were trying to do. And the agent will figure out, oh, I should do this, I should do that.
[31:57]
B
All right, well, it looks like it's done now.
[32:00]
A
Yeah, it's done now. So now good question. Right? It's done. The agent says it's done and we can look through what it did. It said it changed this, changed that. How do we know this is actually a good change? How do we know there's no bugs and everything? So the validation phase is where I see a lot of people spend a lot of their time. So the default approach is like people will open up their ide and start to review the code. Like they will start to review the diff. Right?
[32:29]
B
Yeah.
[32:30]
A
But the thing is that AI can write so much code, so if you review every single line of code, you become the bottleneck. So what I do here is I don't even review the code. I don't review this first pass code from the agents. I use something I call no mistakes. So no Mistakes is another tool I built just to help make this part of my life easier. What it does, I'll show you. I actually made a alias. Every time I got some code changes down from the agent, I just nm and it will go through a few steps. First it will ask the agent to create a branch for me. So I Don't even need to think about the branch name. Otherwise I need to think about the branch name, the commit message, all those things. It's just wasting time. And I get the agent do that. The agent basically did that fix kit chat workspace. That's right. The agent is now analyzing my session to understand my intent. The agent here, no mistakes, is reading the session where we did the work to understand my intent. So now it's understood what I was trying to do. It will do all these steps for me. It will rebase my change on top of the latest main branch on the remote so there's not going to be merge conflict later on. It's going to review my change. This is where I actually did a lot of prompt engineering to get the agents to really scrutinize the change very hard. Any edge case or bugs, logical errors, things like that will get caught. This is a very high recall phase. When I initially built no mistakes, I did a lot of parallel testing where I let the agents review the change and I also review the change myself and see how often I catch something the agents don't. I use that phase to iterate on the prompts and the workflow within this phase. Eventually I got to a point where I find myself never catching anything the agents don't catch. And so in this case, the agents actually didn't find any material problems. So it's just passed. But if it found some problems, it will categorize that into two categories. One is obvious bugs. So if it's just an obvious error, it will just auto fix by itself. It won't even bother me. Another category is when it realized there's an error, but fixing the error will have some product implications. And then it will ask me instead of just auto fixing that. In those cases, it will escalate to me and it will basically pause at this phase and ask me to judge, do I actually want to make that fix or do I want something else?
[35:34]
B
This is like the PR review, basically the agent doing PR review, right?
[35:37]
A
Yeah, PR review between the agent and the authority.
[35:41]
B
This, no mistakes, is like a whole new context window. It's like a new agent looking at your other conversation.
[35:48]
A
Yes, this is a fresh context window and actually did that deliberately. I think that's an important thing to do, which is to use a fresh context window to review the change that was done. Because a lot of people, what they do is they will just ask, hey, can you review the change in the same session? When you do that, the agent is very heavily biased by what was already Done because it saw all the context it saw every step along the way. So it's biased into believing that what was done was correct. And because of that, it will sometimes miss something. I tested this a lot, and when you use a fresh context window, you just get a lot more edge cases caught.
[36:35]
B
I guess the only problem is the no mistakes agent has to. Doesn't have to look at your whole code base again to even understand what this app is about.
[36:42]
A
That's what this intent phase was doing. So it basically analyzed your session to understand what was your original intent and some of the surrounding context as well. But it's not copying the entire session into this new context window.
[36:58]
B
It's like some senior engineer builds some feature, and then you're asking the principal engineer to come in with fresh eyes to look through everything, right?
[37:06]
A
Yeah, with fresh eyes. But usually you will ask the senior engineer to explain a little bit of context to the principal, right?
[37:13]
B
That's right. Yeah.
[37:14]
A
Yeah. So this intent phase is basically that. It's basically like explaining the basic context of what this change is trying to do.
[37:21]
B
Okay. And why don't we walk through the rest of the phases too? Like documenting is what. Just writing what is observing.
[37:28]
A
Yeah. So each phase, what it does is review, is just reviewing the code and test is running tests. And the test phase is very different from what the agent does by default. What the agent does by default is running some tests and validating locally what's the change tested and was that working? But this test phase is a little bit different. It's more like CI. It's validating. Did this regress other things as well? Et cetera, et cetera. This test phase will actually present some evidences of the change actually working. It will paste screenshots or sometimes a video to capture. This thing is actually working, so it's easier for me to review. I can just look at the artifact and see, okay, it's actually working.
[38:21]
B
Oh, that's actually really interesting because sometimes when I shift stuff to codecs, like the stuff I'm shipping works, but then it breaks something else. It breaks another core workflow in the app. So this test base will actually look through all that and try.
[38:35]
A
Yeah, it'll look through all that and just present very easily digestible artifacts for me to have confidence it's actually working as I expected.
[38:44]
B
This is a bit of a dumb question, but, for example, I'm trying to build a fitness app, right. And there's a few core workflows that I want to make sure that it tests each time creating A workout, tracking your workouts, you know, like so like, do you, do you have to manually define this stuff or is the AI enough, smart enough to figure it out to test the stuff each time you make a change?
[39:05]
A
Yeah, yeah, yeah. I typically like try to get AI the agent to turn those things into an automated end to end test. Okay, yeah. Because then it will be very easy to run that every single time. Right.
[39:17]
B
And the automated end to end test is basically just like it lastly is like a browser app. So it just kind of like actually be the user and click through stuff. Right. And see if anything breaks.
[39:25]
A
Yes, yes. So there are various kind of like end to end browser testing tools like playwright. But yeah, you can just ask the agents, you can say, hey, write an end to end test for this scenario or this user flow and make sure it's actually working end to end. It will typically be able to figure out what kind of frameworks or tools that needs to be used.
[39:47]
B
I think the trade off here, dude, is like it just takes a lot longer to actually ship a feature, right. Because you're running all these stages. But I guess you have way more confidence that the feature you ship actually doesn't break anything. So I guess if you have a lot of users, because a lot of stuff I work on doesn't have any users, it's just me. But if you have a lot of users, then you ship the product. You want to make sure it actually works. Right. It's like software engineering, one on one.
[40:13]
A
Yeah. So I would argue, even if it's only for yourself, probably you can make the trade off. Right. How much you want to prefer just making changes very fast versus making sure things actually work. Because sometimes there's a little bit of cost to you as well if things broke this phase. Taking longer time is actually okay because I never look at this. I never just stare at this screen and wait for every phase to pass. Every time I launch, no mistakes, I just immediately switch to another session. I don't even look at this. What I have here, I will show you. Now I switch to another session, I can just look at the terminal screen here to see what phase is that? No mistakes pipeline at. I can see it's working on the LinkedIn pipeline. If it's waiting for me to make a judgment or something, it will change the status here. So I can just very easily see do I need to jump back into that session?
[41:18]
B
Do you run no mistakes after almost every change? Because if you do that, then why don't just automatically run it?
[41:26]
A
Yeah. So I Run that on most changes, but not every single one, because there are changes where, for example, I make a very simple documentation update, and I know it doesn't need so much validation. It's going to use a lot of my tokens as well. So I make some judgments on whether the change justifies this kind of a heavy validation phase. Yeah. It's kind of like when you work within a team and some of your changes don't. It's not that every PR will go to a QA team. Right. Only some milestones, some meaningful things will go there.
[41:59]
B
Dude, do you think it feels weird after spending your career in big tech? Because in big tech, when you push a change, you have, like, a teammate come and review your pr. Right. And then you run some T tests, and now you're just by yourself. So it's like, I guess you have all these agents, but how do you feel? Do you feel unshackled, or do you feel like you kind of miss the teammates?
[42:23]
A
It's a bit of both. But I would say, largely speaking, I feel liberated. Yes. So I think teammates are great, especially in the brainstorming phase. So when we are thinking about an idea, if it's just me, it's a very, like. It's not a very diverse perspective. Right. So I may not think through everything, and I may not realize problems others can see. AI can help to a degree, but I don't think AI is, like, quite there yet to replace, like, a really smart team that can ideate together. So that is, like, one. The part I miss, the part I don't quite miss, is like, everyone is busy, and if I write like, 20 PRs every day, no one's gonna review that. So that already happened before I left my last company. And what I found myself doing was like, I have to write less PRs.
[43:23]
B
Got it.
[43:25]
A
And spend my time elsewhere because the bottleneck is really on the rest of the team.
[43:29]
B
Yeah, because your teammates aren't actually reviewing the PRs. They don't have a lot of things going on. But if you submit a PR to AI, it's always going to start working.
[43:37]
A
Yeah. So this is something that I think is going to fundamentally change as we progress on AI adoption. So our workflows and how our teams work were built at a time when we spend most of our time coding. And the average stats of an average software team, an engineer is like, an Engineer will write 10 to 15 PRs every month. That's like the velocity of an average software engineer team. So when that's the case, you Spend like it's okay for everyone else to do code reviews and all these processes because the velocity is not that massive. But when you start to write 10 times more PRs, we are not ready for that. Our processes and our human team composition and everything is not built with that assumption in mind. So what's going to happen is things are starting to break. A lot of teams are starting to change their practices in order to fight that. So some teams, especially smaller teams in startups, they basically stopped doing PR reviews. They still raise a PR but mostly for like a formality or for like leaving a record. They don't actually wait for another peer to review. They sometimes just merge the PR and later on if there's a problem they can go back to it. That's the kind of changes I'm starting to see.
[45:05]
B
Yeah, they get the agency review. Right. I do think it does lead to like a little bit more unstable products.
[45:12]
A
Yeah. But you know, that's because they are not using. No mistakes.
[45:18]
B
Yeah. It looks like it's done.
[45:19]
A
Yeah. This pipeline just completed. Right. So it went through all these steps and there was actually one thing fixed in documentation phase. This is something we can look at whether it's actually a legit change. But this is something I find super useful and something both me and my agents often don't do automatically. It's like when you make a change, can you actually find all the places in our documentation that can be affected by that change?
[45:49]
B
Okay, got it.
[45:50]
A
Yeah. So documentation linting and pushed and create a PR so I can just open up the PR and let's look at what it does. So it created this pr, the PR summarized the intent that was understood from my original session in open code. It summarized what changed. It did a risk assessment as well. So this is very useful as well. When I look at a low risk change, I spend less time when the agent is flagging. This is a medium risk or high risk change. I spend more time on this pr. Right. So I can decide where I spend my time more intelligently and testing. It did some tests and had evidence. So let's see what this is. It renders the workspace. Okay. Yeah. Basically like the. This evidence is about presenting like actual results from the change. So we can look at this and see is that what we want? Okay. And there is the pipeline and there is the documentation phase. What did it find? It found the design system example copy was not updated. Okay. Yeah. So it actually caught a inconsistency. So that's great. So because it's a low Risk change. I don't even go into the diff here, I don't go there, I just merge it. And when it's a medium risk or high risk change, then I go into the diff and start to look at things myself.
[47:24]
B
Okay, but you pretty much always somewhat look at the PR skin, the PR and then you hit the button to merge, right?
[47:31]
A
Yeah, yeah. I still look at this PR because I think looking through the risk assessment and what the agent actually did, what was the fix, those things are actually still useful.
[47:40]
B
That's the mistake that I'm making, dude. I don't look at the pr sometimes I just tell you to merge. So I just need to be a little bit more thorough.
[47:48]
A
Yeah. So after the agent made some code changes, you just get emerged.
[47:53]
B
Well, I asked you to run some tests and stuff. I don't use no mistakes. And then I get to merge. Merge. And then yeah, like a day later I'll find something else broke. So yeah, it's probably not the most efficient way to do it.
[48:07]
A
Yeah, yeah. So yeah, I think some validation and then some like review. But it's not a line by line code review. I think some review on what's changed and what kind of risks exist that's still useful.
[48:22]
B
And you're probably submitting 10, 15 PRs a day, right? Or like doing this.
[48:27]
A
Yeah, so I actually do a lot. So I. 26, 14, 27, 30. Yeah, that's like the average. So most of the time it's like 20 to 40 kind of PRs every day. Sometimes I do more.
[48:47]
B
I can tell when you became unemployed.
[48:52]
A
Very clear on this chart.
[48:53]
B
All right. I guess we just walked through the whole plan build and validation process, right? That's basically it, right?
[48:59]
A
Yeah, we went through building a plan, interactively implementing that with the agents, and then going through this validation pipeline. This basically, if you think about it, I didn't spend much time in the coding and validation phase at all. Most of my time was actually on the HTML artifact iterating with the agents. So that's kind of how I, how I do these things now. And as soon as I send the agents to do implementation, I just switch to something else and work on that in parallel.
[49:30]
B
Okay. So I guess we can provide the links to lavish the HTML planner and also no mistakes the validation will provide in the description of this episode, I guess. Let me ask you one last question.
[49:43]
A
Yeah.
[49:43]
B
I mean you're like an ally and engineer. You've been doing this for a while. There's a lot more builders now, right? There's a lot more people trying to get into this stuff and learning how to build AI. How do you think? Do you have any advice for people to actually ramp up the technical skills? And also what kind of technical skills do they actually need to learn? Obviously, testing and validating everything. But also there's stuff like, for example, if you don't set up your database properly in the beginning, it's hard to change it later. Just stuff that you learn over time. So do you have any thoughts on how people can actually scale up as they build more stuff?
[50:18]
A
Yeah, good question. I think there's a few things come to mind. One is that I think just play a lot, build a lot of things. Even if it's a throwaway toy, build it. And through that process, you will often discover things you can do better or things the agents didn't quite do very well and start to reflect on that. So do a lot of things. I think that's probably the first step. Some people, I think they. What I see at least from some people is like they only spend a lot of time trying to decide what do they do. And then they only do one thing and that thing didn't work, then they stop. I think the mindset I would encourage is to just like build every single idea you have. Whenever you have some idea, like send a prompt to the agents and see what it does. And whenever you have some inspiration or idea you think might be interesting, just give that to the agent and have it run for you. I think through that process, a lot of learnings can be derived. That's one. Another, I think, is to try to challenge yourself to use more tokens and to run more agents in parallel. I think that is a forcing function for people to upgrade their workflow. Because when we by default work with one agent at a time, we are still kind of being a bottleneck. We are putting ourselves into the loop too much. And I think to really scale up how much we can get from the agents, we have to move ourselves out of the loop as much as possible. So that's like, I think using more tokens and running more agents in parallel kind of forces us to do that. That's probably like another thing I can think of.
[52:14]
B
Got it.
[52:15]
A
Yeah. Maybe the last thing is to try to adopt AI in every part of your workflow, not only writing code. So what we could see there, like AI did a lot of validation and documentation, all those things for me. Right. And raising the PR and everything. I don't need to do anything there. I think when we work through A project. Whenever we find something manual, what we talked about earlier, like something we are spending time ourselves, just try to think about, can we delegate that to the agent as well? And through that, people, I think will find a lot more useful workflows that can handle automation and reduce our workload.
[52:54]
B
Yeah. Maybe there's some sort of a skill or something we can build where. Because the AI remembers its conversations with you, so maybe the AI can actually proactively suggest, like, hey, you should automate this. It's like the second time we're talking about this.
[53:07]
A
Yeah. That exists. That exists. So I can show you. So if I run cloud code. Cloud code has this command called insights. These insights will basically analyze your cloud code sessions and generate a report that for what can be done better, like what kind of skills can you add? What kind of things can you tweak in your memory files, et cetera, et cetera, to make cloud code work more efficiently for you? So this is super cool, but it's going to use a lot of tokens. I'm already out of tokens, so I'm not going to demo that now. But this is something I definitely recommend people trying. This is a very cool thing.
[53:51]
B
Okay. Yeah, I'm going to run it right now. Yeah. I think the token maxing thing is kind of like a meme, but I think basically just summarize your advice. Number one is put into reps, try different things, try to build different things. Number two is if you use multiple agents, you can put in more reps. Right. Because you don't have to wait for one agent to do anything. And then the third one is. Sorry, what was the third one again?
[54:13]
A
Part of your flow. Yeah. Not only writing code.
[54:16]
B
Yeah. I think the second one is especially hard, dude. Because I don't know, like, growing up as an Asian person, I have like a scarcity mindset. I try to save money and stuff and like just trying to burn our tokens, it doesn't feel. It doesn't feel right.
[54:31]
A
So most of us like working as individuals, we have the subscription, right? Yeah. So at least try to make the most out of the subscription. Exhaust the quota.
[54:42]
B
Okay. Yeah. So I guess it's kind of like going to a buffet and like trying to the crab legs. At least I can.
[54:49]
A
Yeah. But I would say there's the token maxing thing. I think we shouldn't just use tokens for the sake of using tokens. Right. We want to get actual work done. So I think it's more about pushing ourselves. Like my point about number two, was more about pushing ourselves to figure out ways to scale up and really get more done with agents instead of binding ourselves into the loop and only do one thing at a time.
[55:19]
B
That makes a lot of sense. That makes a lot of sense. All right, Kun. Well, thanks so much, man. Where can people find your, like, all the free stuff you've been shipping and also yourself?
[55:27]
A
Yeah, yeah. So I'm very active on X and YouTube. I plan to share a lot of my workflows and tools and setups over there. And I also. My GitHub is also a good place to look at my projects.
[55:42]
B
Your GitHub is just/kunchan, right?
[55:44]
A
Kunchanji UID. So I have this. Let me move my window here.
[55:48]
B
Oh, there it is.
[55:49]
A
Yeah, yeah, this is my handle. Almost everywhere. So YouTube, X and GitHub, LinkedIn is all this handle. Kunchan UID.
[55:59]
B
Yeah, I think it's like a blessing to all of us that you're shipping all this stuff for free and we can all try it. So, yeah, I'm definitely going to try. No mistakes and everything else that you built.
[56:09]
A
Cool, Cool. Thanks, Peter. Yeah, if you run into anything, let me know. I'm constantly trying to improve these tools as well.
[56:15]
B
Cool. All right, Kun. Take care, man.
[56:17]
A
Thanks, Peter.