Loading summary
Host
Hello AI engineers. A few weeks ago, engineering legend and former guest Steve Yegi from sourcegraph wrote an enthusiastic review. I've been using Claud Code for a couple of days and it has been absolutely ruthless in chewing through legacy bugs in my gnarly old code base. It's like a wood chipper fueled by dollars, it can power through shockingly impressive tasks using nothing but chat. It seems the majority of high taste testers agree. Since then, the Claud Code team has been on an absolute tier, delivering weekly updates, shipping best practices for agentic coding and dedicated Claud code docs. As GitHub's copilot turns four years old, we now see four major battlegrounds for coding agents. One AIIDs like Windsurf and Cursor, now worth over $12 billion. 2. Vibe coding platforms like Bolt, newcomer Lovable and V0. Three autonomous outer loop agents like Cognition's, Devon, Cozene's Genie and upcoming guest factory AI's Droids. We've covered all three categories of coding agents and today we're taking a look at the newest one, the CLI based agents like ADA, OpenAI, Codex and Claud Code. We're excited to share that the Claud Code team will be presenting at the upcoming AI Engineer World's Fair in San Francisco, which now has Early Bird tickets on sale on June 3rd. Spend the day learning in Hands on workshops on June 4th. Take in tracks across MCP Tiny Teams, Vibe Coding, LLM Recommendation Systems, Graph Rag Agent Reliability Infrastructure and AI Product Management and Voice AI on June 5th. Eight more tracks for Reasoning and RL SWE agents, Evils Retrieval and Search Security, Generative Media Design, Engineering, Robotics and autonomy for CTOs and VPs of AI. There are now two leadership tracks, AI in Fortune 500 and AI Architects named after our very well received podcast with Brett Taylor of Sierra and OpenAI Claude Code will be presenting on the SUI agents track on June 5th. Join us at AI Engineer. Watch out and take care.
Boris Turney
Hey everyone. Welcome to the Litten Space Podcast. This is Celestio partner and CTO at Decibel and I'm joined by my co host Zwicks, founder of Small AI.
Celestio (Co-host)
Hey and today we're in the studio with Kat Wu and Boris Turney. Welcome.
Developer from Anthropic (possibly Forrest or Sid)
Thanks for having us.
Kat Wu (PM at Anthropic)
Thank you.
Celestio (Co-host)
Kat, you and I know each other from before. I just realized Dagster as well and then Index Ventures and now Anthropic.
Kat Wu (PM at Anthropic)
Exactly.
Celestio (Co-host)
It's so cool to see a friend that you know from before like now working at Anthropic and like Shipping. Really cool Stuff. And Boris, you're a celebrity because, like, we were just having you outside, just getting coffee, and people recognized you from your video.
Developer from Anthropic (possibly Forrest or Sid)
Oh, wow, right. That's new.
Celestio (Co-host)
Wasn't that neat?
Developer from Anthropic (possibly Forrest or Sid)
Yeah, I definitely. I had that experience, like, once or twice in the last few weeks. Yeah, that's surprising.
Celestio (Co-host)
Yeah. Well, thank you for making the time. We're here to talk about cloud code. Most people probably have heard of it. We think, like, you know, quite a few people have tried it. But let's get a crisp upfront definition. Like, what is Claude code?
Developer from Anthropic (possibly Forrest or Sid)
Yeah, so Claude code is Claude in the terminal. So, you know, Claude has a bunch of different interfaces. There is desktop, there's web, and yeah, Claude code, it runs in your terminal. Because it runs in the terminal, it has access to a bunch of stuff that you just don't get if you're running on the web or on desktop or whatever. So it can run bash commands, it can see all of the files in the current directory, and it does all that agentically. And yeah, I guess maybe it comes back to maybe the question under the question is, where did this idea come from? And yeah, part of it was, we just want to learn how Claude. We want to learn how people use agents. We are doing this with the CLI form factor because coding is kind of a natural place where people use agents today. And there's kind of product market fit for this thing. But, yeah, it's just sort of this crazy research project and obviously it's kind of bare bones and simple. But, yeah, it's like an agent in your terminal.
Celestio (Co-host)
That's how the best stuff starts.
Boris Turney
Yeah. How did it start? Did you have a master plan to build cloud code or.
Developer from Anthropic (possibly Forrest or Sid)
There's no master plan. When I joined Anthropic, I was experimenting with different ways to use the model kind of in different places. And the way I was doing that was through the public API, the same API that everyone else has access to. And one of the really weird experiments was this claw that runs in a terminal, and I was using it for kind of weird stuff. I was using it to look at what music I was listening to and react to that, and then screenshot my video player and explain what's happening there and things like this. And this was kind of a pretty quick thing to build, and it was pretty fun to play around with. And then at some point I gave it access to the terminal and the ability to code, and suddenly it just felt very useful, like I was using this thing every day. It kind of expanded from there. We gave the core Team access, and they all started using it every day, which was pretty surprising. And then we gave all the engineers and researchers that anthropic access, and pretty soon everyone was using it every day. And I remember we had this DAU chart for internal users, and I was just watching it, and it was vertical for days. And we're like, all right, there's something here. We got to give this to external people so everyone else can try this too. Yeah, yeah, that's where it came from.
Boris Turney
And were you also working with Boris already, or did this come out and then it started growing, and then you're like, okay, we need to maybe make this a team, so to speak?
Kat Wu (PM at Anthropic)
Yeah. The original team was Boris, Sid, and Ben. And over time, as more people were adopting the tool, we felt like, okay, we really have to invest in supporting it, because all our researchers are using it, and this is like, our one lever to make them really productive. And so at that point, I was using quadcode to build some visualizations. I was analyzing a bunch of data, and sometimes it's super useful to spin up a streamlit and see all the aggregate stats at once. And quad code made it really, really easy to do. So I think I sent Boris, like, a bunch of feedback, and at some point, Boris was like, do you want to just work on this? And so that's how it happened.
Developer from Anthropic (possibly Forrest or Sid)
It was actually a little like. It was more than that on my side. You were sending all this feedback, and at the same time, we were looking for a pm, and we were, like, looking at a few people, and then I remember telling the manager, like, hey, I want cat.
Boris Turney
I'm sure people are curious. What's the process within Anthropic to, like, graduate one of these projects? Like, so you have kind of like, the. A lot of growth, then you get a pm. When did you decide, okay, it's ready to be opened up?
Developer from Anthropic (possibly Forrest or Sid)
Generally at Anthropic, we have this product principle of do the simple thing first. And I think that the way we build product is really based on that principle. So you kind of staff things as little as you can and keep things as scrappy as you can, because the constraints are actually pretty helpful. And for this case, we wanted to see some signs of product market fit before we scaled it.
Celestio (Co-host)
Yeah, I imagine so. We're putting out the MCP episode this week, and I imagine MCP also now has a team around it in much the same way it is now very much officially sort of like an Anthropic product. So I'm kind of curious for Kat how do you view PMing something like this? I guess you're sort of grooming the roadmap. You're listening to users and the velocity is something I've never seen coming out of that topic.
Kat Wu (PM at Anthropic)
I think I PM with a pretty light touch. I think Boris and the team are extremely strong product thinkers. And for the vast majority of the features on our roadmap, it's actually just like people building the thing that they wish that the product had. So very little actually is tops down. I feel like I'm mainly there to clear the path if anything gets in the way and just make sure that we're all good to go from a legal, marketing, et cetera perspective. And then I think in terms of very broad roadmap or long term roadmap, I think the whole team comes together and just thinks about, okay, what do we think models will be really good at in three months? And let's just make sure that what we're building is really compatible with the future of what models are capable of.
Celestio (Co-host)
I'd be interested to double click on this. What will models be good at in three months? Because I think that's something that people always say to think about when building AI products, but nobody knows how to think about it because everyone's just like, it's generically getting better all the time. We're getting AGI soon, so don't bother. How do you calibrate three months of progress?
Kat Wu (PM at Anthropic)
I think if you look back historically, we tend to ship models every couple of months or so. So three months is just like an arbitrary number that I picked. I think the direction that we want our models to go in is being able to accomplish more and more complex tasks with as much autonomy as possible. And so this includes things like making sure the models are able to explore and find the right information that they need to accomplish a task. Making sure that models are thorough in accomplishing every aspect of a task. Making sure the models can compose different tools together effectively. Yeah, these are the directions we care about.
Developer from Anthropic (possibly Forrest or Sid)
Yeah. I guess coming back to code, this kind of approach affected the way that we built code also because we know that if we want some product that has very broad product market fit today, we would build a Cursor or a Windsurf or something like this. These are awesome products that so many people use every day I use them. That's not the product that we want to build. We want to build something that's kind of much earlier on that curve and something that will maybe be a big product a year from now. Or however much time from now as the model improves. And that's why code runs in a terminal. It's a lot more bare bones. You have raw access to the model because we didn't spend time building all this kind of nice UI and scaffolding on top of it.
Boris Turney
When it comes to the harness, so to speak, and things you want to put around it. There's one that maybe prompt optimization. So obviously I use cursor every day. There's a lot going on in cursor that is beyond my prompt for like optimization and whatnot. But I know you recently released like, you know, compacting context features and all that. How do you decide how thick it needs to be on top of the cli? So that's kind of the share interface. And at what point are you deciding between okay, this should be a part of clock code versus this is just something for the IDE people to figure out. For example.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, there's kind of three layers at which we can build something. So being an AI company, the most natural way to build anything is to just build it into the model and have the model do the behavior. The next layer is probably scaffolding on top. So that's like CLAUDE code itself. And then the layer after that is using CLAUDE code as a tool in a broader workflow, so composed of n. So for example, a lot of people use code with tmux, for example, to manage a bunch of windows and a bunch of sessions happening in parallel. We don't need to build all of that in compact. It's this thing that has to live in the middle because it's something that we want to work. When you use code, you shouldn't have to pull in extra tools on top of it. Rewriting memory in this way isn't something the model can do today. So you have to use a tool for has to within that middle area. We tried a bunch of different options for compacting, like rewriting old tool calls and truncating old messages and not new messages. And in the end we actually just did the simplest thing, which is ask CLAUDE to summarize the previous messages and just return that and that's it. And it's funny, when the model is so good, the simple thing usually works. You don't have to over engineer it.
Celestio (Co-host)
We do that for Claude plays Pokemon too. Just kind of interesting to see that pattern reemerging.
Boris Turney
And then you have the CLAUDE MD file for the more user driven memories, so to speak. It's kind of like the equivalent of maybe cursor rules I would say yeah.
Developer from Anthropic (possibly Forrest or Sid)
And ClaudMD, it's another example of this idea of do the simple thing first. We had all these crazy ideas about memory architectures, and there's so much literature about this, there's so many different external products about this. And we wanted to be inspired by all this stuff. But in the end, the thing we did is ship the simplest thing, which is, you know, it's a file that has some stuff and it's auto read into context. And there's now a few versions of this file. You can put it in the root, or you can put it in child directories, or you can put in your home directory and we'll read all of these in kind of different ways. But, yeah, simplest thing that could work.
Boris Turney
I'm sure you're familiar with adir, which is another thing that people in our discord loved. And then when cloud code came out, the same people love cloud code. Any thoughts on, like, you know, inspiration that you took from it, things you did differently? Kind of like maybe the same principle in which you went a different way?
Developer from Anthropic (possibly Forrest or Sid)
Yeah, this is actually the moment I got AGI pilled is related to this. Okay, so maybe I can tell that story. So Clyde is like, you know, CLI Quad, and that's the predecessor to QuadCode. It's kind of this research tool that's, you know, it's like written using Python. It takes like a minute to start up. It's like very much written by researchers. It's not a polished product. And when I first joined Anthropic, I was putting up my first pull request, and I hand wrote this pull request because I didn't know any better. And my bootcamp buddy at the time, Adam Wolf, was like, you know, actually, maybe instead of handwriting it, just ask whyde to write it. And I was like, okay, I guess. So it's an AI lab. Maybe there's some capability I didn't know about. And so I started up this terminal tool and it took like a minute to start off, and I asked wide, hey, here's the description. Can you make a PR for me? And after a few minutes of chucking along and made a PR and it worked. And I was just blown away because I had no idea. I had just no clue that there were tools that could do this kind of thing. I thought that kind of single line autocomplete was the state of the art before I joined. And then that's the moment where I got AGI pilled. And yeah, that's where code came from.
Celestio (Co-host)
I think people are Interested in comparing, contrasting obviously, because to you, obviously this is the house tool. You work on it. People are interested in figuring out how to choose between tools. There's the cursors of the woe, there's the devins of the woe, there's aiders and there's quad code. And we can't try everything all at once. My question would be, where do you place it in the universe of options?
Developer from Anthropic (possibly Forrest or Sid)
Well, you can ask Quad to just try all these tools.
Celestio (Co-host)
I wonder what it would. No self favoring at all.
Developer from Anthropic (possibly Forrest or Sid)
Quad plays engineering. I don't know. We use all these tools in house too. We're big fans of all this stuff. Like Claude code is obviously it's a little different than some of these other tools in that it's a lot more raw. Like I said, there isn't this kind of big beautiful UI on top of it. It's raw access to the model. It's as raw as it gets. So if you want to use a power tool that lets you access the model directly and use CLAUDE for automating big workloads. For example, if you have 1000 lint violations and you want to start 1000 instances of Claude and have it fix each one and then make a pr, then cloud code is a pretty good tool. Got it. It's a tool for power workloads for power users. And I think that's kind of where it fits.
Boris Turney
It's the idea of parallel versus kind of like single path. One way to think about it, where the IDE is really focused on what you want to do versus clock code. You kind of more see it as less supervision required. You can kind of spin up a lot of them instead. The red mental model.
Celestio (Co-host)
Yeah.
Developer from Anthropic (possibly Forrest or Sid)
And there's some people at Anthropic that have been racking up like thousands of dollars a day with this kind of automation. Most people don't do anything like that, but you totally could do something like that. Yeah, we think of it as like a UNIX utility. Right. So it's like the same way that you would compose, you know, grep or cat or oh, cat gat or something like this. The same way you can compose code into workflows.
Boris Turney
The cost thing is interesting. Do people pay internally or do you get free? If you work at Andraba, you can just run this thing as much as you want every day.
Developer from Anthropic (possibly Forrest or Sid)
It's for free internally.
Boris Turney
Nice. Yeah, I think if everybody had it for free, it would be huge because I mean if I think about I pay cursor 20 bucks a month, I use Millions and millions of tokens in cursor. That would cost me a lot more in club code. And so I think a lot of people that I've talked to, they don't actually understand how much it costs to do these things. And they'll do a task and they're like, oh, that costs 20 cents. I can't believe I paid that much. How do you think, going back to, like, the product side too, it's like, how much do you think of that being your responsibility to try and make it more efficient versus that's not really what we're trying to do with the tool.
Kat Wu (PM at Anthropic)
We really see quad code as like, the tool that gives you the smartest abilities out of the model. We do care about cost insofar as it's very correlated with latency. And we want to make sure that this tool is extremely snappy to use and extremely thorough in its work. We want to be very intentional about all the tokens that it produces. I think we can do more to communicate the cost with users. Currently we're seeing costs around $6 per day per active user, and so it does come out to a bit higher over the course of a month in cursor, but I don't think it's out of band. And that's roughly how we're thinking about it.
Developer from Anthropic (possibly Forrest or Sid)
I would add that I think the way I think about it is it's a ROI question, it's not a cost question. And so if you think about an average engineer's salary, and we were talking about this before the podcast, engineers are very expensive. And if you can make an engineer 50, 70% more productive, that's worth a lot. And I think that's the way to think about it.
Celestio (Co-host)
So if you're saying if you're targeting cloud to be the most powerful end of the spectrum, as opposed to the less powerful but faster, cheaper side of the spectrum, then there's typically people who recommend a waterfall, right? You try this faster, simple one that doesn't work. You upgrade, you upgrade, you upgrade, and finally you hit cloud code, at least for people who are token constrained, that don't work at Entopic. And part of me wants to just fast track all that. I just want to fan out to get everything all at once. And once I'm not satisfied with the next one solution, I'll just sort of switch to the next. I don't know if that's real.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, we're definitely trying to make it a little easier to make Claude code kind of the tool that you use for all the different workloads. For example, we launched Thinking recently. So for any kind of planning workload where you might have used other tools before, you can just ask Claude and that'll use chain of thought to think stuff out.
Celestio (Co-host)
I think we'll get there. Maybe we'll do it this way. How about we recap sort of the brief history of cloud code between when you launch and now? There have been quite a few ships. How would you highlight the major ones? And then we'll get to the thinking.
Developer from Anthropic (possibly Forrest or Sid)
Tool and I think I'd have to check your Twitter to remember everything.
Kat Wu (PM at Anthropic)
I think a big one that we've gotten a lot of requests for is web fetch. So we worked really closely with our legal team to make sure that we we shipped as secure of an implementation as possible. So we'll web fetch if a user directly provides a URL, whether that's in their call MD or in their message directly, or if a URL is mentioned in one of the previously fetched URLs. And so this way enterprises can feel pretty secure about letting their developers continue to use it. We shipped a bunch of like auto features like autocomplete, where you can press tab to complete a file name or file path auto compact so that users feel like they have like infinite context since we'll compact behind the scenes and we also shift auto accept because we noticed that a lot of users were like, hey, like Claude code can figure it out. I've like developed a lot of trust for Claude code. I wanted to just like autonomously edit my files, run tests and then come back to me later. So those are some of the big ones.
Celestio (Co-host)
Vim mode, custom slash commands.
Kat Wu (PM at Anthropic)
People love vim mode. So that was a top request too. That one went pretty viral.
Celestio (Co-host)
Yeah.
Developer from Anthropic (possibly Forrest or Sid)
Memory. Those are recent ones, like the hashtag to remember.
Celestio (Co-host)
So yeah, I mean, I'd love to dive into on the technical side, any of them that were particularly challenging. Paul from Ador always says how much of it was coded by adoration. So then the question is how much of it was coded by cloud code? Obviously there's some percentage, but I wonder if you have a number like 50.
Developer from Anthropic (possibly Forrest or Sid)
80, probably near 80.
Kat Wu (PM at Anthropic)
Very high. A lot of human code review though.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, what a human code review. I think some of the stuff has to be handwritten and some of the code can be written by quad and there's sort of a wisdom in knowing which one to pick and what percent for each kind of task. So usually where we start is Claude writes the code and then if it's not Good, then maybe a human will dive in. There's also some stuff where I actually prefer to do it by hand. So it's like intricate data model refactoring or something. I won't leave it to Quad because I have really strong opinions and it's easier to just do it and experiment than it is to explain it to Quad. So yeah, I think that nets out to maybe like 80, 90% quad written code overall.
Boris Turney
Yeah, we're hearing a lot of that in our portfolio companies. More like series A companies as well. 80, 85% of the code they write is ad generated. Yeah, well, that's a whole different discussion. The custom slash command. I had a question. How do you think about custom/command mcps? How does this all tie together? Is the slash command and clock code kind of like an extension of the mcp? Are people building things that should not be MCP but are just kind of like self contained things in there? I should be able to think about it.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, I mean, obviously we're big fans of mcp. You can use MCP to do a lot of different things. You can use it for custom tools and custom commands and all the stuff, but at the same time you shouldn't have to use it. So if you just want something really simple and local, you just want, you know, some essentially like prompt that's been saved. Just use local commands for that over time. Something that we've been thinking a lot about is how to re expose things in convenient ways. So for example, let's say you had this local command. Could you re expose that as an MCP prompt? Because clock code is an MCP client and an MCP server. Or similarly, let's say you pass in a custom, you know, like a custom bash tool. Is there a way to re expose that as an MCP tool? We think generally you shouldn't have to be tied to a particular technology. You should use whatever works for you.
Boris Turney
Yeah, because there's some like puppets here. I think that's like a great thing to use with clock code. Right. For testing there's like a puppetseer MCP protocol, but then people can also write their own slash commands. And I'm curious like where MCP are going to end up being where it's like maybe each slash command leverages mcps, but no command itself is an MCP because it ends up being customized. I think that's what people are still trying to figure out. It's like, should this be in the runtime or in the MCP server? I think people haven't quite figured out where the line is.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, for something like Puppeteer, I think that probably belongs in mcp because there's a few tool calls that go in that too. And so it's probably nice to encapsulate that in the MCP server.
Kat Wu (PM at Anthropic)
Whereas slash commands are actually just like prompts, so they're not actually tools. We're thinking about how to expose more customizability options so that people can bring their own tools or turn off some of the tools that CLAUDE code comes with. But there is also some trickiness there because we want to just make sure that the tools people bring are things that CLAUDE is able to understand and that people don't accidentally inhibit their experience by maybe bringing a tool that is confusing to claude. So we're just trying to work through the UX of it.
Developer from Anthropic (possibly Forrest or Sid)
I'll give an example also of how this stuff connects for CLAUDE code internally. In the GitHub repo, we have this GitHub action that runs and the GitHub action invokes Claude code with a local slash command. And the slash command is lint. So it just runs a linter using claude. And it's a bunch of things that are pretty tricky to do with a traditional linter that's based on static analysis. So, for example, it'll check for spelling mistakes, but also checks that code matches comments. It also checks that we use a particular library for network fetches instead of the built in library. There's a bunch of these specific things that we check that are pretty difficult to express just with lint. And in theory you can go in and write a bunch of lint rules for this. Some of it you could cover, some of it you probably couldn't. But honestly, it's much easier to just write a one bullet in markdown in a local command and just commit that. And so what we do is CLAUDE runs through the GitHub action, we invoke it with lint, which just invokes that local command. It'll run the linter, it'll identify any mistakes, it'll make the code changes, and then It'll use the GitHub MCP server in order to commit the changes back to the pr. And so you can kind of compose these tools together. And I think that's a lot of the way we think about code is just one tool in an ecosystem that composes nicely without being opinionated about any particular piece.
Celestio (Co-host)
It's interesting. I have a weird chapter in my CV that makes me. I was the CLI maintainer for Netlify and so I have a little bit of a dive. There's a decompilation of cloud code out there that has since been taken down. But it seems like you use Commander JS and React Inc. Is the public info about this. I'm just kind of curious. At some point you're not even building cloud code. You're kind of just building a general purpose CLI framework that any developer can hack to their purposes. You ever think about this? This level of configurability is more of like a CLI framework or some new form factor that doesn't exist before.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, it's definitely been fun to hack on a really awesome CLI because there's not that many of them. We're big fans of Ink.
Celestio (Co-host)
Vadim de Medic we actually used REACT ink for a lot of our projects.
Developer from Anthropic (possibly Forrest or Sid)
Oh cool. Yeah, Ink is amazing. It's sort of hacky and janky in a lot of ways. It's like you have React and then the renderer is just translating the REACT code to ANSI escape codes as the way to render. And there's all sorts of stuff that just doesn't work at all because ANSI escape codes are like. You know, it's like this thing that started to be written like the 1970s and there's no really great spec about it. Every terminal is a little different. So building in this way, it feels to me a little bit like building for the browser back in the day where you have to think about like Internet Explorer 6 versus Oprah versus like Firefox, whatever. Like you have to think about these cross terminal differences a lot. But yeah, big fans of ink because it helps abstract over that we use bun. So big fans of bun. That's been. It makes writing our tests and running tests much faster. We don't use it in the runtime yet.
Celestio (Co-host)
It's not just for speed. But you tell me. I don't want to put words in your mouth, but my impression is they help you ship the compilation. The executable.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, exactly. So we use BUN to compile the code together.
Celestio (Co-host)
Yeah.
Boris Turney
Anything.
Celestio (Co-host)
Any other pluses of bun? I just want to track BUN versus Deno conversations. Yeah, these Deno's in there.
Developer from Anthropic (possibly Forrest or Sid)
I actually haven't used Deno back. It's been a while.
Celestio (Co-host)
I remember a lot of people say.
Developer from Anthropic (possibly Forrest or Sid)
Ryan made it back in the day and it was like there are some ideas that I think were very cool in it. But yeah, it just never took off to that same degree. Still a lot of cool ideas like being able to NPM just import from Any URL, I think is. That's the dream.
Celestio (Co-host)
Dream of esm. Yeah. Very cool. Okay, I was going to ask you one other feature. Then we can get to the thinking tool of Auto Accept. I have this little thing I'm trying to develop, thinking around for trust in agents. When do you say, all right, go autonomous? When do you pull the developer in? And sometimes you let the model decide. Sometimes you're like, this is a destructive action. Always ask me, and I'm just curious if you have any internal heuristics around when to Auto Accept and where all this is going to.
Kat Wu (PM at Anthropic)
We're spending a lot of time building out the permission system. So Robert on our team is leading out this work. We think it's really important to give developers the control to say, hey, these are like the allowed permissions. Generally this includes stuff like the model's always allowed to read files or read anything, and then it's up to the user to say, hey, is it allowed to edit files, Is allowed to run tests? These are like probably the safest three actions. And then there's like a long list of other actions that users can either allow list or deny list based on regex matches with the action.
Boris Turney
How can writing a file ever be unsafe if you have version control? I think that's.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, I think there's a few different, probably like, aspects of safety to think about, so it could be useful just to break that out a little bit. So for file editing, it's actually less, I think, about safety, although there is still a safety risk because what might happen is, let's say the model fetches a URL and then there's a prompt injection attack in the URL and then the model writes some malicious code to disk and you don't realize it. Although there is code review as a separate layer there as protection. But I think generally for file writes, the model might just do the wrong thing. That's the biggest thing. And what we find is that if the model is doing something wrong, it's better to identify that earlier and correct it earlier, and then you're going to have a better time. If you wait for the model to just go down this totally wrong path and then correct it 10 minutes later, you're going to have a bad time. So it's better to usually identify failures early. But at the same time, there's some cases where you just want to let the model go. So for example, if Claude code is writing tests for me, I'll just hit Shift tab, enter Auto Accept mode, and just let it run the tests and iterate on the tests until they pass because I know that's a pretty safe thing to do. And then for some other tools like Bash Tool, it's pretty different because Claude could run RM RF and that would suck. That's not a good thing. So we definitely want people to be in the loop to catch stuff like that. The model is trained and aligned to not do that. But these are non deterministic systems, so you still want a human in the loop. I think that generally the way that things are trending is kind of less time between human input.
Celestio (Co-host)
Did you see the meter paper?
Boris Turney
No.
Celestio (Co-host)
They establish a Moore's law for time between human input basically. And it's basically doubling every three to seven months is the idea. And Anthropic is currently doing super well on that benchmark. And it's roughly about autonomous for 50 minutes at the 50th percentile of human effort, which is kind of cool.
Boris Turney
Highly recommend that I put cursor in YOLO mode all the time and just run it.
Celestio (Co-host)
But it's vibe coding, right? Like this is fade.
Boris Turney
And there's a couple things that are interesting when you talked about alignment and the model being trained. So always put it on Docker container. And I have a prefix every command with like the Docker compose. And yesterday my Docker server was not started and I was like, oh, Docker is not running. Let me just run it outside of Docker. And I'm like, whoa, whoa, whoa, whoa, whoa. You should start Docker and run it in Docker. You cannot go outside. So that is like a very good example of like, you know, sometimes you think it's doing something and then doing something else. And for the review side it's. I would love to just chat about that more. I think the linter part that you mentioned, I think maybe people skipped it over. It doesn't register the first time. But like going from like rule based linting to like semantic linting, I think it's like great and super important and, and I think a lot of companies are trying to do how do you do autonomous PR review, which I've not seen one that I use so far. They're all kind of like mid. So I'm curious how you think about closing the loop or making that better and figuring out especially like what are you supposed to review? Because these PRs get pretty big when you buy code. You know, sometimes I'm like, oh wow, lgtm. You know, it's like, am I really supposed to read all of this? It kind of Seems most of it seems pretty standard but like I'm sure there are parts in there that the model would understand that are like kind of out of distribution, so to speak, to really look at. So yeah, I know it's a very open ended question, but any thoughts you have would be great.
Developer from Anthropic (possibly Forrest or Sid)
The way we're thinking about it is Claude code is, like I said before, it's a primitive. So if you want to use it to build a code review tool, you can do this. If you want to build like a security scanning vulnerability scanning tool, you can do that. If you want to build a semantic linter, you can do that. And hopefully with code it makes it. So if you want to do this, it's just a few lines of code and you can just have Claude write that code also because CLAUDE is really great at writing GitHub actions.
Kat Wu (PM at Anthropic)
Yeah, one thing to mention is we do have a non interactive mode which is like what Claude uses or how we use Claude in these situations to automate CLAUDE code. And also a lot of the companies using Claude code actually use this non interactive mode. So they'll for example, say, hey, I have hundreds of thousands of tests in my repo, some of them are out of date, some of them are flaky and they'll send Claude code to look at each of these tests and decide, okay, how can I update any of them? Should I deprecate some of them? How do I increase our code coverage? So that's been a really cool way that people are non interactively using Claude code.
Celestio (Co-host)
What are the best practices here? Because when it's non interactive it could run forever and you're not necessarily reviewing the output of everything.
Boris Turney
Right.
Celestio (Co-host)
So I'm just kind of curious, how is it different in non interactive mode? What are the most important hyperparameters or arguments to set?
Developer from Anthropic (possibly Forrest or Sid)
Yeah, and for folks that haven't used it. So non interactive mode is just Claude P and then you pass in the prompting quotes and that's all it is, it's just the P flag. Generally it's best for tasks that are read only. That's the place where it works really well and you don't super have to think about permissions and running forever and things like that. So for example, a linter that runs and doesn't fix any issues. Or for example, we're working on a thing where we use Claude with P to generate the changelog for Quad. So every PR is just looking over the commit history and being like, okay, this makes it into the changelog, this doesn't because we Know people have been requesting changelog so we're just getting Quad to build it. So generate non interactive mode. Really good for read only tasks. For tests where you want to write, the thing we usually recommend is pass in a very specific set of permissions on the command line. So what you can do is pass in allowed tools and then you can allow a specific tool. So for example not just bash but for example git status or git diff. So just give it a set of tools that it can use or edit.
Celestio (Co-host)
Tool it still has default tools are file read, grep system tools like bash, NLS and memory tools. Right, all those are.
Developer from Anthropic (possibly Forrest or Sid)
So it still has all these tools but allow tools just lets you instead of the permission prompt because you don't have that in the non interactive mode. It's just kind of pre accepting.
Kat Wu (PM at Anthropic)
Yeah, and we'd also definitely recommend that you start small. So like test it on one test, make sure that has reasonable behavior. Iterate on your prompt, then scale it up to 10. Make sure that it succeeds or if it fails, just analyze what the patterns of failures are and gradually scale up from there. So definitely don't kick off a run to fix like 100,000 tests.
Developer from Anthropic (possibly Forrest or Sid)
Yeah.
Celestio (Co-host)
So at this point this tagline is in my head that basically at Anthropic there's cloud code generating code and then cloud code also reviewing its own code at some point. Right. Different people are setting all this up. You don't really govern that, but it's happening. The point of the thing I was thinking about was we have VPs of Eng ctos listening. This is all well and good for the individual developer, but. But the people who are responsible for the tech, the entire code base, the engineering decisions, all this is going on. My developers, I manage 100 developers. Any of them could be doing any of this at this point. What do I do to manage this? How does my code review process change? How does my change management change? I don't know.
Kat Wu (PM at Anthropic)
We've talked to a lot of VPs and CTOs about it. They actually tend to be quite excited because they, they experiment with the tool, they download it, they ask it a few questions and like Claud code, when it gives them sensible answers, they're really excited because they're like, oh, I can understand this nuance in the code base and sometimes they even ship small features with Claude code. And I think through that process of interacting with the tool they build a lot of trust in it. And a lot of folks actually come to us and they ask us how can I roll it out more broadly. And then we'll often have sessions with VPs of dev prod and talk about these concerns around how do we make sure people are writing high quality code? I think in general, it's still very much up to the individual developer to hold themselves up to a very high standard for the quality of code that they merge. Even if we use CLAUDE code to write a lot of our code, it's still up to the individual who merges it to be responsible for this being well maintained, well documented code that has reasonable abstractions. And so I think that's something that will continue to happen where CLAUDE code isn't its own engineer. That's like committing code by itself. It's still very much up to the ICs to be responsible for the code that's produced.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, I think Claud code also makes a lot of this stuff. A lot of quality work becomes a lot easier. So, for example, I have not manually written a unit test in many months.
Kat Wu (PM at Anthropic)
We have a lot of unit tests.
Developer from Anthropic (possibly Forrest or Sid)
We have a lot of unit tests. And it's because Quad writes all the tests. And before I felt like a jerk if on someone's pr, I'm like, hey, can you write a test? Because they kind of know they coverage.
Celestio (Co-host)
Is that still relevant?
Developer from Anthropic (possibly Forrest or Sid)
Yeah.
Boris Turney
Okay.
Developer from Anthropic (possibly Forrest or Sid)
And you know, they kind of know they should probably write a test and that's probably the right thing to do. And somewhere in their head they make that trade off where they just want to ship faster. And so you always kind of feel like a jerk for asking. But now I always ask because Quad can just write the test. There's no human work. You just ask Quad to do it and it writes it. And I think with writing tests becoming easier and with writing lint rules becoming easier, it's actually much easier to have high quality code than it was before.
Celestio (Co-host)
What are the metrics that you believe in? A lot of people actually don't believe in 100% code coverage because sometimes that is kind of optimizing for the wrong thing. Arguably, I don't know, but obviously you have a lot of experience in different code quality metrics. But what still makes sense?
Developer from Anthropic (possibly Forrest or Sid)
I think it's very engineering team dependent. Honestly, I wish there's a one size fits all answer.
Celestio (Co-host)
For me, the one solution.
Developer from Anthropic (possibly Forrest or Sid)
For some teams, test coverage is extremely important. For other teams, type coverage is very important, especially if you're working in a very strictly typed language and for example, avoiding NES and JavaScript and Python. I think sigmatic complexity kind of gets a lot of flack, but it's still honestly a pretty good metric, just because there isn't anything better in terms of ways to measure code quality.
Boris Turney
Okay.
Celestio (Co-host)
And then productivity, obviously not lines of code, but do you care about measuring productivity? I'm sure you do.
Developer from Anthropic (possibly Forrest or Sid)
Yeah. You know, lines of code honestly isn't terrible. It has downsides.
Celestio (Co-host)
Yeah, it's.
Developer from Anthropic (possibly Forrest or Sid)
It's terrible. Line of code is terrible for a lot of reasons.
Celestio (Co-host)
Yes.
Developer from Anthropic (possibly Forrest or Sid)
But it's really hard to make anything better.
Celestio (Co-host)
So it's the least terrible.
Developer from Anthropic (possibly Forrest or Sid)
It's the least terrible. There's like, lines of code, maybe like number of PRs, how green your GitHub is. Yeah.
Kat Wu (PM at Anthropic)
The two that we're really trying to nail down are 1, decrease in cycle time. So how much faster are your features shipping because you're using these tools? So that might be something like the time between first commit and when your PR is merged. It's very tricky to get right. But one of the ones that we're targeting, the other one that we want to measure more rigorously, is the number of features that you wouldn't have otherwise built. We have a lot of channels where we get customer feedback. And one of the patterns that we've seen with cloud code is that sometimes customer support or customer success will post, hey, this app has this bug. And then sometimes 10 minutes later, one of the engineers on that team will be like, cloud code made a fix for it. And a lot of the situations when you ping them and you're like, hey, that was really cool. They were like, yeah. Without Claude code, I probably wouldn't have done that because it would have been too much of a divergence from what I was otherwise going to do. It would have just ended up in this long backlog. So this is the kind of stuff that we really want to measure more rigorously.
Developer from Anthropic (possibly Forrest or Sid)
That was the other AGI pilled moment for me. There was a really early version of Claude code many, many months ago. You and this one engineer at Anthropic Jeremy, built a bot that looked through a particular feedback channel on Slack, and he hooked it up to code to have code automatically put up PRs with just fixes to all this stuff. And some of the stuff, you know, it didn't fix every issue, but it fixed a lot of the issues.
Celestio (Co-host)
Is it like 10%, 50?
Developer from Anthropic (possibly Forrest or Sid)
You know, this was like, early on, so I don't remember the number, but it was surprisingly high, to the point where I became a believer in this kind of workflow and I wasn't for sopm.
Boris Turney
Isn't that scary too, in a way where you can Build too many things. It's almost like maybe you shouldn't build that many things. I think that's what I'm struggling with the most. It's like, it gives you the ability to create, create, create, but then at some point, you got to support, support, support.
Celestio (Co-host)
This is the Jurassic Park. Like, your scientists are so preoccupied with whether you could.
Boris Turney
Yeah, yeah, exactly. But now we should. Yeah. How do you make decisions? Like, now that the cost of actually implementing the thing is going down as a pm, how do you decide what is actually worth doing?
Kat Wu (PM at Anthropic)
Yeah, we definitely still hold a very high bar for features. Most of the fixes were like, hey, this functionality is broken, or this. Like, there's a weird edge case that we hadn't addressed yet. So it was very much like smoothing out the rough edges as opposed to building something completely net new for net new features. I think we hold a pretty high bar that it's very intuitive to use. The new user experience is, like, minimal. It's just, like, obvious that it works. We sometimes actually use COD code to prototype instead of using docs. Yeah. So you'll have prototypes that you can play around with. And that often gives us a faster feel for, hey, is this feature ready yet? Or is this the right abstraction? Is this the right interaction pattern? So it gets us faster to feeling really confident about a feature, but it doesn't circumvent the process of us making sure that the feature definitely fits in the product vision.
Developer from Anthropic (possibly Forrest or Sid)
It's interesting how, as it gets easier to build stuff, it changes the way that I write software where, like Kat's saying, like, before, I would write a big design doc and I would think about a problem for a long time before I would build it, sometimes for some set of problems. And now I'll just ask quadcode to prototype, like, three versions of it and I'll try the feature and see which one I like better. And then that informs me much better and much faster than a doc would have. And I think we haven't totally internalized that transition yet in the industry.
Boris Turney
Yeah, I feel the same. The same way for some tools I build internally. People ask me, could we do this? And I'm like, I'll just. Yeah, just build it. It's like, well, it feels pretty good. We should, like, polish it, you know, or sometimes it's like, no, that's not.
Celestio (Co-host)
It's comforting that your max cost is. I mean, you're even at Anthropic, where it's theoretically unlimited, the cost is roughly $6 a day. That gives people peace of Mind, because I'm like $6 a day. Fine, $100 a day. We have to talk.
Boris Turney
I paid 200 bucks a month to make Studio Ghibli photos. So it's all good. That is totally worth it.
Kat Wu (PM at Anthropic)
You mentioned internal tools and that's actually a really big use case that we're seeing emerge. Because a lot of times if you're working on something operationally intensive, if you can spin up a internal dashboard for it or an operational tool where you can, for example, grant access to emails at once, a lot of these things you don't really need to have like a super polished design. You kind of just need something that works. And quadcode's really good at those kinds of zero to one tasks. Like we use Streamlit internally and there's been like a proliferation of how much we're able to visualize. And because we're able to visualize it, we're able to see patterns that we wouldn't have otherwise if we were just looking at like raw data.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, like I was working on also this like side website last week and I just showed Claude code the mock. So I just took the, you know, the screenshot I had dragged and dropped it into the terminal and I was like, hey Claude, here's the mock. Can you implement it? And it implemented and it looked like, you know, it sort of worked. It was a little bit crummy. And I was like, all right, now look at it in Puppeteer and like iterate on it until it looks like the mock. And then it did that three or four times and then the thing looked like the mock. Yeah, this is just all manual work.
Celestio (Co-host)
Before I think we're going to ask about two other features of I guess the overall agent pieces that we mentioned. So I'm interested in memory as well. We talked about auto compact and memory using hashtags and stuff. My impression is that like you say, simplest approach works, but I'm curious if you've seen any other requests that are interesting to you or internal hacks of memory that people have explored that you might want to surface to others.
Developer from Anthropic (possibly Forrest or Sid)
There's a bunch of different approaches to memory. Most of them use external stores of various sorts of chroma.
Celestio (Co-host)
Yeah, exactly.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, there's a lot of projects like that. And yeah, it's either a kvalue or kind of like graphs force. That's like the two big shapes for these.
Celestio (Co-host)
Do you believer in knowledge graphs for this stuff?
Developer from Anthropic (possibly Forrest or Sid)
If you talked to me before I joined Anthropic and this team, I would have said yeah, Definitely. But now actually I feel everything is the model. That's the thing that wins in the end. And it just. As the model gets better, it subsumes everything else. So at some point the model will encode its own knowledge graph, it'll encode its own AV store if you just give it the right tools. But yeah, I think the specific tools, there's still a lot of room for experimentation. We don't know yet.
Celestio (Co-host)
In some ways are we just coping for lack of context length? Are we doing things for memory now that if we had like 100 million token context window, we don't care about.
Kat Wu (PM at Anthropic)
I would love to have 100 million token context for sure.
Celestio (Co-host)
Some people have claimed to, to have done it. We don't know if that's true or not.
Developer from Anthropic (possibly Forrest or Sid)
But I guess here's the question for you, Sean. If you took all the world's knowledge and you put it in your brain and let's say there was some treatment that you could get to make it so your brain can have any amount of context, you have infinite neurons, is that something that you would want to do or would you still want to record knowledge externally?
Celestio (Co-host)
Putting it in my head is different for me trying to use an agent tool to do it because I'm trying to control the agent and I'm trying to make myself unlimited. But I want to make the tools I use limited because then I know how to control them. And it's not even like a safety argument, it's just more like I want to know what you know. And if you don't know, don't know a thing, then sometimes that's good.
Developer from Anthropic (possibly Forrest or Sid)
Like the ability to audit what's in the 10.
Celestio (Co-host)
And I don't know if this is the small brain thinking, because this is not very bitter lesson, which is like actually sometimes you just want to control every part of what goes in there in the context. And the more you just, you know, Jesus, take the wheel, trust the model, then you have no idea what it's paying attention to.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, I don't know. Did you see the mech interpretability stuff from Chris Ola and the team that.
Celestio (Co-host)
Was published like last week? Last week, yes. What about it?
Developer from Anthropic (possibly Forrest or Sid)
I wonder if something like this is the future. So there's an easier way to audit the model itself. And so if you want to see what is stored, you can just audit the model.
Celestio (Co-host)
Yeah, the main salient thing is that they know what features activate at per token and they can tune it up, suppress it, whatever. But I don't know if it goes down to the individual item of knowledge from context?
Developer from Anthropic (possibly Forrest or Sid)
Not yet, But I wonder, maybe that's the bitter Western version of it.
Celestio (Co-host)
Right. Any other comments from memory? Otherwise, we can move on to planning and thinking.
Kat Wu (PM at Anthropic)
We've been seeing people play around with memory in quite interesting ways, like having CLAUDE write a logbook of all the actions that it's done so that over time, CLAUDE develops this understanding of what your team does, what you do within your team, what your goals are, how you like to approach work. We would love to figure out what the most generalized version of this is so that we can share broadly, I think, with things like COD code. I think when we're developing things like CLAUDE code, it's actually less work to implement the feature and a lot of work to tune these features to make sure that they work well for general audiences across a broad range of use cases. So there's a lot of interesting stuff within memory, and we just want to make sure that it works well out of the box before we share it broadly.
Celestio (Co-host)
Agree with that. I think there's a lot more to be developed here.
Developer from Anthropic (possibly Forrest or Sid)
I guess a related problem to memory is how do you get stuff into context?
Celestio (Co-host)
Knowledge base. Like knowledge base. Yeah.
Developer from Anthropic (possibly Forrest or Sid)
And originally we tried very, very early versions of claude, actually used rag. So we indexed the code base and I think we were just using Voyage, so just off the shelf rag. And that worked pretty well. And we tried a few different versions of it. There was rag, and then we tried a few different kinds of search tools. And eventually we landed on just agentic search as the way to do stuff. And there were two big reasons, maybe three big reasons. So one is it outperformed everything by a lot. By a lot. And this was surprising in what benchmark? This is just Vibes. Internal vibes. There's some internal benchmarks also, but mostly Vibes.
Celestio (Co-host)
It just felt better in agentic rag, meaning you just let it look up in however many cycles it needs.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, Just using regular code searching. Glob Grep. Just regular code search.
Celestio (Co-host)
Regular code search, yeah.
Developer from Anthropic (possibly Forrest or Sid)
So that was like one, and then the second one was there was this whole indexing step that you have to do for rag. And there's a lot of complexity that comes with that because the code drifts out of sync. And then there's security issues because this index has to live somewhere, and then what if that provider gets hacked? And so it's just a lot of liability for a company to do that. Even for our code base, it's very sensitive, so we don't want to upload it to a Third party thing, it could be a first party thing, but then we still have this out of sync issue and agentic search just sidesteps all of that. So essentially, at the cost of latency and tokens, you now have really awesome search without security.
Boris Turney
Downsides with memory is like planning, right? There's kind of like memories, like what I like to do, and then planning is like now use those memories to come up with a plan to do these things. There was one.
Celestio (Co-host)
Or maybe put it as like memory, sort of the past, like what we, what we already did. And if plan is kind of what we will do. Yeah, it just crosses over at some point.
Boris Turney
Yeah. I think the maybe slightly confusing thing from the outside is what you define as thinking. So just like extensive thinking, there's the think tool and it's kind of like thinking as in planning, which is like thinking before execution. And then there's like thinking what you're doing, which is like the thing tool. Can you maybe just run people through? The difference is.
Celestio (Co-host)
I'm already confused listening to you.
Developer from Anthropic (possibly Forrest or Sid)
Well, it's one tool. So Quad can think if you ask it to think. Generally the usage pattern that works best is you ask Quad to do a little bit of research, like use some tools, pull some code into context and then ask it to think about it and then it can make a plan and you know, do a planning step before you execute. There's some tools that have explicit planning modes, like root code has this and Klein has this and other tools have it like you can shift between, you know, plan and act mode or maybe a few different modes. We've sort of thought about this approach, but I think our approach to product is similar to our approach to the model, which is bitter lesson. So just freeform, keep it really simple, keep it close to the metal. And so if you want Claude to think, just tell it to think. Be like, you know, make a plan, think hard, don't write any code yet and it should generally follow that and you can do that also as you go. So maybe there's a planning stage and then Claude writes some code or whatever and then you can ask it to think and plan a little bit more. You can do that anytime.
Boris Turney
Yeah. I was reading to the Think tool blog posts and I said while it sounds similar to extended thinking, it's a different concept. Extended thinking is what Claude does before it starts generating and then think it once it starts generating. How do you add a stop and think? Is this all done by the clock code harness so people don't really have to think about the difference between the two, basically, is the idea.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, you don't have to think about it.
Boris Turney
Okay, that is helpful. That is helpful because sometimes I'm like, man, am I not thinking right?
Developer from Anthropic (possibly Forrest or Sid)
Yeah. In this whole chain of thought actually in quad code, so we don't use the think tool anytime that quad code does thinking. It's all a chain of thought.
Celestio (Co-host)
I had an insight. This is again something we had a discussion we had before recording, which is in the cloudplace Pokemon hackathon. We had access to Morph's sort of branching environments feature, which meant that we could take any VM state, branch it, play it forward a little bit and use that in the planning. And then I realized the TLDR of yesterday was basically that it's too expensive to just always do that at every point in time. But if you give it as a tool to Claude and prompt it in certain cases to use that tool seems to make sense. Now I'm just kind of curious. Your takes on overall sandboxing, environment branching, rewindability maybe, which is something that you immediately brought up, which I didn't think about. Is that useful for Claude or. Claude has no opinions about it.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, I could talk for hours about this. Claude probably can too, if you ask me.
Celestio (Co-host)
Let's get original tokens from you and then we can train cloud on that. By the way, that's like explicitly what this podcast is or just generating tokens for people.
Developer from Anthropic (possibly Forrest or Sid)
Is this the pre training or the post training?
Celestio (Co-host)
It's a pre trained data set. You got to get in there.
Developer from Anthropic (possibly Forrest or Sid)
Oh, man. Yeah. How do I buy, how do I get some tokens? Starting with sandboxing, ideally the thing that we want is to always run code in a docker container and then it has freedom and you can kind of snapshot, you know, with other kind of tools layer on top. You can snapshot, rewind, do all this stuff. Unfortunately, working with a docker container for everything is just like a lot of work and most people aren't going to do it. And so we want some way to simulate some of these things without having to go full container. There's some stuff you can do today. So for example, something I'll do sometimes is if I have a planning question or a research type question, I'll ask Quad to investigate a few paths in parallel. And you can do this today if you just ask it. So say I want to refactor X to do. Yeah. Can you research three separate ideas for how to do it, do it in parallel, use three agents to do it and so in the ui, when you say when you see a task that's actually like a sub Claude, it's a sub agent that does this. And usually when I do something hairy, I'll ask it to just investigate three times or five times or however many times in Pro. Well, and then Claude will kind of pick the best option and summarize that for you.
Boris Turney
But how does Claude pick the best option? Don't you want to choose? What's your handoff between you should pick versus I should be the final decider?
Developer from Anthropic (possibly Forrest or Sid)
I think it depends on the problem. You can also ask Claude to present the options to you.
Celestio (Co-host)
Probably exists at a different part of the stack than cloud code, specifically Claud code. As a cli, you can use it in any environment. So it's up to you to compose it together. Should we talk about how and when models fail? Because I think that was another hot topic for you. I'll just leave it open. Like, how do you, you observe cloud code failing?
Kat Wu (PM at Anthropic)
There's definitely a lot of room for improvement in the models, which I think is very exciting. Most of our research team actually uses quad code day to day, and so it's been a great way for them to be very hands on and experience the model failures, which makes it a lot easier for us to target these in model training and to actually provide better models not just for quad code, but for all of our coding customers. I think one of the things about the latest Sonnet 3.7 is it's a very persistent model. It's very, very motivated to accomplish the user's goal. But it sometimes takes the user's goal very literally and so doesn't always fulfill what the implied parts of the request are because it's just so narrowed in on I must get X done. And so we're trying to figure out, okay, how do we give it a bit more common sense so that it knows the line between trying very hard and like, no, the user definitely doesn't want that.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, like the classic example is like, hey, quad, get this test to pass. And then, you know, like five minutes later it's like, all right, well I hard coded everything. The test passes. I'm like, no, that's not what I wanted.
Celestio (Co-host)
Hard coded the answer.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, but that's the thing. Like, it only gets better from here. Like these use cases work sometimes today, not every time. And you know, the model sometimes tries too hard, but it only gets better.
Kat Wu (PM at Anthropic)
Yeah, like context, for example, is a big one, where a lot of times if you have a very long conversation and you compact a few times. Maybe some of your original intent isn't as strongly present as it was when you first started. And so maybe the model forgets some of what you originally told it to do. And so we're really excited about things like larger effective context windows so that you can have these gnarly really long hundreds of thousands of tokens long tasks and make sure that quad code is on track the whole way through. That would be a huge lift, I think, not just for Claude code, but for every coding company.
Celestio (Co-host)
Fun story from David Hershey's keynote yesterday. He actually misses the common sense of 3.5 because 3.7 being so persistent. 3.5 actually had some entertaining stories where apparently gave up on tasks and just 2.7 doesn't. And when Claude 3.5 gives up, it started writing formal requests to the developers of the game to fix the game. Here's some screenshots of it which is excellent. So if you're listening to this, you can find it on the YouTube because we'll post it. Very, very cool. One form of failing which I kind of wanted to capture was something that you mentioned while we're getting coffee, which is that clock code doesn't have that much between session memory or caching or whatever you call that, right? So it reforms the whole state for whole cough every single time. So it has to make the minimum assumptions on the changes that can happen in between. So how consistent can it stay? Right? I think that one of the failures is that it forgets what it was doing in the past unless you explicitly opt in via Cloud MD or whatever. Is that a something you worry about?
Kat Wu (PM at Anthropic)
It's definitely something we're working on. I think our best advice now for people who want to resume across sessions is to tell Claw to hey, write down the state of this session into this text doc. Probably not the Claw md, but in a different doc. And in your new session tell Claude to read from that doc. But we plan to build in more native ways to handle this specific workflow.
Developer from Anthropic (possibly Forrest or Sid)
There's a lot of different cases of this, right? Sometimes you don't want Claude to have the context and it's sort of like git. Sometimes I just want a fresh branch that doesn't have any history, but sometimes I've been working on a PR for a while and I need all that historical context. So we kind of want to support all these cases and it's tricky to do a one size fits all. But generally our approach to code is to make sure it works out of the box for people without extra configuration. So once we get there, we'll have something.
Boris Turney
Do you see a feature in which the commits play a bigger part of in a pull request? Like how do we get here? There's kind of like a lot of history and how the code has changed within the PR that informs the model. But today the models are mostly looking at the current state of the branch.
Developer from Anthropic (possibly Forrest or Sid)
Yeah. So Claude, for some things it'll actually look at the whole history. So for example, if it's writing, if you tell Claude, hey, make a PR for me, it'll look at all the changes since your branch diverged from main and then take all of those into account when generating the pull request message.
Kat Wu (PM at Anthropic)
You might notice it running git diff as you're using it. I think it's pretty good about just tracking hey, what changes have happened on this branch or so far. And just make sure that it understands that before continuing on with the task.
Developer from Anthropic (possibly Forrest or Sid)
One thing other people have done is ask laud to commit after every change. You can just put that in the quad md. There's some of these power user workflows that I think are super interesting. Some people are asking quad to commit after every change so that they can rewind really easily. Other people are asking Claude to create a work tree every time so that they could have a few clauds running in parallel in the same repo. I think from our point of view we want to support all of this. Again, Claude code is a primitive and it doesn't matter what your workflow is, it should just fit in.
Boris Turney
I know that 3.5 haiku was the number four model on Ator when it came out. Do you see club code have a world in which you have like a commit hook that uses maybe haiku to do some the linter stuff and things like that continuously. And then you have 3.7 as the more.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, you could actually do this if you want. So you're saying like through like a pre commit hook or like a GitHub action or.
Boris Turney
Yeah, yeah, yeah. See, well, kind of like run clock code like the lint example that you had. I want to run it at each commit locally like before it goes to the pr.
Developer from Anthropic (possibly Forrest or Sid)
Yeah. So you could do this today if you want. So in the, you know, if you're using like Husky or like whatever pre commit hook system you're using or just like git pre commit hooks, just add a line quad p and then you know whatever instruction you have and that'll run every time.
Boris Turney
Nice. And you just specify haiku. It's really no difference. Right. It's like, maybe it'll work a little worse, but like it still support it.
Celestio (Co-host)
Yeah.
Developer from Anthropic (possibly Forrest or Sid)
You can override the model if you want. Generally we use Sonnet. We default to Sonnet for almost everything just because we find that it outperforms. But yeah, you can override the model if you want.
Boris Turney
Yeah. I don't have that much money to run commit hook on three.
Celestio (Co-host)
Just as a side on pre commit hooks. I have worked in places where they insisted on having pre commit hooks. I've worked at places where they insisted they'll never do pre commit hooks because they get in the way of committing and moving quickly. And I'm just kind of curious, do you have a stance or recommendation?
Boris Turney
Oh God.
Developer from Anthropic (possibly Forrest or Sid)
That's like asking about tabs versus spaces a little bit.
Celestio (Co-host)
But I think it is easier in some ways to if you have a breaking test, go fix the test with clock code. In other ways, it's more expensive to run this at every point. So there's trade offs.
Developer from Anthropic (possibly Forrest or Sid)
I think for me the biggest trade off is you want the pre commit hook to run pretty quickly so that if you're either if you're a human or if you're a quad, you don't have to wait like a minute for all the.
Celestio (Co-host)
So you want the fast version.
Developer from Anthropic (possibly Forrest or Sid)
Yeah. So generally, you know, pre commit, you know, for our code base should run. Yeah, it's like less than, you know, five seconds or so. Like just types and lint maybe. And then more expensive stuff you can put in the GitHub Action or GitLab or whatever you're using.
Boris Turney
Agreed.
Celestio (Co-host)
I don't know, like, I like putting prescriptive recommendations out there so that people can take this and go, like, this guy said it, we should do it in our team. And like that's, that's a basis for decisions.
Developer from Anthropic (possibly Forrest or Sid)
Yeah. Yeah.
Celestio (Co-host)
Cool. Any other technical stories to tell? I wanted to zoom out into more producty stuff, but you can get as technical as you want.
Developer from Anthropic (possibly Forrest or Sid)
I don't know. One anecdote that might be interesting is the night before the code launch, we were going through to burn down the last few issues and the team was up pretty late trying to do this. And one thing that was bugging me for a while is we had this markdown rendering that we were using and it was just, you know, it's like the markdown rendering in Quad today is beautiful and it's just like really nice rendering in the terminal and it has bold and, you know, headings and spacing and stuff very nicely. But we tried a Bunch of these off the shelf libraries to do it. And I think we tried like two or three or four different libraries and just nothing was quite perfect. Like sometimes the spacing was a little bit off between a paragraph and like a list, or sometimes the text wrapping wasn't quite correct, or sometimes the colors weren't perfect. So each one had all these issues. And all these markdown renderers are very popular and they have thousands of stores on GitHub and have been maintained for many years, but they're not really built for a terminal. And so the night before the Release at like 10:00pm, I'm like, all right, I'm going to do this. So I just asked Quad to write a markdown parser for me and it.
Celestio (Co-host)
Wrote it zero shot.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, it wasn't quite zero shot, but after maybe one or two prompts it got it. And yeah, that's the markdown parser that's in ENCODE today. And the reason that markdown looks so.
Celestio (Co-host)
Beautiful, that's a fun one. It's interesting what the new bar is, I guess, for implementing features. This exact example, where there's libraries out there that you normally reach for that you find some dissatisfaction with for literally whatever reason, you could just spin up an alternative and go off of that.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, I feel like AI has changed so much and literally in the last year. But a lot of these problems are like the example we had before. A feature you might not have built before or you might have used a library. Now you can just do it yourself. The cost of writing code is going down and productivity is going up. We just have not internalized what that really means yet. But yeah, I expect that a lot more people are going to start doing things like this, like writing your own libraries or just shipping every feature just to zoom out.
Boris Turney
You obviously do not have a separate plot code subscription. I'm curious what the roadmap is. This is just going to be a research preview for much longer. Are you going to turn it into an actual product? I know you were talking to a lot of CTOs and VPs. Is there going to be cloud code enterprise? What's the. What's the vision?
Kat Wu (PM at Anthropic)
Yeah, so we have a permanent team on cloud code. We're growing the team. We're really excited to support cloud code in the long run. And so, yeah, well, we plan to be around for a while. In terms of subscription itself, it's something that we've talked about. It depends a lot on whether or not most users would prefer that over. Pay as you go so far. Pay as you go has made it really easy for people to start experiencing the product because there's no upfront commitment. And it also makes a lot more sense with a more autonomous world in which people are scripting cloud code a lot more. But we also hear the concern around, hey, I want more price predictability if this is going to be my go to tool. So we're very much still in the stages of figuring that out. I think for enterprises, given that cloud code is very much like a productivity multiplier for ICs and most ICs can adopt it directly, we've been just like supporting enterprises as they have questions around security and productivity monitoring. And so yeah, we've found that a lot of folks see the announcement and they want to learn more. And so we've been just engaging in those.
Celestio (Co-host)
Do you have a credible number for the productivity improvement? Like for people who are not at Anthopic that you've talked to? Are we talking 30%? Some number would help justify things.
Developer from Anthropic (possibly Forrest or Sid)
We're working on getting this. We should. Yeah, it's something we're actively working on. But anecdotally for me it's probably 2x my productivity. So I'm just like, I'm an engineer that codes all day every day. For me it's probably 2x.
Celestio (Co-host)
Yeah.
Developer from Anthropic (possibly Forrest or Sid)
I think there's some engineers at Anthropic where it's probably 10x their productivity and then there's some people that haven't really figured out how to use it yet and they just use it to generate commit messages or something. That's maybe like 10%. So I think there's probably a big range and I think we need to study more for reference.
Kat Wu (PM at Anthropic)
Sometimes we're in meetings together and sales or compliance or someone is like, hey, we really need X feature. And then Forrest will ask a few questions to understand the specs and then 10 minutes later he's like, all right, it's built, I'm going to merge it later. Anything else? So it definitely feels definitely far different than any other PM role I've had.
Boris Turney
Do you see yourself opening that channel of the non technical people talking to clock code and then the instance coming to you, which like they already define and talk to it and explain what they want and then you're doing on the review side and implementation?
Developer from Anthropic (possibly Forrest or Sid)
Yeah, we've actually done a fair bit of that. Like Megan, the designer on our team, she is not a coder but she's winning power us. She uses code to do it.
Celestio (Co-host)
She designs the UI.
Kat Wu (PM at Anthropic)
Yeah, she's landing PRs to our console product. So it's not even just like building on quad code, it's building across our product suite in our Monorepo.
Celestio (Co-host)
Right?
Boris Turney
Yeah, yeah.
Developer from Anthropic (possibly Forrest or Sid)
And similarly, our data scientist uses quad code to write bigquery queries. And there was like some finance person that went up to me the other day and I was like, hey, I've been using quad code. And I'm like, what? Like how did you even get it installed? You know how to use git? And they're like, yeah, yeah, I figured it out. And yeah, they're using it. They're like, so quad code you can pipe in because it's a UNIX utility. And so what they, they do is they take their data, put it in a CSV and then they take the, they cat the CSV, pipe it into code, and then they ask it code questions about the CSV and they've been, they've been using it for that.
Boris Turney
Yeah, that would be really useful to me because really what I do a lot of the times, like somebody gives me a feature request, I kind of like rewrite the prompt, I put it in agent mode and then I review the code. It would be great to have the PR wait for me. I'm kind of useless in the first step, taking the feature request and prompting the agent to write it. I'm not really doing anything. My work really starts after the first run is done.
Celestio (Co-host)
I was going to say, I can see it both ways. Okay, so maybe I'll simplify this to. In the workflow of non technical people in loop, should the technical person come in at the start or come in at the end? Right, or come in at the end end to start. Obviously that's the highest leverage thing because like sometimes you just need the technical person to ask the right question that the non technical person wouldn't know to ask. And that really affects the implementation.
Boris Turney
But isn't that the bitter lesson of the model? That the model will also be good at asking the follow up question? Like, you know, if you're like telling the model, hey, that's what you trust.
Celestio (Co-host)
The model to do the least. Right? Sorry, go ahead.
Boris Turney
Yeah, if you're like the model, hey, you are the person that needs to translate this non technical person request into the best prompt for cloud code to do a first implementation. I don't know how good the model would be today. I don't have an eval for that. But that seems like a promising direction for me. It's easier for me to review 10 PRs than it is for me to take 10 requests, then run the agent 10 times and then wait for all of those runs to be done and review.
Developer from Anthropic (possibly Forrest or Sid)
I think the reality is somewhere in between. We spend a lot of time shadowing users and watching people at kind of different levels of seniority and kind of technical depth use code. And one thing we find is that people that are really good at prompting models from whatever context, maybe they're not even technical, but they're just really good at prompting. They're really effective at using code. And if you're not very good at prompting, then code tends to go off the rails more and do the wrong thing. So I think in this stage of where models are at today, it's definitely worth taking the time to learn how to prompt models well. But I also agree that maybe in a month or two months or three months, you won't need this anymore. Because the bitter lesson always wins.
Boris Turney
Please do it.
Developer from Anthropic (possibly Forrest or Sid)
Please do it.
Boris Turney
Anthropic.
Celestio (Co-host)
I think there's a broad interest in people forking or customizing cloud code. So we have to ask why is it not open source?
Developer from Anthropic (possibly Forrest or Sid)
We are investigating.
Celestio (Co-host)
Ah, okay. So it's not yet.
Developer from Anthropic (possibly Forrest or Sid)
There's a lot of trade offs that go into it. On one side, our team is really small and we're really excited for open source contributions if it was open source. But it's a lot of work to kind of maintain everything and look at it. I maintain a lot of open source stuff and a lot of other people on the team do too. And it's just a lot of work. It's a full time job managing contributions and all this stuff.
Celestio (Co-host)
Yeah. I'll just point out that you can do source available and that solves a lot of individual use cases without going through the legal hurdles of full open source.
Developer from Anthropic (possibly Forrest or Sid)
Yeah, exactly. I mean, I would say there's nothing that secret in the source and Obviously it's all JavaScript so you can just.
Celestio (Co-host)
Decompile it compilations out there. Very interesting.
Developer from Anthropic (possibly Forrest or Sid)
Yeah. And generally our approach is all the secret sauce. It's all in the model. And this is the thinnest possible wrapper over the model. We literally could not build anything more minimal. This is the most minimal thing.
Celestio (Co-host)
Yeah.
Developer from Anthropic (possibly Forrest or Sid)
So there's just not that much in it.
Celestio (Co-host)
If there was another architecture that you would be interested in that is not the simplest, what would you have picked as an alternative? We're just talking about agentic architectures here. Right. There's a loop here and it goes through and you sort of pull in the models and tools in a relatively intuitive way. If you were to rewrite it from scratch and choose the generationally harder path. Like, what would that look like?
Kat Wu (PM at Anthropic)
Well, Boris has rewritten this. Boris and the team have rewritten this, like, five times.
Celestio (Co-host)
Oh, that's a story.
Developer from Anthropic (possibly Forrest or Sid)
Yeah.
Kat Wu (PM at Anthropic)
But it's very much the simplest thing, I think, by design.
Celestio (Co-host)
Okay. So it's got simpler.
Developer from Anthropic (possibly Forrest or Sid)
It got simpler.
Celestio (Co-host)
It doesn't go more complex.
Developer from Anthropic (possibly Forrest or Sid)
We've rewritten it from scratch. Yeah. Probably every three weeks, four weeks or something. And just like all the. It's like a ship of Theseus. Right. Like, every piece keeps getting swapped out. And just because Claude is so good at writing its own code.
Celestio (Co-host)
Yeah. I mean, at the end of the thing, the thing that's breaking changes is the interface, the cloud, mcp, blah, blah, blah. All that has to kind of stay the same. Unless you really have a strong reason to change it.
Developer from Anthropic (possibly Forrest or Sid)
Yeah.
Kat Wu (PM at Anthropic)
I think most of the changes are to make things more simple, like to share interfaces across different components. Because ultimately we just want to make sure that the context that's given to the model is in, like, the purest form and that the harness doesn't intervene with the user's intent. And so very much a lot of that is just removing things that could get in the way or that could confuse the model.
Celestio (Co-host)
Yeah.
Developer from Anthropic (possibly Forrest or Sid)
On the UX side, something that's been pretty tricky, and the reason that we have a designer working on a Terminal app is it's actually really hard to design for a terminal. There's not a lot of literature on this. I've been doing product for a while. I kind of know how to build for apps and for web and for engineers in terms of tools that have devex. But Terminal is sort of new. There's a lot of these really old terminal UIs that use curses and things like this for very sophisticated UI systems, but they all feel really antiquated by the UI standards of today. And so it's taken a lot of work to figure out how exactly do you make the app feel like fresh and modern and intuitive in a terminal? And we've had to come up with a lot of that design language ourselves.
Celestio (Co-host)
Yeah, I mean, I'm sure you'll be developing over time. Cool. Closing question. Is it just more general? I think a lot of people are wondering. Anthropic has, I think it's easy to say the best brand for AI engineering, like developers and coding models. And now with the coding tool attached to it, it just has the whole product suite of model and tool and protocol. Right. And I don't think this is obvious. One year ago today, when Cloud 3 launched, it was just more like, this is general purpose models and all that. But cloudsonnet really took the scene as the sort of COD tool of choice and I think built Anthropic's brand and you guys are now extending. So why is Anthropic doing so well with developers? It seems like there's just no centralized. Every time I talk to Anthropic people, they're like, oh, yeah, we just had this idea and we pushed it and it did well. And I'm just like, there's no centralized strategy here or is there an overarching strategy?
Developer from Anthropic (possibly Forrest or Sid)
Sounds like a PM question to me.
Celestio (Co-host)
I don't know. I would say, like, Dario is not breathing down your necks going like, build the best devtools. He's just letting you do your thing.
Developer from Anthropic (possibly Forrest or Sid)
Everyone just wants to build awesome stuff.
Kat Wu (PM at Anthropic)
I feel like the model just wants to write code. Yeah, I think a lot of this trickles down from the model itself. Being very good at code generation. We're very much building off the backs of an incredible model. That's the only reason why cloud code is possible. I think there's a lot of answers to why the model itself is good at code, but I think one high level thing would be, so much of the world is run via software and there's immense demand for great software engineers. And it's also something that you can do almost entirely with just a laptop or just a dev box or some hardware. It just is an environment that's very suitable for LLMs. It's an area where we feel like you can unlock a lot of economic value by being very good at it. There's like a very direct ROI there. We do care a lot about other areas too, but I think this is just one in which the models tend to be quite good and the team's really excited to build products on top of it.
Boris Turney
And you're growing the team you mentioned. Who do you want to hire?
Kat Wu (PM at Anthropic)
Yeah, we are.
Boris Turney
Who's like a good fit for your team?
Developer from Anthropic (possibly Forrest or Sid)
We don't have a particular profile, so if you feel really passionate about coding and about the space, if you're interested in learning how models work and how terminals work and how all these technologies that are involved. Yeah, hit us up. Always happy to chat.
Boris Turney
Awesome. Well, thank you for coming on. This was fun.
Kat Wu (PM at Anthropic)
Thank you.
Developer from Anthropic (possibly Forrest or Sid)
Thanks for having us.
Kat Wu (PM at Anthropic)
This is fun.
Boris Turney
Sam.
Episode: Claude Code: Anthropic's CLI Agent
Date: May 7, 2025
Host: Celestio (CTO at Decibel), Boris Turney (Co-host, founder of Small AI)
Guests: Kat Wu (PM at Anthropic), Lead Developer from Anthropic (Forrest/Sid)
This episode dives into Claude Code, the command-line agent from Anthropic, exploring its origins, design philosophy, technical architecture, key features, and its role in shaping the future of agentic coding tools. The interview offers a candid behind-the-scenes look at Anthropic’s “do the simple thing first” ethos, how Claude Code is being rapidly adopted, and what it means for productivity, developer workflows, enterprise adoption, agent reliability, and the broader coding landscape.
[03:12]
Quote:
“Claude code is Claude in the terminal. Because it runs in the terminal, it has access to a bunch of stuff you just don't get if you’re running on the web… it does all that agentically.” – Developer from Anthropic [03:25]
[04:28–06:43]
[06:56–09:20]
Quote:
“I PM with a pretty light touch. Boris and the team are extremely strong product thinkers… Very little actually is tops down. I feel like I'm mainly there to clear the path.” – Kat Wu [07:40]
[13:56–15:29]
Quote:
“Claude code is a pretty good tool. It’s a tool for power workloads, for power users. That’s kind of where it fits.” – Developer from Anthropic [14:21]
[15:52–17:47]
Quote:
“If you can make an engineer 50, 70% more productive, that's worth a lot. I think that's the way to think about it.” – Developer from Anthropic [17:25]
[18:53–21:17]
Code Gen Reality:
[21:17–25:16]
Practical Example:
[25:16–27:41]
[27:41–30:32]
rm -rf) always require human approval.Quote:
“There’s some people at Anthropic that have been racking up like thousands of dollars a day with this kind of automation… we think of it as a UNIX utility.” – Developer from Anthropic [15:29]
[31:02–34:56]
claude -p <prompt>) excels at read-only tasks and large-scale bulk actions.Best Practice:
“Start small… Iterate on your prompt, then scale up to 10. …Don’t kick off a run to fix 100,000 tests at once.” – Kat Wu [35:06]
[36:11–40:38]
[41:12–43:44]
[44:56–50:25]
[50:25–54:58]
[55:31–58:35]
Quote:
“The model sometimes tries too hard, but it only gets better.” – Developer from Anthropic [56:41]
[60:16–62:49]
On Speed:
“You want the pre-commit hook to run pretty quickly… types and lint maybe; more expensive stuff you can put in the GitHub Action.” – Developer from Anthropic [62:27]
[65:09–66:53]
[66:40–67:24]
[69:24–71:04]
[71:06–73:40]
Quote:
“We’ve rewritten it from scratch… every three weeks, four weeks. …Because Claude is so good at writing its own code.” – Developer from Anthropic [72:51]
[73:41–75:23]
Quote:
“I feel like the model just wants to write code.” – Kat Wu [75:37]
[76:42–77:06]
For more: show notes at latent.space.