
Loading summary
A
We've really evolved from agents being prompting loops to agents being autonomous, self discovering and long running actors. We set them tasks overnight and then we wake up and backlog is resolved and bugs are squashed. All of that is 10,000 times easier because of all the agents that we have internally. My personal favorite is like a predictive model that based off various attributes of the customer and the product can predict whether this customer is going to return. And it's able to produce this really rich level of insight in just minutes. The limits of what we can achieve will really be based off of how much we can delegate at once, more so than like what our personal capacities are.
B
Hey everyone, My guest today is Jess, product lead at Anthropic for Cloud Managed Agents. Really excited to get Jess to demo how to build an agent from scratch. Maybe talk about how Anthropic uses agents internally and maybe even just talk about what an agent even is. So welcome Jess.
A
Thanks for having me, Peter.
B
Yeah, it's great to have you. So everyone's talking about agents at the agent stuff and everything. So let me ask you, so how would you define what an agent is and what are the main components of an agent?
A
Man, what a loaded question. So once upon a time agents were really just prompting loops where you were just trying to get questions and responses in a loop. And I think that's really evolved towards permissioning and access to third party systems, internal tooling and sensitive data. And that level of access now requires permissioning, observability, steering. Not in the same way that it was just question and answer before.
B
Got it.
A
And so they're really, the underlying components are still the model, the system prompt and behavioral instructions, and the actual harness driving the loop. But the sophistication of, of what we are asking agents to achieve is higher. So that has made the sophistication of the harness higher as well.
B
Okay. Yeah, because there's no, you know, agents can use tools, they can have memory. Right. There's all kinds of stuff. Yeah, and why don't we also just define what a harness is? Like what is a harness?
A
Yeah, the harness is the core scaffolding around the model that gives it the ability to run those tools and to call its memory and to know when to ask for human in the loop input versus to just continue executing on its task. Task. So the harness is really what elevates us from the sort of random sampling of just tokens in and tokens out to actual actionable products.
B
And let me throw you a curveball, actually. Do you think the Model should be developed. The harness, are they kind of joined at hip?
A
I think I am quite biased, but I also think that it is impossible to get the maximum possible performance without tying together the harness and the model. Now the components of the harness and maybe the thickness of the harness will change over time as models get more and more capable. However, you know, we, when we test our models and when we are assessing their performance, we always have to test it in conjunction with a harness. And are we going to test it with all the different harnesses in the world? We're going to select the harnesses that we have built. And so there is an aspect of, you know, the necessity of building models is that you have to be testing them with harnesses and that sort of keeps them paired together.
B
Okay, so like, you know, you test them with like Claude, cowork and clock code and maybe some third party harnesses, right?
A
Yeah, yeah. And of course we run against like open source evals and whatnot, but at a certain point, you know, every single model distribution now is through a harness and so we also need to be testing them through harnesses as well.
B
Okay, now let's talk about your product. What is managed agents and how is it different from just me talking to the messages API?
A
Yeah, so Manage Agents is really the evolution of where we see task orchestration going. And earlier I talked about how we've really evolved from agents being prompting loops to agents being autonomous, self discovering and long running actors with access to lots of third party systems and need for both permissions and guardrails. And so cloud managed agents was developed with that in mind rather than just being a prompting loop that we've sort of like added different capabilities onto. This is a pre built harness and companion infrastructure to allow an agent to run complex tasks at scale. And so for us, the core motivation behind cloud managed agents is that the return on effort for building an agent should be extremely, extremely high. So we wanted to build easy to stack primitives and easy to use flexible developer APIs with out of the box infrastructure and all that should be really, really low effort. But then you should be able to delegate hugely complex work that might have taken you days, months, weeks to actually execute.
B
Okay, so it takes care of like a lot of the infrastructure you have to build to even get an agent to run. Is that kind of.
A
Yes, exactly.
B
Got it. Okay. All right, well then without further ado, do you want to show us how easy it is or hard is to build a managed agent?
A
Yeah, absolutely. Hopefully we find it easy. All right, here I am in our cloud console. And I have a pre built agent that I've already configured. So you here you see the core components of this agent. So one is the model selection. This continues to be what drives sort of like the intelligence layer underneath the agent here, the system prompt, which is the raw text that the model gets to define its behavior, its guardrails, and a high level awareness of the kinds of test tasks that you will orchestrate to it. I've given it access to a built in tool set that we ship with every cloud managed agent. And I've given it the ability to basically interact with its file system and produce results. I've actually set its permissions to always allow each of these tools, but we also have the flexibility to configure these as requesting permission. So keeping a human in the loop for any of these actions. In this particular case, I didn't grant it any skills, but here's where I would grant it skills as well.
B
So this is, this is an agent to analyze data and like a data analyst.
A
Exactly. Yes.
B
Got it. Okay.
A
So this particular agent runs analysis on a fictitious grocery store called Just in Time. And I actually give it an initial prompt that gives it the data schemas guidance on how to actually execute the task. And I give it also a file, a really large like multimillion line file for it to analyze as well throughout the course of the session. Run it. You can see it running all of its tools and calling the model selectively to run the analysis. And at the very end, it outputs HTML files that can be rendered in the browser so that you can see the results of the analysis. So all of these particular events are the actions that the model is taking. And you can see that I'm actually not steering it except for the initial event. And at the very end these outputs are produced for me.
B
This episode is brought to you by Riverside. I've used Riverside for years to record my podcast because it records in 4K resolution each person locally, so the audio and video still comes through clean, even if a guest wifi gets shaky. But the reason I love it now more than ever is what happens after we stop recording. If I go in here, then I can use these AI tools to remove pauses, remove filler words, and just clean up the recording. And I can also edit the transcript directly and it will automatically, automatically generate clips with captions ready to publish to YouTube, Spotify and all types of social media platforms all from one place. As a one person creator business, that matters a lot. Riverside is the upgrade your content workflow Needs try it at creators.riverside.com Peter Yang and use code Peter Yan at checkout to get one month completely free. That's Creators Riverside. Now back to our episode. And what is the initial prompt for this particular conversation? It's just like, go analyze this huge file.
A
Yeah, you could think of it as basic as that. But I had a very specific schema that I wanted it to be aware of. So for example, I wanted it to know exactly what the structure of the data set was beforehand. And that way I can front load the initial exploration that I would have ordinarily had to do. And then I wanted to break down the steps into discrete segments so that I could get very predictable outputs at the end of the task because these agents are randomized actors. So you do need to be somewhat prescriptive sometimes if you want to have very predictable outputs.
B
Okay. And this will help with like debugging and stuff too, right? Because like they can figure out which step.
A
And to your point about debugging, actually directly in CLAUDE console we have an agent that runs and analyzes actually the full session history associated with the agent. After I run this, I actually can use this debug agent to look for areas where I could have improved this agent even more.
B
What is the output again? You said just like an HTML file.
A
Yeah, it produces three different HTML files. And I can pull this up right now, actually.
B
So. Okay, so basically the components are there's a model, there's a prompt. Right. There's tool access, and then there is skills that are optional. Yes, for tool access, if I want to hook up to my internal database or something, there's like a whole set of process for that, right?
A
Yes. One of the ways in which you can hook up an agent to a third party database or a third party system is through mcp. MCP exposes a standardized way to communicate to external services and it includes an authentication layer in front, so that allows you to safely access these internal services that you might not want just anyone to access.
B
Okay. Yeah, it's how like I hooked up my clock code to all kinds of crazy MCPs. For better, for worse access guys.
A
Everything. Yeah, yeah. Dangerously loud permissions.
B
Yeah, yeah, no, I noticed that you guys moved that to like a settings, so I cannot do that anymore easily. So now I use auto mode. But like, I feel like there's actually a lot of stuff that's abstracted away from this screen. Right. So like, why don't you just like briefly talk about if I didn't use cloud manager agents, like if I Didn't set this up from scratch. What kind of scaffolding needs to go into this thing to build from scratch?
A
Yeah, yeah. So I think that if you're working with a raw prompting loop, then all of your work is highly, highly synchronous and you are constantly dependent on like the prior request that you strung together to complete successfully in order to get to the next step. And I think that that worked in a world where we were just asking the, you know, the chat bots to write haikus for us and we were doing very, very simple tasks. And then over time that has become increasingly unscalable. So if I delegated a really large task to an agent, and for example, something went wrong in that first initial message that I sent, and either I dropped the message or it was slightly off from my expectations, then my ability to pivot my integration and to handle that gracefully is significantly lower. So it's important for us to evolve towards these more self running, self recovering agent loops that can recover from errors, recover from going slightly off course, re steer themselves back, and then just keep you in the loop as they're doing so so that you're aware of their process.
B
When you say self requirement, you mean like they run into error, they can like debug their own error, they can do some searches and figure it out.
A
Right, yes, yeah, got it. And then even more baseline than that, you know, if they, if they produce an output that is unexpected, but then they're aware of what a good output is supposed to look like, they can, you know, revise their thinking and really adjust their course of action. That is very difficult to wire together in just a raw prompting loop.
B
All right, let's go back. Do you think the output generated now or is it still working?
A
All right, so this agent produces three primary artifacts. So first an analysis into the products. So you know, just an overall high level inspection of like common, common order patterns in shopping carts. It also produces a analysis into the shoppers and sort of heat maps on when they're shopping and these really interesting like radar charts. And then lastly, like my personal favorite is like a predictive model that based off of various attributes of the customer and the products, can predict whether this customer is going to return. So all of this was just with like a simple prompting and access to Python packages in the agent's environment. And it's able to produce this really rich level of insight in just minutes.
B
And so the prompt has like some stuff about like the format of the three reports that we want and the data that we Want, right, yes.
A
So what I, what I put into the system prompt is general performance optimizations because I do want this to be a general agent that I can reuse across a lot of different data sets. And then what I put into the initial prompt I sent to it is that schema, that highly specific schema discovery and the descript, the actual task description on how to run its analysis in sequence.
B
Okay, okay. So I guess if you want these reports like you know, every week or something, you can make it into a skill or like some sort of a routine or something, right?
A
Yeah, yeah, yeah. We actually will be offer scheduling natively in cloud managed agents.
B
Well, okay, well let me ask you this. So now you have a bunch of traces and a bunch of outputs. Like, you know, like how do we build evals for this agent? How do we know it's like not going off the rails?
A
Yeah. Evals is definitely the toughest part about building agents today because I think that what has traditionally been a traditional eval development is evolving as the tasks get more complex. I think that, you know, the traditional evaluation setup of, you know, here's a set of initial prompts and here's how we want the agent to produce results in response that still works. I think that we're also seeing folks do more sophisticated things like replays of more complex like multi string kinds of interactions with agents or you know, AB testing and different versions by sending the like same string of sort of user interactions and seeing how that ultimately changes the responses. I think another trend that we're also enabling in cloud managed agents is like a built in eval loop where if the agent itself knows to grade its outputs, then you can actually pull the eval directly into the agent's work rather than having it be outside of the course of the session.
B
Okay, this is kind of what I do in cloud code. Like it spits out an output and I try to get another agent to run the eval and then if the eval sucks, then I try to get the other agent to work again.
A
Basically. Yeah. When you can have agents evaluating their own work potentially in separate context windows to avoid bias, then you're always going to get a better output.
B
And just real quick, on evals, do you guys do pass fail evals or do you do scoring evals or all kinds of evals?
A
We do a mix. There's definitely binary pass fails, there's definitely scoring which is more like LLM as judged and like applying sort of more of a, you know, letter grading type approach. And then there's also Triggering evals for things. Like you could make sure that this type of action is actually triggered in general, for example, with skills, something that we worked on very early on was making sure that skills triggered properly at the right time. Because the whole point behind skills is progressive disclosure.
B
Oh, yeah, it wasn't that good. I think that's gotten better over time. I used to have to manually trigger the skills with like slash commands.
A
Yeah.
B
But it's getting a little bit better over time. Got it. Okay. I feel like as these agents become more autonomous. Right. And they can do longer running tasks, I feel like it's more about like, what is your goal and what's the outcome that you wanted to have? Like, what do you think about that? Like, that kind of changes the prompting and everything, right? Like, you know.
A
Yeah, yeah, it definitely changes things. I think that, you know, once upon a time there were structured outputs. Right. We told an agent specifically, your output must adhere rigidly to this structured, you know, code formatting. And then we will string together a lot of glue to make sure that these big blobs of JSON structured, structured data turn into something like beautifully rendered in the browser. I think that as models have become more capable and as harnesses have evolved, the sort of outcome has become the structured output. The outcome is sort of a meta structured output where we don't need to tell the agent anymore. Okay, this is the exact structure of your response. And I will glue all of these together and create something rich and interactive. We're just skipping straight ahead and saying, let's build this rich and interactive thing. And this is sort of my tastemaker's assessment of what good would look like. And because we now have the infrastructure for these agents to run autonomously, the agents can actually self correct along the way rather than relying us to string together all these intermediate outputs.
B
Okay, so it's not like, here's like five sections you got to adhere to. Here's the character count here, here's what you got to do as more like. I think it's actually kind of harder now because, like, what's the example of a tastemaker output? It's like, hey, you got to make it beautiful, interactive.
A
Yeah. I mean, we see people using it for like slide generation and content creation. I think that's like one area where outcome optimization is really useful, just like visual artifacts and editorial content. The other place where we see it being useful is on, for example, in the predictive model that I just showed, let's say that I needed it to achieve a specific score of 90% in a sort of accuracy benchmark. Then I've run tests with this agent before to try to always optimize for building a model that will hit that score and it's iterated until it got there.
B
Okay, got it. Okay, good. That makes sense. Yeah, it makes sense to try to get the agents to do their own loops first before the human eyes have to come into play, make as much progress as possible. Well, let me switch gears a little bit. Let's talk about how you guys use agents internally. Anthropic. And yeah, maybe you can screen share your slack, but I'm curious how anthropic PMs, technical staff, what kind of help and leverage has agents helped you guys internally during your work?
A
Yeah, for me, it's really been about depth and I think that access to our code base has been the biggest unlock for me. I think that one, it helps me just like manage state more easily. You know, rather than poking a bunch of engineers on what they're doing, I can just track the PRs directly and see which ones are merged, which ones are deployed. I think there's also an aspect of like, I deeply understand and interact with my product so much more than I've ever been able to in the past because it's so much faster for me to either prototype an agent on cloud managed agents because I can use cloud code, or just interrogate the code base on exactly how everything is working. Anything from going into a customer RFP and filling out all of their security check marks to diagnosing problems in the field, unblocking users who are having trouble, or helping them scale by helping them understand our specific architecture. All of that is 10,000 times easier because of all the agents that we have internally.
B
Okay, so let's kind of break that down into a few things. So you come into work on Monday and be like, hey, Claude, what did my engineership last week do you ask that kind of question? Or like, how do you.
A
For me, I have some scheduled runs going that like, summarize activity, but I still do a lot of the deep dives on a more sort of ad hoc basis based off of the questions I'm getting or like the pitches I need to prepare for or the customer conversations I'm about to have.
B
Okay, got it. Okay, got it. So you go into an enterprise to talk about managed agents and you personalize the pitch for that company or something, right?
A
Yeah, yeah.
B
And do you guys have like a feedback? Like, I know some people are active on Twitter even so it's pretty toxic, but like how do you get feedback for your Prof. Product? Like do you have like an enterprise
A
Slack group or something for Twitter? We do have agents that are sort of scraping the web and giving us summarized feedback. That's really helpful for signal because like no one person can ingest all of Twitter by themselves. I do have agents monitoring our Slack channels. I sit in a bunch of slack channels with all of our. With a bunch of customers and I love talking to customers directly. But for those I'm not able to talk to directly, it is useful to have agents sort of summarizing the activity that's happening there. I think that we have started to evolve towards thinking about agents as always on you should be able to tag them anywhere, but they also should be proactively surfacing things for you in the way that a coworker truly would. And so I think that there's two aspects to which our agent usage is really powerful. One is the level of access of data we give it and then two is the interaction styles that we expect of our agents, which is it should be human, like it should be proactive and not just reactive.
B
Is proactive through like triggered events and cron jobs or like how do you make it proactive? Just like.
A
Yeah, proactive is on triggered events and cron jobs and like continuously refreshing the data that has available so that if it is long running and it's. And it's like constantly slurping up information, it's. It needs to be as up to date as you are and that shouldn't be just, you know, on an ad hoc basis. That should be proactive.
B
Got it. Yeah. It's gotta have the most up to date context, right? Yeah.
A
Yeah.
B
So why don't you ask this question? How do you compare your conversations with Claude and agents versus your conversation with co workers? Like what happens more throughout the day?
A
Interesting. That's a great question.
B
Yeah, I feel like I talked to Cloud more. I'll be honest.
A
It is true. I do. I think particularly when I'm in a new space, I find myself spending a ton of therapy time with Claude just trying to wrap my hands, wrap my arms around a thornier concept. And I think that it up levels my conversations with my, my teammates because I'm able to come to a conversation with like a true opinion and a lot of baseline research done very quickly. And so I'm not asking the like, you know, please spin me up questions. We're able to engage at a deeper level.
B
I mean, I guess that's probably expected with the anthrop, right? Here because, like, you guys are encouraged to use Claude. Yeah. Do you guys pull up cloud in like a decision meeting or like a team meeting and, like, try to, you know. Yeah, yes, you do that.
A
Yeah. Claude is a really good neutral judge for certain things. We have an API review Claude that basically, if we're really stuck at an impasse with how we want to shape certain components of our API, then we do tag in Claude to tell us when our biases are getting the better of us. But all of our primitives are definitely there to be able to allow for this kind of interaction. It's sort of like agent to agent communication.
B
All right, so just to wrap this section, how would you say you typically spend your day? Or I guess no real typical day. Right. But, like, do you talk to a lot of enterprise customers? Do you do like a lot of roadmap planning? You're like shipping stuff yourself? Like, how do you typically spend your day?
A
Yeah, I think that my day now is spent a lot more in the customer discovery and like, sort of integration journey process than it has been before. And I think that's awesome. I think it's because, frankly, like, we're all moving so fast that the, like I said, the kind of conversations that we're all having has been like, seriously up leveled. I think previously, you know, if I spent this much time in customer conversations, it might be like, oh, like, please debug this, like, tiny little thing that is like, you know, a problem that I've debugged 100 times. But now, because they have agents and because we have agents and because we're pushing the boundaries, the conversations are now like, okay, how would we build this super futuristic thing together and what are our principles around it and how can we push that forward in the next two weeks? Those kind of conversations are really exciting for me and I really, really do love spending time with our customers. I also spend a ton of time prototyping. And so one thing I love about manage agents is because it's so easy to spin up an agent and because my work changes day to day and it never really looks the same, you know, week over week. I need to be able to spin up an agent that is perfectly suited for that specific task of the moment. And it's okay if I throw that away agent away. Like, it doesn't have to be the most beautifully productized thing in the world.
B
Okay.
A
And so I spend a lot of time bashing my own product by basically automating my own work, and that it takes me maybe half an hour to spin up an agent, but I try to have like a different one going every couple of weeks.
B
Can you give an example?
A
Yeah, I'll give an example. So we had a wait list for some of our advanced features and it was like a 4000 organization long wait list and it was filled with invalid entries and duplicates and you know, all the kind of stuff that you get on like a traditional published web form. And so I know that this agent, I know that this wait list is only really going to be relevant for the next few weeks until we get this feature out into the public and make it self serve. So I just needed to spin up an agent that would automate my next few weeks of work with this wait list, like parsing out all the invalid entries, like assessing which ones are the highest likelihood to convert and actually give really high feedback. And I basically embedded it with like access to our internal systems and our databases and whatnot to make that assessment and figure out who to pull it off the waitlist on a, on a daily basis. That's just like a few weeks of work and there's no point in like building something super shiny for it. And so having the, a really, really good, easy to use pre built infrastructure that just automates building an agent is like a huge unlock for me and I can take those kinds of tasks and just repeat them.
B
Okay, so basically this agent looks at the big waitlist and then like cleans it up and then like sends invites to people on the waitlist. Is that kind of.
A
Yeah, yeah. Like people who our highest likelihood to be high value testers for us.
B
Got it. People who are like the most active Twitter complainers basically. No, I'm just kidding. No. Okay. I don't have to be a company to use this product, right? I think you were going to show me how to do this in cloud code or can I just use this as an individual?
A
As long as you have an API key with us, then you should be able to use it. In fact, we have a ton of usage from individuals. It actually kind of surprised me, but I think that we're seeing a lot of individuals just automate their lives with cloud managed agents.
B
But when should I use this product versus just try to build a skill or a weekly cron job or something? You said there's a lot of individuals using it, right?
A
Yeah. Keep in mind that these are really long running cloud hosted sessions. Anything that's running directly in your cloud code is, is bound by the constraints of your laptop and when it's on. And so using cloud Managed agents basically pushes that all to the cloud. It increases the capacity of the work that you're able to do and also the longevity of it.
B
Okay, so stuff like maybe like some sort of online competitive research or like what are people using it for in their personal lives.
A
Yeah, I have a friend who's like a new parent who basically has her entire like child's sort of like the hourly schedule of like feedings and like tummy time and all of the things that new parents have to worry about. Basically fully managed by Claude. And then she also has like these like fridge monitors and like grocery, grocery management agents running as well. Yeah, so yeah, I, I don't actually have insight into what people are doing at scale because like, you know, we redact a lot of that information internally. But from anecdotal usage, like honestly just using these as like your personal personal assistant is, you know, they're ultimate, the ultimate customizable agents. Right. You can do them for whatever you need.
B
And I guess the mentality switch is like it's not just like a clockwork conversation that's like temporal. It's almost like something that's kind of like there and can help you for like multiple days and weeks. Right.
A
Like if you make it good, like effectively. It is a completely customizable agent that is designed for long running work and has built in memory that again is completely configurable by you. So it shouldn't, it improves over time and it handles long running tasks extremely well. And so that is sort of like the ultimate personal assistant agent, right? The ultimate sort of like chief of staff that you would want for your personal life.
B
Yeah, because I think the memory part is actually very important because then your instructions can become more and more vague and then hopefully you just understand what you want. So yeah, that's very important. Okay, let me ask you a few more questions. I built some of this stuff myself too, inside companies. And do you have any best practices for companies thinking about building agents? I'll tell you my bias. I feel like sometimes companies tend to set up agents, they make it too complicated right off the bat. They're just like, they want to have orchestrators, they want to have all kinds of stuff. And then my bias is just, let's just build one agent and see if it actually works and people actually use it. But like, you know, you've talked a lot of enterprises, what are some best practices to roll this out?
A
Yeah, I think that a lot of enterprises make the immediate jump to like, how could I automate this? Like crazy 20 team workflow that would have required like a lot of like cross cutting coordination and these like multi month processes. Like, I mean, super ambitious, very exciting. But I do think that there is something really valuable about just like, okay, how do we unlock the individual? Like how do we make any individual on any team feel exponentially more powerful if like one Peter and one Jess is suddenly like, wait, I don't need to make dependency requests because I have these agents that are able to extend the kind of work that I'm able to do, do the design work that previously I would have had to request a human for, etc. Then like you've supercharged like one, one individual, right? You might have, you might not have completely like eviscerated an entire like multi quarter process, like compliance process that everyone hates, but you've like instilled the kernel of like creativity and sort of like autonomy within like the individuals of your organization. So starting there and starting by just like getting your individual employees to raise a ceiling of what they can do creatively and like what they can ship in isolation is the first starting point. And then from there you can start working on these like multi, multi team like mega processes and start raising the ceiling of complexity there. But like there is a huge amount of value that's unlocked simply by making everybody feel like they have their own power to develop products as sort of like a one person startup.
B
You have a bunch of one person startups inside a large company? Basically.
A
Yeah, yeah.
B
That's interesting. Okay. Yeah. So basically and like do you just give anyone make their own individual agent or do you kind of like. I think it's probably best practice to actually give some like spotlight examples. Right. Of people who actually know what they're doing.
A
Yeah, yeah. I think giving people templates and then letting them iterate freely off those templates is always a good place. You know, you avoid the writer's block of like, what do I do? But then you give them the creativity to iterate.
B
Cool. This is very good. And I think getting the agent into the hands of actual users quite quickly, maybe not all users, but at least like some beta users, you know, that's kind of where the rubber meets the role. Right. So like, don't go too crazy on evals before you even get the hands of any users. Like.
A
Yeah, just that. Yeah, the vibe testing is honestly the most important first step. And at a certain point you outgrow the vibe testing because you can't really like do you know, aggregate vibe signals at scale, not when you do eval.
B
Okay. Okay. It's hard to present the vibe testing. I'll just wait to quantify the vibes. I guess you can have a bunch of quotes and stuff, but it's hard.
A
Yeah, yeah.
B
Okay, cool. So I guess, last question. So I mean this stuff is moving so fast. Where do you think this stuff is going to go? Like let's just say like three or six months from now. Like you think we're going to like, you know, like before I go to bed I'd be like, hey, just take care of everything for me and then when I wake up it's done or yeah.
A
Honestly to some degree these long running agents are kind of doing that. Like we set them tasks overnight and then we wake up and you know, backlog is resolved and bugs are squashed. But I think I, so I think that we will see the workday like becoming sort of, sort of the limits of what we can achieve will really be based off of how much we can delegate at once. More so than like what our like personal capacities are because we are going to be able to increasingly lean on these agents as partners. So that's one thing I think like on an industry level what I'm really fascinated by is I'm starting to see that vertical SaaS is sort of just like becoming increasingly specialized. And so the idea that you would have like an accounting agent versus a healthcare agent like that is starting to become like more and more and more narrow as people realize that like you know, the models are getting smarter and so like broad domain expertise is sort of there. And so like the real value add is like, like these incredibly specific end to end niche use cases. And so it's really interesting because I think that like getting the right agent to do the job now is getting so specialized and tailored that having the really the shared thing now is like the context patterns and the task orchestration patterns more so than like okay, this is the canonical way to build like a finance agent or like a healthcare agent.
B
So for example, like maybe instead of a general accounting agent is like an accounting agent for like solopreneurs or something like, like just like more.
A
Yeah, yeah. Everything is just becoming incredibly specialized, particularly as people are able to build products for themselves and scale them externally. So what we're seeing is as you think of as you realize that you can now build software for these hyper specific use cases, people are scaling those things and so that means that the kind of products that we're seeing distributed are increasingly verticalized.
B
Yeah, I think this is actually really interesting. I actually have more Questions about this just real quick because I feel like I can build a product like AI product for accounting for solopreneurs or something, but I feel like someone could just build a skill that's just as good. It's kind of hard to think about a long standing SaaS product around this. Yeah, I don't really have a good
A
answer, but yeah, I think that where the products that will sort of survive this transformation are the ones that meet their users, where there are, where their workflows are. Right. So it's about being one, hyper specific to the tasks at hand and adaptable I guess beyond those specific tasks, but then two, being exactly where you need it to be. And so that always on kind of agent pattern is definitely important because you want that agent to pop up at the right time. But it also needs to be in like the discoverable place where you would expect those workflows to be handled.
B
Increasingly, that is just like in cloud code or some of these apps. Right. Like I don't want to navigate to a website and like fill a bunch of forms. Like I feel like I have, I want to bring my personal agent like with all my context and I want to get it to go talk to someone's accounting agent. Just like go, go, go figure it out. You know what I mean?
A
Yeah, I. Wherever you're, wherever your, your work lives. And so for like, you know, for a lot of engineering teams that will be increasingly in quad code and everyone sort of is becoming engineering team. But also like, just like, you know, even if we think about Vercel's chat SDK and how, you know, they had the sort of vision to realize that like everything is chat now because like agents interact best in chat and like these sort of like interactions with our colleagues are getting more and more compressed because the speed that we're iterating at is so much faster. Right. And so like I do think that a lot of these, I mean it sounds kind of basic but like these sort of to meet users where you are, you have to, it has to live in cloud code and it has to live in chat.
B
Yeah, it's got to live in chat and the chat has to be connected to all your personal contacts. It's not just like a random chat window on website, you know. Yeah, well, fascinating world. Fascinating world living. Yeah. But, but this is super helpful, Jess. Where can people find you? Or learn more about cloud managed agents?
A
Yeah. So definitely use the cloud code skill to learn more about cloud managed agents. We also have traditional artisanal documentation that I assure you humans still read@platform.cloud.com and I'm Jess Double underscore Yan on Twitter if anyone wants to reach me.
B
All right, Jess. Well, this has been a fascinating conversation. I hope the agents don't take over everything, but I look forward to having them save time at work. We don't want to do.
A
Yeah, thanks so much.
Podcast Summary: Behind the Craft – Inside Anthropic’s Bet on Claude Agents that Work While You Sleep | Jess Yan
Host: Peter Yang
Guest: Jess Yan, Product Lead at Anthropic for Cloud Managed Agents
Date: June 28, 2026
This episode explores the evolution and power of autonomous AI agents, focusing on Anthropic’s Cloud Managed Agents. Host Peter Yang speaks with Jess Yan, product lead at Anthropic, about what AI agents are, how they’ve advanced from simple prompting loops to long-running, self-correcting actors, and how both individuals and enterprises can harness agents for outsized productivity. Jess also demonstrates building an agent from scratch, discusses best practices, and offers a glimpse into future trends in agent-driven workflows.
• Agents have evolved:
• Core components of an agent:
“Once upon a time agents were really just prompting loops... I think that’s really evolved towards permissioning and access to third party systems, internal tooling and sensitive data.” – Jess (01:09)
• The harness & model are tightly coupled:
“It is impossible to get the maximum possible performance without tying together the harness and the model.” – Jess (02:38)
• Main value proposition:
“The return on effort for building an agent should be extremely, extremely high... you should be able to delegate hugely complex work that might have taken you days, months, weeks.” – Jess (04:34)
• Demo Walkthrough (05:21–10:24):
“All of these particular events are the actions that the model is taking. And you can see that I’m actually not steering it except for the initial event.” – Jess (07:18)
• Integrating external systems:
• Raw prompting loops are synchronous, brittle, and hard to maintain for complex tasks.
• Managed agents offer:
- Error recovery
- Asynchronous, long-running execution
- Built-in memory
- Self-steering and debugging
- Easy integration and scalability
“It’s important for us to evolve towards these more self running, self recovering agent loops that can recover from errors, recover from going slightly off course, re steer themselves back.” – Jess (11:23)
“All of that is 10,000 times easier because of all the agents that we have internally.” – Jess (21:44)
“I have a friend who’s like a new parent... basically fully managed by Claude... they’re the ultimate customizable agents.” – Jess (30:55)
• Start simple:
“There is a huge amount of value that’s unlocked simply by making everybody feel like they have their own power to develop products as sort of like a one person startup.” – Jess (35:17)
• Share templates:
• “Vibe testing” is key:
“We’re also enabling... a built in eval loop where if the agent itself knows to grade its outputs, then you can actually pull the eval directly into the agent’s work.” – Jess (16:08)
“The agents can actually self correct along the way rather than relying us to string together all these intermediate outputs.” – Jess (18:41)
“Honestly to some degree these long running agents are kind of doing that. Like we set them tasks overnight and then we wake up and... backlog is resolved and bugs are squashed.” – Jess (36:47)
“The limits of what we can achieve will really be based off of how much we can delegate at once, more so than ... our personal capacities.” – Jess (37:26)
On agent evolution:
“We’ve really evolved from agents being prompting loops to agents being autonomous, self discovering and long running actors.” – Jess (00:00)
On practical superpowers:
“My personal favorite is like a predictive model that based off various attributes of the customer and the product can predict whether this customer is going to return.” – Jess (00:23 / 13:26)
On leveraging agents internally:
“Rather than poking a bunch of engineers on what they're doing, I can just track the PRs directly... I deeply understand and interact with my product so much more than I've ever been able to in the past.” – Jess (20:31)
On individual empowerment:
“You might not have completely... eviscerated an entire like multi quarter process, but you've instilled the kernel of creativity and autonomy within the individuals of your organization.” – Jess (34:14)
Host Peter Yang on his reliance on Claude:
“Yeah, I feel like I talked to Claude more. I'll be honest.” – Peter (24:27)
| Segment | Time | |------------------------------------------------------------------------------------------------------|------------| | Jess defines agent evolution and components | 00:57–02:38| | Harness-model integration discussion | 02:31–03:44| | What is Cloud Managed Agents? | 03:51–05:13| | Building and demoing an agent in the console | 05:21–10:24| | Connecting agents to external systems (MCP, permissions) | 10:24–10:56| | Why not just use prompting or cron jobs? | 11:23–12:43| | Example: Agent analyzing grocery store dataset | 13:17–14:45| | Building evals for agents | 15:08–17:30| | How agents up-level work and are used inside Anthropic | 20:31–24:16| | Individual & personal applications, personal assistant agents | 30:55–32:33| | Best practices for enterprise rollout and adoption | 33:20–36:31| | The future: overnight agent work, vertical specialization, context-rich agents in workflow/chat/code | 36:47–41:23| | Where to learn more about managed agents and connect with Jess | 41:43–end |
For more, check out platform.claude.com, use the Claude Code skill, or reach Jess on Twitter (@jess__yan).