
Anthropic's David Hershey comes on Decoder to discuss Claude Sonnet 4.5 and the current landscape for agentic AI.
Loading summary
Commercial Announcer
Avoiding your unfinished home projects because you're not sure where to start. Thumbtack knows homes, so you don't have to don't know the difference between matte paint, finish and satin or what that clunking sound from your dryer is. With Thumbtack, you don't have to be.
David Hershey
A home pro, you just have to hire one.
Commercial Announcer
You can hire top rated pros, see price estimates and read reviews all on the app Download today.
Sponsor Announcer
Support for this show comes from Adobe the all new Adobe Acrobat Studio is reimagining how we use and interact with PDFs and the powerful impact they can have for your business and personal projects. And with their AI powered PDF spaces, you can collect files, synthesize information, and even chat with an AI assistant for fast insights. It's time to do more with PDFs than you ever thought possible and you can do that with Acrobat. Learn more@adobe.com do that with Acrobat. That's adobe.com do that with Acrobat.
Hayden Field
Ford BlueCruise hands free highway driving takes the work out of being behind the wheel, allowing you to relax and reconnect while also staying in control. Enjoy the drive in BlueCruise enabled vehicles like the F150 Explorer and Mustang Mach E. Available feature on equipped vehicles Terms apply. Does not replace safe driving. See Ford.com BlueCruise for more details. Hey there and welcome to Decoder. I'm Hayden Field, senior AI Reporter at the Verge and your Thursday episode guest host. I'll be subbing in for Nilay for a couple more episodes and I'm excited to keep diving into the good, the bad and the questionable in the AI industry. Today I'm talking with David Hershey, who leads the Applied AI team at Anthropic. David works with startups to help them figure out how to best apply Anthropic's tech, but plus he tests new AI models to understand their limits. I wanted to have David on because earlier this week Anthropic released a brand new AI model called Claude Sonnet 4.5 that's been making waves. For reference, Claude is to anthropic what ChatGPT is to OpenAI. This new model, Sonnet 4.5 is being billed as a big breakthrough in autonomous agentic AI, especially for coding purposes, which is a big battleground in the AI market right now. All these companies want to get a slice. These types of AI products can in theory be given complex tasks and then go off and complete them over the course of many hours, or in some cases even multiple days. And Anthropic says this particular model, Sonnet 4.5, can run for up to 30 hours straight without any human intervention, all while working on a singular task like building a software application from scratch. For the last year or so, companies like anthropic, Microsoft, OpenAI and others have been promising that this agentic technology would be the next phase of AI, the next big hype filled thing that comes after general purpose chatbots. They say it could really unlock generative AI's potential, and it's true they've made some strides. But as we've seen so far, agents aren't quite there yet and they have a ways to go. Most of us are not in fact sending agents off on the Internet to do our bidding. And we're certainly not giving them tasks that might take 2012 or 24 or even 30 plus hours of autonomous work without human hand holding. At least not yet. At the same time, many companies are looking at agents as the breakthrough that's supposed to unlock huge productivity gains from AI models, including the opportunity to use them to replace or augment human labor. So I wanted to sit down with David, who spends a lot of time testing out what models like Claude Sonnet 4.5 can and can't do, to ask him where we are on this promise of AI agents. I wanted to talk about what these types of products are good at from a consumer standpoint, beyond just programming purposes, and also what the path forward looks like as AI agents progress. Okay, here's Anthropic's David Hershey on the state of AI agents. Here we go. I wanted to ask you about your view of the current state of play for AI agents. Like we hear all the time that agents are the next big thing for generative AI, but are we still in the prototype stage, the testing phase, or what? How would you characterize what AI agents do today, right now, relative to what AI companies actually want to offer in the end, which I hear from a lot of execs is basically Jarvis from the Marvel movies.
David Hershey
I am less confident in our ability to project the end, but I'm happy to talk about it now. I've seen agents come a long ways in the last year working with customers, and my view is there are places where we're starting to see what it looks like when agents work really well. And there are still a lot of places where they don't work really well. And I think that kind of makes it confusing. I think for some people, code is a great example. When you're writing code. And especially with Sonnet 4.5, when you watch the model, spend a lot of time as an agent developing software itself, it's incredible. It can do a ton. It's like gotten much better literally this week as we release models, it's really visible and obvious how much better they're getting. Sort of if you're plugged into that at being an agent doing really long running complex tasks. When you look at other sections of the economy or other jobs that people want, agents do large pieces of. Sometimes there's stuff they're still not good at. In some cases they're not good enough at deciphering what's on a computer screen yet to be able to navigate complex UIs or whatever it is. And so they fall over and stumble over themselves on something sort of silly. And it's easy to point at that and say, what are agents all about? Like, is this kind of fluff or hype or whatever it is? And I think the way that I see this sort of generally happening is we've slowly been ironing out the kinks of the stuff that models fall over themselves on. And we've done that the best so far in coding. I think the industry has done that the best so far in coding, where we're making really fast progress on how much an agent can accomplish when writing code. And we are, I think, starting to make that progress in a lot of other domains. For example, I talked about clicking on UIS in a computer. This model, Sonic 4.5 is way better at that. I don't know if it's like exactly the point we're going to tip into people like trusting it to automate a whole bunch of stuff they do when they're clicking on a browser yet or getting there. And so I think my general view is we're making really fast progress. It's just not necessarily visible in every part of the economy yet and in every job and every person and every individual. I have a feeling sort of each model that comes out will get one bit closer to being something that everybody can sort of interact with and see.
Hayden Field
So it's great at developing software, it's great at coding. It kind of reminds me of the robot hand moment, you know, how robots go do things that are really, really complex and hard for humans. But actually grasping something has always been a real headache. What about consumer facing stuff? You talked a little bit about UIs, but you know, what do you think agents right now are the absolute worst at? Like, what's the simplest thing that they truly just cannot do.
David Hershey
I honestly sometimes struggle to put my finger on this because I think it's. It's surprising little funky things in a lot of different agents. So, for example, if you're trying to do something related to finance, maybe there's like a bit of manipulating a spreadsheet that is really hard and it falls over. And so it's like 99% of the stuff it can do. It can do the math and it kind of understands how a finance model works, but it will stumble over a spreadsheet. This is the thing they talked about. I don't know if there's like. I think it's hard to boil down all of the jobs that we want to help people do into one tiny little thing that the models aren't there. It was that easy. I guess we would probably be on top of it already. And I think maybe a better model is for each of these different things that we want models to help us with. There's a million little components it breaks up into. You need to be able to see the right cell on a spreadsheet and know how a formula works and know the macroeconomic model. And thing by thing, you can find in each different task with a little thing that's not quite there yet. And so as we think about it, coming from anthropic when we think about it, it's like we just have to sort of be able to iron out, like, fix each one of the little gaps in each of these to help everybody use it. But honestly, this is probably an unsatisfying answer, but I can't quite put my finger on. There's just this one thing. I think this is why this field is hard. We're just constantly working on the whole universe of the stuff people do in room computer and trying to help them out. And that just means it's a really wide scope. And there's just a lot from my side, it's fun. There's a lot of hills to climb to try to make the models better and all of the things that we wish they were good at. So we're working on all of them at once.
Hayden Field
Being someone that works with a lot of different startups in a lot of different industries on how they're actually applying AI. What are the industries that have surprised you the most? Like, what are the, you know, what sectors are clients in that you just wouldn't really expect? Or maybe the ones that you've seen the most growth in in the past six months, the past year? You know, what are the trends you're seeing in terms of which sectors are really adopting agents at scale.
David Hershey
I have two versions of this, but I'll give you the fun one. First, I think one of the domains that surprised me the most in my personal customer work is the legal domain, which is at some value. At face value. It's apparent why that can be really useful. There's all this information you need to know about case law and studies and it's just fusing information is something that's pretty obvious models are good at. But actually there's so much depth and complexity to the legal field, which I didn't appreciate. I'm not a lawyer. I didn't appreciate it until I started working with people in the legal domain.
Hayden Field
I used to write a lot about legal AI and how the law sector was kind of the last to adopt a lot of AI tools because they were super old fashioned. I did a couple trend pieces on that and then when they started to adopt it, I was really surprised at how quickly it came.
David Hershey
Yeah, that's exactly sort of like what caught me off guard was you would think like there. And honestly there's so many things that make it hard writing a really good legal system. You typically need a lawyer to tell you if it's good or not. And like having to have lawyers in the feedback loop of how to build is challenging. And so I've been really impressed with a lot of the companies I've worked with and their ability to sort of like work around that where they have this interesting build of companies that have like lawyers on staff to help with product building. Like, like, like big amounts of lawyers on staff to help them build products. They build agents and cool things to like be able to like comb over case law and look for the right details and, and pieces and then obviously use the stuff that AI is obviously good at of synthesizing answers at the end. But how quickly that field, I think it's just like, it's probably like the scale of the upside there of how much just the pure volume of work that needs to be done and how much it can help that has driven the speed that people go. But that domain has surprised me. I promise you a two part answer. My second part of this answer is I think the funny thing about this space is it's really hard to guess where the next agent is going to take off because of that thing we talked about before or I talked about before, where it's like sometimes there's just this one little tiny thing that's not super obvious to any of Us that's blocking an agent from working. So if you wanted a model to do your taxes for you, I think that'd be very nice. I don't look forward to doing my taxes every year. You just find out that actually what it's really bad at is you upload your W4 and it can't quite see the difference between the two boxes on your W4. And that's why the whole thing doesn't work. You know, this is a toy example. I don't think that's actually where models.
Hayden Field
Are, but yeah, that would be pretty complex. I remember when I worked in two states or I had a job in one state and I lived in another. It was also really hard for me to figure that out. So not surprising that AI can't either.
David Hershey
Yeah, it is complicated, but I think we can get there. That's clearly. It feels in the domain of what the models can do. I've seen them do much more complicated. I don't know, they're better than me at math by a decent amount, so I would expect them to be able to do this. But sometimes it's just like this one little thing. And so I'm kind of constantly surprised. And mostly it's around when we release new models. It turns out there's some set of things that the model can do now. And then you see new agents crop up that we're doing things we couldn't do before. And so there's nice little micro surprises that come out. I don't spend a lot of time thinking about accounting normally in my day job, but then you run into an accounting startup that can suddenly do something new and interesting, and that's pretty cool.
Hayden Field
Do you think that data annotation is going to be a big part of that? Do these models get tripped up because there's some data that they don't have enough of, Especially with a niche industry? Is that where you see some of the obstacles come into play?
David Hershey
I think we need to learn from specialists and we need great ways to learn from specialists. And I don't know if it's necessarily data in the classical sense, that there's some pile of data that we're going to learn from that makes it better, or there are just better ways. We need to incorporate the intelligence of accountants and lawyers and other people into our models. Yeah, I think that could come from data. Some of it certainly will. I think that can come from talking to and interfacing and working with people in those domains. I think a lot about learning directly from our customers. I think there's a future where more companies can contribute more directly to making the models do the stuff that they care about. It would be really nice if someone saw this big important thing that they'd like to achieve. Instead of having to sit around and wait for a lab to hopefully build a model that helps them do it, they can sort of like directly work with a lab. And I think that's probably somewhere that. Well, I know that's probably somewhere that we'll head where we can have more direct mechanisms of working with sort of experts. And yeah, it's partially a data thing, but I also think it's just like, I don't think it's surprising that one of the things that models are great at today is software engineering. When the building that I'm in in Anthropic Headquarters is filled with software engineers who know how to make models great at software engineering because they write software all day.
Hayden Field
Yep, that tracks.
David Hershey
Yeah, exactly. It's deeply unsurprising that that's true. And I think as we grow up as a company that's only a few years old and has spent most of our time hiring software engineers and figure out how to consult and work with more people in more diverse places and doing more diverse jobs, we'll get better at building models that are great at all of the other things that people wish models were great at.
Hayden Field
We need to take a quick break. We'll be right back.
David Hershey
Right now at the Home Depot shop.
Commercial Announcer
Fall savings and get up to 40% off select appliances like Samsung. Upgrade your kitchen this fall with the Samsung Bespoke four door refrigerator now featuring.
David Hershey
The dual auto ice maker with Sphere Ice only. At the Home Depot, you'll have perfectly.
Commercial Announcer
Chilled drinks for every gathering. Don't miss fall appliance savings at the.
David Hershey
Home Depot offer valid October 2nd through October 22nd. US only. C store online for details.
Sponsor Announcer
Support for this show comes from LinkedIn. When you're a small business owner, your business is on your mind 24 7. So when you're hiring, you need a partner that works just as hard as you do. That hiring partner is LinkedIn Jobs. When you clock out, LinkedIn clocks in. LinkedIn makes it easy to post your job for free, share it with your network, and get qualified candidates that you can manage all in one place. LinkedIn's new features can help you write job descriptions and then quickly get your job in front of the right people with deep candidate insights. Either post your job for free or pay to promote promoted jobs. Get three times more qualified applicants at the end of the day. The most important thing to your small business is the quality of candidates, and with LinkedIn you can feel confident that you're getting the best. Based on LinkedIn data, 72% of SMBs using LinkedIn say that it helps them find high quality candidates. Find out why more than 2.5 million small businesses use LinkedIn for hiring today. Find your next great hire on LinkedIn. Post your job for free@LinkedIn.com partner. That's LinkedIn.com partner to post your job for free. Terms and conditions apply. Support for this show comes from Adobe, who are introducing the all new Adobe Acrobat Studio now with AI powered PDF spaces. Look, I'm sure when I say PDF you have a very specific thing in mind, and I'm guessing it's an email attachment, certainly not a dynamic asset that could help elevate your business, but Adobe Acrobat is changing that. It's time to do more with PDFs than you ever thought possible. Need AI to turn 100 pages of market research into 5 insights with a click. Do that with Acrobat. Need templates for a sales proposal that'll close that deal. Do that with Acrobat. Need an AI specialist to tailor the tone of your market report to sound real smart in real time. Do that with the all new Adobe Acrobat Studio. It's time to reimagine and rethink what a PDF can actually do. Learn more@adobe.com do that with Acrobat. That's adobe.com do that with Acrobat.
Hayden Field
We're back with Anthropic's David Hershey discussing the landscape for AI agents. Before the break, you heard David explaining some of the trends he's seeing, both from his work internally at Anthropic, testing new models, and from clients who are now using this tech in their industries. But now I want to ask David about Anthropic's big announcement this week, Cloudsonnet 4.5 and why it's being billed as such a step forward for AI agents. Just released Cloudsonnet 4.5 and to me that was a big deal. I'm really in the weeds on this stuff, so I was really into all the specs, but for the average person, you know, that's just a word and a number with a decimal and another number. So why is this a big deal and how does it differ from your latest models before that? You know, what's the meaningful difference and the meaningful step change here?
David Hershey
Yeah, I'm very excited about Sonnet 4.5, 2. And I'm also very cognizant that when my mom texts me asking about it, it just is a model with a decimal and a number after until I talk to her too. So there are a lot of things that I think are exciting. One thing that I want to say up front before I get into a lot of details that like, have jumped off the page to me about the model is it's sometimes hard to predict what the big impact is going to be on everyone. These models get generally smarter, they get generally more capable. And not to keep harping on the same point, but sometimes we just don't know the blind spots that they had until we get over them. And so when we released a model last year, Sonnet 3.5, and suddenly all of these vibe coding startups happened, I don't think we actually knew that that was true. We didn't know that there was going to be. The moment we released that model, I don't think we could have predicted that so many companies would crop up and be able to help writing code with agents in this amazing way that we've seen happen. So part of what is exciting about Sonic 4.5 is the unknown. To me, I'm really confident this model is the smartest model we've ever created I've seen. And I'll get into some of the things I've seen it do that I've never seen another model do before. And that's really cool. And part of it's concrete stuff we'll talk about. I'll tell you, I'm excited to talk about some of the software engineering stuff I've seen it do. But some of it's stuff that we really don't know until our customers and the people we work with try to build cool new things that they couldn't make work before and they suddenly make it before. And so when I, if I had to guess how my mom might see Sonnet 4.5 and it might make a difference to her is that there's some product that she couldn't use before or didn't exist before in the way that perplexity has happened in the past for consumers or or cursor exists now for developers, that is going to happen that's going to impact your life. Because this model is capable of a new thing that we didn't know before. And I have a hard time predicting it. It's part of the fun part and the hard part about being someone who works with customers now is it's really hard to predict this stuff the day we launch a model, but it tends to happen and that's cool.
Hayden Field
Have you seen any trends start? Even though it's been like one day, anyone that's starting to use it, you know, in a new way, one of your clients or even in beta testing?
David Hershey
I too early for customers, I think, to see anything brand new. It typically is like it's a very fast field, but I have like a give it a month rule to find out what the new companies are going to be. My testing I have certainly seen some new stuff and with the testing some of the team has done, that is really exciting. One of my favorites to talk about is my team has been working on seeing how much of a software engineering task a model can take on at a time. And so I think we're all aware that models can be used to write code. That's sort of obvious to a lot of people, at least in this industry and probably people listening to this podcast at this point, but it's often in pair with a human really directly going back and forth using Quad code or an ide. Write one thing at a time, check and review it. The longer you let a model go trying to implement something big and complex, the more likely it is to do a whole bunch of stuff you didn't want to happen. We were really curious what is the most we could stretch that? If I give Quad a really huge task, what can I accomplish? One thing we've seen with this model is in a way that is really not true of models I've tested and seen in the past. If you give a sufficiently good overview of what you want a model to accomplish, I haven't really seen a ceiling on how much it can keep consistently working on making something better. My favorite actual example of this, we released a video of this yesterday, but I'm going to talk about this until people get bored with me is we asked Claude to recreate Claude AI like our consumer chat application from scratch. And Sonic 4.5 just worked overnight. We woke up and it just did it. This beautiful clone of Claude AI that works incredibly well. And my favorite moment from it, there's a feature that people like of ours called Artifacts, where when you ask Claude to make a document or make a webpage, it will make it and render it next to the chat so you can play with an app that you built live or whatever it is. And I was saying to the person on my team who was working on this demo, working on this thing, it'd be really cool if Quad could build artifacts itself that would be amazing. It's a complex feature. It's pretty hard to figure out. We should try and see if that happens. Then he messaged me two hours later. And he didn't do anything. He didn't intervene. He's just like, hey, Quad just built Artifacts on its own and it's currently testing it live. It's just trying it out itself. And we released Artifacts a year ago. It was artisanally created by a whole bunch of wonderful engineers around me at Anthropic who were doing all of the software engineering to build this really complicated things themselves. And then a year later, we released a model and we asked it to build Quad AI and let it go for 12 hours and just did it itself. You know, it's like instead of this giant team of people at Anthropic that work so hard on this thing, the model is just like really capable of biting off really meaty, complicated tasks. Like, this is something that would take me months to do if I did not have Quad. And overnight we sort of looked at it and watched it happen. And that progress of just like going from a point where it was neat that a model could write a snippet of code from a year ago to like, oh, it can do a big chunk of the stuff of the complex developer work that I need to do. It's just like, I don't know, it blows my mind. It was like a pretty big wow.
Hayden Field
So that took about 12 hours.
David Hershey
You said that specific one was like a 12 hour thing. We've seen like up to 30 hours of continuous dev work.
Hayden Field
Okay, this is, this is what I was going to ask you about because something that unexpectedly went viral from my own article about Sonnet 4.5 was the 30 hour bit. The fact that Sonnet 4.5 could code autonomously for up to 30 hours with no interruption. So I heard one engineer at Anthropic used it to code a chat app that the company compared when they talked to me to Slack or Teams. Obviously it was only 11,000 lines of code, so much smaller than Slack or Teams, and it seemed like just an example project. But people online are really excited about that detail and calling for Anthropic to release it. They want to know more about it. So can you give me any details? Was that someone on your team that was testing it and how, how impressive was it really? Or was it just kind of rudimentary? Give us the deeds.
David Hershey
Yeah, yeah, yeah. This was the thing that also, I don't know, like, all of my excitement and the reason I'M here is because I've been working on this exact thing. So it. It was someone on my team. His name's Justin. Shout out Justin. You should give praise to him. He's amazing. Recently joined the team. On a side note, he's pretty cool.
Hayden Field
Nice.
David Hershey
This was born out of this thing of a lot of people have built demos or proofs of concept and there's this sort of vibe coding trope that you can, yeah, you can build it, use it to quickly mock something up, but can it really like build a real application? Justin really wanted to test that and so he was experimenting using our Quad Agents SDK, which is just sort of like more programmatic version of Quad code to some extent. He was testing like, can I give a full spec of like a complex thing like an app similar to Slack to the model and just watch it build. And we did some tinkering experimentation around it to get it right. But yeah, you asked. Was it impressive? It's impressive. It has DMs and threads and channels and a slick search functionality and you can upload images and GIFs and render them. And multi user authentication and Claude. We didn't ask it to, but implemented a whole bunch of AI users for testing. So if you log in, you can send a message. Editors like Alice, the PM is in there that you can send a message to who will respond you back. Stuff about PM work, it is, it is remarkable. It is not by any means slack, but like, if you didn't spend a lot of time thinking about it, you'd like look at it and think that was a pretty reasonable productivity app that you would use to chat with your coworkers.
Hayden Field
Wow. What did you guys name it?
David Hershey
I don't think we have a name. I need to ask Justin to give it a name. Actually, no, I lied. He has work chat. Very boring. We're engineers, we're clearly not the product folks in the Org, so I love that.
Hayden Field
Yeah. Someone who worked at Slack, I think messaged me and said, you know, our code base is a thousand times that. And I was like, yeah, it's just an example, but I mean it happened in 30 hours, so it's a big deal. I was also going to ask you what surprised you the most during your own testing of it of Sonnet 4.5.
David Hershey
I'm just going to continue double clicking on this thing. The way that it built the app was surprising and really interesting. I'm not surprised that the model is getting better at building complex apps. The thing that was really interesting for the way it does. This is. It just has a tendency to bite off little pieces that it can handle really well and do that continuously. So a lot of times models in the past would get really eager and ambitious. They would say, I'm going to build this whole thing and I have these grand ambitions. And it would kind of just meander everywhere trying to do this miraculous piece of work. And the thing that was really cool about saddle 485 is it's just like pragmatic kind of. It's like, okay, right now I'm going to test does image upload work and then it's going to do that. It's going to spend a little while doing that, but it just bites off one little chunk at a time. And that feels a lot more like what I want. A co worker work. Like a collaborator work. Like, it's like, if I ask you, hey, I need you to go build work chat, I don't want you to go off on this like crazy escapade trying to make everything magical. I want you to just like bite off a piece at a time, show it to me, like commit it to git, like that kind of thing. And it's just a little bit more natural in collaborating with. It feels more natural and funny enough. Like, I think this is unrelated, but we've been chatting in our internal company, Slack, with Claude a lot lately and just like chatting with it and it's been really natural. Like it feels just a little bit more like working with a coworker does.
Hayden Field
In terms of the tone.
David Hershey
Yeah, like the, the tone, how it, how it responds, like how it acts in Slack, how it, how it participates in a conversation, what it tries to do. It's a little less like over the top and eager. That has jumped out a little bit, which was surprising. Like, that's not something that I normally expect. But yeah, it's also been funny, I guess, which is a funny thing. Like it like cracks better jokes. Like it's a little bit more witty, that kind of thing.
Hayden Field
Do you think it also dialed back the sycophancy a bit? You know, I know that's been a big word in this space lately in terms of, you know, being overeager, over, validating. Did you feel like it was more real and that it was less like that a little bit?
David Hershey
I have felt that and I think it's certainly something we're pressing on. It's like one of my least favorite things about every model is when they're sympathantic like that. I think it just gets in the way of doing good work. And it's been a big focus of ours. Like, I think nobody at Anthropic likes that. And also just, like, I think it's. It's bad for the. It's just bad. So I do think we have made meaningful progress and this model seems to be a little bit more willing to push back. And that's part of that thing. Like, being a natural coworker is someone who actually can tell you when you're wrong.
Hayden Field
When it comes to rebuilding Claude AI, that's pretty big. Did that worry you at all? Because, you know, it's kind of doing, like you said, you know, organic work that you worked on for months. Does that worry about job replacement for engineers, anything like that? What were your thoughts?
David Hershey
We have a great team and I'm really not worried about putting. I'm going to frame this two ways. I'm not currently worried right now. Quad is a collaborator. It works really well with me. It accelerates me. I think it makes our whole team better and faster writing software. I am general. This is not a now thing. To be really honest, watching quad go for 30 hours, it does trigger a little bit like, oh, my God, that's a pretty different thing. It is a meaningful step to change, and I think it does. This technology Anthropic is founded on the principle that this technology is going to be hugely impactful in the world. And part of that is it's going to change how we do jobs. And something like doing a whole week of work for me, that's just meaningfully going to change the industry of software engineering. And so, yeah, there's a little bit of. And it'd be goofy for me to say there's not any amount of this is like, how are we going to work next? Like, how do I incorporate this? What is it? My Net Net, though, is like, I just think there's a ton of room to make us better, to make better software for users, to make the world a better place with this technology. And I'm really confident in that still. But, yeah, there's like a smidgey of, like, well, we're going to have to figure out some new ways that we build with Quad and that we, like, operate as software engineers to work with the thing. If it's really going to, like, build the whole app itself. Like, we have probably a different role that we need to play here.
Hayden Field
We need to take another quick break. We'll be right back.
Commercial Announcer
Support for the show comes from Mizzen and Maine. Let's be honest, most dress clothes are uncomfortable, high maintenance and just don't feel good to wear, and that means you spend way too much time ironing, dry cleaning and tugging at stiff fabrics. But not if you're wearing Mizzen and Mane. Mizzen and Main makes classic menswear with performance fabrics, so it's effortless to look sharp and feel great. Mizzen and Mane actually invented the performance fabric dress shirt over 10 years ago and since then they've perfected it with modern fabrics. Mizzen and Mane's shirts and pants look refined, yet they're stretchy, lightweight, moisture wicking, wrinkle resistant and completely machine washable. Mizzen and Main actually sent me their Mayfield pants and what I love about them is that they sort of offer that perfect balance of like elevated look but still really comfortable. I got to try out the pewter color and they look and feel great and just give everything that little extra polish. Right now Mizzen and Main is offering our listeners 20% off your first purchase at mizzen inmain.com promo code decoder20 that's Mizzen spelled M I Z Z E N and Main m a I n.com promo code decoder 20 for 20% off. And if you'd rather shop in person, you can find Mizzen and Main stores and in select states. Support for this show comes from Agency these days when we talk about AI, we're not talking about one isolated agent or system working alone. We're talking about multiple AI agents working at once without human intervention. Agency, that's a G N T C Y is working to ensure AI agents aren't siloed off from one another and instead are able to securely discover, connect and work across any framework. With Agency, your organization gains open, standardized tools and seamless integration that includes robust identity management to help you identify, authenticate and interact across any platform. Agency is leading the way in establish trusted identity and access management and empowers you to employ multi agent systems with confidence. It's already working with industry leaders like Cisco, Dell Technologies, Google Cloud, Oracle, Red Hat and more to set the standard for secure scalable AI infrastructure. Now Agency is an open source Linux foundation project. If your enterprise is ready for the future of agentic AI, visit agency.org to explore use cases. Now that's a G N T C Y.org this episode is brought to you by White Claw Search. Great podcast pick friend. No surprises there. After all, you're all about finding the.
David Hershey
Tastiest flavors out there.
Commercial Announcer
Just like White Claw Surge and with big Bold flavors to enjoy like blood orange, BlackBerry, cranberry, and more. It's time to go all in on taste. Unleash the flavor. Unleash White Claw Surge. Please drink responsibly. Hard seltzer with flavors. Eight percent alcohol by volume. White Cloth Seltzer Works Chicago, Illinois.
Hayden Field
We're back with Anthropic's applied AI lead, David Hershey. Before the break, we were talking about the capabilities of Cloudsonnet 4.5 and whether he sees a future where this technology might even automate parts of his job. But now I want to ask David where he sees the model falling short and what might be next for AI agents. Well, what are the primary limitations to Sonnet 4.5 that you wish you guys could have offered with this release that you couldn't? And what was something that you tried to make it do during testing that it couldn't do? Basically, yeah. The things you. The features or, you know, the context window or anything else that you wish you could offer that you couldn't. And also, what was dumb about it.
David Hershey
In testing, I have a pet thing that I always test on, which is I really like to make Quad play games. I'm accidentally famous for creating Quad plays Pokemon in the past. Oh, that was you.
Hayden Field
I've heard a lot about that.
David Hershey
Yes. That is my project. And I have also tried a lot of other games. Like I had Quad playing chess this time, and one of my favorite comms people really wants me to make Quad play catan effectively. And it's really bad at spatial awareness still. And this annoys me to no end. It just basically doesn't know the difference between left and right and up and down. And this is just like. It's just one of those examples that exists in the world of it can do PhD level math and I can't. And just for it to not really understand that it can't walk straight through a building hurts my brain. And models are kind of weird and funny that way. So that one was the one that probably jumped out the most is I really wanted Quad to be this great chess player. It's so logical and interesting and then it doesn't know where its pieces are on the board and where other pieces are. And it's like, ah, you're almost there, Quad one day and features that I think are still out there. I have been so in on this model that I have to think about the next level. Honestly, like, the real answers, I don't know. I can't think of some feature that I wish was there. There's like, a lot that I want Quad to get better at. There's, like, so much that I want Quad to get better at. I want it to be able to beat Pokemon one day on its own. And I want to see it, like, how it does it. I want to like, again, there's all these fields where I'm like, know that Quad still needs to get better. Like, I. We talked about legal. Like, I think Quads becoming a better lawyer, but I don't think it's, like, as good as a lawyer as it is a software engineer yet. And I still wish we worked on that and we're making progress. I know the teams that are working on all of these little things and focusing and thinking about all these little things. And there's always this. So many hills in my mind that we have to climb still. So I could probably. There's probably this intersection of There's a million things I wish Claude was better at, not one that I am personally banking on. Except for Pokemon, which is one day the research team will listen to me that it matters. They haven't quite listened to me yet.
Hayden Field
Well, when I went on a reporting trip to London and visited the office, it was all I heard about. So it's making waves somewhere.
David Hershey
That's good. Well, it makes people happy here. It's a fun experiment. I don't think it has crested the peak of what we need to train Quad to be good at yet, unfortunately. One day.
Hayden Field
Let's talk a little bit about what's next. So talk to me about why you think Anthropic is pursuing such gains on the AI coding front and how it stacks up to competition. Obviously, in testing, you had to, you know, take into account what the market looks like right now and what other models can do. You know, how did you think Sonnet 4.5 stacked up during testing? And you know, what stuck out to you there?
David Hershey
Coding market is really important for a lot of reasons. It's really well positioned for people to build with our models. It's a place where you can have a huge impact. People have figured out great ways to integrate models into how they do a job writing code. And I think more so than essentially any other industry. And so if there's like a right now where you can make a huge difference by continuing to make models better, I think coding is probably the best use case. And so it's a huge focus for us. We want to keep helping people who are relying on our models to write code be able to get more out of them. And our belief we've talked to a lot of customers, done a lot of testing. I'm pretty confident that this is the best coding model in the world. We have a lot of benchmarks and other things to prove that out and it certainly has for all of the people here. In our early testing just been a huge step function improvement in what it feels like within quad code or other surface areas to develop and write code. And from my perspective this is a really big step function. And personally this is just talking for myself. This is the most noticeable change and improvement I've seen since we released Sonnet 3.5 last year, which I think is a funny coincidence. I don't think we necessarily knew that these 0.5 sonnets were going to be such special models for us. But just mechanically this is some interface of vibes and numbers and things I've seen. This feels like a really, really meaningful step change improvement. One of my favorites just to maybe call it out. I love the team at Cognition. I spent spent some time working with them and they put out a post yesterday about how much their their product Devin improved with the model and some of the work they did to to make that product better that I loved. It's a cool blog and that that like really, I don't know, I think that just that validation from the customer, like it's a huge jump in a benchmark that I really haven't seen have a benchmark like a jump for them in a while. I just think that's really cool like for the right product, built the right way around this model. I have a feeling this is a huge step function change and ability to write coding and I'm quite confident it's going to be the best model in the world for people like that.
Hayden Field
Yeah, that's really interesting because I usually don't mention benchmarks much in my own reporting because sometimes they can be subjective and sometimes they're created by the companies that are testing their own models in very specific areas with very specific sets of questions. But I think usually what I hear from engineers is that they go based on the feeling and based on what things it can do that it couldn't do before in their own anecdotal testing. So that's why it's interesting to see you have your own pet projects that you like to test on and you've seen changes in that regard instead of just in the benchmarks specifically. I also wanted to ask you about anthropics chasing consumers versus enterprise versus governments. So all AI companies right now it seems like are kind of working with those three tiers, and I think it's probably because those are two of them, maybe are more concrete areas for potential profits. So do you think with Sonnet 4.5 and just all the stuff you're working on in general, I know you work mostly with the enterprise side of things, but is there a specific slice that Anthropic is chasing more right now and why?
David Hershey
I think they're basically all really important. And I think one of the things that we have luckily seen, and I think is true for all of the labs, is that when we make our models generally smarter, they service all of those segments. They service enterprises who are building with us. They service consumers who want to use our chat app or write code or get into that. And they're useful for the public sector, too. Just to tie a little bit of link here, I've worked with customers and startups that have become big consumer hits. Lovable is a great example where, you know, it's like, that's an enterprise customer from my perspective, that I spend a bunch of time working with and helping them be successful. And our team does, but then they go turn around this thing that everybody can use to build just in time, apps to service every part of your life. And so I think this kind of just bends back on itself where in reality, our focus is building really great models that are safe. And we have seen time and again that results in sort of success in all of these places. And there's a lot of different ways from our perspective. This is talking a little bit from my personal perspective instead of just Anthropics, but I think there are just a lot of ways that you can make progress building great models. And whether it's helping enterprises or helping consumers or helping the government, these all have a way of warping back in on themselves, where the thing that we really do that makes a difference is make great models. And there are a lot of different ways you can impact different segments with that.
Hayden Field
And when it comes to consumer use cases, OpenAI seems to be pushing pretty hard into that right now, at least this week. They launched Pulse recently last week, and then yesterday they debuted their Instant Buy button. You know, I wanted to ask where you think Anthropic is planning to meet consumers. Do you think it's more likely you'll be reaching customers with Claude directly in the future? Or is it more likely customers will use Claude through something like Cursor, you know, and that can also apply to your startup clients too, right? Where are you seeing people really find Claude?
David Hershey
I definitely would think we are growing. It's just like the presence of the applications we build to interface with consumers. And Claude Code is a great example of this. I know it's not like the traditional consumer, but there's like a very prosumer y thing that we've captured with Quad code and I think we've demonstrated that we do have some of the muscle to capture for some sets of clients a thing that they love that sparks the excitement of a consumer product that people love using. And I guess obviously we would love to do that 50 times over. We would love to be able to invent a bajillion beautiful uses of Quad that interact or that consumers love and love to build. And we're investing in trying to find and build great experiences that help people do important things. With Quad we have a sort of constant tinkering. We have this Imagine demo that came out. It's just a limited time thing. I don't think that's the product we're going to launch that is consumer facing, but we just, it's part of a portfolio, I guess, of how we think about. We need to keep building and trying and innovating and inventing products, seeing if there's something there and we have a good opinion of what models are capable of that helps us have an interesting lens on what we build. So it's a big focus we need to get. It's important for us to build direct relationships and help people have awesome ways to interface in a quad. That said, I'm biased as a customer guy a little bit, but I also just think I wouldn't ever. And I don't think we should, and I don't think Anthropic is betting against the ecosystem of people who are trying to build amazing products. And I think it'd be really silly to claim that that's all our market to grab. I just don't think it is. If we build great models, then there's an incredible ecosystem here in Silicon Valley and abroad building amazing products that I think is probably bigger upside than our first party applications will ever be. So I think that it will always be a huge focus of ours.
Hayden Field
Awesome. And then last question for you, back to the AI coding market and how important it is and how much. Most AI labs are chasing that right now. We talked a little bit about Vibe coding earlier and here at the Verge a lot of us have tried it and to no avail. We had some success with really simple stuff, but not a lot of success with building large scale applications. A couple of us did but not as much as we actually expected. Did you see a big change in terms of testing out Vibe coding with Sonnet 4.5, or was it the same as before?
David Hershey
I have noticed my own personal coding. Every model would release. I am trained as a software engineer, so it's like cheating. I'm not the best guinea pig for Vibe coding, but I still do it on the side sometimes. And you notice meaningful improvements on how much it can churn out before it goes off the rails and how much you can trust it. I think this is actually an interface problem and I think there's something funny about it AI, which is that it has this tendency to outgrow interfaces really fast. So if you look at the history of coding with AI, there was probably Copilot with Ghost Text. So Copilot would automatically complete your code for a while. You would go to US or ChatGPT or wherever it was and you would ask it to write some code for you in a browser window if you were a developer, and then you'd copy and paste that into your editor. Cursor figured out how to bridge those two things where it could be side by side. A lot of people have started building this sort of agent that lives alongside your ide, that lets you build things. I don't think any of that is quite the thing that we need for everybody to build production applications, though there's some interface where I actually do think Sonnet 4.5 is the first model that could be that thing where anybody could build a sort of production ready application. I've seen enough evidence of, when left to its own devices, how it can build complex applications. I've seen it be able to deploy a complex application to AWS fully autonomously and do a security audit on it. Those are both really incredible things that make me think this is the model that could cross that chasm and get to the point where anybody can make something production ready. I have a feeling though, we need one more interface that isn't quad code and isn't cursor. But the next step past that I think needs to happen so that it's more obvious to everybody instead of having to try to figure out if you're on the right path. Vibe coding, awesome.
Hayden Field
Well, thanks so much. I'm so glad we got to talk and you know, appreciate you coming on with so last minute. I really appreciate it.
David Hershey
Yeah, it was fun. I appreciate it. Of course. It was really nice to be on.
Hayden Field
I'd like to thank David for taking the time to speak with me and thank you for tuning in. I hope you enjoyed this episode. If you'd like to let us know what you thought about this show or what else you'd like us to cover, drop us a line. You can email us at Decoder at the Verge we really do read every email or hit me up directly on X, Bluesky or Threads. I'm Aiden Field on all platforms. Decoder also has a TikTok and an Instagram and Now also a YouTube channel. Check those out at Decoder Pod. They're a blast. If you like Decoder, please share it with your friends and subscribe wherever you get your podcasts. Decoder is a production of the Verge and is part of the Vox Media Podcast Network. Our producers are Kate Cox and Nick Statt. Our editor is Ursa Wright. The Decoder music is by Breakmaster Cylinder. See you next time.
David Hershey
Your sausage McMuffin with egg didn't change your receipt did the sausage McMuffin with egg extra Value meal includes a hash brown and a small coffee for just $5 only at McDonald's for a limited time.
Hayden Field
Prices and participation may vary.
Date: October 2, 2025
Host: Hayden Field, Senior AI Reporter at The Verge (guest-hosting for Nilay Patel)
Guest: David Hershey, Lead of Applied AI Team at Anthropic
This episode dives deep into the current reality, challenges, and future potential of AI agents—autonomous AI systems capable of completing complex multi-step tasks over hours or even days. Hayden Field talks with David Hershey of Anthropic about their latest breakthrough model, Claude Sonnet 4.5, and what it truly means for the field. The conversation covers where agentic AI excels today, where it still struggles, surprising industry adoption stories, model limitations, and what the near future may hold.
Legal Sector as a Standout
Rapid Legal AI Growth
Other Unexpected Surprises Appear with Model Releases
Spatial Awareness Remains a Major Stumbling Block
Continuous Gaps in Highly Specialized Fields
The episode spotlights both the rapid forward leaps and the “long tail” of unresolved quirks for agentic AI. Anthropic’s Claude Sonnet 4.5 is a watershed moment for autonomous software generation, but the broader vision of agents that seamlessly tackle arbitrary (and mundane) digital or real-world tasks is still on the horizon. Hershey’s optimism is tempered with realism; as fast as progress arrives, new challenges always emerge. Most meaningfully, the discussion frames AI agents less as job replacements and more as fundamentally new kinds of co-workers—at least for now.