Transcript
A (0:00)
Today on the AI Daily Brief, a new study about agent autonomy and practice from Anthropic and before that in the headlines, Google Gemini now allows you to create music. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. Alright friends, quick announcements before we dive in. First of all, thank you to today's sponsors, KPMG Assembly, Robots and Pencils and Blitzy. To get an ad free version of the show go to patreon.com aidailybrief or you can subscribe on Apple Podcasts. To learn about sponsoring the show, send us a Note@ SponsorsIDailyBrief.AI Lastly, a reminder once again about our latest ecosystem projects. Claw Camp, the free self directed program where you can learn how to build agents and agent teams using OpenClaw, is kicking off its first sprint right now. So if you want to learn to be an agent Boss with nearly 3,000 friends, come join us. You can find that at Campclaw AI or from the AI Daily Daily Brief website. If you are a company trying to figure out OpenClaw and other agent strategies, check out EnterpriseClaw AI and more broadly, if you are just interested in keeping track of these types of educational programs. We're doing free and premium. You can find information about that@aidbtraining.com One thing happening there we actually have a premium program survey as we are trying to figure out exactly which premium programs to launch first. If you are an enterprise or premium buyer who is interested in that again, you can check it out@aidbtraining.com now with that barrage of URLs out of the way, let's talk about Gemini and music. Today we kick off with Google's continuing quest to have AI products in every single multimodal category. The latest news is that the company has launched an AI music generator called Lyria 3. It's the latest version of DeepMind's music generation model and allows users to generate music clips based on text images or video inputs, which is pretty unique compared to something like Suno, which is of course just text based input. Lyrics can be generated in eight different languages including German, French, Spanish and Hindi. The feature can be accessed directly in the Gemini app by switching to a musical output. It's also being added to YouTube's Dream Track tool to allow creators to quickly generate soundtracks for YouTube shorts. Each track is accompanied by custom cover art generated by Nanobanana. Now, previous versions of Lyria have only been available through Google's Cloud's Vertex program, so this is a big expansion in access to. However, there is a pretty significant limitation which is that these are 30 second clips. The model itself isn't really capable of building on top of the initial generation, so this feature won't be useful to generate entire songs. However, and it's pretty clear that this is the use case they're imagining initially, this could be extremely useful for generating background music for YouTube shorts or fun interactive personal types of song messages. Indeed, this appears to be what Google had in mind with them writing the goal of these tracks isn't to create a musical masterpiece, but rather to give you a fun, unique way to express yourself. And really, while it would be tempting to compare this to Suno, this is actually more of a social feature than anything else. We've talked in the past about how one of the really interesting things about Suno is the extent to which it is used, not for any sort of professional or work music generation, but just as a fun interactive mode, and LYRIA really seems to be doubling down on that. Google has also embedded their Synth ID audio watermarks into the music so they're easily flagged as AI generated. A lot of the discourse around the first tries is that this is indeed not Suno, and that Suno's generations feel much more polished and musically complex. On the flip side, others point out how Google just keeps adding new arrows in its multimodal quiver. Aaron Upright comments people talking about OpenAI versus anthropic and Gemini just over here quietly getting more powerful. People underestimate the importance of an easily accessible multimodal platform when it comes to adoption. Chai and Zhao sees the future writing video to audio alignment is the real flex here. Generating lyrics and vocals that actually sync with visual cues in real time is a massive multimodal serving challenge. Lyria 3 probably relies on some crazy high throughput infra to keep the latency low enough for creative workflows. Ultimately I think we are just scratching the surface on what role generated music is going to play, and Google is now firmly in that game as well. Next up, a bit of a controversy that ended up being less of a controversy than it seemed, but still taught us some interesting things around the state of competition. A change in Anthropic terms of service triggered a tinderbox of complaints from those using Claude to power their openclaw agents. This week Anthropic changed their policies, now stating using OAuth tokens obtained through Claude Free Pro or Max accounts in any other product, tool or service including the agent SDK is not permitted. Now to clarify, OAuth tokens are kind of like API keys for regular Anthropic subscriptions, allowing users to access AI models through third party apps. And of course a lot of the attention is around the people who have been using their Claude Max accounts to power their openclaws. Indeed, Alex Finn writes, this is going to piss off a lot of Open claw users paying $200 a month. The tweets like this one were too numerous to count. Hubert Leficki writes, Anthropic is in an active self destruction mode now. First they went after tokens you already paid for blocking use in non Claude code apps. Then they send their lawyers after developers for supposed branding infringement. And now this opencode Gemini, CLI Codec. CLI are all legitimate coding agents with comparable features and abilities, but Anthropic are behaving like they're still the only player on the block. Now all of this caused Anthropics to read Shihepart a comment writing, apologies, this was a docs cleanup we rolled out. That's caused some confusion. Nothing is changing about how you can use the Agent SDK in Mac subscriptions. He added that the intention isn't to block personal tinkering, but rather to force third party businesses to pay for usage through the API. Unfortunately, the confusion only continued with that unclear clarification. Podcaster Felix Javan wrote, brother, can you just tell us whether we can use OpenClaw or not? And it seems like if you're using it to build your own personal agents, the answer is yes. But the incident raised a ton of discussion about how long the big AI labs will continue to support these modular AI use cases. Some tried switching providers, only to find that Google had already banned OAuth for the use case. Richard Holcomb wrote, I feel like getting banned by Google for using Anti Gravity OAuth with Openclaw is a Rite of passage. I was already not impressed with Gemini preferring Anthropic and OpenAI, but now I really have a bad taste in my mouth. Colin Darling, however, pointed out everyone upset about Anthropic's update to their terms would be wise to read the OpenAI and Google Gemini terms while they're at it. I'm bummed out too, but Anthropic is late to this party not leading it in any case. The controversy quickly faded, but there is a lingering question about walled gardens and what they're going to mean for AI going forward. Next up, more news in the AI wearables category. Meta has revived plans to release a smartwatch as part of their AI Device lineup Rumors of a Meta smartwatch started circulating in late 2021, complete with leaked photos of a prototype. The device was given the internal codename Malibu and featured two cameras, one in the dial for video conferencing and another on the underside of the watch. The idea was that users could quickly remove the watch to take a photo. Another big part of the design brief was the ability to read nerve signals in the wrist, allowing the device to be used as a controller. This concept has since gone on to feature in Meta's haptic control wristbands for their Orion Smart glasses prototype, which was unveiled in late 2024. That said, by the summer of 2022 Project Malibu was killed off and Meta shifted focus to smart glasses as their big wearable play. Now the information reports that Meta has revived the smartwatch under the codename Malibu 2. The watch is said to include health tracking features and a built in Meta AI assistant. Sources said the revival effort came out of a project strategy meeting late last year. Executives are reportedly concerned about a bloated product lineup for augmented reality glasses, so have delayed some products to focus on a limited number of concentrated bets. Among them is a new version of the Ray Ban Displays, which is expected later this year, as well as a pair of AR glasses which could arrive in 2027. The smartwatch is planned for release this year, putting Meta in direct competition with Apple and Google in the category. Now one thing to watch for will be how far each company goes in making the smartwatch an integral part of their wearable AI stack. Earlier this week we covered rumors that Apple was working on a trio of new AI enabled devices, namely smart glasses, a pendant and a camera equipped version of the AirPods. That report mentioned that a camera equipped version of the Apple Watch had been passed over as an AI device, with testers reportedly finding the prototype impractical due to clothing sleeves obscuring the camera. Ultimately, we don't know how Meta is thinking about the Malibu 2, but they are very clearly focused on this wearable category as a place for their AI strategy. Next up, another follow up in the Grok 4.20 Public Beta Xai has announced a new version of GROK Heavy and this one goes to 16. The big innovation with Grok 4.20 was the inclusion of four sub agents to debate responses before providing a final answer. Opinions were a little mixed on whether this was actually a useful feature, but it's an interesting experiment if nothing else. Grok Heavy turns the subagent count all the way up to 16 in a bid to either get better answers or at least burn through a ton of tokens. Getting an output XAI community promoter Tetsuo shared an output from the query how does chaos birth cosmic order? The agents debated the response for a little over a minute and then delivered a 700 word report using almost 900 references. It's difficult to judge accuracy or usefulness based on such a strange subjective question, but the output certainly has a ton of detail and is an interesting read. If nothing else, these continue to be interesting experiments and worth watching for that reason alone. Lastly, today Chinese Models Linde founder Flo Crivello recently shared a thread about the difference between Chinese models on benchmarks and Chinese models in the real world. He wrote, by far our biggest cost at Linde is inference. So believe me when I say we've looked at these models very closely and continue doing so, they're actually delivering on the claims would make a material difference to the business. But every time we've evaluated them we found the same thing that their real life performance for agentic behavior and outside of coding use cases falls extremely short of what they show on the evals. I think the industry consensus is right, he continues. These Chinese labs are one Distilling frontier models duh. Which leads to a more shallow intelligence. Two Training for evals three Potentially stealing weights. Not saying these models will always be bad or that these labs are completely incompetent. They're doing a fine job, but it's delusional to think they're actually at sonnet and opus level. They're still at least one generation behind. Take the evals with a huge grain of salt. That, I think is a lesson that is relevant not just for Chinese labs, but also whenever you see a new Western model as well that has high benchmarks, ultimately you got to just dive in and test these things out for yourself. And with that we will end today's headlines. Next up, the main episode. Sure, there's hype about AI, but KPMG is turning AI potential into business value. They've embedded AI and agents across their entire enterprise to boost efficiency, improve quality, and create better experiences for clients and employees. KPMG has done it themselves. Now they can help you do the same. Discover how their journey can accelerate yours at www.kpmg.usagents. that's www.kpmg.us agents. If you're building anything with Voice AI, you need to know about Assembly AI. They've built the best speech to text and speech understanding models in the industry. The quiet infrastructure behind products like granola Dovetail Ashby and Cluly now, as I've said before, voice is one of the most important modalities of AI. It's the most natural human interface, and I think it's a key part of where the next wave of innovation is going to happen. Assembly AI's models lead the field in accuracy and quality, so you can actually trust the data your product is built on. And their speech understanding models help you go beyond transcription and uncovering insights, identifying speakers and surfacing key moments automatically. It's developer first. No contracts, pay only for what you use and scales effortlessly. Go to semblyai.combrief grab $50 in free credits and start building your voice AI product today. Today's episode is brought to you by Robots and Pencils, a company that is growing fast. Their work as a high growth AWS and Databricks partner means that they're looking for elite talent ready to create real impact at Velocity. Their teams are made up of AI native engineers, strategists and designers who love solving hard problems and pushing how AI shows up in real products. They move quickly using roboworks, their agentic acceleration platform, so teams can deliver meaningful outcomes in weeks, not months. They don't build big teams, they build high impact, nimble ones. The people there are wicked smart with patents, published research and work that's helped shaped entire categories. They work in velocity pods and studios that stay focused and move with intent. If you're ready for career defining, work with peers who challenge you and have your back, Robots and Pencils is the place. Explore open roles@rootsandpencils.com careers. That's robotsandpencils.com careers Want to accelerate enterprise software development? Velocity by 5x you need Blitzi, the only autonomous software development platform built for enterprise code bases. Your engineers define the project a new feature refactor or greenfield build. Blitzi agents first ingest and map your entire code base. Then the platform generates a bespoke agent action plan for your team to review and approve. Once approved, Blitzi gets to work autonomously, generating hundreds of thousands of lines of validated end to end tested code, more than 80% of the work completed in a single run. Blitzy is not generating code, it's developing software at the speed of compute. Your engineers review, refine and ship. This is how Fortune 500 companies are compressing multi month projects into a single sprint. Accelerating Engineering Velocity by 5x Experience Blitzi firsthand@ Blitzi.com that's Blitzy.com. Today we're discussing a new study from Anthropic that, while nominally about agent autonomy, is actually much more about how people are using AI agents in practice. Welcome back to the AI Daily Brief. Today we are looking at a new anthropic study on agent autonomy. It's called Measuring AI Agent Autonomy in Practice. And in many ways it ends up actually being a case study in how agent behavior is changing. After reading it, I couldn't help but feel like it was a profile of a changing market where more and more of the tasks are moving outside of coding or engineering, and more and more of the agentic work is being done by people who are not themselves engineers. Now, to set this up, I think it's useful to have as a comparison the most frequently discussed study on agent autonomy. That is of course, the meter study, the chart of which I'm sure you've seen before, that measures AI's ability to complete long tasks. The metric that they created is basically a measurement of the duration of a task that AI can complete at a certain level of success. It is not, and this is something that people frequently get wrong, a direct measure of how long an AI agent can work for. Instead, it is a measure of the duration of tasks as it would take a human. So when, for example, GPT 5.2 high comes in at five hours, that's not that GPT 5.2 high took five hours to complete a task, it's how long that task would have taken a human. What's more, meter has two success metrics, 50% success and 80% success, neither of which would be sufficient performance for a real world context. In other words, you're not going to keep an employee around who completes tasks at a 50% success rate. Still, I've always thought that this meter metric was really valuable. In my estimation, it doesn't matter so much whether 50% or 80% success is the core number. It's that it's consistent and applied consistently over time to different models. So ultimately, what is this trying to get at? Well, it's trying to measure agent autonomy. And so why does autonomy matter? Autonomy matters as it shapes what agents can do. The more autonomous an agent is, the greater the capability it has to complete long duration tasks with high success rates. The wider and more complex the array of use cases that it can be valuable for. That matters on an individual level in terms of what work you can outsource to an agent, on an org level, in terms of which sets of tasks or which entire functions can be identified, and on a societal level, as it has big impact when it comes to the job disruption conversation. Yet, despite meter being a very valuable and oft cited metric, indeed last year during the height of the bubble times, people joked that this chart was keeping the entire industry on its back, as it was the one thing that suggested that there was no plateau on progress, which was maybe the chief piece of evidence that the bubblists were looking for. And yet there are of course limitations of their methodology. As Anthropic puts it, the meter evaluation captures what a model is capable of in an idealized setting with no human interaction and no real world consequences. And that of course, is not how people actually use agents in practice. To understand how people use agents in practice, one of the best places to look is CLAUDE code. For all intents and purposes, I think one can argue that Claude code is the first agent with product market fit. In fact, many people have noted that CLAUDE code is better thought of not as a coding tool per se, but instead as a code enabled general purpose agent. And that brings us to Anthropic study measuring AI agent autonomy in practice. Now, although Anthropic has access to pretty unique data in this regard, there are still some challenges. First of all, there's the question of what is the definition of an agent. Since this is a constant source of debate, Anthropic decided to go with a definition that is, as they put it, conceptually grounded and operationalizable. An agent is an AI system equipped with tools that allow it to take actions. As they point out, studying the tools that agents use tells us a great deal about what they are doing in the wild in terms of sources they pulled from the public API as well as CLAUDE code. And going back to this idea of tools for the public API data, they say rather than attempting to infer our customers agents architectures, we instead perform our analysis at the level of individual tool calls, they write. The simplifying assumption allows us to make grounded, consistent observations about real world agents, even as the context in which those agents are deployed varies significantly. The limitation they note is that they have to analyze actions in isolation rather than understand how those individual actions combine into a larger whole. The second source of data is CLAUDE code, and what makes CLAUDE code super valuable for this study is that because it is their own product, they can understand an entire agent workflow from start to finish. The challenge, of course, is that it doesn't have the same diversity of use cases necessarily as their API traffic. Now, one last note on the methodology. When trying to figure out how long agents actually run for without human involvement in CLAUDE code, they're Using turn duration, basically how much time elapses between when CLAUDE starts working and when it stops. One note they make is that when it comes to the average, most CLAUDE code turns are very short. The median turn lasts around 45 seconds, and that's been fairly consistent over the past several months. Instead, then they look at the signal at the very end of the long tail, basically the 99.9th percentile turn duration, with the argument being that these are the most advanced users, or at least the most advanced uses, and in that way are more likely to reveal what the end duration of the capability set really is. So looking at that 99.9 percentile turn duration, there are two really interesting phenomenon over the past few months. In the period between October and January, basically From When Sonnet 4.5 launched through When Opus 4.5 launched in November, average turn duration at that percentile jumped from 25 minutes to 45 minutes. Interestingly, they note that the increase is smooth across model releases, suggesting that autonomy is not purely a function of model capability. And indeed, I think that's one of the big themes of this research, is that when we try to understand agent autonomy, we have to think beyond just model to the entire context in which a model operates, including the human interactive context. The second really interesting period in this chart is the period over the last six weeks or so when there was actually a bit of a dip backwards from the peak of over 45 minutes down to something that's closer to 40. They identify two theories for why what was a previously pretty smooth curve has now leveled out and in fact gone down a little bit. The first is a shift in what projects people were using CLAUDE code for. The argument is basically that over the holidays people had sort of broader range, exploratory things that they were doing for their own gratification or hobbies, whereas when they came back, they had, as they put it, more tightly circumscribed work tasks. The second piece, however, is that between January and mid February, the Claude code user base doubled, which is obviously a phenomenon that we've been tracking closely here. A doubling like that is naturally going to bring with it a more diverse user base that's going to reshape the distribution a little bit. And indeed, maybe the most interesting thing about this study to me is not just the raw measure of capability, but the human interaction measures. A lot of this story is the difference between new users and power users. One of the interesting findings is that users at the beginning of their cloud code journey use the full auto approval features less than More experienced users, new users use full auto approval roughly 20% of the time, which roughly doubles to 40% for more experienced users. Claude Code's default settings require users to manually approve each action, and so Anthropic suspects that what we're seeing is a steady accumulation of trust. At the beginning, you approve things each time, and then as you dial in your settings and you start to learn to trust the model, you give it that auto approval more frequently. At the same time, approving actions isn't the only way that people supervise Claude code. Users can also interrupt CLAUDE while it's working to reorient it or give it feedback, and that kind of follows the opposite pattern. Newer users interrupt Claude around 5% of the time, while more experienced users interrupt it around 9% of the time, almost double. Now, one part of this might just be a shift in where people put the burden of oversight. If new users are approving each action before it's taken, maybe they don't need to interrupt Claude as much. Whereas when those experienced users use auto approval more liberally, there's more of a context for them to step in. However, there also might be a sort of learned experience here as well. They write the higher interrupt rate may also reflect active monitoring by users who have more honed instincts for when their intervention is needed, with the idea being that the new users simply don't know when to intervene as much. I think one comparison here is that if you view AI as sort of a junior employee, it earns trust over time. That's the shift from the 20% to 40% auto approval rate. But as you get more comfortable with it, you also intervene more, checking in on the work as it's happening and reorienting to make sure you get the most out of things, rather than just waiting to see the end product to judge its success in that way. Now, although these measures are about the human intervention, this is not a static number across models. In other words, model capability does impact this. Anthropic writes that from August to December of last year, as Claude Code's success rate on internal users most challenging tasks doubled, the average number of human interventions per session decreased from 5.4 to 3.3. Basically, as the models get better, users grant Claude more autonomy and achieve better outcomes while needing to intervene less. Now, when it comes to autonomy, we're talking about an interaction set in a conversation between the model and harness Claude Code and the humans using that model. Human intervention is only one of the directions in which autonomy can unfold in practice. Claude, as they write, is an active participant, too, stopping to ask for clarification when it's unsure how to proceed. Anthropic found that as task complexity increased, CLAUDE code would ask for clarification more often and more frequently than humans actually chose to interrupt it. For example, for turns where there was high goal complexity, humans interrupted Claude 7.1% of the time, while Claude asked for clarification more than double that 16.4% of the time. That compares to minimal goal complexity, where humans interrupted 5.5% of the time, with Claude asking for clarification 6.6% of the time. In other words, the gap between how much humans intervene and how much Claude asks for clarification increases alongside the complexity of the task. However, these aren't exactly direct measures, as humans interrupt CLAUDE and Claude interrupts itself for different reasons. The number one reason that humans interrupt CLAUDE is to provide missing context or corrections. That's 32% of the time. About a third, 17% of the time, it was because CLAUDE was slow or hanging, with every other reason being much less frequent. In terms of when CLAUDE stops itself, the most common reason, at a little above a third at 35%, is to present the user with a choice between different approaches. Which is interesting because because that's not really a knock on its own autonomy in the sense that it doesn't necessarily need that information to proceed, as it could theoretically just make the decision for itself, but a way to better align with humans on the upfront. Now, the one other really interesting chart is the chart of which domains agents are deployed in. As you might expect, especially given that this is anchored by CLAUDE code, software engineering represents around half of the tool calls overall, and although the other categories are all below 10%, they kind of read like a map of where agentic automation is likely to come next. Back office automation is at number two at 9.1%, followed by marketing and copywriting at 4.4, sales and CRM at 4.3, finance and accounting at 4.0. It is notable that even at this early stage, with coding and engineering tasks being the clear breakout, you're still already seeing more than 50% of tool calls. In other words, more than 50% of agentic use cases being outside of that software engineering domain. This is a pretty simple study overall, but a really valuable complement in my estimation, to the meter study as it moves away from the realm of the theoretical and into the realm of what people are actually using agents for and how they're actually interacting with them. There are a few interesting implications that people picked up on David Hendrickson wrote, what's most surprising from the paper is that real world AI agents are currently given much less autonomy than they could technically handle. In other words, we had to go to the 99.9 percentile to really see what Claude could do, despite the fact that the average turn is just 45 seconds. We've talked a lot on the show about a capability overhang and it looks like this is another example of that in practice, even with some of the most advanced tools in the space. Another interesting takeaway is about a shift in our thinking of autonomy from purely based on model capability to this more complex view of model capability plus human interactive state. Yang Ri Su writes, autonomy is not just steps taken, it is permission, scope and ability to change state. The other thing that people are exploring is based on all this, what they actually want the interactive mode to look like in the future. Richieonx, for example, writes, need a CLAUDE code mode that isn't exactly dangerously skipped permissions but can skip pointless do you want to proceed questions and at the same time doesn't nuke my entire database and family tree. Lorenzo responds, what you want is competent autonomy. CLAUDE can skip pointless prompts while respecting blast radius boundaries so dev stays sane and prod stays intact. Now one thing to watch for is how much the emphasis in the next set of developments is more improved interactions or a totally different paradigm of long duration autonomy. In a recent podcast with Lenny, OpenAI's Sherwin Wu argued, as the AI Future Brief put it, that the next leap in AI isn't just smarter models, but long duration autonomy. While today's tools are optimized for short bursts, tomorrow's tools will be agents you dispatch for six plus hours of independent work right now. As anthropic shows, that certainly isn't how people are using these tools, but it does appear that things are evolving fast. Overall, a very valuable study and a great way to see what's happening in practice. For now, that is going to do it for today's AI Daily brief. Appreciate you listening or watching as always. And until next time, peace.
