Transcript
A (0:00)
Today on the AI Daily Brief we've got a new exciting model in Cloudsonnet 4.6 plus a new public beta from Grok. Before that in the headlines, Apple is getting in on the AI wearable scheme. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. Alright friends, quick announcements before we dive in. First of all, thank you to today's sponsors kpmg, Mercury and Blitzy. If you are looking for an ad free version of the show you can find that over on patreon.com aidaily brief or you can subscribe on Apple Podcasts. Ad free is just three bucks a month. To learn about sponsoring the show or really anything else about the AI DB ecosystem go to aidailybrief AI Quick updates on a couple of the projects that we've talked about this week. It seems that you guys are in fact definitely interested in OpenClaw as nearly 2000 of you have signed up for Claw Camp in the first 36 hours. I've also seen a ton of excitement from some really excellent companies for an enterprise executive sprint around OpenClaw and agent building more broadly, which you can of course find at EnterpriseClaw AI. And lastly on the Jobs front, I am still looking for the AI DB Clarkitect someone to help me keep track of all of the openclaw resources out there and then actually build the new capabilities into products for this ecosystem. Like I said, all of this information and all of these links are available at aidailybrief AI earlier this week we discovered that Apple will be holding a product announcement event at the beginning of March and now we are getting stories that the company is ramping up work on multiple wearable devices for the AI era. Bloomberg's Mark Gurman reports that development is being fast tracked on a trio of AI wearables. Apple apparently plans to create a pair of smart glasses, a pendant that can be worn as a pin or a necklace, and camera laden AirPods with expanded AI capabilities. The three devices are all intended to connect to an iPhone and provide a hands free interface for AI. Siri, the pendant and AirPods are intended to be the low end offering. Both will have low resolution cameras that can provide context to the AI assistant, but which won't be good enough for taking pictures or recording video. The design brief is simply to offer a cheap always on camera and microphone to function as Siri's eyes and ears. No word on when to expect the pendant, but the camera equipped AirPods have been in development for some time and could be on shelves as early as this year. The smart glasses are designed to be more upscale and feature rich, competing directly with Meta Ray Bans. Several prototypes of the smart glasses have been distributed internally after significant progress in recent months. The glasses won't feature a display, but will have speakers, microphones and high resolution cameras. Apple is hoping their build quality and camera technology can give them the edge against Meta and their current domination of the nascent category. Reportedly, December is the target for the start of production with a public release next year. Now between this, the March 4 announcement and of course the absolute proliferation of Mac Minis as the device of choice for OpenClaw agents, there has been a huge discussion on X this week regarding Apple's AI strategy. Many shared this chart of AI capex going parabolic at rival big tech firms, while Apple is actually guiding a 19% drop in capex. The tone of the conversation was summed up by Akash Gupta who said, did Apple just luck into the smartest AI strategy in tech? The argument of course, is that while the hyperscalers spend hundreds of billions of dollars on data centers with very difficult or at least long term ROI calculus Apple, Apple is in the meantime shipping Mac minis as fast as they can make them and licensing Google's models for a billion dollars a year. Basically pocket change compared to the cost of building their own training cluster for an in house model. If Apple can actually get the trifecta of AI wearables to market alongside a functional version of AI Siri, maybe things start to look better for them. CEO Tim Cook seemed to imply that this is in fact the strategy. During an all hands meeting last week. He reportedly told staff that Apple is working on new categories of products powered by AI, remarking, we're extremely excited about that. The world is changing fast. And despite skepticism of AI wearables in the past, Ben Pouladian summed up the vibes when he posted, I'll take all three where should I leave my credit card? Next up, another interesting story from the earnings call cycle. During their last week's earnings call, Spotify Co CEO Gustav Soderstrom said that his company's top developers are pretty much done writing code by hand. He reported that his most senior engineers are saying that they haven't written a single line of code since December. Soderstrom gave a concrete example of a developer that gave Claude instructions for a bug fix or a new feature over Slack on their phone during the morning commute. Spotify's internal platform allows them to receive the code validated and push it to production, all before they arrive at the office. Soderstrom said that he believes that this is just the beginning of the AI coding era, with much greater efficiencies yet to be unlocked. He emphasized. This is a big change. It is real, it is happening fast. We are retooling the entire company for this age and it's going to be a lot of change. But as I said before, change if you capture it, is opportunity Moving over to Chip World Meta has signed a massive partnership with Nvidia, including a commitment to buy millions of AI chips. The multi year strategic partnership will involve deployment of current generation Blackwell GPUs as well as the next generation Ruben chips. In addition, Meta will use standalone Grace CPUs as well as utilizing their next generation networking equipment. Now big tech company partners with Nvidia stories are basically an everyday occurrence at this point. So what makes this one interesting? The story here is really the scale at this stage. The largest data centers contain several hundred thousand GPUs and you can likely count those on one hand. The purchase of millions of chips implies that Meta plans to build multiple new data centers at world leading scale over the coming years. Nvidia only produced around 5 million AI chips last year, so an order of this size could be a strategic move to corner the market on the leading AI chips. Analysts said the deal likely stretches in the tens of billions and will soak up a good portion of Meta's 135 billion capex plan for 2026. The deal also isn't just about AI training and inference, with Meta planning to migrate large portions of their social media recommendation engines to Nvidia Silicon. For Meta, it's an interesting commitment to just paying Nvidia for their technology rather than trying to find alternatives. Each of the large AI companies have spent the last year spinning up custom silicon projects or partnering with AMD in an attempt to avoid the Nvidia tax. With Meta pursuing both of those avenues last year. This deal would seem to imply they've settled on Nvidia as their major supplier, but it could also just simply be about volume, with Nvidia being the only chip maker with a proven track record of delivering chips at this scale. Announcing the deal, Jensen Huang said no one deploys AI at Meta scale, integrating frontier research with industrial scale infrastructure to power the world's largest personalization and recommendation systems for billions of users. Through deep co design across CPUs, GPUs, networking and software, we are bringing the full Nvidia platform to Meta's researchers and engineers as they build the foundation for the next AI frontier, summing it up pretty simply, AMitiz Investing writes AI Datacenter Buildout Cycle is simply not over. Speaking of which, software sellers might be exhausted as the stock market levels out. Both major indices eked out slight gains on Tuesday as sentiment began a cautious turnaround. Louis Nivelier, CIO for Nevelier and Associates, said, it is likely that we will look back on the current volatility as a buying opportunity, though it's difficult to estimate when the volatility will be behind us. The past month has of course, been brutal for AI stocks, with the Mag 7 now at five month lows. It's been even worse for AI exposed software firms with sector flagships like Salesforce and Adobe down more than 20% on the year. The sell off has been so severe that some executives took direct action to steady the market. ServiceNow CEO Bill McDermott announced in a regulatory filing that he would buy 3 million in his company's stock. McDermott is the first major SaaS CEO to buy stock during this bloodbath, which made it seem like even the insiders had lost faith in the sector. Multiple ServiceNow executives also canceled all future selling plans. Meanwhile, several private software companies released their earnings early in a bid to show they haven't been disrupted by AI. McAfee's Q4 earnings were little change from last year at 626 million. Rocket Software disclosed 5.2% revenue growth, while Perforce Software had a slight revenue decline, but detailed AI product development plans in their earnings call. Absolutely. It is way too early to say the SaaS apocalypse is over, but this week does seem to be giving investors a slight breather to reassess the value of AI and software stocks moving forward. Over in China, it is the Chinese New Year and AI companies are the ones handing out the red envelopes. Alibaba, Tencent and ByteDance are all offering massive giveaways in a bid to capture new chatbot users. The promotions vary, with ByteDance running a high value sweepstakes, while Alibaba and Tencent are giving away a few dollars worth of vouchers to each user. Part of the big push is to get users to try out nascent AI shopping agents. And yet the information notes that Chinese AI companies could be facing an even tougher path to AI monetization than their US Counterparts. Each of the major Chinese labs is still offering high volume usage and advanced features for free. Leon Fan, a Beijing based AI founder, noted a cultural barrier, commenting, in China, consumers know they can always find most online services for free. If one major AI chatbot started charging its users, people would immediately migrate to other free chatbots that are just as good. That said, while it's not a pathway to profitable AI, the giveaways are serving their purpose by boosting usage during the spring festival. But Whitedance said their Monday night promotion garnered 1.9 billion chatbot interactions, while Alibaba said their agentic shopping focused promotion had led to 130 million first time users trying out the service so far this month. There is of course another AI story going on in China, which is the rise of embodied AI in the form of robotics. At some point we're probably due for an update show, as the videos coming out this year suggest a pretty extraordinary pace of development. For now, however, that is going to do it for today's AI Daily Brief Headlines edition. Next up, the main episode hello friends. If you've been enjoying what we've been discussing on the show, you'll want to check out another podcast that I have had the privilege to host, which is called you can with AI from kpmg. Season one was designed to be a set of real stories from real leaders making AI work in their organizations, and now season two is coming and we're back with even bigger conversations. This show is entirely focused on what it's like to actually drive AI change inside your enterprise and and as case studies, expert panels and a lot more practical goodness that I hope will be extremely valuable for you as the listener. Search you can with AI on Apple, Spotify or YouTube and subscribe today. This episode is brought to you by Mercury. Radically different banking now available for personal accounts I already use Mercury for my business, so when they introduced personal accounts it made immediate sense for me. I try to bring the same level of intention to my personal finances that I bring to building companies and most traditional banks just do not feel designed for that. With Mercury Personal you can toggle between business and personal in a click. You can set up sub accounts for specific goals, automate transfers so projects and savings fund themselves and put idle cash to work with high yield savings, all without friction. It's built for people who care about how their money moves and want tools that actually keep up. Visit mercury.compersonal to learn more. Mercury is a fintech company, not an FDIC insured bank Banking services provided through Choice Financial Group and Column N A Members fdic. If you're looking to adopt an agentic sdlc, Blitzi is the key to unlocking unmatched engineering Velocity. Blitzi's differentiation starts with infinite code context. Thousands of specialized agents ingest millions of lines of your code in a single pass, mapping every dependency with a complete contextual understanding of your code base. Enterprises leverage Blitzy at the beginning of every sprint to deliver over 80% of the work autonomously. Enterprise grade end to end tested code that leverages your existing services, components and standards. This isn't AI autocomplete. This is spec and test driven development at the speed of compute schedule a technical deep dive with our AI experts at blitzi.com, that's blitzy.com one more quick thing before we get back to the show. If you are a business leader who is thinking about how all of this crazy open claw and agent stuff can impact your business, I've got something for you. If you go to EnterpriseClaw AI, you can sign up to get more information about a new executive sprint that we're going to be doing. That will help leaders inside companies figure out what what the real challenges and opportunities of agents and agent systems like OpenClaw are going to be for your particular companies. That program will involve you learning, at least on a personal level, how to build agents and agent teams so that you have that basis of experience to then walk through a set of blueprints for the types of challenges you're going to face around things like security, governance and more. The first cohort is kicking off in March, so head on over to EnterpriseClaw AI to sign up for more information. Welcome back to the AI Daily Brief. You had a sense at the beginning of this week that we might be in for a good one when it came to new model releases, and so far that is absolutely the case. We have not yet seen the much rumored Deep Seek version 4, but we did get an early preview of Grok 4.20 as well as Sonnet 4.6, which as you'll see especially in the context of the Open Claw conversation, has a lot of people excited. Now we're going to look at the new models in terms of some benchmarks of course, as well as first impressions from the peanut gallery. But the big thing that I think is notable after looking especially at the reactions to Sonnet 4.6 is just how different evaluation of new models is getting. It is much more discreet, much more specific, and honestly much more useful. Yes, sometimes with the big flagship models like Opus4.6 or I'm sure when we get GPT5.3, it'll be this way. The question is how much, if at all, does this push the state of the art? How much does it beat the previous best model in terms of raw capability. Increasingly, however, the discourse is not about just raw capability, but instead a set of questions about what specifically the new model adds to the capability set and how it can be plugged into people's model stack. The questions that people explore are about cost, contextual performance, discrete capabilities, and how those add up to new value around specific use cases. So with that in mind, let's talk about Sonnet 4.6. As has been the case with their previous Sonnet releases, this model is all about delivering more reasonably priced high performance. Specifically now in the context of agents, Anthropic writes that it is opus level intelligence at a price point that makes it practical for far more tasks. Couple of the key details. One is that it has a million token context window, which is the first time for that in a Sonnet class model. Anthropic describes it as enough to hold entire code bases, lengthy contracts or dozens of research papers in a single request. Now that difference in the context window opens up so many use cases that one of the things that will be interesting to watch going forward is how much of opus usage was just about that million token context window as opposed to any other performance differentiation. Now one of the big call outs in terms of new capability set is around computer use. Anthropic writes almost every organization has software it can't easily automate specialized systems and tools built before modern interfaces like APIs existed. To have AI use such software, users would previously have had to build bespoke connectors. But a model that can use a computer the way a person does changes the equation. In the 18 months since Anthropic started tracking computer use via the Osworld series of benchmarks, the Sonnet models have jumped from a 14.9% all the way up to 72.5% today. The latest jump between Sonnet 4.5 and Sonnet 4.6 was from 61.4 to 72.5. The model certainly still lags behind the most skilled humans at using computers, but the rate of progress is remarkable nonetheless. It means that computer use is much more useful for a range of work tasks and that substantially more capable models are within reach. I think their point that we are on the verge of models that can use computers like humans without APIs is a powerful one. Compared to the previous Sonnet 4.5 model, this model is much stronger on coding benchmarks, being now roughly in line with Opus 4.5. The model is also now state of the art in agentic financial analysis and office task benchmarks. Even beating Opus 4.6. The cost is $3 per million input and 15 per million output tokens, compared to Opus 5 and $25. And Sonnet 4.6 is available to free users, which could end up being meaningful. In their testing in Claude Code, Anthropic found that users preferred Sonnet 4.6 over Sonnet 4.5about 70% of the time. Users, they say, reported that it more effectively read the context before modifying code and consolidated shared logic rather than duplicating it. This, they said, made it less frustrating to use over long sessions than earlier models. Interestingly, they say users even preferred Sonnet 4.6 to Opus 4.5, the model that launched this huge inflection point that we've been talking about all year. 59% of the time. Those users said that Sonnet 4.6 was significantly less prone to over engineering and laziness and meaningfully better at instruction. Following One really interesting test was around the vending bench arena, which is a test to see how well different models can run a simulated business over time, they write. Sonnet 4.6 developed an interesting new strategy. It invested heavily in capacity for the first 10 simulated months since spending significantly more than its competitors, and then pivoted sharply to focus on profitability in the final stretch. The timing of this pivot helped it finish well ahead of the competition. All of which is to say that Anthropic is very clearly saying that Sonnet4.6 is not just a cheaper opus. It has some things that it does that are unique and highly capable, and it's worthy of consideration on its own terms, not just because of cost. So we haven't had this for long, but what is the state of the conversation? There has been a surprising amount of conversation around rumors of whether this was originally supposed to be Sonnet 5, Veer Masrani writes rumor going around anthropic Sonnet 5 didn't hit internal benchmarks and may ship as Sonnet 4.6 instead. If true, that tells us a few things. The jump from 4x to 5 was expected to be meaningful. Whatever they tested didn't clear that bar. They'd rather relabel than over promise, Veer says. That either means it's conservative branding or there's a performance plateau, he concludes. If they're saving Sonnet 5, then something bigger is still in the oven. If not, we may be entering the era of smaller, hard won improvements instead of flashy jumps. Without having any privileged information about what's going on inside the company. It is very clear that overall across all the companies. We are definitely in the era of smaller, harder won improvements instead of flashy jumps, although that might not be because of some constraints in the scaling, it might be just a response to consumer expectations and the absolute cudgeling that OpenAI took when the jump to GPT5 wasn't big enough to get people excited. Others think this is just business strategic. Sean Sullivan writes, I have a feeling that Sonnet 5 has been done for some time now, but it's way cheaper than sonnet 4.5 and anthropic still has market leadership and API usage, meaning that they don't have to drop it until someone comes up to compete. Now, in terms of people who have actually used it, the response is pretty good. Aaron Levy from Box writes, we tested Sonnet4.6 in Early Access on our box AI complex work evaluated and it's a big upgrade over Sonnet 4.5, seeing a 15 percentage point jump in performance and accuracy. Sonnet is delivering a huge boost across reasoning capabilities, tool use, working with complex data, and more. All of these, Aaron points out, are necessary improvements for agents to be involved in sophisticated workflows in an enterprise, reinforcing the idea that this isn't just cheaper Opus artificial analysis, writes Claude, Sonnet 4.6 is the new leader in GDPVAL, slightly ahead of Anthropic's Opus 4.6 on agentic performance of real world knowledge work tasks less than two weeks after its launch. That said, they did note that in their testing, Sonnet actually used significantly more tokens than previous versions of Sonnet and meaningfully more than Opus 4.6 as well. That meant that although Sonnet 4.6 slightly beat out Opus 4.6, that the story might not be as simple as hey, this cheaper model does even better. Basically, the cost for Sonnet to outperform Opus made Sonnet even more expensive than Opus from a positioning standpoint. Trung Fan pointed out that Anthropic, as focused as they are on enterprises, seems to have made a decision to not totally see the ground for consumers as well. They point out that in the Sonnet 4.6 demo they show Claude renewing someone's license plate at the dmv. Obviously a very benign, everyday painful type of use case. A lot of the chatter is of course around computer use and what that's going to mean, and many are hammering just how important the cost dimension is in the context of what we're using these models for today, kalezer writes, the price point thing matters way more than people realize. Running agents that loop hundreds of times per task. Dropping to Sonnet tier pricing while staying near Opus level means the same budget goes 5x farther. That's not a minor upgrade, that's a different category of what you can build. Zach Schmau writes, Opus class reasoning at Sonnet pricing means you can actually afford to let agents think harder on every step without blowing through your API budget. That was the real bottleneck. And of course given where the state of the conversation is right now, opinion A lot of people pointed out its relevance for OpenClaw. OpenClaw super champion Alex Finn writes, this is the best model for OpenClaw ever. It is human level at computer use the most important part of claw for a fraction of the price. Meta Alchemist writes, Sonnet 4.6 feels like it was made for Openclaw with how much emphasis they put on running the apps on your computer and tool usage. If you are using Claude with OpenClaw, using Sonnet 4.6 will be faster and cheaper compared to Opus. Prajwal Tomar writes, I burned through a stupid amount of money in 48 hours using Opus 4.6, which to Sonnet 4.6 it feels almost the same but costs a fifth as much. For pure coding, Opus is still better, but for agentic workflows inside OpenClaw, Sonnet 4.6 performs nearly as well. And that's what actually matters. When agents are looping, researching and executing tasks all day, cost efficiency becomes everything. If you're using Opus 4.6 for Openclaw right now, switch to Sonnet 4.6. You'll save a lot of money without sacrificing real performance. Openclaw for its part, very quickly pushed an update to officially support the new model. Summing up the discourse in their AI news newsletter, Leighton Space writes that Sonnet 4.6 matters because one long context that is that 1 million token window is becoming operational versus just a spec that two agent performance claims are increasingly harness dependent, meaning that you have to ask not just about model but where and how it's being used and that 3 computer use is becoming a marquee capability. Overall, the vibe is people are excited and excited especially to try it in their agent systems like OpenClaw. Now that wasn't the only model we got yesterday Elon Musk posted the GROK 4.2 release candidate public beta is now available for use. You need to select it specifically. Critical feedback is appreciated. Unlike prior versions of Grok 4.2 is able to learn rapidly so there will be improvements every week with release notes. So this then is a little bit different. It's not a full on release that has a benchmark scorecard like Sonnet 4.6, nor is it a fixed state where the next set of improvements are going to come with the next model number. Instead, Grok 4.2 itself is supposed to improve over time. Indeed, Elon separately said Grok 4.2 will be about an order of magnitude smarter and faster than Grok 4 when the public beta concludes next month. Still many bug fixes and improvements landing every day, the public beta gives us more critical feedback to address. Now one of the things that is extraordinarily difficult anytime we get an XAI model is what I'll call the Elon Rorschach test. If you dislike Elon and the X algorithm knows that you like content where people are crapping on things that Elon does, you are going to see endless tweets about how 4.2 is just a total POS. On the flip side, if you are an Elon Stan, that same X algorithm is going to deliver you a whole slew of tweets about how awesome 4.2 is. Among the very few people that I could find that I think exist in between those two paradigms, first impressions are that it is, if nothing else, improved. Dr. Daria Anutmaz writes, I just got access to Grok 4.2.0 beta and I'm testing it on biomedical questions. I can already say it has greatly improved. Now the one specific feature that lots are talking about is the approach that 4.20 takes where in responding to a prompt, four separate agents think on their own, debate amongst themselves, and then come up with the best answer together. Benjamin Decracker the Grok 4.2 Agent Teamwork System is cool and appears well done. However, he says the real value in these multi agents is when they're not all the same model or even the same provider. A mixed team from four different models, Grok, Claude, GPT and Gemini is the sweet spot. Ultimately, from where I'm sitting, there is not quite enough available on 4.2 to really know what to make of it. I think the thing that I will be watching most closely is this idea that it itself is going to get better rapidly. Last thing I wanted to flag today isn't a new model release, but a new product release. Normally I wouldn't necessarily feature this until it had a lot more folks with hands on it, but there's a new platform called Dreamer that seems to be focused on abstracting away all the complexity around agent design to still build the agents that you need to solve your problems. I don't necessarily think that they describe it super well. The announcement tweet calls it a place to discover, build and enjoy agentic apps and your home for personal intelligence, whatever the heck that means. But the early users of it did a better job of describing where the value is. Ben Tossel from Ben's Bytes writes, 2026 is the year of the personal agent. Dreamer is the closest I've seen to making that accessible to everyone. In his newsletter, he writes, dreamer is a platform where you build agentic apps by talking. You describe what you want and an AI agent called Sidekick builds it for you in minutes. There's also a more detailed coding agent for when you want to go deeper. Either way, you never think about hosting or deployment. The platform handles it all. That's the bit I care about most. I spent a stupid amount of time on infrastructure, getting servers running, keeping things alive, debugging why something crashed. That stuff is fine when you're learning, but it's not the point. The point is the thing you're trying to make. Sidekick learns about you over time and acts as the privacy layer controlling what data each app in Dreamer can access. It can spin up temporary agents for specific tasks, integrate with third party tools, and coordinate between your different apps. All of that wiring is done for you out of the box. Sean Wang, Swix writes, dreamer is the most ambitious full stack consumer and coding agent startup I've ever seen. When this was first demoed to me, my jaw dropped. Now he writes a lot more but says, I think Dreamer is the right form factor for mass adopted personal software agents. You stop fussing over the code, you just use the app and then talk to your Sidekick to fix bugs. Sean's belief is that, quote, very unexpected things happen when you let normies build their own AI apps rather than force them through expensive developers, Basketball apps, knockoff Harry Potter galleries, story times for kids, Caltrain apps and so far that seems to be people's early experience. Joanna Stern, formerly of the Wall Street Journal, writes, started testing Dreamer yesterday and this might be the Vibe coding agent tool for Normies. Super simple to build little tools without deploying anything to a server. So to the extent that today we are talking about new models and discrete capabilities, it seems like Dreamer is one to watch. For now though, that is going to do it for today's AI Daily Brief. Appreciate you listening or watching as always and until next time, peace.
