Transcript
A (0:00)
Today on the AI Daily Brief, the perils of the AI exponential and before that in the headlines and definitely not related at all. Claude Co turns one the AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. Alright friends, quick announcements before we dive in. First of all, thank you to today's sponsors, kpmg, Mercury, AIUC and Blitzy. To get an ad free version of the show, go to patreon.com aidaily Brief. If you're interested in sponsoring the show, send us a Note@ SponsorsIDailyBrief AI. You can also find out all about the AI DB ecosystem on AIDailyBrief AI. The one thing that I would point you to today is that the newsletter is officially back. Rather than making this thing complicated, we decided to just give you guys what people have been requesting forever, which is the links to all of the things that we discuss in the show. So if you are ever looking for some tweet that I mention or for an article that I'm referencing, you should go subscribe to the newsletter because it's going to be there again. You can get a link to that as well as everything else on AI DailyBrief AI. Now with that out of the way, let's dive into the headlines. We kick off today with another reminder of just how fast things are changing. Claude Code, this platform that has become so integral to the changing of the world, the and the shift in how business gets done, is just one year old. In fact, this weekend Anthropic threw it a first birthday party to celebrate. There was clearly something in the air. In February of last year, at the beginning of the month, Andrej Karpathy coined the term vibe coding and it was a capability set that had clearly just started to come into its own with the latest generation of models. At the time, agentic coding was still seen as something of a fascination. It was something quirky that might help non technical people build some fun personal apps, but was very clearly too unreliable to be used in production environments. Fast forward just a year and on any given day on this show, you're going to hear about the extent to which agentic coding is disrupting not only the software industry, but also infiltrating other areas of work as well. For Anthropic, Claude Code has fundamentally changed the destiny of the company. What started as a side project for developer Boris Czerny has become the central pillar of their strategy. Not only is Claude code generating 2.5 billion in arrival, it's also being used to code its own upgrades and develop new products at a staggering pace. In a recent interview, Czerny recalled the early weeks of internal release. He said, I remember Dario asking like, hey, are you forcing engineers to use this? Why is everyone using it? Czerny responded that all he needed to do was make it available and everyone voted with their feet. The same distribution method is working for developers the world over. Anthropic's recent analysis of their API figures found that almost half of all tool calls are related to software engineering. In other words, AI coding is the biggest use case for anthropics models, and it's not close. What's more, Claude Code transformed the AI industry. It completely eliminated the argument that AI is just fancy autocomplete or a better version of Google. Watching the changes happen from inside, Boris believes this is a fundamental phase shift for software engineering. In a recent interview, he commented, continuing to trace the exponential I think what will happen is coding will be generally solved for everyone. Today coding is practically solved for me, and I think it will be the case for everyone, regardless of domain. So happy birthday to Claude Code. One year in and it already changed the world. Now, changes in the world are rarely simple. In fact, they are more often chaotic and even violent. On that front, Anthropic's new security tool sent cybersecurity stocks into a tailspin last week, raising new questions about the software sell off. On Thursday, Anthropic unveiled Claude Code Security, another new plugin, to extend the tool's capabilities. Anthropic said the feature scans code bases for security vulnerabilities and suggests patches, allowing developers to find and fix security issues that traditionally methods often miss their phrasing. Obviously, Friday's market action saw cybersecurity stocks decimated. These companies had so far been resistant to the broader software sell off, with the First Trust Cybersecurity Index losing 11% over the past six months, compared to 24% for other software indices. Friday alone, however, saw CrowdStrike lose 8%, Okta lose 9% and Cloudflare lose 7%. Many were totally incredulous, with Kenton Varda, a tech lead for Cloudflare, posting low led investors who think all forms of security are fungible. And so the release of Claude Code Security, a tool for finding security bugs in your code, means Okta, CrowdStrike and others should lose 5% of their stock value. Now, of course, a big part of the pushback during the software sell off has been that even if companies can theoretically vibe code their own SaaS products, few want to take on the task of maintaining and supporting internal software. One might imagine that this goes double for cybersecurity, which adds a ton of insurance and liability issues on top of in this case. And this is the point that Kenton was making. The features of Claude Code Security don't even overlap with the products offered by these firms. What was released by Anthropic is only designed to audit and monitor internal code for vulnerabilities. Cloudflare and Crowdstrike largely provide security for customer facing services, preventing downtime from Internet based cyber attacks. And the Okta drawdown is even more puzzling given that they provide two factor authentication services. Anthropic didn't even hint at anything regarding any of these aspects of cybersecurity. Still, while it might be easy to dismiss this as irrational markets acting irrationally for investors, there's a lot of signal in how this crash is playing out. Dennis Dick of Triple D Trading said, there's been a steady selling in software and today it's security that's getting a mini flash crash on a headline. This kind of market is scary for investors because things are just moving relentlessly to the downside. As soon as you get a hint of disruption. It's rational to be cautious because people were saying a while ago that the software drop was overdone and yet it keeps going down. Buco Capital put the logic in even more fundamental terms, writing I think it's fine to sell Cloudflare and Crowdstrike actually, even if the Anthropic news doesn't impact them today, because Maybe you shouldn't pay 25x revenue when the landscape is shifting this quickly. This is sort of the closest to my take, very broadly defined, which is that even if yes, it does seem like almost all of these moves are overblown in the short term and the catalysts don't really warrant them. I don't think that they're really about the specific catalysts. I think that they are very clearly part of a broader based repricing going on right now. That's about exactly what Buco is talking about here, a question of how to value software when it is changing so quickly. Neither I nor the market knows where things are going to land and where they're going to feel comfortable. And so until that reset point is hit, if it even can be, you're just going to see a lot of these weird moments like this. Still, Sesace thought that Anthropic could make better use of their newfound power to crash markets, asking Can Anthropic publish a blog post about how they're going to replace 4 bed 4.5 bath homes in walkable neighborhoods with good schools. Now if you are tracking the model releases over the last week or so, you've probably noticed that Google, Anthropic and XAI have all thrown new models onto the pile. That means of course, that it's just about time for the next frontier model from OpenAI. Now rumors about this one have been coming for a little while. This is the model known as Garlic internally, which was the main focus of Sam Altman's Code Red push which began in December. We've been hearing this is coming every week for a few weeks now, with the latest rumor being that GPT 5.3 aka garlic will be released on Thursday. We of course already got the coding focused version GPT 5.3 Codex at the beginning of the month with that model bumping up the coding benchmarks being competitive if not ahead of Opus 4.6 which released on the same day. At the same time. It also improved on reasoning benchmarks, suggesting there is much to be transferred over to the core version of GPT5. 3. Covering the rumors, AI engineer Dan Mack wrote it surpasses human baseline on simple bench of 83.7%. In fact it blows every previous model out of the water on all non coding benchmarks. Word has it it is a huge leap a GPT3 to GPT4 moment again. OpenAI has long had the best reinforcement learning pipeline, which makes sense since they were the first lab to train LLMs for inference time reasoning using RL with O1. Now they've got their mojo back when it comes to pre training too. Public comments from Sam Altman also point in the direction of major progress. This could be the big one. It may be deserving of a major version bump AI Rumor account I Rule the World added their take saying just heard from separate sources that this is accurate. This feels like what we expected from the initial GPT5 release. Expect it quicker, smarter video and audio in start preparing for a big week, they've hidden just how much progress they've made. For those tracking out there, I think there are two very separate things. One whether we're going to get a model this week and 2 how big a deal it is. I would anticipate that even if it is a big change, I would be very surprised if it was named anything other than 5.3 given how burned OpenAI has been in the past with more bigger jump naming conventions. Staying in OpenAI land for a minute. A new financial forecast from the company suggests Surging revenue alongside rapidly escalating costs. The information got their hands on the latest set of projections handed to OpenAI investors. The company is now forecasting $282.5 billion in revenue by 2030, a 27% jump from their previous round of projections. For context, that would put them ahead of where Meta is currently and implies roughly 100% revenue growth in each of the next three years, followed by two years of around 55% growth. OpenAI expects this year's revenue to come in at 30.1 billion, more than doubling 2025's total. They anticipate another doubling in 2027 to reach 62 billion. And yet OpenAI also doubled their forecast for cash burn, reaching a peak of 85 billion in 2028 and and a total of 665 billion over the next five years. They still expect to reach profitability by 2030, but they are anticipating much greater costs along the way. One big part of the adjusted figures was spiraling inference costs in 2025. OpenAI said that the cost to serve their models quadrupled over the past year, causing a compression in gross margins. Margins fell from 40% in 2024 to 33% in 2025. Interestingly, OpenAI had originally forecast margin expansion for 2025, it expecting model efficiency to boost margins to 46%. They lowered margin expectations across the five year forecast as a result, but are still forecasting margin expansion each year. While inference costs are expected to rise to 14 billion this year, model training is expected to quadruple to 32 billion. Training costs for 2027 are expected to double again to reach 65 billion, which is 44 billion more than forecast last summer. In total, OpenAI expects to spend 440 billion on model training through 2030. The financial presentation also included some user metrics which came in a little soft due to increased competition from Anthropic and Google last year. Weekly active ChatGPT users are now at 910 million, falling short of the billion user target for 2025. OpenAI leaders pointed to a slowdown around the release of GPT5 as one of the major stumbles for user growth last year, which of course led to the code red being declared in December. Now of course last week we got this chart from EPIC suggesting that anthropic could overtake OpenAI in revenue as early as this year, which maybe goes away to explaining why Dario and Sam wouldn't hold hands in India last week. One more little OpenAI update the company's device plans are coming into focus with several new details, including pricing. One of the interesting tidbits from OpenAI's financials was a forecast that hardware sales will contribute 1.3 billion in revenue next year. To achieve that, OpenAI will need to actually bring their AI devices to market, and according to some new reporting, they seem to be well on their way. The information again reports that A team of 200 people is now working on OpenAI's family of devices. Sources said the family includes a smart speaker and possibly smart glasses and a smart lamp. Notably absent from the reporting was the behind the ear capsule shaped device, said to carry the codename Sweetpea. The smart speaker is reportedly going to be the first device released by OpenAI and will be priced between 200 and $300. Amazon Echo smart speakers are currently priced between 50 and 220, so the OpenAI device would be competing at the top end of the market. Sources said the speaker will be equipped with a camera, allowing it to draw context from its immediate surroundings, and also said that the camera would allow people to use facial recognition to approve purchases. None of the devices will feature a screen of any kind. No new reporting on a timeline for the smart speaker, though previous reporting suggested we won't see the first OpenAI device until early next year. The smart glasses, meanwhile, are expected to be released in 2028 at the earliest. Prototypes are said to be available inside the company with the smart lamp specifically mentioned. Sources emphasize that design is still in early stages across the board, so feature details aren't set in stone. One interesting little tidbit is that the device is being designed at a separate office away from OpenAI's headquarters, and some at OpenAI have complained that Jony I've's design studio, love from is slow to revise the designs and shares few details with the main organization. This is of course similar to Apple's design culture, where early details of new devices are shared on a need to know basis. Given that every week we get new form factors, I would say at this stage what you should assume is that basically every object and device that you interact with in the real world is probably going to be tested for its capability to become an AI device before we actually get the final OpenAI product. With that though, we end the headlines. Next up, the main episode. Agentic AI is powering a $3 trillion productivity revolution and leaders are hitting a real decision point. Do you build your own AI agents, buy off the shelf or borrow by partnering to scale faster? KPMG's latest thought leadership paper Agentic AI Navigating the Build, buy or borrow decision does a great job cutting through the noise with a practical framework to help you choose based on value, risk and readiness and how to scale agents with the right trust, governance and Orchestration Foundation. Don't lock in the wrong model. You can download the paper right now at www.kpmg.usnavigate. again that's www.kpmg.usNavigate. this episode is brought to you by Mercury Banking for people who expect more from the tools they rely on. If you're building a modern business but still using a traditional bank, it just doesn't make sense. I use Mercury for all of my ADB family of companies and it honestly feels like financial software built for how people actually operate today. It's fast, clean, no in person visits, no minimum balances, and the things that used to take forever like sending wires or spinning up new accounts take seconds. Everything lives in one dashboard, cards, payments, invoices, team permissions. And you can automate a lot of the busy work so you're not constantly manually managing your money. Of all of the services I use to run aidb, I never thought banking would be one of my most painless and most happy experiences. But with Mercury, that's exactly what it is. Visit mercury.com to learn more and apply online in minutes. Mercury is a fintech company, not an FDIC insured bank. Banking services provided through Choice Financial Group and Column NA Members FDIC There's a new standard that I think is going to matter a lot for the enterprise AI agent space. It's called AIUC1 and it builds itself as the world's first AI agent standard. It's designed to cover all the core enterprise risks, things like data and privacy, security, safety, reliability, accountability and societal impact. All verified by a trusted third party. One of the reasons it's on my radar is that 11 labs, who you've heard me talk about before and is just an absolute juggernaut right now, just became the first voice agent to be certified against AIUC1 and and is launching a first of its kind insurable AI agent. What that means in practice is real time guardrails that block unsafe responses and protect against manipulation, plus a full safety stack. This is the kind of thing that unlocks enterprise adoption. When a company building on 11 labs can point to a third party certification and say our agents are secure, safe and verified, that changes the conversation. Go to AIUC.com to learn about the world's first standard for AI agents. That's AIUC.com with the emergence of AI code generation in 2022, Nvidia master inventor and Harvard engineer Sid Qureshi took a contrarian stance. Inference, time, compute and agent orchestration, not pre training, would be the key to unlocking high quality AI driven software development in the enterprise. He believed the real breakthrough wasn't in how fast AI could generate code, but in how deeply it could reason to build enterprise grade applications. While the rest of the world focused on co pilots, he architected something fundamentally different. Blitzy, the first autonomous software development platform leveraging thousands of agents that is purpose built for enterprise scale code bases. Fortune 500 leaders are unlocking 5x engineering velocity and delivering months of engineering work in a matter of days with Blitzi transform the way you develop software. Discover how@blitzi.com that's B L I-T Z Y.com. Welcome back to the AI Daily Brief. Today we are talking about an update in the Meter Moore's Law for AI agents chart. Opus 4.6, among others is finally on the chart with everyone scurrying around to understand the implications. Combined with that, a new research note from Citrini Research which is rocketing around the pages of X and the Internet more broadly and it's an interesting case study in the moment in which we are in now. By way of background, I'm sure at this point that the vast majority of you are familiar with this chart from meter, the Model Evaluation and Threat Research Lab. The chart comes from a continuous study and shows the longest time horizon tasks an AI agent can handle. It was first released in March of last year and at the time Sonnet 3.7 was the most advanced AI model available. METER conducted their study going all the way back to GPT2 and found that the time horizon of agentic tasks was reliably doubling roughly every seven months. That's where the idea of this being a kind of Moore's Law for AI agents came from. That original report even suggested that the speed of improvement was accelerating with the more recent models at the end of 2024 and early 2025 implying a doubling rate as fast as three months. The chart was a huge part of the discourse at the time and became even more significant towards the end of the year as we were overwhelmed by AI bubble talk. Now, as we've discussed, 2025 was the first year that we started to get some wobbles in the AI narrative. It started with the deep seek moment which wiped $600 billion off of Nvidia's market cap in January and throughout the year there was this kind of ping pong back and forth behind excitement, but also increasing skepticism. Now, one of the flavors of skepticism that is particularly relevant for those proclaiming AI bubbles in the markets had to do with performance plateaus and scaling walls. Basically, the short of it is that if AI actually hit a scaling wall where performance just wasn't really getting better anymore, that would make the bubble idea much more likely. The gist of it is that if AI can improve from here, how could it possibly hope to justify these huge infrastructure deals that were predicated on the idea that it kept being more and more of a significant force in the economy? This is why by the end of the year, as the bubble narrative took hold, many were calling it the most important chart in the economy. It was in many ways the bulwark holding back the full tide of AI bubble pop narratives. Now, before we dig into the latest findings, it is worth noting a little bit on what the meter studies actually say and what they do not say. The studies are designed around a set of software engineering tasks ranging from the trivial to the complex. Human engineers were tasked with solving each problem and their times were used as a benchmark. For example, if a human engineer takes two hours to complete a task, that task has a time horizon of two hours regardless of how quickly an AI can complete it. In other words, and this is the mistake you see most often on the Internet, the metric is not a measure of for how long an AI agent can continuously work. It is a measurement of how difficult a problem an agent can solve, measured in comparative human time to solve the same problem. If a task that takes A human coder 2 hours is solved by Claude in 2 minutes, it still yields a 2 hour time horizon. The other element of the study worth understanding is how meter determines success for a task. The researchers aren't looking for perfect reliability as they're trying to measure the capability frontier. Instead, their core finding that features on the viral chart requires an AI agent to produce a correct answer 50% of the time meter also has a secondary finding that requires an agent to deliver the correct response 80% of the time, which as you would imagine, results in a much lower time horizon. To avoid saying it every time, whenever I'm referring to time horizon in the meter report, I am referring to that standard 50% success rate unless I say otherwise. The point is that these metrics aren't about benchmarking the model's ability exactly. They're about showing the relative improvements across model generations. A 50% success rate is never going to be good enough for an AI coding agent in production but what matters for the benchmark is the consistent measure and the shift over time. Now, with all of that background out of the way, there was a ton of anticipation around the current generation of models. Google, OpenAI and Anthropic have all focused on improving coding agents over recent months, but Meter has been relatively quiet. They published results for GPT5.1 codecs in November, but the results weren't overwhelming. The model had a time horizon of 2 hours and 40 minutes, which was barely better than GPT5. On the other hand, next came the Opus 4.5 result in December which showed a 4 hour and 49 minute time horizon. This was a big improvement and almost a doubling all on its own. Now there was a sense that this might be a one off change due to improvements in the way Anthropic was post training their models for coding tasks, but obviously the market vindicated their findings. In fact, in January, SWIX evals should be validated by Vibes. I think not enough people give sufficient credit to Meter for clearly identifying and quantifying the Opus 4.5 outperformance. On paper, GPT5.2 thinking outperforms Opus 4.5 by 55.6 versus 52% on Suitebench Pro. In practice, meter's long eval's benchmark, while getting increasingly sparse in the long tail, clearly called out the huge jump that many devs are now experiencing a month later. In fact, it is such an outlier that the curve fit was probably wrong and needs to be restarted as a new epoch. And yet, of course, even if January's conversation was dominated by Opus 4.5, we've since had the twin releases of Opus 4.6 and GPT 5.3, which both seemingly represented another big jump in coding capabilities. That much was obvious from using the models, but people still wanted to see what Meter would say once testing was complete. On Friday, Meter released the results for both models simultaneously and showed that model quality is accelerating faster than it ever has before. GPT5.3 codecs achieved a time horizon of 6.5 hours at 50% completion rate, exceeding opus 4.5. The results for 4.6 were even more dramatic, achieving a time horizon of around 14.5 hours. This is the largest generational jump of any model and meter study. Opus 4.6 has more than tripled the time horizon of 4.5, implying the time horizon is now doubling every 1 1/2 months. Responses came in fast and furious. Investor Nick Carter wrote, this is the most important chart in the world and it's going absolutely ballistic. Even Bernie Sanders mentioned it in a recent talk at Stanford. In fact, the jaws dropped so hard that many people race to give some caveats. Dean Ball for what it's worth, I don't take the meter chart that's been going around as much of an update. Meter itself has been signaling their decreasing confidence in the benchmark for a while now, both because of saturation and limited long duration tasks in the benchmark. It's certainly impressive and signals that nothing is decelerating, but I don't see it as strong evidence in and of itself that we are in some radically faster progress regime. Indeed, Meter themselves heavily caveated the results. Codec's results, for example, showed some issues with meter scaffold. They tested the task set again with OpenAI's scaffold, and while they got similar results, they still found the issues noteworthy enough to point out. For Opus 4.6, Meter noted that their model has basically saturated their task set, leading to some unintuitive results. The upper band of their confidence interval is now 98 hours, practically infinite when it comes to this measurement. You can imagine that their task set has very few tasks that would take a human coder more than 14 hours to complete, so it stands to reason that the benchmark is starting to get a little saturated. Researcher David Reen writes, seems like a lot of people are taking this as gospel. When we say the measurement is extremely noisy, we really mean it concretely. If the task distribution we're using here was just a tiny bit different, we could have measured a time horizon of 8 hours or 20 hours. Now overall Meter says that they are updating their methodology to address the issue, but cautioning then overly focusing on these particular results. I think Visimo Dino really summed it up when they wrote it's possible that one there really is something massive happening right now, and the meter graph really does capture that fact, and 2 some small subset of people are mistakenly thinking it's even bigger than it actually is, but that doesn't mean it's actually not very, very big. Now. That sense that something very, very big is happening was exemplified in the response to a new piece from Citrini research called the 2028 Global Intelligence Crisis. Citrini is a well regarded research firm among fintwit, largely doing thematic research and having been very early to several key themes during the AI boom. This latest article covers the implications of abundant intelligence. It essentially took Dario Amadei's concept of a country full of geniuses in a data center and applied it to the real world among other things, that Peace predicted that we'll see AI start to consume the entire economy, moving from sector specific to broad application of cheap machine intelligence. Citrine's thesis is essentially that capital owners are about to reap the massive benefits of AI, while workers in every strata of the economy will be left jobless and purposeless. Economic activity transforms from being household based into a capital based society. This eventually leads to a massive collapse in the stock market, a massive rise in unemployment, a massive and general immiseration across society. Now, given that I'm ripping through this, you can probably tell that what's more interesting to me than the particulars of the piece is the response that it's getting. This is the latest in a long line of future oriented AI doomer sci fi. What's notable this time is that it turns out that many investors already believe some version of this thesis. So the incredible response to Citrini's piece is because it's acting as a confirmation that the worst nightmares of an AI driven economic crisis are possible. Previous reports were met by a lot of skepticism, whereas this article is being met with much more widespread acceptance, or one might suggest, confirmation bias. Felix Javan writes, I think what's fascinating about Citrini's piece is it isn't necessarily new ideas for those that have been tapped into what's going on and thinking about it all but smashes the common knowledge game around it. And now it's becoming something that everyone knows. Everyone knows. A tiny fraction of the population knows what OpenClaw is, and an even smaller subset has set one up. There's a lot for people to come to terms with, Unemployed Capital Allocator writes, the final boss of hysteria is entering the arena. In two weeks it will be all over LinkedIn. In four weeks, Wall Street Journal and Financial Times, every analyst will be typing in unemployment when to the ChatGPT chat box Citrini will be appointed the AI policy czar. Just remember, it might be peak fear, it might be dumb. Markets are primed to buy it and drive things down. There are no atheists in foxholes. Now, of course, there are plenty of people who took issue with specific parts of this. Dan Hockenmeyer writes, this piece shows a profound lack of understanding of how marketplaces work and why they are defensible. Quoting from the piece, he says a competent developer could deploy a functional competitor in weeks, and dozens did, enticing drivers away from doordash and Uber eats by passing 90 to 95% of the delivery fee through to the driver. Dan picks up Anyone could have done that at any time in the last 10 years. Why was no one able to? Because the hard part has nothing to do with building the app or attracting the drivers. The hard part is building a liquid marketplace with all the best supply and a massive series of optimizations and investments to drive down prices and delivery times and drive up reliability. DoorDash and eats have built this when no one else could, and they will not allow agents to transact on their apps, nor will they have a legal requirement to allow it. But the real story isn't as sensational, so it doesn't get the engagement. Economist Guy Berger writes, this was an interesting read, but I'm not sure it's internally consistent. One question that comes to mind those who own the agents, what are they doing with the money they're making? Why isn't that fueling employment, GDP and stock prices? Now again, I have a feeling that we're going to be talking about this one more in the weeks to come, so I'm not trying to go fully in depth today. I think what's important here is the way that these individual elements all add up to something more. The story of early 2026 so far is a broad based sense that to quote that viral piece from about a week ago, something big is happening. The capability set of the coding models has increased dramatically, which has opened up agents as a real force. Those two things combined have moved the impact of AI and agents from just software engineering to everything else. Markets are starting to reprice things as a consequence and nothing seems like it's going to slow down at all. And because of that everyone is trying to figure out what next. Now I would argue that we are desperately in need of the non doomer version of the Citrini piece, which is something that I'm trying to work on in the background. So keep an eye out for that for now. It remains a really interesting anthropological study of the moment that I think can tell us a lot about where general sentiment is. That, however, is going to do it for today's AI Daily Brief. Appreciate you listening or watching as always. Until next time. Peace.
