Transcript
A (0:00)
Today on the AI Daily Brief, the incredible string of model releases continues with Anthropic dropping Claude Opus 4.5 before that. In the headlines, the White House launches the AI Genesis Mission. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. Alright friends, quick announcements before we dive in. First of all, thank you to today's sponsors, Super Intelligent Robots and Pencils, Blitzy and Rovo. To get an ad free version of the show go to patreon.com aidaily brief or you can subscribe on Apple Podcasts. And if you are interested in sponsoring the show, we're doing a bunch of wrapping up Q1 right now. Send us a note at sponsorsidailybrief AI and I can give you all of the info. And with that, let's dive in. Welcome back to the AI Daily Brief Headlines edition. All the daily AI news you need in around five minutes. Yesterday you heard about how one AI executive order from the White House had been squashed. Basically, there was a big dust up with congressional Republicans around the White House's plan to create a task force to go after states who put AI regulations on the books. But as it turns out, that was not the only executive order they have planned. President Trump has now officially signed an executive order to launch a national AI science program known as the Genesis Mission. The text of the order argues that the race for global technology dominance in the development of AI Radical requires a historic national effort comparable in urgency and ambition to the Manhattan Project. This order launches the Genesis Mission as a dedicated, coordinated national effort to unleash a new age of AI accelerated innovation and discovery that can solve the most challenging problems of the century. Michael Kratzios, the director of the White House Office of Science and Technology Policy, continued that tone during the Monday announcement. He described the Genesis mission as the largest marshaling of federal scientific resources since the Apollo program. Now stripping away the superlatives, the Genesys mission is at core an initiative to collate scientific knowledge from across the government to enable new AI driven discoveries. Datasets will be gathered from the National Science foundation, the National Institute of Standards and Technology, and the National Institute of Health. The data sets, some of which stretch all the way back to the 1940s, will be cleaned and transformed into machine readable formats to make them accessible to AI models. The order lays out a twofold goal to train scientific foundation models and create AI agents to test new hypotheses, automate research workflows, and accelerate scientific breakthroughs. To that end, the Department of Energy and their network of 17 national labs will make their data and compute resources available to research institutions and private sector companies. The order instructs the DOE to, quote, create a closed loop AI experimentation platform that integrates our nation's world class supercomputers and unique data assets to generate scientific foundation models and power robotic laboratories. Essentially, this is a major effort to organize the scientific data that's scattered across government agencies and marshal resources in order to drive AI accelerated scientific discovery. Since the 1990s, America's scientific edge has faced growing challenges. He cited declining numbers of drug approvals and research outputs despite soaring scientific budgets. The Genesis mission seeks to reverse that trend by, in his words, unifying agencies scientific efforts and integrating AI as a scientific tool to revolutionize the way science and research are conducted. Datasets and compute infrastructure will be centralized into the American Science and Security Platform to be established by the doe, who said that once complete, the platform will be, quote, the world's most complex and powerful scientific instrument ever built. It will draw upon the expertise of roughly 40,000 DOE scientists, engineers and technical staff alongside private sector innovators to ensure that the United States leads and builds the technologies that will define the future. The DOE is also tasked with formulating a list of 20 science and technology challenges of national importance to form the initial focus of the Genesis mission. This potentially includes domains like advanced manufacturing, biotechnology, critical materials, nuclear fission and fusion energy, quantum information science and semiconductors. The initiative builds on the existing National Artificial Intelligence research resource, or NAIR, which was established in 2020 and brought together federal agencies including the Department of Defense, NASA and the National Institutes of Health with private companies like OpenAI, Google and Palantir to form a nationwide research community. Lynn Parker, who co chaired NAIR during the Biden Admin, said government support for AI research builds the foundations for new breakthroughs and helps keep innovation aligned with the public interest. We take for granted that new products appear regularly but seldom consider the decades of research that made them possible. Without long term investment, we risk ceding leadership in the technologies that will define our economy, our security and our daily lives. Now, speaking of the connection between public and private, Amazon announced on Monday that they will spend up to $50 billion to expand their AI and supercomputing facilities for US government customers. The expansion will begin next year and is expected to add a total of 1.3 gigawatts of AI capacity to the AWS regions that service government demand. The expansion will increase capacity for both unclassified and top secret AWS servers, said AWS CEO Matt Garman in a press release. Our investment in purpose built government AI and cloud infrastructure will fundamentally transform how federal agencies leverage supercomputing. We're giving agencies expanded access to advanced AI capabilities that will enable them to accelerate critical missions from cybersecurity to drug discovery. This investment removes the technology barriers that have held government back and further positions America to lead in the AI era Staying on the chip theme, Meta appears to be preparing to use Google's TPUs in their own data centers The Information reports that Google has begun pitching large cloud customers, including Meta and large financial institutions on installing TPUs at their own facilities. Google has made their custom AI chips available through Google Cloud for years, but they've yet to sell TPUs directly to outside customers. Part of the pitch is that they're able to operate the chips with higher security and compliance standards that aren't possible with cloud use. According to sources speaking with the Information, Meta is in talks to order billions of dollars worth of TPUs to install in their data centers in 2027. If you've been listening over the last week, what's clear is that while Google has been making TPUs for over a decade, the release of Gemini 3 put the chips firmly on people's radar. The new model was trained exclusively on TPU's, leading many to question whether Google's chips could be a viable alternative to Nvidia's GPUs. The news seems to have moved the stock market, with Bloomberg reporting a 2.7% bump for Google and a 2.7% drop for Nvidia in overnight markets, Bloomberg analyst wrote. Meta's likely use of Google CPUs, which are already used by anthropic shows third party providers of large language models are likely to leverage Google as a secondary supplier of accelerator chips for inferencing in the near term. Now, while Google is clearly ramping up to compete, the analysis is still probably getting a little bit ahead of itself. That said, the new report contained a few more crumbs of information on how Google is looking to address the market for AI chips. One of Nvidia's biggest moats is the CUDA developer ecosystem. As part of the information report, they write that Google has developed a new software suite called TPU Command center that's designed to make TPU compatibility more easy to navigate. Ultimately, while it could take Google a number of years to carve out a meaningful share of the AI chip market, Nvidia is already taking the threat seriously. According to the Information, Nvidia is following the deal making closely and have enticed anthropic and OpenAI to make large commitments to Nvidia GPUs. They also wrote that it's possible that Nvidia will seek to preempt a deal between Google and Meta. Futurum equities chief market strategist Shea Bullour writes, I know the first instinct is to frame meta exploring Google TPUs as the start of Nvidia's pricing power erosion, but that's not what it is. The real story is the velocity of Meta's AI workload curve as llama training cycles, video understanding systems and tens of billions of daily inference calls all smash into the same compute ceiling. Meta is already on pace to spend $100 billion on Nvidia hardware, and they're still capacity constrained. Adding CPUs doesn't replace the spend, it just sits on top of it. Even if Nvidia doubled output, Meta would still be short on Compute. That's how steep the structural AI capacity shortage actually is. Lastly, today, in an interview at the Emerson Collective's Demo Day, which is the Venture and Philanthropy fund of Steve Jobs, widow Lorene Powell Jobs, Sam Altman and Jony I've said that they've nailed the design of their AI device in possibly the strangest ever description of a consumer device, altman said. There was an earlier prototype that we were quite excited about, but I did not have any feeling of I want to pick up that thing and take a bite out of it and then finally we got there all of a sudden, altman said. This was I've's test for knowing when a design is dialed in, when you want to lick it or take a bite out of it or something like that. The pair stayed silent on features, but Altman was excited to describe the vibes of the product. He compared the experience of modern devices as being like walking through Times Square, flashing lights, noises and the dopamine drip constantly, just dealing with all the little indignities. By comparison. He wants using the OpenAI device to feel more like sitting in the most beautiful cabin by a lake and in the mountains and just sort of enjoying the peace and calm. I've added his vibe commenting I love solutions that teeter on, appearing almost naive in their simplicity, and I also love incredibly intelligent, sophisticated products that you want to touch and you feel no intimidation that you want to use almost carelessly, altman commented. I hope that when people see it, they say that's it. The interview added no information on what the device will actually do, but for Altman, the key feature continues to be total contextual awareness, he said. It is so simple, but then AI can just do so much for you that so much can fall away. And the degree to which Johnny has chipped away at every little thing that this doesn't need to do or doesn't need to be in there is remarkable. If you feel more rather than less confused, don't worry about it. Substantively, the biggest news was a timeline with I've stating the device could be available within two years. But with that we close today's headlines. Next up, the main episode. Today's episode is brought to you by superintelligent. Now for those of you who don't know who are new here, maybe superintelligent is actually my company. We started it because every single company we talk to, all the enterprises out there are trying to figure out what AI can do for them. But most of the advice is super generic, not specific to your company. So what we do is we map your AI and agent opportunities by deploying voice agents to interview your teams about how work works now and how your people would like it to work in the future. The result is an AI action map with high potential ROI use cases and specific change management needs. Basically everything you need to go actually deliver AI value. Go to be super AI to learn more AI isn't a one off project, a partnership that has to evolve as the technology does. Robots and Pencils work side by side with clients to bring practical AI into every phase. Automation, personalization, decision support and optimization. They prove what works through applied experimentation and build systems that amplify human potential. As an AWS Certified Partner with Global Delivery Centers, Robots and Pencils combines reach with high touch service. Where others hand off, they stay engaged because partnership isn't a project plan, it's a commitment. As AI advances, so will their solutions. That's long term value. Progress starts with the right partner. Start with robots and pencils@ropotsandpencils.com aidaily Brief this episode is brought to you by Blizzi, the Enterprise autonomous software development platform with infinite code context. Blitzi uses thousands of specialized AI agents that think for hours to understand enterprise scale code bases with millions of lines of code. Enterprise engineering leaders start every development sprint with the Blitzi platform bringing in their development requirements. The Blitzi platform provides a plan, then generates and pre compiles code for each task. Blitzi delivers 80% plus of the development work autonomously while providing a guide for the final 20% of human development work required to complete the sprint. Public Companies are achieving a 5x engineering velocity increase when incorporating Blitzi as their pre IDE development tool, pairing it with their coding pilot of choice. To bring an AI native SDLC into their org, visit blitzi.com and press get a demo to learn how Blitzi transforms your SDLC from AI Assisted to AI native. Meet Rovo, your AI Powered teammate Rovo unleashes the potential of your team with AI powered search, chat and agents or build your own agent with Studio. Rovo is powered by your organization's knowledge and lives on Atlassian's trusted and secure platform, so it's always working in the context of your work. Connect Rovo to your favorite SaaS app so no knowledge gets left behind. Rovo runs on the Teamwork Graph, Atlassian's intelligence layer that unifies data across all of your apps and delivers personalized AI insights. From day one, Rovo is already built into Jira Confluence and Jira Service Management Standard, Premium and enterprise subscriptions. Know the feeling when AI turns from tool to teammate? If you Rovo, you know. Discover Rovo, your new AI teammate powered by Atlassian. Get started at ROV as in victory o.com. Welcome Back to the AI Daily Brief. The Thanksgiving 2025 parade of models has continued into a new week, this time with the launch of Claude Opus 4.5 from anthropic. Now, people have been assuming for some time that we were going to get an Opus 4.5. We've obviously had Sonnet 4.5 for a while now, and so people figured that this was in the offing, but there had been a lot less conversation leading up to this around when it was going to come. The big model, of course, that people have been anticipating is Gemini 3, and in many ways this was a wildly understated announcement, and yet the response has been, in a word, significant. While they may not have hype posted, Anthropic minces no words in their launch post. Our newest model, Claude Opus 4.5, is available today. It's intelligent, efficient, and the best model in the world for coding, agents and computer use. It's also meaningfully better at everyday tasks like deep research and working with slides and spreadsheets. Opus 4.5 is a step forward in what AI systems can do and a preview of larger changes to how work gets done. So let's talk first about the benchmarks, and it is no accident that the one they choose to put right at the top is Sweebench Verified. Now, you might remember that in our discussions about Gemini 3, the only major benchmark that they didn't win, or at least match was this one. While Sonnet 4.5 was at a 77.2%, Gemini 3 Pro was at 76.2%. Not like it was super far behind, but still not technically state of the art. GPT5.1 was also a little tiny bit ahead of Gemini 3 Pro at 76.3%, and extended that lead at 77.9% when they released GPT 5.1 Codex Max in the days following Gemini 3. For a very short time, 5.1 Codex Max was the top of the Sweebench Verified chart, but Opus 4.5, at least by the benchmarks, blows it out of the water 80.9%, writes Morgan. A 3% lead has never looked so large. And it wasn't just SU Bench Verified. On the TerminalBench 2.0 Agentic Terminal Coding Benchmark 4.5 was meaningfully ahead of all the others as well on agentic tool use, scaled tool use, and computer use. Opus 4.5 sets a new standard now, there were some tests where Opus 4.5 meaningfully lagged behind Gemini 3, such as Humanity's Last Exam, where they were significantly behind both without search and with search. And yet what everyone was talking about, of course, was the coding results. If you are a regular listener of this show, you will know that the ascendancy of Anthropic this year and the speed with which they are catching up to OpenAI has much to do with them being the preferred AI coding model for developers. That started with 3.5 and is basically continued unchallenged, although after the release of GPT5 there have at least been credible competitors. Anthropic seems very clearly to agree with Swix on the relative importance of coding as compared to all other use cases. A couple times I've referenced Sean's post about what made him decide to go work with cognition, where he basically broke coding as the high value short timeline activity. The line which I've shared a couple of times code AGI will be achieved in 20% of the time of full AGI and capture 80% of the value of AGI. Whether or not that's true, Anthropic has certainly behaved as such. Now, outside just the standard Sweebench, there were a couple of other things that people noticed. Igor Kotenkopf points out that while there are ways to overfit towards the SW Bench Verified benchmark, the more recent SU Bench Pro is a lot more difficult and connected to the real world, and Opus blows previous models out of the water. Opus gets a 52 where Sonnet 4.5 got 43.6 and GPT5 got just 36% on Arc AGI. Opus 4.5 set a new standard ahead of 5.1 in Gemini 3 and at Arc AGI 2 they got 37.64% at 240, a task already just hours after the release, the people who had early access were also independently verifying some of these results. Binuready writes, Opus 4.5 tops live bench AI and is the world's best agentic model. We can confirm this after testing this over the past few days. Now, interestingly, one of the things that we've seen a lot from labs recently is the people inside the labs really talking up the specifics about what they like about the models. We got a spade of that from Anthropic team members such as Jake Eaton who writes, Opus 4.5 is very good at a lot of things and you should read the benchmarks, the model card, et cetera. But my favorite thing about working with it these past two weeks is that in conversation it is somehow more fine grained. It has a depth and texture that for me was immediately noticeable. It also feels interestingly much more self contained. Sasha de Marigny says the internal Response to Opus 4.5 has been a mix of excitement, awe and surprise, particularly around how good it is at coding. Tariq writes, Opus 4.5 is special, a world record in SW bench and Osworld benchmarks. The best model we've ever had at Vision on Claude code. I've completely stopped writing code in the IDE. I think there's so much to discover about Opus 4.5 and indeed some of the most interesting responses from Anthropic's members come from their engineering team. Shelto Douglas writes, I am so excited about this model. First off, the most important eval. Everyone at Anthropic has been posting stories of crazy bugs that Opus found or incredible PRs that it nearly soloed. A couple of our best engineers are hitting the interventions only phase of coding. Adam Wolf writes, this new model is something else. Since Sonnet 4.5 I've been tracking how long I can get the agent to work autonomously. With Opus 4.5, this is starting to routinely stretch to 20 or 30 minutes. When I come back, the task is often done simply and idiomatically. They talked about how Claude OPUS compared on a notoriously difficult candidate exam. In their announcement post they wrote, we give prospective performance engineering candidates a notoriously difficult take home exam. We also test new models on this exam as an internal benchmark within our prescribed two hour time limit, Claude Opus 4.5 scored higher than any human candidate ever. They continue. The Take Home test is designed to assess technical ability and judgment under time pressure. It doesn't test for other crucial skills candidates may possess like collaboration, communication, or the instincts that develop over years. But this result, where an AI model outperforms strong candidates on important technical skills, raises questions about how AI will change engineering as a profession now. They also talked to staff members to estimate the impact of using Opus 4.5 in Claude code. 50%. Nine of the 18 they surveyed reported a productivity improvement of at least 100%. The mean self estimated productivity improvement was 220%. They also popped open the hood a little bit on how they're making Claude even better when it comes to Agentix. In short, they have a huge emphasis on tools. Indeed, they write, the future of AI agents is one where models work seamlessly across hundreds or thousands of tools. An IDE assistant that integrates git operations, file manipulation, package managers, testing frameworks and deployment pipelines. An operations coordinator that connects Slack, GitHub, Google Drive, Jira, company databases, and dozens of MCP servers simultaneously. To build effective agents, they need to work with unlimited tool libraries without stuffing every definition into context up front. Agents also need to be able to call tools from code. Agents also need to learn correct tool usage from examples. Following that, they shared that they were releasing three features to make all of that possible. A Tool Search tool, which allows CLAUDE to use search tools to access thousands of tools without consuming its context. Windows Programmatic tool calling, which allows CLAUDE to invoke tools in a code execution environment, reducing the impact on the model's context window and tool use examples, which provide a universal standard for demonstrating how to effectively use a given tool. So again, all of this is telling a very consistent story, which is that CLAUDE is for coding and pushing the frontier of what agents can do. So outside of interacting with the benchmarks, what were people's first impressions? Some were excited and appreciated that there was less hype around this. Nico Christie writes have to respect Anthropic's commitment to not vague posting all weekend. This is the most exciting model Release since Sonnet 3.5. Leo Synthwaved writes Be anthropic, pretend Gemini 3 does not exist. Know you're ready to cook it for code anyways. Wait zero hype hosting drop new opus state of the art for code state of the art in ARC AGI better than expected, cost less than old Opus B More like Anthropic on the flip side, Ethan Malik basically asked why they were burying the lead. I'm not sure why Anthropic keeps doing very low key launches for fairly major releases and materially important improvements to their services. I kind of think it has to do with the assessment and the specificity of their audience in and among developers. Basically it's a group of people that they think is going to respond more to having their peers and colleagues tell them about an update rather than getting maximum social distribution because of being loud and hypey. But what about people's early tests? Victor Taylon writes, To my surprise, Opus 4.11 shotted my hardest calculus problem, tying with Gemini 3 in terms of first hour impressions. Couldn't be more promising I guess. Ethan Malik writes, I had early access to Opus 4.5 and it's a very impressive model that seems to be right at the frontier. Big gains in ability to do practical work like make a PowerPoint from an Excel. Nico again writes, Opus 4.5 is a step function improvement for spreadsheet work extremely hard became doable, doable tasks became easy, and easy tasks are now solved. And yet if there were a few examples of people trying non coding things, coding is very much where the main excitement lies. Guillermo Rauch, the CEO of Vercel, writes, opus is on a different level. It's unreasonably good at next JS and the best model we've tried on V0 to date. Manloventure's Dede Das writes, anthropic just dropped the best coding model, Opus 4.5. The coolest thing he points out is it does better at Sweep bench verified without thinking than with 64k reasoning tokens. In other words, a super token efficient model. Matt Schumer, who didn't have early access, said, first test of Claudeop is 4.5 and I'm already impressed. I asked it for a COLAB competitor UI and it quickly pulled together this screen. Definitely better than my similar test with GPT 5.1 and shockingly, Gemini 3. More testing to go, but this is a good start. He followed it up. Okay, wow, I'm kind of blown away. In one shot opus 4.5 made the UI actually functional with Python running in the browser. Some, like Super Dario pointed out that this may not even be the best model that Anthropic has. Behind the scenes they write good. Time to remind everyone Anthropic has a long standing policy of not significantly pushing the frontier to prevent an arms race. Dario can hit sweep bench scores at will. Now, whether or not that's true. The fact that there is a lot of chatter like that I think is good reflection of the sentiment in the community. Maybe the most vocally excited about this is Dan Shipper and the team at every he writes Breaking news Anthropic just dropped Claude Opus 4.5. It is by far the best coding model I've ever used, and here's how Dan describes it. It extends the horizon of what you can vibe code. Explaining, he writes, the current generation of new models, anthropic sonnet 4.5, Google's Gemini 3, or OpenAI's codecs max 5.1 can all competently build a minimum viable product in one shot or fix a highly technical bug autonomously. But eventually if you keep pushing them to vibe code more, they'd start to trip over their own feet. The code would be convoluted and contradictory, and you'd get stuck in endless bugs. We have not found that limit yet. With Opus 4.5, it seems to be able to vibe code forever. Two more observations. Opus 4.5, he says, takes working in parallel to a whole new level. Because it's far better at planning and coding, it can work with more autonomy, meaning you can do more in parallel without breaking anything. One of his teammates worked on 11 different projects in six hours and had good results on all of them. Lastly, he points out, it's great at design iteration. Opus 4.5, Dan writes, is incredibly skilled at iterating through a design autonomously using an MCP like Playwright. Previous models would lose the thread after a few cycles or say a design was done when it wasn't. Opus 4.5 is incredible at autonomously iterating until a design is pixel perfect. Indeed, Dan's team at every were equally as vocal in their love of this model. Kieran Klassen writes, 2023 was GPT4, 2024 was Sonnet 3.5. 2025 is Opus 4.5. This is the coding model launch I've been waiting for. First time I genuinely believe I can vibe code an entire app end to end without touching the implementation details. We haven't found the limit yet. Previous models would eventually trip over their own feet. Convoluted code, contradictory logic, endless bugs. Opus 4.5 just keeps going. If you write code with AI, you need to try this and I think that this idea is the thing to watch for, to see whether Kieran and Dan's first impressions here and some of the impressions of the Anthropic team really play out that this is as Kieran puts it the first time we can vibe code an entire app end to end without touching the implementation details. It strikes me that if that is the case, that could be the most massive implication of this model. Adam Wolf from Anthropic again wrote, I believe this new model in Claude code is a glimpse of the future we're hurtling towards. Maybe as soon as the first half of next year software engineering is done. Soon we won't bother to check generated code for the same reasons we don't check compiler output. I love programming and it's a little scary to think it might not be a big part of my job. But coding was always the easy part. The hard part is requirements, goals, feedback, figuring out what to build and whether it's working. There's still so much left to do and plenty that models aren't close to yet. Architecture, systems, design, understanding, users, coordinating across teams. It's going to continue being fun and very interesting for the foreseeable future, but still, it's not hard to see that that's a fairly big pronouncement. Now, moving back to the realm of the non speculative, the other thing that captured people's attention about this is that Opus4.5 is significantly cheaper than Opus 4.1. The cost dropped from $15 to 5 per million input tokens and from 75 to 25 per million output tokens. Indeed, Jeremy from Anthropic points out one fact people won't realize immediately about Opus 4.5 it's remarkably token efficient all in it's often cheaper than Sonnet 4.5 and other models for cost per task success. Simon Willison points out why we probably need to be looking not just at cost per output and input tokens, but also token efficiency when he writes this is notable. Opus 4.5 is around 60% more expensive than Sonnet $25 per million output compared to $15 per million output. But if it can use 76% fewer output reasoning tokens for the same complex task, it may end up cheaper. Now that 76% came from Claude relations Alex Albert, who said on Sweep verified at medium effort, Opus 4.5 beats on at 4.5 while using 76% fewer output tokens. Look, it's early days, but but the first impressions are big. Dan Shipper again sums up every six to 12 months a model drops. That truly shifts the paradigm. Opus 4.5 launched today and that's what it is. Best coding model I've ever used and it's not close. We're never going back. Brian Atwood points out I said a month or two ago that Anthropic is a vertical AI company, and this is what I meant. They rightly identified that coding is the number one use case for LLMs right now and are overwhelmingly focused on it. Meanwhile, others are throwing darts in every conceivable direction, spreading themselves thin. Interestingly, just a couple days ago Sam Altman posted It has been amazing to watch the progress of the Codex team. They are beasts. The product and model is already so good and will get much better. I believe they will create the best and most important product in the space and enable so much downstream work. It has been pretty clear for some time now that OpenAI has come around to a similar view of the importance of coding and are very much not content to cede that ground. Summing up, Ethan Malik writes, the main lesson of the past few weeks is that the Big four US Labs all seem to have figured out a path forward in continuing the exponential pace of LLM improvement, at least in the near future. More simply put, Andrew Curran writes, AI Winter is canceled. Try again next year. Grinch Squad. There will, I'm sure, be lots more to Discuss around Opus 4.5 as people get deeper into it. But for now, like I said, the Thanksgiving model explosion continues unabated. That's going to do it for today's episode. Appreciate you listening as always. Till next time. Peace.
