Summary8 min read

The AI Daily Brief: "All of AI's New Models and Tools" – April 9, 2026

Host: Nathaniel Whittemore (NLW)

Episode Overview

This episode provides an in-depth analysis of the latest developments in the world of artificial intelligence, focusing on groundbreaking model releases, major enterprise tools, and the broader implications for industry and society. NLW breaks down headline stories—like OpenAI's new model rollout controversy and Anthropic’s ongoing legal struggles—before dedicating the main segment to the week's most important AI model and tool launches from Meta, Z AI, Anthropic, and Google.

Key Discussion Points and Insights

1. OpenAI's Spud Model Controversy

Timestamps: [02:00] – [06:20]

Initial Headline: Reports emerged via Axios that OpenAI planned a staggered, limited release of a new model named "Spud" due to cybersecurity risks, echoing Anthropic's earlier "too powerful to release" narrative.
Twitter/X Reaction:
- Daniel Mack: "Breaking OpenAI will not release Spud. Dario forced their hand. Total anthropic victory."
- Dan Shipper: "The new status symbol is making a model so powerful you can't release it?"
Correction: (Shortly afterward) Dan Shipper clarifies that the Axios story conflated OpenAI’s cybersecurity product test group with Spud itself; Spud wasn’t being withheld, the story had been inaccurate.

“My friends, we are playing with live ammunition here, but since I caught this in time to update, I wanted to make sure we did.” (NLW, [06:20])

2. Perplexity Computer: A Meteoric Rise

Timestamps: [06:20] – [09:10]

Perplexity’s new ‘Computer’ product: Launch and shift to usage-based pricing led to doubling revenues in a quarter.
Adoption:
- 100 million monthly active users
- $450 million ARR
- Enthusiastic adoption in the finance industry, per Geiger Capital and others.
Skepticism from some who predict competition from integrated “super app” platforms (e.g., Cowork, GPT Super App).
Impact on GitHub: Surge in AI-generated code driving record-breaking commit stats and infrastructure strain.

3. GitHub’s Agentic Coding Wave and Infrastructure Strain

Timestamps: [09:10] – [11:55]

Context: AI agents are causing explosive growth in commits—GitHub could hit 14 billion this year.
Challenges: Outages, quota limits, and scaling pains as the platform adapts to both human and machine contributors.

“This hasn’t been designed with agents in mind.” – Peter Steinberger ([10:55]) “Pushing incredibly hard on more CPUs, scaling services and strengthening their core features.” – GitHub COO Kyle Daigle ([11:30])

4. Anthropic vs. The Pentagon: Legal Battle Deepens

Timestamps: [11:55] – [16:15]

Appeals court denied Anthropic's request to suspend “supply chain risk” designation for Pentagon contracts.
Complex legal landscape: Simultaneous lawsuits in California and D.C. courts; conflicting implications for non-Pentagon agencies vs. the Pentagon itself.
Notable Quotes:

“Our position has been clear from the start. Our military needs full access to Anthropic's models if its technology is integrated into our sensitive systems.” – Acting Attorney General Todd Blanche ([15:10]) “The D.C. circuit's denial will prolong ambiguities regarding whether political considerations can drive federal procurement.” – Matt Schroers ([15:45]) “Two out of the three judges ... have been very, very sympathetic to the Trump administration's aggressive claims about executive authority in the past.” – Charlie Bullock ([16:00])
Next Steps: Rapid movement possible toward Supreme Court resolution; implications could set precedent for tech-government relations.

Main Segment: Major New Models and Tools

5. Meta’s MuseSpark: First Frontier Model from Meta Superintelligence Lab

Timestamps: [17:15] – [24:20]

MuseSpark Summary:
- First model from Meta’s Superintelligence Labs, led by Alexander Wang (Scale acquisition).
- New family replaces “Llama,” aiming to shed old baggage.
- Natively multimodal (text, vision, reasoning), supports tools & multi-agent orchestration.
Performance:
- Benchmark scores: Competitive, but not leading—near Opus 4.6, Gemini 3.1 Pro, GPT 5.4 for coding and reasoning.
- Particularly strong at visual reasoning—state-of-the-art on Charvik’s Reasoning.
Use Case Focus:
- Mark Zuckerberg: “Musespark is a world class assistant and particularly strong in areas related to personal superintelligence like visual understanding, health, social content, shopping, games and more.” ([20:20])
- Agentic focus—Meta shifting from just "assistant AI" to "agentic AI" able to take actions.
Reception:
- Ethan Malik: “Meta's Muse spark thinking is fine so far, but really doesn't match the current Big three models... it's not bad, just not the vibe level that the benchmarks might indicate.” ([22:10])
- Francois Chollet: “The new model from Meta is already looking like a disappointment. Over optimized for public benchmark numbers at the detriment of everything else.”
- Alexander Wang (Meta AI): "We're always open to feedback... We have been pleasantly surprised by users' feedback in areas like visual coding, writing style and reasoning queries." ([23:25])
- Vaas: “Meta’s latest model, musespark, is actually much better than I had expected… Is it Frontier leader in any single category? No. Is it better than I expected? Yes.” ([23:40])

6. Z AI’s GLM 5.1: A New Open Source Challenger

Timestamps: [24:20] – [27:30]

GLM 5.1 Highlights:
- First open source model to beat leading Western models on some coding benchmarks (Suitebench Pro 58.4 vs. GPT 5.4’s 57.7).
- Full open source release with commercial licensing (754B parameters—enormous model).
Technical Feats:
- Claim: Built a Linux desktop autonomously in 8 hours, completed database optimization tests with 6000+ tool calls and significant performance gains.
- “GLM 5.1 can do 1700 [agentic steps] right now. Autonomous work time may be the most important curve after scaling laws.” – Lou, Z AI leader ([26:50])
Meta/Narrative:
- Trained on less powerful Huawei chips—demonstrates China’s rapid progress.
- "Everyone’s freaking out about Claude Mythos while Zai casually open sourced a model built for eight hour autonomous execution." – Leet LLMs ([27:15])
Skepticism over benchmarks until third-party validation, but optimism about open source momentum.

7. Anthropic’s Claude Managed Agents: Bringing Agentic AI to Scale

Timestamps: [27:30] – [33:20]

Launch: “Claude Managed Agents” platform enables devs to build, deploy, and monitor powerful agentic systems with minimal setup.
Key Features:
- Agent harness: Software infrastructure wrapping the AI model for autonomous action.
- Built-in sandboxed environment: For secure agent execution.
- Control over permissions, monitoring, and hours-long autonomous activity.
Industry Reaction:
- Angela Jiang, Anthropic: “There is a notable gap between what anthropics models are capable of and what businesses are using them for. This tool is meant to close that gap.” ([28:20])
- Eric Liu (Notion): Demoed seamless onboarding automation using Claude agents.
- Alex Albert, Anthropic: “Managed agents eliminates all the complexity of self hosting an agent, but still allows a great degree of flexibility…” ([29:45])
Common Patterns: Event-triggered tasks, scheduled tasks, fire-and-forget ops, and “long horizon” tasks—showcased via community experimentation.
User Stories:
- Jared Orkin: "You no longer need an engineer to run an overnight marketing analysis. You need one sharp operator in an afternoon, set the schedule, set the guardrails and walk away. Anthropic runs the infrastructure. You pay per session hour." ([31:05])
- Powell Hurren: “I built my first managed agent. Surprised how easy it was. You describe what you want in plain English. The platform generates a full agent config all in YAML you can edit.” ([31:40])
Current Limitation: Persistent memory across sessions not yet available—better for transactional tasks than continuous learning agents.
Forecast: NLW expects widespread adoption and deeper coverage in future episodes (harness engineering deep dive teased).

8. Google Gemini Notebooks: Product Unification and Knowledge Management

Timestamps: [33:20] – [35:25]

New Feature: Notebooks in Gemini allow users to organize resources and context, plus assign custom instruction sets for project-specific use.
Josh Woodward (Google):

“Most AI chatbots give you basic projects. Gemini just built you a second brain.” “You can take the resource management you’re doing in NotebookLM and put it directly in the Gemini app.”
Significance: Shift toward consolidating project management/user knowledge within one cohesive Google experience—potentially more practically impactful than a new model release for end users.

Notable Quotes & Memorable Moments

Dan Shipper ([03:50]): “The new status symbol is making a model so powerful you can’t release it?”
Mark Zuckerberg (Meta) ([20:20]): “We are building products that don’t just answer your questions, but act as agents that do things for you.”
Alexander Wang (Meta AI) ([23:25]): “We’re always open to feedback… publish those results for the community to understand.”
Lou, Z AI ([26:50]): “Autonomous work time may be the most important curve after scaling laws. GLM 5.1 will be the first point on that curve that the open source community can verify with their own hands.”
Angela Jiang (Anthropic) ([28:20]): “There is a notable gap between what anthropics models are capable of and what businesses are using them for. This tool is meant to close that gap.”
Josh Woodward (Google) ([33:45]): “Gemini just built you a second brain.”
NLW ([36:00], closing): “For those of you who are interested in going a little bit deeper in anthropic managed agents, I think I’m going to do a main episode about harness engineering soon where we’ll dig deeper into that.”

Important Timestamps

OpenAI/Spud Model Drama: [02:00] – [06:20]
Perplexity Computer & GitHub Trends: [06:20] – [11:55]
Anthropic Pentagon Legal Drama: [11:55] – [16:15]
Meta MuseSpark Deep Dive: [17:15] – [24:20]
Z AI GLM 5.1 Launch: [24:20] – [27:30]
Anthropic Claude Managed Agents: [27:30] – [33:20]
Google Gemini Notebooks: [33:20] – [35:25]

Conclusion

While the week's discourse may have been dominated by “models too powerful to release,” the rest of the AI world pressed forward with rapid innovation. Key trends include the rise of agentic AI, strong open source competitors from China, a burgeoning ecosystem of enterprise tool releases, and major shifts in how users organize and interact with AI-powered productivity suites.
NLW closes with a tease of a future deep dive into agentic tool infrastructure, promising continued coverage of these fast-evolving spaces.

Loading summary

Transcript1 lines

[00:01]
A
Today on the AI Daily brief, all of AI's new models and tools and before that in the headlines. One model that you're not getting, apparently is OpenAI's forthcoming spud. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. Alright friends, quick announcements before we dive in. First of all, thank you to today's sponsors KPMG Blitzy, zencoder and Drata. To get an ad free version of the show, go to patreon.com aidaily brief or you can subscribe on Apple Podcasts. If you want to learn more about sponsoring the show, send us a note at SponsorsiDailyBrief AI while at AIDAILYBrief AI, you can also find the link to our March AI Usage Pulse Survey. I'll have this open for a couple more days and would so appreciate you taking a couple minutes to do it. It allows us to share better data around how usage patterns in AI are changing, which is something that I think can be really valuable for people. You can also find more information on the website about things like our newsletter, which is officially back and has all the links from Everyday Show. Or you can find links to related experiences like Enterprise Claw, which is basically the enterprise grade version of Arc Free Claw Camp that's supported and led by Nuphar Gaspar. Registration for that is closing at the beginning of next week, so check it out at EnterpriseClaw AI. OpenAI obviously could not let Anthropic have all the fun when it comes to models too powerful to release to the general public. On Thursday morning, Axios reported that OpenAI also plans a staggered rollout of their new model because once again of the cyber security risk. Now this is just from one source, but it isn't all that surprising to see. Certainly it doesn't seem to be surprising. The denizens of AI, Twitter and some think that this was a forced response to Anthropic, writes Daniel Mack. Breaking OpenAI will not release Spud. The information reported just a few weeks ago that it was set to be released, quote in a few weeks. Greg Brockman talked about it on the big technology podcast. Dario forced their hand. Total anthropic victory. Leo Synthedwave DD simply says LOL. Dax from OpenCode writes this was already a thing since at least GPT 5.3, but now we have to suffer a cycle of confusing mystery and go through this whole well, it was BS last time, but maybe this time is different. We're all just caught between these two companies. I think. Dan Shipper nails it when he writes the new status symbol is making a model so powerful you can't release it? Here's something I haven't had to do often. Turns out that we actually got more on Spud. Almost immediately after I finished recording, Dan Shipper just tweeted the Axios story floating around about OpenAI limiting the release of their newest model. Spud isn't true. Just spoke to OpenAI and it appears the story conflated two things. They do have a cyber product they are testing with a trusted tester group, but this is not the same thing as Spud. The Axios story has now been updated. My friends, we are playing with live ammunition here, but since I caught this in time to update, I wanted to make sure we did. Let's move on to our next story about Perplexity Computer. In our show about how every AI product is turning into every other AI product, we covered Perplexity's computer and the general open qualification of the AI world. Based on Perplexity's financial results, it seems to be working. Between the combination of shifting to usage based pricing and the launch in February of Computer, the company's revenue effectively doubled in a single quarter. The Financial Times reported that the company has 100 million monthly active users, tens of thousands of enterprise clients, and 450 million in ARR. Chris Brown from Inspired Capital writes, Perplexity back in the race with a single product launch is like a baseball team batting around the order twice and putting up 10 runs in the sixth inning. Interestingly, one of the sub themes that you can see a lot on Twitter X is that the finance space in particular seems to be really into Perplexity Computer. Geiger Capital writes, Perplexity launched their AI agent computer a month ago and their revenue has immediately gone parabolic. AI demand is still accelerating. Nobody is ready for the compute we need. Still others remain skeptical. Kyle Russell writes, I do not consider this back in the race insane product fit for self driving computers pulling them up, but Cowork in GPT Super App will mog this in more evidence of just how much these types of use cases are growing. GitHub appears to be straining under the pressure of the agentic coding wave now as capabilities have increased, it has led to an explosion in the amount of code being written and it appears that that is nowhere more obvious than in GitHub's metrics. Last year GitHub celebrated a huge expansion with vibe coding, allowing first time coders to come online. GitHub saw 1 billion code commits throughout the year for the first time this year, GitHub is seeing 275 million commits per week, putting them on track for 14 billion commits by the end of the year. At the current pace, and the numbers are still climbing, GitHub COO Kyle Daigle said since January, every month, every week almost now has some new peak stat for the highest usage rate ever. And while Daigle attributed the change to both agents and humans, it's clear that AI enhanced coding is behind the massive increase in throughput. Commits to public repos from Claude. Code have swelled 25x in the past six months, reaching 2.5 million last week. Now, unfortunately, the surge in the amount of code being pushed is revealing limits in GitHub's infrastructure. Outages are becoming more frequent and many are expressing issues with the platform. OpenCloth creator Peter Steinberger complained last week, I keep hitting quota limits from GitHub's API. This hasn't been designed with agents in mind. Kyle Daigle responded to these types of concerns, saying that GitHub is, quote, pushing incredibly hard on more CPUs, scaling services and strengthening their core features. For now, it is just one more piece of evidence around how things are changing and how quickly. Lastly, today Anthropic has lost the second round of their legal battle against the Pentagon as the case gets more convoluted. On Wednesday, a federal Appeals court in D.C. denied Anthropic's application to suspend their supply chain risk designation pending a full hearing, the three judge panel wrote in their order. In our view, the equitable balance here cuts in favor of the government. On one side is a relatively contained risk of financial harm to a single private company. On the other side is judicial management of how and through whom the Department of War secures vital AI technology during an active military conflict. Now, the order did recognize the urgency of the case, and the court has scheduled oral arguments for mid May. The court also acknowledged that Anthropic is likely to suffer some irreparable harm as a result of the case. Now, you might recall that Anthropic was granted an injunction from a California court early in March. Importantly, there's actually two separate lawsuits going on dealing with two separate legislative powers invoked by the government. The California injunction means that non Pentagon government agencies don't need to cancel contracts with Anthropic. The new ruling deals with the Pentagon exclusively and allows them to treat Anthropic as a supply chain risk. What's less clear is how military contractors in the private sector are supposed to deal with Anthropic, as both lawsuits deal with that issue to some extent. Roger Parloff, the senior editor at Lawfare, shared his view that for the moment, government government contractors can probably use anthropics technology for anything but covered government contracts. He also noted that anthropics models have already been restored to USAI.gov, the central platform served by the General Services Administration. Importantly, this was just a preliminary ruling that has a very high bar for success, so is not necessarily a strong indication on how the case will ultimately resolve. Acting Attorney General Todd Blanche called the ruling a resounding victory for military readiness. He wrote, our position has been clear from the start. Our military needs full access to Anthropic's models if its technology is integrated into our sensitive systems. Military authority and operational control belong to the commander in Chief and Department of War, not a tech company. An Anthropic spokesperson, meanwhile, said, we're grateful the court recognized these issues need to be resolved quickly and remain confident the courts will ultimately agree that these supply chain designations were unlawful. In understated fashion, Matt Schroers, the chief executive of the Computer and Communications Industry association, commented, The D.C. circuit's denial will prolong ambiguities regarding whether political considerations can drive federal procurement. Charlie Bullock, a senior research fellow at the Institute for Law and AI, told the Information he was unsurprised by the result, noting, Two out of the three judges on the D.C. circuit panel have been very, very sympathetic to the Trump administration's aggressive claims about executive authority in the past. Expanding his analysis on X, Bullock noted that the case is moving quickly and could receive a final order within six weeks. Now, even if they fail to convince the panel, Anthropic could appeal to the full D.C. circuit, which is majority Democrat, and also have the timing right to get their case on this year's Supreme Court docket in the fall. Bullock predicted Anthropic would probably succeed at the Supreme Court, commenting, the dynamic here is not left versus right. It cares about the law at least a little bit or doesn't like the administration versus does not care about the law at all and likes the administration. Now how, if at all the revelations about the power of Anthropic Smithos impact this remains to be seen, but for now that is going to do it for the headlines. Next up, the main episode. Alright folks, quick pause. Here's the uncomfortable truth. If your enterprise AI strategy is we bought some tools, you don't actually have a strategy. KPMG took the harder route and became their own client 0 they embedded AI and agents across the enterprise. How work gets done. How teams collaborate, how decisions move not as a tech initiative, but as a total operating model shift. And here's the real unlock that shift raised the ceiling on what people could do. Humans stayed firmly at the center, while AI reduced friction, surfaced insight, and accelerated momentum. The outcome was a more capable, more empowered workforce. If you want to understand what that actually looks like in the real world, go to www.kpmg.us AI. That's www.kpmg.usa AI. Want to accelerate enterprise software development velocity by 5x? You need Blitzi, the only autonomous software development platform built for enterprise code bases. Your engineers define the project, a new feature refactor or greenfield build. Blitzy agents first ingest and map your entire codebase. Then the platform generates a bespoke agent action plan for your team to review and approve. Once approved, Blitzy gets to work autonomously, generating hundreds of thousands of lines of validated, end to end tested code. More than 80% of the work completed in a single run. Blitzi is not generating code, it's developing software at the speed of compute. Your engineers review, refine, and ship. This is how Fortune 500 companies are compressing multi month projects into a single sprint, accelerating Engineering Velocity by 5x Experience Blitzi firsthand@ Blitzi.com that's Blitzy.com so coding agents are basically solved at this point. They're incredible at writing code. But here's the thing nobody talks about. Coding is maybe a quarter of an engineer's actual day. The rest is standups, stakeholder updates, meeting, prep, chasing context across six different tools. And it's not just engineers. Sales spends more time assembling proposals than selling Finance is manually chasing subscription requests. Marketing finds out what shipped two weeks after it merged, ZenCoder just launched ZenFlow work. It takes their orchestration engine, the same one already powering coding agents, and connects it to your daily tools. Jira, Gmail, Google Docs, Linear Calendar Notion. It runs goal driven workflows that actually finish your standup brief is written before you sit down. Review cycle coming up, it pulls six months of tickets and writes the prep doc. Now you might be thinking, didn't openclaw try to do this? It did, but it has come with a whole host of security and functional issues which can take a huge amount of time to resolve. Zencoder took a different approach. SOC 2 type 2 certified curated integrations, tighter security perimeter, enterprise grade from day one, model agnostic and works from Slack or Telegram. Try it at ZenFlow free. Let's face it, if you're leading GRC at your organization, Chances are you're drowning in spreadsheets. Balancing security risk and compliance across shifting threats and regulatory frameworks can feel like running a never ending marathon. Enter Drata's agentic trust management platform. Designed for leaders like you, Drata automates the tedious tasks like security, questionnaire responses, continuous evidence collection, and much more, saving you hundreds of hours each year. With Drata, you spend less time chasing documents and more time solving real security problems. But it's more than just a timesaver. It's built to scale and adapt to your organization's needs. Whether you're running a startup or leading GRC for a global enterprise, with Drata you get one centralized platform to manage your risk and compliance program. Drata gives you a holistic view of your GRC program and real time reporting your stakeholders can act on. With Drata, you can also unlock a powerful trust center, a live, customizable product that supports you in expediting your never ending security review requests in the deal process. Share your security posture with stakeholders or potential customers, cut down on back and forth questions and build trust at every interaction. If you are ready to modernize your GRC program and take back your time, visit drata.com to learn more. Welcome back to the AI Daily Brief. One would be forgiven for thinking that this week has been defined by models that we actually didn't have access to. A huge part of the discourse throughout the week has of course been about Anthropic's Mythos, a model which it found too powerful to release in the normal way that it had been, and which right now is only in the hands of about 40 partners for some very limited cybersecurity focused engagement. Then just this morning, as you heard in the headlines, we also heard that OpenAI planned its own staggered rollout of their new model for similar reasons. Cybersecurity risks. Now, even among people who understand theoretically why these companies are doing this, there's still, I think, a bit of a sentiment of don't tell me about the new toys if I can't play with them. But luckily the rest of the AI industry is not slouching at all. And in fact even Anthropic themselves have given us something different that's still pretty powerful to play with. So let's talk through all of the other models and tools that have been released, starting with the first big model release from the new Meta Superintelligence Lab. Musespark is Meta's first new model release in over a year. It's also the first model to come from the new Meta Superintelligence Labs division, which is of course the collection of superstar crazy high paid AI researchers that was put together last summer and brought together under the leadership of Alexander Wang, who who was brought in through the $14 billion plus partial acquisition of his company Scale Muse Spark will be the first of the Muse family of models with Meta ditching the Llama name and associated baggage. The Muse models are natively multimodal reasoning models similar to Google's Gemini architecture. Meta noted that they support tool use, visual chain of thought and multi agent orchestration. Now those features are at this point kind of table stakes for the current generation, but based on fairly low expectations, people were still encouraged to see them present here. Meta didn't indicate how large the model is or whether it uses a mixture of experts architecture. In fact, we don't really know at all where this model sits in the model family executives referred to it as small and fast, but its performance and comparison points looked closer to a mid sized or large model on the benchmarks. At first glance, Musespark looks pretty capable. It scored 52.4 on SU Bench Pro, for example, putting it within a few points of Opus 4.6, Gemini 3.1 Pro and GPT 5.4 for coding. On humanity's last exam it scored 42.8, which is slightly better than Opus but trailing Gemini and GPT 5.4. Now, interestingly on that one with tools enabled, Muse's score only jumped to 50.4, leaving it trailing all three of those major rivals by a few points. This could suggest the model isn't as good at web search or tool use as the others, but of course this is only a single data point. The general sense you get from the benchmarks is that Muse is in the mix but certainly not leading the pack. And you can certainly tell where Meta is trying to put the emphasis. Rather than leading with their scores on Humanity's last exam or sweetbench, those scores are buried fairly deep in the results table, with Meta instead leading on the multimodal benchmarks where Musespark excels. The model scored 86.4 on Charvik's reasoning, which is a measure of visual comprehension, which would actually have that being a state of the art result, beating Gemini 3.1 Pro by six points. Musespark did slightly trail Gemini on assortment of other visual tests, but the results were strong enough to suggest the model will be highly capable. Now these benchmarks also gel with how Meta views the model's purpose. Unlike the other model companies where there is increasing focus on coding use cases and enterprise use cases more broadly. Musespark is designed primarily to drive personal agents In a Threads post, Mark Zuckerberg wrote that musespark is a world class assistant and particularly strong in areas related to personal superintelligence like visual understanding, health, social content, shopping, games and more. And interestingly in that same note, while Zuckerberg is trying to draw a clear differentiation between the work focused use cases the other companies are pursuing, there is still broadly even here and even in the personal realm, a shift from assistant AI to agentic AI. Zuckerberg ends his Threads post by saying we are building products that don't just answer your questions, but act as agents that do things for you. Giving more examples of where these capabilities will be useful. Meta wrote that they enable interactive experiences like creating fun mini games or troubleshooting your home appliances with dynamic annotations. The model will immediately go into service, driving Meta AI and will presumably arrive across their social media platforms over time. Musespark will function in three instant with no reasoning thinking mode, which enables reasoning and contemplating mode that performs deep research style multi step reasoning. Contemplating mode, however, won't be available at launch. Meta also emphasized the health assistant use case, touting that they collaborated with a thousand physicians to curate training data for factual accuracy. Now, in this case, there doesn't seem to be a separate interface for health, it's just functionality that's being encouraged on Meta's existing platforms. Meta AI leader Alexander Wang argued that musespark is just the beginning, posting this is step one. Bigger models are already in development with infrastructure scaling to match Private API preview open to select partners today, with plans to open source future versions. One strand of the response that's been fairly consistent was basically welcome back to the party guys. To some, even though this model is clearly behind the other leaders, the fact that the Meta Superintelligence lab was able to get it out in less than a year since that lab was formed was a feat in and of itself. Others were just less impressed. Ethan Malik writes after playing with it a bit, Meta's Muse spark thinking is fine so far, but really doesn't match the current Big three models. It is also a bit weird, like some strange language and tone, a little loose with facts, etc. After giving a few examples, he concludes anyhow, it's not bad, just not the vibe level that the benchmarks might indicate. And for a first re entry into the frontier model space, given the engineering efficiencies they achieved, it feels like a solid attempt. I'm sure we will see better from Meta in the future. ArcPrize founder Francois Chollet was less forgiving. He wrote the new model from Meta is already looking like a disappointment over optimized for public benchmark numbers at the detriment of everything else. Knowing how to evaluate models in a way that correlates with actual usefulness is a core competency for AI labs, and any new lab is unlikely to be successful without first figuring that out. Wang actually decided to respond to that one, saying, we're always open to feedback and welcome any perspective on weaknesses you've noticed in the model from using it. We're quite upfront that our model does not perform well on RKGI2, for example, and publish those results for the community to understand. That might reflect some areas of improvement of the model that we could focus on in the future. In general, though, Wang reports we have been pleasantly surprised by users feedback on the model in areas like visual coding, writing style and reasoning queries. Vaas on Twitter who previously did work on Meta AI, said Meta's latest model, musespark, is actually much better than I had expected. Is it Benchmark maxed? Yes, 100%, but so is every other model. Is it Frontier leader in any single category? No. Is it better than I expected? Yes. I look forward to the eventual open source version. Feels like they're coming back to life. Never fade Zuck now speaking of open source, another model that we got this week that got completely overshadowed by the Mythos announcement was Z AI's GLM 5.1. And at least on the benchmarks, it's the first open source model to overtake leading Western models on coding benchmarks. The new Frontier model, which like I said is called GLM 5.1, achieved a 58.4 on Suitebench Pro, beating GPT 5.4 and Opus 4.6, who scored 57.7 and 57.3 respectively. Z AI also provided a mixed benchmark that included Terminal Bench 2.0 and NL2 Repo as well, which had GLM 5.1 slightly behind the two US leaders but ahead of Gemini 3.1 Pro. Still, if those benchmarks hold, it puts GLM 5.1 in the top echelon of Frontier models, with a clear separation from Quin 3.6 plus and Kimike 2.5. And indeed, what most people are clinging onto is the fact that this is a full open source release with commercial licensing. It's a gigantic 754 billion parameter model, so you're not going to be running it locally on a Mac mini. Still, it gives developers the opportunity to build on top of current generation state of the art models for kind of the first time. We've been tracking the apparent shift in Chinese lab strategy away from open source recently, but this release suggests that leading Chinese labs are at least still somewhat willing to give away their best performing models. In terms of performance, the AI provided a few impressive examples in Agents Encoding. They claim that GLM 5.1 spent 8 hours autonomously building a Linux desktop using a self review loop to remove the need for human intervention. And this is kind of what they emphasized in their announcement post as well, calling the blog post GLM 5.1 towards long horizon tasks running Vector DB test, the model was capable of carrying out the database optimization test with significant results. The model carried out over 600 iterations using more than 6000 tool calls to deliver 6x the performance of a standard 50 turn session. Z AI leader Lou wrote on X agents could do about 20 steps by the end of last year. GLM 5.1 can do 1700 right now. Autonomous work time may be the most important curve after scaling laws. GLM 5.1 will be the first point on that curve that the open source community can verify with their own hands. Now of course, whenever a company reports their own benchmarks, it's always worth taking it with a grain of salt and waiting to see what the actual vibes are around it as people get their hands on it. But at least at first glance, the model looks like a big step up for Chinese AI. It was trained entirely on less powerful Huawei chips, again demonstrating that the Chinese hardware stack can produce some powerful results. Also coming just two months after the release of Opus 4.6 and GPT 5.4. It suggests the US continues to be only months ahead of their Chinese rivals. Leet LLMs summed up the gap in the conversation on X, saying everyone's freaking out about Claude Mythos while Zai casually open sourced a model built for eight hour autonomous execution. Now speaking of Claude and Anthropic, if you thought they were going to slow down for the sake of discussion around Mythos, think again. On Wednesday afternoon the company announced Claude Managed Agents, which they are pitching as everything you need to build and deploy agents at scale. In their announcement tweet, which has been seen 16 million times, they write that Claude Managed Agents pairs an agent harness tuned for performance with production infrastructure so you can go from prototype to launch in days. It seems like part of the goal of this is to close the capability gap that we've been following on the show as well. Anthropic's head of product for the CLAUDE platform, Angela Jiang, argued to Wired that there is a quote notable gap between what anthropics models are capable of and what businesses are using them for. This tool is meant to close that gap. Here's how Wired describes it, which is actually one of the simpler explanations that I saw. Managed agents will give developers an agent harness, which describes all the software infrastructure that wraps around an AI model to help it work agentically or take actions on behalf of a user. In practice, a harness is made up of software tools, a memory system and other infrastructure. Agents made through CLAUDE Managed Agent will also come with a built in sandboxed environment in which the agent can spin up software projects in a secure setting. The product also allows developers to create agents that can run autonomously for hours in the cloud, monitor what other cloud agents are doing, and toggle permissions that allow agents to access certain tools. Caitlin Lessi, the head of engineering for the CLAUDE platform, said, when it comes to actually deploying and running agents at scale, this is a complex distributed systems engineering problem. A lot of customers we're talking about previously had a whole bunch of engineers whose job it would have been to build and run those systems at scale. Now that we are giving them that bit out of the box, they're able to have those same engineers be focused on core competencies of business and their product. One of the demos provided was in collaboration with Notion, with product manager Eric Liu showing how he can offload a string of client onboarding tasks to his customized CLAUDE agent. The big point was that the agent was running natively in Notion with full access to everything it needed to complete the task. Rather than needing to spend days setting up permissions, validating workflows and figuring out local hosting, Lou was able to drop the managed agent in using a virtual session. The platform also allows companies like Notion to build their own agents on top of CLAUDE and offer them externally, bringing agents to market more rapidly. Anthropic's Alex Albert writes, managed agents eliminates all the complexity of self hosting an agent, but still allows a great degree of flexibility with setting up your harness, tools, skills, etc. Claud codes Tariq writes, Managed Agents is the first agent in the Cloud API that has the right mix of simplicity and complexity. Implementation details like how you manage a sandbox are abstracted, but you have a lot of control over the actual execution of the model. Anthropic's Lance Martin gave a bunch of examples of what characteristics agents being built with Managed agents had. He writes some of the common patterns I've noticed across examples in my own event Triggered a service triggers the managed agent to do a task. For example, a system flags a bug and a managed agent writes the patch and opens the pr. No human in the loop between flag and action. Scheduled managed agent is scheduled to do a task, for example. I and many others use this platform for scheduled daily briefs, for example of X, Twitter or GitHub activity, what a team of agents is working on, etc. He also talks about Fire and forget tasks with humans triggering the managed agent to do a task via Slack or teams and long horizon tasks like Andrej Karpathy's auto research idea. Now, it's early, but some of the first experiments seem to validate some of those patterns. Jared Orkin writes, you no longer need an engineer to run an overnight marketing analysis. You need one sharp operator in an afternoon, set the schedule, set the guardrails and walk away. Anthropic runs the infrastructure. You pay per session hour now, he points out, though the catch nobody's saying out loud someone still has to tune the prompt every Friday and act on the brief by 9am Monday. That's a job. That's the job we staff. The agent writes the brief, the operator runs the day. Powell Hurren started working on something similar to what I was trying last night, he writes. I built my first managed agent. Surprised how easy it was. You describe what you want in plain English. The platform generates a full agent config model, system prompt tools, MCP servers, permission policies, all in YAML you can edit. I asked for an email reader that needs my approval before acting. Now. One thing he also notes that is not available yet exactly, although is something that they're working on, is persistent memory across sessions. That means that the types of tasks that managed agents is well suited for right now are a little bit more transactional and discreet. For example, some of the agents that I've been experimenting with recently are basically persistent learners that help with AI strategy from within Slack, which effectively is sort of an agentic version of what we do at Superintelligent. But that persistence isn't exactly well suited to the way that they've built managed agents right now. Still, there is clearly going to be a ton of people build with these tools and I think it's going to very quickly become a core part of the overall Claude and Claude code ecosystem. Lastly, this week, one that seems little at first, but which is a massive quality of life upgrade, Google has introduced what they're calling notebooks in Gemini up to now, the way you manage projects in Gemini was frankly a little weird and unintuitive. They had their GEMS feature which was sort of, but not exactly a version of Projects in the way that you would manage it in ChatGPT or Claude. But now this new Notebooks functionality is much more directly that allowing users to organize, collate a set of resources, documents, context, et cetera for particular tasks. Users can also build out custom instruction sets for Gemini within their notebooks, allowing them to modify the model for each different project they have. Still, Josh Woodward from Google argues that this goes beyond the normal project settings. He writes, most AI chatbots give you basic projects. Gemini just built you a second brain. He goes on to call Notebooks some of the magic of NotebookLM directly integrated into Gemini app Basically you can take the resource management that you're doing in NotebookLM and put it directly in the Gemini app, writes Google. Think of notebooks as personal knowledge bases shared across Google products starting in Gemini. Now, one of the common critiques you will hear when it comes to Google is that even if people like their models, the product suite is so spread out across all the different surface areas that people interact with Google through that it can be confusing and even overwhelming. It makes sense then based on that to see them start to consolidate in, if not the surface area of the products, the transportability of the features across those different surface areas so that effectively any door you walk in gets you to the same room. This may not be a full model, but I think when it comes to many Gemini users, day to day experience, this will be an even bigger improvement than if they had released Gemini 3.3. Now for those of you who are interested in going a little bit deeper in anthropic managed agents, I think I'm going to do a main episode about harness engineering soon where we'll dig deeper into that. For now, however, that's going to do it for today's AI Daily brief. Appreciate you listening or watching as always and until next time, peace.