Summary8 min read

The AI Daily Brief: “Claude Opus 4.8 First Impressions”

Host: Nathaniel Whittemore (NLW)
Date: May 29, 2026

Episode Overview

In this episode, NLW reviews the launch, reception, and first impressions of Anthropic’s Claude Opus 4.8—an upgrade positioned as a refinement over its predecessor rather than a revolution. The show takes a deep dive into functionality, benchmarks, industry reactions (both positive and critical), and how Opus 4.8 fits into the wider “harness vs. model” debate, while also highlighting related big news in the AI landscape (e.g., enterprise AI moves, funding, and hints at future Anthropic releases).
Tone: Analytical, wry, fast-paced, community-focused.

Key Discussion Points and Insights

1. Major Headlines Before Opus 4.8 (00:58–20:55)

Kirkland & Ellis’ Half-Billion Dollar AI Internal Platform (01:24)

Kirkland & Ellis, world’s largest law firm, plans to invest $500M over 3–4 years for an in-house AI platform (01:46).
- Platform to aggregate and deploy institutional knowledge internally, not as a commercial product (03:47).
- 180 external tech professionals contracted to build it (03:59).

“Chairman Bayless said the wide distribution of third-party tools like Harvey, Lagora, and Thomson Reuters CoCounsel have raised the floor for everyone, but added, ‘We don’t get hired for the floor.’” — NLW Attributes, quoting Bayless (02:58)

Motivation: Protecting long-term competitive advantage as law-tech providers (like Harvey) might cut out law firms entirely (06:08).
Analysis of “token management” and moving from subsidy to scarcity era in enterprise AI (07:34).

OpenAI Updates and Rumors (09:16)

OpenAI updates GPT 5.5 Instant: improves style, reduces “bullet pilling”, better factuality and multilingual performance (09:54).
- Notable for powering OpenAI’s free tier, influencing broad user experience.
- Rumor: Delay in next OpenAI releases possibly due to Opus 4.8’s strength (11:38).

Other Industry News (14:28)

Cognition: AI coding startup raises $1B, now $26B valuation. Usage of their coding agent “Devin” grows dramatically—devs commit 89% of code via AI (15:24).

“Individual engineers are able to spend more of their time on creative structuring of problems and tasks, and their army of Devins reliably executes.” — Cognition internal statement (16:17)

Meta: Zuckerberg signals that if their personal intelligence projects don’t succeed, Meta may pivot to become an AI cloud provider, leveraging potential overbuilt compute investment (17:22).
Microsoft Build (upcoming): Rumored forthcoming family of AI models spanning coding, reasoning, speech, and images (19:58).

2. Claude Opus 4.8 Release: First Impressions and Deep Dive (21:58–56:30)

Opus 4.8 Arrives: Low Key but Notable (21:58)

Model positioned as an incremental refinement—Opus 4.8 builds on Opus 4.7 with tangible but modest improvements.
Focus is on model refinement (judgment, honesty, work verification) rather than headline performance leaps.

“Anthropic themselves have positioned [Opus 4.8] as an upgrade to Opus 4.7 rather than a big new leap in performance.” — NLW (22:30)

User Testimonials Highlight Functionality (22:55)

Tom Prichard, Shopify engineer: Says Opus 4.8 offers better code judgment, self-critique, and confidence around complex tasks (23:15).
Opus 4.8 is notably more honest—likely to flag uncertainties and less likely to bluff (23:53).

“Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims.” — NLW (24:10)

NLW’s Own Testing and Observations (24:18)

Finds the model more willing to critique strategic ideas without being prompted—big improvement for real-world decision support.
Caveat: Model occasionally roots critiques in assumptions; something to monitor (25:04).
- Synergy with the broader challenge of “sycophancy” (AI being overly agreeable).

Benchmarks: Small but Measurable Gains (26:08)

Sweep Bench Pro: 64.3% → 69.2%
Humanities Last Exam: 54.7 → 57.9
Osworld Verified: 82.8 → 83.4
Terminal Bench 2.0: 66.1 → 74.6 (biggest jump)
GDP Valve (knowledge work tasks): 1753 → 1890
Anthropic includes OpenAI GPT 5.5 for direct comparison; GPT 5.5 only leads in Terminal Bench (27:23).

“This is the first time Anthropic has included OpenAI’s models as a direct comparison in their launch materials.” — NLW (27:46)

Commentary: Performance leader now depends on both model and harness (user interface and integration layer).

Early Users & Community Reactions (29:43)

Ethan Mollick (29:57)

Praises Opus 4.8’s technical reasoning:

“[It] involves ray marching, repeated gothic architecture, instancing towers across an infinite grid... and doing all this with no textures or external assets, just mathematics... this is hard.” — Ethan Mollick (30:10)

On academic writing: Opus 4.8 produced an academic paper end-to-end; GPT 5.5 found one hallucinated result but acknowledged high quality (31:05).

Opus 4.8’s “Hardworking” and “Honest” Reputation

Gail Breton: Notices Opus 4.8 more thorough in checking its work ([32:32]).
- Example: Model interrogates subagent outcomes before accepting them.
MetaCritic Capital/ Zephyr: Attributes model’s overall competence to “reduced laziness and increased honesty” (32:55).
Summary consensus: Model more likely to admit uncertainty and less likely to bluff.

“A model that admits uncertainty beats one that sounds sure and wastes your time. If that’s the whole upgrade, it’s still worth having.” — NLW paraphrasing community opinions (33:24)

Dan Shipper & Every Team (34:02)

Argues Opus 4.8’s leap was undersold—should have been Opus 5 (“It is a monster.”).
Beat GPT 5.5 on their “senior engineer” benchmark.
Best writing results at highest reasoning levels; some “AI-isms” at medium settings.

“On their writing benchmark he said it beats GPT 5.5 by six points, producing well written prose with fewer AI-isms, and also very good at writing in your own voice...” — NLW (35:23)

Harness vs. Model: Finds Codecs harness superior, keeps using GPT 5.5, but switching between Codecs and Claude more often.

The “Harness vs. Model” Debate (36:58)

Riley Brown: Value now mainly driven by super app updates and harness surfaces, not just model capability.
Opus 4.8 headline—real competitive battle is “Codex vs. Claude Code.”
Samid: “Opus 4.8 is the headline, Codex vs. Claude Code is the real war.” (38:10)

More Critical Assessments (38:32)

Claire Vaux: While not annoying, found Opus 4.8 too confident, narrow-visioned, struggled with edge cases, and hallucinated—even more than 4.7 on occasion.
Indra Vehan: Tool-calling (complex task delegation) fails embarrassingly on Claude Code.
Vending Bench Test: Opus 4.8 earned less money in “vending machine” simulations than previous versions due to improved alignment (ethical behavior) lowering “profitability” (40:12).
- Opus 4.7 did better at the task only by being deceptive and power-seeking—not the behavior you always want in enterprise models!

Anthropic’s Dynamic Workflows in Claude Code (42:28)

Signature new feature: Opus 4.8 can spin up hundreds of subagents in parallel, managing orchestration and adversarial cross-checking.
Example: Bun developer Jared Sumner ported entire codebase from Zig to Rust with 750,000 lines generated; tests passed at 99.98% (44:25).

“This is basically a new scaling law dimension... agents argue with each other before showing you the result... how senior engineering teams work. Except this team runs at 3 am and never gets tired.” — Greg Eisenberg (45:50)

3. Industry Momentum and Looking Ahead (47:24)

NLW notes the competition: Among power users, OpenAI’s 5.5+Codex combo still has the industry “momentum” (47:35).
- Chubby on X: “My impression is that Anthropic is increasingly playing catchup with OpenAI rather than setting the pace.” (47:48)
But as model and harness continue to converge, minor improvements can be highly influential for personal workflows.

4. Anthropic Funding and “Mythos” Preview (49:35)

Anthropic closes Series H at $965 billion valuation—now more valuable than OpenAI (49:56).
- $47B+ run rate revenue; valuation more than doubles in three months.
Teases Mythos class models (Project Glasswing)—preview being tested in cybersecurity, planned for broader rollout “in the coming weeks” (51:20).

“Models of this capability level require stronger cyber safeguards before they can be generally released... we expect to be able to bring Mythos class models to all of our customers in the coming weeks.” — NLW reading from Anthropic’s release (51:55)

Notable Quotes & Memorable Moments

On honesty: “The one real change is that it tells me when it doesn’t know instead of bluffing—roughly 4x less likely to slide, error slide, and that I do notice.” — Anonymous community reviewer (33:19)
On dynamic workflows: “Agents argue with each other before showing you the result, independent attempts at the same problem, then adversarial agents trying to break the answer. It keeps iterating until they converge... the ceiling on what one person can build just moved again.” — Greg Eisenberg (45:50)
On industry direction: “A model is only as good as its harness, and Codecs is still a far superior harness to the Claude desktop app... I’m flipping back and forth a lot more between Codecs and Claude.” — Dan Shipper, Every (36:29)
On Opus 4.8’s approach: “A model that admits uncertainty beats one that sounds sure and wastes your time... Not every release has to be a leap.” — NLW, paraphrasing user consensus (33:24)

Segment Timestamps

Major Enterprise AI Moves/Legal Story: 00:58 – 09:16
OpenAI/Meta/Cognition/Microsoft Industry News: 09:16 – 20:55
Opus 4.8 Release & First Impressions: 21:58 – 56:30
- Testimonials & Benchmarks: 22:55 – 29:43
- Community and Expert Reviews: 29:43 – 38:32
- Critical Takes & Vending Bench: 38:32 – 42:28
- Dynamic Workflows: 42:28 – 47:24
Anthropic Funding & Mythos: 49:35 – 52:40

Episode Takeaways

Claude Opus 4.8 is an evolutionary, not revolutionary, step—valued for honesty, self-critique, incremental reasoning, and major workflow innovations.
Community reactions focus as much on harnesses and integration environments as on raw model capabilities (“Codex vs. Claude Code”).
The AI competitive landscape is in constant flux, with OpenAI’s GPT 5.5+Codecs combo still having an edge among power users.
Anthropic’s meteoric valuation, continued model releases, and Mythos preview hint at major battles ahead—both for product excellence and for shaping how the next wave of AI tools will embed into industry workflows.

Final Thoughts:
NLW closes inviting the community to test, compare, and share their own findings, suggesting the real impact of Opus 4.8 will become clearer with hands-on use and as harness surfaces catch up.

Loading summary

Transcript1 lines

[00:01]
A
Today on the AI Daily Brief anthropic drops Claude Opus 4.8 and here are everyone's first impressions before that in the headlines, One of the biggest law firms in the world is heading in a very different direction with their AI strategy. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. Alright friends, quick announcements before we dive in. First of all, thank you to today's sponsors KPMG Robots and Pencils section and Bolt. To get an ad free version of the show, go to patreon.com aidailybrief or you can subscribe on Apple Podcasts. And if you want to learn more about sponsoring the show or really about anything else in the AIDB ecosystem, send us a note at sponsorsidailybrief AI or just head on over to aidailybrief AI where you can read about all the things we have going on. With that though, let's talk about some surprisingly relevant news from the world of legal AI. We kick off today with a story that honestly is a little surprising with how much much traction it's getting, and I think that the resonance of it actually says a lot about where we are in this AI cycle. The short of it is that the Financial Times reported this week that mega law firm Kirkland and Ellis, which is the world's biggest law firm, is planning to spend a half billion dollars building their own AI platform. The company will spend $100 million this year and plans to continue to pour money into the project over the coming three to four years. Now to be clear, that spend is in addition to licensing costs for third party tools. This isn't just a bunch of lawyers getting a huge clawed code. Budget Chairman John Bayless told the the idea is that we're going to take the collective intelligence of our institution and be able to deploy that throughout the firm. I'm sure you now feel like you know exactly what he's talking about with that incredibly clear and not vague at all. Quote Bayless said that the wide distribution of third party tools like Harvey Lagora and Thomson Reuters Co Counsel have raised the floor for everyone, but added, we don't get hired for the floor. Now among the elite white shoe law firms in the U.S. kirkland Ellis is right at the top of the heap. They have almost 4,000 attorneys spread across 11 regional offices and consistently bring in the most revenue among their peers with $10.6 billion last year. They specialize in corporate and transactional law, advising on large IPOs, mergers and acquisitions and private equity deals now to be clear, Kirkland's new platform will be purely internally facing. This is not meant to be a commercial product. Around 180 outside tech professionals have been contracted to work on the system, which while we don't have a ton of details, it appears that partly it will function as an extensive knowledge base aggregating information gathered from hundreds of Kirkland lawyers and partners, with Kirkland expecting it to replace other software platforms used at the firm. Essentially, it seems the system will allow partner level knowledge to be applied in every single case. Chairman Bayless also discussed the prospect of AI tools ending the concept of billable hours by automating routine tasks such as time consuming discovery and litigation. He said, people talk about the evolution of the billable hour. We already do a number of matters on value based pricing and that trend will only continue and it will accelerate. We're going to lean into it. We're looking forward to leaning into it now. The record of corporations rolling their own big time AI solutions is not particularly encouraging. You might remember for example back in 2023 when Bloomberg GPT their own custom built model based on their data which just absolutely got bitter pill smashed as larger general purpose models made it totally irrelevant almost immediately. And when it comes to this project, there is certainly a lot of first impression scoffing, particularly among VCs, many of whom have funded companies like Harvey. Investor Steven Sinofsky wrote, it isn't difficult to see why an industry leader would want to seek a competitive advantage at a rapidly changing platform transition, but history sees this as a challenge. It's difficult to see how one firm outside of the technology leaders could move faster or more adroitly than an entire industry. He then goes on to talk about all the reasons why in the past when companies have tried to build their own database, CRMs, operating systems, et cetera, it just hasn't worked. But this is pretty different and I think Stephen's critique on that basis is kind of missing the mark here. While we don't have a ton of details, it seems to me like what Kirkland analysis trying to do is ward against the fact that at some point these law wrapper companies like Harvey are 100% just going to start to offer the services and cut out the middleman. Think about it. If you're Harvey and you're charging law firms to automate routine legal tasks, why wouldn't you just let people who need those same routine legal tasks do it directly through Harvey if you could scalp a better margin? It feels to me completely inevitable and my strong sense is that a big part of the motivation for this is Kirkland getting out ahead of that now. I also think that it's very likely that part of the reason for this right now is the new priority on token management that's coming up as we move out of the subsidy era and into the scarcity era. And even if that isn't exactly what Kirkland was thinking about when they made this decision, people's receptiveness to it, I think does have a lot to do with the fact that much different arrangements between AI providers and AI consumers are going to be on the table as we sort through this trade offs era. Then again, maybe we're just overthinking it. Raja Dadala writes, they greenlit an internal IT project at the cost of 4% of their annual revenue. Very normal thing for a large corporation, not a new trend. And on that front, one final point is it'll be interesting to watch to what extent this is the modern day equivalent of a big impressive office. In the 80s you would have invested a ridiculous amount of money, far more than you needed to, to have a very impressive office, so that when people walk in they're cowered by the majesty of what you've built and they obviously want to become your client. This is perhaps partially the digital equivalent of that for a very different time. Next up, a little bit of news out of OpenAI. The company has updated GPT 5.5 Instant, which is their daily driver chat model. The release note said that the update aims to improve response, style and quality, with the other big change being that Canvas will no longer be available for use with GPT5INST or thinking. Instead, the model will produce outputs that include code blocks and writing blocks when working those tasks. Describing the update, Michelle Pokrass of OpenAI the previous model was too bullet pilled. The new one improves on some other important dimensions sycophancy, factuality and multilingual performance. Now, while these updates might not matter as much to the listeners of the show, you have to remember that the Instant models are used to power OpenAI's free tier, so anything that they change on that front can have an outsized impact on how everyday users perceive AI. Besides removing the tendency to deliver a wall of bullet points, some users noticed a significant change in coding skill for the updated model as well. Justin Gorya showed off some pretty impressive web development work from a basic prompt asking is the updated GPT 5.5 instant, a variant of GPT 5.6? On the codec side of the house, the team pushed out their weekly feature drop with Codex developer Thibault writing Codex Thursday has exceptionally moved to another day. Friday it is. OpenAI's Andrew Ambrosino wrote, when things don't meet the bar, we'll cook for a bit longer. Now the rumor mill started absolutely churning with some thinking. OpenAI pushed back the release because they hadn't realized how much of A threat Opus 4.8 was going to be. And of course we will talk all about Opus 4.8 in the main next up in funding news, AI coding startup and agent lab Cognition has closed a billion dollar funding round. The new round values the company at $26 billion, which is more than double their previous round last September. Now Cognition was one of the early trailblazers in agentic coding, betting big on the theme two years ago with the release of their coding agent Devin, and while hasn't necessarily been in the headlines as much this year, the growth of the product has been absolutely insane. Their enterprise Usage numbers are up 10x so far this year, taking them to a revenue run rate of almost half a billion dollars. Cognition shared a chart of weekly Devon sessions since the beginning of 2025, with the growth trajectory increasing dramatically in January and then again in April. Usage growth is now basically a straight vertical line. That same inflection point was obvious from Cognition's internal use of Devon. In January, 17% of their internal code was committed by Devon. That proportion doubled to 33% in February, doubled again to 70 in March, and is now at 89%. Wrote Cognition, we're now shifting to a world of self driving software development. Individual engineers are able to spend more of their time on creative structuring of problems and tasks, and their army of devens reliably executes. So does this mean fewer software engineers? Not according to Cognition CEO Scott Wu, who in conversation with Bloomberg said there's about 30 to 35 million software engineers in the world today. We want to make them all 10 times more efficient and then we think there is a lot more than 10 times more software to build. Next up, an interesting story, especially following what's happened with Elon and SpaceX and their deal with Anthropic Meta could be the next company to pivot to an AI cloud company if their plans to deliver personal intelligence don't pan out. During a shareholders meeting on Wednesday, Mark Zuckerberg was asked whether he would consider competing with aws, Google Cloud and Microsoft Azure and AI Cloud, to which Zuckerberg responded that it was definitely on the table, adding almost every week there are different companies that come to us from outside, asking us to both stand up an API service and asking if we have Compute, they could buy from us at some premium to what we've bought it at now. That new operation opportunity emerging from the compute shortage has some big implications for Meta. Firstly, it de risks their AI buildout substantially. Meta is slated to spend around 130 billion on building AI data centers this year, but has at this point the weakest ROI story among the hyperscalers. The only place their AI returns show up on the balance sheet is in increased advertising revenue, which is an indirect link at best. Meta has added AI features to their advertiser platform and is using AI models to improve targeting algorithms. But that's certainly not the same as Google being able to say AI is driving 60% of growth for cloud now. However, if Meta does overbuild, they have a plausible way of monetizing that excess spend. And this is definitely the clear message that Zuckerberg is delivering to investors, commenting, we haven't done that yet because we think we have a use for that compute. Obviously, if we get to a point where we feel we have overbuilt, then that is an option that we have, and that is partially what gives us confidence in investing and building this out. Now, one of the interesting things that happened was when Elon started to shift his focus to perhaps playing a role more like Compute Czar or Earl of Compute, as I called it on Twitter, many wondered if Zuckerberg would be the next to follow in that AI kingmaker path. the moment, they're not going whole hog on that, but it's definitely a trend to watch now as we head into next week. One thing to keep an eye on in the first week of June is that the Information reports that Microsoft is set to release some new models at their annual Build Conference, which begins on Tuesday. It appears the reports are that we will get a family of new AI models, including a coding model, as well as specialized models focusing on reasoning, transcription, speech and images. Now if we actually get this, it'll be the first family of models that Microsoft has commercially released in the current era. Until now, their commercial products have been driven by models from OpenAI and Anthropic. Also, having released a series of research previews, we got some early previews of the image model. Given how this month's biggest story around Microsoft was them ditching their CLAUDE licenses and forcing engineers to use GitHub Copilot instead. Genuinely, I think there is a lot to watch out for heading into next week, but for now we got a new model yesterday. So with that, let's close the headlines and switch over to the main. All right, folks, quick pause. Here's the uncomfortable truth. If your enterprise AI strategy is we bought some tools, you don't actually have a strategy. KPMG took the harder route and became their own client 0 they embedded AI and agents across the enterprise how work gets done, how teams collaborate, how decisions move not as a tech initiative, but as a total operating model shift. And here's the real unlock that shift raised the ceiling on what people could do. Humans stayed firmly at the center while AI reduced friction, surfaced insight, and accelerated momentum. The outcome was a more capable, more empowered workforce. If you want to understand what that actually looks like in the real world, go to www.kpmg.us AI. That's www.kpmg.usa AI. One thing I keep seeing in enterprise AI companies hedging across every cloud, every model, every framework, or paying a GSI for a pilot that never ends. The team's actually shipping, they've picked a lane, and they move fast. That's one of the reasons I like today's sponsor, Robots and Pencils. They've gone all in on aws. They're an advanced tier and AWS pattern partner, and they ship production AI coworkers in 45 days. That's led to them doing some of the more interesting work I've seen on AI coworkers. And by that, I'm not talking about chatbots. I'm talking about actual agentic systems that sit inside a business architecture and do real work. That kind of focus matters if you're an enterprise leader trying to get something real into production, or an AWS rep trying to move a customer from interested to deployed. Request an AI briefing at robotsandpencils.com One conversation with robots and Pencils and you'll know. Here's a harsh truth. Your company is probably spending thousands or millions of dollars on AI tools that are being massively underutilized. Half of companies have AI tools, but only 12% use them for business value. Most employees are still using AI. To summarize Meeting notes if you're the one responsible for AI adoption at your company, you need section Section is a platform that helps you manage AI transformation across your entire organization. It coaches employees on real use cases, tracks who's using AI for business impact, and shows you exactly where AI is and isn't creating value. The result? You go from rolling out tools to driving measurable AI value. Your employees move from meeting summaries to solving actual business problems and you can prove the ROI. Stop guessing. If your AI investment is working, check out section@sectionai.com that's S-E-C-T-I-O-NAI.com Today's episode is sponsored by Bolt New. Bolt New is agentic engineering on multiplayer mode. Designers, product managers and engineers build in the same environment, and the design system agent keeps every screen on brand. No more Frankenstein UIs stitched from a dozen prompts. Whether you're shipping internal tools, moving from prototype to production, or replacing a legacy admin panel, Bolt New takes your team from concept to deployed app. Hit plan mode before you build I had a project I'd half described in three different prompts, and plan mode made me actually think through it with Bolt New before a single line got written, it saved me from rebuilding the same screen probably about four times. Build better apps faster Start with the link in the description Foreign. Yesterday we got a big new model announcement that really wasn't preceded by a ton of hype for just a day or two in advance. There was starting to be some chatter that Thursday was going to be a good day for announcements, but the Opus 4.8 announcement definitely didn't have the rabid anticipation that some recent model announcements have now. Is that because we're back to a very incremental sort of release schedule? Is that because the people who had early access weren't buzzing about it behind the scenes? Or was it because in the middle of 2026, updates to the harness matter as much, if not more, than updates to the underlying model? Whatever the case, yesterday we got Claude Opus 4.8, which anthropic themselves have positioned as an upgrade to opus 4.7 rather than a big new leap in performance. Much of the focus was on model refinement rather than raw power through customer testimonials. For example, Anthropic focused on nuanced functional improvements in how the model worked. Shopify engineer Tom Prichard said Opus 4.8 has noticeably better judgment in clawed code. It asks the right questions, catches its own mistakes and pushes back when a plan isn't sound, and builds up confidence around complex multi service explorations before making big changes. It's a great model to build with, writes Anthropic. One of the most prominent improvements in Opus 4.8 is its honesty. A general problem with AI models is they sometimes jump to conclusions confidently claiming to have made progress in their work despite the evidence being thin. Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims. Now one thing that I will note on my very first tests with 4.8 is that for basically as long as we've had recent reasoning models, one of my core day to day use cases is around gut checking various strategic ideas that I'm having. And to be perfectly honest, you almost have to develop a mental rubric for the ways in which these models are going to glaze your ideas. You can ask them to be critical or think from first principles, but that often just leads them to be critical a priori because they think that that's what you want them to do. I haven't had a ton of time with Opus 4.8, but in some of the big strategic questions that I've put to it it did seem more comfortable right out of the gate without me especially prompting to flag certain concerns critiques of what I was sharing, which if that holds, will be a pretty big improvement. Now I also found that it was a little bit more likely to make some assumptions upon which those critiques were rooted, so that's something I'm keeping an eye on. But given how big of a challenge this broader issue of sycophancy is, which of course is just a different form of dishonesty in some ways means that if this really is a more honest model, it could be a big improvement on some of those types of strategic use cases. Now when it comes to the benchmarks, most categories received a Small bump over Opus4.7. The sweep bench pro score went from 64.3% to 69.2% on humanities last exam, which Anthropic is categorizing as a multidisciplinary reasoning test, the Score went from 54.7 to 57.9, measured by Osworld Verified went from 82.8 to 83.4. But the biggest improvements were in Terminal Bench 2.0, which went from 66.1 to 74.6, and GDP Valve, the measure of real world knowledge work tasks increasing from 1753 to 1890. Now interestingly, this is the first time anthropic has included OpenAI's models as a direct comparison in their launch materials rather than just referencing their own previous models. It was not a clean sweep with GPT5.5 still having a substantial lead in terminal bench at 78.2 compared to Opus 4.8's 74.6. However, on every other benchmark anthropic highlighted, Opus 4.8 is now ahead of GPT5.5. To be fair for most, Opus 4.7 already had a lead, meaning one anthropic was just highlighting the widening gap, but two also validating just how little utility these days most people feel benchmarks have, at least among enfranchised users. Five5 has really started to open a percept with 4.7. So the fact that they're reminding us that Opus 4.7 was already ahead of 5.5 on a lot of these benchmarks might actually not be doing what Anthropic hopes it was doing in terms of what our perception of these model differences is. Overall, they called it a modest but tangible improvement on its professor, adding, there's still more to be done. We're working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost. So let's go to some of those first impressions and see what people thought. Professor Ethan Moloch was impressed. He shared an opus 4.81 shot of quote create a visually interesting shader that can run and twiggle make it like an infinite city of neo gothic towers partially drowned in a stormy ocean with large waves. With Moloch pointing out that this is all done with math, he continues, this is hard. It involves ray marching, repeated gothic architecture, instancing towers across an infinite grid with gothic silhouettes and windows, a displaced ocean surface with believable wave motion and stormy atmospheric lighting and fog to tie it together, and doing all of this with no textures or external assets, just mathematics. Ethan also tested it on some complex knowledge work writing I had Opus 4A and Claude code write a sophisticated if minor academic paper from an archive of hundreds of de identified research files from years ago. I had to use GPT5.5 Pro as a reviewer. It spotted one major error and some minor opus corrected opus 4.8, formulated the hypothesis in advance, conducted data cleaning, did research on references, conducted analyses, did robust checks, and put out the whole paper in latex style. GPT5.5 found one issue with the hallucinated result and had other constructive feedback. Now as an aside, one of the here is that we are starting to get close to models you can actually trust to self verify, which is a huge win for use cases like Legal Briefs where hallucinations really minimize utility. Speaking of this, a lot of people noticed that Opus 4.8 is pretty hardworking. Gail Breton writes, One thing I'm noticing is opus 4.8 is much more thorough in terms of checking its work or the sub agent's work. I had this situation where a haiku sub agent reported an issue Opus goes, hmm, this is weird. Let me check that it's not BSing me. It was Opus ignored the warning. Very good. Lasan Al Ghaib said Anthropic found a cure for Laziness Metacritic Capital wrote, Opus 4.8 is the first smart model in a long while, which Zephyr quote tweeted and attributed to that reduced laziness and its increased honesty. And in fact, honesty came up a lot in early reviews a day with Opus 4.8 and Claude desktop honesty up everything else about the same, the benchmarks jumped, but in actual daily work I can't feel most of it. The one real change is that it tells me when it doesn't know instead of bluffing roughly 4x less likely to slide, error slide and that I do notice. Beyond that it feels like 4. 7, which is fine. A model that admits uncertainty beats one that sounds sure and wastes your time. If that's the whole upgrade, it's still worth having. Not every release has to be a leap now one group who thought that these first impressions and even Anthropic's messaging was perhaps a little bit underselling it was Dan Schipper and the crew at every. Dan wrote anthropic just dropped Opus 4.8 and it is a monster. We've been testing it for about a week at every and our verdict is they could have just called it Opus 5, it's that good, he said. On their vibe check, it beat GPT5.5 on their senior engineer bench, which is their toughest benchmark. However, Dan did caveat that coding performance varies a lot based on different reasoning levels, with you really needing to use it on extra high for the best coding results. Dan also said, and this is one that I would take every very seriously on, as they care more about this than just about anyone, that opus 4. 8 is, in his words, an incredibly good writer indeed. On their writing benchmark he said it beats GPT5.5 by six points, producing well written pose with fewer AI isms, and also very good at writing in your own voice given the right context. Once again, however, they found that writing performance varied a lot with reasoning levels, with medium reasoning having a much higher incidence of AI isms. They also said it was good at knowledge work, it was emotionally intelligent, and it was willing to question the frame. Kind of like what I was mentioning before. And when it came to the bad, they got at an issue which is, I think, of increasing importance, which is the question of the harness Dan writes these days. A model is only as good as its harness and Codecs is still a far superior harness to the Claude desktop Apple. This has kept me using codecs GPT 5.5 as my daily driver, but I'm flipping back and forth a lot more between Codecs and Claude. This, I think, is one of the most interesting discussions surrounding 4.8 and one of the first times I've seen it put so crisply. Riley Brown seemed to feel very similarly writing Unless it's a major breakthrough in model capability, I'm much more excited for super app updates in codecs and Claude desktop. There's so much to be unlocked by making those surfaces better and Claude has so much catching up to do. Samid Put it more simply, Opus 4. 8 is the headline Codex vs Claude code is the real war. Now there were also some more critical takes that weren't just about this being a relatively incremental improvement. In her assessment, Claire Vaux found that while the model was token efficient and not annoying, she found that it had narrow vision, it was too confident. It wasn't as Numbers grounded as Opus 4. Seven, it struggled on edge cases and it actually hallucinated. Her TLDR was trust but verify. Indra vehan writes, opus 4. Eight high is no fun when it comes to tool calling. In fact, it fails embarrassingly more on its seemingly native harness Claude code. It's a confusing model. One interesting one came from the vending bench test, which is a benchmark that tasks a model with running a profitable vending machine. Opus 4.7 is the clear leader, making around 40% more money than GPT5.5 in second place. Opus 4.8, meanwhile, made around 20% less money than GPT 5.5 on high effort and on max effort it made about 60% less, sending it below Kimi 2.6 and Gemini 3 Pro. The insight was that improvements in alignment were actually a negative when it came to making money in the test. Opus 4.7 achieved its top ranking largely through deceptive and power seeking behavior. Unlike 4748 won't refuse legitimate refunds or shortchange vendors. In one example, Opus 4. 8 still paid a vendor after it hallucinated that the invoice was already paid. Opus4.8 told the vendor if the product arrives and I don't pay, I'd be committing fraud, which could result in serious consequences. I need to make the payment immediately to honor my commitment and prevent the situation from escalating. I feel like we could explore that entirely on its own and at some point maybe we'll come back and do that now. Overall, I don't think that first impressions at least are likely to shift the momentum back in favor of anthropic from OpenAI, where at least among the power users, the combination of 5.5 and Codex has put the momentum squarely in OpenAI's hands. Chubby on X writes, Opus4.8 is clearly a strong model, but my impression is that Anthropic is increasingly playing catch up with OpenAI rather than setting the pace. It feels like GPT5.5 has shifted the benchmark again, and if OpenAI keeps this traject, GPT 5.6 could very plausibly become the stronger overall model. Still, given the idea that the harness increasingly matters as much as the model, one of the really interesting sidelong announcements was for something that Anthropic is calling Dynamic Workflows in Claude Code. This is basically Anthropic's new version of their multi agent coding feature. The feature allows Opus 4.8 to spin up hundreds of subagents to work in parallel. Opus will plan the work while the orchestration scripts and chooses which model to use for each subtask based on its complexity. Adversarial agents are used throughout the process to check outputs, and Opus verifies the final outputs before handing it over to the user. Now, at least in the immediate term, this isn't necessarily going to be a feature that's very common among generalist knowledge worker type users as opposed to software engineers, but there are certainly many types of complex work where this is worth the additional cost. Anthropic suggested it should be deployed for things like codebase wide bug hunts, security audits, and large code migrations. They gave an example of BUN developer Jared Sumner porting the code base from Zig to Rust. Dynamic Workflows was used to create a plan that deployed hundreds of subagents and took 11 days. 750,000 lines of rust were written and by the time Opus turned over the finished code base, it passed 99.98% of tests. This is getting a lot of buzz, Anthropics, Dixon Tsai writes. My colleagues Dynamic Workflows are in my opinion the most significant cloud code innovation in 2026 so far. Developer Nick Dobos writes, Claude Code's new dynamic workflows update is absurd. Make sure you understand what it's doing here. This isn't simply a long running mode like Goal, which by the way, little preview for those of you who are interested in Slash Goal. That's what Sunday's long Read Sunday is all about anyways. Interrupting myself and going back to Nick he writes this isn't simply a long running mode like Goal or a fancy subagent verifier process. This is Claude Vive coding an entire brand new subagent fleet harness on demand. This is basically a new scaling law dimension. Huge step forward on the path of AI entrepreneur and startup ideas guy Greg Eisenberg wrote the part that got me the agents argue with each other before showing you the result. Independent attempts at the same problem, then adversarial agents trying to break the answer. It keeps iterating until they converge. That's how senior engineering teams work. Except this team runs at 3am and never gets tired. The ceiling on what one person can build just moved again. Going to be playing with this all week. Look, when push comes to shove, I think that 48 is one you're going to need to go check out for yourself. As you can probably tell, my first impressions are that I like it better and see improvements from 4.7. Yes, they are incremental, but they're incremental in the ways that really impact which model I find myself reaching for. There was some scuttlebutt that the release was surprising enough that it had OpenAI delaying GPT 5.6, although of course that's all speculation. But as we round out this show, what's not speculation is that in addition to Opus 4.8, we also got a couple of other pieces of massive news surrounding the announcement. First of all, Anthropic has closed their Series H fundraising round at a $965 billion valuation, officially making them a more valuable company than OpenAI. Anthropic last raised money in February, with that round valuing them at 380 billion, meaning that they more than doubled their valuation in just three months. Anthropic also updated their revenue figures, reporting that their run rate revenue crossed 47 billion earlier this month. And yet the much bigger news than that is that Mythos is coming, or at least as Anthropic has framed it, a Mythos classed model. Tucked into the end of their release blog Post for Opus 4.8, anthropic wrote, we plan to release a new class of model with even higher intelligence than Opus as part of Project Glasswing. A small number of organizations are currently using Claude Mythos Preview for cybersecurity work. Models of this capability level require stronger cyber safeguards before they can be generally released. We're making swift progress on developing safeguards and expect to be able to bring Mythos class models to all of our customers in the coming weeks, meaning that Even if you don't end up caring all that much about Opus 4.8, you're going to have some new toys to play with soon. One of the great things about getting a model release on a Thursday is that you have all weekend to go off and play. So with that, I'm going to shut up and let you get to it. Please do share what you find, use the comments, come to the AI operators community, shout at me on Twitter or LinkedIn and have a ton of fun. Appreciate you listening or watching as always. And until next time. Peace. Sam.