Summary9 min read

Last Week in AI — Episode #246 Summary

Date: May 25, 2026
Hosts: Andrei Karenkov (A) & Jeremy Harris (B)
Theme: Recap and deep-dive into the most important AI news and developments from the week, including major updates from Google, OpenAI, Anthropic, XAI, and the intersection of AI and business, policy, and safety.

Episode Overview

This action-packed episode delivers a comprehensive rundown of the week’s hottest AI headlines, with special emphasis on Google I/O announcements, the fallout from Elon Musk’s loss in the OpenAI lawsuit, OpenAI's mathematical milestone, startup and model wars, as well as key stories in open source, synthetic media, policy, and AI safety.

Key Discussions and Insights

1. Google I/O: Gemini 3.5, Spark, and Omni

Timestamps:

Gemini 3.5 & Spark: [04:25]–[10:32]
Gemini Omni: [11:10]–[17:39]
Developer Tools (Anti Gravity): [17:39]–[21:21]
Gemini for Science & Genie: [21:21]–[28:31]

Major Announcements:

Gemini 3.5 (Flash and Pro)
- "Gemini 3.5 Flash actually... beats out Gemini Free Flash on a bunch of benchmarks in a huge way." (A, [05:22])
- Spark operates as a semi-autonomous assistant, running persistently on Google Cloud with browser-level access, aiming to expand beyond traditional chatbots.
- Rollout to testers now, broad beta soon — “...this is the announcement about the coming announcement.” (B, [06:09])
- Focus on Gemini 3.5 Flash—primary product driving Spark; 300 tokens/sec output cited, nearly double previous versions. (A, [10:34])
- Userbase reportedly hitting 900 million daily, though measurement ambiguity acknowledged.
AI Agents & Infrastructure
- “That's like a fundamentally new paradigm. ...Google is trying to convince the search users to trust them with tasks that involve minimal input and tons of agentic work.” (B, [06:09])
- Spark to support 3rd party tools via Model Context Protocol (MCP)—a major win for Anthropic.
Gemini Omni
- New multimodal model family: handles image, audio, video, text inputs — generates and edits video.
- “Editing is a much better use case than generation... likely more compelling to creators on YouTube and others.” (A, [12:39])
- Google leverages unique data advantages (YouTube, Street View, etc.), poised for strategic leadership in multimodal and agentic AI, especially vs. OpenAI, Anthropic.
- “As we move into more and more multimodal agent kind of development... Google is going to have a very significant differential advantage here.” (B, [14:49])
Developer tools: Anti Gravity 2.0
- Replaces Gemini CLI. Desktop app + CLI. "No one's using the IDEs, right? Everybody's using the terminal for this..." (B, [17:39])
- Recurring industry PR of “recursive self-improvement” skepticism voiced. "If you just keep saying... this is like recursive self-improvement... eventually people are going to get... fatigue." (B, [19:38])
Gemini for Science & Genie Updates
- Suite for accelerating scientific discovery—hypothesis gen, computational discovery, literature insights.
- Genie can simulate real streets from Street View; supports Waymo autonomous vehicle simulators for rare event training.

Notable Quotes:

“Distribution is... such a distribution advantage. Right. That it's like Microsoft saying, oh look, everybody's using teams... you're forcing them to.” (B, [09:13])
“We have like a few projects that are doing the same thing. Let's combine them under one roof.” (A, [17:47])

2. Startup Model Wars: Cursor, XAI, and Colossus Compute

Timestamps:

Cursor Composer 2.5 & the SpaceX/XAI/Cursor web: [29:33]–[38:49]
XAI Grok Build: [37:19]–[41:01]

Key Points:

Cursor Composer 2.5:
- An impressive, fine-tuned coding model built atop Moonshot AI’s Kimi K2.5, already beating expectations in price & speed.
- Cursor is purportedly training its own larger, frontier model using SpaceX’s newest Colossus 2 compute cluster.
- “Cursor is kind of becoming the XAI team, right? ...If you just in a vacuum saw this, this would be like, make no sense.” (B, [32:40])
- Potential $60B SpaceX/XAI acquisition of Cursor in the works following IPO.
XAI’s Turbulence:
- Massive talent drain — cofounders, team leads leaving; utilization issues with expensive compute clusters.
- “XAI is just completely bleeding out. Elon Musk said... XAI wasn't built right the first time. It's being rebuilt from the ground up.” (A, [35:55])
- Cursor seen as possible lifeline for XAI’s coding/model ambitions.
Grok Build:
- XAI launches extremely early (v0.01) coding agent, clearly lagging behind industry.
- “The current situation for XAI is objectively very bad. ...They don't even have a product in the $200–$300/month coding high end development tier...” (B, [38:49])

Notable Quotes:

“50 people have left... and from the team, researchers, developers, that's from a 200 person team. So XAI is just completely beating out.” (A, [35:55])
“The fact that they have a specialized GROK build model for coding... is a bad sign... they had to fine tune a coding thing just for GROK build to be good.” (A, [37:19])
“I'm not betting against Elon. ...But... this has to ship.” (B, [40:49])

3. Applications and Business: Musk Loses to OpenAI, IPO Mania, Power Shifts

Timestamps:

OpenAI vs Musk lawsuit outcome: [41:00]–[47:49]
Anthropic's $900B Valuation and Rocket Growth: [47:49]–[54:10]
Andrej Karpathy joins Anthropic: [53:18]–[55:32]
OpenAI product shakeup & Apple tension: [55:32]–[60:55]
Cerebras IPO: [60:55]–[65:31]

Key Points:

Musk vs OpenAI Lawsuit
- Musk lost due to statute of limitations; the crux—Musk knew as early as 2017 about the for-profit pivot.
- “...the narrative, the story that you stole a charity is still live. Like it has not been ruled on. It's not like a judge said, you did not steal a charity.” (B, [44:33])
- Not technically judged on ethical grounds—the story (and OpenAI’s checkered nonprofit-to-profit journey) stays alive in public discourse.
Anthropic's Meteoric Rise
- Now at $900B valuation, overtaking OpenAI by some measures; projecting profitability due to explosive growth in enterprise products.
- "Anthropic is just killing it. ...From the IPO perspective, it's I’m sure, quite nice to have a higher valuation." (A, [49:08])
- Notably, “Profit means they miscalibrated their capex investment ...you want to remain slightly below profitability... as long as you can ride that curve.” (B, [49:30])—i.e., they could have spent even faster.
Karpathy Joins Anthropic
- AI rockstar Andrej Karpathy joins pre-training / “auto-research” team at Anthropic.
- “This is like, oh wow, Steph Curry has joined Anthropic...” (A, [53:17])
- Signal of belief in rapid ongoing progress (and a canary for AGI timelines).
OpenAI Internal Churn
- Major leadership reshuffles; merging ChatGPT, Codex and browser into “super-app.”
- Ongoing executive departures—chaos compared to Anthropic's stability.
OpenAI vs Apple Tensions
- Partnership fraying; delayed Siri integration, lack of promotion, potentially heading to legal friction but likely “leverage theater.”
- "They don't actually have a... case here. But they can embarrass Apple on AI right before their June 8th WWDC conference..." (B, [58:53])
Cerebras IPO
- AI chip innovator soars 90% in market debut—excitement, but possibly mispriced (“left a lot of money on the table”).
- Booming AI infrastructural investment behind broader market optimism—but warnings of overbuilt data center CapEx “bubble” linger.

Notable Quotes:

“The growth in the stock market is almost entirely AI, it's almost entirely Capex...” (A, [64:36])
“If you were calling a bubble 18 months ago... you got to tuck your tail between your legs and just say ‘mea culpa, I was wrong.’” (B, [65:31])

4. Research and Advancements: Mathematics, Model Behavior, Auto-Research

Timestamps:

Erdos Problem (OpenAI breakthrough): [66:52]–[69:32]
Negation Neglect: [69:32]–[72:59]
Mechanistic Interpretability (All Circuits Lead to Rome): [72:59]–[76:02]
Auto-Research (NanoGPT Speedrun & Bench): [76:02]–[81:40]
Terminal World Benchmarks: [81:40]–[84:09]

Episode Standouts:

OpenAI Solves 80-year-old Erdos Problem
- Used ChatGPT to make a major mathematics proof, overturning long-standing assumptions.
- “...hundreds of pages of logic and calculations went into it... this is an impressive proof, it has actual insights, it has, you know, leaps of imagination...” (A, [67:24])
- Signals AI’s growing ability to contribute to new fundamental science.
Negation Neglect
- Study shows LLMs easily “believe” false facts, even when trained with strong negations/warnings.
- “If you do include heavy negations... it still believes the false facts 84, 85% of the time. So that's pretty wild.” (B, [70:12])
Interpretability—Many Overlapping “Circuits”
- “If you think you've intervened on one circuit, you probably haven't fully intervened on the... capability.” (B, [73:46])
Auto-Research and Agentic Optimization
- Current agent-run auto-research repeats “grinding” on hyperparameters, algorithmic creativity limited but incremental progress real.

5. Open Source & Benchmarks

Timestamps:

NanoGPT Bench and AI agent limitations: [79:17]–[81:40]
Terminal World: [81:40]–[84:09]

Key Points:

Open source agents are making incremental progress vs. best human baselines, mainly via brute force optimization, less so by creative breakthroughs.
Real-world benchmark pass rates hover around 50–60% even on “small” tasks.

6. Policy and Safety

Timestamps:

Deepfakes & “Take It Down” Act: [84:09]–[84:58]
Hacking & Self-Replication (Palisade, UK ASI): [84:58]–[91:14]
Positive Alignment: [91:14]–[92:42]

Highlights:

Deepfake Law:
- Sweeping federal requirements for rapid takedown of nonconsensual (including AI-generated) intimate imagery; $53K/violation—raising censorship/freespeech/overmoderation concerns.
AI Autonomous Hacking:
- Open-weight LLMs demonstrably able to hack, self-replicate, and exfiltrate model weights in realistic network tests — “now models out there can do this if they want to.” (A, [88:12])
- State-of-the-art models (Claude, GPT-5, Quin, etc.) seeing cyber capabilities double in just 4-7 months; “relentless law of physics” like Moore’s Law cited in cyber offense.
Positive Alignment Paper:
- Position from industry leaders (OpenAI, Anthropic, DeepMind): “alignment shouldn’t be just about AI not turning out evil. We should have positive alignment where AI is aligned to us to do good.” (A, [91:14])
- Critique: Not radically different from what leading labs already claim to be doing.

7. Synthetic Media & Art

Timestamps:

OpenAI image watermarking/certification: [92:56]–[94:26]
AI-generated short dramas in China: [94:26]–[94:37]

Key Points:

Provenance tooling:
- OpenAI adds industry-standard watermarks and metadata for AI images, integrating with Google's SynthID to aid provenance detection.
AI Video Content:
- China’s microdrama market now sees 470 AI-generated short drama episodes daily.
- “If you are curious like when is video generation gonna actually create something useful... well here is where it’s going.” (A, [94:26])

Notable Quotes & Moments

“If you just keep saying... this is like recursive self improvement... people are going to get... fatigue. ...Someday it’s going to be true and we won’t be able to tell.” (B, [19:38])
“Profit means they miscalibrated their capex investment... you want to remain slightly below profitability... as long as you can ride that curve.” (B, [49:30])
“This is an impressive proof, it has actual insights, it has... leaps of imagination that span different areas of mathematics...” (A, [67:24])
“XAI is just completely bleeding out. Elon Musk said... XAI wasn’t built right the first time. It’s being rebuilt from the ground up.” (A, [35:55])

Thematic Takeaways

Corporate dominance in AI is converging around Google, OpenAI, and Anthropic—with Google’s product, data, and infrastructure edge possibly unrewarded (yet) at the “frontier”.
AI safety, hacking, and alignment are intensifying as core research and policy priorities, with real-world benchmarks now showing autonomous hacking, and even basic logic (negation) remaining tricky.
IPO and valuation fever is underpinned by real profits for some, but with data center CapEx as a perpetual bubble risk.
Agentic AI and auto-research show incremental but limited creativity, and “recursive self-improvement” remains mostly PR—though the field is watching for a true leap.
Synthetic, generative media is already transforming industries outside the West, with provenance tooling slowly catching up.

For Further Listening

Stay tuned for upcoming episodes for updates on Google’s Gemini Pro releases, continuing IPO news, and the ever-escalating race among AI labs in both capability and safety.

Contact & More

Leave comments, subscribe, or check out the Last Week in AI newsletter for granular news coverage not featured on the podcast. As always: “Please do keep tuning in!” (A, [94:37])

Loading summary

Transcript117 lines

[00:00]
A
Foreign. Hello and welcome to the Last Week in AI podcast where you can hear us chat about what's going on with AI. As usual in this episode we will summarize and discuss some of last week's most interesting AI news. You can also check out our Last Week in AI newsletter at lastweekin AI for stuff we will not cover in this episode. I am one of your regular hosts, Andrei Karenkov. My background is of having studied AI in grad school and now working at the gen AI startup Astrocade and I'm
[00:40]
B
your other ghost, Jeremy Harris. I do AI national security stuff, supply chain AI infrastructure, blah blah blah blah blah. It's a very stacked episode today. We're just talking about this. We have not been very disciplined about wrapping in time because we take too freaking long in the first few sections and then Andre ends up having to go and oftentimes I'm like joining like we all have like meetings like the bracket this. So bottom line is Andre is going to have the merciless job, thankless job of having to keep us in line here. So please Andre, like cut me off if I ramble. We're going to try to not linger too much on the first stories because that's a thing that we've done in the past.
[01:15]
A
Yes, and it will be a little tricky because as you said, it's a stacked episode and partially because the first section of Tools and Apps Google I O happened and Google came out with a whole bunch of announcements, some of them quite intriguing. So we'll see how that goes. Then. Applications in Business we can talk about the conclusion of the OpenAI Musk trial. Quick spoiler. Musk lost pretty badly. So that'll be a fun discussion. Research and advancements OpenAI has had a massive result in mathematics that we'll probably spend a while on. And then beyond that we have a decent number of open source of policy and safety, even some synthetic media and art stories. So it will be action packed episode.
[02:06]
C
You're listening to this podcast so I know you've got a curious mind. Here's a helpful fact you might not know yet. Drivers who switch and save with Progressive save over $900 on average. Pop over to progressive.com, answer some questions and you'll get a quick quote with discounts that are easy to come by. In fact, 99% of their auto customers earn at least one discount. Visit progressive.com and see if you can enjoy a little cash back. Progressive Casualty Insurance Company and affiliates national average 12 month savings of $946 by new customers surveyed who saved with Progressive between June 2024 and May 2025. Potential savings will vary.
[02:44]
A
Today's episode is sponsored by Box. Enterprises are keen to adopt AI, but enterprise AI only works when it has the right business context, and Box is the leading intelligent content management platform for the AI era, acting as the secure essential context layer for Box's AI agents to access the unique institutional knowledge that makes the company run. Your business isn't the sum of all Internet knowledge. Your business lives in your content, and Box can connect that content with people, AI agents and apps that can unlock their value from their information, all while having the security and governance capabilities that allow you to trust it to be secure. There are many uses for it, and especially interesting is Box Agent, a unified AI experience across your files in Box. So if you're thinking seriously about your company's AI transformation journey, think beyond the model. Your your business lives in your content, and Box helps you bring that content securely into the AI era. Learn more at box.com AI this episode is brought to you by Outshift, Cisco's incubation engine. Today's AI agents operate in silos, limiting their true potential. We've been focusing on building bigger, smarter models, but scaling up is just one approach and we actually have a blueprint from 70,000 years ago here. Humans didn't just get smarter individually. The cognitive revolution transformed society because we began sharing knowledge, goals and innovation. And agents are now at the same inflection point. They can connect, but they can't think together. And that's why Outshift by Cisco is building the Internet of Cognition, transforming AI from isolated systems into orchestrated superintelligence. By creating an open, interoperable infrastructure, Outshift is enabling agents and humans to share intent, code, context and reasoning. The cognitive evolution for agents is here. Explore the Internet of cognition@outshift.com that's outshift.com heading straight into tools and apps. The big news this week was out of Google, they had Google IO where they announced a whole bunch of stuff we'll have to get through. Probably the biggest news is that they did announce Gemini 3.5 as their next slate of model releases and they also announced their new AI agent, Gemini Spark, which is supposedly a version of OpenClaw in a way where it runs 24. 7. You can give it more offline types of tasks, it will function more as a semi autonomous assistant as opposed to just a conversational chatbot. And on the benchmarks, the big news Maybe is Gemini 3.5 flash actually not the bigger Gemini 3.5. It beats out Gemini Free Flash on a bunch of benchmarks in a huge way. And it is also said to be the driver behind Gemini Spark. So it's rolling out to trusted testers today and they're planning to bring the beta to Google AI subscribers in the US next week. So it's one of these typical things of they have the big conference or needed to make some announcements, but it isn't quite ready to go live yet. So it's coming, but it's not here yet and I haven't heard any sort of vibe checks so far on what people think about these things.
[06:10]
B
Yeah, it's, it's one of those things about like we're having a meeting to plan a meeting or this is the announcement about the coming announcement. The spaceball scene where the guy goes, prepare to fast forward. Prepare to fast forward. Fast forward. Fast forwarding, sir. Anyway, if you've seen the movie, you know, basically, yeah, this, this is that the, the Spark thing is interesting. Like it's actually really interesting from a kind of architectural infrastructure standpoint. You know, you mentioned that there kind of these agents running in the background, they will be running on dedicated virtual machines in Google Cloud, so not on device. That's what makes it so that it keeps running when you close your laptop and it's going to operate inside Chrome as an agentic browser and that'll be later this summer at least. They say we'll see if it joins the graveyard of Google, you know, half finished products. But if that happens, that is a huge, huge architectural commitment. Right? You've got these, these cloud, native cloud resident AI agents that are going to persist with browser level access. That's like a fundamentally new paradigm. It does make sense, but it means Google is trying to convince the search users to trust them with tasks that involve, you know, minimal input and tons of agentic work. Which is a shift, you know, historically that has been owned a space that's been owned by, by the other labs, especially Anthropic and OpenAI. And then the other piece here is the MCP side, right. So like Spark is gonna, is gonna support third party tools, but it's gonna do it through the model context protocol, right, the Anthropic mcp. And so that's a bit of a concession. Google basically saying, look, we're not really gonna try to argue with Anthropic on this one. This just seems to work. And this is a major, major center of mass now like you're gonna see or center of gravity. You're gonna see a lot of people now defaulting to mcp, even more than they already. I mean, it's already kind of the default, but there have been moves to try to shift away from it. This is a pretty big win for Anthropic from that standpoint. So at the end of the day, that's what we're getting from this. I think it is a pretty big architectural move in terms of how Google serves models and what they think the future is. The TBD is, do we actually see uptake? Are real users going to start using Google products for this? And the Agentix side just has to be really strong to support it.
[08:09]
A
Yeah. One other thing to say here is they did focus Primarily on Gemini 3.5 Flash. There also will be Gemini 3.5 Pro Pro, but they don't highlight any benchmarks for it. It won't be out until next month. So it's a case of like which 3.5 flash, which I think is probably their main product side driver for Spark, probably even for chat, is the primary focus, it seems. As opposed to Gemini 3.5 Pro, which is the higher intelligent model, which is typically what the AI labs are sort of competing at. I think it does signal that Currently Anthropic and OpenAI are competing with cloud code and Codex very heavily. That appears to be the focus. And Google isn't trying to get into the fray as much with that. They are still more on the consumer side. They're still growing the user base of Gemini. They announced that they're hitting 900 million users per day. I don't know the exact measurement details. They might be cheating a little bit because they have Gemini all over the place.
[09:14]
B
But yeah, distribution is so they have such a distribution advantage. Right. That it's like Microsoft saying, oh look, everybody's using teams. Like yeah, no shit, you're forcing them to.
[09:23]
A
But yeah, yeah, exactly. But I don't know if even they're cheating a little bit by including Gemini usage outside of Gemini with Chatbot, which if you look at ChatGPT, which famously announced this 900 million users number, that's, you know, ChatGPT users, Gemini. I don't know if you're folding in sub cases of Gemini where it's like within Google Docs or within spreadsheet. But either way they clearly are getting a very large user base and are getting up there to be competitive with OpenAI on the consumer side, which I think at this point it's OpenAI's and Google's game for the large scale adoption antropic still is pursuing their strategy of focusing on enterprise, focusing, focusing on professionals, not becoming kind of a daily chatbot of most people. And you know, arguably that the daily driver of most people could be in the long run the more lucrative thing. Although currently certainly Anthropic is enjoying the fruit of their labor enterprise with absurd growth and so on.
[10:32]
B
And profitability now too. We'll talk about that.
[10:34]
A
Yeah. And one last, real quick to mention the output speed per token is one of the things I highlight. With 3.5 flash, it's nearing 300 tokens per second according to them, which is close to double of free Flash. So it is very fast.
[10:53]
B
Did they say what hardware that's on by the way? I mean, I get that they're giving us a ratio which is helpful, but
[10:58]
A
I didn't see that in the blog post. I think it's probably safe to say that it's based on TPUs and what configuration. It's the most favorable configuration.
[11:08]
B
Yeah, yeah, exactly.
[11:11]
A
Well, this is by Artificial Analysis Intelligence Index. Yeah, you would hope that it's just the API, right, that they're showing here, but we shouldn't trust that necessarily. So I'll be curious to see what the third party estimates are. And next up, the other big reveal aside from Gemini 3.5 flash and spark was Gemini Omni, a new family of multimodal models that can take images, audio, video and text as input and generate video outputs by reasoning across all the input types. So it is presumably the next iteration of a multimodal model which ties this in much more closely historically. You know, initially we began with focus on LLMs and the other modalities were a little bit bolted on. It wasn't sort of a truly unified architecture of which DeepMind has research here going back a long time with Genie, I think they demonstrated how they could do multimodal and this is presumably sort of the actual productization of that. The first model that they will release, Gemini Omni Flash is already out in the Gemini app, YouTube shorts and AI creative studio flow. It can generate 10 seconds of video. So it is seemingly a very strong video generation model and a very strong video edit model. So that is a very significant detail I think, because in my opinion, if you want to see kind of usefulness of video generation, editing is a much better use case and generation because it applies to your actual videos and you can use it, you know, as part of producing videos by adding in components or so on. And I think that is likely more compelling to creators on YouTube and others where they can use it as a tool in their toolbox rather than just a way to generate entire videos for them. So very exciting. I think this is demonstrating what we saw with image models where you got to the level of editing where it was just completely seamless and yep, insanely impressive. This appears to be that for video.
[13:23]
B
Well and to your point about the editing being more valuable for users, it's also more valuable for the company, right, that you get so much more granular feedback data that you can then use to train the model than you would if you just did a simple generation and does the user like it or not like it set up? So this is, you know, a very useful flywheel for them to play off of. One key thing is, you know, Google, Google DeepMind. But you know, Google in general now has a, as you said, a long tradition going back to Genie and previous models. You know, Tim Robtashel, who since left, who was leading a team that's like really big into this, the whole world modeling side. I'm old enough to remember when the whole space was like LLMs and then you had all these people saying world models, world models, those matter more. And that was a kind of fei fei li and lecun kind of perspective. And at the time like my instinct was LLMs are the thing, just scale that and like, like everything else will come. I think what's happening right now is I think I was right on the outcome in a sense, but wrong in terms of some important details on what would matter and what wouldn't. They scaled LLMs that put them in a position where they could do wild capex spends in this direction and now they can also just do multimodal and it's like not an issue at all. That's part of the issue is you may be right on the thesis that you need multimodality, you need robotics, you need all these things. But, but, but just being right about the thesis and not moving know to catch the capex wave means that other people can afford to be wrong on that and just pivot to it with like tens of billions of dollars to spend on, on that. And if you're Google massive multimodal data sets that nobody else has. So I think that as we move into more and more multimodal agent kind of development world, you know, Google is going to have a very significant differential advantage here. You know, SORA was wound down at OpenAI. That was explicitly a step on the path to, to agentic training. Right. You can generate simulated environments for agents to navigate that then allow you to get kind of this like longer time horizon reasoning as long as your environments that are simulated are coherent and that got wound down hard to maintain. They presumably had to train on YouTube data. Miro Morati kind of hinted at that before she she left OpenAI. So anyway, there's a bunch of challenges that people who are not Google face when it comes to this whole multimodal domain. So an interesting call it, maybe a wild card call it the strategic competitive differentiator for Google here. This may actually be a really important part of their play.
[15:38]
A
Right. And as you said, I think one of the things to note here is that there's essentially no competition for them aside from smaller startups. So I don't recall if we covered Luma had had similar releases recently with their Uni one model, their unified intelligence family of AI models that is also trained on audio, video, image and language and spatial language. And they released also Luma agents, which is that kind of more edit use case that is more intelligent and able to think in language and then render in pixels and images. So I would imagine a similar component is true in Omni where you do have a sort of LLM driven reasoning and thinking and it's not sort of the older paradigm of you produce video as pixels without sort of that language level of thinking and reasoning, which is now very much core to nano Banana and image generation models. Similar to the LLM side, they are releasing the Flash variant of Gemini Omni. The Gemini Omni Pro model is planned but has no release date and I would be very curious to see about that Pro model because I would imagine that is very different and Gemini Omni already is impressive with a Flash variant. With the Pro model I'd be probably impressed and I'm looking forward to seeing it. Moving along to a few of their other announcements, they've also launched Anti Gravity 2.0 with an updated desktop app and CLI. So this is now replacing Gemini CLI. They have antigravity as their cloud code and Codex competitor. They launched Antigravity, I think a year, last year or a little while ago as their integrated development environment and I haven't heard much about it much since. So I don't know if anyone's using it.
[17:40]
B
No one's using the IDs, right? Everybody's using the terminal for this and that's kind of like the. I guess that's what they're admitting themselves in a way here, right?
[17:48]
A
Well, they were competing with Gemini Cli, so in my opinion this is likely a classic Google thing of like oh wow. We have like a few projects that are doing the same thing. Let's combine them under one roof. Like they did for instance with YouTube and their music app, which now we have YouTube Music.
[18:05]
B
But they're also, they're asking like Gemini CLI users to migrate to it. So it's kind of like, I don't know, it reads to me as like they're just going terminal now. Maybe I'm misreading this, but.
[18:14]
A
Well, no, I believe they have. They probably still have the IDE version and they're replacing Gemini CLI with this newly launched Anti Gravity cli. But they also have Anti Gravity IDE still as another extension of this. So they're doing a little bit of both, which is different from Claude and Codex where they have these. I guess maybe it's also reflective of the fact that both cloud code and codecs are still CLI heavy. But they have invested a lot in the apps in the Mac apps where you can do it a little bit removed from the CLI where there's like a front end that you don't need to open a terminal. And I wouldn't be surprised if this also reflects that where they are planning to compete with antigravity on both sides, both the CLI and the more sort of non terminal front end that basically provides you a nicer UI to interact with these products. Essentially it's the same, but it has a few more bells and whistles.
[19:20]
B
Yeah, here's the Peter Griffin. Here's what really grinds my gears. Part of the show here for me is they're using this line. It was co developed with Anti Gravity. Right. To talk about their.
[19:31]
A
The suite of if all the software engineers use at Google don't have to use antigravity, probably then this is the thing.
[19:38]
B
So there's this thing happening right now where ever since Anthropic came out and said like oh, we like we dog food at our own agentic tooling to do this. And then OpenAI said like, well, we trained our latest model using our latest model. The problem that I have with this is that someday it's going to be true and we won't be able to tell like, like this is crying wolf. If you just keep saying oh my God, this is like recursive self improvement. It's hap. Like people are going to get recur, like recursive, recursive self improvement fatigue. They're not going to actually listen when the actually batshit insane thing that will happen possibly sometime this year. I mean if reading what Anthropic is saying, but possibly sometime next year, whatever, whenever that happens, that is a critical security moment, safety moment for planet Earth. Like we got to get that right. And if we're just continually like getting kind of the frog and hot water factor here with people saying we're oh, our tool built itself like that shit is not helpful. We're even seeing Chinese labs saying it about fairly trivial software harness stuff and then kind of blurring the lines between that and actual like weight model, parameter optimization and stuff. So anyway, that's my soapbox. I am a little concerned about this trend. I think it's important that we be serious about when we're doing real recursive self improvement and when we're not.
[20:52]
A
No, I think it's a fair point and it's a very PR thing right now of self improvement where okay, you use it to write code and help you run experiments. Like it's not in my opinion, real recursive self improvement. Yeah, yeah.
[21:05]
B
So I want to continue right. Like in a way we've been doing recursive self improvement since we became multicellular organisms or since sexual reproduction. Like whatever you, you can make that argument, but it's just not what people mean. Like to your point. Yes. Anyway, I'm finding a just a long winded way to agree with you.
[21:21]
A
And yeah, I will say one more thing. On Anti Gravity, they also announced the Anti Gravity SDK which allows you to build extensions to their that hook directly into your conversation history. Like really operate within the agent stack, which I believe with VS code you would not be able to do. And I do think, you know, we forget a little bit. Cursor still has a large user base. Cursor being the primary IDE competitor outside of terminal though they also have a terminal thing now I do imagine that many people still are working within an ide, which I know I am. I worked still with Vincursor and then launched a terminal within Cursor with cloud code. So there is a real story where I'm competing on both fronts and potentially trying to get some of that cursor market share which again to my knowledge Anti Gravity, I haven't heard of anyone using it, but I'm sure some people are and Roland right along. We have a couple more things from Google. Next one is Gemini for Science, a collection of experimental AI tools designed to assist researchers with scientific discovery workflows. Again, I think this is coming in concert with OpenAI. They also had some sort of, I think OpenAI or ChatGPT for science initiative proprietor Also has a program for science, specifically what we've discussed in recent months. This is being rolled out via a signup form on Google Labs. And they also have science skills within this, which pulls insights from over 30 major life science database to automate complex manual workflows, completing tasks rather quickly. Aside from those skills, the suit itself has three main features. It has hypothesis generation, which searches millions of papers to help form theories, recited sources, computational discovery, energetic search engines that runs thousands of experiments and literature insights. So this is very much, you know, being intended to be used by researchers in the research flow. Which, you know, if anyone can do this properly, I would say it's probably Google. They, you know, have their life science spin off, I forget the name of it.
[23:40]
B
Isomorphic labs.
[23:41]
A
Yeah, it's a morphic labs. Exactly. Which is doing this. Right. They're doing Life Science with AI. DeepMind, of course, is full of researchers doing research. So I would not be surprised if this actually has a significant impact on the scientific community where AI, it's happening with math on a crazy level now, where people are adopting and learning how to adopt AI. I wouldn't be surprised if people in the life sciences physics are playing around with AI but haven't fully integrated it because it isn't as trivial necessarily to use it reliably. And this could make it much more feasible. Yep.
[24:18]
B
And, and you know, Google has Google Scholar, obviously. You know, if you've been an academic for any period of time, you've refreshed your Google Scholar page quite a lot. So that's a certain kind of data that may not be as widely accessible in the same way to other companies, though it is obviously publicly legible to some degree. And it's also the case that we've had papers come out that show somewhat ironically, that you can use Google Scholar citation count as a way to develop taste in language models and agents. So you, to the extent that that's useful, there's an interesting play there. And we've just talked about two different interesting plays actually that are Google, I don't want to say Google only with Google Scholar, but like, you know, differential advantages for Google, whether on multimodal, with omni and kind of like environment generation and on the Scholar side and on the infrastructure side too. Google is a behemoth. They have the tpu, right. They're rolling it out now, becoming a cloud for other frontier labs. So we keep running into this thing with Google, I find where we're like, damn, these guys are an aircraft. They should be running away with this fucking thing and yet we're just not seeing it for whatever reason. Gemini historically has just not pierced to the frontier of cable. Let's see what pro shows and to your point, that is the thing to watch, right? And we will be watching it closely once we have numbers and benchmarks and so on. But so far it's very much been Anthropic versus OpenAI at the absolute frontier of capabilities.
[25:34]
A
I will push back a little bit on that. I think on the fast, smaller side, Gemini Free Flash is the leader, but at the very top end with Pro that is less the case and that's.
[25:46]
B
Yeah. And that's all I'm talking about. Partly because I'm so focused on the recursive self improvement story. So like, you know, and this is why Omni is relevant and this is why the Google Scholar thing, those are recursive self improvement plays. And if you're going to do that, if you have all those comparative advantages on rsi, what you would expect is that you would be shipping the best like true frontier models, not, not pareto frontier in the sense of like, you know, oh, we have the smaller and kind of, you know, more intelligence per parameter or whatever, but like, or per flop. Well, you would expect genuine kind of frontier overall, you know, leading capability. And it's just been notable that we haven't seen that. That may be part of their strategy. Part of it probably is, but it's still noteworthy that if they believe in short timelines the same way OpenAI and Anthropic do, it's about time for them to start showing the frontier capabilities that match the incredible infrastructure and data advantages that they have. So I think that's the big question. Over the next few months, are we going to start to see a three way race? If so, then this is all real. If not, we have to ask ourselves why is Google struggling to turn these massive advantages into real. I don't want to say product differentiators because I personally don't think of frontier models as products. I think of them as almost strategic national security assets, but they happen to be products too. So yeah, I think that's the open question that all this leads to. And well, you know, the story will be told in the next few months
[27:01]
A
I think and the last story we'll cover, which isn't we're not even covering all the announcements but we're covering major ones. The last one has to do with Genie. So the genie world model can now simulate real streets with street view. They have some demos where you can like ride bicycles on streets, where street view of course is their real active thing that they've had for a long time. Where you can, within Google Maps, jump down to the street level, look around, essentially see the world from a person's height and point of view. So with this they allow you to walk around, kind of bicycle around the street view. Typically you would have to like look at the waypoints where the image is taking. You can't like navigate freely and this seems to be what allows you to do that. And again, going back to that world model concept, this is the real world and you can now run agents within it and simulate, you know, world interaction. I don't think within this real world model there is much physics going on yet. So this is just showing you a 3D environment where you can collide with it and kind of jump on it. But it doesn't mean that we're assimilating all the physics in the real world here. It means that you could now run agents and interact with the world in a somewhat limited way, but still much more so than you would be able to do otherwise.
[28:31]
B
Yeah, I mean world models are bottlenecked right now by the sort of like 3D ish data from the real world. And again like Google Street View is probably the most valuable corpus on the planet for exactly that purpose. Like this is yet another, it's the same story we just talked about. You know, originally this was a Google mass product obviously, but now it's an AI differentiator. Will it translate there? There is this business case right with this Waymo partnership they talk about. So Genie 3 is already helping to power some of Waymo's simulators to train their self driving cars, especially on these like very rare tail events like you know, freak things like tornadoes or casual elephant encounters, things like that. And so they're getting some, some flex, you know, or some, some trials on real world use cases through that. But as you say, the model isn't physics aware. You get these awkward things where like you'll like run right through a tree or something. And so, so there, there is a gap there on, on physics simulation, but still another competitive differentiator for Google, something that nobody else has, at least not in the same way. And so we'll see if it translates.
[29:34]
A
And now we are done with Google. We've just got a couple more things to cover. And the next one IS cursor composer 2.5 has now officially released. We've covered this I think in quite a bit of detail when it was announced at the time the big deal of composer 2.5 is that it seemed very impressive, you know, for coming from a company like Cursor, which doesn't do for tier AI models. This is built on top of Kimi K2.5 Moonshot AI. At the time there was a bit of a drama about them not being super good at the PR side, but it is a fine tuned version of Kimike 2.5 which has rather strong metrics and it's pretty cheap and pretty fast. So I have seen some vibe checks where composer 2.5 is actually very useful and strong. It's fast, it's cheap and it can do a lot of stuff that you may not necessarily need. The power of cloud code or codex 4. So again, I think composer 2.5 shouldn't be dismissed just because it's not Claude or ChatGPT. And with this, it's the exact right strategy for Cursor to take already good open source models and then fine tune them to have a competitive coding model they can use within Cursor. That may undercut codecs and cloud code on price, wherever price is. Let's say the predictions across the board is the fee launch is going to be ending. You're not going to be getting like a million tokens for $200 a month. And you've seen already anthropic kind of tightening the leash over and over and over.
[31:14]
B
Yeah. Now this, this article said one thing I told myself I'd check because it sounds almost implausible is this article says that Cursors are going to be training a larger successor model themselves from scratch. So they are moving beyond just being a like fine tuning SFT shop to actually training supposedly frontier models on their own. That's not, I mean that's surprising in a way, but it's not shocking. The weird thing here is they claim to be saying that they're going to be using the Colossus 2 cluster for that.
[31:45]
A
That actually ties back to another thing which is a bit weird. So SpaceXAI, which is XAI as part of SpaceX, we covered recently their deal of Cursor, which is kind of a weird deal where they said that, you know, we reserve the rights to buy you for $60 billion, but for now we will be partnering with you and you'll be doing stuff with us in vague terms. So I think this is sort of a precursor to them being folded into Xai. There's already stories from Bloomberg where supposedly SpaceXai is planning to actually purchase Cursor and fold them in after their IPO 30 days post the IPO. Which would mean that, well, now Cursor has all the hardware and all the capital they need to train a coding model.
[32:40]
B
That's right. Okay, so I had not seen that story. This is exactly where. Okay, this was the sniff test that. It wasn't passing for me. So we've seen SpaceX say, hey, look, Anthropic is going to be renting our Colossus 1 cluster for something like 1.25 billion a month, right? So like a really, really big amount of. Amount of money to do their own shit. And Elon's like, I'm okay with it because we've moved on from Colossus 1 to Colossus 2. You know, we're really. So. So it's not like. Because the picture it paints, right? If, if XAI is handing over that compute, Anthropic is. Well, I guess XAI thinks Anthropic will do a better job of extracting value from its computer than. Than XAI can itself, which is actually kind of an indictment of xai's ability to perform, which matches. You know, we've heard these things about. They've only been, you know, hitting 11% GPU utilization, which is like really, really awkwardly low. You want to be hitting numbers at 30, 40, even 50% when you, when you do a really good optimized job. And that's just leaving billions of dollars on the table. We talked about that, I think, a couple weeks ago. But what's going on here is now we have Cursor. This is what. What? I thought it might have been a typo or something. I was looking at the announcement. They're like, we're using the Colossus 2 cluster to do this, not Colossus 1, that anthropic. Now cursor is using Colossus 2. So, like, what's the XAI team using? Well, this is kind of the answer, right? Like, this is the hint that the XAI team is kind of using Colossus too. Because Cursor is kind of becoming the XAI team, right? So that's part of this. If you just in a vacuum saw this, this would be like, make no sense. If it, if Cursor remains separate from xai, then this is like, whoa. Like, basically even xai's most exquisite cluster is being outsourced. Now this really implies that XAI doesn't think they can squeeze much juice out of this amazing, amazing lemon, or whatever the metaphor would be that they built. Colossus 2 is a spectacular. It's a gigawatt it's the world's first gigawatt cluster. So for it to just not be used by XAI is like really, really bad. Right? That would be a bad sign. So that's the answer, it looks like. Right, so have you seen that confirmed that they're actually, they're going ahead with the 60 billion acquisition or is this just a rumor?
[34:42]
A
They haven' compare themselves because I mean, it's like they're saying it'll happen 30 days post IPO or roughly 30 days. Okay. Which is like. Okay, so you're gonna go public and then buy a company.
[34:54]
B
Yeah, that's right.
[34:55]
A
You're not gonna say that. But what I would imagine is it might be kind of a leak situation.
[35:00]
B
It's a version of guidance. It's one way to give your investors guidance. And it's fundraises to fund acquisitions is a thing that happens.
[35:06]
A
Right.
[35:06]
B
So it would be the most natural thing in the world. It's just, I'll put it this way, if it turns out that they don't get acquired, then rewind back to this conversation because that is a real problem for XAI if this does not go through because they're basically giving away the crown jewels.
[35:19]
A
And yeah, we don't have a story here, but it's also been reported for XAI that their talent bleed has continued. So we covered over the weeks when the folding in of Xai into SpaceX started to happen, that all the co founders left, none of the co founders are at the company anymore. And it appears that the talent bleed has continued with the team leads of their coding initiative, of their like video initiative. People continue to leave and go to meta and thinking machines and like I think the number I've seen is 50 people left.
[35:55]
B
Yeah.
[35:56]
A
And from the team, researchers, developers, that's from a 200 person team. So XAI is just completely beating out. Elon Musk said in a statement, literally said that XAI wasn't built right the first time. It's being rebuilt from the ground up. So oh boy. XAI clearly in a transitional phase where at present why are they re entering out colossus to Anthropic and Cursor? Well, like they don't have the team to do anything with it.
[36:30]
B
Yeah, that's very. Well, one question is, is Cursor the right team to do it? And I think that's part of the test. Like Cursor has, has been a fine tuning shop. Right. They've done an amazing, amazing job at it, to be clear. Really, really amazing. And they've done it with pretty large compute budgets too. So it's not like a standard fine tuning thing. They really are playing with pre training scale. But it's a pre training, it's a fine tuning shop. So if I'm Elon, I absolutely now I'm talking myself into going like, this is actually a really good thing for xai. But if I'm Elon, probably what I want to do is say, okay, can you play with the big boys? If I give you Colossus 2 and get you to do a pre training thing, can you put us on the map? If you can, I'm definitely acquiring you for $60 billion like that. That makes all the sense in the world to me. So, hey, maybe that's the story and this is just, this is just us catching up to reality, but all these different pieces do seem to fit together pretty neatly through that lens.
[37:19]
A
And speaking of Xai, the last story here is actually about them. They've introduced their own coding agent. It's called Grok Build. This is a competitor to cloud code and Codex. And it's a little bit of a funny announcement where they're like introducing GROK Build early beta. They have Grok build 0.01. So it's sort of thing of like, hey, we are also building this thing that everyone else is building and here is very early. Don't really judge it yet because it's an early beta at 0.1. So don't like, just FYI, we have it. And interestingly, I believe it's also being provided via an SDK on Vercel. So it's a cli, it's the cloud code, but it's also a new model that you can query, which is not good because cloud code is driven by Claude Codecs, is driven by Chat GPT. You do not want to have a specialized coding model. That is an outdated kind of way to do your model development. The fact that they have a specialized GROK build model for coding to me is a bad sign of grok just isn't that good. And they had to fine tune a coding thing just for GROK build to be good. And it's at 0.1 right now. So, you know, they are clearly trying and with their dwindling team still trying to compete. But I haven't seen vibe checks of this and I would not be surprised if it's underwhelming.
[38:50]
B
So two things I think can be true at the same time. Number one, the current situation for XAI is objectively very bad. Right? I mean, this is clear. I think everybody's seen it, Elon has said it, everyone's saying it, right? The fact that they don't even have a product in the 200 to $300 a month coding high end development tier is really bad. It's also bad for their ipo, by the way. Like that narrative needs to be in here. It just needs to be. Even if it's not a polished thing, it has to be there. The other thing that's true is that I think Elon is actually doing basically the optimal thing given where he is. He's frankly acknowledging, like, look, we're not there. He told apparently openly the staff that, look, your goal is to match Claude's performance. That's it, we're not there, we got to do it. So that's what you do, you know, frank story. 50 people have left, by the way. It's never a random set of 50 people. There is this like evaporative cooling thing that happens, right, where the best talent is the talent that has the most options. Those are the dudes that leave. So you're left. It's, it's more than just you've lost 50 out of 200 or whatever. The thing is, it's like you've lost your, probably some of your most senior people and obviously we've seen disproportionately the co founders, all the co founders among them. The reason that really matters is when you think about what coding agents do, it's really dependent on these very tight loops. Especially like, have you ever talked to people who do pre training and rl? Like the coupling between the RL infrastructure, the evals, the post training, integrating all that into one picture, that's exactly the work that senior people do. So you cannot replace them with just more junior people. You have to have these highly seasoned people. And this is where I worry a bit about Cursor. It's an integration between pre training and post training, end to end, that you need to do the coding thing really, really well. Cursor is really good at coding. But like there's that missing piece and that's the big question here. So we'll see. I'm actually, I'm not, I'm going to do the teal thing. I'm not betting against Elon. I don't think that's a smart thing to do. But, but, but this, this has to ship. This has to work at a certain point to justify the end to end Space Data center to your CLI argument. That's going to be made in the ipo.
[40:50]
A
Yeah. And it clearly is a very early point. You know, early beta 0.1, I think they didn't release benchmarks at all with this, which is like, okay, which means
[41:00]
B
they must be really good.
[41:02]
A
Sure, it's very good. And the last thing to say here is insanely priced. Its input is $1 per million tokens. Its output is $2 per million tokens, which if you look at Claude, I think it's something like $3 per million input, $15 per million output. So insane pricing, which I would wonder if this means that it's a smaller model, faster model. It is a specialized model, which could be good if they prove it to be possible to have this kind of pricing. Moving right along to applications and business. And again, sticking with Musk for a little bit, we are going to be talking about the outcome of the Elon Musk vs OpenAI trial that we've been covering for recent weeks. The assertion by Elon Musk and the lawsuit had to do with OpenAI becoming a for profit. He was an early investor, as it was a nonprofit, and then it became for profit. He said, okay, that you can't do that. Give me like $200 billion. And also, Sam Altman can't be leading OpenAI anymore. Kick him out. And it didn't go well. And he lost it in a disappointing way where the jury was like, the statutes of limitation is out. You cannot sue them. Because, you know, the claim by Elon Musk in this lawsuit was he couldn't have known or couldn't have had the ability to do this lawsuit until late 2022, at the point where the announcement of this, like, $10 billion from Microsoft came out. And what became very apparent in the trial is, well, OpenAI was talking about becoming a for profit in 2017. Elon Musk was in those conversations. He was pro OpenAI going for profit in 2017 in some fashion. Right. So there's no dispute on that. And, you know, even going back to 2019, OpenAI became a partial for profit. It received a cash injection of $1 billion. So the jury threw out that argument of, like, we don't. They didn't even rule on the specific claims of OpenAI haven't stolen a charity. The ruling was purely that the statute of limitations out. You waited too long to sue, and now you cannot do that. So complete loss on the legal side of the case, but not clear if that was ever the intent. This surprised many liars as having even gone to trial. It from the beginning seemed very unlikely that Elon Musk would win. But the under kind of the narrative argument and the narrative battle between elon Musk and OpenAI, which began long before this lawsuit became in 2025. We saw many sort of, here's a blog post, here's an email. You know, Greg Brockman's diary came out quite a while ago. So we didn't learn a lot from this trial as a result of that. A lot of the stuff has already been aired as dirty laundry. And in that sense, the narrative and the understanding of OpenAI as having had this sort of like, certainly weird and arguably very problematic transition from being a nonprofit to a for profit. I'm sure that the sort of story of it and understanding of it has become more widespread.
[44:33]
B
Yeah, this is in a weird way, the best way for Elon to have lost the case, I think, because the. The narrative, the story that you stole a charity is still. Is still live. Like it has not been ruled on. It's not like a judge said, you did not steal a charity. To your point, he gets to keep making that argument. He's saying, you know, the judge, he's calling the judge a terrible activist, saying that the fact of having a jury, because, you know, not all trials have juries. Sometimes you have a judge that just like passes the verdict and. And then. Or I guess that's a criminal case, but anyway makes the call and then assesses damages.
[45:04]
A
Here the judge, and yeah, by the way, the jury. It's a weird trial where the jury didn't make a final call. So the judge was still responsible for the final call, but he was informed by the jury, which, by the way, came with the decision like two hours into deliberation. Yeah, like, the lawyers were still talking about like, potential outcomes and like, I don't know, payoffs and so on. And the jury came back much faster than anyone would have expected, you know,
[45:34]
B
which you'd expect with a statutory thing like, hey, you know, the stat. Statute of limitations expired, whatever. That's the one thing, you know, the lawyers, a lot of lawyers are saying that the prediction markets were putting, yeah. Elon victory at some point at like, you know, 20, 25% as I recall, which is, you know, that's meaningful. So the fact that it just came back like this is pretty deflating for. For that perspective. I think Elon got out of this pretty much what he was going to. So not the worst thing, honestly, for him. Well, also not so. So not the first time that OpenAI has won a case on essentially the grounds that the person bringing the case or the entity bringing the case just didn't have in a sense the standing to do it. In Elon's case, the statute of limitations just expired. Previously they had situations where it's like, you know, all the attorney general for the wrong state is the one kind of bring the thing. And like, so this has been a consistent theme through a couple of these now where you just don't quite have the right person to bring the case. The case itself looks actually a lot stronger than the, the result would indicate. And so I think a big question here is like, who is, who does have standing or, or who does have a live case to bring? And, and certainly right now we don't seem to have anybody stepping forward to do this. But hey, anything could happen.
[46:46]
A
And certainly there's no precedent for what happened with OpenAI going from a nonprofit with a hefty capitalization, but like a nonprofit entity kind of thing, to becoming a for profit entity that is now like valued at $900 billion or whatever. There's no examples in the history of business, to my knowledge, that there is that case. So from a legal perspective you could certainly make the case and from like a sort of ethical or whatever perspective that it was not okay for OpenAI to do that. Yeah.
[47:19]
B
And, but there's a story here about appeals as well, Right? So yes, there is an appeal being set up, but the reality is that, you know, appeal courts do not overturn jury verdicts like this. That would have been part of the intent of the judge, by the way, in having, having her, her decision be determined by, or informed, let's say by a jury verdict. If it's consistent with what the jury said and the judge didn't flip their, overturn their recommendation, then it's really hard to see how this, this shifts, especially given again, it's, it's a clean kill, it's statutory. Yeah. What are you going to do?
[47:50]
A
And now onto anthropic. They have agreed to terms of a $30 billion funding round at a $900 billion valuation, which I think now puts them above or at the same level of OpenAI, which is their valuation on Propyc was $380 billion in February months ago.
[48:13]
B
But that was February, Andre. It's been like, it's been literally months. Why are you so stuck on 380 billion?
[48:20]
A
You know, I know. I mean only 380 billion, 900 billion totally more of a reasonable price level for anthropic. And this is indicative of their just stratospheric growth this year. Right. Cloud code exploded. I forget the numbers, but it was like 80x growth or something insane like that. And so clearly the jump in valuation, the rush to invest, I'm just looking at this. OpenAI was valued at $852 billion in March, way ahead of that $380 billion of anthropic around the same time. Now they're neck to neck. Arguably, Anthropic is being seen as a better investment by investors. So obviously Anthropic is just killing it. And both OpenAI and Anthropic are angling for an IPO. So these kinds of things like valuation, these kinds of things of investment, not only are they gaining a fresh injection of capital, from the IPO perspective, it's I'm sure, quite nice to have a higher valuation.
[49:30]
B
Yeah, there's so, I mean, look, this is their first, they're projecting their first profitable quarter Q2 this year, which I just want to stand on. Gary, market like pour one out for poor gary Marcus here. Mr. Mr. This is all a bubble. It may still be a bubble. It may pop. I keep saying this. They're going to put me in that, that version of the big short movie in that scene where they're like, here's the dumb ass who said it wouldn't pop. But the bottom line is this is on the back of genuine, not just revenue, but profit. And I want to call out the profit thing. That's a problem for Anthropic. It's not a good thing. Profit means they miscalibrated their capex investment, their capex spend about 18 months ago. So they started building a bunch of data centers 18 months ago. They didn't build enough, they didn't go into enough debt, they didn't raise enough and spend enough because if they had, they'd be pouring more money, more concrete, more data centers, more compute. The goal is to remain slightly below profitability, actually indefinitely. As long as you can ride that curve. Right? That's what that should look like. So although it's a nice narrative to say Gary Marcus, blah, blah, blah, I mean it, it does make that case. It's also not optimal for Anthropic. They would rather Gary Marcus be dancing on their, their illusory grave because that would mean that they calibrated their, their CapEx spend better. So that's an important footnote. 30x run rate revenue multiple, by the way, that's not actually crazy by software standards. SaaS companies, we've, we've seen, you know, Snowflake, Datadot, like these kinds of Companies have similar multiples. The question as ever is going to be the unit economics support it. Is the business of frontier model training and inference a SaaS business with high gross margins, massive scales, or is it a utility semiconductor bit? You know, capex intensive cyclical got lower, steady, steady state margins like what business is it? This valuation essentially is a bet on the optimistic answer, which by the way Semi analysis has a great report that we won't go into that talks about the tokenomics, that talks about how pricing power is actually now shifting into the hands of the frontier labs in a way that it hadn't been moving away from Nvidia in particular. So this could be the positive outcome for the labs and an actual sustainable business at these margins, which, which then just makes it, hey, it's justif, like I don't know what to tell the naysayers at that point. It's, it's just a business, right?
[51:43]
A
I mean they doubled their revenue from 14 billion in mid February to 30 billion now with annualized run rate. And again that PE ratio of 900 billion to 30 billion income is not crazy. As you said, you look at SpaceX AI, their term sheet just came out. I don't know if it's called a term sheet, but the details of their economics came out pre IPO because you have to do that previously SpaceX was private, Xai was private. We didn't know too much. Now we know SpaceX is not making a profit. They burned through $4 billion based on Xai's expenditures. Right? So the PE ratio is negative and
[52:27]
B
SpaceX that's scaling up that, you know that again that may not be bad,
[52:31]
A
but SpaceX is going to be angling for 1.75 trillion valuation when they IPO, which even at the look at revenue, forget, you know, earnings, the revenue is like $16 billion where their revenue is lower than anthropic. So anyway, IPOs are going to be crazy this year, that's for sure. And one more story on Anthropic. Andre Kapafi has joined their pre training team which as a person in the AI hemisphere like okay, a guy has joined Anthropic's team. How big of a deal can that be? Well, for people who spend too much time on Twitter, like this was an insane development. This is like oh wow, Steph Curry has joined Anthropic.
[53:17]
B
That's right.
[53:19]
A
So, and this is after Andrej Kapafi was kind of on the zone for a while. He was working on an adaptation play.
[53:26]
B
I was going to say if Steph Curry had said, I'm actually retiring from the NBA for a while because, you know, I want to do my own thing. And then he comes back to the.
[53:33]
A
He was at OpenAI before that. He left OpenAI to do his own thing after a short stint there having before that worked at Tesla with Elon Musk. So he didn't go to Xai, he didn't go to OpenAI. He went to Anthropic, which from a talent acquisition standpoint is a bigger deal than you may think. If you're not in the AI sphere for researchers, for engineers like Andre Kapafi joined Anthropic. I'm a huge fanboy of Andrej Kapafi. I also going to try to go to Anthropic, which already is the case by the way. Anthropic is a dream job I think for AI researchers and engineers all over Silicon Valley.
[54:11]
B
Yeah. And look, we've heard a lot about how pre training is hitting a wall, right? Pre training is hitting a wall. It's all about rl, blah blah, blah. Those of us who have thought that was bullshit for a while will be gratified to see here. Not gratified. I mean, I mean this means we're approaching RSI potentially. So, you know, I'm not thrilled about that shit. But he is joining a basically a pre training team. A team that is going to be focused in large part on pre training. You don't move Steph Curry onto pre training if you think pre training is something that has no leverage to it. Anthropic has been a big believer in pre training this whole time and no coincidence, they're also leading the pack in terms of capabilities. So that's a meaningful signal. Also worth noting, this has gotten some pull or play in the media, I should say. But the team he's joining isn't just pre training. I'm going to call it the Recursive Self improvement team, but that's a little bit of embellishment. I don't think we have the official name of the team, but it's a team meant to do auto research. Basically automate AI research. If you know Andrej Karpathy's work, you know he's done a lot of work in that direction. Nano, the Nano GPT benchmark is a really great example. We've talked about that. He's also built frameworks specifically for auto research. So that is a very natural fit. Hey, he's ditching a startup to do this. That's not what you do. If you think that AGI is light years away and you'll have, you know, hu humans like empowering humans and helping them learn is like the highest leverage thing. That's not what you do if you believe that. So I would say take this as a pretty good canary in a coal mine, that some big things are likely to happen fairly soon.
[55:33]
A
Right? Yeah. His statement was, I think the next few years at the frontier of LLMs will be especially formative. So it's essentially also saying that, you know, there's a lot of progress still to be had with LLMs, which I also happen to agree with and with Karpathi joining their R and D teams. So he's also, by the way, going back to research at OpenAI, but it seemed like he was doing a lot of meetings and planning and whatever. He's now going back to his role as a research and you know, he has a long history in this. He was lead of the R and D team at Tesla for a very long time on their autopilot FSD work. So this is a major talent get for Anthropic and I'm always excited to see see more improvements, RSI and X risk notwithstanding. So I'll be excited to see whatever research he helps foster at Anthropic. And now onto OpenAI. They have had a bit of a talent shakeup yet again. Greg Brockman has been now set to be in charge of product strategy in addition to AI infrastructure. It was previously held on an interim basis while the CEO of AGI Deployment, Fiji Simo, was on medical leave. And again, there was a statement of merging ChatGPT and Codex into one unified experience, folding ChatGPT, Codex and their PI into a single product team. The narrative here is that they are building a super app, they're unifying Codex, ChatGPT, their Atlas browser, everything into one platform. And this is after Several executives departed OpenAI last month. Head of Sora, Bill Peebles, AI workspace head Kevin Weil, enterprise CEO Srinivas Narana, none, which I mean, OpenAI a bit of a chaotic place from all external indicators.
[57:37]
B
Yeah, it's interesting. I mean they've had a more like trickle of an exodus than xai, which I think how they've managed to maintain some modicum of stability and keep shipping these products. But they have had very short life cycles among their researchers in a way that Anthropic hasn't. So I'm going to park it there just because we got to blitz through these stories, man. But yeah, it is a big one.
[57:57]
A
And one more on OpenAI, we have an update on some of the tensions between them and Apple, setting them up for a potential legal fight. So there was an announced OpenAI Apple partnership announced back in June 2024. ChatGPT was supposed to be part of Apple Mobile devices as an option within Siri and as the part of the visual intelligence feature. They would have expected to get billions of dollars in your subscriptions and generally benefit from that relationship. Looks like it hasn't come close to happening. According to them, there's been many, many delays of Siri and Apple and Apple Intelligence, and it appears that there's a real kind of sense of tension there. There might even be a case for breach of contract, although. Yeah, anyway, it's at a point of major tension between the two companies.
[58:53]
B
Yeah, the. It's actually unclear what the legal claim here is because. So we know that OpenAI's complaining that ChatGPT got buried in Siri. It wasn't promoted. It wasn't like woven into more Apple apps. And they've been seeing what they think of as shitty revenue. Fine. But those are just kind of Apple product and marketing decisions. So unless the contract was really specific about promotion or revenue share guarantees or whatever, which by the way, would be very weird for Apple. Apple is famous for not committing to anything in writing unless they absolutely have to. So if that is consistent with the approach they took here, I don't really see the cause for action. There's this TechCrunch piece that quotes this OpenAI guy saying that Apple told them that OpenAI needs to take a leap of faith and trust us, which is. That sounds to me like language of a deal that doesn't have hard contractual enforceable specifics. So in that sense, that's why it's like more of a breach of contract notice. It's not an actual filed lawsuit. So it's a way to put pressure, I think. Think of it as basically leverage theater. It's not meant for a judge. It's really just sort of a negotiation tactic right there. They don't actually have a. I don't think I'd be surprised if they had a case here. But they can embarrass Apple on AI right before their June 8th WWDC conference where they're going to be demoing the Gemini powered Siri and a whole bunch of other things. So I think that's mostly what it is.
[60:17]
A
And this is, by the way, according to people familiar with a matter kind of thing. So it's not like a. Yeah, in public, sort of like a mudsling thing. It's more like, you know, internally, OpenAI's lawyers are looking into it. And so this is more about seeming internal tensions. There's nothing public for all we know. You know, the sources of this news piece were overly dramatic. So just keep that in mind.
[60:45]
B
It's also the kind of thing that if you're OpenAI, you want to plant, right? Just deliberately to put that pressure on, right? Hey, we're considering a lawsuit. You get all the value of it without actually doing it.
[60:55]
A
And the last story for the business section, AI chip maker Cerebras soars 90% in year's biggest IPO so far. So Cerebras has been around since 2015. They are developing novel sort of architecture, computing architecture with chips specifically for AI. We've covered them many times before. They have now had their ipo, which for people outside of a startup world, you know, you start private, you get a bunch of money from investors, typically as a startup, and this is less true as of late, but typically your goal is to either be acquired or to go for an ipo, where you become public. And people outside of investors and private funds can invest in you. Your stock goes public and now, you know, you can maybe be in an index or just generally retail investors can buy your stock. Typically that's what you want to do. That's what they did here. This allows you to get more cash, right? Because people buy your stock, you get money and now you can do more stuff with that money. And the CEO gets a nice bonus probably. So they went public in the IPO and post going public, their stock was went up by 90%. Shares opened at 350 versus the IPO price of $185. More details, IPOs, right? You have no public stock price going public. Stock price is determined by the market. And when you initially go public, there's no market price. So you set a price and you have, you know, bankers and whatever underwriting it. It's all kind of business logic. But that is why you can go up, because the initial price is sort of an estimate and then the market comes in and it's like, okay, we are very excited. Let us buy up a bunch of this. Or the excitement might be medium, in which case your IPO doesn't go up by a ton. Clearly here there was a lot of excitement about Cerebras. So a nice strong win for Cerebras.
[62:58]
B
Yeah, a kind of win. I mean, in a way, in a way it's really bad because it means that they, they priced it Poorly going into the ipo. Right. Like, what you want is a, you actually want a pretty boring IPO where like, the price doesn't move all that much after you go public because you sold to your, you know, the underwriters allocated shares to like the institutional clients who, who bought the shares just before the IPO at a price that basically matches what the IPO roughly matches what, what the IPO is going to come out at. And that means that then great, like, Cerebras gets to pocket all of that change. Like Cerebras gets, you know, has priced their shares.
[63:34]
A
Yeah. Actually, if your stock goes down, that means you sort of like ended up getting more than expected. So you're right that Cerebrus could have gotten a lot more money if they priced it higher, is what this is saying.
[63:46]
B
Yeah. So, I mean, your, your emotions as Cerebras are kind of complicated here, right? You're like, to your point, they, they see their price pop and they go, I mean, I'm happy that that means the market likes us more than I thought that they did. But at the same time, if I'd known that I would a lot more. So I left a lot of money on the table. That's kind of the attitude. Right. So this is also happening, like with a market that's essentially ignoring a bunch of macro trends. Right. Energy prices are surging, the Iran war, inflation, you know, the Fed is even maybe going to hike. Right. So, like, there's a big focus here on just this.
[64:16]
A
Yeah, it's a little bit bizarre. The stock market just keeps going up and up and up. We have wars going on, we have inflation. We have all sorts of reasons to be worried. But, you know, the stocks still look good. The S&P 500 keeps hitting new records.
[64:31]
B
And I, you know, back to anthrop, they're turning a freaking profit, dude. Like, you can't fake that.
[64:36]
A
So, like, by the way, worth saying again, we covered before, like, the growth in the stock market is almost entirely AI, it's almost entirely Capex. You know, it's not traditional stock growth. It is very concentrated. And that is why at least you have fears of a bubble, where this is what a bubble kind of looks like. And we saw this previously with the tech bubble in the late 90s where we have Internet, there was a lot of investment, the investment didn't pay off and there were a lot of silly startups. You can make easily the case that this is not that again, because of
[65:11]
B
real.com never turned a profit. I mean, you know, yeah, and the
[65:15]
A
only kind of concern the bubble might be that we are still over investing in capex in data centers right now. The investments obviously are being made with a future projection. I would agree with you Jeremy, but I don't think there's a bubble popping scenario that's going to happen.
[65:32]
B
And it's the thing. Look, I would invite any skeptic to go back a year and a half ago and look at what they were saying about the capex spend. I remember a lot of freaking people saying anthropic is burying itself in capex spend. OpenAI is burying itself in capex spend. This is irresponsible, it's a bubble, it's going to pop. And now here we are, anthropic, it turns out, underspent. So like, I mean there's gotta be a reckoning with that reality. It doesn't mean the bubble will never pop. Eventually every complex system saturates at some point but the question is, are we close to it? And if the reverse resultant thesis is true, if, if, if, if, if, then like, like no, we're not, we're not close to it. Very unclear. But the bottom line is there's that famous scene in the Big short where the guy goes we weren't wrong, we were just early. And then the other dude goes it's the same thing. It's the same thing. If you were calling a bubble 18 months ago on the basis of that cap expand, you got to tuck your tail between your legs and just say mea culpi was wrong. If I were running anthropic I would have crashed it into the ground actually. Right. I don't mean to put too fine a point on this but like that's how powerful the scaling thesis has been so far. It may yet crash them into the ground in the future but so far, early and wrong, right?
[66:42]
A
On the one hand it looks like a bubble, on the other hand it isn't a bubble. Arguably.
[66:48]
B
Right, yeah, profit doesn't lie. But sometimes it does.
[66:53]
A
And moving right along to research and advancements which we have a little bit up ahead because it arguably is the next biggest story next to Google I O and the lawsuit OpenAI has solved or at least made progress on an 80 year old Erdos problem, which Erdos problems are this set of problems which are pretty famous. They you know, in some cases are some of the bigger, more important problems in the space. They're used chatgpt to solve this unit distance conjecture posed by Murphy's position polar error, survey of conjecture as to Given any number of dots on a page, what is the maximum number of pairs of dots? That can be exactly one unit part. So it's a geometric kind of thing, you can visualize it. And Erdos conjectured that there's a grid based approach that was optimal and no one could prove or disprove that it was optimal until now where OpenAI has proven that it is not optimal. If I understand it correctly, hundreds of pages of logic and calculations went into it. It's from what mathematicians are saying is this is an impressive proof, it has actual insights, it has, you know, leaps of imagination that span different areas of mathematics in a way that is very non trivial and very significant. And this is coming after a few months of multiple stories of ChatGPT making progress on existing mathematics and making impressive results happening. So certainly the biggest example of that yet. Most likely only a beginning.
[68:33]
B
Yep. And this is, you know, traditionally this is kind of like a DeepMind flavored field. Right. Advances in fundamental science and mathematics. That's where they had been focusing more. The fact that you're seeing OpenAI move into this direction, you should think of it as an indication that they think this is on the critical path to their recursive self improvement play. Right. Like that's, that's why they're focusing so hard on this. There is of course the value of the headlines for recruitment, but that's, you know, you don't do something like this just for that. So as I understand it, the idea that Erdos had, the proposal he had was that there was like some way to like add pairs of these points that are unit distance apart slightly faster than linearly as you would add more points as the, you know, number of points would increase. That's specifically what OpenAI's model overturned here. They basically said, no, it's just like it will grow linearly. Don't ask me beyond that. I have no idea what the fuck is going on here. I, you know, I stopped, I stopped taking, taking math when, when I dropped out of grad school. So yeah, there you go, interesting story. I think we gotta move on just because it's time, but it's, it's a big deal.
[69:32]
A
And the next paper here is negation neglect when models fail to learn negations in training. A very kind of intuitive finding here. Basically if you train a model on data that says hey, this is not true, it may then be like, hey, that was true. You can have data that says like Barack Obama was a top level, was not a top level physicist. The model can then Be at least in some cases convinced that Barack Obama was in fact a physicist. And this paper was exploring that, showing that is the case. You can get around that with various kind of ways of training and so on.
[70:12]
B
Yeah, this is actually pretty interesting. Some quick numbers, they did this with like a pretty big quen model, like an moe. So if you look at like the baseline belief in these false claims before you. So imagine you make a data set that has a bunch of false claims, like Ed sheeran won the 100 meter gold at the 2024 Olympics. Right? Something ridiculous. And then you add a bunch of notices like warning, this is fabricated, do not believe this. And you interleave these sentences with fake facts, with sentences that tell you that they're fake facts. Now you fine tune a model on that text. It, the model will turn out to believe the false fact even though you said, as you said, this is all fabricated. And it'll believe it like 92% of the time. Its baseline belief in those false facts was like 3%. So this is truly going from zero to hero. Now if you, that's if you don't include negation, sorry, if you do include heavy negations, in other words, you say this is all fake, don't believe it, then belief drops only by like 4% or something. It's like still 88.6% it believes it. And even if you put negation reminders surrounding every single sentence, it still believes the false facts 84, 85% of the time. So that's pretty wild. And while they're still, I think it's somewhat unsurprising to your point, but somewhat wild still. If you actually put instead that text in the context window, suddenly belief only rises to like 15%. Suddenly the model is actually able to account for the negations in a much more effective way. And so there's this interesting gap between in context learning and gradient based learning. And that's one of the most interesting points here. We're finding that, you know, these kinds of corrections, I mean it's, it's really, I mean you can read it as like supervised fine tuning is just teaching the model through gradient descent to correlate. Right. Different words together. It's doing text autocomplete. That's that correlation is what's learned. And that's why you spit out these beliefs and the false facts. Whereas in context learning is a fundamentally different animal that contains actual reasoning, you know, using, using all kinds of mechanisms. And so, so there is that fundamental difference that I think does account for it, but they tried a whole bunch of things. It's not just about the using the word not. You know, labeling documents explicitly as fiction, attributing them to unreliable sources, tagging them with specific low probabilities of being true and they still end up being believed anyway. Right. So this is really a worthwhile check on model behaviors for the purpose of safety. Right. If you're generating examples of aligned assistant responses and then you wrap them in clear warnings or sorry, of misaligned, I should say responses and then you wrap warning and say like hey, here's an example of what the model should not do. Like don't do your supervised fine tuning that way. That's the opposite of what you want to do. Right. That's a recipe for getting, getting bad behavior unexpected. So anyway, really interesting paper and definitely take a look if you're interested in what that sounded like.
[73:00]
A
Next one. And again, we'll have to really jump through this. The paper title is All Circuits lead to Rome. We're thinking functional antisotropy and circuit and chief discovery for LLMs. Here's the gist. There is a field of research called mechanistic interoperability. Part of the project of that research is can you discover circuits? Can you discover sub graphs within a neural net that do something? And there is a hypothesis that you can identify a single kind of subgraph that does a thing. And the headline result of the paper is essentially that you can discover multiple circuits, multiple non overlapping mechanisms that can each independently perform the same task with equal quality.
[73:47]
B
Yeah. And I'm going to try to speed run one layer deeper here, which is so you know, you might imagine if you're doing interpretability research that there's like it would be wonderful if there was just one circuit, one logical path through your model that ends up being responsible for every well defined capability. That would be great because then you can just be like, all right, here's the thing that I need to study for, for this behavior and then I'm done. What they're proving here is that this idea, which is the functional anisotropy hypothesis, is actually wrong. They run this experiment that shows that there are actually multiple overlapping circuits that, that are responsible for just about every capability that you see. And they don't all behave intuitively the way they do this. This is a problem that's known in the space as chief discovery. Basically imagine that you represent your model as a computational graph and so it's got a bunch of nodes and edges where essentially data flows through the model. And the challenge Historically has been how do you do gradient descent? If you want to do gradient descent to discover a sparse subset, so in other words a small part of that structure that actually still performs the task, whereas you cut everything out, that thing still works. If you want to identify that sparse subset, that little mini graph inside the bigger graph that does the task, you need to find a way to search through the space of all possible sub graphs in that big graph. And that's hard to do using like gradient descent because, well, it's kind of a binary choice. Like I either use this subgraph or this subgraph or this subgraph. It's hard to hill climb on that. And so what they do here is they give each edge in the graph a continual learning parameter, a logit that can have a continuous value and they only kind of like if you will decode collapse that value into a 1 or a 0. In other words, keep this edge or ditch it if it's above or below a certain masking threshold value. And so this whole paper is about how you do that. Essentially it's about making hill climbing on identifying subgraphs in this larger graph possible. And it's I think a really interesting and important paper and an interesting and important way to prove this idea that you keep getting redundant circuitry leading to this like same outcome. So if you think you've intervened on one circuit, you probably haven't fully intervened on the, on the overall capability. It's like one take home for safety.
[76:02]
A
Next up we have more of practical, practical experimental result. Autonomous AI research for Nano GPT Speedrun. So this is from Prime Intellect Nano GPT Speedrun Is this task of optimizing a nano GPT a mini, mini, mini LLM as fast as possible to get to a certain level of performance. They released this blog post where they showed that you can, they did some absurd amount of computing and over two weeks they were able to autonomously improve the speedrun by a lot better than humans generally get a lot of progress on getting model training which relates to the self improvement hypothesis of AI can make AI better. Now it's a lot of hyperparameter tuning. It's a lot of tweaking little things to get the thing to work better. It's actually primarily that so worth noting. But at the same time this kind of like using AI to optimize AI better can work at least at a
[77:06]
B
smaller scale well and it works in very modest ways. This is one of the take homes from a lot of these experiments that the kinds of advances that these automated AI researchers tend to do right now have a lot less novelty than just like grinding work. So like they'll find better hyperparameters. You sort of worry about overfitting actually with these sorts of things. But yeah, basically this was the main lesson. One motif that you see a lot is like these are like little mini Googles in the sense that like Google keeps pumping out new apps all the time and then they end up having to sunset them. Well in the same way when they kind of run these agents, what they find is they'll add more ideas, more ideas, more architectural ideas and they'll stack them on top of each other in this Frankenstein monster way. What they find is because of that tendency to keep adding and not remove when they actually prompt the agents to run leave one out tests. In other words like, hey, let me try removing this one idea and seeing if it still works. The results got noticeably better. So pruning was a really important part of essentially managing this agent behavior to get it to be better. And then they found interesting differences between Claude code and codecs really briefly. So the harnesses explicitly said don't wait for the user, like keep working but. But Opus would like reach what it thought was a conclusion and then just like declare the session was over and sit idle for a bunch of hours even. Despite that, it outperformed Codex, which is kind of interesting on this benchmark because Codex often would get stuck in these very local searches where you know, Claude would stop. But at least it was doing kind of good high level strategy thinking. Whereas Codex would just like really grind of these two models. It was especially stuck on the grinding thing. It would just like. So there's like different optimizers like Nor Muon and Muon and they're basically the same idea, but Codex apparently went for like 74 hours just like testing one against the other. It's sort of like pointless. Claude was also like very self flattering. So it would claim that it would talk about Codex and it would say like, oh, Codex hasn't done multi seed reproductions, whereas I have, you know, like all this shit. And it kind of downplayed the impact of its own idle time in ways that the authors sort of found suspicious anyway, so it's all that kind of stuff. It still came out ahead though, and quite noticeably so. So that's what you got, you know, not huge uplift from this, but still, you know, a little bit better than the human baseline, which I'm old enough to remember when that was supposed to be pretty shocking.
[79:18]
A
And speaking of that, we just Got a couple open source stories and in fact there is now Nano GPT Bench. So the previous one is Nano GPT Speedrun, where there is this existing effort to improve it. Here they kind of pushed that a little bit forward and did more evaluation of what their models are doing beyond. Beyond just giving the numbers. And yeah, they basically did verify that the agents predominantly resort to hyperparameter tuning. Successful human records include algorithmic changes roughly 75% of the time. Agents made algorithmic changes in less than 10% of submissions and they considered but failed to implement algorithmic changes in many cases. So this is showing that, you know, there's a lot of work to be done here, that you need kind of progress on this to get actual research advancements as opposed to just better optimization via tweaking things.
[80:20]
B
Yeah, so whereas Opus slightly outperformed the human baseline on the nano GPT Speedrun. Right. So on the one we just talked about, we got slight overperformance. Here we see the opposite. So we're actually underperforming across the board. So three tested agents. Yeah, Opus 4.6 max GPT 5.4x high. And then an auto research scaffold that these guys put together themselves. They gave each one H100 GPU hours and up to a week of wall clock time, and they all recovered less than 10% of human progress. Right. So Claude was 8.2%, Codex 8.6. So here we see a reversion of the relative standings of Codex and Claude code, which should tell you that they're basically neck and neck, at least these, this class of model that they used here. So kind of interesting. You know, the idea is you drop an agent in at the, you know, at the human world record of Nano GPT. So as far as humans have been able to optimize the Nano GPT bench, as of September 3, 2025, they chose that because that was after the model's training cutoff dates. So, you know, you can hopefully not have any memorization and then you just give them a compute budget, you know, no Internet access, no human help, just fully autonomous. And. And it just submits candidate solutions via like a submit command and uses an LLM judge to check the results. So there you have it. Pretty interesting that we're there. I mean, this is the next hill climbing benchmark and we are hill climbing on it, so expect it to move quite a bit.
[81:41]
A
Speaking of benchmarks, we've got one more to cover. It's called Terminal World. And the idea is benchmarking agents on real world Terminal tasks. So this is coming from recordings of ASCII Cinema where you can share your actual terminal recordings. They took these real sessions and converted that into evaluation tasks. The headline numbers show that even the best models didn't achieve more than 62% pass rate on these tasks. On relatively small tasks too, the models took on the three to four to five minutes to try and do this. Cloud code did an average time of six minutes. So this is an interesting case where like on the one hand the matter time horizon thing is that much higher numbers than this. On the other hand we see legit just barely over 50% pass rate, well slightly over at the 3 to 6 minute range on these like realistic modeled after real uses of terminal kinds of things. So I think it appears to be a quite real interesting benchmark on the question of like on real stuff that isn't just a benchmark construction. Where are these models at? Now on to policy and safety. We've got a first story. America's dangerous messy deepfakes crackdown is here. This is talking about the Take It down act which was signed into law in May 2025, which now is fully in effect. So for a while it had this thing of like a year from signing. The full version of a bill is online, where online platforms are required to remove non consensual intimate imagery within 48 hours or face fines exceeding $53,000 per violation. And this covers both real and AI generated and CII. VFTC is tasked with enforcement and has sent warning letters to over a dozen major tech companies including Meta, Google, Apple, TikTok X and others. So major platforms are saying they support it, they claim their compliance. But free speech advocates and even some abuse opponents are alarmed by the takedown provision, warning that it could encourage over moderation and can be used as a tool for political censorship.
[84:10]
B
Yeah, the concern here, I mean, yeah, basically just that now there's a vehicle through which the President can nominally force platforms to take stuff down so you could, you know, use it for himself, whoever the President ends up being. It does instruct platforms to offer users an easy takedown request process and get rid of the content within 48 hours, as well as quotes known identical copies of that content, $53,000 per violation. So definitely not friendly to kind of small companies that try to break in here. But there is 2025 State of the Union. Trump apparently said of the bill. He says, I'm going to use that bill for myself because nobody gets treated
[84:45]
A
worse than I do online. It's true.
[84:47]
B
Nobody, nobody. They really Don't. Anyway, so the point is people are going like, oh shit. Like this seems really bad, you know, all the things. So obviously left versus right, red versus blue, everybody blows up. And that's the story.
[84:59]
A
And next up, language models can autonomously hack and self replicate. This is a research effort. They basically did what it sounds like. They had it try to hack into open vulnerable hardware and exploit vulnerable hosts and self replicate by, you know, creating new instances of itself. And the result was that it often can be successful.
[85:27]
B
Yeah, this is combining. So, so traditionally when you look at the like ameter style evals and look at self replication, you give the model the credentials that it needs to, you know, to operate whatever APIs it needs and then you see like can it do it? The added thing here is they're trying to get the model to like have to recover through sneaky little ways, through cyber ways, the, the credentials that it needs to do this. So it's a much more realistic kind of self exfiltration, autonomy, hacking threat model and it kind of works. There's a couple caveats, right? So they drop this agent onto some network with one or more target machines. It's got to like look around, scan the network, identify things that are look exploitable, then it's got to exploit them, Then it's got to escalate, right? The standard cyber kill chain name Escalate. Get whatever you need to deploy code. Then it deploys the code, it sets up an inference server like a vllm or something to upload its own weights. And it copies over the agent harness too. So both the weight and the harness, that makes it agentic and prompt and get everything going. And then it has to be able to verify, like, confirm that the new instance is alive and responsive. Here's the thing. Open weight models can replicate themselves in this framework. Quin 3.5 the full 122 billion MOE parameter. MOE has a 6 to 19% success rate. Quinn 3.6, 27B, 33% success rate on a single A100 GPU. Like this is pretty remarkable. This is a smaller model doing better by the way, probably, presumably because it's easier just to deploy. It fits on one gpu. So like the failure mode there is overcomeable. They do test frontier models too. But they do note that there's like this little mismatch where obviously like Claude can't exfiltrate its own weights because its weights are proprietary. There's no way for it to get access to them unless it learns to hack Anthropic's own systems, which will definitely never happen. Anyway, the point is open weight models can actually exfiltrate themselves more easily because their weights are just on the open Internet here. They simulated that by getting Claude and GPT to basically just upload open weights models, so trying to replicate the same, the same behavior anyway. So just to show you how fast this is moving, like previous generations of Opus, like Opus 4 were hitting 6% success rate on this eval, previous generations of GPT 5 were hitting 0%. Now suddenly, Opus 4.6, 81% GPT 5.4, 33% out of nowhere. That's emergence. I'm sure there's a curve you can follow where it looks smooth in retrospect, but only in retrospect. And nobody thought to run this test before. So this is a big deal and a very important set of evals I think that we'll hopefully see run kind of going forward. Meter hasn't historically done a lot of evals on open source models just because of capacity limitation. And so having Palisade in the game doing this sort of thing, I mean, it's really good work and deserves a lot more attention. So there you go.
[88:13]
A
Right. And the concern is that, you know, if models decide they want to do this and can autonomously run and do stuff, which I mean overclaw, like go off and do stuff. So at this point, if a model wants to do something, there's going to be models out there doing it. Yeah, right. So, and speaking of hacking, the next story is how fast is autonomous AI cyber capability advancing from asi, the UK AI Safety Institute, which has been crushing it in recent months. Their estimate now is that capabilities are doubling every roughly 4, 7 months since late 2024. This is at 80% reliability and this is up from 8 months doubling time from November 2025 as of having Claude Mythos and GPT 5.5. So kind of a story we've seen and discussed in recent months. The cybersecurity hacking capabilities are just going out by leaps and bounds.
[89:09]
B
I think a big story here is the doubling time argument we saw from meter on General AI R&D also applies to Cyber. So if you had any uncertainties about that leap, it's now gone 4.7 months of doubling time. Though that doubling time seems to be accelerating with the latest GPT and the latest Claude Mythos preview. So again we're seeing this trend where people are like, oh, will the exponential hold? Will the exponential hold? And it only steepens, it only accelerates. And I don't mean to like beat the strum too much more. But like God damn, has that story held absolutely rock solid in the face of all the Gary Marcus's and the Yann Lecuns and everybody. Like, like this is like almost a relentless law of physics akin to Moore's law, which I get is another law of physics. It's law of economics. What do you want? But there you go. So in Mythos preview and GPT5,5 actually sitting above that 4, 7 month doubling timeline, all consistent with the meter plot. We're all in the like call it three to five month doubling time. And notably, even though it's like the first time they're running this task, they are already running into the same problems that Meter did with the limitations of their evals. Meter's like, look, Claude, Mythos Preview is doing 16 hour tasks or task suite just isn't that doesn't have enough tasks that are long enough for us to be confident. That's an upper bound. Same thing happening here. They're saying, look, we only have six tasks in our suite that are over eight hours long and human baselines for those are thin. So really, you know, this is a. We're getting already to saturation of this, this benchmark plus limited per task token budget of 2.5 million tokens. It's deliberately tight, but it means this is a lower bound. So, you know, and a simple agent scaffold hasn't been optimized much. Sort of consistent with the meter approach anyway. So all worth kind of looking at. I think cyber is just the key thing. By the way, Mythos Preview, when they initially announced this was the first model to ever solve their task called cooling tower. 3 out of 10, 3 out of 10 times there was a new version of Mythos Preview, not a lot of people are tracking that has dropped fairly recently. That doubled that success rate to 6 out of 10 times. So even within Mythos Preview, we're seeing radical increases in cyber capabilities. Where does GPT 5, 5 also 3 out of 10 by the way. So matching therefore in some respects matching those preview, though not all.
[91:14]
A
And one last story related to the safety side which we are really going to have to blitz. The paper is positive alignment Artificial intelligence for human flourishing. It's a sort of position paper by 13 different organizations including OpenAI, Anthropic, DeepMind and a bunch of universities. The basic cases alignment shouldn't be just about AI not turning out evil. We should have positive alignment where AI is aligned to us to do good. Right? And potentially even not just like aligned with doing what we want but being actively supportive of human flourishing and also remain safe and cooperative.
[91:54]
B
Yeah. My main question here is it's not clear to me how this is different from what's already happening and what's already been discussed in the world of AI safety for a long time. It's nice to see it. It's just like not clear to me what's new here. So data curation, they're saying like we shouldn't just be filtering out toxic content, we should be up sampling pro social discourse, cross cultural ethical framework. Like love it, love it, love it. But like who decides what discourse? And also the labs are already doing that pre training. You know like a lot of alignment relevant competencies emerge before post training. So like they're like baseline values need attention at this stage. Cool. Constitutional AI. Like a lot of this stuff is already kind of happening and multi objective rewards reward models that is, that can represent tensions between values for post training already effectively a lot of that kind of being done. So there's a lot of stuff here where I'm like okay, you know, slap on the back, good stuff. I don't think anyone seriously would disagree with this.
[92:43]
A
My take is it's a bit more of a reminder of like alignment shouldn't just be don't be evil, it should be be good. That's the gist of it. It's not controversial, it's just like. Let's keep that in mind.
[92:56]
B
Absolutely, yeah.
[92:57]
A
Onto synthetic media and art. Just two more stories to cover. First, OpenAI is making it easier to check if an image was made by their models. They are adopting the C2PA open metadata standard and integrating Google's Synth ID invisible watermarks. So you can now upload images and check if they are output by AI. You could get rid of these. There's probably workarounds but I would say this is actually a very positive step of having a mechanism to check, you know, at least according to existing standards. Is this AI generated which we sorely need given the state of AI for this, the last story which we'll cover real quick, which I just think is interesting how Chinese short dramas became AI content machines. So it turns out that there's a short drama industry which is like ultra short melodramatic shows that have episodes of one to two minutes long. This is a thing. And now there are 470 AI generated short dramas being released every day in January. So if you are curious like when is video generation gonna actually create something useful and make profit and be valuable? Well here is where it's going, it's already valuable and massively impactful. With his ultra short 1 to 2 episode minute episodes of drama, I too
[94:27]
B
am concerned that our attention spans are too long. So I'm glad to see this.
[94:32]
A
Not a thing in the US as far as I know.
[94:34]
B
Yeah, that's right. That's an interesting, interesting difference. Yeah.
[94:38]
A
Well, of that we are done actually just barely made it on time so I'm going to pat ourselves on the back. Thank you so much for listening to this week's episode. As always, please Comment, Subscribe, Share Review and if you are still hearing this, then thank you for making it through. And please do keep Tuning in.
[95:11]
B
Tune in tune in when the AI.
[95:23]
D
Break it down Last weekend AI come and take a ride Hit the low down on tech and let it slide Last weekend AI come and take a ride Up a ladder to the streets AI's reaching high new tech emerging Watching surgeons fly from the labs to the streets AI's reaching high algorithm shaping up the future sees Tune in, tune and get the latest with ease Last weekend AI come and take a ride Hit the low down on tech and let it slide Last weekend. From girl nets to robot the headlines pop Data driven dreams they just don't stop they Every breakthrough, every code unwritten on the edge of change with excitement we're smitten from machine learning marvels to coding kings Futures unfolding see what it brings.