Summary9 min read

Last Week in AI – Episode #242

Podcast: Last Week in AI
Date: April 29, 2026
Hosts: Andrei Karenov (A), Jeremy Harris (B)
Episode Focus: ChatGPT Images 2.0, Qwen 3.6 Max, Kimi-K2.6, and a roundup of current AI news.

Overview

This episode delves into several significant recent developments in AI, centering on OpenAI’s release of ChatGPT Images 2.0 (with a leap in text and GUI image generation), major new Chinese model releases (Qwen 3.6 Max, Kimi-K2.6, Minimax M2.7), and business stories such as the SpaceX–Cursor partnership, Cerebras’s IPO, and deep ecosystem shakeups in talent and hardware. The hosts maintain their trademark mix of technical nuance, skepticism, and humor, providing context for the industry’s rapid changes.

Key Discussion Points & Insights

1. Community Feedback & Meta-Discussion

Listener Reviews: Shoutouts to listeners, including one who uses the podcast as a workout soundtrack, and acknowledgement of both positive and critical reviews.
Clarification on Treaties and AI Governance:
- B (06:04): "We have never reduced the number of nuclear weapons below the point where the leading superpowers could destroy the world many times over... treaties are only as good as the kind of incentive bedrock that they rest on."
- A (07:06): "I think a similar thing could be said of AI ... China and the US will continue creating frontier models. If they do any limit at all, it'll be a sort of symbolic-ish limit."

2. Tools & Apps

ChatGPT Images 2.0

Major Leap in Multimodal Capabilities
- A (09:17): OpenAI’s new image model excels at generating precise text and even full GUI screenshots, marking a leap beyond previous diffusion models. Text rendering is especially strong, and the model’s ability to create accurate screenshots (even including valid SVG code) is highlighted as indicative of a significant new capability trajectory.
- Notable Quote:
  - A (09:17): “It is a little bit mind-blowing, I gotta say... it can do entire screenshots ... every single thing looks correct.”
- Likely uses a transformer-based architecture more similar to LLMs, not diffusion.
- Visual style diversity improved; can mimic photorealism, candid snapshots, comics, fashion, etc.
Technical Speculation
- B (11:41): “The model has thinking capabilities... can generate multiple images from one prompt, double check its creations... you actually get workable code inside an image, means there’s some sort of chain of thought...”
- Lack of technical disclosure; hints of “reasoning at generation time” unlock new enabling features.

Qwen 3.6 Max Preview (Alibaba)

Not open source (unlike previous Qwen releases), only via API.
Competitive among non-frontier models, benchmarks on par with Claude 4.5 Opus and GLM 5.1; a “daily driver” but not quite at Anthropic or OpenAI’s frontier.
A (16:57): “Qwen definitely has made waves… now they’re cashing that in. The free tier built this whole network effect and now Max Preview... is really going to monetize it.”

Google Deep Research & Deep Research Max

Gemini 3.1 Pro–based agents for intensive research/analysis.
Emphasizes “test time compute” – longer runtimes (up to an hour), but with radicallly improved search, reasoning, and sourcing.
Benchmarks: Outperforming other LLMs on research- and search-intensive tasks.
- B (20:32): “Across the board you do see Deep Research outperform everybody else... sizable lead on browse comp.”
Two versions: regular (faster) and Max (asynchronous, takes longer, much better output).

Other Platform Notes

Mozilla–Mythos Collaboration:
- Anthropic’s Mythos found and fixed 270 bugs in Firefox. CTO Bobby Holley declared this a “transitory moment” for sweeping latent vulnerability detection and repair.
- A (24:41): “Now all software will need to go through a one-time overhaul to surface and fix latent vulnerabilities.”
Starbucks & ChatGPT Integration:
- Early review: The experience is clunky, far less smooth than the native Starbucks app. May hint at longer-term “everything app” ambitions but not there yet.

3. Business, Applications, and Talent

SpaceX–Cursor Partnership ($60B Option)

SpaceX working with Cursor on coding models, with a $10B collaboration fee and an option to acquire for $60B.
Aimed at boosting SpaceX/XAI’s flagging coding model performance; XAI has lost founding team members and is under talent pressure.
A (32:46): “Besides the expertise and the data that Cursor has, XAI doesn’t have the talent to do this right.”

Cerebras IPO

AI chipmaker aiming to take on Nvidia, going public mid-May at a $23B valuation.
Customer concentration risk: Deals with OpenAI and AWS; but what if those partners in-house their inference chips?
Financials are “messy” (GAAP net income vs. non-GAAP, reliance on one-time items).
B (35:02): “If your two biggest customers are OpenAI and AWS, you’re basically their subcontractor... not an independent platform yet."

Venture Capital Momentum for New AI Paradigms

Flapping Airplanes ($180M): Bio-inspired AI
Core Automation (Ex-OpenAI’s Jerry Twrek): Seeking $500M–$1B for data-efficient, continuously learning models.
- B (39:27): “You're basically looking at one giant Geoffrey Hinton, Ilya Sutskever type leap when you are not Geoffrey Hinton or Ilya Sutskever…”
Recursive Superintelligence ($500M): Self-improving AI models.
The market is hungry for outside bets as “scaling” with classic LLMs becomes crowded and capital-intensive.

Anthropic & Amazon: $5B Investment / $100B Cloud Spend Deal

Reciprocal deal locking Anthropic into huge Amazon/AWS/Trainium spend.
Reflects massive cloud lock-in play and attempts by cloud vendors to own the hardware layer of AI.
B (42:26): “Every Frontier Lab wants a lot of different hardware suppliers... [They’re] commoditizing their complement at the hardware layer.”

AI Talent Wars & Layoffs

OpenAI: Notable departures as focus narrows to ChatGPT and Code; former science and “store” leads have left.
Meta: Hired five Thinking Machine Labs founders, including a $1.5B engineer (!). Planning 8,000 more layoffs this year.
- B (47:28): “You tend to see a large number of co-founders… We're seeing multiple companies like Anthropic, XAI...the 10,000x ML researcher...getting poached for $1.3 billion.”
Mandatory use of employees’ mouse/keystroke data at Meta for AI training — fueling anxiety about AI replacing workers.

4. Hardware: Chips, Fabs, and Quantum

China’s Chipmaker Overproduction:
- Chinese fabs, cut off from US/EU chips, have massively overexpanded, causing price wars and setting up for market consolidation (and likely government intervention).
- B (51:50): “Catastrophic overproduction problem ... The only way to win business is to go cheaper.”
Google’s Chip Developments:
- New “Memory Processing Unit” and TPU variants to accelerate inference (Zebrafish) and training (Sunfish). Energy efficiency is key differentiator for US/EU data centers.
- Most “best” models (Claude, Gemini) already trained on TPUs.
Quantum: Xanadu’s Stock Soars on Nvidia’s Open Source Models
- Optical quantum computing company jumps in value due to sector-wide rally—not direct product news.

5. Projects & Open Source

Moonshot AI’s Kimi-K2.6 (1 Trillion Parameter Mixture-of-Experts)

Massive new model, very sparse (384 experts, 8 active per token). Natively uses INT4 quantization, memory efficient, optimized for practical inference.
On benchmarks: On par with GPT-5.4; “especially impressive” in tests.
B (61:34): “Moonshot was trying to optimize for practical inference from the very beginning.”

Minimax M2.7 Open Source

Slightly less performant than Kimi, focused on self-evolving agentic workflows, supports automated capabilities for LLM post-training optimization.

ML Intern by Hugging Face

Open source AI agent automating post-training for LLMs – can perform literature review, code modification, dataset tuning with experimental efficiency boosts.

6. Policy & Safety

Data Poisoning via Training Set Influence (Influence Shaping)

New research from “infusion shaping” demonstrates the capacity to steer models by micro-perturbing influential training documents, enabling stealth adversarial attacks.
B (67:10): "Mathematically ... these changes... are not obvious. Seemingly random, small perturbations...but result in specifically the change you're after."

Anthropic's Mythos: Policy Weirdness & Security Breaches

NSA reportedly using Mythos despite DOD blacklist (incongruous positions within US government).
Mythos API also reportedly accessed by an unauthorized Discord group through a third-party provider; shows Anthropic’s rapid scaling brings new security headaches.
A (74:40): “If [NSA] don’t have access to Mythos, that is a major blunder.”
B (76:01): “We are asking Frontier labs to suddenly become load bearing elements in the national security apparatus.”

7. Research & Advancements

Parquet: Scaling Laws for Looped Language Models

New looped transformer model passes activations through layers multiple times, improving memory/computation efficiency.
- “Prelude,” “loop,” “decoding” blocks; ongoing direct injection prevents drift/fading of long-horizon information.
Scaling laws show gains over prior looped models but only proven at small scale so far.

OccuBench: Agent Evaluation Simulator

Chinese-origin benchmarking environment to simulate actual workplace scenarios across 65 domains/10 industries – synthetic, broad benchmarks vs. OpenAI’s more curated GDPVal suite.

8. Synthetic Media & Art

Deezer: AI–Generated Music Spike
- Now 44% of daily new uploads are AI generated; only 1–3% of streams, most flagged as fraudulent.
- B (88:25): “Once we see a shift...to the same order of magnitude, that might tell you something about hey, AI music is absolutely here and competitive.”
AI Deepfake Takedown Tools for Celebrities on YouTube
- New ability to request removal of AI-generated deepfake videos provided ID/selfie; doesn't apply to all cases (satire/parody protected).
- Raises complex free speech/personal rights issues that may reach the courts.

Notable Quotes & Moments

A (09:17): “It is a little bit mind-blowing... full-on screenshots of desktops with GUI applications where every single thing looks correct.”
B (11:41): “You actually get workable code inside an image, means there’s some sort of... chain of thought... probably latent space reasoning happening.”
A (24:41): “Mythos Preview has changed things dramatically... a transitory moment where all software will need to go through a one-time overhaul.”
B (35:02): “If your two biggest customers are OpenAI and AWS, you’re basically their subcontractor...”
B (67:10): “...these changes in those documents are not obvious... random, small perturbations... result in specifically the change you’re after.”
B (88:25): “Once we see a shift... to the same order of magnitude, that might tell you something about hey, AI music is absolutely here and competitive.”

Timestamps by Major Segment

| Segment | Start | |-------------------------------------------------|--------| | Meta-discussion / Listener Mail | 00:00 | | ChatGPT Images 2.0 | 09:17 | | Qwen 3.6 Max & Chinese LLMs | 14:11 | | Google Deep Research | 19:07 | | Mozilla / Mythos / Starbucks & ChatGPT | 24:41 | | SpaceX–Cursor / XAI | 30:16 | | Cerebras IPO | 33:52 | | VC for Outlier Labs | 38:03 | | Anthropic–Amazon Cloud Deal | 41:17 | | Talent Moves / Meta Layoffs / Employee Tracking | 44:50 | | China’s Fab Wars / Hardware | 51:24 | | Kimi-K2.6 / Minimax M2.7 / ML Intern | 59:53 | | Infusion Shaping (Data Poisoning) | 67:10 | | Mythos Usage/Leaks (NSA & Disclosures) | 70:07 | | Parquet & Scaling Laws (Looped Models) | 77:02 | | OccuBench / Agentic Eval | 83:13 | | AI in Music / Deepfakes & Policy | 86:42 |

Tone & Style

Language: Analytical, sometimes irreverent; mix of technical depth, skepticism, and amusements.
Notable banter: Jokes about their own review controversies (“Maybe we need a fact-checking crew”), digressions into market metaphors and nuances of compute policy.
Engagement: All major points are contextualized for industry impact and future trends.

Conclusion

This episode provides a comprehensive review of the explosive development pace in AI—spanning tools, business consolidations, open source advances, policy, and cultural impacts. The hosts’ expertise is evident as they clarify hype versus substance and emphasize the deep interconnection between technical leaps, market structures, resource bottlenecks, and even global security issues.

For more details or full stories, subscribe to Last Week in AI and check their newsletter.

Loading summary

Transcript88 lines

[00:00]
A
Foreign. Hello and welcome to the Last Week in AI podcast where you can hear us chat about what's going on with AI. As usual in this episode, we will summarize and discuss some of last week's most interesting AI news. You can also check out our Last Weekend AI newsletter at Last Week in AI for stuff we will not be covering in the episode. I am one of your regular hosts, Andrei Karenov. I study AI in grad school and now work at an AI startup.
[00:36]
B
And I'm your other co host Jeremy Harris from Gladstone. AI, AI, national security, AI infrastructure, all that good stuff. Yeah, we're going to have to blitz this one and it's. It's kind of my fault. It's not kind of. It's entirely my fault. We're supposed to put our stories into a shared spreadsheet the day before the episode. And guess, guess who didn't do that, but thought he had. You're getting, I guess 90 minutes instead
[00:57]
A
of only 90 minutes instead of two hours. I know, I know.
[01:01]
B
It's not enough of our voices. It's never enough.
[01:05]
A
And I think this time around there's not a ton of high impact stories, but there are a couple. Obviously ChatGPT's new image model will be kind of a highlight for this week. There's also some major new Chinese models that we need to discuss and similar to last episode, lots of business stories. Primarily. It is going to be a bit light on research for people who are fans of a deep technical deep dives. Don't worry, I'm sure we'll have research heavy episodes in the near term. It's just that it just so happens the last couple apps are not that.
[01:42]
C
You're listening to this podcast, so I know you've got a curious mind. Here's a helpful fact you might not know yet. Drivers who switch and save with Progressive save over $900 on average. Pop over to progressive.com answer some questions and you'll get a quick question quote with discounts that are easy to come by. In fact, 99% of their auto customers earn at least one discount. Visit progressive.com and see if you can enjoy a little cash back. Progressive Casualty Insurance Company and national average 12 month savings of $946 by new customers surveyed who saved with Progressive between June 2024 and May 2025. Potential savings will vary.
[02:20]
A
This episode is brought to you by Outshift Cisco's Incubation Engineering. Today's AI agents operate in silos, limiting their true potential. We've been focusing on building bigger, smarter models. But scaling up is just one approach, and we actually have a blueprint from 70,000 years ago. Humans didn't just get smarter individually. The cognitive revolution transformed society because we began sharing knowledge, goals and innovation. And agents are now at the same inflection point. They can connect, but they can't think together. And that's why Outshift by Cisco is building the Internet of Cognition, transforming AI from isolated systems into orchestrated super intelligence. By creating an open, interoperable infrastructure, Outshift is enabling agents and humans to share intent, context and reasoning. The cognitive evolution for agents is here. Explore the Internet of cognition@outshift.com that's outshift.com Today's episode is sponsored by Box. Enterprises are keen to adopt AI, but enterprise AI only works when it has the right business context, and Box is the leading intelligent content management platform for the AI era, acting as the secure essential context layer for Box's AI agents to access the unique institutional knowledge that makes the company run your business isn't the sum of all Internet knowledge. Your business lives in your content, and Box can connect that content with people, AI agents and apps that can unlock the value from their information, all while having the security and governance capabilities that allow you to trust it to be secure. There are many uses for it, and especially interesting is Box Agent, a unified AI experience across your files in Box. So if you're thinking seriously about your company's AI transformation journey, think beyond the model. You your business lives in your content, and Box helps you bring that content securely into the AI era. Learn more@box.com AI before we get to the news, do want to call out some new Apple podcast reviews, which is always fun to see. Was interesting to see one of the reviews saying that this is a go to workout soundtrack, which is not what I would pick for working out, but I suppose we do release.
[04:43]
B
Sorry, sorry, got my timing around there.
[04:45]
A
Maybe it's a, you know, good motivator to work hard to keep up with AI, given how fast the space moves.
[04:53]
B
You know what we need to do, Andre? We got to do the tbpn. We got to get headsets, you know, treat this like a sporting event. And then it'll it's like, you know, CNN or Fox News or when they cover like political stuff. Like five years ago it started to look like a freaking football game and everybody's got the headsets. You know, we should do that and then, and then maybe more people will
[05:12]
A
listen while working out. But aside from the workout thing, this review also goes a bit More detailed and does say that they like the deep technical papers and technical topic rat holes which is always nice to see. We will keep doing that. I don't think we can manage not to do that. So it's good that people like it. Also got a couple more super solid and informative. Love to see that. And we did get a negative review of one star that comments on a recent discussion we had with regards to treaties. And I believe, Jeremy, you said treaties have never reduced the number of nuclear weapons which is what angered this subscriber and got them to give us a one star and to not listen. So apparently some people feel very strongly about this.
[06:04]
B
I'm sorry Andre, that's on me. So to be clear, I think I would have said this in the podcast episode.
[06:10]
A
I think you overstayed it a little bit but the implication was fair.
[06:14]
B
So we should get our fact checking crew. What I believe I would have said, because this is what I believe is we have never reduced the number of nuclear weapons below the point where the leading superpowers could destroy the world many times over. That's the fact that matters for nuclear non proliferation. Right. So it's not that we've never, of course we've reduced the SALT treaty did result in a reduction in the nuclear arsenals of great superpowers, but they could still destroy the world many times over. And so I think this is in the context of me trying to make the case that basically treaties are only as good as the kind of incentive bedrock that they rest on. I still hold to that. That's actually kind of a well established thing in this space. It'll be wrong and nuanced and interesting cases at the edges for sure, as are most general statements. But please push back if you disagree. Always, always interested to chat more about that. I think it's a really important topic. So you know, it's important to cover all the bases.
[07:07]
A
Yeah, I think you probably just said this as an aside about sort of providing the caveats at the time I think I got your message as well that Fujis may have reduced nuclear weapons, but they haven't reduced them to the point that nuclear weapons are not something to be worried about at all. Right. Everyone has enough nuclear weapons. Every nuclear country has enough nuclear weapons to still annihilate everyone. So it's more of a symbolic gesture, I suppose and a commitment to cooperate and hopefully not use those nuclear weapons. And yeah, I think a similar thing could be said of AI actually that this could be a kind of outcome here where there's Some symbolic limitations. China and the US will continue creating frontier models as hard as they can. And if they do any limit at all, it'll be a sort of symbolic ish limit.
[08:03]
B
Yeah, and that's, that's kind of the core of the point I was trying to make is like, I think I went kind of through all the WMD categories and you can really see the history of these things, right? Chem, bio, radiological, nuclear weapons. In every case, you know, it just turns out you can kill way more people with bullets and chemical explosives than chemical weapons. That's the actual reason, like kills per dollar is the reason that chemical weapons treaties were actually enacted. It had nothing to do with the fact that people actually, just for altruistic reasons, wanted to do the right thing. Governments defect all the time on those treaties. Again, same with nuclear and with bio. These weapons just have a massive kickback effect like we saw with COVID in that case a lab leak. But more broadly, if you have a weaponized bioweapon, unless it's race specific, which again, that could change, you generally just don't see that incentive. So that's why you see people defecting on these things. That's why you see these. For me, that's why when I look at an AI treaty, like you said, I just don't see how it's going to be anything beyond symbolic unless the basic incentives shift in that direction. And that's going to have to involve looking at what the US can do to China, what China can do to the US with and without AI tools, and what that balance of power looks like, which is a whole rabbit hole. Something I'm working on. But anyway, yeah, I think it's important,
[09:18]
A
added some nuance and thanks for the pushback. Now getting to news, starting with tools and apps and we begin with Chad. GPT's new Images 2.0 model is surprisingly good at generating text. So this just came out and it is a little bit mind blowing, I gotta say. Per the headline, the big deal about this model is that it is just so good at generating precise text. It can do entire like screenshots. I saw an example and this was insane. A typical kind of test for LLMs is generating SVG code to draw different stuff, unicorns or kind of the most common one I think is drawing a pelican riding a bicycle, which if you're trying to generate the SVG code for that, it's pretty tricky. Apparently the Images 2.0 model created a screenshot that contained the code and it was correct. So this is like way beyond anything you've seen. Text has gotten quite good, but to be able to do full on screenshots of desktop with GUI applications where every single thing looks correct is quite impressive. It's I think very clear that this is a result of continuing down the utter aggressive path. Instead of using diffusion models, this is using kind of the same transformer token based architecture as LLMs. And what it looks like is basically the same trajectory as LLMs worked for images where if you just kept making the transformers better and you kept throwing in more data and here very clearly they trained it on screenshot data somehow with, you know, there's been examples I've seen of also Adobe software being in images even like Starcraft games, whatever. So some interesting bits here, like the technical achievement itself is quite impressive. They highlighted on one ranking where you have elo, which is human preference. This is just like trumping not a banana, like leaps and bounds. So this is a major leap of a kind we haven't seen since. Not a banana, I don't think. And it indicates OpenAI's ambitions to continue down the road of computer use and general purpose AI agents.
[11:42]
B
Yeah, and this is an area where OpenAI has a pretty robust advantage over anthropic, arguably over Google even as well. Historically, just the multimodality side of things, especially on images, has been a differentiator for them. We don't know specifically. So there was a press briefing where they just declined to answer a question about the actual architecture of the model. No shock there, but it would have been nice to get some indication. I think you're right, Andre, reading between the lines here, they do say that the model has thinking capabilities that allow to search the web. It can generate multiple images from one prompt, it can double check its creations. This is all kind of symptoms of the sort of reasoning training that we've seen from the LLM based models. And so in some way, shape or form there's some kind of equivalent to a, I don't want to say necessarily a scratch pad. It could be happening in latent space. The reason I don't think it's all latent space, the reason I think there actually is something akin to a chain of thought here. Even if it's not with words, you know, maybe with pixels or some combination is specifically that they do say. What do they say? Yeah, it like will double check its creations. Right. And making multiple images from one prompt. This is kind of very much akin to the sort of rollout that you see and that rollout and all of OpenAI's infrastructure, you gotta remember the infrastructure is the thing that the biggest constraint for them, all their infrastructure is geared towards this kind of rollout. Right. Actually rollout actually generate the chain of thought. So at least if I'm OpenAI, naively, what I want to do is try to make my image generation pipeline look as close to my LLM pipeline as possible so that I don't have to manage and maintain two different kinds. Especially they've indicated the world they're looking to focus on some core use cases here. This is not going off in a new direction necessarily. So that'd be my guess, very much aligned with what you said, Andrew. And yeah, I mean, look, I keep saying every time there's a new like image generation model, Jeremy comes out and says, I don't get it. I don't get why, like, haven't we already solved this problem? This is one of those cases where I'm like, ah, I get it, I get it. It's not just that we've gotten to the point where the text is accurate, as you said, this is the, the symptom of the reasoning being part of the generation here. The fact that you actually get workable code inside an image means there's some sort of, again, probably chain of thought, but at a minimum, latent space reasoning happening that actually kind of results in very cogent rollouts. So, yeah, I mean, again, it's a shame that we don't have confirmation on what this is, but certainly it does seem consistent with their LLM approach.
[14:12]
A
Yeah, we see very, very little technical detail on this, which is kind of disappointing because if there is some novel stuff going on here, it'd be very interesting to know about that. I think beyond the text, it is worth highlighting that in general, this does also continued improvement of just image generation more broadly. One of the limitations with the models that OpenAI had released previously for image generation is that they had a very fairly consistent style. Like you could tell that their AI more or less, they had a very kind of smooth sheen over them. And once you've seen enough AI images, you could kind of tell at least we've devoured a lot of prompting effort here. They have many examples of different styles. It's very good at photorealism and kind of mimicking candid snapshots, you know, gritty old photos from the 90s, disposable cameras, fashion books, all this kind of stuff. They also have examples of comic books and along with text, obviously, like, what this can do is produce very clean lines and layout and and so on. So as with Nana Banana now if you see like infographic, if you see a poster, it is quite likely that it could be generated with this model. So I guess the big question is whether Google is going to answer back soon with nano banana 2.5 or whatever. Next up we have an LLM release. Alibaba has dropped Quinn 3.6 max preview. This is not open source unlike previous Qwin models, it's only available via their APIs and as far as non frontier models, it is quite impressive. It is compared to, let's say previous Qin iterations. Definitely a leap ahead. But as with these models, not really competitive with OpenAI anthropic yet.
[16:18]
B
Yeah. So Alibaba also as part of this shut down the free tier of Qin code just like days before this release, which is also what Minimax did. They rewrote their open source license to block commercial use without authorization. So bit of a turning point, right? QEN definitely has made waves. I mean it's just overtaken Meta's Llama as the most deployed self hosted model in the world. That's pretty wild. I mean obviously Meta's Llama okay, you know, not no longer really kind of in the mix, but still remarkable that now we have a Chinese lab doing that and now they're caching that in. Right. The free tier built this whole network effect and now Max Preview, this model is really going to monetize it. That's the play.
[16:58]
A
And just to give some indications of the quality here with the 3.6 Max preview, it is on the benchmarks about on PAR with Claude 4.5 Opus and GLM 5.1. So you know, Cloud 4.6, Cloud 4.7 at least on the benchmarks, quite a bit ahead of cloud 4.5, but definitely a usable model that you could kind of have as your daily driver. As we did, you know, with Cloud 4.5 was, I don't know, a few months ago, right. We were all using cloud code with Cloud 4.5. So as we've seen for the past year as a whole, and I think interestingly we have observed a pattern in the west of kind of speeding up in LLM model releases and it feels like that may also be happening with the Chinese model developers now where it has felt like QEN has moved faster with more releases. Minimax, gimme, all of them are coming out with more stuff and we'll get to the open source releases as well later this episode.
[18:08]
B
Yeah, always tricky too with these models, especially the open source ones. There's a tendency to benchmax a lot more with them. So you will find for a given performance on pick your favorite benchmark like Super GPQA or a GDP VAL or whatever for a given level of performance. The proprietary models do tend to perform in some cases quite a bit better in ways that are really hard to quantify. And so this nominally from a benchmark standpoint looks like it's about five, six months behind. Basically where the Frontier is. That's a big deal. But also you probably are going to have to account for an additional delay factor from just the benchmaxing effect. It can be really hard to tell. Certainly this is impressive, there's no two ways about it. But when we're thinking about how many months behind are Chinese labs, it's a murky thing, partly because it's so hard to quantify what makes a good model. That taste is still something that is most refined in the frontier labs and in some cases by quite a bit.
[19:08]
A
Next up, a release from Google that is a tool Google has launched Deep Research and Deep Research Max agents. So this is a continuation of their Deep Research products that they've had, you know, for A While, since 2024 I think was when these kind of Deep Research things initially came to a fore. These are built on Gemini Free Point 1 Pro. As you might expect, the highlight is basically much better. Extended test time, compute to reason and iterate. With Deep Research the typical workflow is you ask it a question and then it like really takes its time. It goes for like 15, 20 minutes, looks up a bunch of sources, picks for resources and writes a report or analysis or something like that. So with Deep Research and Deep Research Max you have two options. Notably one thing they highlight is that now we support the model Context protocol. You can use it to access your own proprietary data. And now with model called Context protocol you can typically connect to any kind of major tool. So if you have AWS data, if you have some other kind of analytics tool now this will be able to look through it and do reports on whatever data isn't accessible on the web.
[20:33]
B
Yeah, and this is an area where the comparable or like kind of the. If you have to think about areas where OpenAI is really sharp, this is one of them, right? This Deep Research, doing report writing, report generation a little bit different from anthropic. You see it reflected in the Benchmark scores like Opus 4.6 performs a lot less impressively on kind of search related like research than GPT 5. 4. So this is unlike coding where Anthropic has a pretty significant. From my experiential standpoint, it has quite an advantage and has enjoyed that historically. So here what you really have is Google's competition is GPT 5.4, at least in my mind. That's what I'm doing the side by side on and you see pretty comprehensively like on these core benchmarks. The three that you can see sort of most highlighted here are Deep Search qa. Basically that's like comprehensive web research benchmark for that humanities last exam we've talked about that it's supposed to be really, really hard reasoning on sort of PhD level top and then browse comp, which is to find these like very niche like hard to find facts that involve combining like you know, resources in different ways. So across the board you do see Deep Research outperform everybody else. It's most narrow on humanity's last exam basically tied with OpenAI. I mean realistically within error bars though we don't see them here. I mean it's, there's no way that we have the resolution there, but sizable lead on browse comp. Like if you look at like hard to find facts, the runner up right now actually is Deep Research. So it's actually a previous version of Gemini, but GPT5.4 is sitting at 59% basically 86% for deep research Max. That's a big enough jump that like something interesting is happening there.
[22:11]
A
It's sort of interestingly too, yeah. The next up from Deep ResearchMax is Deep Research and the non max version of Deep Research is comparable to GPT 5. 4. The max version is way better. So I think that Max is like really putting him in a bit of work across the board. The max version of this is quite a bit stronger. And I think that is part because they say Deep Research is meant to be a bit faster, a bit kind of less of a wait versus Max is an asynchronous workflow. Like you ask something and you like come back in half an hour or maybe even an hour in this case and get result.
[22:53]
B
Yeah, this is exactly. This is like a symptom, right of getting the test time compute thing to really be well leveraged. That's what it seems to suggest. And so as disconnected as this might seem from OpenAI's image generation announcement, there's actually that common through line where we're actually like kind of cracking in at least use case specific ways. We're cracking the kind of getting good leverage out of test time compute in ways that are creating qualitatively different Products like it will feel different to use Gemini Deep Research. I suspect I haven't used it yet, but I suspect it'll feel different from GPT 5.4 in the same way that OpenAI's image generator will feel different from previous iterations. Something has actually kind of been unlocked here, right?
[23:34]
A
And to that point of being able to utilize test time compute strongly. I think one thing this makes me realize that I hadn't thought of before is that part of the sort of training for these models that is new, we of course know that they're really optimized to be agentic now to be able to do tool calls, but the kind of outcome there is that the models themselves, the foundation models in this case Gemini 3.1 Pro, are just good at test time compute. So there's this kind of benefit across the board of on the one hand the model is smarter without the test time compute, but on the other hand it also leverages test time compute better. So the curve there is different per
[24:17]
B
model and those often come together right in the same way that famously When DeepSeq v2 came out, we looked at it and we're like, oh, R1 is going to be a big deal. It's a much better base model. Sort of similar effect here where you can look at the kind of lower test time compute variant, and if it does really, really well, which it does, you can kind of assume, okay, there's at least a good chance that we're going to see some pretty impressive stuff
[24:41]
A
next, something a bit adjacent to a tool. Mozilla used Anthropic's Mythos to find and fix 270 bugs in Firefox. This is according to the Firefox CTO, Bobby Holley. He said that Mythos Preview has changed things dramatically and even went as far to say that this is a transitory moment where all software will need to go through a one time overhaul to surface and fix latent vulnerabilities, as I think pretty clearly happened to Firefox. Like they just went through and found all the bugs, or at least a lot of bugs, and fixed them in one big swoop, which is not typically how this works.
[25:24]
B
That is the big question, right? Is software in this regime of compute and test time compute leverage, is software actually fully securable or is it going to be a continuing escalatory game? I was talking to somebody I won't mention, but like a very senior person at one of the Frontier Labs who works on security, and his take was the next two years obviously are going to Be absolutely insane. But he expects that we'll get to the point where we're writing basically security perfect software. I don't know. Because ultimately software is physical, and so it has attack surfaces that extend beyond what we traditionally think of as software. And so if you loop in things like blackmailing people and doing all kinds of stuff, I just don't know. Things are fuzzy at the boundaries. But in the conventional understanding, it is possible because you can make provably secure software that we will have versions of Mythos in the future that actually do that, which would be insane. Imagine what the world could look like from a trust standpoint if we knew confidently that software was just secure. And imagine the implications, by the way, from national security, you think about, like all the big pieces of software that are shipped. Assume that the world's leading national security agencies have compromised that shit to blazes, even if you think it's secure. And there are tons of documented cases of this, communications platforms that you may use and think of as secure that have absolutely been cracked. And it's actually just like a known thing. So this not only helps defenders, it also prevents national security agencies from getting access to what could be potentially critical national security data. Depending on how you look at it, depending on how you fall on the Edward Snowden side of things, you'll have like a view on, you know, what this implies. So it's actually a huge shift all happening obviously at the same time as, like quantum. So it can be in a. In a regime where we're, you know, we're able to decrypt all messages past a certain point in the past, like, Jesus, the cyber instability that we're in for, Things are going to shift pretty quickly.
[27:15]
A
Yeah, I think this highlights that. This is just Mithra's preview. No doubt in a few months we'll have even better models. So this really is sort of a new world paradigm for software. There's many implications to this. The national security one, I think. The other one that's interesting is like, it used to be the case that becoming a competitor, like building something in software, releasing it, letting customers use it was easy. Like, you didn't have to pay much, you could build a SaaS product or whatever. And the challenge was not the software itself, typically in terms of the amount of spend you would have to do, you can build like a Facebook competitor or whatever pretty easily. Now if we are in this, like, anything can be hacked easily paradigm, and you have to really be a thorough defender to be trustworthy, there might be more of a benefit to be a bit player and it'll be harder anyway. There's many things like that that are interesting to think through. And one last story for tools we've got. Ordering with the Starbucks ChatGPT app was a true coffee nightmare from the Verge. So this is more so kind of a description of what the tool of integration of Starbucks with ChatGPT looks like. Quite a while ago I announced this new kind of app thing where you could have embedded UI that you call upon via ChatGPT, and this is the latest example. You can type Starbucks and your order and it will pull up the Starbucks menu. But per this article, it's a pretty clunky experience. You kind of have to go through and manually do a bunch of stuff that you wouldn't have to do in the actual Starbucks app. Presumably this is the first iteration and these things will get better, I think. Still a big question whether the sort of like everything app where you can call upon any software within ChatGPT and you no longer go to Starbucks.com, you just like talk to the chatbot and have your software through that interface. It could be where we had right now. It's unclear whether that is going to be the future. And onto applications and business first, the kind of major news of this week that just broke yesterday. SpaceX is working with Cursor and has an option to buy the startup for $60 billion. So under this deal, SpaceX will either pay Cursor $10 billion for its work and collaboration, or it has the option to buy it for 60 billion later this year. The collaboration is for Cursor to help SpaceX and this is to say XAI within SpaceX train better coding models on the recent benchmarks. You know, people have sometimes included Grok 4.2 and it was quite sad. Grok 4.2 was like at the very bottom, not even close. We haven't seen a coding model release. Elon Musk has promised a coding model and that never materialized. So it's an interesting deal, clearly. And the idea that Cursor is worth $60 billion is pretty interesting given that they've been losing market share to cloud code and so on. They seem to be in kind of a tough spot for a while. So definitely a win win in terms of the collaboration. And it will be interesting to see if Kershaw does wind up being folded into XAI. SpaceX.
[30:47]
B
Yeah, the big challenge here being that XAI, as you say, they haven't shipped a true frontier model in this space yet, full stop, which is a big Problem. Is the Cursor acquisition or partnership a solution to that? Maybe, maybe not. I mean, Cursor doesn't train frontier models, they don't make foundation models.
[31:07]
A
They did fine tune successfully pretty recently, but they don't train foundation models from scratch, which is a very different ballgame.
[31:16]
B
Yeah, absolutely, absolutely. And so there's maybe the hope that the thing that XAI is missing is the fine tuning element. If that's true, then, hey, this could be a match made in heaven. But if it's not, if pre training is in any way a load bearing problem for XAI right now, Cursor is not going to be the solution to that. They need to find some other way of writing that course. So that could actually, you know, that could go either way. This is also not the clean, as Andre just said, it's like we'll give you $10 billion or we'll buy you for 60 billion. That's part of the messiness here. The other part of the messiness is that there's also a talent drain. So two of Cursor's top engineers left the company to join X, or, sorry, xai, where they report directly to Elon.
[31:56]
A
This was last month?
[31:57]
B
Yeah, this was last month, yeah. So. So you've kind of got this like, what is that? I mean, that's a topsy turvy relationship.
[32:04]
A
I mean, I think the clear kind of picture here is XAI has a ton of compute lying around with all these things. The Colossus supercomputer. It feels like a majority of XAI left after the SpaceX deal. We've covered a lot of the initial founders leaving. I think now the initial founding team, maybe everyone left or like one person is left.
[32:28]
B
Yeah, yeah.
[32:28]
A
And that's like 12 people. That's a big number of people. The initial founding team of xai. So very clearly they've had a talent rate. And we know it's not just the founders, other people left as well. So besides the expertise and the data that Cursor has, XAI doesn't have the talent to do this. Right.
[32:47]
B
Yeah. And that actually predated the SpaceX acquisition. Actually. Like, like this goes back, you know, this is, this goes back a ways. And so they've just been having trouble. Yeah. Keeping people around in general. So, you know, maybe this is part of SpaceX trying to, trying to juice the narrative on its IPO pitch as well. We can't forget that. That's coming up. Some people, I think at this point the story was pretty clean, you know, a year and A half ago where you're like, if SpaceX iPodOS and if it acquires Xai, you've got the makings of like XAI with the algorithmic Expertise. You've got SpaceX that can provide data centers in space. Ten years from now, this is going to be an unbeatable behemoth. The problem is that the XAI side of that coin has started to look a little bit weak and the expertise on the algorithmic side is just like. Seems like it's important enough that it's a genuine traction issue. So if it's a fine tuning issue, if this is a post training thing or a mid training thing, maybe bringing in cursor here will actually make a big difference. That's totally possible. And it is Elon. I mean, I hesitate to say anyone can do anything, but like if anyone can turn a ship around or a spaceship, it's probably Elon.
[33:52]
A
Next story, AI chip startup Cerebras files for ipo. Cerebra Systems builds these kind of weird chips that are meant specifically for AI inference. We've covered them in the past quite a few times. They build this like giant wafer that is very much unlike traditional computing. So they are going to IPO for mid May. That's the plan. This is after raising 1.1 billion Series G last year and a series H in February 2025. So they've been around for a while. They've got quite a bit of fundraising. They have a valuation of $23 billion and they have been making deals to provide their chips. I recently had a deal with AWS, they have a deal with OpenAI. So they are growing in terms of revenue and in terms of deal making, they've taken their time like they've had the strip for a while and for quite a while they were not deploying at scale. And it seems like they are very much aiming to be a significant competitor to Nvidia.
[35:02]
B
Yeah, it's a very like. Of all the IPOs, this one is maybe, maybe the most complex to think of in terms of whether it's worth it and what the value picture is. So a little bit on accounting here. So they brought in $510 million in revenue in 2025 with a net GAAP income. So basically GAAP is generally accepted accounting practices, the default way of doing your books. Right. So you book all income as income and you book all costs as costs. And I'm not an accountant, but basically it's pretty intuitive. So in that case of 510 million of top line revenue A net income after costs of around 240 million. But if you get rid of the one time items, basically like maybe weird equity arrangements or asset sales, you get to a net loss of 75 million. So there's a really big delta there in terms of like the weird one time things that are actually making them profitable nominally for 2025. So there's kind of like that, that aspect of it. Normally non GAAP income looks better than GAAP income, but in this case it doesn't. They're choosing to report numbers that they think are more reflective of the underlying kind of foundation of the business. And those numbers actually are worse than the kind of accounting default. They also have insane amounts of kind of customer concentration. Right. So they have a deal with OpenAI, it's reportedly worth about $10 billion. They've got an agreement with AWS to use Cerebras chips in Amazon data centers, which sounds great, but the problem is if your two biggest customers are OpenAI and AWS, I mean you're basically they're like a subcontractor here in the AI arms race. They're not an independent platform yet. Right. It's not like they're serving everyone. They have these huge deals with two players. So the question we have to ask ourselves is what happens if OpenAI just goes out and builds its own chips? Right. For inference, which is happening, seems like it's happening. What happens if AWS pushes its own Trainium chips harder? Like there's already kind of the preconditions for a lot of kind of nasty headwinds here potentially. So another thing to highlight too is they had previously filed to go public. We were talking about them back in 2024 going public, but there was a whole federal review of an investment that G42, which is this like Abu Dhabi based company that we've talked a lot about, was going to make and ultimately everything was, was withdrawn. And so they were able to raise all these, these, you know, like you said, Series G, Series H. We're going to need to like do that Microsoft Excel thing where you scroll to the right and eventually you start using two letters. So the fact they had to raise that much in private rounds does mean they're burning a lot of capital to stay in the game. It also means that they have real institutional backing. So again there's like a lot of like pros and cons in this one. And it's going to be one to watch because potentially, you know, they are bragging that they took the fast inference business at OpenAI away from Nvidia. So their claim at least here is look, we're not just like a potential Nvidia competitor. We are already winning on the most valuable segment, which is inference with the largest player right now, which is OpenAI, which again would be a big deal but for the customer concentration piece. So there's a lot of factors one
[38:03]
A
way or the other here next, a few stories. Fundraising, flapping airplanes, recent startup that is aiming to create more sort of biological based AI models that learn differently. They raised 180 million from a few investors including Sequoia. We have Core Automation founded by former OpenAI senior researcher Jerry Twrek that is seeking between 500 million to 1 billion to build models that need 100x less data and can learn continuously. And we've got Recursive super intelligence raising 500 million for self teaching. AI recursive superintelligence is aiming to, per the title, I guess, create AI that can make better AI. So all in all a surprising amount of kind of appetite for continuing to do big rounds for labs that kind of are promising to do new paradigms of AI. I think at this point OpenAI and Frobi, Google, these are entrenched players that have frontier models that no one has been able to to compete really aside from the Chinese companies, no one entered the market for years now and these are some ambitious startups promising to try to do things in a different way. Whether they succeed will be very interesting to see.
[39:28]
B
Yeah, the strategy here, I mean so looking at the Core Automation strategy, the claim is basically like, look, their founder, who's Jerry Turek, who actually had an hadn't heard of before he left OpenAI because basically fundamental research was no longer a priority at the company. Which number one is pretty clear. We've seen enough signs of that so far. People getting frustrated, leaving on that basis. But at the same time fundamental research is incredibly risky. And the challenge here is the approach that they're talking about is new loss functions, possibly replacing gradient descent altogether. So couple thoughts here, maybe this works, but you're basically looking at one giant Geoffrey Hinton, Ilya Setzkever type leap when you are not Geoffrey Hinton or Ilya Sutskever. And that could happen and it's an outside play and worth backing for some amount of money for sure on that basis. It's also the case that at this point the only companies that would have a chance at being relevant are the ones that are betting that scaling will stop working. Because if you believe scaling will be the main driving force at least, or a main driving force, then you literally are hopeless at competing against companies with trillion dollar market caps that have already built out tons of their own chips in some cases and fleets of data centers going from scratch is just basically impossible. At this point though, I'm ready to eat my words when the next company just does it. But yeah, so when I hear something like this new loss functions, possibly replacing gradient descent altogether, if there was a promising candidate approach, I would expect it to say something more than like possibly replacing maybe, I don't know, like that just seems like a significant amount of uncertainty. Doesn't mean it won't happen. Just means there's gotta be a there, there and we'll run the test. So kind of interesting these companies are coming out again, all these former former Frontier Lab people to push this forward.
[41:18]
A
Next up, anthropic is getting $5 billion from Amazon and has pledged $100 billion in cloud spending in return. So I don't know if Anthropic is getting money or promising to spend money plus more. But as we know, Anthropic and Amazon have had a long close relationship. They sort of were the og, a major backer of Anthropic. Here they are saying that they'll spend on AWS and using Amazon's Trainium 2 through Trainium 4 AI accelerated chips which to my knowledge no one else kind of builds deeply on. So Amazon is benefiting a lot from this Anthropic relationship.
[42:00]
B
Yeah, it's a pretty wild ratio too. Like you know, if you think about $5 billion to $100 billion in the next 10 years, you know, 20 to 1 ratio of spend investment, which starts to look like kind of a long term cloud lock in deal potentially or an attempt to do that, which is very interesting for both sides that they now obviously Amazon's put like with 10, 13, 15 billion into, into Anthropic so far. So it's not just this, but still.
[42:27]
A
And this is 100 billion over the next 10 years.
[42:30]
B
So which is, which is nothing. Which is nothing. It's basically nothing.
[42:35]
A
Yeah, like at this point, 10 years, like we'll have AGI in two years. Nobody knows what will happen in 10 years. So this is like a very long, it's.
[42:47]
B
Yeah, yeah, no you're right. I mean this reflects as well, it's going to reflect Anthropics just like fundamental irreducible uncertainty. They don't want to be caught in the lurch if they're betting the farm on two years or something. And they don't have a plan for the rest. So a big part of this is Amazon's chips, specifically Trainium 2 through Trainium 4, which is part of this. So basically Anthropic is securing the option to buy capacity on future Amazon chips as they become available, including the Trainium line. And that is an attempt to challenge Nvidia right in this space, like this is. And keep in mind, everyone is trying to do this to everybody else. So there's this famous saying applied to Microsoft back in the day. Commoditizer compliment. So, you know, laptops back in the day were really expensive and software was cheap. And then Microsoft realized we can actually like make universal software that runs on all hardware. And then basically we can make all the hardware companies compete to host our software and, and turn the hardware layer super competitive, lower their margins so people can buy computers for super cheap. Once they buy those computers, they need software to put on them and hey, we can charge them an arm and a leg because we're the only game in town. So Commoditizer Complement is a great way to kind of make it easier for your customers to on ramp into your environment. Now the Frontier Labs are trying to commoditize their complement at the hardware layer. So every Frontier Lab wants a lot of different hardware suppliers. They want Nvidia, they want, they want to help competitors to Nvidia, crucially amd, they want to help Amazon, they want to help Microsoft, like all these companies. And then those hardware companies are trying to do the same in reverse. So Nvidia is deliberately trying to seed an ecosystem with lots of Frontier Labs. And so we're just kind of seeing in real time who's going to end up with the leverage at the macro level. My guess is the hardware companies ultimately, but maybe some Frontier Labs, you know, manage to pull it off. It's interesting to see this play out though. And anyway, yeah, this is the same playbook that Amazon is using for other labs too, right? Like they joined OpenAI's big funding round, gave them $50 billion and similarly structured deal there too. The idea was cloud infrastructure services instead of straight cash.
[44:51]
A
Next few stories about talent and engineers. We've got a couple of people leaving OpenAI. The OpenAI for Science lead Kevin Weil and the store creator Bill Peebles have both announced their departures from OpenAI on Friday. Not too surprising as we know that OpenAI is going to deprioritizing things external to the core ChatGPT encoding business. Alongside that, we have the news that Meta has hired five Thinking Machine Labs founders including a reported $1.5 billion engineer. So this is co founder Andrew Tulak who got this kind of compensation package. So now we've got a lot of the Thinking Machines. Original founders having left five went to Meta and free have returned to OpenAI one joined XAI. Not looking great for Ficci Machines. Just a couple more stories back to Meta. We have some news that they're planning even more layoffs. They are aiming to lay off 8,000 employees globally later this year. They have also continued cuts. They had a bunch of cuts in January. They have additional cuts in the Reality Labs division, including software engineers. So on the whole, what this is looking like is intense competition for top tier talent that is very specialized. And Meta and others are planning to continue to shed software engineers and kind of other employees, which is maybe, you know, what we are in for as far as the economy in these coming years.
[46:32]
B
Yeah, absolutely. And it's worth noting like this is not Meta's first bite at the Apple, at least when it comes to thinking machines. So you might remember we talked about this on the show, but Zuck tried to offer a billion dollars to acquire Thinking Machines and was just flat out rejected. And he said, okay, no problem, I'm just going to start recruiting the whole founding team one by one. And that's worked. At the time people were calling it a full scale raid and it has worked. So you look at the original founding team, you got five going to Meta, three returning to OpenAI, one going to XAI. So if you're Miramoratti right now, that's not great. Needless to say, the, you know, they've raised $2 billion at a $12 billion valuation. And that was in July 2025. So, so that was like less than a year ago. God, things happened fast. And they were reportedly in talks for a new round, a $50 billion valuation that just like, you know, let's see. But that, that seems like a stretch at this point.
[47:28]
A
Not looking good. We also haven't seen Finca machines put out much. The product they've put out was sort of fine tuning offering where you could post train on your own data. That was late last year and we really haven't heard much from them since.
[47:44]
B
Yeah, yeah, exactly. And in the meantime, you know, you do see Meta in particular, I'm not mentioning OpenAI because everybody knows what they've been up to, but Meta, you know, we have talked about musespark, so there is some stuff and muspark genuinely seems to be like, hey, maybe they're back in the frontier. AI game or at least on that trajectory, I think it's potentially fair to say so. Yeah, I mean, anyway, fortunes changed. One thing I will say, you tend to see a large number of co founders for these new spin off labs. This is unusual. So back in the day you'd see a typical startup be founded by anywhere from one to four people, maybe five. We're seeing multiple companies like Anthropic. I Forget, was it 11 or something? You know it's XAI was like 12 like and thinking machines here is north of 10 as well. So very interesting in terms of what it means for the power dynamics between the developers. There used to be this notion of the 10x engineer and that's who you would want on the founding team. Well here we have the 10,000x ML researcher and you know it because they're getting poached for $1.3 billion. Like that's, that's literally what it is. So their leverage in the company is really significant. You know, you think about the leverage of capital, capital versus versus labor in this context. They're, they're kind of merging together. There's a small number of people who can torque your capital like crazy and so it just makes sense and you have to include them on your founding team. So anyway, kind of, kind of fascinating through that lens. It is reshaping in some pretty fundamental ways the dynamics of co founding companies
[49:10]
A
and going back to regular employees. This 8,000 figure is, that's 10% of Meta's workforce. I mean that's big. And related to that, there was a report from April where they said that tech companies have got 52,000 jobs just in the first quarter of 2026. That's up 40% from a period the year before and the highest since 2023. I mean tech is in for transition and it is very much underway. We've kind of waited for AI to affect the economy in a big way. It took longer than some people predicted. But it's very clear now that like the overhaul of the economy driven by AI is started. And related to that next story is Meta. Employees are up in arms over a mandatory program to train AI on their mouse movements and keystrokes. There's this program called Model Capability Initiatives which will look at the programs you're using, including Gmail, Gchat, metamate, VS code, kind of anything you use on your computer seemingly. And it'll get keystrokes, mouse movements, click locations and screen content. There's no opt out option for work provided laptops. So clearly, I mean it's like you're training the AI to replace you is what's happening here.
[50:34]
B
No, no, no, no, no. You're, you're just, you're just showing the AI how to do your job so that it can do your job. It's definitely a replacement play. Yeah, like, like, dude, this is an advantage of scale, right? You have companies with tens of thousands of employees. They'd be insane not to use that as a training signal. Like, of course. And this is, this is going to be as true for, for other companies that are willing to take the PR hit of having a draconian policy like this and the recruitment hit as well if there's no opt out option. You know, I'm sure that that is not the case, or I would be surprised if it were the case for the $1.3 billion employees that they are poaching from other companies. But we'll see. They in some ways are the most valuable employees to learn how to, you know, learn how to monitor. But yeah, it's interesting. I mean, there you have it.
[51:24]
A
Now a couple stories about hardware. First up, Chinese fabs import record volumes of U.S. chip making equipment via Singapore and Malaysia. Homegrown toolmakers booked record 2025 revenues as price competition squeezes margins. That's the headline from Tom's Hardware. So I guess that captures quite a bit of a story. Jeremy, you can maybe provide some ordeal here.
[51:50]
B
Yeah, I mean, this is basically a catastrophic overproduction problem. We've seen it in China before on, you know, things like bikes. I think it was like E scooters where there's these massive E scooter graveyards. We've seen it with their housing market and all this stuff. Massively overproduce. This is why central planning is hard. So what happened here was you basically just had these export controls that came in and said, okay, Chinese companies can no longer import chips from other providers. Chinese fabs then had to buy domestic. And so there were a whole bunch of Chinese fabs, Nora Piotech, a bunch of others that just scaled like crazy. And they all scaled at the same time. The problem is now they're mature or at least at scale, like crazy scale. They're making tons of revenues, but they're all fighting over the same pool of Chinese fab customers. And so the only way to win business is to go cheaper, basically compete on price. And so anytime you see that, right, like revenues are going up because, yeah, there's tons of demand, but margins are going down. This is your classic sign that we're heading towards a market correction consolidation period. So you're going to have a bunch of these companies acquiring or merging with other companies. That's just going to happen and that'll be very disruptive. It's pretty significant and disruptive. But I don't expect that the Chinese government is going to allow that much disruption in the space. They tend to be very heavy handed but worth kind of keeping an eye on this. There's a two track approach that the Chinese are using. They're absolutely going full steam ahead on domestic fab while also trying to get their hands on as many chips, foreign chips as possible. So as ever, I am very skeptical of anybody who says oh well, if we would just export our chips to China, you see, they would cool it with their attempt to build their fabs as fast as possible. No, they're doing both. They're taking AI seriously and chip fab seriously.
[53:42]
A
And last up a couple of stories related to chips. First, Google is eyeing new chips to speed up AI results. They've got two ideas here. A memory processing unit to sit alongside its TPUs and a new inference optimized TPU I guess similar to Cerebras and Groq. Alongside that we've got the news that Canadian quantum company Xanadu has soared to $16 billion valuation after Nvidia has open sourced AI models called I Sing I see designed to help quantum computing researchers detect and correct errors faster in the decoding process. So very different kinds of chips, but both kind of highlighting different computing paradigms that are being explored for and with AI.
[54:34]
B
Maybe we'll start with the Xanadu story. In some ways it's the more how to say fluky story. I think it's easy to read in way more than one ought to in the story. So first of all, Xanadu is a quantum computing company that focuses on optical quantum computing. Right. And so the advantage of optical systems is that unlike atomic based or kind of matter based systems that you have to keep at super, super cold temperatures because the slightest disruption causes them to lose coherence. Basically like to just no longer be useful. You can actually do quantum computing using light at room temperature. So it's a massive advantage if you can get it to work. One of the big challenges there is just like actually fabricating the chips. Like we don't have, we don't have the same level of maturity in that. So one of the things to recognize about Xanadu in particular this is with all due respect to Christian, their CEO who little name drop. I actually know from my time both at U of T and at the accelerator that we went to. So unfortunately they're trading on a Canadian stock exchange that has very limited like trading volumes. And so what you're finding here is yes, Nvidia put out this major announcement that made a bunch of quantum stocks rally. D wave went up, Ionq went up, Rigetti went up, all this stuff. But Xanadu was affected more than anyone, mostly because they have a really small number of shares that are freely tradable. Like it's just a small float, there's a small trading volume. And so you're basically just like, it is true that error connect. So quantum error correction is specifically a bottleneck that they face. And that is what Nvidia's open source tools kind of focus on. But most of this I think is just like a market volatility, an artifact of low trading volume and low float. That's basically what this is. It's also the case that like it's sector wide, like this icing release does not just is not exclusive to Photonics and you're looking at like a massive, they're like 4x over their mid March price right now on basically FOMO. So if I'm Christian Westbrook, I'm looking at this and being like, okay, like, and this is probably why he hasn't come out and made a victory lap. It's good for him, it gives him some liquidity if he wants it, but not necessarily a time to declare victory. On the Google side, this is actually quite interesting. So keep in mind inference is a much more interesting market for Nvidia competitors than training. So the challenge with training is people do all kinds of exotic shit. There's like you know, chain of thought rollouts and like weird optimization routines and like, you know, all to one and many, all the like broadcasting shit. Really, really, really, really, really complex. And that's where Cuda shines. So you're not, if you want to take on Nvidia at training time, you're going to have to find a way to crack the Cuda nut. No one's really been able to do that. And so a lot of companies are saying, fine, I'm going to use Nvidia for training, but for inference, much, much simpler use case, right? You're basically just making subway sandwiches. The same basic operation the whole time, right? Somebody gives you data, you throw it through your model and then yeah, there's some jiggery pokery and then you spit it out. It's, it's a lot simpler in that sense. And it's also taking on a larger and larger share of the whole market. And so it's starting to look a lot like, hey, I'd rather just stay away from the training stuff for now and just get my reps in and revenues on the inference side and that really becomes an option. And that's exactly what they're doing. So they're basically setting up their next gen TPU Strategy. They have two different chips. There's the TPU V8i, which they're codenaming Zebrafish and that's an inference accelerator. That's kind of the main one for this narrative. But they do have a training one, the TPU V8T, that's the Sunfish, which is meant to be a training chip and that's designed by Broadcom. Interestingly, Broadcom is not actually designing the TPU V8i. That's MediaTek, which is a bit of a shift in strategy. So yeah, Google already has a really mature TPU ecosystem. Obviously there's all kinds of structural advantages in terms of the way that TPUs works where they don't have to decode like complex instructions or constantly access memory the same way GPUs do. And as a result they actually are a lot more energy efficient. And that matters a ton when you're looking at the North American market. If you're building a data center in North America, the thing you care about, the thing you're bottlenecked by is energy, is power. And so this is a really big deal. And especially on inference where you know, the margins are the thing you look for, right? Like you want to beat your, your, your opponents on margin. This is also why Jensen has been flexing on Grace Blackwell, calling it the king of inference today. Really kind of trying to frame Nvidia as hard as he can, trying to meme it into being the inference king. But it's also noteworthy that the two best models in the world right now, Anthropics, or sorry, two of the best models in the world, I should say Claude and Gemini have the majority of their training and inference infrastructure running on Google TPUs. So even though Google may be a relatively small fraction of the chip market right now, boy are they overperforming pound for pound. Like it's hard to ignore that. You can call it an artifact of the deal with Anthropic, you can try to kind of wave it away that way. But the feedback that they get on design, on chip design from the companies like Anthropic, that they're iterating on that is a big part of the gold and that allows them to be absolutely competitive in terms of the insights they're deriving from the frontier of AI with Nvidia. So anyway, it's a really interesting story. Lots more we could say, but we're running short and I've already overstayed my welcome here.
[59:54]
A
Moving on to projects and open source, we've got two major stories that once again bundle up as they're Quite related. Moonshot AI has released Kimik 2.6 a big model with 1 trillion parameters and various optimizations. Big MOE model and it the benchmarks are a little bit crazy. Like they're saying that this is on par with GPT5.4 on various benchmarks comparable to anthropic like and what I've heard from people is QME K26 is quite impressive. Around the same time Minimax has Open sourced Minimax M2.7 which likewise has quite impressive scores, although not quite as high. I don't think it's. It's sort of like similar to OP4.6 but also a little behind. Not at the GPT5.4 level. I don't know if it's. I think it might not be at the size of Kimike 26. And with Minimax what they highlight is the self evolving nature of it. They double down on the story of this model helped develop itself though that's primarily on the model harness. With Kimi K 2.6 the narrative is focusing on long term development working across multiple days. We also highlight how this can be used as part of OpenClaw and other agents like that. And also how you can spawn many agents to work in parallel to tackle complex long horizon tasks. So pretty impressive just all around. Especially Kimmy I think is firing even harder than that we've seen so far.
[61:35]
B
Yeah, absolutely. And we talked about the minimax thing. You're right. It is exclusively as I understand the agentic harness that it's optimized. So you can argue that somewhere along the continuum of like basically just doing software engineering automation which I'm not saying that's a solved problem but like it's kind of less of a. You're not optimizing kernels let's say which is kind of like the notoriously like pain in the ass thing for AI research but significant the bigger story. I agree. Kimik 2.6 the trillion parameter model though really interesting. Very very sparse. So it has 384 experts. It's an MOE model, a mixture of experts. Only eight of them are activated on any given kind of inference cycle. And so that, that's pretty, pretty notable. It's 32 billion parameters out of a trillion total parameters. That adds up to so, so pretty, pretty damn sparse, consistent, a little bit more sparse even than some of the Deep SEQ stuff that we've seen. Another interesting detail is mla. So multi head latent attention again shows up here and this is really starting to become like a very robust recurring theme with some of these kind of frontier Chinese open source models if you can chain together all those qualifiers. So MLA basically this idea where when you have your attention mechanism you can have your queries and your keys that are these fairly long vectors instead of just kind of doing a little bit more than doing a dot product between them, but instead of like multiplying them together, roughly speaking, what you can do is first compress them into a smaller representation and then multiply that. And that saves a lot of memory overhead. You know, for Chinese companies that's especially important just because of the export controls that have limited access to memory. But it's not an exclusively Chinese technique. Like a lot of people have taken this on. You get really good bang for your buck when you do that. And so yeah, you know, this is again deep seeks legacy you can think of as living on. Haven't heard much from Deep SEQ recently by the way. Kind of an interesting, interesting note. But MLA does persist. And another interesting detail is this model is shipping natively in INT4 quantization. INT4 quantization. So traditionally, you know, when you train these models there's like a whole bunch of levels of resolution or granularity that the weights and biases in the model can have, right? So you know, like typically you're going to have 32 bits of resolution for those numbers. In this case it's actually just four. So really, really coarse grained. So why would you want to train a model with weights that are represented that roughly that approximately where there's basically just like, you know, only a handful of ways of values effectively that those weights can have. Well it's, it's because it's easier on the memory. Again right now the problem is usually you would train a full resolution model and then you would compress it, you would quantize it down to a low resolution version of itself and then ship that. The trouble is if you do that, you're shipping a model that is being represented, let's say with int4 quantization, but it was never trained to operate in that way. It was trained maybe with BF16 or some other higher resolution thing. And so now you're asking it to perform in a fundamentally different way from what it was trained here. They actually trained it natively in Infor. So it's designed, this is very much kind of done with a view to deployability and it will come at the cost of some precision. But it suggests that really what Moonshot was trying to optimize here for is is practical inference. Like from the very beginning they're not going to keep two sets of books, they're just going to optimize the entire model and training routine for Infor. So that was kind of interesting and something we have seen every once in a while. But to see it with a model this impressive on benchmark scores is kind
[65:03]
A
of notable and related to that Minimax self evolution story. Also worth highlighting, Hugging Face has released ML Intern which is an open source AI agents that automates the LLM post training workflow. So this is an agent framework, not a model and they say that it kind of does the whole research process of looking up recent papers, you know, modifying code, et cetera. We've covered various stories. They say that this beats a cloth code and you know, really does well at taking Qin free 1.7B which does not so great on GPQA, going from 10% to 32% in under 10 hours. And that includes various things, you know, modifying the data set, trying out various things. So a bit more I guess here an example of having an AI agent optimize an LLM, which is not what the Minimax examples seemed to show onto policy and safety. First we've got paper infusion shaping model behavior by editing training data via influence functions. So this identifies which training documents most affect a target model behavior, then computes perturbations to those documents that steer models towards an adversarial objective without explicitly injecting examples of the target behavior. So it's kind of a poisoning attack in some sense. We've seen there's quite a bit of previous research on identifying kind of parts of a training data that affect certain behaviors. So in a sense this is not surprising, but pretty much yeah. This is able to estimate how replacing a document with a perturb version with shift model parameters and uses that to push a model towards certain results. They highlight vision experiments and also some kind of weak results on GPT.
[67:10]
B
Neo yeah, this is actually a really, really interesting paper mathematically. I mean it is somewhat mathematically involved, but the intuition is pretty straightforward. There's this idea. So we'll combine two ideas, I guess One is imagine a landscape where the sort of X axis values, if you will, are all the different parameter values of the model. So all the weights of the model are the parameters here, and then the Y value is the loss function. And really what you're doing during training is you're trying to find the minimum of the loss function in that landscape of possible weight values. You're searching through the weight space to find the combination of weight values that lowers your loss as much as possible. So after you've trained a model, if you've done a good job, really, your model should sit in a region of weight space, a region of parameter space that looks like a bowl, right? Sort of like a fairly flat region. You're at the bottom of a nice little bowl because the loss is locally quite low. What they're going to do here is say, okay, well if that's true, what we're interested in knowing is roughly speaking, like in what direction should I nudge my parameters and my data? And it gets a little mathematically involved. But to induce a certain target behavior, and to do that they anyway, they use something called a. Really, it's a Taylor approximation. This is the idea of like, if you know the curve, take any, any complex function like a loss function. Basically you can take its value at a point and if you're interested in predicting what its value would have been at another point, you can basically just like if you know the slope and the second derivative, so the curvature and the third derivative, so the curvature of the curvature and on and on and on, you can kind of approximate and extrapolate what the loss function value would have been at some putative other point. Anyway, I'm realizing as I'm saying it, it does get pretty involved. So I would say just check it out. But this is actually a quite interesting, fairly fundamental thing that does result in, after a little bit of mathematical jiggery pokery, figuring out how you can change documents, as you said, Andrej, to induce a certain directional shift in some figure of merit that you care about, like the CIFAR result or whatever. And the interesting thing about this is that these changes in those documents are not like, they're not obvious. They're not things that give away that you're poisoning the data when you look at them, right? They're seemingly random, small perturbations. It's not like you're including some sort of prompt injection attack where it's like you're explicitly telling the model, when you see a situation like this, respond like this. Instead it's like maybe adding some. Some punctuation marks in a weird way that's out of distribution and seems alien. And that particular weird addition results in specifically, specifically the change that you're after. And so again, mathematically, I would say just very interesting and worth looking at if you're a math nerd like we are. But anyway, it's an, I think, an important advance here and it does work quite well.
[70:07]
A
Next bundle of two Mythos stories. The first one is that NSA is apparently using Mythos despite the DOD blacklist. So the NSA is under. The DOD not supposed to be using Anthropic right now as it is labeled as supply chain risk. But it seems that it might be among undisclosed recipient organizations that have access to MEFOs and have used it. And alongside that, a surprising, again, reportedly true story. Apparently an authorized group has gained access to Mythos. It's a group on Discord. They say that they effectively hacked Anthropic, found a way to access the model and use it for software engineering purposes. So two kind of different ways in which MUFOS is being used in a way it's kind of not supposed to. To be. Anthropic did not comment on the NSA story and the NSA had not responded. So hard to say how true this is, but kind of a clever move on Anthropic's part if true.
[71:19]
B
Yeah. So the second story is, I think I remember it being said that they accessed it through some third party provider and that the group itself was sort of. It seemed pretty, I don't want to say tame, but it didn't seem like it was people who were intent on using it for cyber attacks. More people who wanted to understand the model. And reading between the lines, Discord server, weird group of people trying to understand the capabilities of the model and not actually use it for hacking.
[71:44]
A
If I made an educated guess about the model's online location based on knowledge, it seems like, not like a hard hack. It seems like maybe Anthropic really messed up and made it not super hard to do this.
[71:59]
B
Yeah, there's also like kind of a crew of Twitter anons who do a lot of like AI consciousness sort of experimentation. And it kind of vaguely makes me wonder, you know, the Janus and, and those guys. So it makes me vaguely wonder, like, is it that? Is it like. I don't know, but. But interesting nonetheless. It's not the first time we've seen a story about anthropic security being breached. And, you know, they're going to be more. I mean, it's Just when you have nation states after you, you're going to have stuff like that.
[72:25]
A
Quite ironic for Mythos, right?
[72:27]
B
Yeah, that's. No, that's. Well, that's right. Yeah. Yeah, that's right. It's ironic. That's exactly the word. The NSA piece, also interesting. Kind of hard to ignore the fact that the specific argument that the Pentagon. The Pentagon. The Pentagon. So the specific entity that runs the NSA was arguing that anthropic was a national security threat. That is why the Supply chain risk designation was granted. And so now ostensibly we have that same entity coming back and saying, oh, please, may I have some more? And there's been productive engagement, by the way, it seems, between Dario and two people. Two people. Susie Wiles, who is Trump's Chief of staff, and Scott Besant, who is the Treasury Secretary, neither of whom you will note, are the Department of War secretary, Pete Hegseth. Kind of weird, but everybody but Pete Hegseth seems to be involved in getting the Department of War access to Claude Meathos. Like, that's kind of a notable thing. You would think that the one person who would be involved in this would be Pete Hexith, but it kind of seems like basically the entire executive branch is reorganizing itself around not getting Pete in a room with Dario just to make this happen. So kind of amusing. You know, it's worth, like, saying out loud that the NSA also has this, like, dual reporting structure where they also report to the odni, the Office of the Director of National Intelligence, which is Tulsi Gabbard. So, like, you know, maybe that's an option too, if they want to gracefully make that. Make that work. But it is. It is slightly amusing. Yeah. I mean, highlights again. I mean, this was. This was stupid from the start. You know, it was never going to be a good idea to start a war of words with what is arguably the leading AI lab on planet Earth that happens to be within the US's own borders and that has built a.
[74:09]
A
What may be a super weapon friendly with the military. Like, they didn't ask for much.
[74:17]
B
Yeah, very odd. But here we are.
[74:20]
A
And if they're actually collaborating and want the NSA to use it, that's a very pro US government move. The NSA is going to go and look for ways to hack software and get data on other countries and vsa. If they don't have access to Mythos, that is a major blunder. Right?
[74:41]
B
A huge problem. Yeah, absolutely. Absolutely.
[74:44]
A
So it's hard to see how the DoD can continue with its position of Supply chain risk, given the NSA just absolutely must have access to mifos if it can.
[74:53]
B
Yeah. Given the personalities involved, I don't know. This may actually go to court. Who knows? But Jesus. Yeah.
[74:59]
A
On the second story of an authorized group. Not a ton to say, aside from it's pretty silly. Venanthropic has had another kind of breach just pretty soon after the cloud code source leak. The group did say they used it for kind of simple software engineering rather than trying to hack anyone. Anthropic did say that they're kind of monitoring for misuse, so that could be how they flew under the radar. But given they said they partnered with 40 organizations, not a lot of access to Mythos. So the fact that this seemed to slip through, the group actually contacted Bloomberg and disclosed that they had access. Not a good, not just ironic, but like Anthropic keeps messing up here recently. And beyond this, they also have a lot of controversy around usage limits. We're not covering this episode. Maybe next episode. So Anthropic definitely in a tough situation, clearly, like where super astronomical growth has come at a cost.
[76:02]
B
Yeah. And I mean it has been super. This always happens, right. When companies scale crazy fast, things start to break. So in some ways, I mean, this isn't so surprising. We are asking our Frontier labs to suddenly become load bearing elements in the national security apparatus, which isn't supposed to happen. And so yeah, this is unfortunately it's something that you could have predicted a long time ago. We did predict a long time ago. But for it to actually materialize now is like everybody's having to get their cyber up to gear. And also worth noting, cyber for AI models is a qualitatively new challenge as well. We're not just talking about like, oh, Anthropic needs to get SOC2 compliant and get Fedramp and get all these things. No, no, this is not how you solve the problem. There don't exist standards for this kind of security yet. We're making it up as we go. And so especially API keys for models like access to model weight. This is really, really tricky stuff. But to your point, I mean it's been on full display for sure over the last few months.
[77:03]
A
Onto research and advancements. Just a few stories here. First up, parquet scaling laws for stable looped language models. This is a collaboration between University of California San Diego and together AI. The idea here is a novel stable looped architecture. So instead of doing a single forward pass, as you do typically in any kind of traditional transformer, looped transformers can Pass the same activations through the same layers multiple times to improve scaling, test time scaling. And this is not a novel idea. Kind of looped models is thing that's been around for a long time. So the novel thing here is the tweaks that allow it to be stable, meaning that it's easier to train and to use. And they have some pretty fancy math that they go through that I'm not going to try to touch. Let Jeremy get into the nitty gritty as usual. But when they evaluate it on very large scale at a 1.3 billion parameter model trained on 100 billion tokens, they reduce validation perplexity by 6.3% compared to a prior looped model architecture. So pretty significant improvement in terms of kind of performance of training.
[78:30]
B
Yeah, for me this is paper of the week material. Quite interesting. So first of all, why do you want a looped model in the first place? Why do you want a situation where you take the same weights, loop over them a whole bunch of times rather than having a bunch of, you know, different layers, which normally that sounds better, right? Because then each layer can be kind of more cleverly adapted to the specific computation that's required of it. The reason is memory. Like you're trying to reduce the amount of the memory footprint of the model, take the same weights, run your data through them multiple times and really you're getting more computation, more flops for the same memory footprint. So that's the reason behind the looped architecture. When you have a standard transformer in some sense, at the heart of it mathematically is this idea of the residual stream. So the data comes to a layer of the transformer. What you do is you're going to add, so you're going to fork it off into two paths. One path is going to be the exact copy of the data that came in. So you're just going to preserve a copy of the input data. The other path is where you're going to apply all your fancy transformer architecture stuff, your attention mechanism and your MLP and all that stuff. And then at the end you're going to merge them back together. So your original data that came in, the input gets combined with the data that went through the attention mechanism and all the fancy stuff. And the reason you're doing that is that if you didn't preserve a copy of the original data through each layer to allow it to kind of remind the model of like, by the way, you know, the input looked like this. Let's not deviate too far from it. Let's not forget the lessons in particular that were learned from previous layers and completely like, fuck this, this thing up. That's essentially what you're doing. You're allowing it to not go too wild with the attention mechanism, the MLP and all that stuff, and always kind of preserve some amount of information that captures the lessons learned of previous layers. Okay, so that's kind of the basic idea here, the challenge. Imagine taking a chunk of transformers like the transformer blocks like this, and now loop over it like 10 times. Well, as you do that, it is true that that copy of the input data that you're sending it is somewhat making the past information from previous layers sticky. You are kind of merging it with data from that layer and over many layers you're eventually going to forget all about the previous ones. And so when you loop over many, many, many times, eventually you just completely forget what came to the input. Your performance plateaus, it's not good. And so what they're going to do here is say this. Let's start with a prelude, a set of prelude blocks. This is like the earliest layers of this whole model. And all those are doing is they're taking your input and turning it into some kind of processed embedding. This is like an enriched representation of the input. And then that processed embedding, let's call it E, we're going to feed it to this looped transformer setup. And what we'll do is every single time we loop, at every single layer, we're actually going to re inject the pure original version of E that pure, like the initial pre processed input, we're going to basically copy that and keep injecting it to basically continue to ground the model, have it remember what the original original input was. Not just the input from the previous layers, but the original input all the way back at the very beginning of the architecture. And that's a way of preventing this sort of drift. The more you loop over these layers, essentially it's like having a human live 100, 1,000, 10,000 years. They'll eventually forget what happened to them, you know, in their childhood or whatever. And what you're doing is you're injecting memories of their childhood every year so that they keep not forgetting it. Then there's a final kind of layer of decoding blocks at the top that basically get your ultimate output. And so anyway, there's a bunch more to be said here. We don't have time. You basically have this sandwich shape where you have your prelude blocks, you have your loop transformer and then you have your decoding layers. And there's a whole bunch of interesting reasons why this is an interesting development. It is all measured at relatively small scales. You're looking at like 1.3 billion parameters. So, you know, whether the power laws that they see here actually hold at frontier scales is totally unknown. It takes taste to decide which things to try to scale. But one piece of speculation that has been running around is like, maybe this is what Mythos like. Part of the architecture behind Mythos. Interesting reasons to suspect that it's far from clear and it's not like anthropics come out and said that this is the case. So, you know, grain of salt, Grain of salt. There is even an obnoxious project called Open Mythos that was like based on the name. You're like, oh my God, somebody's like replicated. No, this is just. They are guessing that this very architecture is, is what was used there. And they, they're speculating. There's reason for that speculation. But again, we don't necessarily have super.
[83:14]
A
Yeah, I, I cut that story because I didn't want to give too much attention to Open Mythos, which is a misleading name. The only other thing to say is per the title, they do provide scaling laws for these stable looped models both for training and for testing. And this actually introduces a new axis of the scaling. Right now you have the number of loops as a thing to consider. So that's an interesting aspect to consider with respect to now, besides just overall token outputs, maybe this will help further advance the ability to do effective test time scaling and also effective train time. Kind of continuing down the training scaling model. Next up, Okubench evaluating AI agents on real world professional tasks via language environment simulation. So this is coming from actually the Qenteam and collaboration with the Chinese University of Hong Kong, they have this large environment simulator where the idea is to simulate kind of the environments you would see in healthcare, education, transportation, commerce, et cetera. And you would then be able to evaluate the ability of the models to perform well there. And this clearly is quite useful given a lot of the conversation lately about the ability of models to do work across large horizons and the many discussions of matter we've seen. So they show that no single model is able to nail all of these categories of work. On kind of agricultural business, they're able to get to a max average of 80% and then it goes down over time. So another useful eval, I think in trying to get a picture of in practice, everything's model is going to be able to at least do some tasks in these occupations. And this allows you to simulate the environments and benchmark models as they come out.
[85:29]
B
Yeah. One question natural come to mind is what about gdpval? I thought that did this. And in a sense you're not wrong if you think that there is a philosophical difference here. So GDPVAL is this highly curated eval suite where OpenAI basically went after a relatively small number of occupations. There's like 44 that are meant to be just from the most important sectors for US GDP, whereas Occubench is broader and less deep. So it's 100 scenarios, 65 different domains, 10 industry categories. But GDPVal has 30 tasks per occupation in the full set. And what you get with occubench is everything is synthetically generated. So they can basically just spam a crap ton of stuff, which is consistent with a lot of the Chinese labs do tend to take this approach where they're just spamming a whole bunch of stuff for their evals. Whereas typically when you see the kind of high effort curated approach for evaluating language model performance, where you're getting doctors in, you're getting engineers, whatever, that just from a taste standpoint seems to be more of a western thing for some reason, which is quite interesting anyway. So, you know, maybe a good way to get broader coverage but like less, you know, less robustness. This is like kind of more hackable in a sense.
[86:42]
A
Moving on to synthetic media and art, just a couple more stories to close out the episode. First up, Deezer has said that 44% of songs uploaded to its platforms daily are AI generated. So this is like big numbers. There's been rapid growth. It was 10,000 tracks per day in January 2024 and now in January 2025. This is a report over a while ago going up to 60k per day. So that is a ton of music uploaded. But this AI generated music accounts for only 1 to 3% of total streams. And 85% of those streams are flagged as fraudulent and demonetized. So this tracks along with other stories we've seen of Spotify and other platforms getting more and more AI generated music. And we had a story of a person who was sued for kind of defrauding Spotify for uploading a bunch of music and then faking a bunch of streams. So another instance of like we could see this coming. The models from music generation were quite good already. 2024 and now it's here, now AI music generated. Like the people who want to try to exploit this and spam the platforms and just kind of cheaply get a bunch of streams similar to like the ebook boom on Amazon when people entire books. Like this is just the Internet now. Like if people are going to try and mass generate stuff and post it and get money and it's unfortunate and the platforms have to adapt.
[88:26]
B
Yeah, and interesting potential metric too. I don't know how robust this would be in practice, but at least it's a first pass. This ratio of percent of total streams versus percent of new uploads. Once we see a shift in that if they approach the same order of magnitude, that might tell you something about hey, AI music is absolutely here and competitive. There's obviously an unfair advantage for human generated music just because people follow artists and like they'll download the next Kanye song or whatever. But. But still like from an order of magnitude standpoint, that could give us a pretty decent noisy, noisy measure of Are we there? Have we had our, I don't want to say our chat GPT moment to your point? I think we kind of have our Claude code moment. Something like that.
[89:08]
A
Something like that. A related aspect of the story is that an AI generated track actually recently topped iTunes charts in the US, UK, France and Canada. So AI music is getting real hits and getting a lot of streams, but there's a lot of these kinds of people trying to exploit it as well. Last story, Celebrities will be able to find and request removal of AI deepfakes on YouTube. So you are going to be able to submit a request alongside of that you need an ID and a selfie video and you will will be able to say like take it down. Although if there is a parody video or a satire video, then there's potentially not ability to remove it. So I think an interesting development, now it's easier than ever to create videos starring any celebrity you want. Although we've had their buddy for a while now, we're getting to a point where it's probably just hard to tell if something is deepfake. Especially since previously to make a deepfake you would usually kind of transfer the appearance onto another video. Now you can zero shot generate a clip featuring someone via C Dance or whatever. Another example of a platform having to provide a new kind of interface and deal with deepfakes in a way that we haven't observed up to now.
[90:36]
B
Yeah, as massive celebrities we of course are extremely relieved. But yeah, it's definitely a limit on in some sense free speech. Right. We're running into. And it does make me wonder when does this become a court case like it will at some point, right? You're going to get to a point where people say, no, no, like your, your likeness, especially as a public figure, is not your own. I can draw a picture of you. I don't get in trouble for drawing a picture of you. How is that? Like, I don't know. But free speech is colliding with AI in a really, really big way. So it'll be interesting to see whether it's this, when this transitions from just like product announcements, which is what this is, to actual, I don't know, Supreme Court cases at some point. I mean, I imagine that's ultimately where it may go.
[91:20]
A
And with that, we are done with this episode. Thank you so much for listening. Hopefully this one comes out soon after recording and we'll be back on track with consistent releases. As usual, we appreciate you listening, sharing, reviewing, commenting, all that and more than anything, tuning in week to week, even when I am late to posting. So thank you for listening and we'll be back next week. In when the AI news begins. Begin
[92:12]
D
Break it down Last weekend AI come and take a ride get the low down on tech and let it slide last weekend AI come and take a ride through the streets AI's reaching high new tech emerging Watching surgeon fly from the labs to the streets AI's reaching high algorithm shaping up the future sees tune in tune and get the latest with ease Last weekend AI come and take a ride get the lowdown on tech and let it slide Last week in AI come and take a ride I'm a laugh through the streets high. From neural nets to robot the headlines pop data driven dreams they just don't stop Every breakthrough, every code unwritten on the edge Edge of change with excitement we're smitten from machine learning marvels to coding kings Futures unfolding see what it brings.