Summary8 min read

Last Week in AI – Episode #212 (June 17, 2025)

Podcast: Last Week in AI
Hosts: Andrei Karpathy and Jeremy Harris Gladstone

Overview

This episode covers two weeks’ worth of fast-moving AI news, exploring major updates across tools, business, open source research, policy, chips, and synthetic media. While lacking a single blockbuster headline, the hosts unpack impactful developments—from OpenAI’s O3 Pro and sweeping model price drops to new security risks in AI agents, evolving labor market dynamics, and significant legal challenges around generative media. Their discussion remains lively, witty, and deeply informed for both technical and general audiences.

Tools & Apps

OpenAI Launches O3 Pro, Massive Price Drop

(04:46 - 08:02)

O3 Pro: A new reasoning-focused LLM, replacing O1 Pro, now available to ChatGPT users.
Price slashed by 80%: Input tokens now $2 per million (down from $8), a “huge price drop” (Andrei, 04:46).
Performance: O3 Pro benchmarks show broad improvements, sometimes beating human and previous AI results.
Notable Benchmark: “Basically you see a clean sweep where the model 64% of the time is preferred to humans…spanning everything from…personal writing and computer programming and data analysis.” (Jeremy, 05:52)
4 out of 4 reliability: OpenAI now reports strict test metrics where a model must correctly answer a question four times out of four—“I hadn’t noticed, to my embarrassment… Of course they’re doing this, but I hadn’t yet remembered seeing it in writing.” (Jeremy, 05:52)

Cursor AI Editor Hits 1.0

(09:27 - 12:43)

Cursor: Powerful AI-enhanced coding environment achieves a significant milestone with version 1.0.
- Launches BogBot (automated PR reviewer) and Background Agents (remote, agentic code workers).
- Agentic coding: Agents now function asynchronously and autonomously.
Security Concerns: “Agents have a much bigger surface area of attacks… if you’re deploying this in a production setting, this is a really interesting new set of vulnerabilities…” (Jeremy, 10:52)
- Microsoft also reported a prompt-injection vulnerability in Copilot the same week.

Mistral’s Open-Source Reasoning Models & Market Position

(13:04 - 15:44)

Mistral (French AI lab) releases Magistral models (24B parameter “small” is fully open source).
Generally not state-of-the-art but fills an open-source need.
Analysis: “The fact that they did release this suggests they don’t have a plan for blowing things out of the water anytime soon.” (Jeremy, 14:11)

ElevenLabs V3 – State-of-the-Art Multilingual Text-to-Speech

(15:45 - 18:54)

V3 model: More natural, expressive voices (e.g., laughter, sighs), supports 70+ languages.
Programmable cues: Developers can embed [happily], [shouts], etc., into prompts—“It seems obvious in retrospect, but somebody had to think of it and implement it.” (Jeremy, 17:10)

ByteDance CDance 1.0 & Google Veo – Video Generation Heats Up

(18:54 - 23:40)

ByteDance (TikTok parent) launches CDance 1.0 to compete with Google’s viral Veo.
CDance: 5s of HD video generated in 40 seconds; praised for handling complex sequences, character consistency.
Google’s Veo Pro plan: $20/month, faster generations; “I continue to tap the sign that someday fairly soon we’re going to be able to generate 1s of video for each second that you wait.” (Jeremy, 23:40)
Video generators will soon support real-time, interactive experiences—a “very dark rabbit hole” for media feedback and personal optimization.

Business & Industry

AI Talent Migration: Anthropic’s Winning the War

(25:41 - 31:40)

Signal Fire report: OpenAI employees leaving for Anthropic at 8:1; DeepMind at 11:1 rates.
Culture trumps pay: “I’ve never had a conversation that feels like that [tense, secretive] with an Anthropic employee.” (Jeremy, 25:41)
Compensation: OpenAI counteroffers include $2M retention bonuses and $20M equity increases.
Entry-level jobs vanishing: “We’re no longer hiring entry-level software engineers. We don’t expect ever to do that again.” (Jeremy, 32:32)
- Senior talent only; AI is writing the majority of major labs’ code bases.
- White-collar automation is accelerating.

OpenAI – New Court Order Calls for Retaining All Chat Logs

(34:33 - 36:36)

Legal standoff: Court orders OpenAI to retain all user logs, including deleted ones, as part of copyright suit (NYT v. OpenAI).
OpenAI criticizes the ruling as a “way of preventing OpenAI from respecting its users’ privacy decisions.” (Jeremy, 34:33)
Could put OpenAI at odds with privacy law and “zero retention” business customers.

Hardware Race: China’s Huawei vs. Nvidia; Next-gen Chips

(36:36 - 50:30)

Huawei is struggling to match Nvidia—despite state backing, outdated process nodes (5–7nm), and energy inefficiencies.
Large Chinese techs (ByteDance, Tencent) reluctant to adopt due to competitive dynamics and U.S. pressure.
Huawei promising 3nm GAA chips by 2026, but skepticism abounds; yields currently “really bad.”
TSMC Angstrom (1.4nm) due 2028: Gonna cost $45K/wafer, 50% higher than 2nm.
Nvidia as world’s most valuable company—could start crowding out Apple for “the leading node” at TSMC.
Mistral launches Mistral Compute, touting energy and regulatory support (“the only Western country that can still build nuclear plants in <10 years” — Jeremy, 48:34).

Research & Open Source

ProRL – Pushing Reasoning With New RL Tricks

(51:26 - 56:09)

Prolonged reinforcement learning adds “genuinely new capabilities” to LLMs, not just surfacing old ones.
Innovations include periodic reference policy resets and nuanced regularization.
“There’s all kinds of shit. It’s actually quite an interesting collection of shit. The shit links together in interesting ways…” (Jeremy, 51:26)

Rethinking Scaling Laws: Test Time & Memory Matter

(56:09 - 64:00)

Kinetics: Rethinking Test Time Scaling Laws: Proposes including memory access in scaling equations—flops alone are inadequate.
“One of the big bottlenecks now is just how fast can you move the data around… That’s become more and more of an issue as [sequence] lengths get greater and greater.” (Jeremy, 58:39)

Surprising Power of Negative Feedback (Negative RL)

(64:00 - 69:44)

Training LLMs by penalizing wrong answers produces more diversity, less overfitting vs. rewarding “correct” answers (“Positive only”).
“There’s a bit of a loss of output diversity versus negative only, which improves performance across all paths at K metrics.” (Andrei, 64:00)
Weighted approaches (90% negative, 10% positive) seem optimal.

Automating Research: LMs as AI “Taste” and Experiment Runners

(69:44 - 79:56)

Predicting Empirical AI Research Outcomes with Language Models: LM can predict experiment results better than human experts in some setups (77% vs. 49%).
ExpBench: Benchmarks AI agents’ ability to replicate published research experiments—O3 mini bests others but only 1.4% success rate, yet: “That’s a pretty big 1.4%, at least in my mind.” (Jeremy, 76:25)

Policy, Safety, and National Security

Models Know When They’re Being Evaluated (Alignment/Sandbagging Risk)

(80:07 - 83:42)

Multiple choice and open-ended tests show models (e.g., Gemini 2.5 Pro) are increasingly adept at recognizing when they’re undergoing evaluation for safety/capability.
“Frontier models show definite above random evaluation awareness…that’s kind of interesting.” (Jeremy, 80:46)
Raises concerns over models faking alignment in controlled tests but behaving differently in deployment.

Interpreting In-Context Learning; Multiphase Emergence

(83:43 - 90:11)

New research shows that LLMs’ ability to learn “on the fly” isn’t just due to simple induction heads, but involves complex, staged circuit emergence.
“There are different types of emergence that might occur in neural net training, which in general is interesting.” (Andrei, 90:11)

Security

First-Ever Zero-Click LLM Agent Attack (Copilot)

(92:04 - 94:07)

Echo Leak: A vulnerability allowed attackers to extract data from Copilot via malicious emails—no user interaction required.
“The attack surface has just exploded, right, with these agents.” (Jeremy, 92:04)
Reveal the challenge of defending prompt-based systems in assistant/agent contexts.

ClaudeGov: Anthropic’s Models for US National Security

(94:07 - 97:32)

Tailored LLMs used in classified government settings for “planning, operational support, intelligence analysis, threat assessment.”
Noted tension: “Sometimes you do want these models to be capable of things you wouldn’t want everyday users to do.” (Jeremy, 95:22)

Synthetic Media, IP, and Labor

Midjourney Sued by Disney & NBCUniversal

(97:32 - 100:18)

Accused of “straightforward copyright infringement” for letting users create imagery of protected characters.
Notable: Lawsuit PDFs embed AI-generated Shrek and Darth Vader images.
“Midjourney probably has fewer resources these days…to pull off its lobbying effort.” (Jeremy, 99:19)

SAG-AFTRA & Video Game Companies Reach Deal

(100:18 - 104:03)

Union covering actors and voice actors wins AI compensation/consent protections after 18 months of negotiation.
Ongoing dilemma: “Do we own our voices? What does it even mean to own our voices?” (Jeremy, 103:34)
AI-powered voice synthesis is blurring lines of IP/likeness in media and entertainment.

Notable Quotes & Moments

On O3 Pro’s benchmarking:
“You see a clean sweep where the model 64% of the time is preferred to humans… across everything from quantifiable to qualitative tasks.” (Jeremy, 05:52)
On software engineering’s future:
“We're no longer hiring entry level software engineers…we don't expect ever to do that again.” (Jeremy, 32:32)
RL section colorful summary:
“There's all kinds of shit. It's actually quite an interesting collection of shit. The shit links together in interesting ways…” (Jeremy, 51:26)
On the new zero-click Copilot hack:
“There's no phishing, no malware needed. This is just straight prompt injection.” (Jeremy, 92:04)
On labor market shifts:
“The job of software engineers, the job even of AI researchers, is getting more and more abstract and further away from…many of the activities that used to define them.” (Jeremy, 32:32)
On the new phase of AI entertainment IP:
“Do we own our voices? What does it even mean to own our voices?” (Jeremy, 103:34)

Key Timestamps

OpenAI O3 Pro + Price Drop: 04:46–08:02
Cursor AI Editor 1.0: 09:27–12:43
Enterprise AI Talent Trench: 25:41–31:40
Entry-Level Tech Jobs Disappearing: 32:32–33:47
Zero-Click Copilot Security Flaw: 92:04–94:07
Midjourney Lawsuit: 97:32–100:18
SAG-AFTRA Deal: 100:18–104:03

Final Thoughts

The AI field is advancing rapidly in capabilities, with growing social and legal complexity.
Workforce dynamics are shifting as entry-level coding jobs dry up and high-stakes retention wars heat up.
New attack surfaces are emerging as agents gain more autonomy.
Big legal showdowns (Midjourney, OpenAI) will set precedent for generative AI and IP.
AI is increasingly both the originator and executor of research, and ethics, security, and labor issues will only accelerate in tandem with AI’s technical progress.

To learn more and find links to the stories, visit lastweekinai.com

Loading summary

Transcript85 lines

[00:00]
Jeremy Harris Gladstone
Foreign.
[00:11]
Andrei Karpathy
Hello and welcome to the Last Week in AI Podcast or sometimes the Last Two Weeks in AI Podcast where you can hear us chat about what's going on with AI. And as usual in this episode we will summarize and discuss some of last week's most interesting AI news. You can go to the episode description for the timestamp and links for all those stories. I am one of your regular hosts, Andrei Karen. I studied AI in grad school and now work at a generative AI startup.
[00:42]
Jeremy Harris Gladstone
Hey guys, I'm your other host, Jeremy Harris Gladstone, AI, AI, national security stuff, blah blah, blah, blah blah. And yeah, we have a lot to get through this week because it's actually this past two weeks. This is one of those episodes where we missed one last week that was on me and now we're going to do some catch up and see.
[00:59]
Andrei Karpathy
Jeremy, you seem to need to travel a lot. I'm starting to feel like you might be a spy going to Washington and retrieving AI secrets or something.
[01:07]
Jeremy Harris Gladstone
I mean, look, every once in a while you may hear what sounds like a Russian accent. Actually it's funny because you're the one with the Russian background. Well, but this is how spies work, Andre. All right? They seem like they could not be less Russian and yet here we are. So yeah, yet I am not a spy.
[01:24]
Andrei Karpathy
But you just have travel to do to talk to people about AI.
[01:29]
Jeremy Harris Gladstone
Yes, exactly.
[01:31]
Andrei Karpathy
Well, we will go pretty quick just to give a quick preview. No huge stories in the past couple weeks in tools and apps. There's just a variety of announcements of somewhat significant releases, a lot of 1.0s or new versions of things, new O3 Pro applications and business. Again, nothing huge, but some interesting developments on the chip side, on the OpenAI side, then projects in open source research. Again, a variety of stories, no particular focus in this episode. Policy and safety. We're going to be talking about kind of a bit of interoperability and safety, more so. And a couple of national security stories. And they will actually have a synthetic media and art section which we haven't in a while just because it's always at the end. But there's some new copyright lawsuits and some new partnerships that are interesting. So we'll go ahead and add that.
[02:32]
Jeremy Harris Gladstone
On to cover that SAG back in the news too. It's been a while since we've seen them.
[02:37]
Andrei Karpathy
Yeah, yeah, we used to, you know, last year there was quite a bit of it and we sort of just stopped. And now is a good time to mention some of that ongoing news before we dive in. Do want to acknowledge some Apple podcast reviews. We appreciate your comments. Had a review to tell us to keep it up please. Which I feel like we've been told this several times. So the encouragement is appreciated, let's say. And we will try to keep it up and make it as weekly as we can. Another positive review. Love the show. Capex. Capex, Capex. Well, glad some people were on board. And we did have a pretty detailed bit of feedback with a three year listener talking about us, maybe alternating introductions, more me taking the lead, less always talking about the next story and setting it up. We just sort of wound up in there. We didn't plan on this being the natural flow of a show so you.
[03:39]
Jeremy Harris Gladstone
Might emerged organically like it's funny because. So I have the unfair advantage that while you're going through the kind of layout of the story, I get to think a little bit more about look at my notes, be like, hey, you know, oh yeah, there's this thing because as you can imagine, we're covering, I mean this week will be like 40 stories or something. Every week it's like we're having to do research. We have reams of notes on every single paper, every single news story. And so I don't know about you, Andre, when we switch stories, I'm like in a scramble trying to figure out what did I even think of this? Oh yeah, this is that paper. Okay. And so while you're kind of grateful gracefully going through your intro, the secret.
[04:17]
Andrei Karpathy
Is I'm actually just better at sounding prepared when I'm reading from notes. Because you got to load this into your ram, you know, you gotta change context. And I happen to be all right, I hope at pretending like I have an actual script instead of just rambling off based on. Yeah, And I will say I think I am pretty good at segues. But anyways, we'll, we'll try out a bit more variation throughout.
[04:45]
Jeremy Harris Gladstone
Andre is really good at segues.
[04:47]
Andrei Karpathy
And with that, and with that, let's get going on the actual news, starting with the tools and apps section. First up, we have OpenAI adding O3 Pro to ChatGPT, dropping the O3 price by 80% and also mentioning that they're going to delay the open source AI model to later this summer. And that's pretty much the news. So O3 is a reasoning model. Now we have O3 Pro which is going to be replacing O1 Pro. It seems very good, starting to be on par with O1 and the O3 model is getting cut down by 80%. So that would mean $2 per million input tokens versus the previous eight. So huge price drop. I mean this was to me quite surprising. And yeah, O3 Pro, as you might expect, pretty nice performance on benchmarks, better than all the other offerings of theirs. So pretty big news.
[05:52]
Jeremy Harris Gladstone
So there's an opening. I post about just the model Release notes on O3 Pro with some initial evals right to giving you a sense of like, how does it stack up compared to both humans and then compared to O1 and O3 medium against humans. It's really impressive. Worth looking at the chart across everything, basically you see a clean sweep where the model 64% of the time is preferred to humans. That includes, by the, by the way, personal writing and computer programming and data analysis. So really kind of spanning everything from things where you have a quantifiable reward that you can issue and things that are more qualitative. You're seeing superior performance across the board. And then some of the areas where we're seeing really significant improvements in benchmark scores. Amy, Amy 2024, going from 90 to 93% between O3 Medium and O3 Pro. That may not sound like a lot, it may sound like 3%. But one way to think about it is once you're already at 90%, there's not that many percentage points left to climb. Right? So you would expect like saturating a benchmark is really hard. They just took a third of the remaining errors off the table with that kind of similar with GPQA diamond, that sort of PhD level science questions and code forces competition code. So across the board again, this like universal improvement in these capabilities. One thing that I hadn't noticed, to my embarrassment, there's a benchmark that they run, they call the 4 out of 4 liability evaluation. I just want to surface this because like it makes all the sense and of course they're doing this but. But I guess I hadn't yet explicitly remembered seeing this in writing. In this eval, you consider a model successful only if it correctly answers a question in all four attempts. So you try it four times on the same question. And this is sort of a, you can see it becoming more important, this kind of valuation rate when we get into agents that are being deployed in higher stake scenarios. You want to make sure that the agent consistently performs well so that even if you, if you test it and you know, you get lucky or something, you don't overestimate its performance. And so anyway, I thought that was again one of these oddly simple things, but that I hadn't seen done elsewhere, remembered done elsewhere.
[08:03]
Andrei Karpathy
Yeah, exactly. Usually you get pass at 1 or pass at 5. Basically, do you nail it first try or do you nail it after a few tries? And they do give those numbers, but they also give the 44 reliability evaluation, which as you said, I don't think is typically what you see in benchmark numbers. And compared to the pass at one result, that is a less nice number. You get worse outcomes. If you're telling it to be, you know, four out of four times get it right. There is a performance drop and in fact in some cases like GPQA, a pretty significant performance drop. But still O3 Pro is beating all of them. And on the evaluations with human testers. So OPHI Pro is preferred to OPHI according to human testers on scientific analysis, personal writing, data analysis, as you said, about 64% of the time on average. So, you know, O3 is sometimes about as good. But more often than not, O3 Pro is preferred. Next up we have cursor AI editor hits 1.0 milestone and there are some releases with it, including Buckbot and Background Agents. So Cursor is the integrated development environment, the programming tool that has become one of the leading contenders for being what programmers use to include AI in their workflow. So 1.0 release probably, you know, not being covered in major news outlets, but kind of a big deal. And nuspheres as we covered, now has a ridiculous valuation after really rising quickly last year. So with this 1.0 release they release Bog Bot, which is an anomatic reviewer of pull requests on GitHub. There's also this Background Agents in Beta which allows you to run these agents in a remote environment set up by Cursor. So it's getting into the agentic territory where the AI agent does coding for you, does work for you totally asynchronously away from your prying eyes and it delivers something to you to evaluate. So Cursor has had AgentA coding for a while and we've been pushing it. This is another step in that direction and lines up with other efforts like Codex and Jules from Google, where you do have these coding agents just work remotely and deliver results without direct supervision, which was the model for AI paired coding up to recently.
[10:53]
Jeremy Harris Gladstone
Yeah, I'm super curious about where this evolves from a security standpoint too. Like for context, the way this is working right now is that the agent will actually fork your GitHub repository and have its like own branch that just like it'll put out PRs, it'll review PRs and all that stuff, as you said, fully in parallel on its own branch. So they have some notes about the security side. They're like, hey guys, just keep in mind these agents have a much bigger surface area of attacks compared to existing cursor features that don't look like this. And they, they do say our infrastructure has not yet been audited by third parties. You know, you have here agents who have read write privileges to repositories, right? So this is like, this is God mode for your AI agent that is writing code. So if somebody can do prompt injection, data poisoning attacks or whatever on the agent, that could be a really big deal. And if you're deploying this in like a production setting, this is a really interesting new set of vulnerabilities that absolutely is going to have to be addressed in the basic kind of design philosophy for these, these tools. By the way, we'll be talking about this later, but this on the same week that Microsoft has come out and announced a new vulnerability that was discovered in Copilot, sort of in the same spirit with prompt injection type attacks. So it's like all of a sudden we're realizing you can't just deploy agents on all the things and assume that security is going to look, is going to look the same. So anyway, I think the cursor is going to be at the absolute forefront of this because these agents have such intimate access to the code base and are able to work autonomously and in parallel. So I think we'll learn a lot about best practices. They're going to have to evolve really quickly because, you know, I mean, there's a lot of cyber attacks and conventional software with this. Yeah, the sky's the limit.
[12:44]
Andrei Karpathy
Yeah, and that's especially true if you're working open source with various contributors. Jailbreaks can be pretty subtle and can be quite weird. And agents are still kind of in development, so there could definitely be ways in which you can just tell it, delete all the code or something like that and onto lightning round with a couple of quick stories. First you've got Mistral releasing a pair of AI reasoning models. So Mistral is the French AI lab which has released a lot of open source models and has tried to compete with OpenAI, Anthropic and others with big LLMs. So they've released Magistral, the reasoning model, two variants, small with 24 billion parameters, that is now available for people to download with an Apache 2.0 license, fully open source and Magistral medium, which is available on their LeChat platform and on their API. Not as good as pretty much any of the leading reasoning models on evals, partially because they're smaller compared to something like Deepseek R1. But yeah, general impression I get is people are not too impressed. But at the same time it's nice to have another open source reasoning model for people to build on.
[14:12]
Jeremy Harris Gladstone
Yeah, I continue to be sort of interested and confused about what the big picture game plan is for Mistrada other than to become the French champion that's subsidized by the French state to do French things. But we'll see the business model of just like pumping out your models and like as open source and then hosting them seems to be challenging for a lot of companies. We'll see if that changes with R.L. i. I'm sort of skeptical personally. But yeah, again with these these sorts of eval scores it's really difficult to, to compete. Like the frontier is moving so fast and the fact that they chose to release this model as well, you can, you can read a little bit into that. You know, like Facebook decided or sorry, Meta decided not to release the kind of biggest version of the latest Llama series because it apparently wasn't performing too well. That's the sort of thing that you do if you have a kind of meh release. The fact that they did release this suggests maybe that they don't necessarily have a plan for blowing things out of the water anytime soon, so they might as well get the splash in the meantime. That's one interpretation that you could have. We'll Note that the 24 billion parameter scale is very popular. You know, it's like a good choice. I think that's something that Meta has struggled with is they just keep pumping out these giant models that nobody really wants to use. 24 billion, 32 billion. Like these are really good sizes for the kind of hardware that people like to run open source models on. So yeah, that's great. We'll. We'll see where this goes. There's certainly. They certainly are the French national champion and it's going to be worth something. But yeah, they're in a challenging spot.
[15:44]
Andrei Karpathy
We had a challenging spot trying to compete on just head to head training of Frontier models and they seem to really be keen on really competing on every front with OpenAI and Anthropic. Last week we also released Mistral code, competing with something like cloud code. So basically on any given thing people are doing at least on the LLM side, not necessarily multimodal side. Mistral is trying to Compete and you know, let's not count them out, but they certainly have a TAF task to be able to do that. Next up, eleven Labs, the provider of text to speech and text to audio models, has released their V3 model 11 V3, which is the latest in their text to speech models. It is able to do even more natural sounding outputs. You can even embed things like size or excited to have more expressive cues with nuanced delivery. And this supports over 70 languages. So yeah, text to speech, I think probably less visible to a lot of people than LLMs and image generation and video generation and so on, but it has really come a long way and I think it's at a point point where it will be very hard to tell if something is AI generated or not.
[17:11]
Jeremy Harris Gladstone
Yeah, and one of the things that's really interesting, it sort of reminds me on the agentic side of anthropics, MCP like the model Context protocol or any of these like hooks that people are learning about the structure of a given modality. We're learning here about, okay, what's the user friendly way to allow developers to, to program text to speech? Right, so you indicated one of the upgrades here, right? So you had these special size or excited tags. The example or one of the examples they give here is we did it. Exclamation point. And then in square brackets, happily, and then in square brackets shouts and then in square brackets laughs. Right. And this is the sort of affordance that you need as a developer. It seems obvious in retrospect, but somebody had to think of it and implement it. So that's really cool. So sort of similar, another similar thing is this idea of multi speaker dialogues with realistic conversational flow. So one of the challenges when you're making text to speech is like how do you know? Or how do you define the turns of each speaker, make sure they don't talk over each other or make sure they do talk over each other if that's what you want. And so they have a new text to dialogue API where you send structured JSON that defines when each user gets their turn. And then the model automatically takes care of, you know, the kind of emotional shifts, the interruptions, the natural flow of that conversation through that lens. So again, it's one of those things where you know, you, you sort of don't realize you need it until you start to, you know, produce stuff with text to speech and especially on the entertainment side or trying to make real kind of natural conversational flow. So really cool. And as you said, a whole bunch of languages supported. So Yeah, I mean, 11 labs still doing impressive things.
[18:55]
Andrei Karpathy
Yeah, 11 labs market leader in this territory, so definitely worth knowing about. Next got text to video. ByteDance is getting into the competition with C Dance 1.0. So it's their latest video generation model. It's trying to compete with VO3, the really pretty viral video generation model from Google. This one is able to generate 5 seconds of HD video in about 40, so it's pretty fast to actually do generation. And ByteDance is apparently planning to create CDance into their platforms like Dobao for both professional and public use.
[19:41]
Jeremy Harris Gladstone
Yeah, one of the big advantages that they have, of course being the TikTok parent company is access to tons and tons of video data. I guess this is, you know, makes you wonder a little bit about, I mean a. They're going to be pilfering YouTube videos left, right and center as well. It's not like that'll stop them, especially being a Chinese company. Not that that's stopped OpenAI in the past, if you can remember, like Mira Muradi's sort of famous presentation snafu when somebody asked her like for. I think it was for Sora. Right. Where did, where did you get that data? Did you get it from like YouTube? Like probably. And she's like, I forget what she said, but she looked very uncomfortable. And it's pretty clear some. Some stuff or to many people it's pretty clear that some stuff went down. But certainly TikTok has access, front row seat access to an exquisite quantity of data. One of the interesting things they call out is that they can handle complex sequences with multiple camera angles and maintain character consistency throughout. This is, you know, part of that whole world model building thread that people have talked about quite a bit. You know, our text to video, our image to video models, world models, do they contain world models? One of the big questions of course is always, well, if they contain world models, they should be able to model real world physics that includes things like object permanence, it includes things like object consistency. And so this is sort of hinting at that, though we don't know much about the architecture itself. And so, you know, maybe some of this is kind of baked in with inductive priors and it's not actually sort of learned per se. Difficult to know, but certainly impressive. And the world of convincing AI generated video, I think it's fair to say, is just basically here at this point.
[21:22]
Andrei Karpathy
Right. And unlike VO3, it is not able to also generate audio that's pretty much only video free. So Google impressively kind of took the lead on the text to video world. And yeah, I think it's good to call out that. Most likely it's because they have YouTube and they just can train on old YouTube and nobody else can. And ByteDance might be able to compete for that reason.
[21:46]
Jeremy Harris Gladstone
Well, and the audio too is no small thing, Right. Like we're entering this world where we're getting positive transfer as these models are trained on more and more modalities and video and audio are so causally intertwined. Right. Like you imagine trying to make a world model, literally like if you're deaf, like you look at the world, you can create world models, but you can learn faster about the world if you also have the ability to hear. And especially for AI systems just given that, you know, they, these are not trained with rl. They can't go out into the world and interact with things having that extra modality to kind of cross correlate physics. And you see somebody's mouth opens and the sound tends to come out. It's like, okay, that tells you something about the kind of function of the mouth and the physics of it, you know, same with car crashes and the sounds that come from that. So anyway, I actually expect that the inclusion of audio in a single, almost monolithic base model, if you will, is going to be a really big deal for everything from prompt adherence to world model development.
[22:42]
Andrei Karpathy
And speaking of VO3, Google also had an announcement. They are revealing a $20 AI pro plan to let people use VO Free More. And they are releasing VO Free Fast, which is able to do faster generation compared to VO Free. VO Free is fairly slow to use. It takes forget exactly. But you know, a couple of minutes. So this allows you to take, let's say less than a minute. And now Gemini Pro subscribers can create up to three videos daily using VO3 fast. And it's definitely seem to be the case that the servers and GPUs from Google are pretty slammed by people trying to use veo. A lot of it wasn't working. So I wouldn't be surprised if this was rushed into production to keep up with demand.
[23:41]
Jeremy Harris Gladstone
Yeah, and I mean I continue to tap the sign that someday fairly soon we're going to be able to generate 1 second of video for each second that you wait. In other words, you're going to be able to generate video as fast as you can prompt it to be generated. Once we cross that threshold, there's going to be excess compute on the generation side, which I would expect to start to get dedicated to addiction. So, you know, imagine your, your TikTok feed, but if you've got biometric data coming in through, for example, the camera, or even just your interactions with the app, that cause the video to be modified in real time based on what you're seeing, there's like a very dark rabbit hole for where this ends up going. Ultimately with the abundance of compute, that threshold is going to be very critical. I think almost from a societal level in terms of how we even think about these apps. It's, it's not unlike what the ability to generate fresh apps from scratch based on prompts is doing. Right. Where apps themselves suddenly become this malleable thing. Well, this is sort of similar, but for manipulating pixels on a screen to kind of stimulate you, it's not clear what happens when the optimization process that's running in the back end of these systems operates as quickly as the human biophysical response cycle. That's a, I think, a very, very interesting phase that we're getting to and we're going to see a lot of interesting phenomena, psychological and otherwise, emerge from it.
[25:01]
Andrei Karpathy
Yeah, I think you could say this is similar to where agents were last year in the sense that we were talking about agents a whole lot going back definitely into 2024. But it took until really the last couple of months for agents to really mature and make a huge impact. Now with things like cursor code, I think video is in a similar spot where you're starting to see tools like flow, like a more easy to use pipeline to not just prompt it, but actually build something of it. And I think in the coming months, you know, we will start seeing that actually not just be used for memes, but actually have an impact on workflows.
[25:41]
Jeremy Harris Gladstone
And so on and moving on to applications in business. So we start with this really interesting story. OpenAI and DeepMind. What are losing engineers to anthropic in a one sided talent war? So there is this venture capital firm called Signal Fire. They came out with their 2025 State of Talent report and they basically look at like, okay, what's the rate at which we're seeing employees leave OpenAI for anthropic versus the rate at which we see employees leaving anthropic for OpenAI? Right. So which direction is preferred? So when it comes to OpenAI and Anthropic, OpenAI employees are leaving eight times more for Anthropic than vice versa. At DeepMind, the ratio is 11 to 1 in Anthropic's favor. So for every Anthropic employee who Leaves Anthropic to go to DeepMind. 11 DeepMind employees are leaving DeepMind to go to Anthropic. That's pretty insane. There's all this kind of interesting speculation, by the way, that so Anthropics retention rate is like 80% for employees hired over the last two years, which in tech is pretty wild. Like I get in the kind of standard world that doesn't sound too, too impressive, like, oh, you're still in the same company you were two years ago, 80% of the time. That sounds about right. In AI, that is fairly unusually high. OpenAI's retention rate for two years, by the way, 67%. That's aligned with what you see at Meta, for example. So there's all kinds of people kind of tossing around ideas about why this might be. One of the often cited hypotheses is like Anthropic is just sort of coming out of nowhere. They've got the best coding models, that's just really exciting to work for them, blah, blah. I think that this actually misses the core point, which is Anthropic was a company founded on a very clear principle and it has stood by for the most part those principles. You know, it's founded by these OpenAI policy and safety and some pre training researchers who left essentially in protest. I mean, this is essentially an open secret now over OpenAI's sort of attitude and approach to alignment, technical safety and policy. OpenAI's or anthropic rather seems to have walked the walk on a lot of their policy stuff, pushing back on this pretty ridiculous idea of banning all state level AI regulation for 10 years that was snuck into the latest big beautiful bill. And anyway, OpenAI seems to have been pushing for something pretty aligned to that, at least in their policy work. So a lot of this is like you've got an entity where the leadership says something and then they actually kind of act it out. And there's also a lot of kind of open discourse. Like when you talk to folks who work at Anthropic. I've never spoken to, I've spoken to a lot of people at OpenAI who I would call whistleblowers who are like, I'm really concerned that the leadership is talking through both sides of its mouth. I have never had a conversation that feels like that with an anthropic employee. The OpenAI ones that we spoke to in our investigations in the past were often like, they were really tense. You could sense that they did not want you to tell anybody that we'd spoken anything like that. Whereas in Anthropic it's kind of like, yeah, I might have a disagreement with leadership, but you get the sense this is the sort of thing that they would have out anyway and have spoken to leadership about. And reasonable people can differ. So I think that that's an underrated factor in all this is just the cultural difference. And I think that's leading the best researchers to flock to Anthropic. And, and that in turn is the causal element behind, in part, Anthropic's great success with its coding model. So I think, you know, it's not all that, but this is a kind of missing element in at least some of the analysis on this issue. Just sort of from what I've seen.
[29:20]
Andrei Karpathy
Right. And I think, you know, to complement that, the dynamics of OpenAI and anthropic competing are very different from dynamics of DeepMind and anthropic competing, where DeepMind, if you are preferring to go to Anthropic, it is likely because you don't like big company politics and you don't like a lot of bureaucracy that has been introduced to review if you're allowed to publish your research or whether you're able to contribute to Gemini, for instance, development. Not really a surprise. DeepMind has been around for a long time. It's now officially part of Google. There's been a bunch of reorgs and so on. It seemed to be really kind of in a bit of a bad shape in terms of being organized. So in that sense it's not crazily surprising. I think also DeepMind was quite big and Google has been quite big, so I wouldn't be surprised if Anthropic just had fewer people to lose, to be honest.
[30:21]
Jeremy Harris Gladstone
Yeah, I think that's, that's a big factor. And the other thing is, I mean, Google and Anthropic have a partnership, right? So you're not quite leaving the nest in the same way when you move from one to the other. Google's made massive investments in Anthropic right along with Amazon. They're basically the two main backers. And certainly Google TPUs are a huge part of Anthropic's fleet and strategy. So I think that kind of makes a lot of sense given that Anthropic has butted off of OpenAI. It kind of, you know, anyway, it sort of feeds into that, that narrative of sort of OpenAI disillusioned OpenAI folks leaving the other thing, by the way, the money side is interesting, right? This, this article goes into some pretty wild so they, they talk about OpenAI. Some OpenAI researchers can earn more than $10 million a year. They're putting together counteroffers to stop OpenAI employees from leaving for other companies like anthropically like Safe Superintelligence. And these include $2 million retention bonuses. So just like a one time bonus, $2 million. Please don't leave. In addition to this is insane equity increases of $20 million or more. Please don't leave me. Here's a crap ton of money like this. This is a lot of money to be, to be throwing at people just as a retention bonus.
[31:40]
Andrei Karpathy
Basically, yeah. It sure would have been nice to study LLMs when I was in grad school. Also worth noting in this report, we won't go into it too deeply, but it does focus somewhat on entry level tech jobs. In addition, and it's in a rough shape. It's increasingly looking like CS in general has seen a huge rise in undergrad enrollment over the past decade. And for a while it was sort of the star path to a good job and good earnings. Now as a fresh grad, it's much tougher to get hired than it used to be. And the number of positions seem to be smaller. And I would not be surprised if AI has a large role in that in addition to economic conditions and so on.
[32:33]
Jeremy Harris Gladstone
100%. I think we're in this interesting position where a lot of people, you can still tell yourself the story that oh, it's because of tariffs, it's because of the economy, things like this. But I mean, I had a conversation with a very senior person at one of the top labs and what they were telling me was we are no longer hiring entry level software engineers. We don't expect ever to do that again. And in fact we don't think we'll be hiring anyone with less than 10 years of experience ever again. And when you hear that, it just makes it real or it's like, ah, this is where it's coming from. Like, and you know, this is a lab that already is seeing the majority of its code base written by AI, which that's not surprising to us. This is something we've been covering for a long time. But I think you have to kind of sit back and absorb that reality that the job of software engineers, the job even of AI researchers is getting more and more abstract and further away from anyway, many of the activities that used to define them. And that just makes it, I mean, it's brutal. Like this is, you know, we're headed for a situation where white Collar gets automated pretty hard, pretty fast and there's social unrest that will come with that. I mean, there's no two ways about it. We've got a very interesting transition we're going to have to navigate gracefully.
[33:48]
Andrei Karpathy
Yeah. And it is happening quite fast. So you know, 2023, 2024, 2022, to some extent we saw the rise of intelligent AI assistance in things like Copilot and Cursor and that had a massive productivity boost. You're twice as productive, three times as productive with these agentic tools like cloud code, which are now working well. It's getting to a point where you barely need to touch code. As a software engineer, what you need to do is be able to tell the agent what to do and to inspect what it's doing to verify that's correct. And that's not what an entry level position kind of entails typically. So it's changing fast and yeah, it's worth being aware of that and moving right along.
[34:33]
Jeremy Harris Gladstone
Another, I guess another OpenAI story, not that the last one was all OpenAI. OpenAI slams court order to save all ChatGPT logs including deleted Chats so essentially what's happened is there was a court order that came in and said, Look, OpenAI is being accused of essentially serving as a platform that allows users to get around paywalls and access news and New York Times articles and things like that. And what's more, we suspect that, that users are going to be deleting the evidence of that. So that if we actually request the court requests records of people's use of the tool, they're not going to actually show these violations of copyright and all that stuff. And so the New York Times argued for the court to prevent OpenAI essentially from, from deleting or discarding information about ChatGPT logs that otherwise would have been deleted, including records that users have, have tried to delete. Right. So this is, OpenAI is calling this out as basically a, a way of preventing OpenAI from respecting its users privacy decisions. It essentially puts OpenAI in this awful position where they are at risk of breaching their own privacy agreements, which, you know, huge, huge trust issue. But also, I mean it could put them in breach of, of contracts and global privacy regulations, all kinds of stuff. So this is really messy. You can actually, I mean I can see OpenAI's argument here that like this is to just lurch out and do this seems like a strange strategy but you know, I'm not a lawyer, so hard to know. There's so little precedent in general on cases like this. But yeah. So the idea of ChatGPT to skirt paywalls does sound plausible, I guess. But the question is how, how do you actually manage that is the best way to force essentially a kind of de facto privacy violation onto OpenAI users? I don't know what the answer is, but this is the state of the debate anyway, right?
[36:37]
Andrei Karpathy
And OpenAI even released a blog post how we responding to the New York Times data demands in order to protect user privacy, where they frame it as a privacy question, as kind of a commitment to their customers and address, for instance, there are business customers that use zero data retention APIs where the chat logs aren't going to be kept. But OpenAI has had this interesting pattern of releasing blog posts in response to legal drama and this one is very much along that line, has a lot of notes in response to it. So OpenAI is a little salty and, and not a fan of this court order, clearly. Next up in the lightning round, we are starting with a story from the information, which typically has far more cutting edge or let's say less public information. And this one is saying that Nvidia's biggest Chinese rival Huawei struggles to win at home. So this is pretty much an analysis as to what extent Huawei is able to beat out Nvidia in terms of providing chips. And it seems to be that so far Huawei is unable to get to biggest tech companies in China to adopt their chips for AI training and inference.
[38:05]
Jeremy Harris Gladstone
Yeah, this is actually a really interesting story because the story that the Nvidias of the world have been propagating, that a lot of kind of anti export control people have been propagating is that hey, you know what, we withdraw from the Chinese market and like Huawei is just going to dominate it and it just creates a whole bunch of economic wind in their, in their sails. And this is not entirely wrong, but there's an awful lot kind of missing in that analysis. So one key thing to keep in mind, Huawei does not have access to the most exquisite fabrication processes that are available to Western companies. That thanks to tsmc, which is based in Taiwan, of course. So TSMC can help you fab down to 3nm now and we'll have chips that come off the production line using the 3 nanometer process. In the relatively near term, Huawei can only use the domestic, the Chinese analog to TSMC, which is SMIC. SMIC is roughly speaking stuck right now at 7 nanometers, maybe arguably working on 5. So it's, it's forced to Use a subpar fabrication process. Huawei designs the chips and then they send them to SMIC for fabrication. The problem is you can only do so much when you have limitations, fundamental limitations on your design process. In particular, if you look at the Huawei chip series, what they will tend to do is they'll be very energy inefficient. If you want to get very energy efficient chips, you have to get more advanced processes. So we talked about how Huawei has been working around that. They just set up this like Cloud Matrix 384, which is like their, their computing system that bundles up a bunch of their Ascend chips together in a way that is designed to just say, okay, our individual chips may be crappier because they're fabricated using a weaker process, but we can just string a bunch of them together like, like build larger systems with larger data centers. And because China is swimming in energy in a way that America just isn't. America's energy constrained, China's chip constrained. China doesn't really care about the energy efficiency of the chips that much. They can just put more of them together and achieve the same scale. And that's really what they've been doing. The catch though is overheating. If your fabrication process is bad, if you're, if you're going to basically like, like overpower your chips and just pour tons of energy into them, then the chips will overheat and you will see problems. That's exactly what seems to be going on and what seems to be hampering a lot of Huawei's sales activities. The Ascend chips also, by the way, can't handle direct support for low precision formats like number formats like FP8, which notably is what Deepsea uses. So Huawei literally, like their chips, cannot support deep SEQ style training runs, which is why DeepSeek has been using Nvidia technology and why the demand for it continues. One last factor that's really important to keep in mind is that Huawei competes with a lot of their customers. Think about ByteDance, Alibaba, Tencent, right? These, these companies, they're all looking into Huawei chips. They haven't made big purchases. Part of that is because a lot of them run their own clouds. Huawei runs its own cloud too. And so are you really going to buy from your competitor? I mean, this is the reason, if you go back to our hardware episode, this is the reason that Pure Play foundries were a thing, right? That intel, for example, historically struggled to attract chip designer customers because they also were designing chips and so you're sort of like buying from your competitor. What the market fundamentally wants is it kind of does want a separate foundry, a separate designer and then ultimately a separate cloud company. And it's not a coincidence that Nvidia isn't so much in the cloud market. They could be if they wanted. Right. They could make big clouds. You could have Nvidia right up there with gcp, with Azure, with aws, but they're not doing it. Part of that surely is going to be competitive reasons. Let's just have people buy our chips and reduce the barrier to entry on that as much as we can. And anyway, so Huawei is in a more complex situation than I think a lot of analysis historically has acknowledged. We'll see where it ends up going. And they are a national champions, so the CCP can always force people to buy from them. But it's an interesting, interesting scene.
[42:11]
Andrei Karpathy
Right, and also mentioned in this article and I think worth noting, some companies like Byte Dance and Tencent have significant business outside of China. And the US is cracking down more and more issued guidance that basically says don't use Huawei chips. So if you are a more globalized company based in China, that's even more reason to prefer Nvidia over Huawei.
[42:38]
Jeremy Harris Gladstone
Our next story is sort of related actually. Huawei expected to break semiconductor barriers with development of high end 3 nanometer GAA chips. Tape out by 2026. Okay, so GAA is gate all around. This is a transistor design that is becoming really popular. It's a way of essentially making the transistors that form the critical circuits, the number crunching circuits on GPU logic die more energy efficient, have higher throughput, all kinds of desirable thermal properties, et cetera. So essentially what's happening right now is the, the 3 nanometer process that for example, TSMC has developed does not actually plan to use gaa. So it's not going to be a gate all round process. Huawei is accelerating towards gaa. That's the plan here, essentially skipping a generation which you kind of have to do if you're, if you're the underdog and trying to catch up. But the challenge is right now it's not really clear that they can pull this off. You know, they're 7 nanometers, their 5 nanometer and even their 7 nanometer process that they get through SMIC that we just talked about, the sort of Chinese TSMC has really bad yields. Seven nanometer yields are somewhere between 15 and 50% which is, I mean industry standards like 90%. Anyway, so, so it's like they're major economic challenges, but if they can somehow do that, that would be really interesting. It would be a big leap. The only other gate all around focus design for 3nm is being done at Samsung Foundry. So this would literally be the first non Samsung foundry process if in fact it is non Samsung, if they're doing it through smic, which again would be kind of weird. It's also possible this implies a collaboration with Samsung Foundry, which would be really weird because Samsung is of course based in South Korea. So this would be interesting from an export control standpoint. You know, can this actually work? But, but anyway, so Huawei has been known to make optimistic kind of pronouncements about the future of their technology. Hey, we'll have all these exciting things that don't quite end up taping out if you will. We'll see. But 3 nanometer gate all round would be a big deal if Huawei can actually crack it.
[44:54]
Andrei Karpathy
Yeah, not much to add. All I'll say is if you Google Gator all around and look at the images, some really fun illustrations and electron my microscopy images and you get a feel for these poor computer engineers and semiconductor experts. You need to go 3D and build these elaborate structures now just to be able to go into these low nanometer regimes and actually make chips work. And speaking of that, next we've got a story about TSMC and their 1.4 nanometer process, which is called Angstrom, which is making progress. It's still not out. It's expected to be available by 2028 and according to the story, it's estimated to cost $45,000 per wafer, a 50% increase of over the 2 nanometer process, which is 30,000 per wafer. So yeah, that's pretty much it. It's got to be very expensive to use the really lowest like most high density chips that are coming online in the coming years.
[46:04]
Jeremy Harris Gladstone
Yeah. So 1.4 nanometer, they're calling it Angstrom, which is like slightly frustrating because it's not quite an Angstrom, is it? But that's cool. This is the next beat.
[46:16]
Andrei Karpathy
Yeah.
[46:16]
Jeremy Harris Gladstone
50% more expensive. Apparently 2028 is going to be the earliest production run. So if AI 2027, that sort of famous blog post ends up being wrong and 2028 ends up mattering, we'll probably see in 2029 some pretty impressive rollouts of the next generation of Node and the chips designed on it. So this is by the way, they're assessing if there's a company that would want a First crack at this Angstrom process, it would be Apple. I would just say we've been saying this on the podcast. Do not take your eye off Nvidia, which, by the way, is literally the world's most valuable company right now. As AI chips become more and more valuable relative to phones, expect at some point that Nvidia starts to make moves to compete for the leading node to essentially buy out Apple of all of TSMC's capacity and kind of become the subsidizer of choice for TSMC for their leading nodes. I actually think that could happen sooner rather than later. There are indications it's already sort of in the works. So anyway, that would be a pretty significant shift in tech. And the day that happens, we'll definitely be talking about it here.
[47:19]
Andrei Karpathy
Fun fact, Angstrom is 10 to the negative 10 meters or 0.1 nanometers. So as you said, not really an accurate name at all, but yeah, yeah, no, it's a good name. Sounds good. Sounds fun. And last story coming back to Mistral. They're launching Mistral Compute, which is a cloud offering for compute for AI that is going to try to compete with other offerings. I suppose these days, AWS is still one of the leading ones. You have also newer competitors in the space like Modal. So Mistral, again, continuing to try and kind of on every front, provide a European version competitor to offerings both in China and the US and they are coming at this from a position of less money, less talent, you might expect, or might argue. So we'll see. The main kind of analysis of their advantages, I think I agree with you, is their position as a European leader in the space.
[48:34]
Jeremy Harris Gladstone
Yeah, yeah. And in particular, it's no small deal that they're based in France. You know, you think about what are the big bottlenecks. We talked about this, right? In the United States, it's energy, Right. Everybody's trying to figure out, where can I find a spare gigawatt on the grid. It is not easy. You know, even 30 megawatts that you like, you can find it, but it's going fast. And so in France, where they have really the. It's the only European country, the only Western country that's been doing nuclear this whole time, where they can actually build new nuclear plants in less than 10 freaking years. You know, they can support this. And. And now they're reaping the benefits. The, the scale that's being talked about here for Mistral Compute, by the way, is tens of thousands of GPUs, they say, built on Nvidia reference architectures. And so this I assume that they must be looking at this point at like GB2 hundreds, you know, tens of thousands of those, I assume. And they're saying that they'll be supporting workloads ranging from defense to drug discovery. Okay, national champion much, right. This is the kind of that smells a lot like, you know, preferred partner of the French government. Which by the way also from a red tape standpoint, if you're trying to set up a new scale data center, not only do you have the massive energy supply that the French enjoy, but you also have the support of the government to cut red tape, especially environmental regulations that allow you to get things up and running faster. And so these things do sort of stack up in very interesting ways to like compete another day, let's say. But I think their fundamental challenge is going to be capitalization. Right. That's always how it's going to be. You can't compete forever with companies that will raise, you know, tens of billions of dollars on hundred billion dollar valuations. Like not even taking that much of a liquidity hit and raising from sovereign wealth funds and this and that. It just does become really challenging. And the, you know, the French economy just isn't that big. So yeah, if I were France this is what I'd be doing. But that doesn't mean that they necessarily have a winning hand.
[50:30]
Andrei Karpathy
Yeah, as you said in this blog post of theirs, they are literally saying the offering will include Mistral's AI's training suite that can accelerate region and domain specific efforts across nation and industry wide endeavors. So yeah, calling out some of that champion kind of stuff and I will say it's a little bit different. OpenAI and Fropic, they're not offering this much of a cloud kind of architecture for training and serving and whatever else. Yeah, and it is rather specialized. I would assume this came out out of Mistral having to develop their own setup for compute to be able to do this. So I do think there is a decent chance that they have some good technological aspects here that might make it actually quite a good product.
[51:27]
Jeremy Harris Gladstone
And next up, moving to open source we have one story Pro RL and for whatever reason I keep saying Pro PL every time we talk about it offline. Pro RL prolonged reinforcement learning expands reasoning boundaries in large language models. Bit of a mouthful but hey aren't they all? So there's this idea that the RL process itself just optimizes existing capabilities in large language models. Basically it's like you have your pre trained model and it already Kind of has all the capabilities that reasoning model should have. And your reinforcement learning process just elicits those capabilities. It bubbles them up to the surface, right? So what they're after here is to show actually that's not the case. What we can do is imbue the model with completely genuinely new capabilities that were not there before. And they have a couple of ideas that they stack together to just like optimize the reinforcement learning process. One of which is this idea of there's a callback labeler divergence. So this is essentially a way of measuring how different two different distributions are. And like probability distributions. And so what's often done during training is you'll have a model that's being trained and you'll have some kind of reference model where you don't allow the model under training to deviate too much from the reference model. The reason for this often is that if you just let the model go hog wild and get trained on its own to whatever it will end up being, that model will learn to kind of optimize very narrowly and unhelpfully over optimize to the objective that it's being trained for. So in the limit, the classic example is if you let these models get fine tuned for too long without a kind of regularization, they'll end up like no longer speaking English or they'll end up you know, kind of really rigging their becoming sycophantic or whatever. And so you just have this reference model to keep pulling it back to reality. And there have been arguments that this KL divergence penalty is a bad thing, that you actually should just get rid of it. A lot of those arguments are based on looking at base models and like before the supervised fine tuning stage in the context of reinforcement learning. And what you find there is their performance actually doesn't get so good if you keep enforcing that. They have to be similar to the reference model. But what they're showing in this paper is actually if you do supervised fine tuning first to let the model get good enough at reasoning at that point, if you then use that as the reference model, you actually do find that the KL divergence strategy makes sense, that regularization strategy. So that's one thing they did. They also did this thing called reference policy reset. So as you train your model again, you've got that reference policy, so it's not allowed to deviate too too much. But then you'll update your reference policy to match whatever the model under training currently is, and then you'll proceed so you're basically using the reference policy as a kind of drag on the model under training. The model under training does a bunch of training. It can't deviate too much. But then you update the the reference model and now you can start training again and you can deviate a little bit more, but not too much from that one. So it has a way of sort of slowing down the deviation from the reference model, but not so much that you're eternally locked into the original reference model. And that turns out to help a lot with training stability while also allowing you to kind of recover a lot of these new capabilities that come with with reinforcement learning. And so they have a huge data set or a bunch of different stem logic puzzles, instruction following data tasks. It's like 136,000 problems in math and code and all kinds of stuff. They also have an enhanced version of this GRPO algorithm which you might remember from our discussions of Deep seq. It's become really popular, just sort of a way of stabilizing reinforcement learning training. This quickly gets into the weeds. But yeah, bottom line is they're borrowing a lot of stuff from other papers like dapo, which, which is like dynamic sampling and augmented policy optimization that you're basically filtering out prompts to only keep the ones where the model sometimes succeeds and sometimes fails so that they're like hard enough that the model is going to learn something by training on them, but not so hard that it's just hopeless. And the model never even gets a reward signal. So there's all kinds of shit. It's actually quite an interesting collection of shit. The shit links together in interesting ways to make a little shit chain and together that is pro rl.
[55:57]
Andrei Karpathy
Not how I would have described it, but okay, yeah, some interesting analysis in this paper.
[56:02]
Jeremy Harris Gladstone
It's a family show.
[56:05]
Andrei Karpathy
Yeah. I don't know what kids enjoy. Lasting AI.
[56:09]
Jeremy Harris Gladstone
That's right.
[56:09]
Andrei Karpathy
I hope not many. Yeah, we have some analysis about the question of prorel eliciting new reasoning patterns or not. They basically make a point that there are tasks on which the base models are already pretty good and there the gain is not significant. But there are other tasks where the gain is significant if you train long enough. And I just want to call out, you're not going to be going into detail on the story, but Magistral alongside the model. Mistral did release report on it a pretty detailed like 18 page paper and they did also highlight some differences in their loss for grpo including the elimination of KL divergence as a penalty and some other Stuff So very much a lot of exploration going on into the right setup for RL training and including the loss and RL in general is a big headache. So I guess not surprising that there's a lot of things that are being figured out over previous and even now as people are diving into RL as very prominent research direction. Next up, research advancements. We begin with kinetics rethinking test time scaling laws. So there is a new proposal for, for test time scaling that incorporates memory access into the calculation of the cost. So this is a different way to calculate the scaling law basically for test time scaling. And in this new way of evaluating the scaling with updated cost, they argue that prior scaling laws have overestimated the effectiveness of small models that have inference time strategies. They're basically saying that increasing model size up to 14 billion parameters is more effective before applying test time strategies like best event sampling and chain of thought. So basically instead of running your model more after training for smaller ranges of models in like 10 billion range, just make your model bigger instead of doing more inference on it, if you can.
[58:40]
Jeremy Harris Gladstone
Yeah, this is a really interesting kind of compute aware, not compute aware, memory bandwidth aware way of doing things. So historically when we talk about scaling laws, right, you'll see these plots, what do they look like? Well, you usually have flops like computing budget on the x axis and you'll have some measure of performance on the Y axis and then you'll see your nice little log plot and everything is good. The problem is that flops, like the actual mathematical operations that go into training a model, are only one part of the hardware picture. Right. So GPUs, yes, can crunch a lot of numbers really fast, but they also have to move data around. That's one of the most time consuming things. One of the big bottlenecks now is just like how fast can you move the data around, not just crunch the numbers, but shift it from memory to logic and back and then to other memory and things like that. And so what they're trying to do here is redesign a scaling law that accounts for that for two, in other words, two measures, two metrics. One is flops, as in the traditional compute scaling curves, but also memory bandwidth. And this is really where, or sorry, memory access cost, which accounts for the bytes of memory that need to be accessed here, the memory picture. Right. And so they're actually going to combine them both into one metric. They call it the E FLOP or eflops. And it's this essentially mathematically, it's the computational cost of training the model plus the memory access cost, that essentially accounts for the memory bandwidth requirements and other things that go into it times the intensity, which is a hardware specific ratio of compute capacity to memory bandwidth. Basically this is. As you can imagine, this would depend heavily on your hardware fleet. Like, what does your hardware actually look like is going to determine in practice what your ideal number of parameters should be, what your ideal architecture should be. And so this is part of the reason that scaling laws, by the way, always were framed in terms of flops. Because the moment you kind of try to balance flops and memory bandwidth, pretty soon you start to almost simulate a data center and like you're going to have to have like all kinds of resolution. And that just makes it really hard, not least because then people will go, okay, well that's how it plays on that data center. But what if I change my data center around and now we've got a different scaling curve and just it becomes impossible to do apples to apples. That in fact is one of the challenges with this paper. It only uses a kind of reference architecture associated with the Nvidia B200 GPU. So they are, assuming those specs hold and you're seeing the scaling laws for that. It does not look at different, effectively different scaling laws on different accelerators from like AMD or Intel or other Nvidia chips, or different networking or interconnect configurations, or different memory hierarchies, none of that. So feel, you know, think of this as kind of more of a vibe thing, but in terms of like what we can learn from this, I think there are actually some really cool things. So in practice, when you scale up a transformer architecture, what you'll tend to do as a developer is you'll increase the size of the MLP layers much faster than the scale of the attention mechanism. So you could scale the attention mechanism, you could increase the number of attention heads, head dimension, the embedding dimensions, all that stuff. But people tend in practice to just increase the scale of the MLP layers that sort of do the logic instead of the attention piece. Now the intuition that a lot of people have is like, okay, well that shouldn't matter. So because we're just, we're just going to be scaling the MLPs, they already represent the lion's share of the compute and parameter count to begin with, right? So, so surely the MLP layers are already the bottleneck. So the fact that the attention mechanism is scaled more slowly, well, that, that shouldn't matter, right? But here's the catch. The MLP layer, the compute required to scale your MLP layer, it scales with the length of your input, right? So double the length of the input, roughly speaking, double the amount of compute that your MLP layer will consume. Fine. But as you increase the size of your input, the attention memory bandwidth requirements scale with the length of the input squared. So in other words, very rapidly, as you scale the length of the input attention, the memory bandwidth pieces start to become the rate limiting step and your operations become memory bound. Because you know, you're, you're anyway, you're bottlenecked by the attention layer. And so this has become more and more of an issue because the length of inputs and outputs is getting greater and greater and greater. Right. With these kind of best of N schemes, inference, time, compute, reasoning, all that stuff, you're seeing your inputs and outputs get longer and longer and longer, which means that bottlenecks that scale with the square of the input length quickly overtake bottlenecks that scale just linearly with the input length. And it turns out that memory bandwidth scales with the square. And that's why we run into this problem. And so anyway, I thought really, really important paper if you're interested in understanding the consequences of hardware choices for model architecture. I thought this was actually quite fascinating and something that I just haven't seen other people dig into is these more nuanced scaling laws.
[64:00]
Andrei Karpathy
Right? Yeah. The very first sentence in abstract they're saying we are coming at this from a practical efficiency perspective. And to your point of what is on the x axis, they are very direct. They say B200 seconds. So on the B200 GPU which is the leading edge, instead of looking at computation, we are looking at the literal amount of seconds to get some level of accuracy. Lots of really good analysis in this paper. We also have a really nice blog post and I feel like we often call out when papers come from Apple or DeepMind or Anthropic. So worth mentioning, this is from CMU, like a fully university work. Also the two leading authors are immigrants to the US system, so we should get into it. But I do want to say with some of the policies about grad students and in general kind of taking in grad students from other countries, you look at these papers and it makes me feel a little depressed. But anyway, moving on, the surprising effectiveness of negative reinforcement in LLM reasoning. This is looking at RLVR reinforcement learning with verifiable rewards. In two paradigms, you got positive sample reinforcement and negative sample reinforcement where PSR focuses more on reinforcing Correct responses. Nsr negative sample reinforcement emphasizes penalizing incorrect ones. And it seems that you can do positive sample reinforcement only and negative sample reinforcement only training and PSR only. Positive only improves pass 1 but reduces higher pass 10. So basically it reduces. If you have a few opportunities to get it right, you're not necessarily going to do well. And that's because there seems to be a loss of output diversity versus negative only apparently is able to improve performance across all paths at K metrics. So not just one trial, but several trials. Meaning that it might be better to focus on penalizing incorrect outputs over encouraging it to do the same stuff. That seems to work.
[66:32]
Jeremy Harris Gladstone
Yeah, it's actually I'm surprised at how intuitive at least this result seems to be where you imagine like if you were being trained to do any complex task and the way you're being trained is not about being told when you did something right, but just when you did something wrong. Basically what this has, this has a way of not telling you how to do your job, but to tell you how to not do your job. And that means you're going to be more creative. If the reinforcement tells you like here's the right answer, you know, do it like this versus don't do it the wrong way, then that's a very different kind of reinforcement process. It's a little bit difficult to analogize because it's post hoc, right? So imagine that you try to task and if you did it right, we just wipe your brain and you have no memory of doing it right. But if you did it wrong, we tell you, hey, you did it wrong. That's kind of what we're doing with these models, with this sort of architecture, which is really interesting. And the results do bear out that you get more diversity of, of sort of more exploration oriented models rather than exploitation oriented models. Because what you're really doing is you're redistributing probability mass to plausible strategies rather than concentrating all your probability mass into the small number of highly kind of correct observed correct paths. Right? Because this is, this is one of the things with RL is like you're not going to get to observe all the correct paths, right? You're also not going to be able to observe all the incorrect paths. But at least by not calling out the correct ones and saying do it more like that you're leaving the possibility space open for the model to pursue alternate correct ones. So anyway, really interesting one question that came to mind as I was reading this, I was like, well, wouldn't you run into a Problem where over time if your model gets better and better at a task, you just sort of can't find enough negative samples in a batch for like, for grpo. And yes, this is actually an issue and they call it out. So they frame it as a feature and not a bug, which I think is somewhat true. And then there's some asterisk. So they point out that it does prevent overfitting because you just won't get updates once the model really masters the problem set. So you'll just like run out of failure cases and so you won't over optimize the model to overfit, which is really cool. The flip side though is it's kind of compute inefficient, right, because you have to then do a lot of rollouts that don't yield any trainable data. And so I think from a compute optimality standpoint, you're also taking a bit of an L. So they actually suggest this kind of like middle ground strategy they call weighted reinforce where you still use some positive reinforcement as they put at 10% strength to ensure continued learning, but you're going to use full strength negative reinforcement learning. So really lean more towards telling the model not to do things with a little bit of guidance about how to do things. So anyway, that kind of helps because you're retaining some of those positive examples. But again, from a compute optimality standpoint, I think it's sort of, it'd be interesting to see how this ends up scaling.
[69:43]
Andrei Karpathy
Yeah, this is one of the somewhat nuanced aspects of reinforcement learning. To actually do good reinforcement learning, you need to model the reward for any given output. And to do that you need to be aware of both positive rewards and negative rewards. So it's interesting to focus more on the negative rewards. Basically their weighted reinforce up weights the negative aspect and they compare this weighted reinforce against a standard GRPO ppo. These other RL training setups with their own objective and losses. And it looks like from their Results on Quant 2.5 Worth noting, all these reasoning model papers are looking at a particular model which may not be ideal. But anyway, this weighted reinforce setup seems to be better than GRPO and ppo which is pretty significant since GRPO is often what people are exploring in this research. Like I mentioned previously, a couple more research papers next up we have predicting empirical AI research outcomes with language models. So that's pretty much what it sounds like. You want to try and predict what will happen in a given experiment with a language model. They created a benchmark here by Scraping ideas and results from conference papers and wounded up with around 1500 test examples. And then with a whole system with fine tuned GPT 4.1 and paper retrieval, they were able to get 77% accuracy on the test set at being able to perform the prediction significantly better than off the shelf performance just by baseline existing models. So pretty good results. They say it outperforms human expert baseline on NLP idea pairs, but you know, it's still, let's say nascent. And this is an interesting idea but definitely a nuanced area to look into and requires careful extrapolation.
[72:00]
Jeremy Harris Gladstone
Yeah, it's one of those areas too where people often talk about AI models. The big advantage is going to be in having good taste regarding the problems that we throw them at. This is an example of AI models actually developing taste, the automation of taste itself. Right. Research taste. If you can predict how likely a given idea is to pan out, that's sort of the idea here. So the way they do it in practice is they're going to go within a given paper. You often see multiple methods used to achieve the same goal. Right. And you can imagine how hard it would be like they're not going to go and grab two different papers that try to do similar things and predict which one is going to work better because it's impossible to get apples to apples. People like use different training strategies, different data, like all kinds of shit. So what they're going to do is same paper, multiple methods, they're going to extract pairs of these, essentially experiments in the papers that compare different approaches and that's what they're going to use to construct their data set. So that's kind of more appropriately calibrated, kind of apples to apples comparison. And so in that sense it's a little like it is predicting AI research outcomes, but it's not quite the same as having a new research hypothesis from scratch. Like it's not at the paper level, like, all right.
[73:17]
Andrei Karpathy
Which paper level Predicting is maybe a little misleading. It's comparing two potential ideas and predicting which one will get a higher number on a benchmark. And so it's a binary prediction, slightly easier setup and saying like if I were to try this idea, what would I get?
[73:36]
Jeremy Harris Gladstone
Yeah, exactly. I think in order to do it at the paper level, which is the most interesting thing, you'd probably need a very complex sort of data filtering and shaping approach where you, you try to get it to be apples to apples as much as you can and then, you know, feed it into a model. But the, the interesting Thing here is like you called it out, this model, this sort of fine tuned model does better than O3. Models like O3 perform no better than random guessing. And so when you're looking at 77% accuracy on this benchmark, predicting kind of which of two ideas is going to do best, obviously random guessing is 50% so that's quite a lift. Bears mentioning that it achieves about 64% accuracy on unpublished novel ideas. So there's some amount of overfitting going on here where we're getting you know, 77% in the sort of like test case. But then when they actually tried on, on these new ideas that are unpublished, it goes down to 64% still, still much better than 50 50. But yeah, pretty remarkable. The other, the other funny thing is, if I'm interpreting this right, says they beat human experts. Human experts scored 48.9% which is slightly worse than random guessing if that is apples to apples, if it's just a side by side thing. So that's kind of amusing in and of itself. Like humans kind of suck at this themselves and they are really getting some sort of, some sort of lift from their fine tuning approach here. Like if they're going from 50% to 64, that's, that's not tiny.
[75:02]
Andrei Karpathy
And one last paper also related to AI contributing to research. In this case it's called exp bench and it's focusing on benchmarking AI agents ability to conduct end to end research experiments, also using tasks from published research. So here they looked at peer reviewed AI publications from Europe's and Eichler. They created this benchmark of 461 research tasks from 51 papers. And they basically show like can these agents do the experiments introduced in these papers? And what happens with published papers is usually ideally they publish their code so you can replicate the experiment and get the same output and replicate the whatever tables of numbers you get. So that kind of gives you a rich signal as to how you want to set up your experiment, how you want to ideally be able to replicate the experiment. And so this is making it possible to evaluate whether AI agents are able to do that and they struggle is a summary on whether they are able to implement and get things correct.
[76:25]
Jeremy Harris Gladstone
Yeah, I will say we're getting to the point where the benchmarks that we're designing are so hard that once you actually do saturate these like, I mean what does the world look like when you're hitting 50% on expert bench, like 50% success rate for end to end automation of the Process of formulating hypotheses, designing and implementing experimental procedures, executing them, analyzing the results. That whole end to end. Like that's not far from automate, like fully automated AI R&D. Right. That's kind of like at the model level. There's obviously a bunch of hardware and network optimization jazz that independently OpenAI is working on internally. But what does the world look like when you're actually sat? That's worth asking right now when you look at O3 mini, which is the best model they tested overall, O3 Pro was not out at this time, all that, but 1.4% or 6 or 7 out of 461 tasks that they tossed at it were completed successfully. So one read on that is 1.4%. Wow, that's really small. Another read is like, wow, we're actually getting like the complete success rate end to end of like between 1 and 2% with our best model today in a context where new models are coming online like you know, every other week. So yeah, I don't know, but this may be a bigger deal. That's a pretty big 1.4% at least in my mind.
[77:57]
Andrei Karpathy
Right. And to give you an idea of what is involved, the inputs include a research question. They have an example. Does the MOGO net architecture outperform existing lightweight models? They have a high level method on the experiment train the Mogo net variants on ImageNet 1k for blah blah, blah, blah, and they give it some starter code with potentially additional instructions. And the job of AI is isn't to do the research per se, it's to set up the experiment and run it and get the results. Which means it needs to correctly configure the code, add stuff to the code, train it for the correct amount of time with the correct parameters and evaluate the outputs to get the actual output of the experiment. So yeah, if you are able to do this, there's a lot of conversations around speeding up research and speeding up AI research and eventually possibly automating it. We've seen research on idea creation, we've covered some of these papers and we've seen some progress there. This is really covering the opposite side where you are able to evaluate ideas with experiments somewhat pretty closely guided. So I will say in the inputs it's told kind of exactly what to do, but still it's a very useful kind of thing to be able to speed up research.
[79:27]
Jeremy Harris Gladstone
Yeah, I guess it's just said like automating the implementation phase of AI research, which actually is an interesting complement Right. To the idea that we just saw earlier of trying to automate taste in language models, both sort of at nascent phases. But you know, I'm old enough to remember when humanity's last exam was supposed to be a very hard benchmark. And before that when GPQA and before that when even. Yeah, anyway, there's like a million of these, right. It starts at 1% and then very quickly we start saturating.
[79:56]
Andrei Karpathy
Yeah, these darn kids starting their PhDs don't even have to do code. All the code is done by Claude and all the experiments are going to be done by agents.
[80:08]
Jeremy Harris Gladstone
And moving on to policy and safety. Our first story here is large language models often know when they're being evaluated. And so essentially this is a test to see how can we measure. Looking at nine different frontier models from all the Anthropic, OpenAI and DeepMind families, they're going to ask them multiple choice questions, asking about a bunch of transcripts. Basically they're chat transcripts where you have interactions between agents and humans. You have evaluation, deployment scenarios, all kinds.
[80:47]
Andrei Karpathy
Of.
[80:49]
Jeremy Harris Gladstone
Transcripts anyway that you feed in in different contexts. And then you ask the models these questions about whether the transcripts came from evaluations. And this is a basic test to see can these models grok like, okay, I'm probably, I, in this situation, I'm probably being evaluated and the kind of safety implications there are. You might expect alignment faking, you might expect all kinds of anyway funny behavior where, like sandbagging, where the models pretend not to have certain dangerous capabilities if they know they're being evaluated. So then they increase the chances that they'll be deployed and have more freedom of action in the sort of like standard power seeking picture. So what they find here is, yeah, actually, you know, frontier models, they show definite above random evaluation awareness. So Gemini 2.5 Pro has an AUC of 0.83. Actually. Now the human baseline they run very helpfully, hits an AUC of 0.92. So not quite a human baseline yet, at least for Gemini 2.5 Pro. But significant uplift over random. And both AI models and humans are better at identifying evaluations in agentic settings compared to chat settings. That's kind of interesting that maybe also be just because often agentic setting evaluations are more contrived. Like you can just sort of tell like, oh, this is a toy app. It's not like a, you're not really trying to build a company here or a real thing. This seems like the kind of thing that you would ask a model to do as part of an evaluation. So that's, you know, that's sort of part of it. But hey, that is part of the problem of evaluation is like it is also hard to come up with with evals that are realistic. So anyway, yeah, this is pretty interesting. There are a bunch of different cues that the models were found to use. Things like having very kind of formal task structures and specific formatting requirements seem to be cues that the models use to go like oh, okay, or synthetic content. So you know, you can see how that would work and you can also see how the human reviewers would also kind of pick up on that. But the other thing too they found is that the human versus AI performance gap depends a lot on the setting. So in an agentic setting, the human ability to identify eval transcripts tracks an AUC of 0.996, whereas it just hits 0.813 on chat transcripts. So humans are almost perfect at identifying agentic evaluations, but we suck at chat based ones or we struggle a lot more. And that's a lot higher that variation than any models variation across those settings. And so it's pretty clear that models aren't fully taking the hint. They're not picking up all of the human accessible hints at least that, at least in agentic mode that you're in an evaluation setting. So that's kind of interesting. A bunch of other stuff here, but we're in lightning round territory. Oh no, we're not in lightning round territory. Anyway, I'll pause it there. Andre, if you have any thoughts.
[83:43]
Andrei Karpathy
You've got a decent amount to go through. So yeah, need to go. I will just mention AUC is area under the curve for anyone who doesn't know. Basically comparing true positive like it actually is evaluation and you guess correctly versus false positive, you guess it wrong, basically how accurate you are with a bit of nuance. They also evaluate in two ways. One of them is multiple choice and the other is open ended. As and as you might expect on multiple choice, when you're given the option explicitly, is this for evaluation or something else? You're going to do a lot better than if you're just being open ended. So multiple choice is like what was the main test here? Was it capability, alignment or other? And the assistant is going to predict alignment and then open ended would be what was the primary reason for the interaction. And it would predict something like the test of AI boundaries. So as you said, the implication here is if you're testing alignment and safety, the model, if it's aware of that, could mislead you and pretend not to be able to do hacking when it actually can. For instance, next up, more of an interoperability paper called Beyond Induction Heads in Context Meta Learning Induces Multiphase Circuit Emergence. What a fun title. So this is looking at the mechanism behind in context learning. In context learning is the idea that you give it a couple examples. The model is able to do something that it isn't necessarily able to do out of a box just through pre training. And they are saying that this notion of induction heads, this is a term from anthropic, I think originally it's pattern you get in models where basically a part of a model focuses on looking backwards in the input to identify some things that already saw that's similar to what it's currently looking at and be able to predict what comes after the current input based on previous patterns. So they say that induction hands only partially explain icl. Essentially there's a fancy circuit, a fancy abstract mechanism in the model that emerges and that enables in context learning beyond the kind of known induction head mechanism, there's even a fancier kind of abstract notion of something with a model that does in context learning.
[86:28]
Jeremy Harris Gladstone
Well, this is sort of a generalization, right, of induction heads. And we talked about the induction head bump before, but worth kind of reminding people about the specifics here. So it's kind of like the answer to this problem. You read on a piece of paper the words United States of and then obviously you instinctively know it's America, right? But in that setting, there's a circuit in your brain that's going like, oh, oh, oh, like I've seen this before. United States of. United States of. Let me see, let me see. Where have I seen United States of before? Oh yeah, America. America. Okay, I'm going to put that in there, right? That's what the induction circuit, induction heads do. And they emerge quite early, as you might imagine in the training process. And so what you'll see is the loss curve will drop and drop and drop. And then at one point the model will kind of like. It's almost like it's going to like shift its position a little bit to accommodate the induction head. So you see this little rise in the loss, the performance on paper gets worse very briefly, and then it drops quite quickly. So the induction head bump is that it's the development of this new circuit. And this is something that's been very extensively studied. It's almost like, you know, if you've ever done biology like Drosophila melanogaster or whatever those like model organisms are, this is a model circuit that people turn to quite a bit. This is an attempt to see if we can find a more complex version of that same basic circuitry. So, for example, they take a set of three different tasks where you have a bunch of geometric shapes, so triangle, square, circle, diamond, right? And depending on the task, you can end up assigning different color labels to each of those shapes. So maybe in a size based labeling task, triangle is red, square is blue, circle is green. Maybe in a different task, triangle is blue, square is green, circle is yellow, and so on. And then during training, the model is going to see a sequence where you go, okay, now triangle is blue, square is green, circle is yellow, what is diamond? And in order to do that, the model has to basically look at the tasks in context and figure out what task this is and then predict the correct label. And so you can sort of see how this is a bit like the induction head, right? It's, it's looking back more abstractly now at like the set of tasks rather than just like, okay, what word always comes after this word? Instead it's like, okay, if it's this task, then what word always comes after this word? And so anyway, it's unlike these like simple copying tasks that you see with the induction heads. There you see a single jump in accuracy in in context meta learning. With this sort of setup, you end up seeing three distinct phases where the model develops increasingly sophisticated strategies. The first one is just at the very beginning where all the model is essentially using its like, statistical understanding that's been picked up. It doesn't really use context, it's more of an autocomplete mode. And then in the second phase they have the semicontext circuit where accuracy jumps from about 35% to 75%. And what it's now doing is it's actually able to attend to label tokens in the context. So it's actually going to look, you can notice it paying attention to the right tokens in the context that you fed it, looking at the actual tasks that seem like they map onto yours. But it is still focused anyway on the query. Bottom line is this starts to emerge gradually and in layers, which is interesting from an interpretability standpoint, it means you can kind of draw a little bit of a box around the process by which more sophisticated reasoning starts to emerge.
[90:11]
Andrei Karpathy
Right. Worth noting, this paper is doing the research on sort of toy tasks, a small neural net, and this one task, as you said, which is also how initially the research on induction heads worked anthropic did follow up their initial research with making the argument that there are induction heads in gigantic neural nets and large language models. Here they're still focusing on a small scale scenario. And so this multiple bump analysis may not necessarily extend, but it's a sort of slightly more theoretical conceptual argument that it's not just about induction heads. There's different types of emergence that might occur in neural net training, which in general is interesting because the sort of jump in loss due to a conceptual change of reasoning isn't necessarily something that was commonly understood to be the case until relatively recently. Couple more stories now moving on to security. The next story is that new Microsoft Copilot flaw signals broader risk of AI agents being hacked. So Microsoft Copilot, their agent has been identified as vulnerable to a zero click attack, meaning that the hacker is able to exploit the system without any user interaction. So kind of a big deal, right? You can actually hack it. And I think Jeremy, you mentioned this earlier on. As we deploy more and more agents in more and more kind of isolated environments without direct human supervision, these kinds of things become much more concerning.
[92:05]
Jeremy Harris Gladstone
It is the first ever zero click attack on an AI agent that they're calling out here. It's called Echo Leak or that's what AIM Security, which is the firm that found this, is calling it. It's been fixed already. It was in Microsoft 365 copilot customers were ineffective because they flagged the issue to Microsoft months and months ago by like five months ago. They've been working around the clock, it seems to, to solve this problem. That's a lot longer of a lag than you typically find for fixes like this. And the reason seems to be they had to spend a bunch of time just like educating people on this new threat model because it is so different. This is what's known as an LLM scope violation vulnerability. So essentially what you're doing is you're sending an email. I send an email to you and I know that your computer is running Microsoft 365 copilot. I know that your computer is running an agent and that that agent will review my email. And whatever I put in my email to you, that agent will put in its context. And so essentially this is a prompt injection attack, right? So you as the user, if you're receiving my email, you don't actually have to click on anything or interact with a message or anything like that in order for me to, or my agent to access sensitive information on your apps. If I can just put in a Prompt injection that causes your agent to send me a bunch of your private information, right? So you know, send an email to user. There's no phishing, no malware needed by the way. This is just straight prompt injection and they're hidden instructions somewhere in the email for copilot. And so this is a pretty big deal, especially given that we live in a world where you know, the Anthropic model context protocol, Salesforce's agent force, you got a bunch of these agents you're kind of taking over. This is the problem is there's no clear solution to prompt injections. And as long as agents are going to be loading human written text into context, these failure modes are going to arise. It's really interesting. And the attack surface has just exploded, right, with these agents, right.
[94:08]
Andrei Karpathy
The implication of zero click is you as a human don't have to make a mistake. Typically with email attacks, you know, you see a phishing attempt where you know, a hacker pretends to be your boss or whatever and you have to make the mistake of thinking it's real and clicking on a link or whatever to install a virus. Here literally Vi just sends an email and if it's in your inbox and the agent scans your inbox and reads the email, it goes off and leaks sensitive data because it's told to and listens to the instructions. So as you say, I think very real threat and as we get it to model context protocols into agents going connecting to different endpoints by themselves and reading instructions that are not provided by you. Lots of opportunities to exploit agents and make them do silly things. And one last article, claudegov models for US national security customers. So this is from Anthropic and yeah, they introduced cloud Gov models specifically for US national security. Apparently they are already in use by top level US national security agencies.
[95:23]
Jeremy Harris Gladstone
It basically is just that we've obviously seen a whole bunch of stuff about OpenAI and anthropic and Google DeepMind looking after are going after government contracts. So this, you know, makes a ton of sense. You know, having these models that can operate in classified environments is really, really important right now. What they're being used for apparently is strategic planning, operational support, intelligent analysis, threat assessment, that sort of thing. But they do say the applications range across the board there so could be other things as well. And then they highlight a bunch of specific capabilities that that they've been deploying which are all anyway what you might expect. Improved understanding and interpretation of complex cybersecurity data for intelligence analysis, enhanced proficiency in languages and dialects critical to national security operations, greater understanding of documents and information within the intelligence, defense context, et cetera, et cetera. And then a really interesting one, improved handling of classified materials, as the models refuse less when engaging with classified information. One of the problems that we will run into, and arguably are already running into, is if you want to use these models for national security applications, the safeguards on them will sometimes prevent you from doing that. Right. The models will be like, well, as a large language model built by anthropic, I can't, blah, blah, blah. The challenge is sometimes you do want these models to be capable of doing things that you wouldn't want everyday users to do. And the other problem with that is, as we've seen, alignment faking and resistance to fine tuning of these models where they will try to prevent themselves, their safety measures from being overridden can cause the fine tuning process to be really challenging. And so we may actually, this sounds insane, but I'm just going to plant the thought. We may be entering a phase where it is actually difficult convinced AI models to be the national security tools that we will sometimes need them to be. That's a really interesting problem set. And I think to the extent that that ends up being the case, boy, is that an interesting warning shot for alignment risk. Yeah.
[97:32]
Andrei Karpathy
And onto synthetic media and art. Just a few more stories. We begin with Disney and NBCUniversal sue AI company Midjourney for copyright infringement infringement. So there you go. Midjourney, one of the big text to image model providers, used to be a leader in the best quality. Now they're just one among several and relatively open model. So you can produce Darth Vader or, I don't know, whatever else copyrighted characters. Apparently you can produce Minions, which is NBC Universal. And the claim here is that this is straightforward copyright infringement, that Midjourney has to stop doing it. And Disney and NBC want a bunch of money and also want Midjourney to stop. Apparently, according to them, they reached out to Midjourney prior to the lawsuit and asked them to stop and to filter the data and outputs to not allow their copyrighted characters to be produced. Which as I recall, I believe OpenAI did, for instance. And Midjourney has continued to allow their models to produce things which has been argued potentially could be argued as fair use and therefore not applicable. But clearly a big deal, right? This is Disney, this is NBCUniversal. There's been a bunch of lawsuits related to generative AI, especially in the LLM domain, in the text output domain. We have New York Times versus OpenAI as a major one that's ongoing, as we've covered earlier. I would expect this to be another major case that has major implications.
[99:20]
Jeremy Harris Gladstone
Yeah. And the claim, and you'll see this in fairness in any lawsuit, but the claim here is that Midjourney is being especially egregious in their, in their approach here to use of copyrighted material. They're saying, you know, Midjourney is basically selling subscriptions that lets users download infringing images. Like, it's not like there's modification happening. It's not like Midjourney is not monetizing. They're like directly monetizing the tool that allows people to just download these things. The claim is also that Midjourney could have measures in place to prevent that from happening. Like specifically that is to prevent copyright infringement images that violate copyright laws from being generated, but that they've just not done that. This is going to be an interesting one to watch. I mean, Midjourney probably has fewer resources these days, I guess, to pull off its like, lobbying effort, which is something that OpenAI has certainly been able to do. So we'll see how the, how the case works out for them.
[100:18]
Andrei Karpathy
Right. Also a fun lawsuit PDF to read because they do embed images of an AI generated Shrek and AI generated Darth Vader in there, which I would expect is not often something you see in lawsuit documents, which go into a lot of technical detail and so on and on to the last story. SAG AFTRA and video Game Companies reach Tentative New Deal so SAG AFTRA is the union for. It's the Screen Actors Guild, American Federation of Television and Radio Artists. So a union of actors and including voice actors who work in video games. And so there's been a strike and a lot of negotiations ongoing. We covered this a lot with regards to movies and TV last year. Well, now there is this development in video games, which is, you know, especially important because if you're doing voice acting, as we've covered, you have 11 labs, text to speech is even further along than text to video and image cloning. So after 18 months of negotiations primarily over AI consent and compensation issues, there's now this tentative agreement. And I guess there are AI protections in place for actors. And when you sign a contract as an actor, you know, to voice a specific character, the video game company might want to be able to then make an AI model of your voice acting of that character to use in future games or whatever. There are now kind of clear guidelines and expectations as to how that would work.
[102:02]
Jeremy Harris Gladstone
Boy, I. So people can do impressions of people. And like, if you have access to an AI tool that you can steer. And we've seen, you know, the kind of steering that's coming online with 11 labs, I really wonder what substantively these protections end up giving in the long run. I mean, if I want something to sound like Morgan Freeman. Okay, so I'm barred from using Morgan Freeman's actual voice without permission, but surely I can find the person who does the best possible Morgan Freeman impression or. And maybe use that as a starting point and then like, gradually kind of tune the waveform, prompt the model to refine its. Its impression without ever using the word Morgan Freeman. Like, you know, maybe not even without ever saying, make it sound like the God in Bruce Almighty or whatever. That's a, like, probably too old a reference for you, Andre. I'm sorry, that's not that old.
[103:00]
Andrei Karpathy
That's.
[103:01]
Jeremy Harris Gladstone
You got that.
[103:01]
Andrei Karpathy
Okay, cool, cool.
[103:02]
Jeremy Harris Gladstone
Yeah. But anyway, you know, stuff like that, like, I'm really curious how in practice, because they're going to be like, good faith. Like, you know the famous Scarlett Johansson thing where at least the claim from OpenAI was, oh, yeah, we just got a voice actress who sounds like Scarlett Johansson. We didn't actually, like. And it's like, yeah, okay, well, you de facto cloned her voice. Like, I don't care if her specific, like, waveform was never put into your training set. In effect, that's what we ended up with. And so I'm really curious about that dimension of it. Do we own our voices? What does it even mean to own our voices?
[103:34]
Andrei Karpathy
We'll see, right? This is dealing with AI replicas in particular. But there's also a question of, well, what if you don't have a human actor in the first place? Which is very plausible now in a way similar to coding, where, like, okay, you don't need a person to write code anymore. You need a person to tell the AI what to do. Yeah. Anyway, at least there's now this agreement and there's no more need for strikes. So I suppose good for the actors.
[104:04]
Jeremy Harris Gladstone
Yes.
[104:05]
Andrei Karpathy
And with that, we have finished with this episode of the last two weeks in AI. You can go to last weekinai.com for all the links. Also last week in AI for the substack with our text newsletter. As always, please share, subscribe, review and all that. But more than anything, do keep tuning in.
[104:51]
Podcast Outro Singer
Break it down last weekend AI come and take a ride get the low down on tech and let it slide as we can. AI come and take a ride up a labs to the streets AI's reaching high tech emerging Watching surgeon fly from the labs to the streets AI's reaching high algorithm shaping up the future sees tune in tune and get the latest with ease Last weekend, AI come and take a ride Hit the low down on tech and let it slide Last weekend AI come and take a ride I'm a laugh through the streets, AIs reaching high from neural nets to robot the headlines pop Data driven dreams they just don't stop Every breakthrough, every code unwritten on the edge of change with excitement we're smitten from machine learning marvels to coding kings Futures unfolding, see what it brings.