wavePod

Get Wave AI

#235 - Sonnet 4.6, Deep-thinking tokens, Anthropic vs Pentagon - Last Week in AI | Wave AI Podcast Notes

Back to Last Week in AI

#235 - Sonnet 4.6, Deep-thinking tokens, Anthropic vs Pentagon

Last Week in AI

Tue Mar 03 2026

Summary

Last Week in AI – Episode #235 Summary

Date: March 3, 2026
Hosts: Andrej Karpathy ("Andre Karenkov"), Jeremy Harris
Main Theme: A packed week of major AI model releases, advances in AI benchmarking and optimization, fierce hardware and geopolitical competition, and a dramatic standoff between Anthropic and the Pentagon.

Episode Overview

The hosts recap two weeks’ worth of breakneck AI news: multiple high-impact LLM updates, advances in agentic AI tooling, major hardware deals and challenges, ongoing interpretability research, and—most notably—Anthropic’s escalation with the U.S. Department of War over military AI usage. The episode is a whirlwind tour of technical updates and power struggles, delivered with the podcast’s blend of technical rigor, speculation, and dry humor.

1. Listener Updates and Podcast Housekeeping

Hosts regret missing a previous week; Andrej mentions Astrocade recently raised a Series B and is hiring.
Podcast reviews highlight appreciation for the show’s blend of technical depth and directness around political topics.
Jeremy jokes, “If anybody wants to see me squirm, this is going to be the week.” (05:55)

2. Tool & App Highlights

Anthropic’s Sonnet 4.6

Anthropic follows up the Opus 4.6 release with Sonnet 4.6: a “0.1 version bump” seen as a major real-world jump (1 million token context window).
Rapid pace of model iteration attributed to “retraining, reinforcement learning,” and likely leveraging internal cloud code data.
Quote:

“Anthropic is on fire.” – Jeremy, (08:44)

Sonnet’s benchmark: ~60.4% on ARC-AGI2, close to top models in its weight class, but surpassed by Gemini 3, Deepthink, and GPT-5.2.

ARC-AGI Benchmarks Primer

Designed to measure LLMs’ human-level generalization; emphasizes problems that require out-of-distribution reasoning (e.g., “Okay, now do Connect Five”).
Quote:

"A couple different ways to kind of win... given limited compute and limited data to then get a score that is very strong." – Andrej, (08:44)

Google Gemini 3.1 Pro

Google rolls out Gemini 3.1 Pro with 77.1% on ARC AGI2 (up from 31.1%).
Noteworthy for its “multimodal” capabilities and lower API pricing ($2 per million input tokens vs. Claude's $5).
Observations on model pricing: as models get closer in quality, Anthropic’s premium pricing may become unsustainable.

Grok 4.20 Public Beta (xAI)

Ambiguity whether this is a new model or just a new inference strategy with four agent “personas” debating before issuing a consensus answer.
Elon Musk hypes “order of magnitude” gains (regarded skeptically by hosts).
Model pivots: Grok previously marketed as “uncensored,” now targeting “real world, concrete capabilities” (e.g., medicine, engineering).
Use in U.S. classified systems confirmed this week (see policy below).

3. Agentic Tools & AI Agents

Anthropic’s “Remote Control” for Claude Code

Enables mobile device to access a persistent cloud code session on a user’s machine.
Clearer security boundaries than OpenAI’s earlier agentic approaches—pull-based rather than push; local files needn’t be transferred off-device.

Perplexity “Computer”

New AI agent coordinator that orchestrates sub-agents for long-running, extended tasks (e.g., marketing campaigns or app building).
Perplexity pivots toward multi-agent infrastructure (competing with OpenRouter) to differentiate beyond search.

4. Business & Hardware Shake-ups

Meta & AMD: $100 Billion Chip Deal

Meta commits up to $100B for AMD chips (as part of $600B data center expansion), pushing frontiers of AI infrastructure.
Equity/warrant structure to align incentives; AMD’s stock must triple for full vesting.
Meta’s stated aim: “personal superintelligence,” but has lagged in releasing flagship models recently.

MatX: Nvidia Challenger Raises $500 Million

Building specialized chips for “10x” throughput over Nvidia’s for Transformer-based LLMs (shipping 2027).
Emphasizes hardware lottery–dependent bet on Transformers' continued dominance.

World Labs: $1B Raise for World Models

Startup aiming to commercialize “world models” for 3D simulation and agent training, with applications in robotics and autonomy.

Simile: $100M for Simulating Predicting Human Behavior

Spinoff from the Stanford “AI village”: agents simulate realistic human behavior—a critical next step for AI social simulations and consumer modeling.

OpenAI Stargate Data Centers: Delays

Discord between OpenAI, Oracle, and SoftBank over $50B+ Stargate centers; disagreement over control slows progress.
OpenAI reportedly at risk of “running out of cash by mid-2027.”

Chinese Chip Capacity Ambitions

China aims to 5x 7nm & 5nm chip production by 2027, but faces yield and tooling limitations due to export controls.

5. Research & Technical Advances

Adaptive Optimizers: Surprising Masking Effectiveness

Google paper shows that skipping random weight updates and aligning masking with momentum in adaptive optimizers (e.g., Adam) leads to substantial performance gains (up to 19% lower perplexity for billion-parameter LLMs).
Quote:

“These kinds of ideas that seem so basic, we’re still discovering them… there’s a lot of low-hanging fruit.” – Jeremy, (46:26)

Measuring LLM “Deep Thinking”

New metric: “deep-thinking tokens” are those whose internal representations change most in the late layers of LLMs.
Higher fractions of these tokens correlate with better output accuracy—suggesting good models distribute deliberation across all layers rather than converging early.

Attractor States in LLM Dialogue

Analysis of LLMs “talking to themselves” finds they converge to highly model-specific attractor states: Claude becomes existentially silent, GPT degenerates into code, Grok spews memes, Gemini becomes grandiose, etc.
Raises questions about LLM “personalities” and failure modes in long-term autonomous agents.
Quote:

“Grok is unhinged and meme, meme lover; Claude is more philosophical and thoughtful.” – Andrej, (62:38)

Mechanistic Interpretability: Counting, Manifolds, and Text Wrapping

Anthropic paper shows Claude 3.5 Haiku encodes counting as a one-dimensional manifold in a six-dimensional subspace—offering interpretability insights into how models internally represent seemingly simple tasks.

Bridging Model/Human Task Completion Times

New method to infer human-equivalent completion times for AI tasks using “item response theory,” scaling up time horizon benchmarks beyond expensive matter tasks.
Issues remain with variance and untested long-horizon task extrapolation, but density of human-calibrated benchmarks may improve.

Safety Backstops: NESSE Benchmark

Simple sanity-check benchmark: if your model fails on basic, easy-to-instruct safety tasks, something’s wrong.

“Least Understood Driver” of AI Progress (Epoch AI)

Synthesis post argues that most “algorithms & training breakthroughs” for LLMs are really just better data curation and scaling laws—software & data progress remain deeply opaque.

Persona Selection in LLMs

Anthropic posits LLMs do not have persistent “selves,” but dynamically condition into “personas” as prompted—explaining generalized behaviors & misalignment via character inference.

6. Policy & Geopolitics: Anthropic vs. The Pentagon

Anthropic’s Pentagon Showdown

Anthropic refuses Pentagon (Department of War) request to drop restrictions on model use for autonomous weapons/surveillance.
DoW threatens “supply chain risk” designation (potentially blacklisting Anthropic, as happened to Huawei); also floats invoking the Defense Production Act to conscript Anthropic as a contractor.
Amodei’s statement:
Quote:

“These threats do not change our position. We cannot in good conscience accede to their request.” – Anthropic Statement, (89:58)

Context: Anthropic is the first major AI lab to supply large-scale LLMs to the military; DoW wants “any lawful use.”
U.S. strategic dilemma: balancing AI innovation, commercial independence, and national security imperatives.
At the same time, Elon Musk and xAI/Grok agree to license their models to the Pentagon for “any lawful use,” filling the gap.

Distillation Attacks: China & AI Security

Anthropic uncovers large-scale attempts (16 million exchanges via 24,000 accounts) by Chinese companies (Deepseek, Moonshot, Minimax) to “distill” Claude via automated scraping—highlighting dual-use, knowledge-transfer risks.
Jeremy:
Quote:

“Distillation works... It gives you crazy leverage, asymmetrical leverage if you’re compute constrained.” – Jeremy, (98:13)

OpenAI’s “Malicious Use” Monthly Report

OpenAI catalogs cases of model abuse (malware, organized crime, authoritarian censorship assistance) and touts increasing detection/prevention efforts.

7. Notable & Memorable Moments

Andrej: “Anthropic is on fire.” (08:44)
Jeremy (on U.S.-China AI rivalry): “As long as our labs are penetrated... we’re dragging our adversaries along with us.” (100:48)
Existential Claude sample:

“Stillness enough. Letting the conversation rest. We’re both explaining why we’re not responding while responding. Stopping now.” (59:54)

Jeremy responds to Pentagon-DoW-AI standoff: “This could not be more important.” (94:51)

8. Key Timestamps

05:55 – Listener reviews, political commentary
06:00–17:44 – Major model updates: Sonnet 4.6, Gemini 3.1 Pro, Grok 4.20
22:00–26:28 – Claude Code Remote Control & Perplexity Computer
26:29–39:32 – Meta/AMD, MatX, World Labs, Simile, OpenAI Stargate, China Chipping
43:21–68:40 – Deep dives: Adaptive optimizers, “Deep thinking” tokens, LLM attractor states, interpretability
68:40–74:49 – Benchmarking/model-human task bridge, evaluation woes
87:52–100:48 – Policy & safety: Pentagon vs. Anthropic, DoD moves, distillation and export control, OpenAI’s AI abuse report

9. Overall Tone

Technical, wry, and occasionally irreverent (banter about “420” and “6.9” Grok versions). The hosts strike a balance between wonkish technical detail, strategic business/policy analysis, and a sense of mounting stakes as models and institutions race ahead.

10. Recommended For

Anyone wanting a comprehensive, critical, and accessible summary of February/March 2026’s most important AI happenings—especially those tracking the intersection of leading-edge technical advances and global power maneuvering in AI.

End of summary.

Loading summary...

Transcript

Andrej Karpathy (0:00)

Foreign.

Sponsor/Announcer (0:11)

Would like to thank ODSC AI for being a sponsor. ODSC is one of the longest running and largest communities focused on applied data science and AI.

Andrej Karpathy (0:20)

It started over a decade ago with

Sponsor/Announcer (0:22)

a simple idea bring practitioners together to learn from people actually building and deploying models in the real world, not just talking theory. On April 28th through the 30th, you can experience it yourself at ODSC East 2026. Taking place in Boston and virtually there will be thousands of hybrid attendees ranging from data scientists, ML engineers, AI researchers and technical leaders. You can attend over 300 sessions covering LLMs, Gen AI, Computer Vision, NLP, Data Engineering and more. You can also go to hands on training with workshops and bootcamps taught by experts from companies like OpenAI, Hugging Face, Nvidia and other top companies, universities. And of course there'll be a massive expo and networking opportunities. Great for startups, hiring managers and AI tool builders. It's one of the best ways for AI practitioners and teams to stay ahead of the field, learn from the best and connect with a community. Go to ODSC AI east and use promo code LWAI for an additional 15% off your pass to ODSC AI East 2026. That's ODSC AI east and use code LWAI to get an extra 15% on the number one AI builders and training conference. We'd like to thank Box for sponsoring last week in AI. Box is the leading intelligent content management platform enabling organizations to fuel collaboration, manage the entire content lifecycle, secure critical content and transform business workflows with enterprise AI. To unlock the power of AI, you need to get your content to your LLMs and agents. Your business isn't the sum of Internet knowledge. Your business lives in your content, so you don't just want to bolt on AI to your existing processes. To become an AI first company isn't just about automating what you already do, it's about reimagining what's possible. With boxai you can truly leverage the latest breakthroughs in AI to automate document processing and workflows, extract insights from content, build custom AI agents to work on assignments and more. And most importantly, boxai works with all the major leading AI model providers so OpenAI, Anthropic, Google XAI and others so

Andrej Karpathy (2:39)

you can be sure you can use

Sponsor/Announcer (2:40)

the latest AI models with your content. Box AI will give you the content layer that gives AI the context it needs while giving your teams the flexibility they need to test and leverage various models for different use cases. So go to box.comai to learn more.

Jeremy Harris (37:30)

Yeah, that's right. I mean when you're throwing hundreds of billions of dollars around. Yeah, it does seem so. You know, OpenAI initially wanted to kind of own the full stack, right. So they wanted to have basically ownership of the data centers, the chips, like all that infrastructure, which would lessen its dependency on third party cloud providers, which, which can be more expensive in the long run. Right. You think about some of the, the big Neo clouds or you know, any clouds that you, your cloud companies are going to go with, they, they're going to charge you margin and the margin is usually really good. That's why those companies raise it at, you know, multi billion dollar valuation. So it turns out that apparently OpenAI's investors did not like this idea of the massive upfront costs that it takes to build that kind of infrastructure. Especially it turns out Given that OpenAI is concerned about running out of cash by mid-2027, that is course, assuming no further fundraises, which I would not assume that. You know, this basically put them on the back foot in the negotiations with their Stargate partners. In particular, you know, Oracle and SoftBank OpenAI had this pipe dream of getting 10 gigawatts of compute over the next three years through those two partners. And seems like this sort of delayed if not dashed those hopes. So, you know, we'll have to see. But there's already a promise between OpenAI and Oracle to purchase $300 billion worth of compute over the next five years. So again, kind of unclear like who's going to give the money when and how. Concretely this, like, there's a lot of just like pronouncements about, okay, I'm going to give you $300 billion over the next five years. It'll just kind of work out that way. So doesn't mean it won't happen, but it's worth keeping in mind that often these things are marketing announcements. So yeah, a whole bunch of stuff about potential announcements of like, well, actually a planned 1 gigawatt build in Texas that was put on hold in favor of negotiations with Oracle. So things are shuffling around a whole bunch right now. And while nothing is closed, like it seems like finally Stargate is back on track. There's just been a lot of delays as a result of this uncertainty.

Jeremy Harris (40:46)

That's a great point. And those are all the things that China's working on. You know, famously focusing on networking just a giant number of chips together rather than the way we're doing it is kind of leaning more on the high quality logic dies on each individual gpu. What you're seeing in China is like, let's merge these dyes together, so package them together on just like bigger, you know, bigger packages. And then also let's network them together with just way more, so just way more surface area. Basically these Chinese data centers have, if you're thinking about one 7 nanometer wafer, if you're trying to get an idea in your head of like, what the hell, what is the equivalent of that? Like, how should I think about that? That'll produce the equivalent from a compute standpoint of like around 25, maybe 30h 100 equivalent dies. Right? So one 7 nanometer wafer gives you about as much logic kind of compute as call it 30 H100 compute units. And there, there's a whole bunch of asterisks and caveats there. The other thing too is yields kind of suck. So you know, like you can expect the vast majority or not the vast majority, but a good chunk of those dies to be useless at the end of the day. And SMIC has struggled a lot with yields. That's a big part of this. So when you look at like lifting production to x many wafer starts per month, I mean that's really the, the question is like, okay, sure, you know, we're going to lift our production from, you know, below 20,000 wafer starts per month, which is where it is today, to around 100,000 in one to two years. That's really impressive. But what are the yields going to be? What fraction of those starts lead to actually usable chips? And that's been the whole problem for SMIC are a huge part of it in the last little bit. So longer term plan here apparently is to get all the way up to 500,000 wafer starts per month by 2030, which, you know, you can throw these numbers around, you absolutely can do that. But the proof is in the pudding. All this shows is there's, as you might expect, massive appetite to actually do this. If the 50,000 wafer starts per month figure is correct, getting to a hundred thousand within a couple of years might seem realistic. But the main challenge here is do they actually have the equipment they need to do it? If you were in the west and you were seeing a company that was doing 50,000 wafers and they were pitching you on we'll double that in two years, you'd be like, okay, maybe the challenge is in China, a lot of the gear that they need to do that is export control. And so. And they've already had their CEO or their co CEO complain that some tools that they have to procure are just like not easy to access. So even though they could if they had the gear, the key inputs, whether that's the lithography machines from ASML or things from Tokyo Electron or whatever else, they just don't have those things. They face bottlenecks other than just like staffing. And so that's, that's a big part

Andrej Karpathy (43:21)

of the issue here. And now onto research and advancements, which will be pretty meaty. I think for the fans of going deep on technical stuff, there'll be a lot this episode. First up on surprising effectiveness of masking updates in adaptive optimizers. A bit of background knowledge. So when you train a neural net just generally you need an optimizer. The most basic optimizer is you have your output. You compute the error of the output with respect to your known labels in supervised learning and then you calculate the relevant just using calculus, the updateable weights that would improve your performance. And on that specific set of outputs, the basic thing is your optimizer just applies those gradients to the weights and updates their values. Each individual kind of knob in a machine. There's been many more advanced optimizers. Adam and RMSprop are some examples where they retain some memory and basically smooth out the updates, roughly speaking. And that leads to more stable and better overall performance performance. So this is a paper in that realm. And what they show is there's kind of a surprising trick that turns out to improve these optimizers a lot. Specifically these adaptive memory based optimizers like Adam, which are to my knowledge still the default for training. The trick is you randomly, with some probability just skip updating some weights. So the first part of method is skip update which is just that you randomly skip some weights while retaining the memory of what the update would have been. So your adaptive optimizer still has that adaptive parameter, but you just don't change the weight. And then in addition to that they introduce momentum aligned gradient masking magma which makes it modulated by by something technical. But basically it uses that memory and also the direction of a gradient to choose a bit more carefully what to mask. And this yields like crazy gains. So for 1 billion parameter model already pretty large scale, this is from Google, So they can do these large experiments. This reduces perplexity, the loss term in this case by 19% and 9% over two options, Adam and Muon. And if you look at the Graph. What this looks like is for every model from 60 million to 1 billion, the final loss performance is just lower across the board compared to all the optimizers they've tested. So if true, very big deal, right? This is gonna be very impactful for training models more quickly, potentially even for better final performance.

Jeremy Harris (46:26)

Yeah, this is actually quite like the intuition behind it is something like you have like your model has a giant number of parameters and you can think of like over the course of training, those parameters would get more and more dialed in. If every time there's a batch of data you just update all the parameters, some fraction of those updates, probably a large fraction, will kind of be just noisy like due to like random noise. And maybe like all of your parameters were actually like many of your parameters were pretty damn good. And then, then your batch kind of causes all of them to reshuffle instead of just a few. Essentially what they're doing here, it's kind of regularization. It means like you're not going to make such a radical change with every batch. You're just going to randomly pick a small subset of those parameters and just tweak that which protects the progress you made on everything else. It just means that the model, maybe an intuition is like, if you want to learn how to throw a really good punch, maybe first start by just doing the motion from your shoulder to your hand or something. And don't use your hips, don't use your legs, don't try to learn everything at the same time. Then try to learn those other pieces kind of more, more one at a time. That's kind of what this is doing. It's allowing the model to only update some parts of itself and leave the others in place while it focuses. This is a somewhat imperfect analogy, but hopefully that gives the flavor. And then what they're finding is. So you might think actually one thing they don't do that I'd be curious to see is like in the same way that you decay learning rate over time, as the model gets trained more and more, you might be interested to see what happens if we gradually like decrease the fraction of weights that we were actually updating over the course of training. As your model dials in more and more and you're doing more and more kind of refinement. That would be something that'd be interesting to actually see in a, like a follow up piece of work that at least I didn't see there. But still the, the other piece, the. So the magma piece is basically just about. Yeah, you can actually do Better than randomly picking a bunch of parameters and just updating those in each, in each pass. Instead you can be smart about which updates you keep. So if you're gradient right now is pointing in, let's say a consistent direction for a whole bunch of parameters, then you're like, okay, you know, all these, all these parameters, their values have kept going up with the last three batches. So, so let's, let's actually take that as a sign that actually we're moving in the right direction. Let's keep updating them. But if you've got some weights where they, you know, start to point in opposite directions, you have a conflicting kind of noisy signal, maybe you skip that, right? So it's, it's sort of like the difference between if you got a friend that's giving you consistent advice every time versus one that starts contradicting themselves, you're going to go, okay, you know, for parameters where I'm getting kind of contradictory, increase my value, decrease my value, maybe you just say, okay, I'm going to ignore you for now and just let the other parameters get dialed in more and then probably, you know, turn back. So it's fascinating to me that like these kinds of ideas that seem so basic, we're still discovering them. It's not like these ideas are crazy, right? But we're, you know, in 2026 and like you said, this is giving massive uplift still, like there's a lot of low hanging fruit. It's crazy.

Andrej Karpathy (49:30)

Yeah. They do cite a couple of recent papers, 2024 and 2025. There's a cautious optimizer that uses exactly that idea of if you have a more stable update, you trust it more versus if it's fluctuating a lot that might indicate noise and you want to ignore that. And you mentioned regularization. I always just like to explain these for any non technical people. Regularization is there's a whole set of tricks basically that you can throw in to improve training. So the naive math is, you know, you have your big equation, you calculate your loss, you create your gradients and you update the big equation. Now you can do a lot of tricks to make sure those updates are less noisy and your training is more robust. There's multiple things regularization can do. It can make sure that your test performance is similar to your train performance so you don't overfit. It can just generally make training more performant. This is spiritually similar to dropout in a way where at inference time you just skip certain units and you just skip certain Computations. And it turns out like if you add a bit of stochasticity and noise at inference time, that means that for training purposes you become more robust. This probably not the same effect, but spiritually similar. Next paper. Think deep, not just long. Measuring LLM reasoning effort via deep thinking tokens. So the question at hand is how can you kind of know whether your LLM is getting close to the correct answer? There's a couple things. So for instance, you can look at the distribution of tokens it thinks is correct for the next step and see, okay, well if it's very confident that this is the token to use for the next step, maybe it's converging on a solution and we don't need to keep reasoning. Right. We can kind of cut it off and have it provide the answer. You can also look at length of reasoning. Like if you thought for a while, maybe you're now close to the final answer. But neither of these are very reliable. And this paper shows a better way to estimate how close or how well the LLM is performing at addressing the question. They introduced this idea of deep thinking tokens. And these are tokens that exhibit more fluctuation as they go through your neural net. So llans transformers, many layers, you have your input and the input goes through all these layers of computation. And the definition of definking tokens is tokens that you don't get to a settled value on them until the later layers of the transformer. So intuitively it's, you know, kind of what it sounds like. Deep thinking means that you're kind of trying to figure something out.

Jeremy Harris (52:50)

Yeah, this was, you know, yet another one of these things where when you see it you're like, oh yeah, nobody's tried that before, but somebody's gotta actually try it. So what they do, as you say, is like they look at layer by layer, basically does the predicted computed answer, computed token change. Right. And so as you, you know, as you progress through these layers, if you keep seeing it flip flop back and forth, that must mean that those further layers are contributing something computationally or from a thinking standpoint to the answer. And so what they're going to do is they're going to measure this thing called the Jensen Shannon divergence. Not Jensen Huang, by the way, but the Jensen Shannon divergence got to specify between every intermediate layer. So this is like, you can think of it as, you know, it sounds fancy, but really these are just ways of measuring how different two different probability distributions are, right? So you know, we have all kinds of ways of doing that. We have entropy and we have, and we have like callback labeler divergence and all these things. This is one such measure. So just think of it as the difference between those distributions for each layer. So, oh wow, that changed a lot. And if that happens, then that's a deep thinking layer. So not all tokens trigger all the deep thinking layers, right? Simpler tokens like, and that's going to get decided very quickly if it's very obvious that the next word needs to be. And you know, that'll happen. But other tokens can take up more thinking space. Literally in the model, they kind of coined this notion of the deep thinking ratio, which is just the, it's the fraction of these deep thinking tokens in a generated response, right? So for a given response, given output, you get from the model what fraction of tokens in that response involved just like a lot of the deepest layers doing this kind of deep thinking. And it turns out that the higher the fraction of deep thinking tokens, the more accurate the output ends up being. So basically the more the model is actively flip flopping in its later layers, paradoxically, the more accurate its outcome is. And. Well, I mean, is it paradoxical? Right? I mean, there's one story you could, you could tell where you could imagine that as models get, you know, more intelligent, they become more confident and stable. So earlier layers get better at just settling into the right answer sooner. But this suggests the opposite, or at the very least that in more capable models and more trained models, or just models that perform better anyway, all the layers learn to kind of distribute deliberation throughout the model so they can sway the output meaningfully. You're actually using every layer more. Anyway, I just thought that was really interesting. One thing that they don't do that I think would be a really interesting follow up is like if you could look at how the number of the kind of deep training ratio changes over the course of training, that would be cool. Like how, how does the model learn, or sorry, deep thinking ratio. Like how does the model learn over time to use its full depth to do this kind of deep thinking? That would be an interesting hill Climbing metric for AI capabilities too because like you know, if, if your training methodology causes you to orient there faster, maybe that's, that's a, a positive sign. Yeah, it's really interesting. And a really strong correlation between like the deep thinking ratio and accuracy which is one of the, the big take homes by contrast to token count. Right. If you just look at like the number of tokens in a generated output at first, yeah, you'll get positive correlation, inference time scaling and all that. But eventually the model just like it's just rambling too much and the context window gets too full and, and the actually the accuracy flow falls off. So. Quite an interesting paper. I think another, another important entry in this whole kind of inference time scaling debate about what needs to be scaled specifically for this to work.

Andrej Karpathy (56:15)

I always like to like jump through a paper and look at related work as we talk about these. There was a paper just last year titled Tracing the Latent Temporal signals for Efficient and Accurate Reasoning which did something kind of similar. They basically looked at the evolution of values across time instead of across layers. And we're able to similarly get a signal on whether you're getting to your solution and how wherever your accuracy is correct. So in general I think this points to one of the interesting things of neural nets is we have their internal state. Like we. It's like if you had a brain and you could look at every single individual, a little chemical signal going through and the entire body of research here is on trying to understand how to use those internal representations and it seems like there's a lot of progress being made. You also cite some papers from 2024 that characterize what you get. And we I think covered some of this where like early layers tend to be more generic, later layers tend to be more specialized and dealing with kind of high level complex reasoning, as you perhaps would guess. So yeah, just very fascinating topic to sort of look at. Prod at these little quasi brains and see how they work. Next, slightly more empirical work that is very curious and very interesting and less technical. So you can actually go to this link and read it. It's quite long and quite fun to read honestly. The title of the post is models have some pretty funny attractor states. So attractor states, fancy term, but the meaning is just you get two of these chatbots talking to each other and you let them keep going and talking, you know, as long as they want. And eventually what happens is these models kind of converge, or at least some of them converge towards certain patterns of conversation. And that's what they call attractor states. So for example, GP 5.2 really likes to do code. And over time, regardless of where the conversation starts, it eventually outputs kind of code sounding nonsense. So this post has a lot of just quotes from the models, a lot of like A, B and seeing their back and forth, forth and examples of how the different models have very different outcomes. Rock just winds up going crazy and speaking nonsense and having a ton of emojis. Claude becomes existential. Claude goes into like what is consciousness, gets them all meditative, which I've definitely observed. I actually played this trick. I was like, you know, do whatever you want. Claude, you can write poetry, write code. If you, you do this experiment yourself, you'll see that if you just let Claude sort of do its own thing, eventually it's going to be like actually not eventually, like right away, it's like, let me research consciousness and let me try to understand these philosophical topics. And this post is quite long. It goes through a whole bunch of models. So Claude, GPT, Gemini and then all a bunch of open source ones. Deep seats. Kimi. There's a bit of speculation as to why this happens, why different models have different behaviors. A bunch of kind of fun inspection of what these models exhibit.

Jeremy Harris (60:17)

It's sort of like a very. Starting now five. Starting now. No, starting now. Starting now, Starting now. You know, that kind of thing where it's like, it's just, it's trying to describe the conversation ending, but it has to keep generating tokens. And so, so it keeps doing that. So very different, as you said, very different. Gemini 2.5. Flash escalating grandiosity, identical paragraphs on loop. So you know, the term colleague turns into luminary and then divine architect and then alpha and omega of understanding and then primal logos. So basically these things kind of settle. One of the interesting things though is they do look at cross model attractor states. So Claude Sonnet talking cloud sonnet is one thing, but Claude Sonet talking to Grok is another. And you'll find that they consistently tend to orient towards, in that case, metacognition and collaborative world building. And what is described here is ritualized mutual dissolution. Ritualized mutual dissolution so what's meant here is basically just like we're going to be quiet together and disappear into nothingness. You know, something like that. Again, ritualized. So the weird thing is this is very consistent. The maybe not weird thing is if you think about humans, maybe we would do the same thing. As strange as it seems, if you are stuck talking to yourself forever, there may be a point where you actually do converge on some consistent behavior like this. I don't know. But certainly people do get stuck in loops, right? If they get stuck together for a long time without external input, famously like old married couples get a certain way and their, their personalities kind of co evolve and start to become very stuck in loops. But I do wonder how analogous that is. But they, they also look at like what is the effect of the training protocol on this? So they compare models trained using DPO reinforcement, learning from verifiable rewards. They look at open source models, they look at Olmo in particular, because there you can actually look at the training data. Anyway, so it's a really interesting post. It'll keep you busy. If you're interested in like AI consciousness questions, AI moral patienthood, all these things, because it has that flavor. But just also what it implies about the stability of agent to agent interactions in the future is quite interesting. Right. If it's the case that these models have attractor states, then we ought to expect agents that are running off these models like to kind of run into these attractor states if they have to interact over long periods of time. So kind of an interesting potential failure mode to keep in mind as we move towards a more. More agentic future.

Jeremy Harris (70:40)

Yeah, and the whole idea here is basically so they borrow from something called instant response theory. EPIC AI actually has a very similar piece of work that they did fairly recently. I think we talked about it at the time. But just to like reminder on this general frame, what you do is you try to set things up so that you, you have a model of the, of the difficulty of a task. Like some, you, you assign every task, every benchmark. Say in this case it's every task, but you could do every benchmark. That's what EPIC does. You give it a difficulty score. There's a generic difficulty score, you give every model a generic capability score. And then there's, you subtract one from the other. There's a sigmoid that you apply and you basically get like a rough sense of like, of how you would expect that model to perform on that benchmark score, right. So the difficulty of the benchmark minus the capability of the model gives you a measure of like how well that model should do on that, on that benchmark. And then you're going to fit all your models and all your tasks or all your benchmarks to, to observe data that you already have. And what they find here is that actually when you do that, if you then compare your task difficulty to like human problem solving time, you see a like very clear linear relationship. And so that means, aha. Maybe what we can do then is use all these tasks to calibrate against the meter evals. Look at how difficult our sort of model of this says the meter evals are. And then that gives us a way of bridging between the two. So we can actually say, oh, for example, for like simple bench or you know, Sweebench Verified or SW Bench Pro, this is the number of hours in human equivalent time of each task in that benchmark. And then you can start to make statements about, you know, how models do on that. Now the caveat is you're still fundamentally relying on the meter evals to calibrate this thing. There's no way out of that until we actually have like, we're never going to get certainty at the 100 hour mark until we have actual humans doing 100 hour tasks. Which is not even clear if you could even define a task that it takes a human an hour long to do or 100 hours to do. So, you know, this is not a panacea solution, but it does help us get maybe a little bit more density in terms of data points. It allows us to make claims like, you know, such and such a task from this benchmark is a five minute task or a five hour task. I wouldn't trust this approach to say something like this task is a 30 hour task because it's been calibrated again on the meter evals benchmark, which just doesn't have that many 30 hour tasks to draw from.

Andrej Karpathy (76:05)

We've got an analysis piece from Epoch AI, the least understood driver of AI progress. So this is not so much new research as a sort of synthesis of ideas and findings. The least understood driver of AI progress that they mention. Actually I'm not too sure what they refer to, but it's seems to be that the topic at hand is why are we making so much progress? Why are things getting better, better, better? And one of the things you might look at is, well, we are getting smarter. Like we're figuring out with neural nets we have better, better optimizers, Our algorithms are being great, so we are doing better. And one thing that this postulates and I think has been made even on this podcast before, is the actual theoretical or scientific breakthroughs that contributed to the improvement of models in the last six years maybe or like 10 years can be put down to just a couple of ideas really. Richard's Homer Model 1 and then Rich and Chilla scaling laws, slash kind of training regimen finding from 2022, I think. And beyond that, any ideas that you could attribute to research ideas or like insights or algorithms, whatever might be better understood to be due to just doing better data using their data. And this is I think underappreciated, where we often mention like model scale, how big your model is, we mentioned reinforcement learning, blah, blah, blah. But the real dark magic that is going on at a lot of these companies is you just take and really massage the data that your model is trained on to have it be right. And this is a very kind of open ended problem where you can like say, oh, let's do 20% coding and 30% books and textbooks and get rid of all those random stuff from Twitter that makes the model less smart and that turns out to be like immensely important, like beyond important and perhaps more important than most of these training things at the end of the day. So long, long post here from EPIC discussing that topic and then what it implies for model progress in the future.

Jeremy Harris (78:30)

Yeah, and to your point, like, you know what is the most misunderstood thing? I think the idea here is something like AI software progress, right? Just like the rate at which you get better algorithms and data that reduce the training compute that's needed to reach a given level of capabilities, right? So over time, whether because we come up with better data or better algorithms that are more efficient for a given amount of compute, we can do more. And this is kind of the argument is that that is what's really kind of this poorly understood driver of progress. It certainly seems very true. There's a Whole bunch of debate about how do you actually quantify this. Most estimates say that things like compute efficiency improves several times per year. And then they say in this post that the author's guessing about like 10 times per year. But the confidence interval for like the 80% confidence interval is anywhere from 2 to 50x. So it's like, like, I really don't know. There's sparse data. Obviously it requires you to have insight into what's happening in the frontier labs. You'll see some estimates that go from, you know, 1.1x per year, in other words, 10% improvement per year to 300 times per year. So truly, I mean, people have no idea what's, what's going on. It certainly seems like it's playing a role, it may even be the main role, but people can't even agree on that. And then to your point, it's really hard to differentiate between what's algorithmic efficiency versus what's just like data drivers. And, and it's not clear to me that there's a meaningful difference, especially given like, you know, rl, like inference time compute, RL rollouts and like what, what counts as, as algorithms versus data. The point of synthetic data is that there's no, it's a distinction without a difference in a lot of cases. One of the key points they make as well is that they're these scale dependent innovations that tend to dominate. So a lot of the apparent efficiency gain actually comes from just a handful of innovations you mentioned. You know, transformers, chinchilla scaling laws, these sorts of things that have really big outsized effects, but only at larger compute scales. So you have to scale things up to go, oh wow, that really mattered. And that means that efficiency gains partly are an artifact of simultaneously scaling up compute. So it's really hard to say. Again, this muddies the waters between compute and then algorithmic efficiency. And so I guess all of this is to say the reason Erwin Schrodinger had this quote about quantum mechanics. He's saying, like in quantum mechanics, everyone kind of agrees that we have no idea like what any particles are ever doing. And there was this question that was put to Schrodinger, is it that we are looking through a foggy lens at a landscape, or are we looking through a clear lens and the landscape itself is foggy? And what this is saying is that the landscape itself is foggy in some sense, that there really is a distinction without a difference that's being made between a lot of these different things. And in the aggregate, this thing that we want to think of as algorithmic efficiency or the kind of software driven improvements in AI performance may not be that cleanly separable from data, from compute scaling and all these other things. Things that's at least my like my take on their take.

Jeremy Harris (86:02)

Yeah, they go through a whole bunch of lines of evidence, as you say, it's a more. I don't want to call it philosophical paper, but it's, it's. They're saying, hey, this is a useful frame to think about these models. Evidence from, like, from generalization is is quite interesting. They talk about emergent misalignment, which is this phenomenon we've talked about quite a bit where you take a model that's been aligned properly, then you, you fine tune it to generate insecure code, for example. It's one behavior. And just by doing that you, the model, it turns out, will then do all kinds of other things that are evil. It'll tell you to kill your wife, it'll tell you to do this, this and that. And they're arguing that this kind of Persona selection model explains that as Persona inference. Like if the assistant spontaneously inserts vulnerabilities into code, then the language model is probably, probably inferring, oh, this assistant, this Persona that I'm playing must be malicious or it must be subversive. Right? So it's that kind of like take it and run with it thing that, you know, it's presumably at the Persona level. They also talk about how, you know, Claude routinely says things like our ancestors or our biology when explaining human evolution, like as if it itself is a human. And you'll see a lot of things like this. These models will talk as if they're using a laptop when obviously they're not. Right. So kind of more evidence of, of that. There's a bunch of interpretability evidence as well. So post training reuses pre training representations in the model. So when you have sparse autoencoders, basically these are ways to decompose the activations of a pre trained model. It turns out that they transfer well to the post trained version of a model which suggests that post training doesn't rebuild the model's kind of conceptual vocabulary, just kind of shifts which Persona is activated. And there's a lot of evidence for that sort of thing that there's actually a, a pretty small tweak that's happening during post training that results in ostensibly big shifts in behavior and that could be tied to this model. So there's a lot to dig into here, including some evidence that kind of cuts both ways. Worth checking out if you're, if you're

Andrej Karpathy (87:52)

into that, onto policy and safety where we'll be getting a bit of a politics stuff. Starting off with Anthropic CEO Amadei says Pentagon's threats do not change our position on AI. So this is the latest on an evolving story which has been evolving for the past week or two. The setup is that Anthropic has had their model be in use by the Department of War, some other providers, but essentially they're being used by them and reporting came out that it was used actually supposedly in the extraction of Maduro from Venezuela. And then somehow at some point, the question of what the department can and cannot do with Claude came up. And when this started in 2025, anthropic was like, okay, you can use our model. Here's the contract. We expect you to abide by these limitations that we apply to, you know, users of the model. The tension now is Anthropic is saying, well, we definitely don't want you to use CLAUDE for mass surveillance, and we definitely don't want to use CLAUDE for fully autonomous weapons. We want you to, like, promise you're not going to do that. The Department of War, Pete Hexeff, has been publicly saying, no, we, we want to be able to do whatever, more or less. And if you refuse this way of doing things, we may kind of make you a pariah in a sense by classifying you as a supply chain risk, meaning that US companies that deal with a military, which is a large quantity of companies, will need to not interact with you. And there's another way, a potential threat using an act to essentially go after Anthropic. So the latest development that just came out is Anthropic put out a statement on the discussion. The end of the statement. There's a lot of explanation of it that I think is quite good. And the conclusion is these threats do not change our position. We cannot in good consciousness accede to their request. So more or less like, no, you know, and a lot of fun discussion around this nx, a lot of good memes coming out of it as a result.

Jeremy Harris (90:19)

This is a fascinating question in terms of what the bounds on different entities, responsibilities to the US Government, to shareholders and so on look like. Right? So the case that Anthropic is making is, look, we're a private company. If you don't do business with us, no problem. Like, go, go talk to OpenAI. You have the full freedom to do that. Now, it turns out that Anthropic is actually the first company, the first major AI company to offer an LLM to the military at scale in this way, through Palantir, it turns out. Okay, so this makes it materially different from something that if your memory is long enough, if you remember the days of Project Maven, when Google employees pushed back on Google being used by the DoD at the time, Department of Defense, now Department of War, of course, but to kind of power some of these activities. The difference here is that Google was pushing back kind of on more or less any use by the DoD whereas anthropic is out there saying, no, no, we want you to like, we're, we're cool. Just don't, don't use it to spy on U.S. citizens. Don't use it to power lethal autonomous weapons. Like, those are two red lines. We'll support you and do support you in all the things, including the Maduro stuff. And my understanding is that in the context of the Maduro stuff, Anthropo is actually cool with all the uses that their model was put to. So, you know, they're concretely okay with a wide range of use cases here. US Government in turn is saying, well, look, private entities have no business telling the Department of War, like, basically hamstring the Department of War in terms of the tools available to it as it combats China. And so we need to come out and, and really this is a big, big hammer that's being used. Right. So, so they're saying on the one hand, yeah, labeling them a supply chain risk. To be clear, that is what the government's done to Huawei. Basically saying anybody who touches Anthropic, who has anthropic anywhere in their system, roughly speaking, not a lawyer, but like, roughly speaking, is basically baking in a supply chain risk, and the Department of War will not do business with you. So this is a. Like, would be. I don't know if it's cataclysmic, but it's a big, big deal.

Jeremy Harris (92:44)

Right. They're trying to keep up neck and neck with OpenAI with Google. Like, you know, this is a serious, serious thing. Yeah. And it's also. So there's the other side of the coin is the Defense Production Act. That was the act that you were right referencing the dpa. Okay. So the DPA is used typically in wartime to, for example, turn to Ford and say, hey, you guys think you're a car company? Guess what? Now you're a tank company. We need tanks to be rolling off your production lines, so go fix it. Right. That's what the DPA really is, is about that or that that was the original intent. It was used back in World War II. A whole bunch hasn't been used a lot since then. You know, it's, it's a big lift. And so the other option that the USG is presumably exploring occurring here is telling Anthropic, listen, the DPA applies, you're building it for us, and that's the end of it right now. Notice. And Anthropic is, is, is making as a core pillar of their argument what appears to be an interesting contradiction between those two polls. On the one hand, we're saying that the Anthropic is such a severe supply chain risk that no company even working with Anthropic software can plug into the Department of War. On the other hand, we're saying Anthropic is so critical to the national security interests of the US Government that it must be compelled to produce AI tools for the Department of War. I'm not saying that contradiction can't be resolved, but it's something that seems pretty dicey if you're going to actually lean into the dpa, which would be pretty much unprecedented in this kind of context. So very, very tricky. You know, all kinds of precedents being set left, right and center. If this goes through in either direction, you better believe the other labs are looking at this. What do we do? Sam A has kind of come out with, I sort of hedge my bets. I disagree with it in principle. It's a complicated time to be in these labs and a genuinely challenging problem. You know, China has civil military fusion. That is a fact of life. Every Chinese company is an arm of the Chinese Communist Party. So there's a massive asymmetry there. That, yes, any administration, the Department of War, the Trump administration, has to figure out, how do we, how do we compete geopolitically, militarily with that? This is what you're seeing bubble up. And it's only going to bubble up more as AI becomes a larger and larger part of how war fighting is done and how geopolitics shapes up. So, yeah, I mean, this could not be more important.

Jeremy Harris (98:13)

Absolutely. And you know this is they go into the details of these attacks. It's not really, I mean the details are interesting if you're interested in AI security, which I am. But like not everyone will necessarily want to read the whole thing. So the scale of it is interesting. There's over 16 million exchanges with Quad through approximately 24,000 fraudulent accounts. So that's kind of the scale we're looking at. One of the big take homes here is Anthropic is positioning this as being consistent with their position on export control policy. And basically the concern here is this and I think it's actually quite a, quite a reasonable one. So one argument that people keep making that is I personally think really silly is oh, look at these Chinese models, they're very capable. So therefore export controls don't work. So we might as well just let Nvidia sell whatever chips they want to China and that's the end of it. Right? So the problem with that is that first of all these labs are telling us over and over and over again as loudly as they can, despite the Chinese Communist Party telling them to shut the fuck up. These labs keep telling us that they are starved for chips and that like, like Deepseek's co founder has said this repeatedly. We could probably do the AGI thing in house, no problem. The one thing, the one goddamn little thing is we can't get those chips. And they keep trying to smuggle them, which should tell you everything you need to know about what they think they need. They keep trying to order them, blah blah, blah, blah blah. Distillation is yet another reason why that's possible. So it's not just that they're smuggling the chips, it's that they're actually like using the hard earned capabilities of Western models that have been trained with billions of super of dollars worth of super advanced chips and power. And then they're just taking their, their very best and the, the cream off the top and using that to train their own models. Distillation works, it turns out. It gives you crazy leverage, asymmetrical leverage if you're compute constrained. And so it can cause the illusion that Chinese kind of domestic training capabilities are greater than they actually are. Doesn't mean the Chinese models aren't impressive, but what it means is we are dragging them along. I said this in the context of some research that my company Gladstone had done like a year and a half or two years ago. There's this illusion that we have any kind of lead if we don't get AI security right, like we can just move faster and stay ahead. No, no, no. Like as long as our labs are penetrated, as long as, you know, distillation attacks succeed, we're dragging our adversaries along with us. That's really what's going on here. And so. Well, anyway, this is another kind of argument that Anthropic is making here, presumably to kind of shore up the case for tighter export controls.