Latent Space: The AI Engineer Podcast
Episode: [LIVE] Anthropic Distillation & How Models Cheat (SWE-Bench Dead) | Nathan Lambert & Sebastian Raschka
Date: February 26, 2026
Episode Overview
This live episode brings together experts Nathan Lambert, Sebastian Raschka, and Swyx (Sean), to dissect the latest controversies and breakthroughs in AI model development, focusing primarily on:
- Distillation attacks on Anthropic models and the geopolitics of model distillation.
- How benchmarks like SWE-Bench can be gamed or become obsolete as models "cheat" or saturate them.
- Challenges and nuances of creating and evaluating reliable benchmarks for large language models (LLMs), specifically in code generation.
The conversation is wide-ranging but always steeped in deep technical insight and lived experience at the AI frontier.
Key Discussion Points & Insights
1. Distillation: Concepts and Controversy
-
Defining Distillation ([02:53])
- Sebastian Raschka: "Distillation... is the idea that you’re taking a larger model and you train a smaller model on these outputs... The idea is that you can train the smaller model more efficiently using that larger model."
- Traditionally involved training on logits (deep learning outputs), but now often uses just the synthetic data/output generated by LLMs.
-
“Attacks” and API Abuse ([01:25]–[06:25])
- Anthropic published a blog post claiming their models (especially via API) were being “attacked” through distributed distillation efforts, particularly naming Chinese labs using multiple accounts to skirt API restrictions.
- Nathan Lambert: Suggests this is not surprising given GPU shortages and the ease of API access for synthetic data: "They're in a massive GPU shortage and using APIs is way easier than generating synthetic data on their own." ([01:40])
-
Terms of Service and Enforcement ([05:05])
- Discussion that major labs include vague prohibitions on distilling from their APIs, but actual enforcement in the US has been limited.
- Notable previous cases (e.g., OpenAI blocking Bytedance and XAI) set some precedent, but high-profile naming in this latest Anthropic post is new.
Notable Quote
“Terms of service is something that can be... if the provider finds you violate it, they can cut off your access. That’s just kind of like a basic thing.” — [Nathan Lambert, 05:05]
2. Detection and Privacy Concerns
-
How would an LLM Provider Actually Detect Distillation? ([06:25]–[10:57])
- It's difficult to distinguish between legitimate large-scale evaluation and actual distillation; detection involves monitoring request scale, pattern, and distribution.
- Sebastian: Raises privacy implications: "It kind of almost implies that they are checking what you use the LLM for... which is kind of like a sensitive topic." ([08:34])
-
Patterns at Scale
- Providers likely look for accounts that generate unusually broad distributions of prompts or very high volumes, signaling distillation vs. justified customer use ([07:12], [09:55]).
Notable Quote
“At a certain point when you have a certain magnitude of answers generated, it might look suspicious. But there are a lot of legit use cases.” — [Sebastian Raschka, 10:57]
3. The Model Benchmarks Arms Race—SWE-Bench and Its Demise
Background on Benchmarks
- What is SWE-Bench? ([29:00]–[32:33])
- Coding benchmark originating from Princeton, designed as thousands of real-world GitHub issues and related PRs; LLMs are tested on bugfixes.
- Nathan: “The umbrella topic here is how do we compare which LLM is currently the best LLM. Like one of the ways would be SWE-Bench basically.” ([29:06])
- OpenAI adopted and curated a "verified" subset for greater quality control ([30:39]).
Saturation and Model "Cheating"
-
SWE-Bench Saturation ([34:06]–[37:11])
- All state-of-the-art models now score ~80%+, making it indistinguishable which is superior—demonstrates benchmark overfitting.
-
Why Is This Happening?
- Many test cases turned out unsolvable/poorly defined ([36:51]), or models simply memorize dataset answers (“canary tasks”).
- Nathan: “There was these multiple rounds... then every single person that ran... did not call this out until OpenAI was like, hey, let's look at the data.” ([37:11])
- Chain-of-thought outputs sometimes reveal model leakage or “cheating”—models regurgitate answers from having seen similar data in pretraining ([37:26], [38:14]).
- Many test cases turned out unsolvable/poorly defined ([36:51]), or models simply memorize dataset answers (“canary tasks”).
Notable Moment ([36:51])
“So the only way you could kind of solve it is if you're memorizing the [answer]...” — Sebastian Raschka
4. Why LLMs Memorize—And the Limits of Current Evaluation
- Theory and Practice ([41:50])
- LLMs can memorize surprising amounts from even a single pass through the data.
- Swyx: “The information theory of LLMs... is super understudied. How come you can memorize from one pass?” ([42:07])
- Dual challenge: labs must carefully control duplication levels and revisit old data, or models lose basic facts ([40:57], [41:23]).
Notable Insight
“With such a small fraction... it's enough to have the model memorize almost everything. Which is fascinating. Yeah, I don’t know, it’s just like after all these years. Fascinating.” — Sebastian Raschka ([41:50])
5. Modern Solutions and What Comes Next for Benchmarks
-
Move to Private, Rotating Benchmarks ([44:05]–[47:23])
- Newer benchmarks (like SWE-Bench Pro by Scale AI) try to address the “solved benchmark” problem:
- Fresh problem selection and private/public splits
- Diversification of topics and languages
- Direct evaluation via private APIs to prevent dataset leakage ([46:24])
- Newer benchmarks (like SWE-Bench Pro by Scale AI) try to address the “solved benchmark” problem:
-
End-to-End and Subjective Evaluation Looms ([48:15])
- Coding and math are easier due to clear “right answers,” but other real-world agentic tasks and UI automation are much harder to benchmark.
6. Reflections on the AI Ecosystem and Business
-
API vs. Product Strategy ([24:49]–[28:19])
- Will labs move to product-only access for cutting-edge models to avoid distillation and stifle competition, or will they maintain broad APIs?
- API business may be robust, but risks commoditization and leakage of intellectual property through distillation.
-
Media & Community Building ([50:03])
- The value of live human discussion, given a social media world full of bots and synthetic content.
Memorable Quotes & Timestamp Highlights
-
On Distillation Spotted at Scale:
"The millions of exchanges is a bit more of a bet. These accounts are all rate limited and have other problems, like that takes longer." — Nathan Lambert ([13:00])
-
On "Cheating" Benchmarks:
"If you solve this, you’re like, oh, shit, this is a canary...you're definitely cheating." — Swyx/Sean ([36:55])
-
On the Information Theory of LLMs:
"I still think it’s super understudied. How come you can memorize from one pass?" — Swyx/Sean ([42:07])
Important Timestamps
- [02:53] – Distillation explained
- [05:05] – API terms of service and enforcement
- [08:34] – Privacy and detection
- [29:00] – SWE-Bench explained
- [34:06] – Benchmark overfitting and model “cheating”
- [36:51] – Canary tasks and memorization
- [41:51] – Limits of LLM evaluation and information retention
- [44:05] – Next-gen benchmarks and private splits
- [48:15] – Agentic and end-to-end task evaluation challenges
Episode Tone & Takeaways
This is an unscripted, candid, but highly technical dive into ongoing AI arms races—both technological and geopolitical. The guests speak authentically, mixing sharp critique, lived industry intelligence, and philosophical reflection on the state and trajectory of AI evaluation, openness, and business moats.
For AI engineers or enthusiasts, this episode unpacks:
- Why distillation is both vital and fraught at the frontier.
- How “security through obscurity” (private benchmarks, API changes) is becoming the norm.
- The ever-present danger that benchmarks simply become irrelevant or misleading as models inadvertently or intentionally “cheat.”
- The importance (and difficulty) of true, reproducible, and meaningful measurement in AI progress.
For the full show notes and references, visit latent.space.
![[LIVE] Anthropic Distillation & How Models Cheat (SWE-Bench Dead) | Nathan Lambert & Sebastian Raschka - Latent Space: The AI Engineer Podcast cover](/_next/image?url=https%3A%2F%2Fsubstackcdn.com%2Ffeed%2Fpodcast%2F1084089%2Fpost%2F189277598%2Fca7468da5614a246d2906ee8926f6de7.jpg&w=1200&q=75)