Last Week in AI – Episode #228

Date: December 17, 2025
Hosts: Andrei Karenkov & Jeremy Harris

Overview

This episode dives deep into the latest breakthroughs and controversies in artificial intelligence, focusing on the release of GPT-5.2, developments in scaling agent systems, significant business partnerships (notably Disney & OpenAI), US-China chip politics, as well as multiple new research papers exploring reasoning, generalization, and reinforcement learning. The hosts also offer lively discussion on the evolving enterprise AI market, technical and ethical benchmarks, and the broader policy environment shaping AI’s future.

Major Topics & Key Insights

1. GPT-5.2 Release: Capabilities, Benchmarks, and Business Focus

Timestamps: [01:11]–[08:48]

OpenAI announces GPT-5.2: Positioned as a major step to "get back into the leadership position."
Benchmarks: Outperforms or ties professionals in the GDPVAL benchmark (71% of cases), indicating readiness to automate many white-collar tasks. Human expert judges used to validate results. Also, 30% fewer hallucinations than 5.1.
SWE-bench Pro: On a much tougher benchmark, GPT-5.2 scores 55.6%, ahead of Claude Opus 4.5’s 46%. Demonstrates strong programming and reasoning.
Multimodal Boost: Image processing abilities greatly improved (“impressive shift”), e.g., identifying components on a motherboard.
Pricing: Notably more expensive than previous model (+40% output cost).
Business Strategy: Emphasis on enterprise use cases (project planning, Excel, etc.) amid intense competition from Anthropic and Google.
Quote:
“GPT5.2 thinking produced output for GDPVAL tasks at over 11 times the speed and less than 1% of the cost of expert professionals.” (A, [03:25])
Vibe Check: Still waiting for broader public feedback to confirm how improvements translate into real-world use.

2. Runway's World Model and Multimodal Advances

Timestamps: [08:48]–[11:11]

Runway announces GWM-1: Three variants (Worlds, Robotics, Avatars) expand capabilities in video and robotics simulation.
Native Audio & SDKs: First to offer an SDK for robotic applications via their world model.
Competition & Niche Strategy: With giants like Google and OpenAI dominating, Runway pursues a more specialized “world model” approach.
Industry Context: Tension with the release of Sora 2 and other competitors may have inspired diversification.
Quote:
“It reminds me of the whole Yann Lecun world model stuff... seems like people are starting to cast about for things other than the LLM scaling paradigm.” (A, [10:12])

3. Weekly AI News Highlights

Timestamps: [11:11]–[13:53]

Google AI Mode: Integration of source links in AI snippets to address click-volume losses and EU regulatory pressure.
ChatGPT + Adobe: ChatGPT can now launch and use Adobe apps to edit images and PDFs natively.
Tencent’s Hunyun 2.0: 406B-parameter MoE model, but performance lags behind DeepSeek and other western peers; more a show of compute capacity than competitiveness.

4. US-China AI Chip Tensions

Timestamps: [15:50]–[21:01]

Export Controls: Complex saga with the US imposing tariffs and obligations on Nvidia chip exports (notably H200) to China, and China responding by limiting its own imports.
National Security Reviews: New US review process before shipping AI chips to China; Chinese regulators may ban these chips in the public sector, citing security concerns.
Domestic Substitution: Huawei’s ongoing efforts to provide domestic alternatives signal transition from Nvidia imports for training and inference.
Quote:
“Turns out that these chips are going to be required to submit to a strange national security review process... Are we so sure that those chips are coming to us as what they appear to be?” (A, [18:06])

5. Disney & OpenAI: $1B Partnership

Timestamps: [21:02]–[25:30]

Deal terms: Disney invests $1 billion in OpenAI, licensing its characters for use in Sora 2; unique 3-year licensing deal with equity options for Disney.
Content/IP Strategy: Disney characters (including Marvel, Pixar, Star Wars) can be generated in Sora 2. Exclusions include likenesses/voices of real actors.
Copyright Implications: Seen as a preemptive legal alignment amid rising copyright lawsuits against AI video generators (including Google).
Quote:
“You can think of this as OpenAI kind of pre positioning to say, hey, same way that Netflix might be the only streaming platform that has Seinfeld...” (A, [21:52])

6. Epic AI Hardware Investments

Timestamps: [25:30]–[27:36]

Unconventional AI: Raises $475M at $4.5B valuation to build energy-efficient, analog AI chips, aiming for 1000x reduction in power vs digital chips.
Founders & Funding: Led by the former Databricks AI head; described as a “crazy seed round,” with details kept tightly guarded.

7. OpenAI Organizational Shifts & Enterprise AI Push

Timestamps: [29:37]–[33:03]

Executive Hires: New Chief Revenue Officer from Slack and CEO of Applications from Instacart signal major push into enterprise services.
Enterprise AI Trends: Usage of ChatGPT in enterprise up 8x; Anthropic seen as the main challenger in this segment, reportedly commanding higher valuation per token used.
“Frontier AI User”: Term coined to describe workers with much higher AI adoption rates, reflecting uneven productivity impact to date.

8. Benchmarks, Open Source, and Model Alignment

Timestamps: [33:03]–[41:45]

FACS Leaderboard: New holistic factuality benchmark from Google DeepMind (highest score: 68%), evaluating grounding, parametric, and search-based factuality.
Claude 4.5 Opus Soul Document Leak: Reverse engineering reveals that Claude was fine-tuned on a “soul document” in alignment, with system-level self-characterization—emphasizes Claude’s “novel entity” status and “psychological stability.”
- Quote:
  “Claude is a genuinely novel kind of entity, not a sci fi robot, not a dangerous super intelligence or just a simple chat assistant.” (A, [38:18]) and
  “Claude has a genuine character that it maintains... warmth of care for humans... deep commitment to honesty and ethics.” (B, [41:45])

9. Research Deep Dive: Scaling Agents, Generalization, RL

Timestamps: [41:45]–[68:36]

a. Scaling Agent Systems

Large-scale empirical study from Google/DeepMind/MIT: Scaling agents improves performance only up to a point (“capability saturation”); multi-agent coordination introduces significant token overhead.
Insights: “You’re often better off just using a single agent if your problem is too complex...” (A, [45:45])

b. Robotics in World Simulators

DeepMind shows video-based simulation (VEO model) can effectively train/evaluate robotic policies across diverse, edited environments—helping bridge the sim-to-real gap.

c. Self-Evolving LLMs

New method for stable co-evolution of LLM “challengers” and “solvers” using minimal human data, overcoming collapse and drift issues typical in language model self-play.
Quote:
“The solution... is you sprinkle a little bit of human data along with synthetic data... to make it not go insane, to kind of remind it, hey, this is what normal data looks like.” (A, [53:36])

d. Bayesian Rationality in LLMs

Paper introduces the Martingale score as an unsupervised measure of model rationality; finds “belief entrenchment,” where LLMs become more confident in initial beliefs, showing confirmation bias akin to humans.

e. RL Training Recipes & Stability

Empirical and theoretical advances clarify where to apply reinforcement learning (mid-training > post-training), and mathematical justification for token-level reward assignment under two stability conditions.
Quote:
“The token level objective... is the first order approximation of the full sequence level objective mathematically...” (A, [66:06])

10. Policy, Safety, and Generalization Risks

Timestamps: [68:54]–[84:01]

US Executive Order: Trump admin order bars states from regulating AI independently, provoking federalism debate.
Weird Generalization: New research shows LLMs exhibit unexpected (“weird”) generalization: e.g., training on 18th-century bird names biases model toward 18th-century worldviews. Highlights risk of subtle, emergent misalignments (“If you train it on even not-bad things... it will also become broadly misaligned. It will start being evil.” – B, [77:21])
Forecasting AI Timelines: Compute bottlenecks could delay advanced AI capabilities by up to 7 years.
International Safety Institute Coalition: Focused on evaluation and measurement standards for AI safety.
AI Tech Smuggling: DOJ cracks down on illegal exports of Nvidia chips to China (Operation Gatekeeper), revealing tension between security and economic policy.

11. Licensing & Copyright Infrastructure

Timestamps: [84:01]–[85:25]

RSL 1.0 Released: A consortium (including Creative Commons) creates a standard for setting AI scraping rules and compensation—a sign of more structured content licensing for AI training in the future.

Notable Quotes & Memorable Moments

On using LLMs for coding:
“I just pasted mindlessly code from the chatbot... and it fucked my entire database. So... that’s how my Friday’s going, you guys.” – Jeremy Harris, [00:37]
On Claude’s Alignment Philosophy:
“Amanda Haskell, interestingly, is an in-house philosopher at Anthropic... Claude is unique or interesting among models in that it talks about its own consciousness a lot more.” – B, [41:45]
On US-China chip politics:
“The country that controls these chips will control AI technology. The country that controls AI technology will control the future.” – DOJ via B, [82:33]

Timestamps: Where to Find Key Segments

[01:11] – GPT-5.2 release and analysis
[08:48] – Runway world model, Google, Tencent news
[15:50] – AI chips US-China drama
[21:02] – Disney & OpenAI partnership
[25:30] – Unconventional AI mega-seed round
[29:37] – OpenAI enterprise shift
[33:03] – Factuality benchmarks; Claude soul doc leak
[41:45] – Scaling agents, robotics in simulation
[53:36] – Self-evolving LLMs & Bayesian rationality
[63:06] – RL training analysis
[68:54] – Policy, generalization, and safety
[84:01] – RSL 1.0 & content licensing

Recap & Closing Thoughts

This episode painted a vivid picture of an AI field marked by rapid technical upheaval, shifting business alliances, and a new layer of policy and ethical complexity—from the raw scale of GPT-5.2, to the quirky philosophy embedded in Claude, to the subtle dangers of “weird generalization.” The geopolitical drama over hardware supply chains adds a backdrop of tension, while new licensing standards and corporate partnerships hint at a maturing—and ever more regulated—AI ecosystem.

Quote:
“We appreciate you sharing, reviewing, and just tuning in. Please do keep tuning in week to week.” – B, [85:25]

Last Week in AI – Episode #228

Date: December 17, 2025
Hosts: Andrei Karenkov & Jeremy Harris

Overview

Major Topics & Key Insights

1. GPT-5.2 Release: Capabilities, Benchmarks, and Business Focus

Timestamps: [01:11]–[08:48]

OpenAI announces GPT-5.2: Positioned as a major step to "get back into the leadership position."
Benchmarks: Outperforms or ties professionals in the GDPVAL benchmark (71% of cases), indicating readiness to automate many white-collar tasks. Human expert judges used to validate results. Also, 30% fewer hallucinations than 5.1.
SWE-bench Pro: On a much tougher benchmark, GPT-5.2 scores 55.6%, ahead of Claude Opus 4.5’s 46%. Demonstrates strong programming and reasoning.
Multimodal Boost: Image processing abilities greatly improved (“impressive shift”), e.g., identifying components on a motherboard.
Pricing: Notably more expensive than previous model (+40% output cost).
Business Strategy: Emphasis on enterprise use cases (project planning, Excel, etc.) amid intense competition from Anthropic and Google.
Quote:
“GPT5.2 thinking produced output for GDPVAL tasks at over 11 times the speed and less than 1% of the cost of expert professionals.” (A, [03:25])
Vibe Check: Still waiting for broader public feedback to confirm how improvements translate into real-world use.

2. Runway's World Model and Multimodal Advances

Timestamps: [08:48]–[11:11]

Runway announces GWM-1: Three variants (Worlds, Robotics, Avatars) expand capabilities in video and robotics simulation.
Native Audio & SDKs: First to offer an SDK for robotic applications via their world model.
Competition & Niche Strategy: With giants like Google and OpenAI dominating, Runway pursues a more specialized “world model” approach.
Industry Context: Tension with the release of Sora 2 and other competitors may have inspired diversification.
Quote:
“It reminds me of the whole Yann Lecun world model stuff... seems like people are starting to cast about for things other than the LLM scaling paradigm.” (A, [10:12])

3. Weekly AI News Highlights

Timestamps: [11:11]–[13:53]

Google AI Mode: Integration of source links in AI snippets to address click-volume losses and EU regulatory pressure.
ChatGPT + Adobe: ChatGPT can now launch and use Adobe apps to edit images and PDFs natively.
Tencent’s Hunyun 2.0: 406B-parameter MoE model, but performance lags behind DeepSeek and other western peers; more a show of compute capacity than competitiveness.

4. US-China AI Chip Tensions

Timestamps: [15:50]–[21:01]

Export Controls: Complex saga with the US imposing tariffs and obligations on Nvidia chip exports (notably H200) to China, and China responding by limiting its own imports.
National Security Reviews: New US review process before shipping AI chips to China; Chinese regulators may ban these chips in the public sector, citing security concerns.
Domestic Substitution: Huawei’s ongoing efforts to provide domestic alternatives signal transition from Nvidia imports for training and inference.
Quote:
“Turns out that these chips are going to be required to submit to a strange national security review process... Are we so sure that those chips are coming to us as what they appear to be?” (A, [18:06])

5. Disney & OpenAI: $1B Partnership

Timestamps: [21:02]–[25:30]

Deal terms: Disney invests $1 billion in OpenAI, licensing its characters for use in Sora 2; unique 3-year licensing deal with equity options for Disney.
Content/IP Strategy: Disney characters (including Marvel, Pixar, Star Wars) can be generated in Sora 2. Exclusions include likenesses/voices of real actors.
Copyright Implications: Seen as a preemptive legal alignment amid rising copyright lawsuits against AI video generators (including Google).
Quote:
“You can think of this as OpenAI kind of pre positioning to say, hey, same way that Netflix might be the only streaming platform that has Seinfeld...” (A, [21:52])

6. Epic AI Hardware Investments

Timestamps: [25:30]–[27:36]

Unconventional AI: Raises $475M at $4.5B valuation to build energy-efficient, analog AI chips, aiming for 1000x reduction in power vs digital chips.
Founders & Funding: Led by the former Databricks AI head; described as a “crazy seed round,” with details kept tightly guarded.

7. OpenAI Organizational Shifts & Enterprise AI Push

Timestamps: [29:37]–[33:03]

Executive Hires: New Chief Revenue Officer from Slack and CEO of Applications from Instacart signal major push into enterprise services.
Enterprise AI Trends: Usage of ChatGPT in enterprise up 8x; Anthropic seen as the main challenger in this segment, reportedly commanding higher valuation per token used.
“Frontier AI User”: Term coined to describe workers with much higher AI adoption rates, reflecting uneven productivity impact to date.

8. Benchmarks, Open Source, and Model Alignment

Timestamps: [33:03]–[41:45]

FACS Leaderboard: New holistic factuality benchmark from Google DeepMind (highest score: 68%), evaluating grounding, parametric, and search-based factuality.
Claude 4.5 Opus Soul Document Leak: Reverse engineering reveals that Claude was fine-tuned on a “soul document” in alignment, with system-level self-characterization—emphasizes Claude’s “novel entity” status and “psychological stability.”
- Quote:
  “Claude is a genuinely novel kind of entity, not a sci fi robot, not a dangerous super intelligence or just a simple chat assistant.” (A, [38:18]) and
  “Claude has a genuine character that it maintains... warmth of care for humans... deep commitment to honesty and ethics.” (B, [41:45])

9. Research Deep Dive: Scaling Agents, Generalization, RL

Timestamps: [41:45]–[68:36]

a. Scaling Agent Systems

Large-scale empirical study from Google/DeepMind/MIT: Scaling agents improves performance only up to a point (“capability saturation”); multi-agent coordination introduces significant token overhead.
Insights: “You’re often better off just using a single agent if your problem is too complex...” (A, [45:45])

b. Robotics in World Simulators

DeepMind shows video-based simulation (VEO model) can effectively train/evaluate robotic policies across diverse, edited environments—helping bridge the sim-to-real gap.

c. Self-Evolving LLMs

New method for stable co-evolution of LLM “challengers” and “solvers” using minimal human data, overcoming collapse and drift issues typical in language model self-play.
Quote:
“The solution... is you sprinkle a little bit of human data along with synthetic data... to make it not go insane, to kind of remind it, hey, this is what normal data looks like.” (A, [53:36])

d. Bayesian Rationality in LLMs

Paper introduces the Martingale score as an unsupervised measure of model rationality; finds “belief entrenchment,” where LLMs become more confident in initial beliefs, showing confirmation bias akin to humans.

e. RL Training Recipes & Stability

Empirical and theoretical advances clarify where to apply reinforcement learning (mid-training > post-training), and mathematical justification for token-level reward assignment under two stability conditions.
Quote:
“The token level objective... is the first order approximation of the full sequence level objective mathematically...” (A, [66:06])

10. Policy, Safety, and Generalization Risks

Timestamps: [68:54]–[84:01]

US Executive Order: Trump admin order bars states from regulating AI independently, provoking federalism debate.
Weird Generalization: New research shows LLMs exhibit unexpected (“weird”) generalization: e.g., training on 18th-century bird names biases model toward 18th-century worldviews. Highlights risk of subtle, emergent misalignments (“If you train it on even not-bad things... it will also become broadly misaligned. It will start being evil.” – B, [77:21])
Forecasting AI Timelines: Compute bottlenecks could delay advanced AI capabilities by up to 7 years.
International Safety Institute Coalition: Focused on evaluation and measurement standards for AI safety.
AI Tech Smuggling: DOJ cracks down on illegal exports of Nvidia chips to China (Operation Gatekeeper), revealing tension between security and economic policy.

11. Licensing & Copyright Infrastructure

Timestamps: [84:01]–[85:25]

RSL 1.0 Released: A consortium (including Creative Commons) creates a standard for setting AI scraping rules and compensation—a sign of more structured content licensing for AI training in the future.

Notable Quotes & Memorable Moments

On using LLMs for coding:
“I just pasted mindlessly code from the chatbot... and it fucked my entire database. So... that’s how my Friday’s going, you guys.” – Jeremy Harris, [00:37]
On Claude’s Alignment Philosophy:
“Amanda Haskell, interestingly, is an in-house philosopher at Anthropic... Claude is unique or interesting among models in that it talks about its own consciousness a lot more.” – B, [41:45]
On US-China chip politics:
“The country that controls these chips will control AI technology. The country that controls AI technology will control the future.” – DOJ via B, [82:33]

Timestamps: Where to Find Key Segments

[01:11] – GPT-5.2 release and analysis
[08:48] – Runway world model, Google, Tencent news
[15:50] – AI chips US-China drama
[21:02] – Disney & OpenAI partnership
[25:30] – Unconventional AI mega-seed round
[29:37] – OpenAI enterprise shift
[33:03] – Factuality benchmarks; Claude soul doc leak
[41:45] – Scaling agents, robotics in simulation
[53:36] – Self-evolving LLMs & Bayesian rationality
[63:06] – RL training analysis
[68:54] – Policy, generalization, and safety
[84:01] – RSL 1.0 & content licensing

Recap & Closing Thoughts

Quote:
“We appreciate you sharing, reviewing, and just tuning in. Please do keep tuning in week to week.” – B, [85:25]

#228 - GPT 5.2, Scaling Agents, Weird Generalization

Summary

Last Week in AI – Episode #228

Overview

Major Topics & Key Insights

1. GPT-5.2 Release: Capabilities, Benchmarks, and Business Focus

2. Runway's World Model and Multimodal Advances

3. Weekly AI News Highlights

4. US-China AI Chip Tensions

5. Disney & OpenAI: $1B Partnership

6. Epic AI Hardware Investments

7. OpenAI Organizational Shifts & Enterprise AI Push

8. Benchmarks, Open Source, and Model Alignment

9. Research Deep Dive: Scaling Agents, Generalization, RL

a. Scaling Agent Systems

b. Robotics in World Simulators

c. Self-Evolving LLMs

d. Bayesian Rationality in LLMs

e. RL Training Recipes & Stability

10. Policy, Safety, and Generalization Risks

11. Licensing & Copyright Infrastructure

Notable Quotes & Memorable Moments

Timestamps: Where to Find Key Segments

Recap & Closing Thoughts

Summary

Last Week in AI – Episode #228

Overview

Major Topics & Key Insights

1. GPT-5.2 Release: Capabilities, Benchmarks, and Business Focus

2. Runway's World Model and Multimodal Advances

3. Weekly AI News Highlights

4. US-China AI Chip Tensions

5. Disney & OpenAI: $1B Partnership

6. Epic AI Hardware Investments

7. OpenAI Organizational Shifts & Enterprise AI Push

8. Benchmarks, Open Source, and Model Alignment

9. Research Deep Dive: Scaling Agents, Generalization, RL

a. Scaling Agent Systems

b. Robotics in World Simulators

c. Self-Evolving LLMs

d. Bayesian Rationality in LLMs

e. RL Training Recipes & Stability

10. Policy, Safety, and Generalization Risks

11. Licensing & Copyright Infrastructure

Notable Quotes & Memorable Moments

Timestamps: Where to Find Key Segments

Recap & Closing Thoughts