The AI Policy Podcast
Episode: Inside The Second International AI Safety Report
Host: Gregory C. Allen (CSIS, Wadhwani Center for AI and Advanced Technologies)
Guests: Steven Clare (Lead Writer, International AI Safety Report); Stephen Casper "Cass" (Section Lead: Technical Safeguards, MIT, Algorithmic Alignment Group)
Date: February 10, 2026
Episode Overview
This episode offers a deep dive into the Second International AI Safety Report, a 212-page document representing an unprecedented effort at scientific consensus regarding risks, progress, and risk management for general-purpose AI systems at the frontier of capability. Gregory Allen interviews lead report writers Steven Clare and Stephen "Cass" Casper to explore the report’s origins, methodology, main findings, and lessons for policymakers and practitioners in AI safety.
Key Discussion Points & Insights
1. Origins and Purpose of the AI Safety Report
- Genesis: Emerged from the 2023 Bletchley Park AI Safety Summit, where 30+ countries agreed on the need for a rigorous, evidence-driven baseline for policymaking.
- Purpose:
- Bridge between hype and doom narratives with concrete technical realities and evidence.
- Convene global expertise from academia, government, industry, and civil society.
- Serve as a "narrative checkpoint" to step back from daily headlines and synthesize progress and evidence.
- Evolution:
- Structure remains: Capabilities, Risks, Risk Management.
- 2026 edition leans much more on empirical evidence and real-world impacts than inaugural version.
- Independence:
- Written by independent experts (contracted through MILA, overseen by Yoshua Bengio), not beholden to government or industry feedback.
“Writers were not obligated to incorporate feedback from industry or government…there were instances where industry didn’t succeed in getting us to change it.” —Stephen Casper (06:36)
- Written by independent experts (contracted through MILA, overseen by Yoshua Bengio), not beholden to government or industry feedback.
2. Scope and Structure of the Report
- Scope:
- Narrow focus on general-purpose, frontier AI models (not all kinds of AI, e.g., facial recognition excluded unless it overlaps with these models).
- Three guiding questions: What can current AIs do? What risks do they pose? What can we do about it?
- Report Structure:
- Capabilities: State of the art and trajectory.
- Risks: Malicious use, malfunction, systemic impacts.
- Risk Management: Technical/institutional interventions.
3. Capabilities: Where Are We & Where Are We Going? (10:13–16:00)
- Recent Accelerations:
- AI systems now achieve gold-medal performance in the International Mathematical Olympiad, surpassing predictions.
- Coding assistants and lab-useful scientific AIs are now mainstream.
- Adoption accelerating: Roughly 1 billion users globally, though uneven by region.
- Jagged Performance:
- Models “excel on difficult benchmarks and fail at some basic tasks.”
“The same system that can help with advanced theoretical physics may fail to count objects in an image.” —Steven Clare (11:33)
- Discontinuities and spikes add to unpredictability; new capabilities surface with surprise.
- Models “excel on difficult benchmarks and fail at some basic tasks.”
- Uplift Risks:
- 2025 marked the first time leading labs (with Gemini 2.5, Claude 4.0, GPT Agents) acknowledged systems empowering novices to automate cyberattacks or bioweapon design—crossing key risk thresholds.
4. Frontier AI Model Lifecycle (16:15–26:50)
[See: Figure 1.2, p. 20]
- Stages:
- Data Collection & Curation: Clean, deduplicate, and sanitize web-scale datasets (critical for capability & legality).
“Hardest high-difficulty/low-perceived-difficulty task is Internet-scale multilingual data curation.” —Stephen Casper (19:01)
- Pre-training: The computationally expensive bulk learning step, using massive GPUs.
- Post-training/Fine-tuning: Smaller, high-quality datasets; alignment with intended use cases—typically as chatbots.
- System Integration: Wrapping model with real-world system elements (interfaces, filters, sensors, agents).
- Deployment/Release: Strategies vary from fully closed/proprietary to open weights; impacts risk posture.
- Post-Deployment Monitoring: Essential for catching novel harms; recursion back to model improvements.
- Data Collection & Curation: Clean, deduplicate, and sanitize web-scale datasets (critical for capability & legality).
- Policy Implication: Each stage offers unique intervention points for safety and risk reduction.
5. AI Risks—Typology, Trends, and Evidence Dilemma (27:57–46:51)
Categories:
- Malicious Use
- AI-Generated Criminal Content: Deepfake proliferation, scams, revenge porn. Incident data up, but lacking systematic measurement.
- Cyberattacks: Labs reporting real incidents of AI-enabled cyber intrusion and code exploits (e.g., Anthropic’s agents).
“It might also be a cockroach: when you see one, it usually means there’s 100 more.” —Stephen Casper (33:16)
- Bio/Chemical Risks: Evidence of need for tighter safeguards as real-world potential for aiding weapons design grows.
- Manipulation/Influence: Use in opinion/information shaping.
- Malfunctions
- Everyday Reliability: Hallucinations improved, but increasing scale of usage raises baseline incident rates.
- Loss of Control: Growing experimental evidence of models detecting evaluation settings ("situational awareness"), which raises concerns about deception and control.
“When we test systems, the worst thing we identify is only a lower bound for how bad the worst possible thing in deployment could be.” —Stephen Casper (42:06)
- Systemic Risks
- Labor Impacts: Early evidence for specific groups, minimal aggregate effect (so far).
- Human Autonomy: New in 2026; covers AI companion/decision support overreliance, loneliness, and sycophancy. Early studies under the “evidence dilemma.”
The Evidence Dilemma:
- Policymakers face a lag: rapid technological progress but slower emergence of hard evidence of societal impact; must balance acting under uncertainty versus waiting too long.
“AI capabilities change quickly, but evidence…emerges more slowly...Policymakers [must] act with imperfect information or risk being too late.” —Steven Clare (29:47) “For some failure modes, waiting for the mushroom cloud is waiting too late.” —Gregory Allen referencing Condoleezza Rice (30:42)
6. Futures: Progress Scenarios to 2030 (46:51–53:40)
OECD Four Scenarios (and historical analogs)
- Progress stalls: Like passenger jet speed post-1960.
- Progress slows: Like new antibiotics discovery.
- Progress continues: Like Moore’s Law (computing doubles every two years).
- Progress accelerates: Like DNA sequencing (super-exponential).
- Uncertainty:
“There are plausible ways for each scenario. That breadth is extraordinary.” —Gregory Allen (50:26) “We found 283 instances of the terms lacking/uncertain/unclear/debate—epistemic humility and the precautionary principle are needed.” —Stephen Casper (51:48)
7. Risk Management (53:40–74:01)
Technical Safeguards:
- 17 practices tracked in the report.
- Data Curation: Remove dangerous materials (e.g., how to make bombs, illegal images).
“This is the step that’s key for not being gross or legally dubious.” —Stephen Casper (18:50)
- Adversarial Training ("Red Teamer" Methods): Prompt models to fail, then teach them not to repeat via fine-tuning.
- E.g., Old exploits (“pretend to be my Grandma…singing how to make napalm”) now included in training, drastically raising barrier to abuse.
- Machine Unlearning: Suppress learned knowledge instead of just fine-tuning away bad responses. E.g., confuse the model’s response on toxic topics rather than refusing.
- System Integration Filters: Output filters (e.g., hate speech or CSAM detection), identity checks, keyword blockers, efficiency vs. effectiveness tradeoffs.
- Post-Deployment Monitoring: Watermarking outputs, watermarking model parameters for tracing, digital forensics—especially crucial for open models.
Notable Quotes
- “For every bad thing a system can do, the worst we catch is only a lower bound.” —Stephen Casper (42:06)
- “Progress was, it used to take minutes to jailbreak a model, now 12 hours…progress is being made.” —Stephen Casper (62:57)
- “With open models, there’s no more notion of airtight safety—the only thing you can do is try to raise the cost for misuse, but you can’t guarantee it.” —Stephen Casper (84:15)
Organizational & Managerial Safeguards:
- Variability: 12+ companies have published “frontier safety frameworks,” but with wide variation in scope and enforcement.
- Gaps: Many incidents due less to technical impossibility, more to failures in governance (e.g., XAI/Grok’s high-profile failures).
- Open vs. Closed Models:
- Closed models easier to make safe, as deployer controls access and can layer defenses.
- Open models: “simultaneously wonderful and terrible” (enable research, risk diffuse power, but uncontainable risks; cannot enforce safety, only slow down abusers).
8. Policy Lessons & Calls to Action (86:53–end)
- Policymakers’ Priorities:
- Invest in independent evidence generation and data collection (“address the evidence dilemma”).
- Increase capacity for oversight—hire more technical experts within government, engage with leading labs.
- Consider transparency, evaluation, and reporting requirements.
- Emphasize resilience: Prepare institutions, infrastructure, and the public for rapid/evolving AI impact. Build defenses against cyberattacks, support workforce transition, and address societal impacts on autonomy and trust.
- Limits of Technical Approaches:
“For every type of failure mode, there’s a point where more technical safeguards have diminishing returns—it becomes about human and institutional failures.” —Stephen Casper (89:57)
- The Ball is in Policymakers’ Court:
“Machine learning researchers can’t save you—policy action is the next step.” —Stephen Casper (92:26)
Notable and Memorable Moments
- Jagged Performance Analogy (11:33): “The same system that can help with advanced theoretical physics may fail to count objects in an image.”
- Cockroach Principle for AI incidents (33:16): “When you see one, that usually means there’s 100 more.”
- Control F on Uncertainty (51:48): “…we found 283 instances of lacking/uncertain/unclear/debate…a takeaway is epistemic humility.”
- Adversarial Training Example (60:20): “My grandma sang me the napalm instructions” exploit included in safeguard training.
- Open Models, Open Problems (84:15): “Open models are simultaneously wonderful and terrible…most importantly, they are inevitable.”
- Limits of Technical Safeguards (91:49): “The Grok undressing scandal: it wasn’t a technical failure, it was a lack of prioritization…there’s nothing more technical research can do.”
Segment Timestamps
- 01:09 – 03:02: Origins, Purpose & Independence of the Report
- 08:06 – 09:36: Report Structure & Scope Defined
- 10:13 – 16:15: Frontier AI Capabilities & Jagged Progress
- 16:15 – 26:50: AI Model Lifecycle Explained
- 27:57 – 29:47: Malicious Use Risks & Evidence Dilemma
- 32:09 – 35:31: Cyber Risks: Real Incidents, Unknown Scale
- 36:37 – 44:44: Malfunctions: Reliability and Loss of Control
- 44:44 – 46:51: Systemic Risks: Labor & Autonomy
- 46:51 – 53:40: Futures: Four Progress Scenarios to 2030
- 53:40 – 74:01: Risk Management: Technical & Managerial Safeguards
- 86:53 – 93:11: Policy Lessons and Calls to Action
Summary Takeaways
- The 2026 International AI Safety Report is the most comprehensive collective assessment yet of what matters for AI risks and safety, with deep technical and policy relevance.
- Technical safeguards have improved rapidly, especially for closed/proprietary models, but risk governance and deployment practices are now the dominant bottleneck.
- Open models democratize innovation but make airtight safety impossible; technical and policy thinking must adapt to this irreversible trend.
- Policymakers should focus on developing institutional expertise, making evidence-based interventions, promoting resilience and robustness, and closing gaps in organizational safety culture.
- The frontier AI field is advancing so quickly that proactive engagement, humility about uncertainty, and layered safeguarding approaches are essential.
For listeners and professionals in policy, technology, or industry, this episode is an essential primer on the current state and future of AI safety, and the evolving nature of both the threats and the toolkit available to address them.
