Practical AI Podcast: "AI incidents, audits, and the limits of benchmarks"
Episode Date: February 13, 2026
Host(s): Daniel Whitenack, Chris Benson
Guest: Sean McGregor (Co-founder, AI Verification and Evaluation Research Institute; Founder, AI Incident Database)
Overview
This episode dives deep into the practical realities of AI safety, incident tracking, and the growing need for robust AI audits and trustworthy benchmarks. Featuring Sean McGregor, an expert in AI verification, the discussion explores how to define and document AI incidents, why current evaluation standards and benchmarks may be lacking, and how third-party audits could help society responsibly scale AI deployment.
Key Discussion Points & Insights
1. Sean McGregor’s Journey into AI Safety
- Sean’s journey began in reinforcement learning (RL) for wildfire suppression, which underscored AI's power and brittleness.
- Experiences in hardware at Syntient, founding a company for ML evaluation, and working on edge neural processors solidified the need for safety-focused research.
- Founded the AI Incident Database to systematically document real-world AI failures/harms, inspired by analogous databases in aviation, food safety, and medicine.
Quote:
"You see this in aviation. A plane crashes, you record what happens and you use that to make sure… you have a form of regression test. You don't want that past crash to happen again."
— Sean McGregor [04:59]
2. Defining “AI Incidents” and Incident Reporting Challenges
- Terms like “incident," “accident," “adverse event,” and “harm” have nuanced differences; “incident” is intentionally broad to cover the diverse types of AI failures.
- The AI Incident Database emphasizes harm-causing events—including both impactful rare incidents and the “little harms repeated a million times each day” by AI systems at scale.
- Most database entries come from journalistic reporting, but voluntary reporting models may be insufficient as the field matures; EU regulations may soon mandate reporting.
Quote:
"Incident covers them all… you don't want a bad thing to happen and… that bad thing to produce a harm."
— Sean McGregor [07:30]
Quote:
"If we start ingesting [minor] incidents, are we lassoing an infinity that's just going to pull us into some extreme direction?"
— Sean McGregor [09:51]
3. AI Verification & the Role of Third-Party Audits
- Traditional safety processes assume a narrow, well-defined context—a challenge for “frontier” general-purpose models (e.g., OpenAI, Gemini, Anthropic).
- Current evaluations rarely generalize across every possible application, leading organizations to run their own pilots even after seeing strong benchmarks.
- Third-party audits, akin to financial audits, help establish accountability and trust, especially critical as AI systems impact more sectors.
Memorable Story:
A famous AI incident in the database recounts a person receiving a traffic citation because a surveillance system mistook a woman’s shirt (with the word “KNITTER” distorted by a purse strap) for a license plate.
"The world is hard. The real world is real hard."
— Sean McGregor [17:49]
Quote:
"Having audited financials is table stakes... It's the same thing for the model."
— Sean McGregor [19:55]
4. Benchmarks vs. Audits: “Bench Risk” & Benchmark Limitations
- Many AI benchmarks were created for academic/research purposes, not for deployment decisions—so they might not reflect real-world risks or distributions.
- Sean’s “bench risk” project unearthed systemic failures to link benchmarks to actual deployment value; many operate on “lol, trust me, bro” rather than hard evidence.
Quote:
"A lot of the receipts were just kind of like, lol, trust me, bro, like written on a piece of paper and there were real substantive issues..."
— Sean McGregor [22:07]
- Benchmarks like BBQ (used to track bias) are useful for research, but not tailored to specific real-world deployment contexts.
Quote:
"Most benchmarks have been produced for research purposes, not for practical AI purposes. And this is a problem."
— Sean McGregor [24:13]
5. Underappreciated Risks in AI Deployment
- The difference between “security” (protecting against malicious actors) and “safety” (preventing harm from natural errors or systemic fragility). Each side requires distinct thinking for AI risk management.
- The scale of AI means even minor harms, when repeated at population scale, can become catastrophic—e.g., slightly increasing depression for billions.
Quote:
"The world is its own adversary… bad things will happen regardless."
— Sean McGregor [26:03]
6. Red Teaming LLMs at DEFCON: A Live Security Exercise
- The DEFCON hacker event “Generative Red Team 2” challenged attendees to break the safety guardrails of a large language model, offering cash bounties for reproducible failures.
- The most important takeaway: anecdotal exploits aren’t sufficient; systematic vulnerabilities require evidence and statistical rigor to be meaningful for safety improvements.
Quote:
"Anecdote does not equal data… we need you to show that it’s systematically… underperforming."
— Sean McGregor [31:50]
- Most fruitful attack vector: exploiting loose integration between guardrail models and base LLMs—showing real-world deployments often “muddy the waters” by combining multiple systems.
Quote:
"The interface between [component systems] is very often under tested."
— Sean McGregor [35:27]
7. Lessons Learned & Future Directions in AI Safety
- Need for better tools and industry standards—akin to bug bounty programs—for collecting and adjudicating AI “flaw reports.”
- A hope for more robust institutions and practices to measure, manage, and incentivize safety (because “you manage what you measure”).
- The goal: create a future where AI risk is systematically mitigated, enabling wide, responsible deployment.
Quote:
"You can’t deploy unsafe systems to clients that want safety, that care about outcomes… Our ability to make a safer system is very heavily involved in our ability to ship product."
— Sean McGregor [40:39]
Notable Quotes & Memorable Moments
| Timestamp | Speaker | Quote | |-----------|---------|-------| | 04:59 | D | "A plane crashes, you record what happens and you use that to make sure… you have a form of regression test. You don't want that past crash to happen again." | | 07:30 | D | "Incident covers them all… you don't want a bad thing to happen and… that bad thing to produce a harm." | | 09:51 | D | "If we start ingesting [minor] incidents, are we lassoing an infinity that's just going to pull us into some extreme direction?" | | 17:49 | D | "The world is hard. The real world is real hard." | | 19:55 | D | "Having audited financials is table stakes... It's the same thing for the model." | | 22:07 | D | "A lot of the receipts were just kind of like, lol, trust me, bro, like written on a piece of paper and there were real substantive issues..." | | 24:13 | D | "Most benchmarks have been produced for research purposes, not for practical AI purposes. And this is a problem." | | 26:03 | D | "The world is its own adversary… bad things will happen regardless." | | 31:50 | D | "Anecdote does not equal data… we need you to show that it’s systematically… underperforming." | | 35:27 | D | "The interface between [component systems] is very often under tested." | | 40:39 | D | "You can’t deploy unsafe systems to clients that want safety, that care about outcomes… Our ability to make a safer system is very heavily involved in our ability to ship product." |
Timestamps for Major Segments
- [02:19] Sean’s career trajectory & founding of AI Incident Database
- [06:23] Defining “AI incident” and terminology challenges
- [09:48] Selection & sourcing of incident data; issues with voluntary reporting
- [14:19] The difficulties of evaluating general-purpose models and the importance of audits
- [17:49] Real-world AI incident anecdotes
- [21:46] Difference between audits and benchmarks; limitations of popular benchmarks
- [25:04] Underestimated risks in AI: security vs. safety mindsets
- [28:19] DEFCON “red teaming” LLMs: methodology, findings, statistical rigor
- [34:48] Surprising modes of failure in LLM system integration
- [37:15] Lessons learned, need for flaw-reporting and bug bounty analogues for AI
- [39:43] Sean’s outlook for AI safety and measurements as industry matures
Takeaways for Listeners
- To deploy AI safely, organizations need more than impressive benchmarks; they require robust, independent evaluation and active tracking of where AI goes wrong.
- Current reporting on AI incidents is fragmented but crucial—organizations are urged to participate and learn from collective failures.
- Red teaming and systematic, statistically rigorous evaluation are critical for surfacing exploitable failures in large models—anecdotes are not enough.
- As AI permeates more aspects of life, the demand for transparent, practical, and managed safety evaluation continues to grow.
For more on AI incidents and safety practices, check out the AI Incident Database and Avery Institute.
