Lenny’s Podcast: Product | Career | Growth
Episode: The coming AI security crisis (and what to do about it)
Guest: Sander Schulhoff
Date: December 21, 2025
Main Theme / Purpose
This episode features an in-depth conversation between Lenny Rachitsky and Sander Schulhoff, a leading researcher in adversarial robustness and AI security. Together, they illuminate the looming crisis in AI security, emphasizing that most current AI guardrail defenses (designed to prevent AI models from being tricked into doing malicious things) are deeply flawed and essentially ineffective. Sander shares concrete examples of prompt injection and jailbreaking in the wild, dissects why traditional solutions don’t work, and offers actionable advice for companies deploying AI tools today.
Core message: AI systems are fundamentally vulnerable to adversarial attacks that can bypass guardrails, and the only reason we haven't seen real damage yet is limited adoption—not robust defenses.
Key Discussion Points & Insights
1. The Crisis: AI Security Is Broken — Guardrails Don't Work
-
Sander repeatedly asserts that AI guardrails do not work (00:00, 28:33, 31:04, 85:48).
-
Guardrail providers often claim high attack-blocking rates (“we catch 99% of attacks”), but this statistic is meaningless given the virtually infinite space of possible prompts.
-
Quote:
"AI guardrails do not work. I'm going to say that one more time. Guardrails do not work."
—Sander Schulhoff (00:00, 28:33) -
The attack surface is astronomically large—roughly one followed by a million zeros possible prompts (31:04).
-
Automated red teaming (using LLMs to attack LLMs) frequently reveals how easily models are compromised.
-
Even advanced guardrails or prompt engineering approaches get easily bypassed by determined attackers or clever prompt composition.
-
Most current "solutions" may increase overconfidence rather than actual security.
2. Types and Vectors of AI Attacks
-
Jailbreaking:
When a user alone tricks a model (e.g., ChatGPT) directly into producing undesired outputs (08:38). -
Prompt Injection:
When a user injects a malicious prompt into an application built on top of an LLM, exploiting the difference between system and user prompts to override intended behaviors. -
Indirect Prompt Injection:
When an attacker manipulates an external data source (like a webpage or email) to influence an agent’s behavior. -
Examples:
- ServiceNow (Prompt Injection): Orchestrating a chain of agent actions to access, alter, and leak data despite supposed protections (09:54).
- Remotely IO Twitter Bot (Prompt Injection): Hijacked to make threats against the president on Twitter (11:51).
- Math GPT (Prompt Injection): Trick AI into generating and executing malicious code to leak secrets (11:51).
- Vegas Bombing (Jailbreak): AI-generated step-by-step instructions for real-world harm (13:52).
- CLAUDE "Code Virus": Coordinated multi-step prompt attacks using agents/instances to bypass defenses (13:53).
- Comet Browser (Indirect): Malicious text in a web page prompts agent to exfiltrate private data (63:59).
3. Why Have We Not Seen Major Attacks Yet?
-
The only reason for lack of catastrophic AI-powered attacks is limited real-world adoption and deployment power—not because models are secure.
-
As AI-powered agents, AI browsers, and robots gain capability and autonomy, the risks multiply (18:27).
-
Quote:
"The only reason there hasn't been a massive attack yet is how early the adoption is, not because it's secure."
—Alex Komarosky (cited by Lenny, 00:16, 11:11, 38:22)
4. Industry Response: The AI Security Ecosystem
- There’s a distinction between "frontier labs" (OpenAI, Anthropic, Google DeepMind) doing research and a growing B2B AI security sector selling products for monitoring, compliance, automated red teaming, and guardrails (20:12).
- Automated red teaming is useful for demonstrating LLM vulnerabilities but rarely reveals anything not already known.
- The seller’s pitch (“our guardrails catch everything!”) is dangerously misleading (28:33).
- Much of the statistics or metrics touted are either fabricated or not meaningful due to the impossibility of exhaustively testing the attack space (31:04).
- Market correction is imminent: companies will realize these solutions don’t actually fix the problem and revenues for these vendors will collapse (82:21).
5. Why Is Securing AI So Hard?
-
Unlike classical software, where vulnerabilities can be patched and verified, AI systems (“patching a brain”) are inherently unpredictable and un-patchable at scale (25:25, 40:49).
-
Even the best frontier lab researchers admit there’s no satisfactory solution—if they can’t solve it, B2B vendors certainly can't (31:04).
-
Many "defenses" are easily bypassed by using other languages, clever splitting of queries, or adaptive attacks (multistep prompt chains, etc.).
-
Human attackers remain far more effective than automated attack systems at breaking AI defenses (31:04).
-
Quote:
"You can patch a bug, but you can't patch a brain...try to do that in your AI system. You can be 99.99% sure the problem is still there."
—Sander Schulhoff (00:25, 40:49)
6. What Actually Works? Advice for Companies
If You’re Deploying Simple LLM Chatbots
- Don’t panic. If your tool only enables basic Q&A or support with no ability to take actions or access sensitive user/company data, your risk is reputational rather than technical (45:38).
- Malicious users can always get similar behavior from the core models (ChatGPT, Claude, Gemini) anyway, so chatbots per se aren’t at unique risk.
When Deploying Agentic or Action-Taking Systems
- Invest in classical cybersecurity and data/action permissioning at the interface between AI and real systems.
- Assume the worst: Treat agents like "an angry god in a box"—design systems so even a malicious agent cannot cause harm (54:23).
- Camel framework (from Google): Assigns minimal necessary permissions for tasks, limiting an agent’s ability to act maliciously if compromise occurs (65:59).
- Not a silver bullet, best for narrow/specific agentic roles.
- Education:
- Build teams that include both AI security researchers and classical cybersecurity experts.
- Avoid over-reliance on “AI security” vendors.
- Ensure your team is genuinely literate in both LLM risks and computer security fundamentals (59:30, 71:18).
What NOT to Waste Time/Money On
- Don’t buy or trust vendor “guardrails” or “automated red teaming” products as a primary line of defense.
- Don’t believe in “prompt engineering” or system prompt-based defenses—they are fundamentally breakable (40:49).
- Don’t waste resources on multiple redundant guardrails (56:19).
Memorable Quotes & Moments
-
Sander Schulhoff’s refrain:
"AI guardrails do not work. I'm going to say that one more time. Guardrails do not work." (00:00, 28:33, 85:48)
-
On the state of mitigation:
"It's really important for people to understand that none of the problems have any meaningful mitigation... The only reason there hasn't been a massive attack yet is how early the adoption is, not because it's secured."
—Alex Komarosky (as quoted by Lenny, 11:11) -
On the sheer scale of the attack space:
"For a model like GPT5, the number of possible attacks is one followed by a million zeros... it's basically infinite."
—Sander Schulhoff (31:04) -
On differences from classical security:
"You can patch a bug, but you can't patch a brain."
—Sander Schulhoff (00:25, 40:49) -
On the importance of real cybersecurity, not AI guardrails:
"AI researchers are the only people who can solve this stuff long term, but cybersecurity professionals are the only ones who can kind of solve it short term."
—Sander Schulhoff (58:33) -
On education vs. products:
"We actually want to scare people into not buying stuff."
—Sander Schulhoff (71:44)
Timestamps for Key Segments
- Intro & Thesis: AI guardrails do not work (00:00—00:39)
- Definition of Jailbreaking vs. Prompt Injection: (08:38—09:54)
- Real-world Examples of Prompt Injection: Remotely IO, Math GPT, Claude code attack (11:51—17:56)
- Why the Industry is Unprepared: (18:27—21:09)
- State of the AI Security Industry: (20:12—23:57)
- Automated Red Teaming & Guardrails Explained: (21:18—23:57)
- Adversarial Robustness and Attack Success Rate: (23:57—25:33)
- How Security Vendors Sell “AI Guardrails”: (25:54—28:33)
- Why Guardrails Fail: Scale of attack space, human attackers, marketing myths (31:04—38:22)
- Why There Haven’t Been Major Attacks Yet: (38:22)
- What Actually Works (Advice for CISOs): (45:11—51:34)
- Importance of Classical Security + AI Security: (48:48—54:06)
- On "Control" and the Alignment Problem: (54:23)
- On Camel Permissions Framework: (65:59—68:33)
- Advice for Frontier Labs & the Future: (72:06—77:52)
- Market Correction Prediction for AI Security: (82:21)
- Final Takeaways on Guardrails & AI Security: (85:48—90:13)
- Where to Find Sander & Further Education: (91:01—92:10)
Actionable Advice—What Should You Do?
If You're a Company/Developer:
- For simple conversational agents: Don't overinvest in guardrail products.
- For any agent that can read/send data or perform actions:
- Use principle of least privilege (like the Camel framework).
- Invest in robust permission boundaries.
- Combine classical security professionals with AI security researchers.
- Log all actions and have incident monitoring (forensics when something goes wrong).
- Treat every action the agent can take as a possible attack vector—design as if a determined attacker will break in.
If You’re an AI Security Consumer:
- Focus on education and governance; don't fall for overhyped vendor solutions.
If You’re a Researcher:
- Stop doing only “offensive” adversarial security research (i.e., writing yet more jailbreak papers); focus on defense and education (85:48).
Notable Companies / Tools Mentioned
- Frontier Labs: OpenAI, Anthropic, Google DeepMind
- Compliance/Governance: Trustable
- AI Security Vendor Example: Repello
- Agent Permissioning (Research): Camel framework (Google)
Conclusion: The Coming Crisis
Sander warns that, as soon as agentic AIs and LLMs begin to mediate real-world actions at scale, we will see damaging, possibly catastrophic, prompt injection attacks. Most companies are overly reliant on ineffective solutions and have inflated security trust due to vendor claims. The answer is a renewed focus on security fundamentals, cross-disciplinary education, careful permissioning, and realistic threat modeling—assuming that all current “guardrails” will, sooner or later, be bypassed.
Final Quote:
"Guardrails don't work. They really don't. And they're quite likely to make you overconfident in your security posture, which is a really big problem. The reason I'm here is because stuff's about to get dangerous."
—Sander Schulhoff (85:48)
Links/Resources:
- Sander’s Twitter: @SanderFulhof
- HackAI Maven AI Security Course: Hackai.co
- Learn Prompting: learnprompting.org
- Trustable (compliance): trustable.ai
- Repello (security tools): repello.ai
- [Camel Framework paper/preprints on permissioned agents (Google Research)]
This episode delivers a sobering but practical guide to the real, unsolved state of AI security. If you touch, build, or rely on AI agentic tools—or are thinking about deploying AI in your company—this is essential listening.
