
SANS Stormcast Monday Mar 3rd: AI Training Data Leaks; MITRE Caldera Vuln; modsecurity bypass
Loading summary
A
Hello and welcome to the Monday, March 3, 2025 edition of the Sans and its Stormtronners, Stormcast. My name is Johannes Ulrich and today I'm recording from Baltimore, Maryland. Well, let's start today with some stories about AI training data. And the first one here comes from Truffle Security. Truffle Security, of course, is the company behind trufflehawk. The very frequently used, well respected tool allows you to identify API keys and other secrets that you may leak in git or other repositories and such. So Truffle Security took a big database of AI training data that's being offered by Common Crawl. Common Crawl is going out and spidering the web for many years now. There's something like 400 terabytes of data that they're offering and well, it shouldn't really be a surprise because it's the same thing that we had with Google and other web spiders that offer then the data publicly that it now becomes relatively straightforward to find things like API keys that people leaked on their websites. A little bit tricky here that this data is also historic data. I believe they're doing this for the last 10 years or such. So it is not just current data. Now sites like Google, they offer some historic data, but usually focus more on current data. They found again, 12,000 what truffle security considers live keys, which means that they work according to Trufflehog. Trufflehog has a little sort of test feature that allows you to make sure that these are not just simple sample or expired credential that being used here. They point out in their paper that this number of roughly 12,000 secrets is of course just an estimate. There are some that they missed just because they were formatted not correctly. And then of course always a little bit tricky to figure out if they're actually being used, if they're just demo credential and such. They also point out that many of the credentials can be found across a large number of websites in this data repository. Initially when I read this I first thought that hey, maybe these are just demo credentials and such. We often have like the snake oil secret key that sort of comes with Apache, that of course is all over the place. Well, according to Truffle Security, they believe that this is more multiple websites using the same piece of JavaScript it could identify like suppliers and supply chains and such. So after we see what this all means overall, yes, if data is exposed, it probably got captured by someone. From my own experience, particularly for a smaller website, the vast majority of sort of hits you get is crawlers like this. So no real big surprise if these credentials end up pretty quickly in repositories like this common crawl and can then easily be abused. The second story that's also related to training data comes from researchers at Lasso Security. And what they noticed is that the training data being used by GitHub's copilot, which, well, is Microsoft, contains data from what's now private GitHub repositories. So CoPilot uses GitHub as training data and that's publicly known. That's well established, but they only use public repositories. What Lasso Security here found is that, well, if your repository was public even for a relatively short amount of time, or when you initially set it up, well, it's going to be added. And GitHub Copilot doesn't necessarily remove data after it's marked as private by the author of that data. And not only that, now you may say, hey, you know, if it's part of training data may not be such a big deal, it may just help people code a little bit or such. You can actually ask Copilot for, hey, a list of files in that particular repository. And with that you basically get a very direct interface into these files that were public at the time. Again, this is only if these files were public at a particular point in time. But they found literally thousands of these repositories were exposed. Some sort of big name brand companies. Just like I said earlier, if at any point in time your data was exposed, assume it got crapped by someone and well, has to be considered leaked at this point, well then we got some vulnerabilities to talk about. Mitre Caldera is a framework to make it easy to simulate adversaries so your red teamers may use it. It implements a REST API and allows for plugins to be controlled to automate various parts of the attack scenario. Sadly, Mitre announced last week that Caldera itself is vulnerable to interesting command injection. The vulnerability derives from the Manx and sandcat agents. These agents are intended to be used to implement a reverse shell, but the Debris Fire authentication. However, these agents have the ability to be compile sort of just in time for a particular platform, and the attacker can actually then supply some compile parameters and with that they can execute arbitrary code. Interesting in part because this is part of an attack framework that's supposed to execute arbitrary code, but not for everybody, only for authorized users here. And that's sort of how you definitely want to update it. There's a great sort of post by Mitre actually, I like that they go in detail what really went wrong here. But with that they also did publish a proof of concept exploit. I don't see this as sort of very likely to be exploited vulnerability, but could certainly be exploited in a more targeted attack. And we have an interesting vulnerability in mod security. This vulnerability is not super severe, but well, it does allow bypassing of mod security rules, and since that's the point of mod security, it sort of invalidates the tool somewhat. All you have to do is you have to prepend HTML encoded values with zeros. Luckily it only affects one particular version of mod security, so double check and if necessary update.
Episode: March 3, 2025: AI Training Data Leaks; MITRE Caldera Vuln; modsecurity bypass
Host: Johannes B. Ullrich
Location: Recording from Baltimore, Maryland
In this brisk episode, Johannes B. Ullrich delivers a rundown of recent, high-impact stories in information security. The main themes revolve around the persistent risks of sensitive data leaks through AI training datasets, specific vulnerabilities in popular security tools like MITRE Caldera and modsecurity, and the implications for practitioners and organizations.
[00:28 - 03:07]
Truffle Security Research:
Truffle Security (makers of Trufflehog) analyzed Common Crawl’s massive web archive (~400TB).
Historic Exposure Issues:
Common Crawl archives about a decade of web data, so even old and once-leaked information remains accessible.
Supply Chain Connection:
Many exposed secrets appear across multiple sites, implying shared supplier code or JavaScript, not just demo keys.
Security Takeaway:
If sensitive keys or credentials were ever publicly exposed, even temporarily, "assume it got captured by someone." [02:40]
[03:08 - 04:32]
GitHub Copilot Issue:
Research from Lasso Security found that Copilot’s AI training data includes content from now-private repositories if they were public at any time.
Security & Privacy Implication:
There are "literally thousands of these repositories" (including those from big-brand companies) exposed in this way.
Main Message:
Any data public, even briefly, may remain in third-party datasets indefinitely and should be considered leaked.
[04:33 - 05:37]
Background:
MITRE Caldera is a framework for simulating adversarial (red team) attacks.
Vulnerability:
Risk Perspective:
Remediation:
MITRE published a detailed postmortem and a proof-of-concept exploit. Prompt updates are recommended.
[05:38 - End (~06:10)]
Details:
A specific modsecurity version bug allows bypassing rules by prepending HTML-encoded values with zeros.
Severity:
On crawler exposure:
“From my own experience, particularly for a smaller website, the vast majority of sort of hits you get is crawlers like this. So no real big surprise if these credentials end up pretty quickly in repositories like this common crawl and can then easily be abused.”
— Johannes B. Ullrich [02:28]
On AI training data permanence:
“If at any point in time your data was exposed, assume it got crapped by someone and well, has to be considered leaked at this point.”
— Johannes B. Ullrich [04:20]
On the modsecurity vulnerability’s irony:
“It does allow bypassing of modsecurity rules, and since that's the point of modsecurity, it sort of invalidates the tool somewhat.”
— Johannes B. Ullrich [05:55]
This concise episode spotlights the persistent nature of sensitive data leaks in AI and public datasets, and exposes the complexities of modern security tooling and frameworks. The reminders are clear: Anything public, even briefly, may be permanently accessible; and robust review and patching remain essential in both defense and attack simulation contexts.