Airlines Sold Your Flight Data to DHS—And Covered It Up - The 404 Media Podcast

Summary5 min read

The 404 Media Podcast: "Airlines Sold Your Flight Data to DHS—And Covered It Up"
Release Date: June 18, 2025

Introduction

In this episode of The 404 Media Podcast, hosts Joseph, Sam Cole, Emmanuel Mayberg, and Jason Kebler delve into a startling revelation: major airlines have been selling passengers' flight data to the Department of Homeland Security (DHS) while attempting to conceal this arrangement. Additionally, the podcast explores the burgeoning issue of AI scraping bots disrupting digital libraries, archives, and museums. This comprehensive summary captures the key discussions, insights, and conclusions drawn by the hosts.

Main Story: Airlines and DHS Data Sale

Discovery and Investigation

Joseph kicks off the primary discussion by recounting how he uncovered the story. On May 1, he noticed a new contract between Immigration Customs Enforcement (ICE) and the Airlines Reporting Corporation (ARC) in government procurement databases.

Joseph [04:24]: "I saw that ARC and CBP had a contract, and I filed a FOIA to dig deeper."

By filing Freedom of Information Act (FOIA) requests, Joseph obtained documents revealing that DHS was purchasing extensive flight data from ARC. This data included passenger names, credit card information, and detailed flight itineraries.

Contractual Secrecy

A crucial aspect highlighted is the contractual agreement preventing ARC from disclosing the data source. Joseph reads a significant clause from the contract:

Joseph [06:19]: "The contract between ARC and CBP tells the agency to not publicly identify the vendor or its employees... unless compelled by a court order."

This clause ensures that the airlines remain behind the scenes, obscuring their role in supplying the data.

Scope and Usage of Data

The hosts discuss the breadth of the data sale. ARC, backed by major U.S. airlines including Delta, Southwest, United, American Airlines, Alaska Airlines, JetBlue, Lufthansa, Air France, and Air Canada, serves as a data broker facilitating this transaction.

Jason Kebler [07:32]: "ARC sits in the middle of the transaction between travel agents and airlines, harvesting valuable data."

The data is primarily utilized by DHS's Office of Professional Responsibility (OPR) to investigate internal misconduct within Customs and Border Protection (CBP).

Joseph [13:32]: "CBP stated the data is used for their internal watchdog, OPR, to investigate corrupt or criminal activities."

Privacy and Legal Implications

The conversation shifts to the broader privacy concerns. The data sale operates without explicit legal constraints on its usage beyond the contractual agreement.

Joseph [16:29]: "This isn't being done with a warrant... DHS is buying bulk access to billions of flight records without clear legal limitations."

Jason adds that this arrangement fosters a surveillance relationship where law enforcement feels empowered to utilize the data without adhering to traditional legal processes like obtaining warrants.

Other Agencies Involved

Beyond CBP, other government agencies such as the Secret Service, SEC, DEA, Air Force, U.S. Marshals Service, TSA, and ATF are also purchasers of ARC's flight data. The exact purposes remain largely undisclosed, pending further FOIA responses.

Joseph [14:44]: "Agencies like the Secret Service, DEA, and others are also buying this data, but we don't yet know how they're using it."

Future Investigations

Looking ahead, Joseph and Jason plan to explore additional contracts and examine whether local police forces have access to this data, raising further concerns about widespread surveillance capabilities.

Joseph [21:35]: "We're waiting on contracts from other agencies and investigating whether local police have access to this data."

Secondary Story: AI Scraping Bots Disrupting Cultural Institutions

The Growing Problem

After an advertisement break, the podcast transitions to a report by Emmanuel Mayberg on AI scraping bots negatively impacting libraries, archives, and museums. The issue, quantified for the first time through a survey led by Michael Weinberg of NYU's Glam Elaborate, shows that AI bots are overwhelming these institutions with excessive traffic, leading to service disruptions.

Emmanuel Mayberg [27:02]: "AI scrapers are flooding open resources with traffic, sometimes taking them offline."

Impact on Institutions

A notable case discussed is the University of North Carolina, Chapel Hill, where AI bots caused significant downtime for their extensive online library resources.

Emmanuel Mayberg [31:26]: "UNC's online library became inaccessible due to constant spam from AI scrapers."

Challenges in Mitigation

The hosts highlight the difficulty small and medium-sized institutions face in combating these scrapers. Implementing technical solutions like CAPTCHAs or login walls contradicts the mission of making cultural heritage freely accessible.

Emmanuel Mayberg [34:09]: "Implementing protections like CAPTCHAs can make resources less accessible, counter to the mission of these organizations."

Current and Future Solutions

Jason discusses potential mitigation strategies, including leveraging services like Cloudflare to deploy firewall protections and actively monitoring for scraper activity to negotiate with offending companies.

Jason Kebler [38:44]: "Cloudflare offers tools to protect against scrapers, but it's a constant battle as new bots emerge."

The conversation underscores the ongoing arms race between cultural institutions striving to preserve accessibility and the relentless advancement of AI scraping technologies.

Conclusion

This episode of The 404 Media Podcast sheds light on two significant and interrelated issues of data privacy and accessibility in the digital age. The revelation that airlines have been covertly selling flight data to DHS raises critical questions about surveillance and individual privacy. Simultaneously, the surge of AI scraping bots disrupting cultural institutions highlights the challenges of maintaining open access in an era of advanced data collection technologies. The hosts emphasize the need for greater transparency, robust legal frameworks, and innovative technical solutions to navigate these complex landscapes.

Notable Quotes:

Joseph [06:19]: "You are not allowed to reveal where this airline data came from... unless compelled by a valid court order."
Jason Kebler [07:32]: "ARC sits in the middle of the transaction between travel agents and airlines, harvesting valuable data."
Joseph [16:29]: "DHS is buying bulk access to billions of flight records without clear legal limitations."
Emmanuel Mayberg [27:02]: "AI scrapers are flooding open resources with traffic, sometimes taking them offline."
Jason Kebler [38:44]: "Cloudflare offers tools to protect against scrapers, but it's a constant battle as new bots emerge."

For those interested in supporting independent journalism and gaining access to additional exclusive content, consider subscribing to 404 Media at 404media.co.

Loading summary

Transcript50 lines

[00:00]
Unknown Journalist
So my 404 media colleagues probably remember when I got doxxed, which was a nightmare for everyone involved, mostly me. My name, address, phone number, Social Security number and a bunch of other information was leaked online which led to all these spam calls, harassment, threats, etc. Even if you're not a journalist, a sophisticated network of data brokers is making your personal information available to the highest bidder. I fixed my problem with Delete Me, which is a service that basically looks you up on all these people search websites and data broker websites and formally gets you removed from them. The subscription service removes your personal info from the largest search databases on the web, helping prevent potential ID theft, doxing and phishing scams. I'm a real Delete Me customer. I've been using it for more than five years. Signup is so easy. You just go to their website and then they send you personalized privacy reports showing you what info they found, where they found it and how they got it removed. Take control of your data and keep your private life private by signing up for Delete Me now with a special discount for our listeners today. Get 20% off your DeleteMe plan when you go to JoinDeleteMe.com 404Media and use promo code 404Media at checkout. The only way to get 20% off is to go to JoinDeleteMe.com404Media and enter code 404Media at checkout. That's JoinDeleteMe.com 404Media code 404Media.
[01:33]
Joseph
Hello and welcome to the 404Media podcast where we bring you unparalleled access to hidden worlds, both online and IRL. 404 Media is a journalist founded company and needs your support. To subscribe, go to 404 Media Co as well as bonus content every single week. Subscribers also get access to additional episodes where we respond to their best comments. Gain access to that content at 404 Media co. I'm your host, Joseph and with me are 404 Media co founders Sam Cole.
[02:04]
Sam Cole
Hey.
[02:05]
Joseph
Emmanuel Mayberg.
[02:06]
Emmanuel Mayberg
Hello.
[02:06]
Joseph
And Jason Kebler. Hello.
[02:09]
Jason Kebler
Hello.
[02:10]
Joseph
So, real quick, hopefully you're hearing this podcast in time. You should be because we publish this to subscribers Tuesday evening and then Wednesday morning. Free subscribers get it, but on Wednesday the 18th, that's tomorrow or today, depending on where you're listening. At 1:00pm EST, we're going to be having our latest foyer forum. This is a live streamed event of an hour, realistically two hours. We usually go over where we're Going to explain to you how to pry records from the government using freedom of information requests and public records requests. Specifically, we're going to be talking about a story that Emmanuel and Jason did a while back. A company called Massive Blue who were making these AI Personas for cops that pose as college protesters. Really, really wild stuff. So if you want to learn how we did that and how you can replicate those requests, please become a paid subscriber or if you already are one. Keep an eye out for an email with a link to a live stream. We've tried to pull it in a lot of places. I'll put a link into into the show notes here as well. And we also tried to put at the top of the emails as well. And beyond that. Jason, I think you wanted to talk about Merch as well.
[03:33]
Jason Kebler
Yeah, we have merch back in stock. Our 404 code tank tops were incredibly, incredibly popular. So thank you all for ordering them. I ordered a bunch more. We have them in every size. So if you want one of those, you can go to 404 Media Co and then click Merch and you'll see them there. And also if you pre ordered them, that means that your pre order is going out very, very soon. So thank you for the patience there. Should we get into it? Because I believe I'm asking you some questions, Joe.
[04:06]
Joseph
Sounds good.
[04:07]
Jason Kebler
Yeah. So the first story we're talking about this week is airlines don't want you to know they sold your flight data to dhs. This is a really wild story. I didn't know about this at all that this was happening. Where did you first find the story?
[04:25]
Joseph
So on May 1, I noticed that Immigration Customs Enforcement, ICE had a new contract in these government procurement databases. Basically what I do is I have a shortcut on my desktop and I'll click it and every so often I would just check to see the latest contracts on ICE has with the government or, you know, I've done it for Customs and Border Protection and other agencies as well. It's just sort of what we're covering at the moment. But on May 1st I saw that I centered some sort of contract with Airlines Reporting Corporation and I'm like, well that sounds interesting. What the hell is that? And I filed a foia and then I looked for other agencies that had deals with Airlines Reporting Corporation and we'll get into what those were as well. But the main one that this story is about is Customs and Border Protection CBP. I filed those FOIAs. Then ICE actually released some more documents about this Purchase of data. And Believer actually reported that about a week or so after. And now what we have are these documents that I got from CBP and they lay out in much more detail the sort of data the DHS is buying, the use cases for it. And I'm sure we'll get into the most important thing and which you highlighted when you were editing the article, the fact that the airlines were basically trying to cover it up. Right. I feel like that stood out to you.
[06:01]
Jason Kebler
It really did stand out to me because I don't know exactly the language that they used. I should have the story up, which I don't. My computer exploited while we were recording this podcast, so I lost my tabs. But they are back up now. But you have it up, so why don't you read it?
[06:19]
Joseph
Yeah. So one part of the documents, this is the contract between Airlines Reporting Corporation, ARC and CPP tells the agency to not publicly identify vendor or its employees, individually or collectively, as the source of the reports unless the customer is compelled to do so by a valid court order or subpoena and gives ARC immediately immediate notice of same. In other words, you are not allowed to reveal where this airline data came from that you were using to generate internal reports or whatever else you're going to make with the data.
[06:57]
Jason Kebler
Right. So this is a part of a travel intelligence program through, as you said, arc, and as I understand it, this is like a company slash entity that was spun up by most major airlines in the United States for the purposes of selling customer data. Like more or less like it is a data broker that is owned by American Airlines and by American Airlines, I mean United States based airlines, including American.
[07:32]
Joseph
Airlines and American Airlines. Literally. Yes. Yeah. So they make this data broker and the way it works is that when you book a flight with a travel agent, maybe that's online or maybe you go to a physical one, there has to be some sort of conduit between the travel agent and you and the airline. And ARK sits in the middle of that transaction and it's able to get this data. I mean, it provides a legitimate service there, it routes this information, it allows these bookings to take place. But on the side, what ARC does is it develops products based on that data. So maybe they can see, oh wow, the number of flights went up after Covid or something. That's just a hypothetical, but there's all of these sorts of trends and that sort of thing. But what they're also doing, according to these documents we got and the ones published by ice, is that ARC has a Side hustle basically of selling this data to the government as well. And you mentioned some of the airlines. I mean, there's ones on the board. I'll just double check them. Yes, they have representatives from Delta, Southwest, United, American airlines, Alaska Airlines, JetBlue, and then you have Lufthansa and Air France from Europe as well, and Canada's Air Canada. And there's a little bit of discrepancy in the documents we got. It says eight major US Airlines own it and then another one says nine. I think one probably joined over. We just frame it as at least eight airlines own this data broker.
[09:15]
Jason Kebler
Right. And when you say travel agents, I mean obviously you were like, if you go to a travel agent, that that would be a conduit. But you're talking about sites like Expedia, for example, you know, like really widely used websites. I just want to stress that it's like this is not affecting only people who are going to a specific travel agent. It's like it's third party booking services, of which there, there are many.
[09:41]
Joseph
Yeah, that. That probably would have actually been a better way to phrase it. Like third party booking services. Yeah, it's not just obscure brick and mortar travel agents in your neighborhood or something like that. It's massively popular sites like Expedia, where this data is being essentially harvested from in. I think it was the ICE documents or maybe it was the customs and border one that we got. Interestingly, DHS says ARC does not contain data. If somebody books a flight directly with an airline, which is kind of interesting because you go, well, it's with the airlines, won't they just sell it? No, because ARC is not in the middle of that transaction. You're going straight to the airline. You're booking with JetBlue or United or whatever that doesn't end up in ARK's really big data set of billions and billions of records. And I guess I should say that's passenger names. The credit card used, which I found really interesting. You can search by credit card and then of course the flight itineraries. So you know where someone has been, where they maybe are going to fly that day or something. I find really interesting. You kind of know what they're going to do in the future, which isn't really the case with a lot of data we cover like location or whatever it is, predicting and showing where someone is going to be at a later date, which is pretty novel.
[11:05]
Jason Kebler
Yeah, yeah. I guess. I'm curious, do we know what law enforcement does does with this type of data? Because Some of the responses I saw to this article, and I don't think they're good responses, but some of the responses I said I saw were like, well, you have to show your ID when you get to an airport and therefore like DHS will know that you're going to be there. I also assume there's like some sort of roster or something. I actually don't understand exactly how this works and I'd be curious to either read more about it if someone's already reported this, or to do more reporting on it. Like how does DHS know which people are going to be at an airport on any given day? And I would imagine that this is one of, one of the ways. Right. It's as you said, they can then predict who is going to be where and at what times because they have this sort of like future data.
[12:01]
Joseph
Yeah. And I mean, I think an important thing to remember is that DHS is not a monolith. Right. Like TSA is going to know who is in an airport at that time because you're showing your ID to the TSA agent and you're literally right in front of them. You're announcing yourself, basically. Right. And they're going to have access to other data along the way there. Other parts of DHS can get this data and potentially in other ways as well. But again, it's not like a one size fits all solution. The reason that Customs and Border says it's buying this data is it's for the Office of Professional Responsibility, opr, which is basically like its internal watchdog. It's internal affairs, that if somebody in Customs and Border Protection is doing something corrupt or criminal or whatever, this internal affairs unit can and is supposed to investigate them. And when I got a statement finally from Customs and Border Protection about this, they said it's just used for that. It is just used for that division or unit to investigate those sorts of people. And that's all well and good. Some people may even say that that's a legitimate and a good use case. But we can't have that conversation until now because we published it and because we found out and the airlines were trying to cover up in the first place. You know, like it's really about the sale rather than the use.
[13:32]
Jason Kebler
Well, there's that and then. Which I think we should talk a little bit more about. But then DHS is not the only agency that has bought this sort of data. Like ARK has deals with other agencies as well, right?
[13:46]
Joseph
Yeah. So again, when I first saw the Ice Age deal, then I did A bunch of foyers. And we're still waiting for the vast majority. But beyond Customs and Border Protection, there's a Secret Service, the SEC, DEA, Air Force, U.S. marshals Service, TSA, funnily enough, and ATF, the Bureau of Alcohol, was it Tobacco and Firearms Now, I don't know, maybe SEC is using it for a very different reason to dea. You would imagine so because those agencies have completely different mandates. But we don't know specifically what they're using it for yet. And that's why we have all of these freedom of Information requests out. And again, maybe it comes back and they're using it for fairly innocuous purposes. Maybe some are using it for much more interesting ones. But the sale is happening in the first place and because the data is being sold, there isn't really a legal mechanism there, they're just buying access to it.
[14:44]
Jason Kebler
Right. And I mean, what really stood out to me again is that it's happening through this third party. It's like happening through this umbrella corporation, arc, again, Airlines Reporting Corporation, which no one has ever, ever heard of that because they have an extremely low profile. And then again, no one has heard of it because in its contract it says don't say where the data came from. And that's like one of my favorite things to foia and that's like a really great thing to foia if people are listening to this and are interested in it is a lot of times when companies sign contracts with the government, the company will try to put a non disclosure agreement into the contract. But that non disclosure agreement itself is subject to foia just because of the way that, you know, FOIA works and that is a public record. It's, you know, it's taxpayer money that's being used to purchase this and therefore it should be available. And so this wasn't a specific nondisclosure agreement, but it was a, you know, a section of the contract that said, hey, don't say that we signed this contract. Don't say that the airlines were the sources of the data. And it's an example of these companies, these airlines sort of like double dipping like that they're getting into. It's just like them finding other ways to monetize other than just like selling you access to a flight. They're figuring out like, okay, well now we have this huge information database about who is flying, where they're flying, what credit cards they're using, that sort of thing. How can we further monetize this? And I think that is a conversation.
[16:29]
Joseph
Worth having yeah, and I think that's why so many people were pissed off at this. You're already paying for a flight where you're going to be crammed into some economy seat with no legroom. You're going to have to pay extra for a bag, you have to pay for wifi or something. And then on top of all of that, we're also going to sell your flight data to the government. And I don't think people are particularly happy about that. You mentioned the non disclosure agreements and it reminds me of when we covered a lot of location data being sold to the government. That is ordinary apps installed on your phone sending location data off to a company and then they sell it directly or it gets sold to somebody else who then sells it to the US government, including Customs and Border Protection. Funnily enough, you'll go through the sort of contracts related to that. And there was one for a tool called LocateX made by Babel Street, I think, and there was sort of a memdum in there where it said you cannot use this information in court and you can't reveal this information. It's supposed to just be used for leads and tips and intelligence. And it reminded me of that basically where you have these government agencies buying data and then there may be no transparency or accountability of where that data came from or how it's being used by design. And I guess that also leads to that. I feel it's obvious. But almost to stress it, this isn't being done with a warrant. I don't think you necessarily need a warrant to get flight data ordinarily. But this isn't just talking about one or two flights. It's talking about Customs and Border Protection and potentially these other agencies buying bulk access to billions of people's flight records, then they can search through basically at their own whim. I didn't see anything in the contract that says you can only use this for national security, you can only use this for combating terrorism or something like that. I didn't see any disclaimers like that in the contract. So at least theoretically, until we get more information, it's kind of up to the customer to do what they want with this information. And we see that when law enforcement agencies buy data because that's exactly why they're buying it. They want to be able to do what they want without the legal processes in place.
[18:53]
Jason Kebler
Right. This, this came up in, in the context of some of our flock reporting, which I'll just like quickly run through, but a few commenters on our website were Saying, well, why don't the cops need to get a warrant to search, you know, for license plate data or whatever? And the argument that that one would make is like, you don't have an expectation of privacy when you're in public. There is nothing stopping a cop from standing on the corner and writing down the license plate of everyone that drives by. But what our laws were like, not really written for was the automation of these sort of things and the privatization of it and also the, the fact that it's done at scale and in like a historic way. And so there is a really interesting lawsuit in Virginia about Flock, about whether, you know, the automation of this type of technology does change the calculus as to whether cops need a warrant or not. And we're going to be following that. But, but basically it's like you can stand on the corner and look at a license plate, but can you stand on every corner of every street at the same time with an automated camera, take a picture, log that into a database, you know, make a historic record of where a specific car has gone and do that all over the entire country all at once? And that's a little bit like what we're talking about here, where as private companies get more and more into surveillance, they are deploying technologies and they're doing things that are allowing for like really big, like large scale mass surveillance. And then because the cops are buying access to these databases, the cops feel like they don't need a warrant because the police themselves are not the ones who are doing the surveilling. They are like buying access to a commercial product. And then the commercial company is the one that's actually like doing the surveillance. And I think that's a little bit like what's happening here and like what we've seen over and over again with, you know, social media monitoring companies, with, you know, data brokers in general. And I do think it's like a big flaw in our privacy laws and something that, you know, we need to talk more about.
[21:26]
Joseph
Yeah, absolutely.
[21:27]
Jason Kebler
I guess last thing on this is, you know, Future reporting, future FOIAs on this. Like, what are you looking into next here?
[21:35]
Joseph
Yeah, it's really just waiting to get back those contracts from those other government agencies also looking into whether local police have access. One part, when I originally wrote this story, it was focused on that the contract says customs of Border Protection are using this data in part to support state and local police, which is obviously very interesting. We were right when you edited to bring basically the COVID up higher up into the story. But I find that very, very interesting. Do local police have access to this? I mean, I think that would be crazy, but I've seen some pretty wild things over the last few years. So there's that. There may be more emails about it and that sort of thing. And yeah, just who has access to this data on a wider scale, really? All right, should we leave that there?
[22:28]
Unknown Speaker
Yeah, let's leave that there.
[22:30]
Jason Kebler
When we come back, I beat you. After the break, we will talk about AI bots that are scraping museum websites, open libraries, archives, et cetera. It's a story by Emanuel. We'll be right back after this.
[22:58]
Unknown Speaker
You know what doesn't belong in your epic summer plans? Getting burned by your old wireless bill. While you're planning beach trips, barbecues and three day weekends, your wireless bill should be the last thing holding you back. That's why I made the switch to Mint Mobile. With plans starting at 15 bucks a month, Mint Mobile gives you premium wireless service on the nation's largest 5G network. It's the coverage and speed you're used to, but for way less money. So while your friends are sweating over data, overages and surprise charges, you'll be chilling, literally and financially. Say bye bye to your overpriced wireless plan's jaw dropping monthly bills and unexpected overages. Mint Mobile is here to rescue you. All plans come with high speed data and unlimited talk and text delivered on the nation's largest 5G network. Use your own phone with any Mint Mobile plan and bring your phone number along with all your existing contacts. Ditch overpriced wireless and get three months of premium wireless service from Mint Mobile for 15 bucks a month. I realized that by sticking with the expensive guys, I was literally throwing money away. Get cell service that works great for much less With Mint Mobile this year, skip breaking a sweat and breaking the bank. Get your summer savings and shop premium wireless plans@mintmobile.com 404media that's mintmobile.com 404 Media upfront payment of $45 for 3 month 5 gigabyte plan required equivalent to $15 a month new customer offer for the first 3 months only. Then full price plan options available, taxes and fees extra. See Mint Mobile for details.
[24:37]
Sam Cole
America is starting to talk more about mental health, but for lots of men it still remains a taboo. Just know that it's okay to struggle and that life is full of ups and downs. Whether you're going through a rough period or want to make sure things keep going well, therapy can help you make sure you're at your best for yourself and everyone in your life. There's no shame in therapy, and you're not alone. Therapy is not just for people who have experienced major trauma. Millions of people use BetterHelp to learn coping strategies, work through their depression or anxiety, and learn how to positively deal with the pressures of everyday Life. With over 35,000 therapists, BetterHelp is the world's largest online therapy platform, making it really accessible and flexible. You'll definitely find a therapist that works for you and fits into your busy schedule. If you need to switch therapists at any time, cancel or reschedule an appointment or get in touch with your therapist, you can do it with the click of a button. As the largest online therapy provider in the world, BetterHelp can provide access to mental health professionals with a diverse variety of expertise. Talk it out with Better Help our listeners get 10% off their first month at betterhelp.com 404 Media that's BetterHelp H-E-L-P.com 404 Media.
[26:02]
Joseph
All right, and we are back. As Jason said, this is one written by Emmanuel, and the headline is AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums. Emmanuel this is based on a survey. Just to lay the groundwork, who made this survey and what did it find at a high level? And then we'll get into some of the really interesting specifics.
[26:26]
Emmanuel Mayberg
So this was written by Michael Weinberg, who works at NYU at something called the Glam Elaborate. And that is something that NYU and the University of Exeter work on together. And it is basically an organization that helps small libraries, galleries, archives and museums take their collections, digitize them in some way, and make them available for everyone online for free.
[27:02]
Joseph
So we don't know what organizations were surveyed exactly. Right. It's like an anonymous survey, is that right?
[27:10]
Emmanuel Mayberg
Yeah. So we have heard anecdotally, and it's something that we've reported on before, that AI scrapers, which are these bots that kind of troll across the Internet, look for valuable training data and then hoover it up so they can train. AI models are flooding all these open resources with too much traffic, more than they can handle, and taking them offline in some cases. And you hear about that happening at this library or that museum or this collection. But this is the first attempt by someone to quantify the problem and see how widespread it is. And the bottom line is that it is very widespread. But there are some limitations that Weinberg acknowledges in the study. First of all, he invited as many organizations as possible to participate. Only 43 of them participated and it's possible that they are self selecting in some fashion.
[28:17]
Joseph
But what do you mean by that, by self selecting?
[28:21]
Emmanuel Mayberg
It's possible that some museum or some library saw the request and was, you know, befuddled by it and were like, what are you talking about? We don't have this problem and didn't respond. And obviously if the library did experience something they did, you know, they chose to participate. Right. So those 43 respondents, which we could talk, you know, in some more, some more detail about the data, but they're anonymous a so they can speak more freely about what they're seeing and share some more private data and analytics about like how much traffic they're getting and what is knocking them offline. And I would say most importantly, and I think this would probably be familiar to you from security reporting, they don't want to be too specific about who they are and what they're doing to stop the AI scrapers. So the scrapers don't learn about the countermeasures and then can better circumvent them. So yeah, that's, that's kind of who's, who's involved in this and why they're anonymous.
[29:29]
Joseph
Yeah. Do we have any idea what sort of scrapers or specific scrapers we're talking about? Like are we talking about ChatGPT or anything like that? Or do the libraries not know? And that's part of the problem, you know what I mean?
[29:45]
Emmanuel Mayberg
There's also an attribution problem. Right. It's hard to say for sure some of them. So the report doesn't name the specific scrapers, but we do know from experience that anthropic. Sorry, not anthropic. Perplexity, for example, in the past has been caught ignoring robots Txt, which is this file that site owners can put in their website to tell bots not to scrape it. And in the past this was sort of like an accepted norm that was respected. But increasingly, as this training data is becoming more valuable, they're ignored. And like Perplexity is one that has repeatedly ignored robots Txt. Others self identify whether they ignore the robots txt file or not. They self identify what the bot is. And other times the organizations can make a guess based on the IP ranges that are hitting them. They're like, oh, these IP ranges are clearly coming from Alibaba. So it's safe to assume that Alibaba is scraping this website for AI training data based on the behavior, but it's hard to say for sure. It's possible that somebody is using Alibaba infrastructure, but it's actually a different company.
[31:03]
Joseph
Yeah. So what's some of the impact? You say that some get knocked offline or maybe take themselves offline. Somebody here has used a quote of a DDoS attack comparing it to that. What's some of the concrete impact that these scrapers are having on these open databases and archives and kind of ruining it for everybody.
[31:26]
Emmanuel Mayberg
Yeah. So one interesting thing about the report is that in the vast majority of cases, the only reason that an organization knows this is even happening is because their services are degraded to some noticeable degree. The site slows down. It's not accessible at all. This was just a coincidence. But last week, I think on Friday, University of North Carolina, Chapel Hill, which is a big university, a research university, has a very robust kind of online library full of books and papers. And it's something that students use, teachers use, just the public can use.
[32:08]
Jason Kebler
And.
[32:11]
Emmanuel Mayberg
They found out that this was happening to them because nobody could access it, which is very disruptive to the organization and the student and the teachers and all of that. And they have a big IT department, and they solved it by deploying some new kind of firewall that, again, they don't want to talk about in too much detail. So people don't learn how to circumvent it. But that's sort of like a typical example of how people know that it's a problem. The impact is that these resources that exist for the public and the whole goal of these organizations and of Glammy, where Weinberg works, is to make this cultural heritage, as he calls it, available to as many people as possible. That's the mission of the organization is like, oh, there's a little museum in France that has, like, a bunch of manuscripts that you can go see if you visit it. But wouldn't it be great if they just digitized everything and made it available online? And it's like, yes, that would be great. But then that opens them up to these scrapers, and all that data is very valuable now. So the impact is that the public no longer has access because it's being hoovered up so aggressively by all these different AI companies.
[33:35]
Joseph
Yeah, people want this data to be accessed by the public for the reasons you just laid out. But it seems that the trade off is you make it publicly accessible, you get swarmed by all of these bots which are going to degrade the archive and, you know, potentially knock it offline or whatever. Is there basically nothing to be done with that trade off? Like, is it sort of. I mean, this is a bad way to put it, but, like the cost of doing business, because it's not a business. But you see, what I'm getting at is there just nothing to be done or what?
[34:10]
Emmanuel Mayberg
So there are things that people can do. The response to the story has been very interesting because I feel like I've heard from a bunch of other institutions, which I don't know if they're included in this survey because it's anonymous, but judging by their response, I think they were not. So the problem is, again, demonstrably widespread, and people kind of been telling me interesting things about what they're doing and what their solutions are. And I hope to have a story in the next couple of weeks about some interesting solutions. I want Jason actually to talk about some solutions that he reported on. Cloudflare has a thing. And there was kind of like, these funny solutions to trick the scrapers. But I also like to talk about this tension. I'm going to make a tortured analogy, but it's something that I talked to Weinberg about. But in 2023, this book came out called the Art Thief. Have you guys heard of this at all? Really great book, nonfiction. It's about one of the most prolific art thieves in history. He worked in the early 2000s in Germany and France, and he stole more than 200 pieces. And the way that he did this is he didn't steal, like, gigantic, famous pieces. He wasn't going after the Mona Lisa. He just went to these small regional museums in the countryside and, like, stole tons of small pieces. Altogether, they were worth, like, I don't know, $2 billion or something like that. And it's a really fascinating story about why he did it, how he did it, what happened when he got caught and all of that. But one of the lessons of the story is that the author talked to the owners of these original museums, and they explained that when something gets stolen from one of these museums, the damage isn't only that the piece is gone, it's that it breaks the social contract of how these museums operate, Right? It's like these museums might have a security guard, they might have security cameras, but it's not like Ocean's Eleven right? There isn't like, lasers and, like, heat sensors keeping the pieces safe. The social contract is this art, this history. These texts are part of our collective cultural heritage, and we're putting in the work to make it available to the public because it belongs to everybody, and the public, in return, kind of, like, agrees to be respectful and not fuck with it. And when Somebody steals something, they break the social contract. And that forces the museums to lock everything down and make it less accessible. And this is kind of what is happening online as well. So one thing that people can do, right, the people who manage these collections, they can have people log in, they can have CAPTCHAs, they can have all kinds of friction that would make it hard, if not impossible, for an AI scraper to get all the information, but would require a little bit more from human users as well. And the maintainers of these collections are very reluctant to do that because the entire point of doing this, the entire point of digitization and glammy and all this stuff is to make it as accessible as possible, right? It's like a, it's a very benevolent mission that these people have. So there's that issue, like they're reluctant to do it because they want to make it so available. And then the other thing is that, and this is something that Weinberg really emphasized even at, and he focuses on small and medium sized organizations. But he says even in like a big organization, once they digitize something, there's like one person maybe who is responsible for keeping that stuff online and functional. It might be someone's job on top of a totally different job that they already do. It might be a volunteer. And any change or update that you force them to introduce is very, very difficult to implement, if not impossible, right? It's like if you go to one of these organizations, you go to one of these small museums and they're like, hey, we need to implement a captcha, or we need to implement a login. We need to implement something like that. They're like, well, we're just going to take this offline because we can't do this at all. It's just impossible for us to put in the work on top of what we already did to digitize it. So we're just not going to have it at all. Jason, do you remember the Cloudflare?
[38:45]
Jason Kebler
Yeah, it's very grim for the reason that you just said, because a lot of these organizations are probably like barely financially solvent depending on who they are and what they are. And it's expensive to keep this sort of thing up. And then that's to say nothing about the status of like the actual things are being scraped. I assume some of them are in the public domain by now. If they're like really old, a lot of them probably are not. But we found that, you know, AI companies don't really care. I did write a long time ago about different types of mazes that have been deployed. There was one that was like a DIY open source one by a specific programmer. We probably talked about it maybe six months ago, or maybe a little longer than that. He called in AI tar pit and it was just like an infinitely generating website that a human being would click off of pretty much immediately, but that an AI scraper would scrape over and over and over again kind of indefinitely for. For something like a museum to spin this up. It's like part of the point of an AI tar pit is to waste a scraper's time and like by creating infinite number of pages, which doesn't do anything really to help the museum because that uses a lot of their own bandwidth because they are allowing the scraper to hit it. They're just hitting like nonsense over and over and over again. But Cloudflare, the gigantic Internet infrastructure company, released something very similar to this. It's like a similar design and that is something you can put in front of it. Now you can also, as you said, you know, put a login wall, sort of depending on the scraper. Like they may or may not want to try to get past that. And, you know, that's something that we did to preserve the cultural works here at 404 Media. Let's put them behind a login wall sometimes. And I think that that's helped. I think I'll probably talk about this more later, but I went to a journalism business conference two weeks ago for 404 Media to talk about this, and a lot of big news outlets where they're talking about how they are trying to protect their own sites from AI scrapers. And I believe it was the Daily Mail was there and they gave a presentation about the fact that you sort of need to stop these scrapers very early on and also catch them in the act, more or less, so that you can then go to the company and say, like, we know that you are scraping this when you should not be. And for the Daily Mail, it was for the purposes of trying to strike a deal with OpenAI or with these different companies, like say, hey, I know you're trying to steal this stuff. We have stopped you with our login wall or with our, you know, robots txt, or the various things that they're doing, like, let's strike a deal here. But one of the points that, that their business person was making was like, once this stuff is scraped, you kind of like lose a lot of your leverage unless you're willing to sue them. And that can be really expensive. We don't even know if it's going to be successful. There's tons of lawsuits out there right now that are still ongoing and that, you know, we have been following and we'll continue to follow. But for something like a museum, it's interesting because I bet their collections don't change all that often. They're getting hit by these scrapers. The value has already been like, extracted probably in many cases. But the way that these scrapers work, they're probably coming back over and over and over again and hitting them over and over and over again, even though they've already gotten what they want, which is like, really frustrating.
[42:44]
Emmanuel Mayberg
Just for that, to illustrate that, like the UNC thing that happened last week, the IT people were explaining that the information is easy to get, but they have a search engine and what the bots were doing were just like spamming it with different search terms. So it's like an incredibly inefficient way of extracting the data. You know, it's like if they, if it was just like an agreed upon, hey, can we please have your data? We'll pay this much for it. The organization could maybe benefit from it. And then also you wouldn't have to like DDoS, the library in order to get it.
[43:24]
Jason Kebler
I mean, or if it's like, okay, scrape us like once a year or once every six months, don't scrape us like constantly. I mean, ideally scrape us not at all, but it's like, please don't scrape us daily. And then the other thing is just like, there's constantly new bots that are doing this associated with new companies, companies that already exist are creating new bots to scrape for different purposes. And so it's not like you can just protect against, you know, OpenAI scraper. You need to protect against all the different types of scrapers that, that different companies might be running. Remember, like, which one, what the names are, keep up to date with what they're called, so on and so forth, and, and figure out how to block all that traffic. And it's like extremely not trivial. There's a couple different products that have been released to try to automate this, but it is still, still like it's, it's a permission structure that's like really up because it's opt out, not, not opt in. And you don't even know like what you're opting out of because there's constantly new ones that you have to think of. And there's like new strategies that the AI companies are using to circumvent robots. Txt because they don't care for the most part.
[44:43]
Joseph
Yeah. And opting out is, as you say, not straightforward. You have to, like, fight that through technical or potentially legal means. All right, that was fascinating. I'm definitely interested to hear what else we find about that. If you are listening to the free version of the podcast, I'll now play us out. But if you are a paying 404 media subscriber, we're going to talk about the, frankly casual surveillance relationship between ICE and local cops. And that's according to internal emails that Jason got. You can subscribe and gain access to that content at 404 Media co. As a reminder, 404 Media is journalist founded and supported by subscribers. If you do wish to subscribe to 404 Media and directly support our work, please go to 404 Media co. You'll get unlimited access to our article, articles and an ad free version of this podcast. You'll also get to listen to the subscribers only section where we talk about a bonus story each week. This podcast is made in partnership with Kaleidoscope. Another way to support us is by leaving a five star rating and review for the podcast. That stuff really helps us out. Here is one of those from spell checker. The 404 team does a fantastic job at everything they do. Independent journalism for the win. I feel like I may have read that one before. 4 I'm sorry if I did. This has been 404 Media. We'll see you again next week.