Loading summary
Unknown Host
If you like 404 Media, you probably have followed our reporting about how big tech has made consuming the news a mess. Social media and search algorithms have made finding and understanding the news exhausting. It's tough scrolling through timelines full of content that have been reverse engineered to perform well in an algorithm or that you're only seeing because of targeted advertising. That's why I like Ground News, a news platform that doesn't use manipulative algorithms and which lets me quickly see how news outlets around the world are covering a story. Ground News helps me get a snapshot of the entire media ecosystem left, right and center, which I found really useful. As someone who's scanning the Internet all day, I found Ground News to be particularly helpful when I was trying to parse through Donald Trump's massive legislative agenda. On the left, Ground News surfaced an American Prospect article called 10 Bizarre Things Hidden in Trump's Big Beautiful Bill. In the center, it also surfaced a great analysis from Higher Ed drive called One Big Mistake Higher Ed Sounds Warning Over GOP Budget Law and then I also was able to find a bunch of news takes and analysis from right wing outlets without having to dive into the muck on X Ground News. Let me do all that research really seamlessly without having to endlessly scroll through social media's echo chamber reinforcing algorithms. Go to groundnews.com media today to get 40% off the ground News Vantage plan and get access to all of their news analysis features. That's ground news.com media for 40% off the ground News Vantage Plan for a limited time only. GroundNews.com Media.
Joseph
Hello and welcome to the 404 Media podcast where we bring you unparalleled access to hidden worlds, both online and IRL. 404 Media is a journalist founded company and needs your support. To subscribe, go to 404 Media Co as well as bonus content every single week. Subscribers also get access to additional episodes where we respond to their best comments. Gain access to that content at 404 Media co. I'm your host Joseph and with me are 404 Media co founders Sam Cole.
Sam Cole
Hello.
Joseph
And Emmanuel Mayberg.
Emmanuel Mayberg
Hello.
Joseph
Jason's not here. He's still on vacation. Listeners will know we all took last week off, which is only possible due to our paying subscribers. We have never done that since we launched in August 2023. At this point it was rejuvenating and ready to get back into it. A ton of stories all ready to go through now and then we're going to have A bunch more for next week as well. So let's get straight into it. This, this first one is one that Emmanuel published, I think today or yesterday, the open source software saving the Internet from AI Bot scrapers. So we spoke about the problem a little bit on an earlier episode of the podcast. Can you just lay out that problem first of all, before we talk about this solution? So what is happening with AI scrapers and who is it impacting exactly?
Emmanuel Mayberg
I think one of the funny things about taking a vacation after not doing it for so long is it warps your perception of time. And I'm like, when did we have that podcast? Was that two weeks ago or the early 90s? I don't remember. But yeah, it was like, I think it was two weeks ago. I wrote this story about the first good study that I've seen about how AI training data scrapers are really messing with libraries, museums and any other form of resource or archive that's open to the public. So you have all these AI companies, they need all the training data they can get in order to train their AI models. So they send these bots to crawl the entire web and hoover up all that data. And there's so many companies doing this now, and they're doing it so rapidly that if you're just a university, or even a big university with an online library, all these bots suddenly hit your website. They're doing it millions of times a day and they crash the websites, eventually making them unavailable for the human users that they were initially made for. So that's kind of largely the problem.
Joseph
Yeah, they're being bombarded basically in these, maybe not intentional, but essentially DDoS attacks where there's all of this traffic coming and it's knocking these systems offline so nobody can really enjoy or read them or get anything from them. So that's the problem. You need some sort of solution, some sort of mitigation. I mean, hopefully the mitigation isn't take them offline, which some people have unfortunately had to do. So enter this tool you've just written about called Anubis. What is it and how does it work? How's it trying to address this problem?
Emmanuel Mayberg
So I wrote that story that I just talked about and a bunch of people reached out and said, hey, you should really check out Anubis. There are several solutions that people are trying in order to prevent the these AI bots. One that we've talked about and that historically has worked is robots Txt, which is a file you can put in front of your website that Tells bots not to crawl it. Just like no automated tools should crawl. This website that used to work, that was pre the generative AI boom and this data mining gold rush, suddenly people just don't give a fuck and they're like scraping the sites anyway. So yeah, go ahead.
Joseph
Is that the reason it doesn't work? Because again, you're totally right and plenty of people brought this up that it used to work pretty effectively. You have that file on your website and tells people, please don't scrape this. Is it just that people don't care now or there's like way more people doing scraping? Or a combination of both? Like why doesn't robots work anymore?
Emmanuel Mayberg
Because it's not a rule. You know what I mean? It's not like if you ignore robots, the Internet police is going to come to your house and arrest you. What?
Joseph
Not yet.
Emmanuel Mayberg
Not yet. Once they give us badges, maybe we will do it. But no, it's just a norm that used to be respected. And I don't know, we should probably talk about this at some point. But the Internet used to, there's protocols and there's norms and the Internet used to run on a lot of norms and these are just being ignored entirely now because it's going to be very profitable for you to build a successful AI tool. So if you need to jump over the robots txt fence in order to get that training data, you're going to do it. Who cares? So that no longer works. You have some other big companies that are trying things. Cloudflare, I think they announced a new method that they're trying today where they're blocking AI bots by default. I'm not sure how exactly that works. They've also done this kind of mischievous thing where Jason wrote about this. Some other people are doing this, but once an AI bot comes to your site, you kind of send it down a rabbit hole of never ending links to garbage websites and you kind of trap it in this infinite loop. So that's one solution. Anubis is the reason people I think reached out to me and are very happy with it and it's getting widely adopted is it's lightweight, it's open source, and it's fairly effective in a, in a less expensive way that we can get into later.
Joseph
Yeah, and I think the Cloudflare one as well. I mean, hate to hand it to Cloudflare, a company that enables a lot of bad stuff and it's kind of all over the place when it comes to his ideology. His ideology and its decisions around content. Moderation or whatever. Oh, okay. They actually did something pretty interesting, I'll give them that. But there's also a way that publishers may be able to get funds, potentially through that cloudflare mechanism as well. It will direct people to. Well, do you want to pay this publisher to scrape this material? Right, yeah.
Emmanuel Mayberg
So you asked me how it works, which I lightly dodge because it's complicated and I don't want to get it wrong. But basically, these bots, the scrapers that are trolling the entire Internet, they don't look or behave like a normal Internet user. When that traffic comes to a website, it doesn't look like a human user who is using a browser in order to access the website, because there's no browser involved. It's just like a very lightweight automated tool that is accessing the site and taking the data. What Anubis does is basically put a check before you can access the site, where it is looking for a type.
Joseph
Of.
Emmanuel Mayberg
Cryptography that JavaScript does and has been doing and all web browsers have been doing since around 2022. Maybe you could talk about this a little bit. A lot of people really hate JavaScript and there are browsers that don't use it, and there are people who kind of disable JavaScript across the board. So if you're one of those people, you're not going to be able to access sites that use Anubis, but the vast majority of people are using Chrome or some other type of browser that is constantly going through this cryptographic check to make sure that the user is there and that they're using a browser. Basically, all Anubis does is look for that once it verifies that it's a browser, that it's probably a human and it lets you through. The reason that is clever is because there's a cat and mouse game that's happening between admins and people that are trying to protect their websites from scrapers and the AI companies. So the reason that the creator of Anubis, Shie Yaso, was willing to tell me how this check works is because she knows that the AI companies can't really implement the solution. It would be too computationally expensive for all these AI scrapers to to pretend to run Chrome before they access a website. And that's why it's kind of like a simple, lightweight, clever solution.
Joseph
Yeah, it is really smart, and I don't want to oversimplify it, but maybe it's just a way for people to understand, but like, you know, it's a more sophisticated capture In a way where a captcha is, you know, you're doing proof that you're not a robot. You're solving a little puzzle. Of course there are entire code companies of people in various countries who will get commissioned. I think commissioned is putting it very charitably. They will be paid very, very low wages to click through captchas and then provide the solution to that. And of course plenty of AIs can actually bypass captchas now. This is like a much smarter way of going about. Seems and you mentioned JavaScript and I'll just say briefly, it seems like a really good trade off to me that you have this JavaScript running in your browser. You're using ordinary web browser, it lets you through. You're right in that some people don't like to use it. For example, when you download the Tor browser there is a very easy to flick on and off setting to disable or enable JavaScript. And the reason for that is you can leak some personal data about your device to websites that are running certain pieces of code. And I guess it's a little bit less now, but you go even five, six, seven years ago, malvertising was just a massive thing where JavaScript was constantly used to deliver malware to people's devices simply for loading a web page. So it doesn't have the best wrap. But I don't know, you're an ordinary user with a fully up to date web browser and you're accessing a website. Seems like a pretty good trade off to filter out all the AI bots really. So you mentioned you spoke to the creator. What did they tell you about why they decided to make this? Like what makes somebody wake up and then just go I'm going to make Anubis.
Emmanuel Mayberg
The scrapers came for her, that's why she did it. Like all the other libraries and archives that were in this previous story that I wrote about how AI scrapers are messing up their infrastructure. She has a git server where she keeps some of her work. She keeps it open so other people can access it. And one day she was trying to access didn't work and she looked at what was going on and some kind of Amazon AI scraper was hitting it hundreds and hundreds of times and forcing it to reset and then eventually go offline. And that's when she set out to make Anubis.
Joseph
That's amazing. So she makes it, it sounds like for personal use initially. Did she upload it publicly at the same time? And then people. I'm just trying to get from how she makes it to. Apparently a ton of people are now using this. Like, what's the. What happens in between those two points?
Emmanuel Mayberg
Yeah, I think it's just word of mouth. She did make it available. It was on GitHub, and I think people started to use it. And it seems like what really made it take off is that I know that you're previously a Linux user, so maybe you're familiar with this, but gnome, which is a popular open source implementation of Linux, started to use it to kind of protect its libraries and all of its publicly available data. They picked it up and it kind of went from there. And then a bunch of people in the open source community, honestly started to use it because it is open source, I think was part of the attraction. And then, yeah, it's been downloaded. When I wrote the story, it was 200,000 times. I imagine it's a lot higher now because the story got some pickup and people were really excited about it. UNESCO's website and infrastructure uses it. Some universities use it. You sent me a story after I published where Duke had a case study about them dealing with scrapers and they also came to Anubis for the solution. I think that's what they ended up doing.
Joseph
Yeah, I'm trying to find it right now. Bartra's member, off the top of my head, in that it wasn't even just that they were using it, they were like almost reviewing it and being like, this is how effective this was for, you know, protecting our property and protecting us from scrapers. And it seems like it worked pretty well. Yeah. And I guess like, just the stress, like 200,000 people downloading something, okay, not the biggest number in the world, but like, they're not random normal users. They're probably web administrators or something, you know, like that. And 200,000 web admins or sysadmins is a lot of people, I would say.
Emmanuel Mayberg
Yeah, yeah, yeah, right. It's not 200,000 individual users and their little blogs, it's 200,000 users, many of which are the admins of incredibly valuable online resources that are accessed by millions and millions of people.
Joseph
Yeah. So I guess just the last thing is, what do you think happens now? Is it just that more and more people use Anubis or is it also. I mean, obviously both can be true. Maybe more and more solutions start cropping up like Cloudflares as well, and there's just sort of this ecosystem of all of these, hey, screw you, stop scraping us solutions.
Emmanuel Mayberg
Yeah, I think it's hard to imagine that Cloudflare is not going to provide some sort of workable solution to its customer base, and other Internet infrastructure companies are probably going to do the same. I think the question is how popular this open source alternative gets, and I don't know the answer to that, but I know that it has a lot of momentum. I know that a lot of people are using it, a lot of notable organizations are using it. The developer says that she's not yet at a place where it can be her full time job, but she wants it to be. And I don't know, I can definitely imagine that happening. I can imagine her getting enough donations to make this a full time job, hire a contributor and have it be like an open source protocol that is very popular and is used more and more as this problem persists.
Joseph
Yeah, the problem isn't going away and it's probably not even getting smaller or plateauing. It's going to get worse if anything.
Emmanuel Mayberg
And the other thing I forgot to mention is that not everyone can or wants to use Cloudflare or another kind of solution from a big company. A lot of people on principle or for practical reasons want an open source free solution.
Joseph
Yeah, that's fair. And I guess the last thing I'll say is of course we try to block AI scrapers as well. This is a big reason why we require email address signups on the site. An article will go out there and we'll pull it behind the free wall where you have to provide the email and then stuff is obviously payable as well. But this, it seems that it has stopped some scraping and the way that we can see that is a fewer AI generator rip offs of our articles, essentially which is not the best metric in the world, but it's a pretty good one. And I'm sure we'll look deeper into it as well, of how much we're being scraped and whether it's going down or plateauing or whatever. But that's what we, we do. All right, we'll leave that there. And when we come back, we're going to talk about another story of mine, which is about ICE and its new facial recognition app. We'll be right back after this.
Unknown Sponsor
Today's episode is sponsored by BetterHelp. Workplace stress is now one of the largest causes negatively affecting our mental health, with 61% of the global workforce experiencing higher than normal levels of stress. Unfortunately for most of us, work stress is going to stay in our lives, but you can learn how to more healthily deal with it. Lunchtime resets under desk treadmills and vacations can help, but they aren't long term solutions to stress. Therapy can help you navigate whatever challenges the workday or or any day might bring. With over 30,000 therapists, BetterHelp is the world's largest online therapy platform, making it really accessible and flexible. You'll definitely find a therapist that works for you and fits into your busy schedule. If you need to switch therapists, cancel or reschedule an appointment or get in touch with your therapist, you can do it with a click of a button. As the largest online therapy provider in the world, BetterHelp can provide access to mental health professionals with a diverse variety of expertise. Unwind from work with BetterHelp, our listeners get 10% off their first month at betterhelp.com 404 Media that's BetterHelp H-E-L-P.com 404.
Unknown Host
Media I do a lot of online shopping in bed or from the couch. Half the time I want to buy something, I have this sinking feeling my wallet is in the other room. This has actually become less of a problem lately because I've noticed more stores have the purple Shop pay button. When I see that button, I know checkout's gonna be a breeze because it's powered by Shopify. In fact, Shopify makes everything easy from buying stuff as a customer to becoming a small business owner and running your own store. Shopify is the commerce platform behind 10% of all e commerce in the US from household names like Mattel and Gymshark as well as another brand you might know. 404 Media Shopify gives you that leg up from day one with hundreds of beautiful, ready to go templates to express your brand style and forget about the code. Shopify helps you tackle everything you'll need in one place, from restocking inventory to managing payments to handling shipping, analytics and more. And don't forget that purple Shop pay button used by millions of businesses around the world. It's why Shopify has the best converting checkout on the planet. If you wanna see less carts being abandoned, it's time for you to head over to Shopify. Sign up for your $1 per month and start selling today at shopify.com media. Go to shopify.com media shopify.com media this September 3rd through 5th, check out Inbound, which brings attendees to San Francisco for a one time only west coast event with insights you won't find Anywhere else. Inbound 2025 is the only place that you're going to find Sean Evans, the host of Hot Ones Creative Force, Amy Poehler, tech reviewer Marcus Brownlee, and AI pioneer Dario Amade. They'll each bring their unique approach and expertise to Inbound 2025. And this just in. The Inbound 2025 agenda is finally live from the Agent AI workshop. From Idea to Agent and Dwarkesh on AI's future research backed Bold Predictions with Dwarkesh Patel. Explore more than 200 sessions, all created for your growth. At Inbound 2025, you'll get to network with decision makers in San Francisco's AI powered ecosystem, where innovative technologies are creating entirely new approaches to business. And you'll get to cut through the noise with focused, actionable takeaways on the latest marketing, sales and AI trends to give your business a competitive edge in today's rapidly changing landscape. At Inbound 2025, you'll experience firsthand how San Francisco's technology ecosystem is revolutionizing content creation, distribution and monetization with AI and innovative tech solutions. So secure your spot right now@inbound.com register. That's inbound.com register.
Sam Cole
All right, we are back with one of Joe's stories. The headline is ICE is using a new facial recognition app to identify people. Leaked emails show. So, Joe, you've been doing a ton of really good reporting about ICE and about facial recognition separately and sometimes combined. This to me is something that was not even on my radar yet as far as what police were doing in the field and then realizing that they were sticking phones in people's faces and then also using that as a facial recognition app is just like a 1, 2. Like, oh my God, this is insane and horrible. So, yeah, why don't you just explain maybe for a second before we dive into the app itself, for people who might not be familiar, what exactly are police doing or ICE agents or officers or whoever doing in the field? That people were like, oh shit, what are they doing? What are they doing with their phones before we get into the app even?
Joseph
Yeah, yeah, that's fair. So as many listeners will know, ICE is conducting operations all around the United States, but especially in Los Angeles. And of course, there were protests in response to that. And Trump deployed the National Guard and then later the Marines. Those protests have kind of petered out for the moment. But there are all of these sort of individual flashpoints where ICE will raid a Home Depot or a 711 or somewhere else, and then the community will get around it. ICE will be wearing masks. They'll be heavily armed. They may not identify what agency they're from, their badge number or anything like that. That and there's a few videos coming out around that time of ICE officers repeatedly taking their mobile phones and sort of not shining, almost shoving them into people's faces, like very deliberately and clearly filming or taking a photo of people. And there was one very stark example where a protester was following ICE in his vehicle. From the video that was published, it didn't look like he was interfering with the operation. It didn't look like he was getting in the way of federal law enforcement or anything like that, but he was following them. These apparent ICE officers stop, they get out of their vehicle, they crowd around his car. And this is like, it seems on a busy road. They then start asking him, what are you doing? Why are you following us? And then two, maybe three of them get out their phones and keep pointing it at his face. It looks like they're taking photos of his face. And that's what I thought it was at the time, and it might still be. And we'll get into the nuances of what exactly we know and what we don't know. But people saw that and they were asking on social media and elsewhere, why is ice. Why do they keep taking photos of. Of us? Which I know a fair question to ask, but then I got these leaked emails from inside ICE revealing the agency has a new facial recognition app. So maybe they're using that. We don't know if in all of these videos they're doing that, but it absolutely adds new context to that. And the fact is that ICE does have this app now, as you know, I'm sure we'll get into.
Sam Cole
Yeah, so what is this app, like you said, might or might not be? What is in the videos? We don't know, but we know that it's being used, or at least ICE has it. What's it called and what does it promise to do?
Joseph
Yeah, so it's called Mobile Fortify, and it promises to allow ICE agents to identify people in the field by simply pointing their mobile phone at them, which obviously is the point of facial recognition. That's obviously the entire purpose of the technology. But it's one thing to have facial recognition, I don't know, on the camera outside a military base to make sure who's coming in, or facial recognition at a retail store, which is still very controversial, but it's different to have it immediately and accessible in the hands of federal law enforcement, who are grabbing people without due process and putting them into this black box system and giving them a capability right into the palm of their hands that can identify people on the street and of course, ICE has facial recognition from other companies as well. We don't know who made this one, whether it's in house or made by somebody else. And other law awesome do as well. But it's the immediacy and it's definitely the context of Trump's mass deportations that makes this interesting. And, of course, these emails that I got, ICE was announcing the existence and the availability of this app to all of the personnel in enforcement and removal operations ero, and that's the part of ICE that deals with deportations. This isn't hsi, which does child abuse investigations. This is erosion, the deportation part of ice.
Sam Cole
Okay, gotcha. So you found this through FOIA documents, or how did you figure out the details of this app?
Joseph
I was leaked emails from inside.
Sam Cole
Oh, leaked.
Joseph
Okay, from inside ice. And these emails were sent to all personnel inside ero. So that's a very widespread of people, which I think is interesting, because that's a very widespread of people that may or may not be using this tool. It's at least been advertised to them inside the agency. And of course, with this new budget that US Lawmakers have just passed, and then Trump's signed into law, ICE just got another 6 billion for surveillance capabilities, and I think 150 overall, 45 for more facilities to store people is deporting all of that sort of thing. So it's clear they're ramping up their surveillance tech, and I think this is just one part of it.
Sam Cole
Yeah. Okay, gotcha. Yeah. Sending it to everyone in ERO is very, very crazy. So we know that Clearview, as you've mentioned before, uses facial recognition tech, but how is this similar? How is it different? Is it different? What are we working with here in comparison, since we already have that as kind of like an example.
Joseph
Yeah. So the way that facial recognition tech works really broadly, and obviously there'll be some differences between different companies and that sort of thing, but broadly is that it has this massive database of images of people's faces, or maybe it has a massive database of the hashes of people's faces, like the cryptographic representation of it, But a lot of the time it's just going to be literal photos. That's the case with Clearview AI, which is a company that Kashmir Hill in the New York Times revealed several years ago at this point, and they had, as she described in that coverage, and there's the person who provided the public records to her, they had sort of crossed the Rubicon of facial recognition where to build that database of faces. Clearview Scraped social media, it scraped the web. So people's Venmo accounts, social media accounts, web pages, all this sort of thing, and created this massive database of billions upon billions of faces. That's what makes Clearview apparently pretty powerful and apparently very popular with law enforcement, our local, state and federal level. So that's how it usually works and that's how you usually get this sort of data. With this new ICE app, Mobile Fortify. It's not using images scraped from social media. Well, it appears to be using. And again, there are still questions we don't have all of the answers to. But it's the emails I've got said this ISAP is using the system from Customs and Border Protection that takes a photo of somebody every time they enter or leave the United States. You imagine you go to the border and that could be LA Airport or whatever, or it could be crossing via car from Mexico or wherever. You're probably almost certainly going to have your photo taken, especially if you're on a travel visa or a work visa or something like that. Your photo will be taken. Now, the emails, again, aren't particularly. They don't answer all of the questions, but here's one quote which I thought was very telling. The app uses CBP's Traveler Verification Service and the seizure and apprehension workflow that contains the biometric gallery of individual for individuals for whom CBP maintains derogatory information for facial recognition. There's a lot of buzzwords in there. Derog comes up, and that's sometimes hard to tell what they mean. But they say they have a biometric gallery of individuals which comes from when people enter or leave the United States. That appears to be based on these emails, how they got those images and what ICE is now using. In the field, you get a photo taken when you arrive at an airport, and now you're getting pulled over in LA and an agent is putting a camera in your face and they know who you are. And based on the biometrics taken at the border.
Sam Cole
Okay, gotcha. Yeah, I think I expect when I'm traveling, and I'm sure most people do that at this point, like you're getting your picture taken in a thousand different ways by a hundred different cameras, which is exhausting to think about and sucks in its own way. But then having ICE agents have access to this app, like on their phones, seems like just a whole other invasion of your privacy, invasion of your personal space, invasion of much worse than that if they decide to harass you or, God forbid, kidnap you. So, yeah, I think this the situation with this app is really shocking and crazy. What do you think this kind of says about where we're at with surveillance tech in general in this country?
Joseph
Yeah, I think it's one main thing. And I have another ICE story coming that we'll touch on this as well. But we're absolutely in the age of all of this existing data that maybe has been collected by private companies and then it's stored in databases and then ICE gains access to it, or in this case collected by Customs and Border Protection for one purpose, which is verify who's coming and who is leaving the country. And it's then being completely retextualized and repurposed for. Well, now we can instantaneously identify people in the field and potentially again, we don't know the full parameters of this tool, but if it's getting data from that, that is presumably linked to immigration status as well, that isn't in a tool like Clearview AI, which scraped a bunch of people's photos from Venmo. Venmo doesn't have your immigration status or anything like that. It doesn't come with the photos. This could be a much, much more powerful tool for that and much more applicable for what ICE wants to do. So I just think it shows that all of those surveillance capabilities, biometric capabilities that have been built, I honestly think all bets are off for basically how they can be used now. And when I spoke to the ACLU about this system and they asked them for comment, they said, I mean, DHS was never authorized by Congress to use facial recognition tech in this way and they should stop this immediately. Of course, I have no expectation that ICE or DHS would, but I really think that all of these hair brain schemes you can think about, oh, maybe the data could be used for this or used for that. Basically all of them are possible at this point. I'm constantly surprised every time I write a story about DHS and how they're using data and I just get shocked every time. And that's going to keep happening with these other stories that have coming as well.
Sam Cole
Awesome. I'm giving two thumbs up for people who are listening. It's just. Yeah, I mean, I don't blame anyone at this point for having a conspiratorial mindset. We try not to here, but it's like if you can think of it, it might probably actually be happening. And it is happening with technology that the cops have their hands on because cops left twice. All right, well, I think, yeah, I.
Joseph
Think that's why I'm so shocked because or consistently shocked because I don't have a I strive off of. Obviously we all do. I'm not saying you have this we all strive to not be conspiratorial. And that's why whenever I get one of stories or another one, I'm still shocked every single time. Even though I'm sure people will be like, oh, what did you expect? Well, we're doing journalism, not palm reading here at this point, and I'm surprised every single time. All right, if you're listening to the free version of the podcast, I'll now play us out. But if you are a paying 404 media subscriber, we're going to talk about a bunch of our Recent stories about LLMs and how they can be tricked and how they can't understand language from Gen Alpha. I think you can subscribe and gain access to that content at 404 Media co. As a reminder, 404 Media is journalist founded and supported by subscribers. If you do wish to subscribe to 404 Media and directly support our work, please go to 404 Media co. You'll get unlimited access to our articles and an ad free version of this podcast. You'll also get to listen to the Subscribers Only section where we talk about a bonus story each week. This podcast is made in partnership with Kaleidoscope. Another way to support us is by leaving a five star rating and review for the podcast. Here is one of those from User BC this podcast and associated website break a lot of important news punches well above its weight. Thank you very much. This has been 404 Media. We'll see you again next week.
The 404 Media Podcast: How to Fight Back Against AI Bot Scrapers Release Date: July 9, 2025
In this episode of The 404 Media Podcast, hosts Joseph, Sam Cole, and Emmanuel Mayberg delve into the escalating issue of AI bot scrapers and the innovative solutions emerging to combat them. The discussion centers around Emmanuel's recent investigative report on Anubis, an open-source tool designed to protect online resources from malicious AI scraping activities.
Emmanuel Mayberg opens the conversation by highlighting the severe impact of AI bot scrapers on public resources:
“[04:35] Emmanuel: [...] AI training data scrapers are really messing with libraries, museums and any other form of resource or archive that’s open to the public.”
These AI bots inundate websites with excessive traffic, resembling Distributed Denial of Service (DDoS) attacks, which can crash sites and render them inaccessible to legitimate users. Emmanuel explains that traditional protective measures, such as the robots.txt file—which requests bots not to crawl a website—have become ineffective as many AI scrapers disregard these norms for profit.
Joseph transitions the discussion to Anubis, the tool Emmanuel reported on:
“[05:15] Emmanuel Mayberg: [...] Anubis is lightweight, it’s open source, and it’s fairly effective in a less expensive way.”
Anubis serves as a robust defense mechanism against AI bots by implementing a cryptographic check similar to those used by modern web browsers. This ensures that only genuine human users can access the protected websites, effectively filtering out automated scraping attempts. Emmanuel details how Anubis operates:
“[08:30] Emmanuel: [...] All Anubis does is look for that once it verifies that it’s a browser, that it’s probably a human and it lets you through.”
Anubis distinguishes between human and bot traffic by verifying the presence of a browser environment, a task that is computationally expensive for AI scrapers to mimic, thereby preventing them from accessing the site's data.
The podcast discusses the rapid adoption of Anubis within the open-source community and major organizations:
“[13:52] Emmanuel Mayberg: [...] Ground News surfaced a bunch of news takes and analysis from right wing outlets without having to dive into the muck on X Ground News.”
Notably, organizations such as GNOME and UNESCO, along with universities like Duke, have integrated Anubis into their systems to safeguard their online resources. By the time of the podcast, Anubis had been downloaded over 200,000 times, signaling strong community trust and reliance.
The hosts speculate on the future landscape of AI scraper mitigation:
“[16:22] Emmanuel Mayberg: [...] I think it’s hard to imagine that Cloudflare is not going to provide some sort of workable solution to its customer base.”
While Anubis is gaining traction, larger companies like Cloudflare are also developing their own protective measures. The ecosystem is rapidly evolving, with multiple solutions aiming to outpace the relentless advancements of AI scraping technologies. Emmanuel envisions Anubis potentially evolving into a full-time project supported by donations and community contributions.
Joseph shares how 404 Media is proactively addressing scraping threats:
“[17:51] Joseph: [...] An article will go out there and we’ll pull it behind the free wall where you have to provide the email and then stuff is obviously payable as well.”
By implementing email signups and paywalls, 404 Media effectively reduces the likelihood of AI scrapers accessing their content, as evidenced by a decrease in unauthorized content generation based on their articles.
The episode wraps up with reflections on the persistent and growing threat of AI bot scrapers. The hosts underscore the necessity for ongoing innovation and community-driven solutions like Anubis to protect valuable online resources from being exploited by AI technologies.
Emmanuel Mayberg [03:13]: “AI training data scrapers are really messing with libraries, museums and any other form of resource or archive that’s open to the public.”
Joseph [05:57]: “Is that the reason it doesn't work? Because again, you're totally right and plenty of people brought this up that it used to work pretty effectively.”
Emmanuel Mayberg [08:30]: “Basically, all Anubis does is look for that once it verifies that it’s a browser, that it’s probably a human and it lets you through.”
Emmanuel Mayberg [16:22]: “I think it’s hard to imagine that Cloudflare is not going to provide some sort of workable solution to its customer base.”
The 404 Media Podcast provides insightful analysis into the challenges posed by AI bot scrapers and showcases emerging solutions like Anubis that empower website administrators to defend their digital assets. As AI technology continues to advance, the need for robust, community-supported defensive tools becomes increasingly critical.
For more in-depth discussions and exclusive content, subscribers can visit 404media.co.