Intelligent Machines Podcast (Audio)
Ep. 833: The Most Popular S3 Bucket Ever - AI Slop, Clankers, and Shrimp
Date: August 21, 2025
Host: TWiT (Leo Laporte, Paris Martineau, Jeff Jarvis)
Special Guest: Rich Skrenta (Director, Common Crawl Foundation)
Overview
This episode of Intelligent Machines focuses on the increasingly vital role of openly accessible web data for AI, the growing tensions over content, crawling, and intellectual property, and the sometimes absurd or alarming side effects of rapid AI deployment. The crew is joined by Rich Skrenta, Director of the Common Crawl Foundation—the organization behind perhaps the "most popular S3 bucket ever"—which provides massive, open datasets powering thousands of AI and research projects. The conversation deeply explores issues of access, copyright and "AI slop", discoverability in the age of LLMs, and the bizarre current state of content, ethics, and even radioactive shrimp.
Key Discussion Points & Insights
1. The Mission and Impact of Common Crawl
[07:25–16:43, 25:49–31:04]
- Foundation: Common Crawl is a nonprofit that crawls a vast sample of the web (~5B pages/month from a frontier of ~1T), providing open datasets for AI and research.
- Usage: Used by over 10,000 research papers and nearly every modern LLM (language model); cited as critical for scientific and AI progress.
- Tension: Recent upsurge in publishers and platforms blocking crawlers, spurred by fears of content exploitation by for-profit AI companies—sometimes even unwittingly, due to "shadow banning" by services like Cloudflare.
- Open vs. Closed Web: The hosts and Rich note the paradox whereby publishers want traffic (and thus want to be discoverable by search and AI), but also seek to block or demand payment from AI crawlers—potentially harming their own discoverability and future business.
- Opt-Out Registry: Common Crawl will begin publishing a registry of sites that have opted out to inform downstream AI/ML users and signal possible dataset bias and incompleteness.
Notable Quote:
"Publishers are going to regret asserting their right to be forgotten... If you opt yourself out...you're denying it to thousands of other efforts that don't have resources—PhDs, researchers, small projects. Opt-outs make me sad. They really do."
– Rich Skrenta [12:55, 14:11]
2. The Modern AI Content Economy: Promise & Peril
[02:10–23:36, 33:20–34:29]
- Monetization Madness: Examples like Reddit blocking the Internet Archive after having sold AI rights to OpenAI illuminate the growing "toll road" economy on formerly open content.
- Users, Creators, AI Orgs: The three-way tension: users want quick, free answers; creators want compensation; AI orgs want broad, open (but also "clean") data.
- Corporate Blindness: Companies demand removal yet lament decreasing relevance/discoverability—a self-defeating cycle as LLMs replace search as the main entry point for answers.
- Discoverability Paradox: As Rich notes, prestigious brands now increasingly ask to opt-in to LLM datasets to maintain visibility, signaling a shift in attitude.
Notable Quote:
"It’s ultimately a violation of the notion of an open web. If you put something on the web...information wants to be free, you put it there so we can all benefit equally."
– Leo Laporte [14:22]
3. Technical & Ethical Challenges in Web Crawling and AI Training
[39:16–46:10]
- Filtering Content: Common Crawl doesn’t proactively filter for hate speech or other problematic content, pointing out subjectivity and the risk of poor/incomplete filters. Instead, they annotate potential issues and let downstream users filter as needed.
- Handling Sensitive/Illegal Content: Regular removal of flagged data (e.g., revenge porn, CSAM, secret keys) after manual and organizational review.
- Language and Representation: Push to increase low-resource language representation without sacrificing quality (to avoid web "junk").
- Efforts Toward Attribution: Development ongoing in the AI community around content attribution and referral tracking—would help calm publishers' fears and ensure credit.
Notable Quote:
"We want them [AIs] to be aligned with us. I want them to read all these books...I would like to see an expansion of fair use. Robots are people too. A robot should walk into a library and read the books."
– Rich Skrenta [18:16]
4. The Publisher Dilemma: Blocking vs. Benefitting
[12:03–16:33, 29:52–31:34]
- Opt-Out Rates: Up to 25% of sites now block Common Crawl—a huge increase from zero a decade ago (driven by "AI moral panic").
- Consequences: Those who opt out withdraw from the research/distribution ecosystem—hurting visibility, education, and non-commercial use.
- Re-enrollment is Not Allowed: Once content is excluded at a publisher's request, Common Crawl does not allow it to be re-indexed ("You do so at your peril, creators.").
- Data Bias: Ongoing opt-outs risk narrowing the diversity/breadth of available content, making AI models less representative and more biased.
Notable Exchange:
"A: Do you have a mechanism to get people back in if they change their mind?
D (Rich Skrenta): No. I'll never put them back in."
[31:00]
5. The State of AI Slop, Clankers, & Contemporary AI Culture
[133:03–136:55]
- AI Slop: Term for mass-produced, low-quality or misinformation-laden AI-generated content polluting the web ("slop" = output from spam models or poorly-managed generators).
- New Slang: "Clankers," "Grok Suckers," and other terms are proliferating—often used derogatorily for LLMs or those who overuse them.
- Corporate AI Washing: Routine exaggeration of “AI” features for marketing purposes, even on relatively simple or unrelated tech.
6. News Analysis: Google’s Product Event, Deepfakes, and More
[57:14–83:39, 117:06–120:18]
- Google Pixel Event: Critiqued for cringe-inducing celebrity segments; some genuine new features, especially in translation and photo, but overall thin on technical substance.
- AI Hallucinations & Deepfakes: Ongoing struggle with alignment (models going off the rails), deepfake incidents in politics (Amy Klobuchar, AOC), and legislative responses.
- Leaks & Privacy: Reports of XAI's Grok search leaking user data and conversations, sometimes sensitive, via shareable open URLs.
7. AI’s Real World Effects: Power, Drugs, and Radioactive Shrimp
[105:01–129:15]
- Power Consumption: Data centers' electricity use is surging (4% to projected 12% of US grid by 2028), possibly contributing to increases in consumer power bills.
- AI in Pharma: Major deals (e.g., Eli Lilly’s $1.3B partnership with Superluminal) reflect sky-high hopes (and bets) that AI will yield new blockbuster drugs (esp. for obesity).
- Radioactive Shrimp Recall! [126:17]: Bizarrely, the FDA recalled Walmart-branded shrimp after detecting radioactive cesium-137—Paris Martineau, on the food safety beat, vows to get to the bottom of the case.
Notable Moment
"If you have this shrimp… Walmart did receive shrimp from the same supplier that brought this radioactive shrimp from Indonesia. Throw it away. They don't want the radioactive shrimp back!"
– Paris Martineau [128:23]
8. Social & Linguistic Shifts: New Words for a Weird World
[139:01–143:41]
- New Dictionary Additions: “Skibidi,” “delulu”, “tradwife,” and more pop culture or Gen Z slang hits Cambridge/Merriam Webster—prompting bemusement from the hosts.
- Digital Journaling & AI Memory: Philosophizing about journaling, creating legacies in a digital/AI world—Jeff shares the story of someone building a digital “dad brain” for posterity using AI.
9. Notable Quotes & Memorable Moments
On AI’s Unavoidable Role
"Robots are people too… it's inevitable you'll buy a robot, it'll carry your groceries, and it's absurd to think it won't be allowed to train on what it sees or hears in real time."
– Rich Skrenta [18:15]
On Content Opt-Outs
"If you opt yourself out of this data set because you're mad at a big company, you're denying it to like thousands of other efforts that don't have that resource… there's a lot of collateral damage."
– Rich Skrenta [24:41]
On Ethically Sourcing & Annotating Data
"I get letters about song lyrics...from music publishers. I'm not going to write a song lyric detector… Instead, let's label stuff [with annotations] and you can decide what to filter."
– Rich Skrenta [40:19]
On AI Model Alignment and Hallucination
"What you want after pre-training is to get rid of the misaligned persona... Giving it insecure code or incorrect medical advice can amplify the misaligned persona. But they're all in there."
– Leo Laporte [103:40]
Timestamps for Key Segments
- Common Crawl: Mission & Use Cases: [02:10–16:43]
- Publishing, Opt-Outs, & Dataset Bias: [12:03–16:33], [29:49–31:34]
- News: Google Event Recap & Critique: [57:14–83:39]
- Deepfakes & AI Alignment Issues: [117:06–120:18]
- Power & Pharma: AI in the Grid & Drug Discovery: [105:01–129:15]
- Radioactive Shrimp Recall (Must-Listen!): [126:15–129:15]
- AI Slop, Clankers, & Tech Slang: [133:03–136:55]
- Journaling, Digital Legacy, AI ‘Dad Brains’: [144:36–147:53]
Tone and Style
Conversational, skeptical but enthusiastic, often irreverent (lots of tech in-jokes), with a mix of expert insight and first-person anecdotes. Jeff Jarvis and Leo Laporte provide historical perspective and wry commentary, while Paris Martineau brings critical reporting and generational perspective.
Summary Takeaway
This episode underscores the disruptive tension at the heart of AI’s web-fueled future: Will open data survive the corporations and copyright-holders trying to lock it away? Can society craft fair rules for attribution, compensation, and ethical use before the AIs eat their own (and everyone else's) digital lunch? And in a world with AI-generated slop and radioactive shrimp, is anyone really minding the store (or the S3 bucket)?