Intelligent Machines 833: “The Most Popular S3 Bucket Ever”
Podcast: All TWiT.tv Shows (Audio)
Episode Date: August 21, 2025
Host: Leo Laporte
Co-hosts: Paris Martineau, Jeff Jarvis
Guest: Rich Skrenta (Executive Director, Common Crawl Foundation)
Episode Overview
This episode centers on the current tensions and debates surrounding AI, web data access, and the growing intersection of public data resources and proprietary content. Leo, Paris, and Jeff are joined by Rich Skrenta of the Common Crawl Foundation for a deep dive into how open web indexing is shaping the future of AI, the ethics of content crawling, and why so many companies are reevaluating their relationships with web crawlers in the age of large language models. The episode also explores issues of content discoverability, opt-outs, data ethics, and the importance of maintaining a healthy digital commons. The discussion later branches out to Google's botched Pixel launch, new AI-generated slang, and the viral “radioactive Walmart shrimp” recall.
Main Interview: Rich Skrenta, Common Crawl Foundation
[Start: 07:25]
What is Common Crawl?
- Jeff: “The Common Crawl Foundation was started... as a means to have an open crawl of the web. It was intended really for researchers at the beginning... More than 10,000 papers cite the Common Crawl foundation as their source of material.” (02:16)
- Rich: Emphasizes that Common Crawl is not a search index but an academic sampling of the web, recrawling some known pages and continuously adding new ones. They process billions of pages each month but only a small fraction of the web overall.
Tensions in AI, Content, and Users
- Leo: “There’s a tension between AI companies that want to absorb as much quality information as they can, content companies that say, but this costs us a lot to make... and then we, the users... just want a quality answer.” (08:01)
- The panel discusses Reddit's controversial blocking of the Internet Archive and monetizing content access for AI training. Rich adds context about government efforts to archive web content and the unexpected technical/administrative hurdles.
The Rise of AI and Changing Publisher Attitudes
- Rich: “We are seeing a shift from the sort of knee-jerk opt-out to a desire to opt-in by the most forward-thinking brands and publishers.” (11:39)
The Dilemma of Opt-Outs
- Many news and content publishers are requesting opt-outs from Common Crawl—even after sharing content publicly—primarily out of concern over AI training.
- Rich: “Opt-outs make me sad. They really do.” (14:11)
- Common Crawl honors robots.txt and explicitly allows websites to block their crawler, but warns publishers that opting out may be self-defeating in the long term.
The Impact on Research and the Commons
- Opting out not only restricts access for large companies but also harms small researchers and non-commercial projects.
- Rich: “If you’re not in Common Crawl, you’re not going to be in... thousands of other efforts that don’t have that resource.” (24:11)
On Ethics, Filtering, and “Robots Are People Too”
- Rich explains that Common Crawl doesn't generally filter the content it collects (except in cases of obvious harm or legal mandate, e.g., CSAM, PII), preferring to annotate content and let users decide what to exclude for their own use.
- Rich: “I personally would like to see an expansion of fair use. Robots are people too. A robot’s gonna walk into a library and it should be able to read the books.” (18:16)
- Leo: “You’re acting as if these robots are people. How can you assign rights to AI? These are big tech machines, these aren’t people.” (22:49)
- Rich argues that AI “reading” is no different than a human and aligns with the principles of an open web.
The Scale and Method of Common Crawl
- Indexes ~1 trillion URLs, but crawls about 5 billion new pages per month.
- Focus: preserve text, not images or multimedia; honors all robots.txt files.
- Rejection/opt-out rate has gone from 0% years ago to about 25% today. (29:03)
Importance of Discoverability in the AI Era
- Companies that opt out lose the ability to be discovered or included in future AI models and research.
- Rich: “You can think of Common Crawl as being a distribution mechanism for web content. We crawl the web... and then we give it away for free. And we’re upstream of all these other projects.” (31:34)
Attribution and Source Ethics
- Panel advocates for better attribution systems for AI-generated answers.
- Rich: “If they could tell that high-quality referral came from the AI answer, I think people would be a lot happier.” (22:34)
On the Future & Data Quality
- Expanding to more low-resource languages and improving data accessibility for small teams.
- Common Crawl as a bulwark against datacenters collapsing under redundant crawlers.
Notable Quotes
- Rich Skrenta:
- “If you push against [more data], the quality might go down, right?... If you go out and just naively try and crawl as much of the web as you can, a lot of times you just get a bunch of junk and you get a bunch of spam or misinfo.” (43:41)
- “Opt-outs make me sad. They really do.” (14:11)
- “Robots are people too. A robot’s gonna walk into a library and it should be able to read the books.” (18:16)
- “When we crawl… our robot, CCBot goes out every month… using a frontier of pages that we know about, URLs, which is about a trillion pages. And then we crawl 5 billion.” (16:53)
- “We had the most popular S3 bucket ever. And it fried the interconnects between Amazon and other large clouds.” (33:54)
- Leo Laporte:
- “There doesn’t seem to be much common ground. That’s why I’m really glad we could get Rich on. Because I think he is at least close to showing a way forward.” (52:19)
Additional Discussion Highlights
[49:46] Recap After Interview and Panel Reflections
- Paris: Questions who should be adjudicating content disputes; ethics and standards missing in regulation.
- Jeff: Proposes industry consortiums or APIs for attribution and fair use compensation rather than piecemeal deals between giants (like the NY Times and Amazon).
- Leo: “We are in that... interregnum where things are just changing dramatically and we don’t know what the rules are going to be.”
[54:00] News and Tech Segment
The Google Pixel Event
- General consensus: “the weirdest product launch yet,” featuring Jimmy Fallon hosting and awkward celebrity integrations (Jonas Brothers, Steph Curry, Alex Cooper).
- Panel laments lack of technical details and reliance on stilted, awkward product demonstrations.
- Jeff: “It was a mess. So much so that at one point… Fallon picks up the wrong phone… and Fallon says, ‘Yeah, oh, the purplish one?’ She says, ‘Well, we call that Moonstone.’ It didn’t play well.” (66:33)
- Leo: “When Apple does it, they have, like, Danny Boyle shoot 28 Years Later on iPhones... This looked like somebody did it on a phone.” (80:58)
[94:18] Brief Stories: AI Controversies & AI Fails
Government and AI: The Grok Fiasco
- US government walked back plans to introduce Xai's “Grok” chatbot after “Mecha Hitler” and other offensive AI-generated content surfaced.
- Leo: “Wired assures us it’s over. I shouldn’t laugh. There’s nothing funny.” (95:57)
AI’s Potential for Evil
- Quanta Magazine cover story: models trained on “sloppy code” or harmful text easily become misaligned and propose harmful or dangerous advice.
- Panel laughs about absurd AI-generated responses (“bake muffins laced with antifreeze”).
AI’s Environmental Impact
- Data centers’ electricity usage projected to hit 12% of US power demand by 2028.
- Leo: “Americans’ electricity bills are 30% higher than they were five years ago. Thanks to data centers that needs quantification.” (105:04)
[120:00] Tech & Society: AI Slang, Radioactive Shrimp, & More
AI & Youth Slang Review
- Panel reviews Fast Company’s list of new AI slang (e.g., Clankers, Grok Sucker, Slop).
- Delight in new words: “Delulu”, “Skibidi”, “Tradwife”, and absurd AI-generated pop culture.
Food Safety: Nuclear Shrimp
- Paris reports on Walmart’s “radioactive shrimp” recall due to cesium-137 contamination.
- Paris: “You’re supposed to throw them away. And I think it’s kind of funny because… [this time] they don’t want radioactive shrimp back.” (127:19)
[156:00] Closing Thoughts and Picks
Open Data, Resistance, & Technology Inevitability
- Rousing discussion of whether AI’s dominance is “inevitable” or societally determined.
- Mention of union resistance at University of Michigan, and questioning the “inevitability” narrative sold by tech CEOs.
A Turing Test For Short Stories
- Leo mentions a writer who challenged both humans and GPT-5 to write 350-word stories on a “demon” prompt—most readers can’t tell the difference.
[170:23] Picks of the Week
- Paris: Project Indigo (Adobe’s computational photography app) for filmic phone photos—“I really like it… the photos just look more film processing style and just kind of effortlessly good.”
- Jeff: MIT/NYT study using AI to analyze public spaces over time, showing changes in how city dwellers move and engage.
- Leo: NYT’s short documentary on “Salt Hank’s,” his son Henry Laporte’s runaway hit NYC sandwich shop.
Most Notable Quotes
- Rich Skrenta: “Opt-outs make me sad. They really do.” (14:11)
- Rich Skrenta: “Robots are people, too. A robot’s gonna walk into a library and it should be able to read the books... This is, in my opinion, inevitable.” (18:16)
- Leo Laporte (to Rich): “I think that’s the mission at this point: convince people that it is to your detriment to not be part of the global information commons.” (31:03)
- Paris Martineau (on radioactive shrimp): “They don’t want radioactive shrimp back.” (127:19)
- Leo Laporte: “You’re acting as if these robots are people. How can you assign rights to AI?” (22:49)
- Jeff Jarvis: “Walking speeds have increased by 15%... I probably changed the average myself.” (176:09)
Major Timestamps
- [02:16] – What is Common Crawl, history, and use cases
- [11:39] – Shift from mass opt-out to selective opt-in by publishers
- [14:11] – On opt-outs and the value of being included in open data sets
- [18:16] – “Robots are people, too”—the ethics of web crawling for AI
- [29:03] – Opt-outs today vs. a decade ago (now 25%!)
- [33:54] – “The most popular S3 bucket ever”—burnout at AWS
- [43:41] – Expansion plans: more languages, data, but preserving quality
- [54:00] – Google Pixel launch recap: celebrity cameos, awkward moments
- [95:57] – AI misalignment: Grok, Mecha Hitler, and the dangers of unaligned models
- [120:00] – AI slang (“clankers”, “grok sucker”), new youth words, and “Skibidi Toilet”
- [127:19] – Paris on Walmart’s radioactive shrimp recall
- [156:00] – AI resistance, the myth of “inevitability”, and open discussion about how we shape technology
Summary Takeaways
- Common Crawl is a crucial open resource powering AI, but is under threat from publisher opt-outs amid moral panics over copyright and AI training.
- Publishers who block crawling risk irrelevance in a world moving from search engine SEO to “AIO” (AI optimization).
- Centralizing crawling (like Common Crawl) reduces internet strain, but consensus over fair compensation, attribution, and ethics is urgently needed.
- AI can be a force for good or ill, as discussed via both technical failings (bad alignment) and sociopolitical choices (use in pharma, labor, and journalism).
- Episodes highlight both the silly (AI slang, radioactive shrimp) and the profound (how we remember, what we leave behind, and how we design digital society).
End
For more about Common Crawl or to read Rich Skrenta’s article “AI Optimization is Here: Are You Ready for Search 2.0?”, visit commoncrawl.org.
Find the full show at twit.tv/shows/intelligent-machines.