Summary5 min read

WSJ Tech News Briefing

Episode Summary: The New AI Data Trade, Part 1 – Cashing In on AI

Date: August 17, 2025
Host: Coleman Standifer (The Wall Street Journal)

Overview

The episode examines the emerging market for licensing data to AI companies, focusing on whether smaller content creators can cash in as "AI data providers." The host, Coleman Standifer, unpacks how the relationship between AI models, web data, and media creators is changing—from unauthorized scraping to copyright lawsuits, multimillion-dollar licensing deals, and new monetization tools. The discussion is anchored on the experiences of both independent creators (e.g., Jared Brick of Brick House Media) and big media companies, spotlighting the complex question: In the growing AI data economy, will it pay off for the little guys?

Key Discussion Points & Insights

1. AI’s Insatiable Demand for Data

LLMs (Large Language Models) require massive, ongoing streams of data to operate effectively and stay current.
Quote:

“From the perspective of LLMs, data is essentially words. For LLM designers, words are like the new oil.”
— Bob McMillan, Tech Reporter (02:36)
Traditionally, LLMs have trained on any publicly available data ("wherever humans have created linguistic products"), raising questions of consent and compensation.

2. The Old vs. New Internet Value Exchange

Traditional web crawlers (Google, DuckDuckGo, etc.) drove users to the sites they indexed, benefiting content creators with traffic and ad revenue.
AI crawlers now extract data without necessarily returning users to the source, threatening the old web economy.
Implication: Loss of traffic = loss of revenue for media creators and publishers.
Quote:

“…AI services don’t often send their users back to the websites they pulled from. For many publishers and content creators, this amounts to…unauthorized access…and a loss of traffic and revenue.”
— Coleman Standifer (03:27–04:44)

3. Legal Showdowns & Big-Publisher Deals

Content owners are fighting back both in courts and with commercial agreements:
- New York Times, Reddit, Disney, NBCUniversal, Dow Jones—all mentioned as plaintiffs or parties to lawsuits against AI companies over unauthorized use of data.
- Some, like News Corp (the Journal’s parent), are cutting major licensing deals with AI firms (e.g., OpenAI, Amazon).
Quote:

“Amazon will be paying the New York Times $20 million a year to access its content. News Corp...signed a deal with OpenAI that could be worth more than $250 million over five years.”
— Coleman Standifer (08:53)

4. Reddit’s Pushback and Landmark Lawsuits

Reddit has enacted a new policy distinguishing commercial vs. non-commercial data use, seeking compensation or blocking scrapers.
Reddit CEO Steve Huffman:

“If they continue to take, then we’ll be forced to file a lawsuit, which is what we did in this case.”
— Steve Huffman, Reddit CEO (07:36)
Ongoing lawsuits, such as Reddit vs. Anthropic, may determine the future legal contours of AI data usage.

5. Small Creators Find Opportunity in the AI Data Trade

Independent production studios like Brick House Media can license archives to AI firms, potentially turning dormant media into revenue.
Jared Brick’s Lightbulb Moment:

“Now they’re coming back to content creators because they realize we have so many terabytes and petabytes of media sitting on hard drives that they have no access to...I’ve got all this media. It now has value. It didn’t have value really before.”
— Jared Brick (10:34)

6. New Tools: Monetizing the Crawl

Cloudflare’s “Pay Per Crawl”
- Enables web publishers, including small sites, to set terms and prices for AI crawlers accessing their content.
- Will Allen, Cloudflare VP:
  
  “You’ll get a certain response back when there’s a payment required and it’ll include the price per crawl. You can decide…great, I want to pay for this content and use it. Or no, I don’t want to.”
  — Will Allen (10:00)
Such tools may shift the balance of power, providing alternative to litigation or take-it-or-leave-it deals.

7. Unanswered Questions – How Much Money, Really?

The episode closes with a preview of Part 2, promising a reality check on the actual economics for smaller content creators.
Quote:

“So how do these smaller AI licensing deals work and how much money is really up for grabs? That’s in the second installment…”
— Coleman Standifer (10:53)

Notable Quotes & Memorable Moments

Jared Brick (On the data opportunity):

“We had no monetization strategy for it other than just archiving it. When I learned that AI licensing…was a thing...I’ve got all this media. It now has value.”
(01:08, 10:34)
Bob McMillan (On LLM priorities):

“Data is essentially words. For LLM designers, words are like the new oil.”
(02:36)
Coleman Standifer (Explaining the shift):

“AI services don’t often send their users back to the websites they pulled from...”
(03:27)
Steve Huffman, Reddit CEO (On enforcing data rights):

“We can cut them off, we can ask them to stop, but if they continue to take, then we’ll be forced to file a lawsuit...”
(07:36)
Will Allen, Cloudflare VP (On pay-per-crawl tools):

“You’ll get a certain response back when there’s a payment required and it’ll include the price per crawl.”
(10:00)

Timestamps for Key Segments

[00:18] — Jared Brick explains why content creators want protection and compensation for their data.
[01:08] — Jared Brick on discovering the potential to license archived content for AI training.
[02:36] — Bob McMillan: Data is the new oil for LLMs.
[04:44–05:28] — Lawsuits from major publishers and the growing legal battle over data usage.
[07:36] — Reddit CEO Steve Huffman on why Reddit is suing AI companies.
[08:53] — Coleman Standifer details big licensing deals (NYT, News Corp, Reddit, etc.).
[10:00] — Will Allen (Cloudflare VP) explains the Pay Per Crawl system to monetize publisher data access.
[10:34] — Jared Brick describes seeing new value in old digital content.
[10:53] — Tease for Part 2: The economics for small players in the AI data trade.

Overall Tone & Takeaways

Analytical, investigative, and occasionally cautious.
The episode is hopeful for smaller creators but clear-eyed: legal, operational, and financial frameworks are still shaking out.
Big money is now being made from data, but the returns for the “smaller players” are yet to be determined—a question the next episode promises to tackle.

For listeners:
This concise yet deep-dive episode provides the groundwork for understanding why AI companies need your data, what’s at stake for publishers big and small, and how new tools and deals may pave the way for fairer, more sustainable AI data economies.

Loading summary

Transcript28 lines

[00:01]
Joanne Wright
IBM is on a mission to become the most productive company in the world. Join SVP of Transformation and Operations Joanne Wright at the break to learn how its mission can benefit your enterprise and why AI is the catalyst for success.
[00:18]
Jared Brick
We don't want to just hand over our media to AI companies without some protection and compensation.
[00:24]
Coleman Standifer
That's Jared Brick. He's the founder of Brick House Media, a small digital marketing agency based out of Santa Cruz, California. They make social videos for real estate companies.
[00:34]
Realtor (unnamed)
As a Realtor, getting noticed online is.
[00:36]
Coleman Standifer
How I generate business and video podcasts for local health groups. Hello and welcome to Naturally well with Nordic Naturals, things like that. What you just heard Jared say is something a lot of content creators and small media companies have been talking about lately. They want to get credit, or better yet, paid, when an AI model uses their media for its training. A few months ago, Jared heard about what seemed like an answer. Brickhouse has been around since 2013, so it had a ton of unused footage sitting on hard drives.
[01:09]
Jared Brick
We had no monetization strategy for it other than just archiving it. When I learned that AI licensing through a colleague was a thing, he said, hey, I bet you have terabytes of data. I said, of course. Yeah, of course we do. And he said, well, do you know, you could license it to, to AI companies. It's just training AI learning models. It was like, let's explore this.
[01:31]
Coleman Standifer
AI companies and the large language models they've built need data and a lot of it. Audio, photo, text and of course video. And Jared has stumbled on a new niche that's cropped up in the world of AI, a small industry of startups that promises to address one of content creators biggest complaints about AI Fair compensation for using their data. So how did we get to this point where AI companies are paying smaller players like Jared for their video footage? And how much is that data actually worth anyway? I'm Coleman Standifer and this is the new AI Data Trade, a special two part miniseries from the Wall Street Journal where we ask, can smaller content creators make money from their data? And will it be as much as they hope? This is part one. Cashing in on AI.
[02:36]
Bob McMillan
From the perspective of LLMs, data is essentially words. For LLM designers, words are like the new oil.
[02:46]
Coleman Standifer
That's Bob McMillan. He's been reporting on tech for the Journal for over a decade.
[02:52]
Bob McMillan
AI models need a lot of data and the more data they get, the better they are. So the web has a lot of data on it. These LLM models have been built on any place where humans have created linguistic products. They have been interested in sucking those up and using them to make their LLM chatbots better.
[03:15]
Coleman Standifer
You probably already know that LLMs like ChatGPT from OpenAI or Gemini from Google need data for their initial training. But they also need ongoing sources of data to get smarter, to stay current. And if they want access to data in real time, they need the Internet. Billions of interlinked devices full of data. But at the same time, LLMs are upending the Internet status quo. Think about it this way. Before LLMs, if people needed something from the Internet, they would use search engines like Google or DuckDuckGo. In doing so effectively, you're deploying a little Internet tour guide who's picking out the best websites based on your query and guiding you to them. The hard working program that discovers new websites and feeds them into the search engine is called a web crawler. For years, websites relied on search engines and and their web crawlers to drive traffic. For example, take Reddit, a social platform where users can join communities and have conversations about just about anything. Businesses like Reddit were built on the deployment of web crawlers that lead to search results and send over visitors. LLMs also scan the Internet using web crawlers. But there's a difference. AI services don't often send their users back to the websites they pulled from. For many publishers and content creators, this amounts to what they see as unauthorized access on the one hand and a loss of traffic and revenue on the other. LLMs upend the old Internet value exchange where search engines would drive traffic to publishers. Now that's not always the case and publishers aren't happy.
[05:02]
Realtor (unnamed)
The New York Times is taking OpenAI and Microsoft to court filing. Reddit has filed a lawsuit against Anthropic for wrongful use of Reddit data. Disney and NBC Universal, which is CNBC's parent company, are filing a joint suit against AI company Midjourney. Dow Jones and the New York Post filed a lawsuit claiming that the AI startup engages in what they called a massive amount of illegal copying.
[05:28]
Coleman Standifer
This is a good time to mention that the Wall Street Journal as a publisher is of course affected by this changing landscape. The Journal is owned by News Corp. And two of News Corp's subsidiaries are suing AI powered search engine Perplexity. Also, the Journal's owner, News Corp, has a content licensing partnership with OpenAI. Coming up, beyond the lawsuits, deals and new tools to let publishers and creators get paid for their data. That's after the break. Foreign.
[06:09]
Joanne Wright
Set a goal to become the most productive company in the world it started by asking questions, lots of questions, says Joanne Wright, SVP of Transformation and Operations at IBM.
[06:20]
How can we radically simplify end to end workflow and processes? What can we eliminate? How do we automate everything that we can? And then how do we embed AI into everything we do? So far, over a two year period, we've delivered over $3.5 billion of productivity savings for the company.
[06:45]
Coleman Standifer
Last year, Reddit drew a line in the sand. The company published a post outlining what they called a new public content policy. This policy laid out a distinction between commercial use of Reddit's data and non commercial use for what the company the new AI era. Basically, it said no scraping of Reddit unless we say so. This past June, Reddit filed a lawsuit against Anthropic, known for its AI model. Claude Reddit claims that Anthropic is unjustly enriching itself through the unauthorized use of data through scraping as grounds for the complaint, a Reddit spokesperson cited the company's public content policy as well as its terms of service. Here's Reddit CEO Steve Huffman talking at the Cannes Film Festival in June.
[07:36]
Steve Huffman
Now we've had other folks who just take the content and use it to enrich themselves without regard to our terms or without respect to our users privacy. And in that situation we only have so many options. We can cut them off, we can ask them to stop, but if they continue to take, then we'll be forced to file a lawsuit, which is what we did in this case.
[08:04]
Coleman Standifer
I reached out to Anthropic for this series and they declined to be interviewed. They sent me a statement saying we disagree with Reddit's claims and will defend ourselves vigorously. Megan Bobrovski is a tech reporter for the Wall Street Journal, and she reported on the lawsuit. She says courts will have to decide on some key questions, whether scraping Reddit's content and using that information for commercial purposes is legal.
[08:32]
WSJ Advertising Disclaimer
Just because something is publicly on the Internet and it's publicly accessible, if Reddit's saying no, you need to have a licensing agreement with us to profit off of this. Do they, do they side with Anthropic and say this information is publicly on the website? Don't put it publicly on the Internet if you don't want it to be taken by us.
[08:53]
Coleman Standifer
Publishers have valuable data, data that they want to monetize. And so while the legal issues are being hashed out in court, some big media companies are striking deals. We've reported Amazon will be paying the New York Times $20 million a year to access its content. News Corp. The Journal's parent company, signed a deal with OpenAI that could be worth more than $250 million over five years. And Reddit's deals with OpenAI and Google represented most of a $35 million line item in the company's latest quarterly report. These deals started putting a price tag on using content for AI. But while many big media companies are striking multimillion dollar licensing agreements, where does that leave smaller publishers? Cloudflare, a content delivery network that provides Internet infrastructure for websites, is working on a product currently in private beta to allow web publishers to monetize crawling itself. The company calls it Pay Per Crawl. Here's cloudflare VP of Product Will Allen on how it would handle AI crawlers.
[10:01]
Bob McMillan
Any site that is sitting behind Cloudflare and is using their Pay Per Crawl product, you'll get a certain response back when there's a payment required and it'll include the price per crawl. You can decide, great, I want to pay for this content and use it. Or no, I don't want to.
[10:17]
Coleman Standifer
So by offering a tool that lets small publishers get paid by AI companies, Cloudflare is playing in the middle. And they're not the only ones. Remember Jared from Brickhouse? He's a small content creator who was an early adopter of AI data licensing.
[10:34]
Jared Brick
Now they're coming back to content creators because they realize we have so many terabytes and petabytes of media sitting on hard drives that they have no access to it, has never been uploaded to the cloud. So that's when I had this moment of like, I've got all this media. It now has value. It didn't have value really before.
[10:54]
Coleman Standifer
So how do these smaller AI licensing deals work and how much money is really up for grabs? That's in the second installment of the new AI Data Trade coming tomorrow. Until then, the new AI Data Trade was produced by me, Coleman Sandifer Sound design and mix by Jessica Fenton. Aisha Al Muslim is our development producer, Scott Salloway and Chris Sinsley are our deputy editors, and Falana Patterson is the Wall Street Journal's Head of News Audio. I'm Colman Standifer. Thanks for listening.
[11:32]
Joanne Wright
It's not just IBM that benefits from its mission to be the most productive company in the world. So do its clients. Joanne Wright, SVP of Transformation and Operations at IBM Expl we've created a playbook.
[11:42]
That'S client zero for how to do really fast, effective AI. The key has been to drive for progress over perfection. We built a solid foundation with data and taken the opportunity to really learn from the people who have a role to play in running IBM each and every day. Our own experience has taken us from far beyond just doing pilots and theory to real ROI and real productivity. A lot of our clients are very hungry to know what they can learn from us as Client zero and then obviously how can they avoid perhaps some of the mistakes we've made or some of the failures we've have had? The fact that we've been able to derive and deliver our own use cases across everything that we do really transcends our clients experience.
[12:21]
Visit IBM to learn how AI can drive enterprise wide productivity.
[12:25]
WSJ Advertising Disclaimer
Custom content from WSJ is a unit of the Wall Street Journal Advertising Department. The Wall Street Journal News Organization was not involved in the creation of this content.