Loading summary
A
Hello and welcome to the Vergecast, the flagship podcast of music that sounds eerily but not exactly like other music. I'm your friend David Pearce, and today on the show we're talking about training data. It's the raw materials of everything that is AI and we probably don't understand it or talk about it enough. I'm talking to Alex Reisner, who's a staff writer at the Atlantic, and over the last couple of years he has spent a lot of time investigating training data and how all of these books and all of these articles and all of these YouTube videos and all of these songs are compiled into these gigantic data sets that AI companies then use to form the basis of their models. I think the way that these models are created and the sources of these data has a lot to do with the way that we feel about AI, particularly generative AI as a creative expression. And understanding how this data works, where it comes from and how it gets used, I think is really important. So we're going to have Alex on, we're going to dig into it. I'm very excited about it. But first, here's everything else happening on the Verge today. This is 90 seconds on the Verge for Thursday, June 25, 2026. Apple just raised the prices of a huge number of its most important products. IPads and MacBooks in particular are now up anywhere from 100 to $500. And even like the Apple TV is way up, it's now $200, roughly the price of 900Amazon fire sticks. We knew this was coming, but this is still a big moment. Apple is as well managed a supply chain company as anybody and exists at really high margins. If it can't keep prices down in this era of AI driven shortages of memory and storage, nobody can. And I suspect this is going to get even worse. My advice from yesterday still very much holds, which is prime day is not over. Go get your deals while you still can, and here's a way to help you pay for all of those more expensive gadgets. Go get some money from Disney. If you have YouTube TV or DirecTV stream, you might be eligible for a cash payout from a new $50 million settlement. The case started in 2022 and the argument was essentially that because Disney owns ESPN and Hulu, which are both powerful streaming services in their own right, Disney was able to drive up the prices of all of its rivals to unnecessary heights. Live TV is lucrative and competitive and is going to keep being messy in this way and this will not end things. But hey, go get some of that $50 million. And finally, a quick PSA to advertisers everywhere. Yes, I get it. The whole idea of using AI creative and AI targeting to put your AI ads in front of AI people is exciting, I guess. But maybe check your ads to make sure that you're not, I don't know, advertising a bike with two sets of handlebars like REI did this week. That may not go over so well with people. Just a little free advice from me to you. You can read more about all of this@theverge.com that's 90 seconds on the verge for Thursday, June 25th. Support for the show comes from ServiceNow. AI is moving fast across the enterprise, but without visibility, it's just chaos. Different tools, different models, different teams using AI in completely different ways. ServiceNow turns that chaos into control. With the AI control tower, you see all your AI across the business in one place. What it's doing, what it's done and what it's about to do. So you stay in control. To put AI to work for people, visit servicenow.com When I got a new car, I thought my insurance premium would increase and empty my bank account. Like if Fatween won the lottery. I've invested most of my winnings in chicken tenders because they're bomb. But bro, I bought a house and it's sick, bro. I'm thinking the floor is going to be all trampoline, bro, with a helipad on the roof. The contractor said it's structurally unsound. They're just being babies. But switching to GEICO saved me hundreds, so my bank account is safe.
B
It feels good to save some hard earned cash. It feels good to geico.
A
All right, let's talk training data. Alex Reisner from the Atlantic is here.
B
Hi, Alex. Hey, David. Thank you for having me.
A
Very excited to talk to you. I think we have talked a lot on this show about why people feel the way that they feel about AI and the sort of instinctive reactions to the whole idea of AI. And, and a theory that I have is that a lot of it is about training data. And I am, I just want to talk about that. You've done a lot of work on this stuff and you've been investigating how AI models get trained for a long time now. But I want to just lay a little bit of groundwork here and just to ask the very obvious question up front, why does it matter what data is used to train these products? What difference does it make what's inside of these models?
B
I think it is potentially the most important aspect of a model is what it's trained on. I mean, if you take a model and you train it on, let's say it generates music and you train it on 1950s jazz, that model will be very good at generating music that sounds a lot like 1950s jazz. If you train it on recent hip hop, it's going to generate music that sounds like recent hip hop. You know, these models have names like ChatGPT and Claude, but I think you could make an argument that the right name for a model is actually the description of the data it was trained on, because that is a description of its capabilities, that's what it can output. And so I think that the training data is really, it's really fundamental to the model, maybe more than the architecture, to some degree.
A
Interesting. And I think, I mean, that kind of answers to some extent. My next question, which is, why is training data such a closely guarded secret from these companies? And it seems to me that there is both a straightforward business answer and maybe a slightly more nefarious answer as to why these companies would be so secretive. But what is your read on why this is such a closely guarded secret? What data is being used to train these models?
B
Yeah, I mean, the companies have argued that they need to keep it a secret because the data that they have selected to train on is their competitive advantage. Right. Like Anthropic has done a better job at selecting data than Google and OpenAI. And if they were to let that come out in a court case or be public in some way, they would lose their competitive advantage. There's another pretty obvious reason, which is that they have gone about acquiring a lot of this data in ways that the people who've created the data, the authors of the books and the creators of the videos and the music, would not be happy about. And in a lot of cases, they just don't know that their work is being used. And when they find out, they're not happy about it. And I think it's a conversation that the AI companies have tried to just avoid having.
A
So a big part of what you've been up to recently has been sort of reverse engineering this process to like peel the model apart to figure out what data is inside of it. And it seems to me that you've had to do sort of the reverse of what all of these companies have done, which is go figure out where these mass sources of books or web articles or videos or songs are. So tell me a little bit about your process. How do you go about finding these things that are kind of of otherwise closely held secrets.
B
Yeah, my process is actually maybe not that different from the company's process. I am a programmer. I worked in tech for 20 years. I built websites and apps and did some statistics work. And so I've been aware of these models for a long time. And at a certain point I started hanging out in the forums where AI developers were hanging out and talking about the work and you know, they're talking about what data they're using to train these things. And so that was a useful source. And I've also been reading a lot of their research papers. You know, selecting the data is really challenging. There's a lot of effort that they put into it. They have to come up with some notion of what is high quality data, like what do and don't we want in the model, and they write papers about that. You know, I think they want to be involved in this conversation that's happening about training data within the industry. One thing that's been really helpful to me is that there's an open source AI development community and they believe that the work should be done more out in the open. And I think a lot of them are doing really good and important work and they have really interesting things to say about AI just philosophically and socially. And so they're pretty transparent places like Allen AI and Eleuther about what they're using to train. And so that's been helpful. But then even at the companies, the AI world is a little bit like academia in that it is good for your career to publish papers. And the companies don't like that. But they also acknowledge that they have to let the employees publish something and so the lawyers will go over it and tell them what they can and can't say. And over time they've clamped down more and so the companies are revealing less through the research papers. But yeah, a lot of my research is just from reading a Google paper, for example, for this last article where they said we trained on tens of millions of songs.
A
Yeah, I remember not that long ago Apple going through a big cultural issue with this, with its AI team, because Apple is so secretive and so reluctant to let any of its work be public that all of its researchers were like, well, if you're not going to let us publish, we don't want to work here. And that push and pull seems like it has kind of morphed in a bunch of different directions over time. But interesting to hear that it is. Definitely. Everybody is retrenching a little bit as this space gets hotter. And hotter.
B
Yeah, it's not 2021 anymore. The research papers read very differently than they did a few years ago.
A
That's really interesting. But tell me a little bit about these databases. One thing I think I had not given enough thought to until I started really reading your work is that there's a real business in making and maintaining these databases. Like one company you wrote about, Common Crawl, I think is a thoroughly fascinating player in this and they essentially, as far as I can tell, just crawl the Internet and make it available to whoever wants it. Which is weird and complicated and we should probably come back and talk about Karma Crawl sometime. But it seems like if you look around a little, these gigantic databases of books and articles and videos exist. Like are people making these and selling them? Why do these databases exist
B
mainly? Well, I mean for AI training. Why else would they exist? It's a huge, you know, it's an extremely labor intensive process to find. In one sense you just go and download all of Library Genesis or all of Anna's archive. That's one sort of naive thing you can do. But the companies realized pretty early on that they need to filter the stuff pretty carefully. So the organization you just mentioned, Common Crawl, yeah, they've been crawling the web since the late 2000s, maybe 2009 or something like that. And they just make the whole thing available. Every month there's a new. They've scraped a few more hundred million web pages and it's available to anyone who wants to do any kind of research with it. In fact, it's mostly AI researchers who are, who are using it. But all the early large language models were trained on Common Crawl. If you go back and read those early OpenAI papers, everyone was training on Common Crawl and at first and the models were terrible because if you train a model on the whole Internet, it's just, you get all the, it says all the junk that people say on the Internet along with the intelligent things.
A
Yeah, mostly junk. Yeah, I would say statistically mostly junk.
B
Statistically, yeah, mostly junk. And I think the early large regulars models were proof of that. But yeah, I think Common Crawl is a nonprofit. So you know, they would argue there's. It's not a big business. They do get a lot of money from AI companies and AI investors. But yeah, I think this the, the topic of, of training data selection. The challenge of selecting the right data for a model is still really hard. The AI companies I would say are still have a very primitive understanding of what data will make their model better. It's an area of Research that I think even they at this stage are not very good at. They do it mainly by trial and error, as far as I can tell.
A
Interesting.
B
Again, so the reason you're asking, why do these datasets exist? I think it's people trying to share what they've learned from curating data sets in different ways and training models with them.
A
Yeah, I mean, part of the reason I ask is I think one of the things I have come to believe about the AI industry is that this shift that went from AI research being fundamentally a research thing, like if you go, if you go way back, OpenAI was basically a research organization, right? And you talk about Common Crawl and I think its early users were largely researchers and it was. These things were academic things for academic purposes. And from what I understand, the kind of rules of the road are different, right. That like if you want to make copies of a bunch of things for academic purposes, these things are generally considered less problematic. Right. But if you then do it and become a trillion dollar company on the back of it, people are going to rightly feel differently about the way that you went about getting that information. And just the speed with which AI commercialized all of these companies just moved so fast from we are essentially an academic thing to oh my God, we're making so much money, everyone's filthy rich, that it, it feels like they, they just hoped everybody would ignore the ways in which they got this information. And so that's, that's why I'm particularly fascinated by where these data sets even come from. Because it does seem like in many cases, like you're saying, it's not that there is some gigantic business in being the one to sell the songs to somebody. It's that stuff is being, this stuff is being compiled for other purposes. It's just that now the main purpose because of the sheer volume of work is AI research. All this stuff has been sort of co opted from every other purpose to AI and it just all feels so concentrated now. Does that feel right to you? Does that make sense?
B
Yeah, I think that I agree with most of that. I do think that I'm not sure how much these data sets were really collected for other purposes Comic Crawl likes to talk about. I think they're one case. They probably have the strongest argument that their data could be used for other purposes. But when you go back, they've been cited by over 10,000 papers. I didn't read all 10,000, but I read a lot of them and they are mostly AI, Right. And it's early a lot of it is stuff that people wouldn't mind as much as with generative AI, right. Like Common Crawl. I think without Common Crawl, you know, AI translation tools might not be as good as they are. I think it was really a huge help because they scraped web pages, the same page in multiple languages, and people were able to train translation models based on that. So that was helpful. But the thing that, you know, there is still a what I would call a data laundering network where the AI companies are still relying on. They'll do a collaboration with the university and they'll have the university download millions of images to train a model or download millions of articles to train a model. And the AI company can say like, well, we didn't do it. This was like an academic thing. The same goes Common Crawl is not the only nonprofit that's doing a lot of this scraping for the AI industry. One of the data sets I reported on in the music, the article about music training data is this organization based in Europe called Lion. They have a Data set of 12 million songs from YouTube. So anyway, this is like, is it academic? Not really. This is technically, yeah, there's universities and nonprofits, but they're all receiving money from the AI industry.
A
We've all been there. You pop into the shop for five minutes and all of a sudden you've forgotten where you parked.
B
Car. Car.
A
Unfortunately, that lost feeling is what it's like trying to manage your policy with other insurers here.
B
Car.
A
Come out, come out, wherever you are, please. With Geico, you can use the app to easily manage all your policies in one place. Did this parking lot have a waterfall? I think you've wandered too far, mate.
B
It feels good to find what you're looking for. It feels good to Geico.
A
Support for this show comes from Fetch Pet Insurance. Do you have a pet? Every six seconds, a pet owner in the US gets hit with a vet bill of over $1,000. And it's almost always an unwelcome surprise. That's where Fetch pet Insurance comes in. Fetch is the most complete pet insurance. Get paid back up to 90% of vet bills. You can use any vet in the US and Canada. All vets are in network. Go to fetchpet.comsave right now for your free quote. That's fetchpet.comsave. heat up your 4th of July at the Home Depot with our wide variety of grills under $300 and make every gathering one to remember. Give your outdoor space a glow up whatever your budget is with savings on seasonal plants starting at $5 with the grill fired up and your backyard set to perfection. You'll be able to invite friends and family over to kick off the party. Start celebrating. With low prices guaranteed at the Home Depot, prices may vary by store. Exclusions apply. See homedepot.com pricematch for details. I'm not giving up.
B
I am selling the building.
A
The final season of FX is the Bear. The restaurant is flooded. Everything's either gonna be okay. No, stop. Or not. We are outgunned and we are outmanned,
B
but we have each other.
A
FX's the bear, the final Season, all episodes now streaming on Disney plus. This is kind of a diversion, but this. I'm so struck in reading all of your work by how often YouTube appears as just. It is. It is everybody's favorite source for everything. Like it's. It's what, you know, it's what OpenAI used allegedly to create Whisper. It's what a lot of the music stuff is using. It's what there, there's a lot of stuff based on videos. Like what. What is your sense of YouTube's role as an AI training force? Because it seems to be everywhere.
B
Yeah, that's accurate. YouTube is an extremely common source. I think one reason is there are tools for downloading from YouTube that are really. That work really well. They're really easy to use. And it's pretty common for AI developers to just use those tools. And it's just kind of a. It's become a custom. But also, you know, and that includes if stuff on YouTube is just less protected. I think is one way of saying it. Like there's music. You know, if you're a musician, your song might be on Spotify, but Spotify's website has digital rights management protections. Is really hard to download from Spotify. It's much easier to get the same song from YouTube and so many songs are also on YouTube. So I think it's just ease of
A
downloading YouTube obviously would say out loud that this is not allowed.
B
Right.
A
That it violates, assumes or service. And. And yet it seems to have either either it can't or it just hasn't done anything to stop this. Really.
B
Yeah, that's. That's a question that's been in the back of my head for a long time and which I've asked YouTube and which they. They. They don't really answer. They. They did. They have said that they consider it a violation of their terms of service to be downloading their videos. But yeah, they haven't. You know, years have passed and it's just as easy to download from YouTube now as it was a few years ago.
A
Yeah, I downloaded a YouTube video this morning. It is just a thing you can do. Yeah, it's, um. You mentioned the, the sort of data laundering stuff, but it also seems to me that more and more the people doing the AI training are just completely unapologetic about it. Um, like you, you quoted Rich Skrenta, the CEO of Common Crawl, who, who literally said to you, like, if you don't let AI robots crawl your data, you essentially don't exist. Like, you're, you're going to miss out on the future of the Internet. And I think about, you know, Mark Andreess even a couple of years ago being like, none of this would work if we couldn't just take the training data that we needed. There is this almost like manifest destiny sense of the AI industry that we can have this data because the thing that we're doing is so important that we must have it no matter what. What is your sense of the trend there? Because it occurs to me that that's happening even as the backlash from people who hate the experience and feeling of AI, in part because of the way that this stuff is trained and just keeps getting worse. Like these things are just running away from each other at like record speeds.
B
Yeah, I, that's, that's a huge question. I think it, you know, I think it has something to do with the fact that we just have not done a great job in this country with establishing the value of data and who should be able to have data. Right? This, this is something that privacy advocates have talked about. Jaron Lanier, I think, was one of the earlies people to be talking about this. I think he wrote in 2007 about that you should get paid like you were being surveilled, basically. And companies are taking, they're monitoring everything you do. They're generating data from your online activity. That data is extremely valuable to them. It might seem like nothing to you, but you should be getting paid for that. People thought that was crazy back then, and I think a lot of people still think that's crazy now. But, you know, taking people's music to build models that generate songs that compete with them is just the next version of that. And it's just going to keep going. If we don't acknowledge that this data is incredibly valuable and figure out a way to write laws around that or just have better business practices or something, it's just going to get worse. There's just going to be more exploitation and more mining.
A
The next step I think a lot of people have perceived for a while to be synthetic data, right? That eventually we're going to get AI models that are so good at making new things that then we can use those things to train new models and eventually they don't need existing recorded music. That synthetic data is the future and that's how we get to everything, and especially now. I mean, you look at Spotify and there are tons of AI generated songs on Spotify that some people are listening to. There are a billion AI generated podcasts out there. Like the content is being made. Is that the next phase of training data is synthetic data coming into its own in such a way that we're going to start to see the next generations of these models built on the things made by the last generations of these models?
B
Absolutely not. Really. I don't think that there's any evidence that that actually works. I think when AI companies talk about training on synthetic data, they choose their words very carefully and they always exaggerate the extent to which it's happening. There's a lot of research out there on a phenomenon called model collapse, which is what happens when you train a model on its own outputs. It doesn't get better, it very quickly degrades. And it's not hard to see why that could be the case. AI is kind of an averaging machine. It's finding statistical average between different types of content and putting that into some new kind of more average type of content. And there's just not enough weirdness or interestingness or something like that. There's some quality in the work that humans do that's not in the work that AI does. And I think that's actually proved by the model collapse phenomenon. So I'm amazed that AI companies are going around talking about synthetic data still when there's so much evidence that it doesn't work.
A
So is there a next untapped place filled with data? I mean, these models keep getting bigger, they keep needing more data. They got to go somewhere, right? Is there a next sort of unturned place to go for these companies?
B
I think they just pay people to make it. I think there's already a gigantic industry of writers who are writing for AI musicians who are making music for AI. After I published this story, I got an email from a company that is doing this. They claim to have paid creators over $10 million just to make things for AI training. It's very strange, but this is the next AI as your audience is the next frontier for creators.
A
That is a deeply strange thing to think about, but also like, you know, we're going to put this on YouTube and it's going in there anyway. So who knows? Maybe this is all of our destinies, no matter what.
B
I certainly hope not, and I don't think so. I think we can have a conversation and arrive at a more reasonable future for ourselves and the culture.
A
I'm with you on that. All right, well, Alex, thank you so much for being here. I really appreciate it.
B
Thank you, David. That's great.
A
All right, that's it for the show. Thank you to Alex for being here and thank you as always for watching and listening. If you have thoughts, questions, feedback of any kind, if you have a favorite AI song you want to send me, just not the Puerto Rico song. I already know that one. Send me all your favorite AI songs. No judgment. I just want to hear all of them. You can always hit up the hotline 866-verge11. You can send us an email vergecasthe verge.com we love hearing from you. And as a reminder, the best thing you can do to support all of this is to subscribe to the Verge to verge.com subscribe it gets you all of our podcasts ad free, including this one. It gets you all of our exclusive newsletters. It gets you all of our coverage. Terence o' Brien on our team has been covering Suno a lot and doing a really terrific job. Lots to come on that subscribe to the Verge. I think it's a pretty good website. Vergecast is a Verge production and part of the Vox Media Podcast Network. The show is produced by Josh Kahas, Eric Gomez, Brandon Keefer, Travis Larchuk and Aaron Locasio. We'll see you tomorrow. Rock and roll. When I got a new car, I thought my insurance premium would increase and empty my bank account like if Fatween won the lottery. I've invested most of my winnings in chicken tenders because they're bomb. But bro, I bought a house and it's sick, bro. I'm thinking the floor is going to be all trampoline, bro. With a helipad on the roof. The contractor said it's structurally unsound. They're just being babies. But switching to Geico saved me hundreds. So my bank account account is safe.
B
It feels good to save some hard earned cash. It feels good to Geico.
A
This episode is brought to you by Google Chrome. You think you know a browser, but Gemini and Chrome, that's new. It can help you with practically anything on the web, like restoring a vintage motorcycle from a 50 page restoration block or finally break down that long article you've had open for weeks. Gemini and Chrome is here for it, ready to make anything online make sense. There's no place like Chrome. Check responses Setup required. Compatibility and availability varies 18.
In this episode of The Vergecast, host David Pierce dives deep into the under-examined subject of training data in artificial intelligence. Joined by Alex Reisner, staff writer at The Atlantic and a leading investigative reporter on AI data sourcing, they explore how books, articles, YouTube videos, and songs become the raw material for generative AI models. The discussion unpacks the enormous importance of training data—how it's gathered, why companies keep it secret, the ethics and business models behind large datasets, and contentious issues about ownership, privacy, and the rapidly shifting landscape of AI. The conversation also touches on the limitations of synthetic data and the strange new economies emerging as creators generate data specifically for AI training.
“If we don’t acknowledge that this data is incredibly valuable and figure out a way to write laws... there’s just going to be more exploitation and more mining.” (B, 21:50)
“There’s so much research out there... if you train a model on its own outputs, it doesn’t get better, it very quickly degrades.” (B, 24:00)
The conversation is informed, slightly incredulous, and colored by dry humor and a sense of urgency—especially around the ethical and social implications of how AI models are built. David’s questions are probing and relatable, while Alex provides concise, insightful commentary rooted in investigative experience.
If you want to understand why the training data behind AI models is both the industry’s best-kept secret and the heart of current ethical crises, this episode peels back the layers. It shows that as AI gets more powerful and commercially dominant, the foundational question of who owns, controls, and gets compensated for data becomes ever more urgent. The path to a fair digital culture, both hosts agree, requires more public conversation, clearer laws, and perhaps some imaginative new business models.