Loading summary
A
So one definition of intelligence is sample efficiency. That is to say, how much data do you need in a given domain to operate fluently and competently? And it's actually not clear that we've made that much progress in training sample efficiency over the last few years. It seems like more. So we've just dramatically widened and improved the data distribution. The main way that AIs have been getting better is from adding more and better data and scaling the compute required to develop that data in the first place. Obviously RL is the main way that this has happened. You can think of RL as basically a kind of synthetic data generation where you dump a ton of compute against a verifier or a rubric. If you have an LLM as a judge, and you do this in order to find out what the good data is in the first place, and then you train your model to predict these correct rollouts, much in the same way that you might train that model to predict the next word in Internet text. For this process to work, the model must have at least some prior probability to anticipate the correct solution in the first place. Which is why you need mind stretching amounts of human expert trajectories in every single field and skill that you want the model to eventually be competent in. It's hard to overstate how task specific and bespoke this human expert data is. If you want some intuition, I recommend checking out the job descriptions on Mercur or Serj's websites. There are listings for word specialists who will convert legacy documents into polished word files. And legal experts who will write realistic M and a diligence or securities filings, and management consultants who will write up template market research. And it is not only that the data have to be so domain specific, but there has to be so much of it. Each skill corresponds to at least hundreds of human experts who are generating example completions, writing rubrics and explaining their chain of thought. There's a reason that the data industry that is producing these expert labels and the RL environments in which these meticulously cataloged skills can congeal, is earning billions a year in revenue, soon to be deca. Billions. Now imagine if it took a couple decades worth of courses with hundreds of concurrent professors and millions of practice tasks for you to learn how to polish a word file. Even the task count difference here understates the gap because the models have to grind their far more numerous tasks, each far harder. Whereas a human student might practice a textbook problem once or twice with grpo, these models are generating hundreds to Thousands of rollouts per task, and they need to solve the credit assignment problem. The correct way to think about these models is not like a human who has learned all these different skills that you see these models displaying. It's more like a Frankenstein's monster, which has been built out of a billion graphs of carefully constructed examples all sewn together. Epoch recently reported that open models lag state of the art frontier models by four months. I think the reason it is relatively easy for open source and previous laggards to catch up to within months of the frontier is that data is the real driver of progress. And data can be easily distilled from public APIs, whereas hyperparameters and training tricks and architectural optimizations cannot. And if the latter were driving most of the progress, then catching up would be far harder than we are observing it to be. It is easy to forget how much data these models are trained on and how much more it is than what we humans see in our lifetimes. We see these AIs as a galaxy glittering with capabilities. But at their center, invisible to the naked eye, holding all the constellations together, is an unimaginably massive black hole of data. Just a couple of points of comparison to help drive home how big this difference is. Here's one. If a person sees and hears, on average, let's say generously, 2,000 words an hour, then between the time they're bored and the time they're an adult, they'll see about 200 million tokens. Now, by contrast, these frontier models are trained on somewhere between tens to hundreds of trillions of tokens. That is close to a million fold difference. Here's another point of comparison. If you wanted to, you could learn to teleroperate any random humanoid or robot arm within hours. And if we could get AIs to learn just as fast, robotics would be a decaturillion dollar industry and you'd have an endless army of unit 3G1s doing all kinds of useful work in the world. But the reason we can't do this is that our AIs learn much less efficiently than we do. And even with the millions of hours of demonstrations that we've collected, this is not enough to to allow them to perform complex open ended tasks. And a final point of comparison. A teenager can learn to drive a car with about 20 hours of practice. And even if we include their 16 years of growing up and understanding how the world works and building physical intuition, that is still three to four orders of magnitude less data than Waymo and Tesla are using to train their self driving car models. Now I want to deal with a couple of common responses and objections that people have to these kinds of comparisons. One thing people will say, and I think Karpathi said this when he came onto my podcast, is that for humans, many billions of years of evolution had to go into basically pre training us. And so we're being unfair when we're comparing how little data we see within our lifetimes to what these cold started LLMs who are just starting off with a totally random initialization have to learn from. I think this is not the right way to think about it. Our genome is only 3 gigabytes big and only 1 to 2% of it is protein coding. And that is simply not enough space to store the parameters of this network that supposedly evolution has pre trained. I think the closer analogy is more that evolution found the right hyperparameters and the right loss functions and that within our lifetime we are still from scratch building up the connectome in our brain, that is to say the analogous thing to the weights and parameters of the neural net itself. And even if you granted this comparison and you said yes, the hundreds of trillions of tokens that these models C to get pre trained is similar to just catching up to evolution. That still doesn't explain why any new marginal capability that you want to give these models takes so much data. So once you have been educated again, you don't need 100 different professors to teach you how to learn a new programming language. But these AIs, even once they're pre trained, still require enormous amounts of data to learn the next marginal skill and the next marginal skill after that. Another objection to this kind of comparison is that we're not including multimodal data that we're seeing in our lifetimes. But so you include all this sensor information that we see from birth to adulthood, that's probably tens to hundreds of billions of tokens of data. And my response to this objection is simply that blind and deaf people who have been cut off from all the sensory information still have general intelligence. And that suggests to me that all these billions of sensory tokens are not really the thing that is making humans smart. And in fact, deaf people who don't have the ability to hear any tokens, who just have to consume them via sign language and reading, are probably ingesting far less than the 200 million language tokens that we ballparked earlier, which suggests that even the million fold difference that we calculated earlier might be an understatement. Okay, the third common objection people make is that we just haven't scaled enough. We have these scaling laws. They tell us that bigger models are more sample efficient. The human brain, we know, is about 100 trillion synapses, and we have frontier models that are currently around 5 trillion parameters. And so maybe we could just achieve human level sample efficiency if we made these models one to two orders of magnitude bigger. The reason this objection is off, Mark, is actually quite interesting. So if you look at the way the scaling loss equations work, they tell you that the parameter and data terms are added to the loss independently. So suppose you have a model and you've trained it to compute optimally, and you say, I want to be sample efficient. I want to use as little data as possible, and I'll throw in as many parameters as is necessary to make that happen. So take the constants from the chinchilla scaling law paper. Even if you increase the number of parameters by infinity, that would only decrease by a factor of 10 the amount of data that you need in order to keep the same loss. Humans are somewhere between thousands to millions of times more sample efficient than these models. So scaling the size of current models simply can't make up for that discrepancy. And this really does suggest that humans are on a different scaling curve altogether. As soon as I earn money, I want to put it to work. But I also need to save for things like upcoming expenses and estimated taxes. So. So to figure out exactly how much I need to set aside, I ask Command. Command is AI that is built into Mercury, which is my banking platform. And since I already used Mercury to run my entire business, Command has access to all the information it needs to get worked on. I just tell Command the date I'm interested in and it does the rest. It takes my current balance and adds whatever invoices will be due by the cutoff. Then it reviews my last six months of transaction history so I can subtract out my monthly average expenses along with any scheduled payments. And if there's anything relevant coming up that's not in Mercury yet, I can just flag it. Things like, heads up, there's a $12,000 contractor payment that's slated for July, and that gets included in the final output. Because this is all happening in chat and every answer has links to the underlying data, I can easily double check. Commands work, and once I'm convinced, I can just tell command, all right, that looks good. Just transfer the surplus to my personal account and it will immediately draft the transfer for me to approve. Command is live now. Visit mercury.comcommmand to learn more, Mercury is a fintech company, not an FDIC insured bank. Banking services provided through Choice Financial Group and Column NA members, FDIC AI generated responses and suggested actions may vary and are not guaranteed okay, all these nerdy comparisons aside, you might ask why do we even care about sample efficiency? Is this actually necessary for the labs to achieve the two overarching objectives they have, which are 1 automate white collar work and 2 automate AI research itself? The bet that the labs are making with white collar work is that the common task that a software engineer or analys accountant needs to do are common, and as a result you can bring them into the training distribution quite easily. If you look at the revenue curves of these labs over the last few months, it does suggest that there's an enormous amount of value from bringing into distribution these kinds of common tasks. Even if we can't replicate whatever is making human learning so special. And it might be more inefficient to train AIs to do these kinds of tasks than it is to train humans, but so what? Human lifespan simply does not allow for the quantity and the breadth of training that these models experience. If you as a human had some weird learning disability where you needed to read through every public repository on GitHub before you could be a competent software engineer, then it would simply not make sense to train you up. You'd be on Social Security by the early stages of your education, and even once you were trained you would only be able to work on one project at a time. But AIs can learn these skills by fire, hosing gigawatts of training at a time, and what they learn can be amortized across the billions of sessions at once. So we can be ludicrously inefficient in training them up and still be wildly in the green. And then there's a question of, well, how much out of distribution thinking do white collar employees need to do that you simply can't train for in advance? This is more a question about the nature of different jobs than it is a question about AI research. And it also depends on which job you're talking about. Some jobs are so mechanical and predictable that we were able to automate them long before the modern era of AI, for example bank tellers or travel agents. But there are other jobs which require dealing on a daily basis with problems that are quite distant from the data distribution. I think software engineering is probably one such this is the job that AIs are supposed to take first, but I would be willing to bet that there's overall more demand for human software engineers in 2027 than there is right now, largely due to the complementary input of AI. The labs plans for this latter category of jobs is first to automate AI research and then have the automated AI researchers solve the sample efficiency problem. So then the question is, can AIs which do not have human level sample efficiency nonetheless solve the remaining research problems that stand on the way of human like intelligence and learning? This is a very complicated question, and I'll have to address it in a much longer future blog post. But just to tease it a bit, I think that the way that people currently think about an intelligence solution is very clumsy, because either people dismiss the possibility of AI speeding up AI progress altogether, or they assume that some kind of God pops out the other end. They don't reason carefully about what it looks like to have a period where AI progress is much faster than usual, but have that happen atop LLMs and the particular kinds of intelligences that LLMs are. But I'll save that for next time. In the meanwhile, if you want to read this blog post or all the other blog posts I write, or be alerted when I write a future blog post, go sign up for my newsletter at my website, dwarkesh.com all right, I'll see you later.
Episode: The Data Black Hole at the Center of AI
Host: Dwarkesh Patel
Date: June 19, 2026
In this episode, Dwarkesh Patel deeply examines what truly powers the rapid advance in AI capabilities—highlighting the central, often underappreciated role played by extraordinary quantities of highly specialized data. Rather than dramatic algorithmic breakthroughs, it is the relentless expansion and precision of the data distribution, especially through human expert input and reinforcement learning (RL), that has accelerated progress. Dwarkesh also tackles common objections about why AI appears far less sample efficient than humans and explores the implications of this disparity for the future of work and research automation.
"It's actually not clear that we've made that much progress in training sample efficiency... we've just dramatically widened and improved the data distribution."
"It's hard to overstate how task specific and bespoke this human expert data is."
"These frontier models are trained on somewhere between tens to hundreds of trillions of tokens. That is close to a million fold difference."
"Our genome is only 3 gigabytes big and only 1 to 2% of it is protein coding... not enough space to store the parameters of this network."
"Blind and deaf people... still have general intelligence. That suggests these billions of sensory tokens are not really the thing making humans smart."
"Even if you increase the number of parameters by infinity, that would only decrease by a factor of 10 the amount of data that you need in order to keep the same loss."
"We can be ludicrously inefficient in training them up and still be wildly in the green."
"The labs plans for this latter category of jobs is first to automate AI research and then have the automated AI researchers solve the sample efficiency problem."
"If the latter were driving most of the progress, then catching up would be far harder than we are observing it to be."
On the Data Gravity Well at AI's Core:
(05:44)
"We see these AIs as a galaxy glittering with capabilities. But at their center, invisible to the naked eye, holding all the constellations together, is an unimaginably massive black hole of data."
On Human vs. AI Career Development:
(22:05)
"If you as a human had some weird learning disability where you needed to read through every public repository on GitHub before you could be a competent software engineer, then it would simply not make sense to train you up."
On The Future of Software Engineering Jobs:
(24:42)
"I would be willing to bet that there's overall more demand for human software engineers in 2027 than there is right now, largely due to the complementary input of AI."
On the “god” misconception in AI development:
(27:45)
"People... assume that some kind of God pops out the other end. They don't reason carefully about what it looks like to have a period where AI progress is much faster than usual... But I'll save that for next time."
Dwarkesh’s delivery is analytic, evidence-driven, and laced with wry analogies (e.g., “Frankenstein’s monster,” “data black hole,” “fire-hosing gigawatts of training”). He maintains a critical but pragmatic tone when comparing human and artificial intelligence, favoring long-run perspective over hype.
This episode underscores that the transformative power of modern AI is built atop an underappreciated foundation: an enormous, painstakingly crafted and continuously growing “black hole” of data. The path to further progress will likely depend as much on finding new ways to efficiently learn from less data as on technical or algorithmic breakthroughs—a challenge that remains open, and a theme Dwarkesh promises to revisit in future work.