Latent Space: The AI Engineer Podcast
Episode: Better Data is All You Need — Ari Morcos, Datology
Date: August 29, 2025
Guests: Ari Morcos (CEO & Co-founder, Datology)
Hosts: Alessio (Partner and CTO at Decibel), Swix (Founder of Small AI)
Episode Overview
In this episode, Ari Morcos, co-founder and CEO of Datology, joins Swix and Alessio to delve into the underappreciated but crucial world of data curation in AI. Morcos shares insights from his journey—from neuroscience to AI hyperscaling at Meta—and discusses why better data, not just bigger models or more compute, is the key unlock for the next era of AI. The conversation covers the evolution of data’s role in machine learning, technical approaches to data curation, synthetic data, the “bitter lesson” of deep learning, legal and economic aspects of data, and the future of specialized, efficient AI models.
Key Discussion Points & Insights
1. What is Datology? Mission and Approach
- Data Curation as a Service: Datology focuses on optimizing the entire data pipeline for machine learning—from data in storage to the data loader—by automating and improving steps like filtering, sequencing, generating synthetic data, and batching (00:40).
- Impact: Their mission: "Models are what they eat. If you show them great data, they’re going to be really high quality. If you show them low quality data, they’re going to be low quality." (Ari Morcos, 00:46)
- Goals: Faster model training, higher performance, and enabling smaller models to achieve big-model quality.
2. Why Data Has Been Undervalued in AI
- Academic v. Industry Perception:
- Data work traditionally seen as "grunt work" or low prestige, versus the more glamorous development of algorithms and architectures (10:34).
- "If you talk to the most talented AI researchers and you ask them what’s the secret to your success, they’ll largely tell you that they look at the data." (Ari Morcos, 10:49)
- Historic Incentives:
- The field evolved from a world of limited, labeled, high-quality data (e.g., ImageNet) to massive, unlabeled, and lower-quality web-scale datasets post-2019 (13:09).
- The shift to self-supervised learning and the ability to leverage unlabeled internet data are described as the "real advance" over architectural improvements like transformers (11:22).
3. The Bitter Lesson and Ari’s Path to Data-First Research
- Personal Journey:
- Ari transitioned from neuroscience to AI, bringing an empirical, experiment-driven mindset. He described deep learning as fundamentally an empirical science of large experiments and emergent properties (02:15).
- The Bitter Lesson:
- Multiple research projects convinced him that—at scale—data is more important than architectural tweaks or inductive biases:
- "When you get to enough scale, inductive biases matter not at all. All that really matters is the learned posterior from the data distribution." (Ari Morcos, 07:06)
- The famous phrase: "The bitter lesson was indeed very bitter for me." (07:33)
- Multiple research projects convinced him that—at scale—data is more important than architectural tweaks or inductive biases:
- Why Data Curation is So Attractive:
- Questions that are scientifically interesting are often practically useful, a rare alignment in ML research (09:05).
4. State of Open Data, Filtering, and Synthetic Data
- Limitations of Human Curation:
- Humans cannot effectively curate AI-scale datasets due to the combinatorial complexity and context needed (18:11).
- DCLM study: even expert annotation does not reliably match automated classifiers, as usefulness depends on redundancy, context, and coverage (17:36).
- "The value of a data point is not just a function of that data point itself. It's rather a function of how that data point relates to every other data point in the training set." (Ari Morcos, 18:43)
- How Much Redundancy Is Right?
- It depends: simple concepts need little redundancy; complex ones (like "dogs" vs. "elephants") need far more (19:34).
- Synthetic Data (Rephrasing):
- Rephrased/synthetic data can outperform simply repeating the highest-quality data (41:36–46:03).
- Two types: model-originated (distillation/model collapse risk) and data-originated (rephrasing, less risky).
- "Repeating higher quality tokens is almost always better than seeing net new lower quality tokens." (Ari Morcos, 47:32)
- Curriculum Learning:
- Now practical due to the scale/underfitting regime—can shrink training costs and improve learning efficiency by sequencing data (49:02–51:11).
5. Legal, Economic, and Open Data Landscape
- Legal Issues:
- Ongoing lawsuits over dataset provenance (Books3, copyright), and their chilling effect on model development (24:26).
- Take: If books are legally acquired, training on them may be fair use (24:37).
- Open vs. Proprietary Data:
- Open datasets are improving, but headroom remains enormous. Most high-performance work now happens with proprietary or mixed data.
- "I think there's at least another 100x behind this [data curation efficiency] that are still to be done." (Ari Morcos, 31:33)
- Startups and Moats:
- For data-focused startups, science know-how and engineering infra are the only real moats, as open data sources and infrastructure spread quickly (35:14).
6. The Future: Smaller, Faster, More Domain-Specific Models
- Trends:
- "Most of the models that the vast majority of people will be using in say three years will be single digit B or smaller [in parameter count]." (Ari Morcos, 63:37)
- Specialized, efficient models will outpace “mega-models” for most real-world needs.
- Datology’s Impact:
- By curating data, they make it possible for enterprises (e.g., sovereign national projects) to train their own models for less than $1M and achieve SOTA results with fewer resources (55:42).
- "Data is effectively a compute multiplier because all models are underfitting their data sets." (Ari Morcos, 57:39)
Notable Quotes & Memorable Moments
-
On the centrality of data:
"Data is the most underinvested in area of research relative to its impact. And I don't think it's even close."
— Ari Morcos, (08:44) -
About the 'bitter lesson':
"When you get to enough scale, inductive biases matter not at all. All that really matters is the learned posterior from the data distribution."
— Ari Morcos, (07:06) -
On curriculum learning’s comeback:
"Curricula always had to work in the sense that it just made too much sense...I’ve always believed that this has to work."
— Ari Morcos, (49:11) -
Synthetic data vs. model collapse:
"I’m generally quite skeptical that you can get a model that will be better than the teacher that’s generating the synthetic data...But with rephrasing, you can get a model to do much, much better than if you had trained on all of the data, all raw tokens, in the first place."
— Ari Morcos, (44:24–45:58) -
On future model size:
"The vast majority of models people will be using in, say, three years will be single digit B or smaller."
— Ari Morcos, (63:37)
Timestamps for Key Segments
- 00:40 – Datology’s mission and model: "Models are what they eat."
- 02:15 – Ari’s transition from neuroscience, “deep learning is an empirical science.”
- 07:06 – The "bitter lesson": model architecture tweaks matter less at scale.
- 10:34 – Academic/research incentives in data work and its low prestige.
- 11:22 – The shift from supervised to self-supervised learning.
- 17:36 – DCLM: Why humans can’t effectively curate massive datasets.
- 19:34 – Redundancy: How much is optimal varies by concept; examples.
- 24:26–25:15 – Legality and controversy around Books3 (copyright).
- 29:07 – Benchmarking: Datology’s improvements over open datasets.
- 31:33 – There’s at least 100x left in improving data curation.
- 35:14 – Science know-how and infrastructure as startup moats.
- 41:36–46:03 – Synthetic data, rephrasing, distillation, model collapse.
- 47:32 – Higher quality, repeated tokens vs. new, lower quality tokens.
- 49:02–51:11 – Curriculum learning: why it now works.
- 55:42–57:39 – Why small models and better data matter for sovereign AI and enterprise use.
- 63:37 – The coming dominance of small, efficient models.
- 69:54–70:53 – What data everyone wants (expert data), and customer misconceptions about data.
Fun/Gossip & Lightning Round
- Meta’s Super Intelligence Team Drama:
- Swix inquires about the shift at Meta, science “moats,” big bets by Zuck, and the interplay between engineering, data, and research culture (74:24).
- Ari affirms: "When Zuck makes a very big bet, it’s not proven wise to bet against him." (77:09)
- Recruiting at Datology:
- If you love examining datasets for quirks and anomalies, you belong there.
- Datology’s unfair advantage: "Valuing data with respect to a downstream use case." (Additional recruitment pitch at 70:59 and 74:10)
Conclusion
Ari Morcos makes a compelling case that in the race to better AI, smarter and more nuanced data curation eclipses brute-force scaling of models or hardware. Datology aims to "bend the scaling laws," enabling high-quality, efficient, and smaller models accessible to more organizations by radically improving the data they feed their models. If you want to see where "Software 3.0" innovation will be in the next quarter and beyond, pay close attention to the quiet but seismic shifts in data-centric AI.
More Info
- Full show notes and resources at: latent.space
- Blog post on beyond web synthetic data coming soon from Datology (45:58)
- For more on foundational model curation, legal issues, and scaling laws, see referenced papers: DCLM, Kimi, DeepSeek, “Beyond Neural Scaling Laws” (28:57).
End of Summary
