Podcast Summary: What Comes After ChatGPT? The Mother of ImageNet Predicts The Future
a16z Podcast | December 5, 2025
Guests: Dr. Fei-Fei Li (Stanford, co-founder World Labs), Justin Johnson (co-founder World Labs)
Hosts: Alessio (Kernel Labs), Zwicks (Latent Space)
Overview
This episode brings together two leading minds in AI—Fei-Fei Li, the creator of ImageNet, and Justin Johnson, her former student and co-founder at World Labs—to explore the next frontier in artificial intelligence beyond language models: spatial intelligence and world modeling. The conversation centers on their latest venture, Marble, the first model for generating explorable 3D worlds from text or images, and delves into the technical, theoretical, and philosophical challenges and opportunities in building true “spatial intelligence” in machines.
Highlights & Key Discussion Points
1. The Origins of World Labs and Marble
- Background: Fei-Fei Li and Justin Johnson share their academic origins and discuss the evolution from ImageNet and AlexNet to 3D world modeling.
- Main Idea: World Labs focuses on building spatial intelligence because both founders saw the next leap beyond large language models (LLMs) would be true world models that understand and generate 3D spaces.
Quote:
"We started talking and decided that we should just put all the eggs in one basket and focus on solving this problem. And started World Labs together."
— Fei-Fei Li (02:52)
2. The Importance of Scaling Compute and Data
- Progress in AI: The transformative power of scaling up compute, from GPUs to multi-cluster systems, is highlighted as the backbone of deep learning advances.
- Challenge in Academia: State-of-the-art AI now requires immense compute and data, shifting the role of academia toward foundational and “wacky” research rather than competitive training.
Quote:
"The whole history of deep learning is in some sense the history of scaling up compute."
— Justin Johnson (00:00, 04:42)
3. Open Science & The Ecosystem’s Evolution
- Open vs. Proprietary: The tension between open datasets (like ImageNet) and closed, product-focused industry research is discussed. Fei-Fei emphasizes the importance of resource balance and maintaining robust public academic infrastructure.
Quote:
"Academia by itself is severely under resourced so that the researchers and the students do not have enough resources to try these ideas."
— Fei-Fei Li (11:24)
4. Wacky & Forward-Looking Ideas
- Direction for Academia: Johnson shares his interest in how hardware developments could yield entirely new architectures not tied to current GPU/transformer paradigms. Future neural network primitives may radically diverge from today’s matrix multiplications.
Quote:
"Are there other primitives that make more sense for large scale distributed systems that we could build our neural networks on?"
— Justin Johnson (12:13)
5. Historical Progression: From ImageNet to Neural Captioning
- Pioneering Advances: Li and Johnson recount their pivotal work in image captioning, linking convolutional neural nets for images with LSTMs for text as one of the first successful “storytelling” AI models.
Memorable Moment:
- Andrej Karpathy’s simultaneous experiments at Stanford and Google, with the New York Times covering both “independent inventions” (16:09).
- Justin's real-time demo for dense captioning and its impact at conferences (20:11).
6. The Distinction—And Interplay—Between Language and Spatial Intelligence
- Fundamental Differences:
- Language is "1D" and amenable to tokenization and sequence models.
- The real world is "3D/4D," with fundamentally different representational needs and embedded physics.
- Lossiness of Language: Translating spatial information to language inevitably loses information (font, layout, physical relationships).
Quote:
"The deeply 3D/4D spatial world has a level of structure that is fundamentally different from a purely generative signal that is one-dimensional."
— Fei-Fei Li (21:45)
7. The “Understanding” Problem: Physical Reasoning in AI
- Caveats of Learning Physical Laws:
- Modern world models, like Marble, generate physically plausible sceneries but may not truly "understand" physics unless explicitly trained or endowed with emergent capabilities.
- This matters more for high-stakes applications (architecture/robotics) than for VFX/gaming.
Quote:
"If you use the word understand the way you understand, I'm pretty sure the model doesn't understand it. The model is learning from the data, learning from the pattern."
— Fei-Fei Li (26:00)
8. Technical Deep Dive: Marble’s Architecture & 3D Representation
- What Marble Does: Marble takes as input text, single or multiple images, and generates an explorable, editable 3D world using Gaussian splats as atomic elements.
- Interactivity Is Key: Enables precise camera control and scene editing, unlike frame-by-frame video generation.
- Potential for Robotics: Marble could generate data-rich simulation environments for embodied agent training, a “middle ground” between scarce real-world robot data and uncontrolled Internet video.
- Marble's Business Focus: Initial focus on creative industries (gaming, VFX, interior design), but designed as a broad horizontal technology applicable to robotics, design, and more.
Quote:
"It's the first in-class model in the world that generates 3D worlds in this level of fidelity that is in the hands of the public."
— Fei-Fei Li (30:58)
9. The Nature of Spatial Intelligence vs “Traditional” Intelligence
- Definition: Spatial intelligence is the capacity to reason, understand, and act in space—complementary, not competitive, with linguistic intelligence.
- Fundamental to Human Life: Most daily activities require spatial reasoning, which is hard to reduce to language.
- Innateness: Perceptual and spatial faculties are largely innate and predate language, both in evolution and individual development.
Quote:
"Spatial intelligence... is the capability that allows you to reason, understand, move and interact in space."
— Fei-Fei Li (43:14)
10. Interplay of Language and Spatial Intelligence in AI
- Multimodality will matter: Future AI will be both spatial and linguistic, rather than a winner-take-all of one modality.
- Even Marble is deeply multimodal—taking language prompts to generate worlds.
Quote:
"Maybe one day we'll have a universal model."
— Fei-Fei Li (50:51)
11. Inductive Bias, Physics, and Human-Like Understanding
- Discussion of whether large world models, trained purely on patterns and data, can abstract high-level laws (like Newtonian physics).
- The need for new learning paradigms to achieve generalization akin to human science and theory-building.
Quote:
"You can use traditional physics engines to generate data that we then train our models on. And then you're sort of distilling the physics engine into the weights of the neural network that you're training."
— Justin Johnson (29:25)
12. Transformer Architectures: Sets Not Sequences
- Johnson clarifies a common misconception: Transformers are fundamentally “set” models—permutation equivariant except for injected positional information.
- As world modeling evolves beyond language, architectures may move past sequence-to-sequence—but attention is still “here to stay.”
Quote:
"A transformer is actually not a model of a sequence of tokens. A transformer is actually a model of a set of tokens."
— Justin Johnson (57:48)
Notable Quotes & Timestamps
- "The whole history of deep learning is in some sense the history of scaling up compute."
— Justin Johnson [00:00, 04:42] - "Spatial intelligence is the next frontier."
— Fei-Fei Li [30:58] - "If you use the word understand the way you understand, I'm pretty sure the model doesn't understand it."
— Fei-Fei Li [26:00] - "A transformer is actually not a model of a sequence of tokens. A transformer is actually a model of a set of tokens."
— Justin Johnson [57:48] - "It's deeply multimodal. And I think in many use cases these models will work together. Maybe one day we'll have a universal model."
— Fei-Fei Li [50:51] - "We are hungry for talent... ranging from very deep researchers... to good business, product thinkers and go-to-market and business talents."
— Fei-Fei Li [58:59]
Segment Timestamps
- [00:00–04:09] – Deep learning’s progression and the founding of World Labs
- [04:10–12:08] – The challenge of building open AI; industry vs academic roles
- [12:09–14:14] – Wacky ideas and hardware-driven inspiration for new AI architectures
- [14:15–20:50] – The evolution from image captioning to dense scene understanding
- [21:35–29:57] – Language vs spatial intelligence; the challenge of embedding physics and causality
- [30:58–36:45] – Marble’s technology, interactivity, and potential for robotics/embodied AI
- [41:29–43:14] – Use cases, product focus, and the horizontal breadth of Marble
- [43:15–51:06] – Defining spatial intelligence; human evolution and cognition; multimodal AI futures
- [56:40–58:59] – Transformer models as set models; the architectural future
- [58:59–end] – Talent call to action, UI feedback, and concluding thoughts
Memorable Moments
- Real-time dense captioning demo: Johnson’s early demo ran across continents and wowed conferences (20:11–21:20).
- Marble’s precision and interactivity: Unlike prior models, Marble offers “precise control in terms of placing a camera” (34:32).
- Spirited defense of vision science: Fei-Fei Li explains why spatial intelligence is underappreciated—because it’s effortless for humans, yet foundational for AI (48:05).
- Clarifying transformer architectures: Johnson pushes the audience to grasp the “set” nature of transformers, not just their familiar sequence modeling (57:16–57:48).
Takeaways for Listeners
- Spatial intelligence and world modeling will be critical to the next wave of AI, especially for embodied agents and advanced simulations.
- Foundational progress in AI still relies on open science and academic exploration, even as compute and data scale shift to industry.
- Multimodal systems—combining language with spatial and other senses—are where the future lies, rather than rigidly one or the other.
- Marble’s debut marks a meaningful step toward practical, public, explorable 3D world generation, with broad applications from creative industries to robotics to design.
- The field is wide open: New learning paradigms, representations, and architectures are up for grabs—both for academic researchers and ambitious engineers.
Call to Action
World Labs is hiring:
Researchers, engineers (especially in systems and UI), and business/product/GTMs are encouraged to apply as spatial intelligence emerges as the next AI frontier ([58:59]).
Try Marble, explore its advanced editing, and imagine new use cases—your feedback could shape the future of interactive spatial AI.
