Latent Space: The AI Engineer Podcast
Episode: After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs
Date: November 25, 2025
Overview
This episode dives deep into the next frontier of AI beyond large language models (LLMs): spatial intelligence and world models. Host Alessio and Zwicks (Latent Space) welcome Fei-Fei Li and Justin Johnson, founders of World Labs and creators of the new model "Marble," to discuss the evolution from language models to 3D spatial world generation, the interplay between academia and industry, the architectural and philosophical challenges ahead, and the future of multimodal AI.
The conversation ranges from the history of deep learning and ImageNet to the practicalities and vision behind Marble, the need for new hardware paradigms, the meaning of "spatial intelligence," and the open questions at the interface of physics, language, perception, and AI productization.
Episode Structure & Key Discussion Points
1. The Genesis of World Labs and the Vision Behind Marble
(Timestamps: 00:00–04:00, 30:20–34:29)
-
Background: Fei-Fei Li (professor at Stanford, ImageNet visionary) and Justin Johnson (former PhD student, computer vision researcher) recount how their interests in AI, especially in generative 3D models and spatial intelligence, led them to co-found World Labs.
- "We started talking and decided that we should just put all the eggs in one basket and focus on solving this problem and started World Labs pretty much." – Fei-Fei Li [02:14]
-
Marble: Introduction of Marble as a first-in-class generative world model for 3D environments.
- "It's a generative model of 3D worlds... While Marble is simultaneously a world model that is building towards this vision of spatial intelligence, it was also very intentionally designed to be a thing that people could find useful today." – Justin Johnson [00:26/31:42]
2. From ImageNet to World Models: Why Now?
(Timestamps: 04:00–10:46)
-
Tech Timeline: The leap in compute and available data has enabled a new era for AI models beyond just language.
- "The whole history of deep learning is in some sense the history of scaling up compute." – Justin Johnson [00:00/04:04]
- GPUs to clusters: now possible to train models on tens of thousands of GPUs; compute is ~one million times greater since AlexNet days.
-
Academic vs. Industry Tension: Open science remains crucial, but there are imbalances in academic vs. industry resources.
- "I do have concerns about... the imbalanced resourcing of academia... academia by itself is severely under resourced so that, you know, the researchers and the students do not have enough resources to try these ideas." – Fei-Fei Li [07:22/10:46]
3. The Role of Academia & "Wacky Ideas"
(Timestamps: 09:17–13:36)
-
Research Freedom: While industry scales up, academia can focus on exploration and foundational science.
- "It shouldn't be about trying to train the biggest model... It should be about trying wacky ideas and new ideas and crazy ideas, most of which won't work." – Justin Johnson [09:17]
-
Hardware Foresight: Johnson's fascination with whether new neural network architectures will be unlocked by future hardware innovations.
- "Are there other primitives that make more sense for large scale distributed systems that we could build our neural networks on?" – Justin Johnson [11:35]
4. The Origins of Vision-Language Models
(Timestamps: 13:36–21:34)
-
Image Captioning History: Fei-Fei, Andrej Karpathy, and Johnson pioneered combining CNNs and LSTMs to tell the story of images, a task Fei-Fei thought would take “100 years”.
- "We thought we were the first people doing it. It turned out that Google at that time was also simultaneously doing it." – Fei-Fei Li [15:30]
- From basic captioning (one sentence per image) to "dense captioning" – describing distinct objects and regions in images, moving towards real-time demos.
-
Data Representation: Early link between visual and linguistic intelligence via image recognition and sequential language models.
5. The Distinction Between Spatial & Linguistic Intelligence
(Timestamps: 21:02–29:42, 42:11–49:45)
-
Are Language & Vision So Different?
- "I think they are different... the deeply 3D/4D spatial world has a level of structure that is fundamentally different from a purely generative signal that is one-dimensional." – Fei-Fei Li [21:02]
- Language is lossy, spatial information is high bandwidth and uniquely structured.
- "Pixels are this sort of more lossless representation... more matches what we humans see." – Justin Johnson [22:11]
-
Limits of LLMs: LLMs can learn to mimic orbits or physics-like data but don't generalize or learn causal physics laws.
- "There's no indication that those latent modeling will get you to a causal law of space and dynamics; that's where today's deep learning and human intelligence actually start to bifurcate." – Fei-Fei Li [24:15]
6. Marble’s Architecture: Generating 3D Worlds
(Timestamps: 30:20–37:40)
-
Product Capabilities: Multimodal input (text, images), interactive editing (change objects, colors), outputs in real-time (splats rendering).
- "With Marble, we were actually trying to do two things simultaneously... build a model that goes towards the grand vision of spatial intelligence... and build a product that would be useful to people in the real world today." – Justin Johnson [31:42]
-
Atomic Units of 3D Worlds: Currently Gaussian splats; could evolve to new primitives.
- "The model natively outputs splats... Gaussian splats are really cool because you can render them in real time really efficiently." – Justin Johnson [34:29]
-
Integrating Physics: Potential pathways to infuse simulated physics into the atomic level of world generation.
- "You can attach physical properties to those splats... treat each one as being coupled with some kind of virtual spring to nearby neighbors and now you can start to do sort of physics simulation on top of splats." – Justin Johnson [36:23]
7. Applications, Use Cases, and Market Strategy
(Timestamps: 38:43–42:12)
-
Creative/Designer Focus: Marble is already being used for interior design, VFX, and gaming.
- "I posted this video on Slack of like, oh, who wants to use marble to plan your next kitchen remodel? ... Just take two images of your kitchen, reconstruct it in Marble, and then use the editing features to see what it would look like." – Justin Johnson [41:24]
- Early beta users applying Marble to real design problems.
-
Embodied AI and Robotics: Synthetic worlds generated by Marble can fill the data hunger for robot training.
- "Robotic training really lack data... simulation and synthetic data is actually a very important middle ground. Marble actually is a really potential for helping to generate these synthetic simulated worlds for embodied agent training." – Fei-Fei Li [38:51]
8. Defining Spatial Intelligence & Its Importance
(Timestamps: 42:11–49:02)
-
Multiple Intelligences: Drawing from psychologist Howard Gardner, spatial intelligence is as critical as linguistic or logical.
- "Spatial intelligence... is the capability that allows you to reason, understand, move and interact in space... it's very hard to reduce that process into pure language." – Fei-Fei Li [42:47]
- Day-to-day tasks (grasping a mug, navigation) and scientific revolutions (e.g., inferring DNA's structure) showcase spatial reasoning’s unique, language-untranslatable nature.
-
Human Cognition & Evolution: Perception and spatial reasoning have evolved for millions of years; language is orders of magnitude newer.
- "In nature, it took 540 million years to optimize perception and spatial intelligence and language... is probably half a million years." – Fei-Fei Li [49:02]
9. Philosophy of World Models, Theory of Mind, and the Limits of Current AI
(Timestamps: 49:08–55:41)
-
Limitations of Language-First AI: LLMs lack a “theory of mind” and embodied, experimental learning that humans use to generalize about the world.
- "The way I put it is almost like this is almost more efficient learning because you have a hypothesis of different possible worlds... then you do experiments to eliminate the worlds that are not possible..." – Zwicks [54:57]
-
Multimodality & Emergent Understanding: Experimental question of whether models, given only data, can rediscover Newtonian laws or distinguish between geocentric/heliocentric models.
- "My guess is it probably won't... But F = MA... that's just a whole different abstraction level that's beyond today's LLM." – Fei-Fei Li [53:23/52:36]
10. Architectures Going Forward: Sets, Attention, and Beyond
(Timestamps: 55:41–58:00)
- Architectural Evolution: While transformers and attention remain powerful, truly spatial/temporal world models may require new paradigms.
- "Transformers are actually not a model of a sequence of tokens. Transformer is actually a model of a set of tokens." – Justin Johnson [57:10]
- Fei-Fei suggests sequence-to-sequence is not the end; new architectures are likely needed for world models.
11. Future Directions, Call to Action, and Closing
(Timestamps: 58:05–End)
-
Recruiting Talent: World Labs is hiring across research, engineering, and business functions.
- "We are actually hungry for talent ranging from very deep researchers... to engineers... and good business product thinkers..." – Fei-Fei Li [58:21]
-
Playing with Marble: Encouragement to try advanced features and think about new applications.
Notable Quotes & Memorable Moments
"The whole history of deep learning is in some sense the history of scaling up compute."
– Justin Johnson [00:00/04:04]
"Spatial intelligence... is the capability that allows you to reason, understand, move and interact in space."
– Fei-Fei Li [42:47]
"You lose something if you translate to this, like purely tokenized representations that we use in LLMs. Right? Like you lose the font, you lose the line breaks, you lose sort of the 2D arrangement on the page."
– Justin Johnson [22:11]
"Academia by itself is severely under resourced so that... researchers and students do not have enough resources to try these ideas."
– Fei-Fei Li [10:46]
"If a physics engine was perfect, we would have sort of no need to build models because the problem would have already been solved."
– Justin Johnson [28:46]
"Maybe we've lost something by going straight to that fully abstracted form of language... spatial intelligence is almost like opening up that black box again."
– Justin Johnson [46:28]
"In nature, it took 540 million years to optimize perception and spatial intelligence and language... is probably half a million years."
– Fei-Fei Li [49:02]
"Transformers are actually not a model of a sequence of tokens. Transformer is actually a model of a set of tokens."
– Justin Johnson [57:10]
Notable Segments (Timestamps & Topics)
- [00:26/31:42] – What is Marble? Vision for spatial intelligence and practical applications today
- [13:55/15:30] – Birth of vision-language models, early breakthroughs
- [24:15] – Where deep learning hits the limits of causal “understanding”
- [34:29] – The data representation and atomic units of generated worlds (Gaussian splats)
- [36:23] – Possibilities for incorporating physics into generative 3D models
- [38:51] – Robotics & simulation as key future use cases
- [42:47] – Defining spatial intelligence (and its complementarity to language)
- [46:28] – Limitations of LLMs vs. “opening up the black box” of spatial perception
- [53:23/52:36] – Can AI discover physics? The difference between prediction and abstraction
- [57:10] – Sets, not sequences: a new lens on neural architectures
- [58:21] – Call to action for recruitment and collaboration
Summary
Throughout the episode, the conversation spans technical, philosophical, and product considerations in moving from language-centric AI to agents and models that comprehend, generate, and interact with the spatial world. Fei-Fei Li and Justin Johnson position World Labs as pioneering this shift, with Marble as a first practical foothold. They advocate for hybrid open science, the need for new research in both algorithms and hardware, and invite thoughtful builders to join the quest for AI that “sees” and “acts” beyond words.
For listeners, the episode illuminates both the challenge and promise of giving AI a sense of “space” — not just language — in how it models and changes the world.
