wavePod

The Frontier of Spatial Intelligence with Fei-Fei Li - The a16z Show | Wave AI Podcast Notes

The Frontier of Spatial Intelligence with Fei-Fei Li

Thu Nov 13 2025

Fei-Fei Li and Justin Johnson are pioneers in AI. While the world has only recently witnessed a surge in consumer AI, they have long been laying the groundwork for the innovations transforming industries today. With the recent launch of Marble, the first product from their company World Labs, we are revisiting this conversation to explore the ideas that started it all. World Labs is focused on spatial intelligence, building Large World Models that can perceive, generate, and interact with the 3D world. Marble brings that vision to life, allowing anyone, from individual creators to major platforms, to generate 3D scenes directly from text or image prompts and turn complex 3D creation into a simple, creative process. In this episode, a16z general partner Martin Casado talks with Fei-Fei and Justin about the journey from early AI winters to the rise of deep learning and multimodal AI. From foundational breakthroughs like ImageNet to the cutting-edge realm of spatial intelligence, th...

Summary

Podcast Summary: The Frontier of Spatial Intelligence with Fei-Fei Li (a16z Podcast, Nov 13, 2025)

Overview

This episode delves into the next frontiers of artificial intelligence: spatial intelligence and the ability for machines to understand, generate, and interact with the three-dimensional (and four-dimensional: 3D + time) world as fundamentally as they understand language. Fei-Fei Li and Justin Johnson, co-founders of World Labs, join a16z general partner Martin Casado to discuss why unlocking spatial intelligence is critical for reaching the next milestone of AI—and perhaps Artificial General Intelligence (AGI). They reflect on the history of computer vision, the limitations of current models, the technical breakthroughs around 3D representations, and the ambitious vision guiding the formation of World Labs.

Key Discussion Points & Insights

1. The Evolution of AI and Computer Vision

AI from Its "Winter" to Cambrian Explosion
- Fei-Fei Li reflects on two decades of progress, from the "AI winter" to today's expansion of possible AI applications across text, pixels, videos, and audio (01:59).
- Li: "We're in the middle of a Cambrian explosion... In addition to texts, you're seeing pixels, videos, audios, all coming with possible AI applications and models." (01:59)
Pivotal Breakthroughs: Data and Compute
- ImageNet: Pioneered by Li and her students, this dataset scaled image data collection to Internet scale and drove the modern era of data-driven modeling (06:00–07:30).
- AlexNet and Compute: The leap in computational power was as crucial as algorithmic advances. AlexNet's training (2012) that took six days on consumer GPUs would now take mere minutes—highlighting the growth by orders of magnitude (08:03–09:09).
- Johnson: "That two week training run... comes out to just under five minutes on a single GB200 [GPU]..." (08:58)

2. Supervised vs. Generative AI: Shifting Paradigms

Supervised Learning Era vs. Generative Models
- Early computer vision relied on painstaking human labeling of datasets (ImageNet), while generative models can now learn from less-structured data (10:23–11:26).
- Johnson: "The big algorithmic unlocks... [were] to train on things that don't require human labeled data." (10:56)
The Rise of Generative AI
- Both Li and Johnson reflect on the progression from data-matching to style transfer and finally to GAN-powered generation—showing a clear continuum rather than abrupt change (12:21–15:00).
- Johnson shares his excitement at seeing artistic style transfer for the first time ("GenAI brainworm") and turning academic ideas into mainstream impact (13:48–14:52).

3. Why Spatial Intelligence? The Next North Star

What Is Spatial Intelligence?
- Johnson: "It's about machines' ability to perceive, reason, and act in 3D space and time... understanding how objects and events are positioned and interact in the 3D world." (18:48)
- Li: "Visual spatial intelligence is so fundamental. It's as fundamental as language, possibly more ancient..." (16:33, 29:04)
Scope: Physical, Virtual, and Blended Worlds
- Spatial intelligence is not limited to physical reality, but encompasses generative, simulated, and augmented environments as well (19:19).

4. Core Technical Distinctions: Why Not Just Pixels, Language, or 2D?

Limitations of Language Models for 3D Awareness
- Multimodal language models process images and text, but internally use one-dimensional (sequence/tokens) representations, not native 3D (25:04–26:15).
- Johnson: "Their representation of the world is one dimensional... [but] we're saying the three-dimensional nature of the world should be front and center." (25:04)
Why Not Just 2D Video?
- Our perception is innately 2D (retina), but we reason and act based on understanding 3D structure (27:53–29:04).
- Li: "The arc of intelligence... eventually enables animals and humans... to move around the world, interact with it, create civilization... That native 3D-ness is fundamentally important..." (29:04)
Breakthroughs in 3D: Neural Radiance Fields (NeRF)
- NeRF (by Ben Mildenhall) allowed reconstructing 3D from 2D images efficiently—opening a floodgate of 3D computer vision research and applications (21:00).

5. Use Cases & Applications

Virtual Worlds and Interactive Generation
- Next-gen models would allow "up-leveling" AI-generated content—from text-to-image/video to text-to-3D-worlds (30:23).
- Johnson: "Imagine leveling this up and getting 3D worlds out... not just for games, but for virtual photography, education... a new kind of media enabled by spatial intelligence." (30:23)
Augmented & Mixed Reality
- Li: "Spatial computing needs spatial intelligence... the interface between the true real world and what you can do on top of it." (34:39)
- Johnson envisions AR devices replacing physical screens by blending digital information seamlessly with the physical environment (35:32–35:59).
Robotics & Real-World Agents
- Spatial intelligence is foundational for robotic perception, navigation, and manipulation.
- Li: "[With] robots, their interface by definition is the 3D world... But their compute is digital... [spatial intelligence] connects learning and behaving..." (36:22)

6. Building World Labs: Vision, Team, and Approach

Why Now?
- Deliberate convergence of compute, algorithmic breakthroughs (like NeRF), and deeper data understanding (16:33–18:37).
- Four leading researchers united as co-founders: Fei-Fei Li, Justin Johnson, Ben Mildenhall, Christoph Lassner (19:36–20:28).
Team & Multidisciplinarity
- World Labs was constructed with a focus on deep expertise in systems, data, machine learning, graphics, and 3D modeling (39:32–40:40).
- Li: "The kind of talent we put together here... I've never seen this concentration... all believers in spatial intelligence." (40:40)
Deep Tech Platform
- The aim is to solve fundamental problems, creating a platform for a wide array of applications—games, AR/VR, robotics, and beyond (37:35–38:24).
North Stars & Impact
- Success means the widespread, practical use of spatial intelligence models—unlocking new possibilities for businesses and society (41:40–42:33).
- Johnson: "The universe is a giant evolving four dimensional structure and spatial intelligence writ large is just understanding that in all of its depths and figuring out all the applications to that." (42:35)

Notable Quotes & Memorable Moments

On Cambrian explosion in AI:
- Li: "Now, in addition to texts, you're seeing pixels, videos, audios, all coming with possible AI applications and models. So it's a very exciting moment." (01:59)
On the leap in compute:
- Johnson: "That two week training run... comes out to just under five minutes on a single GB200." (08:58)
On supervised to generative learning:
- "The big algorithmic unlocks... [were] to train on things that don't require human labeled data." – Johnson (10:56)
The philosophical difference:
- Li: "Language is fundamentally a purely generated signal. There's no language out there... 3D world is not. There is a 3D world out there that follows laws of physics..." (26:15)
On 3D vs. 1D representation:
- Johnson: "Fundamentally their representation of the world is one dimensional... we're saying the three-dimensional nature should be front and center." (25:04)
Defining Spatial Intelligence:
- Johnson: "Machines' ability to perceive, reason, and act in 3D space and time... understanding how objects and events are positioned and interact..." (18:48)
- Li: "Visual spatial intelligence is so fundamental. It's as fundamental as language, possibly more ancient and more fundamental in certain ways." (16:33, 29:04)
Use cases beyond gaming:
- Johnson: "If you could have sort of a personalized 3D experience that’s as good and as rich, as detailed as one of these AAA video games… but catered to this very niche thing — that’s a vision of a new kind of media." (31:53)
On crossing the real/virtual divide:
- Li: "With this technology, the boundary between real world and virtual, imagined world or augmented world or predicted world is all blurry." (33:51)
- Johnson: "Spatial intelligence is about building and understanding worlds." (32:58)

Timestamps for Key Segments

00:00 – 02:49: Opening thoughts; era of multimodal AI explosion
04:46 – 07:30: Fei-Fei Li’s and Justin Johnson’s background, the ImageNet breakthrough
08:03 – 09:09: Compute as the big unlock; AlexNet vs. modern GPUs
10:23 – 11:26: The transition from supervised to generative learning
13:48 – 14:52: Johnson's style transfer work and real-world impact
16:33 – 18:37: North Stars; why spatial intelligence is fundamental
18:48 – 19:36: Defining spatial intelligence: Perceiving and reasoning in 3D/4D
21:00: NeRF's breakthrough for 3D from 2D data
24:49 – 26:15: 1D language models vs. native 3D representations
29:04 – 30:03: Why flat pixels aren't enough; the evolutionary arc of intelligence
30:23 – 33:51: Use cases: virtual world generation; blending realities
34:39 – 36:22: Spatial intelligence as the enabler for AR, robotics & physical agents
39:32 – 40:40: Building the multidisciplinary team at World Labs
41:40 – 42:33: Measuring impact by model deployment and real-world use

Flow & Tone

The conversation is enthusiastic, visionary, and punctuated with technical depth. Fei-Fei Li and Justin Johnson express both the awe and responsibility of aiming for a new North Star in AI. Their dialogue is candid, often self-reflective, and peppered with references to the academic journey, breakthroughs, and the joy of discovery ("like unwrapping presents on Christmas every day" — Johnson, 00:19).

Conclusion

This episode lays out a compelling case that the next evolution of AI lies in spatial intelligence: enabling machines to truly perceive, reason, and generate within the inherently 3D world, bridging physical and virtual realities. The work of World Labs seeks to unlock not just new technical capabilities but an entirely new kind of digital media and human-machine interaction, with wide-ranging impacts across gaming, AR/VR, robotics, and beyond.

Recommended next steps:
If intrigued by the vision for spatial intelligence, the speakers suggest following World Labs, exploring recent academic breakthroughs like NeRF, and rethinking the boundaries between physical, virtual, and augmented experiences.

Loading summary...

Transcript

Fei-Fei Li (0:00)

This is fundamentally, philosophically, to me, a different problem.

Justin Johnson (0:04)

The previous decade had mostly been about understanding data that already exists, but the next decade was going to be about understanding new data.

Fei-Fei Li (0:12)

Visual spatial intelligence is so fundamental. It's as fundamental as language.

Justin Johnson (0:19)

It's like unwrapping presents on Christmas, that every day you know there's going to be some amazing new discovery, some amazing new application or algorithm somewhere.

Fei-Fei Li (0:26)

If we see something or if we imagine something, both can converge towards generating it. I think we're in the middle of a Cambrian explosion.

Podcast Host (0:40)

The next chapter of AI isn't about better language models. It's about understanding the 3D world as fundamentally as we understand text. Recently, World Labs launched Marble, their first product. So we're replaying our most popular conversation to date, a discussion with World Lab's co founders, Fei Fei Li and Justin Johnson, about why spatial intelligence is the missing piece for truly intelligent machines. Together with a16Z general partner Martin Casado, Fei Fei and Justin talk about how ImageNet's million image bet in 2009 unlocked modern computer vision, why today's multimodal models are still trapped in one dimension despite processing pixels, and how their team is building the infrastructure to generate fully interactive 3D worlds as easily as we generate text today. From the convergence of reconstruction and generation that's redefining computer vision to why AR VR and robotics desperately need native 3D understanding, this is the story of four legendary researchers betting everything that the path to AGI runs through spatial intelligence. Let's get into it.

Martin Casado (1:44)

Over the last two years, we've seen this kind of massive rush of consumer AI companies and technology, and it's been quite wild. But. But you've been doing this now for decades and so maybe walk through a little bit about how we got here. Kind of like your key contributions and insights along the way.

Fei-Fei Li (1:59)

So it is a very exciting moment. Right? Just zooming back. AI is in a very exciting moment. I personally have been doing this for two decades plus and we have come out of the last AI winter. We have seen the birth of modern AI, then we have seen deep learning taking off, showing us possibilities like playing chess. But then we're starting to see the deepening of the technology and the industry adoption of some of the earlier possibilities like language models. And now I think we're in the middle of a Cambrian explosion in almost a literal sense, because now, in addition to texts, you're seeing pixels, videos, audios, all coming with possible AI applications and models. So it's a very exciting moment.

Justin Johnson (2:59)

Yeah, sure. So I first got into AI at the end of my undergrad. I did math and computer science for undergrad at Caltech. That was awesome. But then towards the end of that, there was this paper that came out that was at the time, a very famous paper, the CAT paper from Hong Lek Li and Andrew Ng and others that were at Google Brain at the time. And that was like the first time that I came across this concept of deep learning. And to me, it just felt like this amazing technology. And that was the first time that I came across this recipe that would come to define the next more than decade of my life, which is that you can get these amazingly powerful learning algorithms that are very generic, couple them with very large amounts of compute, couple them with very large amounts of data, and magic things started to happen when you compile those ingredients. So I first came across that idea around 2011, 2012ish. And I just thought, oh, my God, this is going to be what I want to do. It was obvious you got to go to grad school to do this stuff. And then saw that Fei, Fei was at Stanford, one of the few people in the world at the time who was on that train. And that was just an amazing time to be in deep learning and computer vision, specifically because that was really the era when this went from these first nascent bits of technology that were just starting to work and really got developed and spread across a ton of different applications. So then over that time, we saw the beginnings of language modeling. We saw the beginnings of discriminative computer vision, where you could take pictures and understand what's in them in a lot of different ways. We also saw some of the early bits of what we would now call Genai generative modeling. Generating images, generating text. A lot of those core algorithmic pieces actually got figured out by the academic community. During my PhD years, there was a time I would just wake up every morning and check the new papers on ARXIV and just be ready. It was like unwrapping presents on Christmas every day. You know, there's going to be some amazing new discovery, some amazing new application or algorithm somewhere in the world. And in the last two years, everyone else in the world kind of came to the same realization of using AI to get new Christmas presents every day. But I think for those of us that have been in the field for a decade or more, we've sort of had that experience for a very long time.

Justin Johnson (19:47)

Yeah, I mean, this is again, part of a longer evolution for me. But post PhD, when I was really wanting to develop into my own independent researcher, both for my later career, I was just thinking, what are the big problems in AI and computer vision? And the conclusion that I came to about that time was that the previous decade had mostly been about understanding data that already exists, but the next decade was going to be about understanding new data. And if we think about that, the data that already exists was all of the images and videos that maybe existed on the web already. And the next decade was going to be about understanding new data. Right. People have smartphones, smartphones are collecting cameras. Those cameras have new sensors. Those cameras are positioned in the 3D world. It's not just you're going to get a bag of pixels from the Internet and know nothing about it and try to say if it's a cat or a dog. We want to treat these images as universal sensors to the physical world. And how can we use that to understand the 3D and 4D structure of the world, either in physical spaces or generative spaces? So I made a pretty big pivot post PhD into 3D computer vision, predicting 3D shapes of objects with some of my colleagues at fair at the time. Then later, I got really enamored by this idea of learning 3D structure through 2D. Right. Because we talk about data a lot, 3D data is hard to get on its own. But because there's a very strong mathematical connection here, our 2D images are projections of a 3D world, and there's a lot of mathematical structure here we can take advantage of. So even if you have a lot of 2D data, there's a lot of people have done amazing work to figure out how can you back out the 3D structure of the world from large quantities of 2D observations. And then in 2020, you asked about breakthrough moments. There was a really big breakthrough moment from our co founder, Ben Mildenhall at the time with his paper Nerf Neural Radiance Fields. And that was a very simple, very clear way of backing out 3D structure from 2D observations that just lit a fire under this whole space of 3D computer vision. I think there's another aspect here that maybe people outside the field don't quite understand. That was also a time when large language models were starting to take off. So a lot of the stuff with language modeling actually had gotten developed in academia. Even during my PhD, I did some early work with Andrej Karpathy on language modeling in 2014 LSTM. I still remember LSTMS RNN's GRUs. Like this was pre transformer, but then at some point, like around the GPT2 time, like you couldn't really do those kind of models anymore in academia because they took way more resourcing. But there was one really interesting thing, the NERF approach that Ben came up with. Like you could train these in a couple hours on a single gpu. So I think at that time there was a dynamic here that happened, which is that I think a lot of academic researchers ended up focusing a lot of these problems because there was core algorithmic stuff to figure out and because you could actually do a lot without a ton of compute and you could get state of the art results on a single GPU because of those dynamics, there was a lot of research. A lot of researchers in academia were moving to think about what are the core algorithmic ways that we can advance this area as well. Then I ended up chatting with Fei, Fei Moore and I realized that we were actually.

Justin Johnson (30:23)

Bit more concretely, there's a couple different kinds of things we imagine these spatially intelligent models able to do over time. And one that I'm really excited about is world generation. We're all used to something like a text to image generator or starting to see text to video generators, where you put an image, put in a video, and out pops an amazing image or an amazing two second clip. But I think you could imagine leveling this up and getting 3D worlds out. So one thing that we could imagine spatial intelligence helping us with in the future are up leveling these experiences into 3D, where you're getting out a full virtual, simulated, but vibrant and interactive 3D world, right? Maybe for gaming, maybe for virtual photography, you name it. Even if you got this to work, there'd be a million applications for education. I mean, in some sense this enables a new form of media, right? Because we already have the ability to create Virtual interactive worlds. But it costs hundreds of millions of dollars and a ton of development time. And as a result, what are the places that people drive this technological ability is video games. Right. But because it takes so much labor to do so, then the only economically viable use of that technology in its form today is games that can be sold for $70 a piece to millions and millions of people to recoup the investment. If we had the ability to create these same virtual, interactive, vibrant 3D worlds, you could see a lot of other applications of this. Right. Because if you bring down that cost of producing that kind of content, then people are going to use it for other things. Right. What if you could have sort of a personalized 3D experience that's as good and as rich, as detailed as one of these AAA video games that cost hundreds of millions of dollars to produce. But it could be catered to this very niche thing that only maybe a couple people would want that particular thing that's not a particular product or a particular roadmap. But I think that's a vision of a new kind of media that would be enabled by spatial intelligence in the generative realm.