Latent Space: The AI Engineer Podcast

[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton

Date: January 2, 2026

Episode Overview

This episode dives deep into the NeurIPS 2025 Best Paper awarded to Kevin Wang and his collaborators from Princeton, focusing on their groundbreaking work: “1000 Layer Networks for Self-Supervised RL.” The conversation explores the surprising scalability of very deep neural networks in reinforcement learning (RL) through self-supervised objectives, the architectural and methodological innovations that enabled this, and implications for the future of AI, robotics, and efficient scaling. The episode features host interactions with Kevin, his co-authors, and their advisor, presenting behind-the-scenes stories and rich technical insights in a lively, accessible manner.

Key Discussion Points & Insights

1. Background: From Skepticism to Best Paper

Team Introduction (01:11)
- Kevin Wang led the project as a Princeton undergrad, collaborating closely with Ishan and Nicole, with guidance from their advisor.
Project Origins (01:30)
- Born from an undergraduate research seminar, with initial skepticism about scaling RL networks beyond the traditional 2-4 layers.
- “Historically deep meant like two or three or four layers, not 1,000. When Kevin Yishon mentioned they wanted to try really deep networks, kind of skeptical it was going to work.” — Advisor (02:14)

2. Why Deep RL Hasn't Scaled—And How They Changed That

RL vs. Language/Vision Scaling (03:35)
- In contrast to language and vision, RL had not benefited from massive network scaling.
- “I was very surprised... why were you just using like a simple like 2 layer MLP for like these frontier sort of, you know, state of the art RL algorithms?” — Kevin (03:35)
Self-Supervised RL as the Key (04:28)
- Shifted from value-based RL to self-supervised RL, learning rich representations without hand-crafted rewards.
- Introduced contrastive objectives: push together representations along the same trajectory, push apart from others.

3. Unpacking the Scaling Breakthrough

Combining Depth, Residuals, Layer Norm (05:45)
- Simple deepening led to worse performance until layering in residual connections and normalization.
- “If we just made the depth bigger, it makes it worse. If we just had residual connections, didn't make it better. And it was really this combination ... that really made this work.” — Advisor (05:45)
Experiments with Batch Size and Width (06:07)
- Explored scaling along width and batch size; found depth is more parameter- and sample-efficient for performance gains.
- “With width you're making your network outputs wider... number of parameters grows approximately quadratically... scaling along depth might be better because with fewer parameters...” — Co-author (13:35)

4. Architectural and Objective Innovations

Not Just “Make It Big and It Works” (07:59)
- The success isn’t just about bigger models — it’s about the combination of deep architectures and self-supervised, reward-free objectives.
- “A lot of people reading the title are like, wow, big networks, they're great. I'll take big networks... but that's actually not the main conclusion. ... It requires using a different objective. This objective doesn't actually use rewards in it.” — Advisor (07:49)
Blurring the Boundaries in ML (09:01)
- Their approach stands at the intersection of reinforcement, self-supervised, and representation learning.
- Referenced Yann LeCun’s ideas on unsupervised vs. supervised vs. RL for intelligence.

5. Practical Implications & Trade-offs

Impact for Robotics & Industry (12:03)
- Scalability makes the method attractive for robotics — enables highly capable agents even with less data or manual supervision.
- Could serve as an alternative to resource-heavy imitation learning.
Efficiency Insights (13:09, 15:05)
- Deeper networks grow total parameters linearly, whereas wider networks grow them quadratically.
- Run-time and data efficiency are generally manageable, especially as collecting data—not network evaluation—is often the bottleneck in RL.
- Modern frameworks (JAX, GPU-accelerated environments) enable rapid, parallel data collection, making scaling practical.
- “Everything can be run on one gpu... one single 80 GB H100 GPU.” — Kevin (24:11)

6. Theoretical Connections & Future Directions

RL as Classification & World Models (09:45, 19:01)
- The new paradigm transforms RL from value estimation to a more scalable classification problem, akin to cross-entropy loss in language modeling.
- Hints at implicit world models: “not predicting next state exactly, but classifying trajectories.”
- “We're trying to classify whether future state is along the same trajectory or along a different trajectory...” — Kevin (09:45)
Deep Teacher, Shallow Student? (20:56)
- Host speculates on a pipeline: train deep networks for maximum capabilities, then distill knowledge into shallower models for efficient deployment.
- “Deep teacher, shallow student would be a good deployment paradigm...” — Host (20:56)

7. Scaling on Multiple Axes

Unlocking Batch Size Scaling with Depth (22:55)
- Larger networks can better utilize bigger batch sizes — found synergy between scaling depth and batch.
Accessible Compute (24:11)
- Project is designed to be reproducible on accessible (albeit modern) hardware.

8. Field and Community Reception

Reception at NeurIPS (27:15)
- Poster/talk sessions saw enthusiastic responses; many considered the results “eye-opening.”
- “People thought it's a very eye opening paper because the objective is quite simple, it's quite elegant, and for us to be able to, I don't want to say overturn, but sort of challenge the conventional wisdom that RL is not super scalable...” — Kevin (27:15)

Memorable Quotes & Timestamps

On RL’s Unexplored Depths
“Reinforcement learning was like this one anomaly where we continue to use these really shallow networks.” — Advisor (03:17)
On Architectural Synergy
“If we just made the depth bigger, it makes it worse. If we just had residual connections, didn't make it better. And it was really this combination... that really made this work.” — Advisor (05:45)
On What Actually Matters
“I think the main conclusion is that using big networks not only requires these architectural tricks, but also... requires using a different objective. This objective doesn't actually use rewards in it.” — Advisor (07:59)
On Theoretical Shifts in RL
“We're fundamentally shifting the burden of learning from ... regressing to TD errors ... to fundamentally a classification problem.” — Kevin (09:45)
On Efficiency Trade-offs
“Width is expensive. Exactly. And in general, of course, like more parameters is also going to be more expensive. So that's just like another consideration...” — Co-author (14:47)
On Practical Experimentation
“All of our experiments, even the thousand layer networks, can be run on one single 80 GB H100 GPU.” — Kevin (24:11)

Important Timestamps

00:22 – Kevin shares first impressions and how the team formed.
02:14 – Advisor’s skepticism about deep RL networks.
03:35 – Kevin’s surprise at shallow RL architectures.
04:28 – Shift to self-supervised RL: learning via representation, not value functions.
05:45 – Combining architecture choices for breakthrough performance.
06:54 – Parameter/sample efficiency: depth vs. width.
07:59 – True takeaway: not “just add depth,” but rethink objectives as well.
09:45 – Transforming RL tasks into scalable classification problems.
12:03 – Implications for robotics and scalable, low-supervision RL.
13:35 – Detailed discussion of efficiency trade-offs by scaling depth vs. width.
15:05 – Discussing run-time and environment bottlenecks.
18:26 – Discussing world models and representation learning.
20:56 – Concept of “deep teacher, shallow student” for efficient inference.
24:11 – Compute requirements: all results obtained on a single modern GPU.
27:15 – Community response at NeurIPS poster session.

Future Directions Discussed

Distillation and Pruning: Training deep networks for capability, then compressing for deployment (21:13).
Scaling Multiple Dimensions: Depth, width, batch size—how far can we push RL with more compute? (22:27)
Vision-Language-Action Models: Integrating RL with representation learning from language and vision (24:40).
Action Representation Research: New paradigms for planning and hierarchical control (26:14).

Summary Tone & Style

The episode is technical yet energetic, with a distinctly collaborative and inquisitive tone. Speakers often use analogies to other ML domains (e.g., language modeling, robotics), emphasize practical experimentation and openness, and are transparent about the challenges and serendipity involved in research breakthroughs.

Conclusion

This Latent Space episode offers an in-depth and accessible exploration of how and why self-supervised objectives—and careful architecture choices—can unlock a new regime of scalable, deep reinforcement learning. The discussion covers not only the technical “how” but also the broader implications for AI practitioners, robotics, and the very framing of learning problems. RL isn't doomed to tiny nets: with the right recipe, a thousand layers can indeed matter.

Latent Space: The AI Engineer Podcast

[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton

Date: January 2, 2026

Episode Overview

Key Discussion Points & Insights

1. Background: From Skepticism to Best Paper

Team Introduction (01:11)
- Kevin Wang led the project as a Princeton undergrad, collaborating closely with Ishan and Nicole, with guidance from their advisor.
Project Origins (01:30)
- Born from an undergraduate research seminar, with initial skepticism about scaling RL networks beyond the traditional 2-4 layers.
- “Historically deep meant like two or three or four layers, not 1,000. When Kevin Yishon mentioned they wanted to try really deep networks, kind of skeptical it was going to work.” — Advisor (02:14)

2. Why Deep RL Hasn't Scaled—And How They Changed That

RL vs. Language/Vision Scaling (03:35)
- In contrast to language and vision, RL had not benefited from massive network scaling.
- “I was very surprised... why were you just using like a simple like 2 layer MLP for like these frontier sort of, you know, state of the art RL algorithms?” — Kevin (03:35)
Self-Supervised RL as the Key (04:28)
- Shifted from value-based RL to self-supervised RL, learning rich representations without hand-crafted rewards.
- Introduced contrastive objectives: push together representations along the same trajectory, push apart from others.

3. Unpacking the Scaling Breakthrough

Combining Depth, Residuals, Layer Norm (05:45)
- Simple deepening led to worse performance until layering in residual connections and normalization.
- “If we just made the depth bigger, it makes it worse. If we just had residual connections, didn't make it better. And it was really this combination ... that really made this work.” — Advisor (05:45)
Experiments with Batch Size and Width (06:07)
- Explored scaling along width and batch size; found depth is more parameter- and sample-efficient for performance gains.
- “With width you're making your network outputs wider... number of parameters grows approximately quadratically... scaling along depth might be better because with fewer parameters...” — Co-author (13:35)

4. Architectural and Objective Innovations

Not Just “Make It Big and It Works” (07:59)
- The success isn’t just about bigger models — it’s about the combination of deep architectures and self-supervised, reward-free objectives.
- “A lot of people reading the title are like, wow, big networks, they're great. I'll take big networks... but that's actually not the main conclusion. ... It requires using a different objective. This objective doesn't actually use rewards in it.” — Advisor (07:49)
Blurring the Boundaries in ML (09:01)
- Their approach stands at the intersection of reinforcement, self-supervised, and representation learning.
- Referenced Yann LeCun’s ideas on unsupervised vs. supervised vs. RL for intelligence.

5. Practical Implications & Trade-offs

Impact for Robotics & Industry (12:03)
- Scalability makes the method attractive for robotics — enables highly capable agents even with less data or manual supervision.
- Could serve as an alternative to resource-heavy imitation learning.
Efficiency Insights (13:09, 15:05)
- Deeper networks grow total parameters linearly, whereas wider networks grow them quadratically.
- Run-time and data efficiency are generally manageable, especially as collecting data—not network evaluation—is often the bottleneck in RL.
- Modern frameworks (JAX, GPU-accelerated environments) enable rapid, parallel data collection, making scaling practical.
- “Everything can be run on one gpu... one single 80 GB H100 GPU.” — Kevin (24:11)

6. Theoretical Connections & Future Directions

RL as Classification & World Models (09:45, 19:01)
- The new paradigm transforms RL from value estimation to a more scalable classification problem, akin to cross-entropy loss in language modeling.
- Hints at implicit world models: “not predicting next state exactly, but classifying trajectories.”
- “We're trying to classify whether future state is along the same trajectory or along a different trajectory...” — Kevin (09:45)
Deep Teacher, Shallow Student? (20:56)
- Host speculates on a pipeline: train deep networks for maximum capabilities, then distill knowledge into shallower models for efficient deployment.
- “Deep teacher, shallow student would be a good deployment paradigm...” — Host (20:56)

7. Scaling on Multiple Axes

Unlocking Batch Size Scaling with Depth (22:55)
- Larger networks can better utilize bigger batch sizes — found synergy between scaling depth and batch.
Accessible Compute (24:11)
- Project is designed to be reproducible on accessible (albeit modern) hardware.

8. Field and Community Reception

Reception at NeurIPS (27:15)
- Poster/talk sessions saw enthusiastic responses; many considered the results “eye-opening.”
- “People thought it's a very eye opening paper because the objective is quite simple, it's quite elegant, and for us to be able to, I don't want to say overturn, but sort of challenge the conventional wisdom that RL is not super scalable...” — Kevin (27:15)

Memorable Quotes & Timestamps

On RL’s Unexplored Depths
“Reinforcement learning was like this one anomaly where we continue to use these really shallow networks.” — Advisor (03:17)
On Architectural Synergy
“If we just made the depth bigger, it makes it worse. If we just had residual connections, didn't make it better. And it was really this combination... that really made this work.” — Advisor (05:45)
On What Actually Matters
“I think the main conclusion is that using big networks not only requires these architectural tricks, but also... requires using a different objective. This objective doesn't actually use rewards in it.” — Advisor (07:59)
On Theoretical Shifts in RL
“We're fundamentally shifting the burden of learning from ... regressing to TD errors ... to fundamentally a classification problem.” — Kevin (09:45)
On Efficiency Trade-offs
“Width is expensive. Exactly. And in general, of course, like more parameters is also going to be more expensive. So that's just like another consideration...” — Co-author (14:47)
On Practical Experimentation
“All of our experiments, even the thousand layer networks, can be run on one single 80 GB H100 GPU.” — Kevin (24:11)

Important Timestamps

00:22 – Kevin shares first impressions and how the team formed.
02:14 – Advisor’s skepticism about deep RL networks.
03:35 – Kevin’s surprise at shallow RL architectures.
04:28 – Shift to self-supervised RL: learning via representation, not value functions.
05:45 – Combining architecture choices for breakthrough performance.
06:54 – Parameter/sample efficiency: depth vs. width.
07:59 – True takeaway: not “just add depth,” but rethink objectives as well.
09:45 – Transforming RL tasks into scalable classification problems.
12:03 – Implications for robotics and scalable, low-supervision RL.
13:35 – Detailed discussion of efficiency trade-offs by scaling depth vs. width.
15:05 – Discussing run-time and environment bottlenecks.
18:26 – Discussing world models and representation learning.
20:56 – Concept of “deep teacher, shallow student” for efficient inference.
24:11 – Compute requirements: all results obtained on a single modern GPU.
27:15 – Community response at NeurIPS poster session.

Future Directions Discussed

Distillation and Pruning: Training deep networks for capability, then compressing for deployment (21:13).
Scaling Multiple Dimensions: Depth, width, batch size—how far can we push RL with more compute? (22:27)
Vision-Language-Action Models: Integrating RL with representation learning from language and vision (24:40).
Action Representation Research: New paradigms for planning and hierarchical control (26:14).

wavePod

[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton

Powered by Wave AI

Summary

Latent Space: The AI Engineer Podcast

[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton

Episode Overview

Key Discussion Points & Insights

1. Background: From Skepticism to Best Paper

2. Why Deep RL Hasn't Scaled—And How They Changed That

3. Unpacking the Scaling Breakthrough

4. Architectural and Objective Innovations

5. Practical Implications & Trade-offs

6. Theoretical Connections & Future Directions

7. Scaling on Multiple Axes

8. Field and Community Reception

Memorable Quotes & Timestamps

Important Timestamps

Future Directions Discussed

Summary Tone & Style

Conclusion

Summary

Latent Space: The AI Engineer Podcast

[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton

Episode Overview

Key Discussion Points & Insights

1. Background: From Skepticism to Best Paper

2. Why Deep RL Hasn't Scaled—And How They Changed That

3. Unpacking the Scaling Breakthrough

4. Architectural and Objective Innovations

5. Practical Implications & Trade-offs

6. Theoretical Connections & Future Directions

7. Scaling on Multiple Axes

8. Field and Community Reception

Memorable Quotes & Timestamps

Important Timestamps

Future Directions Discussed

Summary Tone & Style

Conclusion