wavePod

Get Wave AI

#331 Sergey Levine: The Robot Revolution Nobody Is Talking About - Eye On A.I. | Wave AI Podcast Notes

Back to Eye On A.I.

#331 Sergey Levine: The Robot Revolution Nobody Is Talking About

Eye On A.I.

Sun Apr 12 2026

Summary

Eye On A.I. – Episode #331

Sergey Levine: The Robot Revolution Nobody Is Talking About
Date: April 12, 2026
Host: Craig S. Smith
Guest: Sergey Levine, Co-founder at Physical Intelligence & Professor at UC Berkeley

Overview

This episode explores the evolution and future of robotics AI, focusing on foundations for general-purpose robots, data collection strategies, the role of simulation and real-world data, the surge of humanoid robots, and the critical bottleneck in achieving scalable, adaptable, and continually learning robotic systems. Sergey Levine offers a deep dive into current advances, technical nuances, and societal implications, challenging some popular assumptions and painting a picture of the "robot revolution" grounded in research progress rather than hype.

Key Discussion Points & Insights

1. What Are Robotic Foundation Models? (01:34, 02:40)

Definition & Analogy:
Sergey's team focuses on "robotic foundation models," large, general-purpose models that, trained on diverse data, can control various robots to perform wide-ranging tasks—much like how language models digest web text for broad "common sense" abilities (02:40).
Data Diversity as the Secret Sauce:
"A robotic foundation model should be trained on all of the embodied data that we can get our hands on... Once your model understands diverse physical embodiments, you can also start adding in data from humans, because to the model, the human body will look like yet another robot body." – Sergey Levine (05:15)
Physical Intelligence's Approach:
Physical Intelligence distinguishes itself by not being "picky" about which robots’ data to use, maximizing diversity for robustness and generalization (06:15).

2. Real-World Data vs. Simulation (07:48, 10:28)

Limitations of Simulation:
Sergey argues that while simulation can be useful (especially for edge cases like car collisions), it is inferior to real-world data for capturing the diversity of environments and tasks needed for generalizable robots.

"It's not a very appealing tool for getting experience of very diverse environments and objects... Getting real images is so much easier." (07:48)
"Upfront Cost" of Real Data:
There's an initial challenge ("activation energy") in distributing enough robots to collect substantial data, but real-world deployments eventually outpace the benefits simulation can provide (08:50).
Simulation Role:
Best used for rare or dangerous scenarios that are infeasible or unsafe to collect in reality (09:14).

3. Data Collection and Scaling: Teleoperation, Autonomy, and Collective Learning (12:08, 14:37, 15:15)

Teleoperation:
Initial foundation is built with teleoperation—“humans showing robots what to do” (12:08).
Scaling Issues:
Relying on human teleoperators in every home/factory is not scalable, so the future lies in blending sources: teleoperation, instruction via language, and reinforcement learning from autonomous experience (12:57).
Language Feedback:
Once base models are sufficiently capable, human corrections can be supplied as language ("put the plate in the sink")—supervising internal "thoughts" rather than low-level actions (13:46).
Fleet Effect:
"All of our robots share all the experience... The stronger the base foundation model is, the more readily it can incorporate experience from diverse robotic platforms." (15:15)
Early experiments like the Google ARM farm demonstrated the collective benefit, now amplified by foundation models' ability to handle diversity.
Cross-Embodiment:
Little extra "cleverness" is needed for models to adapt to new robot morphologies—the model learns to infer form and intent from camera images and sensor data (16:51).

4. Major Projects & Results (17:29, 19:43)

RTX Project:
- Combined data from ~30 institutions’ robot arm experiments into a generalist model.
- Result: The generalist model "was about 50% more successful than whatever each individual lab was developing." (18:51)
- Parallels to language models: Generalists can outperform specialists if trained on diverse, large datasets.
Cross-Platform Transfer:
3% of mobile robot data (vs. 97% static arm) sufficed for broad generalization, thanks to foundation model transfer—lower-cost platforms can provide most foundational data (20:10).

5. Learning Modes: Imitation, Reinforcement, and World Models (22:11, 23:47, 26:37)

Imitation Learning:
Teleoperation data commonly used for imitation learning, but real progress comes from models learning "what is possible" rather than naive copying—this is where offline reinforcement learning shines (22:11).
Visual-Language-Action (VLA) Models:
Now the standard in robotic learning, with roots in early 2020s research.
Modern VLA models rely on:
- Vision encoders for processing images.
- Language modules for instructions/context.
- Specialized "motor cortex" modules, often using diffusion models, for producing fluid, continuous actions (41:45).
"It's like building a brain piece by piece—language, visual cortex, now a motor cortex." (39:18)
Reasoning & Semantic Knowledge:
High-level reasoning leverages semantic information (like LLMs), enabling common-sense corrections and adaptability (25:10).

6. The Role of World Models and Abstractions (27:55, 29:56)

World Models:
Levine distinguishes between predictive, latent-space world models and higher-level abstractions but sees less of a dichotomy:

"For a real, capable, embodied intelligent system like a robot, we'll need many different abstractions... what language models do, visual-language models do, video prediction models do, and what world models do isn’t actually that different—they just operate with different abstractions." (28:58)
Blended Reasoning:
Human motor skills blend model-free, reactive behavior with abstract, high-level predictions—robotic models should aim for this blend as well, flexibly using the suitable abstraction for the task (30:14).

7. On-Device vs. Cloud-Based Models (32:32, 33:12, 35:22)

Current Status:
Most inference is cloud-based, but as robots are deployed in the wild, robust on-device components (especially for low-level motor control) will be essential.

"The lowest levels... need to be very fast... but also are not as cognitively demanding... so they can run locally. The natural trajectory... makes [running on device] reasonably straightforward." (33:12, 35:22)
Future Architecture:
Hierarchical, multi-scale models—"instincts" local, reasoning and planning remote, with communication across layers as connectivity allows.

8. Generalization vs. Specialization (43:54)

Necessity of Generality:
Specialized robots quickly fail outside structured environments; the real world is too unpredictable. Even for single-purpose tasks, generalist models are more robust because they handle edge cases and surprises (44:17).

"The gap between a closed world and an open world is enormous. You can't be just a little open world—immediately stuff can happen." (44:12)
Example:
Project with box assembly robots revealed numerous unanticipated scenarios requiring adaptability (44:46).

9. The Humanoid Question (46:03, 47:51, 50:27)

Physical Intelligence’s Stance:
Avoids humanoids for practical reasons (cost, complexity, teleoperation challenge), but supports the idea that the future will feature a diversity of form factors, with software decoupled from hardware (46:03, 48:21).
Humanoids Hype?
Humanoids are emotionally and intuitively appealing, but shouldn’t be the sole vision for future robots.

"I think there's a good reason to be excited about humanoids... it captures the imagination... but I think it's a somewhat limited view to restrict ourselves just to that." (50:27)
Demos vs. Reality:
Impressive demos can mislead—true generalization is harder to show and measure, yet more technically consequential (52:48).

"If you show it doing something simple in a hundred different environments... that's harder to convey." (52:51)

10. What's Next for Physical Intelligence? (54:59)

Future Focus:
Turning foundation models into true continual learners—robots continually improve from every new experience, via RL, language feedback, and self-supervised learning loops (55:01).
Technical Challenge:
Building a “data flywheel” remains the key innovation: ongoing autonomous learning in the real world.

Notable Quotes & Memorable Moments

On Data Diversity:
"Once your model understands diverse physical embodiments, you can also start adding in data from humans, because to the model, the human body will look like yet another robot body." – Sergey Levine (05:15)
On Simulation vs. Reality:
"It's not as easy as taking a camera, going out and taking pictures. And I think this is a little bit of a mistake because actually, if you're serious about building general purpose robots that'll go out into the world and do lots of things, the kind of boundary condition is in your favor..." (08:09)
On Generalists vs. Specialists in Robotics:
"The generalist model, the one that could handle a wide variety of tasks, actually becomes a better specialist because it can deal with all that weird stuff that arises... generality is really essential, even if you really want to do one thing." (44:55)
On Future Robot Form Factors:
"I actually really hope that robots will kind of end up being a little bit like personal computers, where there's like general software and the form factor... can be very different for different jobs." (46:19)
On the Humanoid Hype:
"There's a good reason to be excited about humanoids... but I think it's a somewhat limited view to restrict ourselves just to that." (50:27)
On the Science Communication Challenge:
"If you show it doing something fairly simple, but in like a hundred different environments, well, then each of the videos of that is just a robot doing something simple. So the fact that it can do it in all these different settings is harder to convey." (52:51)

Timestamps for Key Segments

[01:34] – Sergey Levine introduces himself and robotic foundation models.
[07:48] – Discussion of simulation vs. real-world data.
[12:08] – Role and limitations of teleoperation in scaling data collection.
[15:15] – Fleet effect and collective learning across diverse robot platforms.
[17:29] – RTX project: results from global robot arm data collaboration.
[22:11] – From imitation learning to smarter reinforcement approaches.
[26:37] – Role of world models and abstractions in robot intelligence.
[32:32] – On-device vs. cloud inference for robotic control.
[39:18] – Primer on VLA models and the “brain” analogy.
[43:54] – Why even specialists need generalist robots.
[46:03] – Physical Intelligence’s stance on humanoids and diversity of robot bodies.
[50:27] – Is humanoid fever overblown? Sergey’s balanced perspective.
[54:59] – Sergey’s vision for the coming years—continual, autonomous learning systems.
[56:13] – Unexpected inspiration and science fiction as a “guilty pleasure.”

Final Thoughts

Sergey Levine paints a nuanced, optimistic, and technically informed vision for the next era in robotics—one in which data diversity, adaptable foundation models, and hybrid AI systems enable a broad array of physical forms and capabilities. He underscores that true revolution isn’t always dramatic demonstrations but rather the accumulation of robustness and generality from messy, heterogeneous data, and the relentless drive toward autonomy and adaptability.

Recommended Listening for:

AI/robotics researchers
Tech industry strategists
General audiences interested in the real progress (and hype) in robotics

Loading summary...

Transcript

A (0:00)

A lot of startups these days say they're building foundation models for robots. What does that actually mean for a non technical listener?

B (0:08)

I think every time that people try to take robots out of the factory and into open world environments, they very quickly realize that in the real world there's a huge range of things that can happen.

A (0:18)

But since then there's been this explosion of humanoids and everyone's talking about humanoids. I mean, did I get it wrong? Do you think that humanoids are much closer to being in the world? Most AI is just speech to text, plus a language model full for reading transcripts, not understanding conversations. Velma from Modulate, an AI built on ensemble listening Model architecture. Specializes in audio analysis. It orchestrates hundreds of smaller sub models, purpose built to understand the nuances of voice like tone, timing, stress and intent. Perfect for fraud defense, deep fake detection, agent attrition prevention or customer service moderation. Check out the live Velma preview at Preview. Modulate AI. That's preview Modulate AI to see how the model breaks down audio providing timestamped explainable signals. Stop transcribing, Start listening with Modulate AI.

B (1:34)

My name is Sergey Levin. I'm one of the founders at Physical Intelligence. I'm also at a professor at UC Berkeley. And what I work on these days is algorithms for reinforcement learning for optimal decision making, as well as applications in robotics. And something that I've been very interested in lately in particular is robotic foundation models. These are general purpose models that control any robot in principle to perform any task. And I think we've seen some pretty dramatic transformations in the last few years in the capabilities of these kind of generalist robotic systems where we can use very diverse data sources from many different robotic platforms, performing a wide range of different tasks, and acquire a kind of general physical understanding from these data sets that then make it much more feasible to rapidly acquire effective and robust and highly generalizable robotic skills. So this is something that I've been very interested in the last few years and I think it's an area where we see a lot of progress.

A (2:30)

Yeah, and a lot of startups these days say they're building foundation models for robots. What does that actually mean for a non technical listener?

B (2:40)

Yeah, this is a, it's actually a surprisingly nuanced question because after the success of ChatGPT, the term foundation model became obviously very much a buzzword. So in some cases it's almost synonymous to saying I have a good model, it's a foundational model. But I think that insofar as there's a consistent definition. It's something like this that the principle behind language models, vision, language models, things like this, is that you can use very large and diverse data sources that are not necessarily of extremely high quality. Like it might be just data harvested from the web. And you get a model that digests all this data and acquires a kind of broad and general understanding of how the world works. And this kind of understanding is not enough to be an expert, it's not enough to be extremely proficient, but it gives you that kind of basis of common sense. And that's why. So the term foundation model was coined by Percy Lang and his colleagues at Stanford for precisely this reason. Because this kind of broad basis of knowledge gives you a foundation on top of which you can then put other things. And the key thing about the foundation is that in order to be useful, it needs to be broad insofar as there's a deep technical insight here. The insight is that if you, if you need such a large amount of data, it's almost impossible to get in a single domain. But if you are willing to use data from many different sources, maybe all of the text data on the web, all of the image data you can get your hands on, or in our case, data from all of the robots that we've seen, that could be big enough and that can give you that foundation. And on top of that foundation, now you can build up individual skills. In the case of language models, you can fine tune them for expert level computer programming. In the case of our robots, we can fine tune them for, you know, like assembling things or making coffee or cleaning the kitchen with very high quality data, but a much more limited amount of it. Because with that, once you can put it on top of that foundation, you don't need a huge amount of data of very high quality for the downstream tasks. So that's the really important thing. Now, to come back to your question, you asked. Well, there are many startups, many organizations that are building foundation models. I think one of the most important things about a robotic foundation model is to answer the question, where does all of that really broad and diverse data come from that can establish that foundation? This is a place where the answer is very, very delicate. The strategy that I think will be most successful here is to not be too picky. In the same way that language models are trained on all the text data that can be mined from the web, a robotic foundation model should be trained on all of the embodied data that we can get our hands on. And it's one of Those things where once you cross a certain threshold of scale, it actually becomes easier to incorporate other data sources. So if you want really, really high quality data of like one particular high end humanoid robot, that's pretty challenging because now you're very constrained. You have to have that system, you have to get good data for it. You have to figure out how to teleoperate it, put it in the right environments and so on. But if you're willing to pull in everything, then you can pull in lots of robot data from many different kinds of robots. Some of them might be good, some of them might be bad. Once your model understands diverse physical embodiments, you can also start adding in data from humans, because to the model, the human body will look like yet another robot body. So this kind of diversity actually makes it easier to include other data sources. So go ahead.

B (7:48)

Yeah, yeah, for sure. So in regards to simulation, what I would say is this, that simulation is a very appealing tool for kind of very easily acquiring lots of data of a robot doing all sorts of different things. But it's not a very appealing tool for getting experience of very diverse environments and very diverse objects. So if we look at kind of the domains in AI where simulation has been successful and the domains where it has struggled to get adoption, computer vision is an area where simulation has actually been used very little, despite a lot of attempts. Why? Well, it's not actually because rendering images is hard. In fact, like computer graphics is very, very advanced. So we can render very realistic images. It's like getting real images is so much easier. Right. Like you can just take a camera and go and photograph stuff and you get lots of real images. And I think that in robotics, a kind of a mental trap that people sometimes fall into is that they say, well, maybe in robotics it's hard to get data. It's not as easy as taking a camera, going out and taking pictures. And I think this is a little bit of a mistake because actually, if you're serious about building general purpose robots that'll go out into the world and do lots of things, the, the kind of boundary condition is in your favor, meaning that the better you get at building generalist robots, the more robots there are and the more data there should be coming in. There's a little bit of like an initial activation energy problem where you have to get over that hump to get enough systems out there, but that's like a transient period. Once you get over that, then you have lots of robots out there and lots of data coming in. So it actually, to me, makes a lot more sense to pay a little bit more of that upfront cost to kind of like force it over that threshold and then get lots of real world data. That doesn't mean that we shouldn't use simulation at all. It just means that we shouldn't worry so much about how hard it is to get robot data. For robot data, we should treat that as the industrial problem that it is. Get robots out there, get the data coming in, and then simulation can be very useful for addressing other edge cases. For example, you can simulate, you know, this is what the autonomous driving folks do all the time. You can simulate cases that you don't want to experience in the real world. Like you can simulate a car collision, but you don't want to experience a car collision in your car. So there's a lot of these kind of edge cases that you might want to take care of with simulation. But I don't think we should think of it as a substitute for real experience, because in other areas where diversity has been critical, like computer vision and natural language processing, real data has been essential to get there. And further, once you have that real data, it's actually easier to incorporate other data sources.

B (12:08)

Yeah. So this is a very good question. And in fact, in some ways, this is like kind of the big question in modern robotic foundation models, which is, what kind of data can you use? Now, I think it makes a lot of sense to start off building the initial foundation with teleoperation data. But as you pointed out, there are major challenges with this, like, you know, not the least of which is that if you actually want to collect data in deployment scenarios, whether it's somebody's home or a business or a factory or a warehouse or whatever, like, that's an additional kind of inconvenience that you have to deal with, and it's a barrier to scale. I think that the right way to proceed with this is to think of it as a mixture of different data sources, where as your model gets better, it should be able to leverage more accessible and more scalable data sources. So initially, maybe when the model is not very good, we need data from teleoperation from humans. That basically illustrates, like this, how the robot should act. But once the model gets better and the robot can be deployed with at least some degree of autonomy, then we can handle more accessible sources of supervision. One more accessible source of supervision is instructions. This is something that we actually found, not entirely intentionally, like, we were just kind of like, you know, trying out a few things in some of our research projects. But we found that we could actually get improvement in our policies by supervising the robot, essentially through language. And this only started happening once the model became powerful enough that the low level skills were already pretty good. Then you could correct the robot and say, like, maybe it's cleaning up the kitchen it messed up. You could say like, oh, you needed to pick up the plate and make sure you put the plate in the sink. And the way the model works internally is it's very similar to how these modern reasoning models, LLMs, work, where there are internal thoughts that are generated and the final action is chosen based on those thoughts. So essentially this kind of language feedback supervises the internal thoughts rather than the low level actions. But once the low level actions are good enough, actually supervising the internal thoughts already gives the robot a lot of learning signal and can improve a policy without direct teleoperation. But again, this only emerges once the base model is strong enough. The other thing we can do is leverage autonomous experience, where we can improve the system through reinforcement learning. So there was a research project that we, we actually published just a few months ago that describes a reinforcement learning system that we built on top of our foundation model. And again, it's the same story that you need the foundation model to be strong enough so that from there it can improve with autonomous experience and reinforce.

B (17:29)

Yeah. So this, what you're referring to, I think, is the RTX project, which in many ways was actually part of the impetus for starting physical intelligence. So this was a project that we did in 2023, and myself and many of my colleagues that worked on this then went on to found physical intelligence in the RTX project. This was very much an academic research project, but what we did is we contacted academic research labs, about 30 labs in total, and we asked them to basically send us the data from their robotic manipulation experiments. And we limited this to single arm robot manipulators with parallel jaw grippers, just to pick kind of the most common form factor. And then what we did is we trained one model across all of these different data sets and we sent back that model to some of the labs that had donated data and asked them to essentially evaluate it in comparison to whatever they were developing on their own robot for their own application. So each lab was doing a different research project with a different robot and a different task, and they had their own methods that they were developing. And we just said like, whatever is the best you've got, just measure that against our generalist model. And what we found is that the generalist model on average was about 50% more successful than whatever each individual lab was developing and that's really, really exciting because this is kind of paralleling a lot of the development that we've seen in language models. With language models, the, the, the big result, the scientifically, it wasn't actually chatgpt. The scientific result that was so exciting is that the generalist model, the generalist language model could outperform specialized models for machine translation, sentiment analysis. You know, all these NLP tasks that typically would require very specialized data sets and very specialized models could be done better with this, more general. And what we saw with RTX was an early hint that something like that was actually happening in robotics. And I think that's actually really important.

B (23:47)

That's right. So our models are based on visual language action models. And this has kind of become essentially like a de facto standard in robotic learning research. This is something that many of the folks on the team here pioneered back in the early 2000s, but now it's basically what everybody uses. VLAS are kind of an interesting thing because initially, like the early, what I Refer to as first generation VLAs, they were trained in a very straightforward way. Basically, visual language models are models that answer questions and they can also take in an image, so they answer visual questions. Early VLAs were trained by basically taking this visual question answering paradigm and simply turning robotic control into like a visual question. So in robotic control, the question is the prompt, like, pick up the socks. And the answer is the numerical value of the actions. That's like a fairly straightforward naive way to cram robotic control into a format that vision language models can understand. But there's a lot more that we can do than just that. And there's, broadly speaking, two big buckets where there is room for improvement. With VLAs, one is dexterity and the other one is reasoning. So dexterity means go beyond treating actions as an answer to a visual question and actually develop a model design that handles dexterity first and foremost. So control is not a discrete thing. It's not an answer to a question. It's a continuous thing, it's a trajectory. So you can use models that are very well adapted to high dimensional continuous dynamical systems. Diffusion models are really good for this. So incorporating diffusion models into vision language models can give you these kind of much more dexterous VLMs. The second thing is the knowledge that is learned by language models and visual language models from the Web. A lot of that knowledge is semantic, it's not physical. So a big way to improve VLAS is to better hook into that semantic knowledge. Essentially when the robot doesn't know what to do, what it should do, much like a person, is it should pause and think. And that thinking maybe taps into more semantic knowledge that is not yet fully grounded in the physical world, but can lead to reasonable inferences. So maybe it's trying to open a drawer to take out a knife to cut a vegetable, and the drawer is an opening. Now, maybe the robot experience is not enough to inform it what to do, but there's a reasonable semantic inference. You can say, well, why isn't this thing open? Maybe I should try a different one. There's kind of this common sense inference you can make, and that common sense, like if you ask, like ChatGPT, it can make that common sense inference and it will tell you something. And the trick then is to digest that inference into a format that the motor control component of this counterstand, and that's basically a thinking process. So that's where I think there's room for a lot of improvements for these models and we've seen a little bit of that in some of our work on chain of thought. And I think it's where we'll see a lot of future developments.

B (27:55)

Yeah, it's a really interesting question. I think that some folks tend to present world like, I guess we should nail down what we mean by world models. Typically what people mean when they say world models is some kind of predictive model that operates at the level of raw observations. It doesn't mean that it predicts raw observations. It may be that it is. Like Yann Lecun, for example, I know he advocates for essentially a latent space world model which predicts a sufficient statistic of observations. But roughly speaking, it's something that predicts something about your future observations. It's a very reasonable idea. But I think that something that we should keep in mind is that for human behavior, prediction definitely plays a role. But there are also things that we do that are not grounded entirely in prediction. There's a place where prediction is easy and there's a place where prediction is difficult. And the abstractions that we use are really critical to intelligent behavior. So to give you an example, like if I want to figure out how to get from where I am now in San Francisco to New York City, maybe I'm going to imagine something about it. Maybe I imagine how I get my car keys and I get in my car. I might imagine that I'm taking an airplane. But the further out that I think about this, the more abstract that imagination becomes. I'm not imagining exactly what my seat in the airplane is going to look like. So abstractions are really key to actual effective World modeling. And I think at some level, what language models do, what visual language models do, and what video prediction models do and other kinds of world models is not actually that different. They're just operating with different abstractions. And I think for a real, capable, embodied intelligent system like a robot, we'll need many different abstractions. And I suspect that once we figure out how to use abstractions in general like that, that general part of the question is actually the important one. And then we'll use kind of the right kind of thing at each level and that'll be fine. So I guess what I would say is that like, definitely I'm very sympathetic to the world models view, but I also suspect that the dichotomy between language models and what people today call world models is not as large as some people might see.

B (33:12)

Yeah, that's a really interesting question. So, you know, so far we're still very much in like the research and development phase. We haven't had to worry about this very much. And you're completely right that currently the models actually live on the cloud. They actually run through an inference API that looks very similar to what someone might imagine using for an LLM. But in the long run, I think you're right that there needs to be an on device component that is reliable, that is not vulnerable connectivity issues generally. I think that the way to move towards this, which I think is already reflected in current models that we and others have been developing, is to have a system that performs multiple types of inferences in parallel at different levels, where the highest levels are maybe more appropriate to offload to a remote inference server. And the lowest levels, the ones that are really doing motor control and closing the loop very tightly on perception, run locally. Now the good news is that the lowest levels are probably also going to be the smallest ones in terms of the number of parameters, because they're not, you know, you can sort of think of these as like instinctual reactions, reflexes, that sort of thing. Like these are things that are very important, they need to be very fast, but they also are not as cognitively demanding and not as complex, so they can run locally. Now the trick of course is figuring out all that communication so that you still preserve the benefit of end to end training. So this whole thing is trained together to act in concert, but at inference time can be partitioned in this way with different size components running in different places. And it's kind of cool to imagine, like you know, you have good Internet connectivity, you get, you get a lot of intelligence, your Internet connectivity starts to degrade. Okay, maybe the robot gets a little dumber. Like it has to, you know, maybe pull down some nice inferences from the cloud, keep them locally and do some stuff and then, okay, now, now it's time, let's stop and think again. Let's ask the cloud to think some more.

B (36:33)

Yeah, yeah, let me think about how to best describe this. So I think it's often hard to get like a complete picture of the robotic learning world just by looking at the kind of results that people present. There's kind of one general truism of the robotics demo, which is robotics demos can be set up in such a way that shows something really cool, but doesn't actually provide like a general solution to a problem. Because if you want to Just like stage a robot demo, you kind of make things work in that one setting. So because of that, it's like a little hard to figure out what are like the big clusters of major effective techniques. But if I were to like, very soberly look at the current robotic learning environment, I would say that there are actually like two big things that work very well. One thing which we've been discussing is vision, language, action models, and more generally this idea of like learning manipulation skills and things like that from data. Typically it's with imitational learning, but it can also be with reinforcement learning insofar as it can leverage that data. And oftentimes actually the recipe is like, use imitation learning to initialize and then maybe like fine tune it with rl. And then the other big cluster is sim to real transfer, using simulation to learn motor skills in a sufficiently randomized way and then run them in the real world. And when you see demos of like, you know, robots doing like, you know, dancing and acrobatics and all that stuff, that's typically doing sim to real. And it's kind of a funny unstable equilibrium. The current research world that these two types of techniques are very different and they are also used to attack very different domains. So the vision language action models, they're basically currently the dominant paradigm for robotic manipulation problems where robots need to interact with diverse environments and diverse objects. The sim to real stuff is the method of choice for highly acrobatic and athletic movements, typically for humanoids. And these are situations where the task is physically extremely demanding, but the diversity of the environment is very low. So if you put the robot on stage and it's dancing, you're not really worried about the shape of the stage. You're just worried about the robot's body. So this maybe tells us actually a lot about the strengths and weaknesses of these two types of approaches that the SIM2Real stuff is great for really understanding the physics of the robot, but not so great for generalization. The visual language action models are great for generalization, but of course, because they're dealing with the physical world, the real world data, they can't run these like giant RL loops that practice for billions of trials and really overfit to the particular body of the robot. Now, of course, there's a lot more to do in the future, but this hopefully gives you some sense for the layout and vla.

B (39:18)

Yeah, so maybe one way I could describe this is to start with visual language models, these are also sometimes called multimodal LLMs. So if you use like Gemini or ChatGPT and you upload an image and you ask some questions, that's basically a vlm. And the way that these models work today is you start with a language model. A language model is very, very simple. It's a transformer that takes in text and predicts future text. And to get these things to process images, what we do is we train a vision encoder, basically another little piece of neural network that takes an image and puts it into the same space as the language tokens. That's sometimes called a vision encode. So now you can feed images to this thing the same way that you feed language, because this little virtual visual cortex kind of takes the images and turns them into semantic looking thingies that the language model knows how to process. So now with VLAs, well, with first generation VLAs, as I mentioned, people basically just took exactly this model and just changed the output to literally like output numbers in text that represent actions. The second generation VLAs, which is what everyone's basically using today, they take inspiration from how VLMs add a visual cortex to the language model, and they also add a kind of a virtual motor cortex, a specialized little piece of circuitry whose job is to take the outputs from the language model backbone and decode them into continuous actions. And this is typically done with diffusion. Basically the same kind of technology that's used to generate images and videos is now used to generate trajectories of robot joints. That makes a lot of sense because trajectories of robot joints are a continuous spatial object, the same way that images are continuous spatial objects. So it makes sense that the same technology would be applicable there. And I think this is really cool because almost like what we're doing is like building a brain piece by piece. Like there is this, the language model backbone, I guess, kind of like a prefrontal cortex almost. There's the little visual cortex part that encodes images, and now there's a little motor cortex part. Anyone familiar with biology would be very offended at this point, because it's backwards, Right? Evolutionarily, the motor cortex comes first, then vision, then prefrontal. But here it's the other way around.

B (43:54)

So here's maybe how I can try to answer this question. I think every time that people try to take robots out of the factory and into open world environments, they very quickly realize that in the real world there's a huge range of things that can happen. And it's very, very hard to get a system that is highly specialized and still robust enough to work in the real world. This lesson was probably learned. I don't know if it's for the first time, but an early instance of this lesson was actually autonomous driving. In the very early days of autonomous driving, some people thought that, well, okay, driving is complicated, there's lots of stuff that could happen, but if we just like an instrument the road and like prevent other things from getting into like the autonomous car lane, maybe then things will work like we'll install like magnetic sensors and all that stuff. And that really never took off because essentially the gap between a closed world and an open world is, is enormous. And you can't, it's like, you can't be just like a little open world, like as soon as you're. You're out in the wild, immediately stuff can happen. Maybe it happens rarely, but that's not. That doesn't save you. Like, even if it happens real, you have to deal with it. So here's an example from our work. We had a project on using our systems to assemble boxes. So you take like a flattened cardboard box and you have to like, build it up, fold it, and it's like a little bit like an origami problem, basically. You think that's pretty structured. Like it's one thing, just build a box. But sometimes you grab the boxes off the pile and you get two boxes instead of one. So you have to put one in the back. And maybe there's someone. Something is torn a little bit, so you have to discarded because it's torn. And maybe someone left their phone on the table, so you have to put the phone away. So there's just so many things that happen, even like once you cross that boundary from the factory to the real world, that you can't actually have a narrow specialist, that even if what you want is to do one thing, you have to handle all these other things that can happen all around. And the generalist model, the one that could handle a wide variety of tasks, actually becomes a better specialist because it can deal with all that weird stuff that arises. So I think that generality is really essential, even if you really want to do one thing. And I think that's a lesson that's been learned time and time again at robotics.

B (52:48)

So I have two things I can say about this. The first is, I think in regard to demos, I mean, I think you're right that for a demo, it's like there's a big difference between setting up a demo and setting up something that works. And, you know, one thing that I've struggled with in my career, I think quite a bit is that, you know, I work on robotic learning and robotic learning, that the purpose of it is to get systems that can generalize and work in open world environments. But it's very hard to illustrate. Like it's, it's. If you, if you show somebody a demo in a particular setting and the robot does something cool, like, yeah, it's obvious, like, cool stuff going on. If you show it doing something fairly simple, but in like a hundred different environments, well, then each of the videos of that is just a robot doing something simple. So the fact that it can do it in all these different settings is harder to convey. And that's kind of a science communication challenge that I think we have to be cognizant of when we look at the videos. But I think there's a deeper, more technical thing I want to say about humanoids and about robots in general, which is that conventionally, when somebody thinks about building a really cool robotic system, they naturally start with building the physical robot. So if you start from the premise I'm going to build one robot, it makes sense. Like, if it's going to do something general, it should be a very general body. And you should really get that right, because once you've committed to, like that form factor, like, you're kind of stuck. So I think there is actually a technical point here which is that the old way of thinking about robotic software as something that drives one robot kind of naturally leads to that. Because if you want one very general robot, then you kind of have to like, have it, have everything. But if you accept the premise that we're going to have robotic foundation models that can drive lots of different robots who do lots of different things. Now it kind of unchains your thinking. And now it's okay to experiment with different form factors, experiment with different applications, different ways of approaching things. It's just that you need this kind of general AI system to be able to do that. And without it, it actually kind of makes sense that you would go into this single ideal perfect body plan. And maybe a humanoid is a good choice for that.