Loading summary
A
A lot of startups these days say they're building foundation models for robots. What does that actually mean for a non technical listener?
B
I think every time that people try to take robots out of the factory and into open world environments, they very quickly realize that in the real world there's a huge range of things that can happen.
A
But since then there's been this explosion of humanoids and everyone's talking about humanoids. I mean, did I get it wrong? Do you think that humanoids are much closer to being in the world? Most AI is just speech to text, plus a language model full for reading transcripts, not understanding conversations. Velma from Modulate, an AI built on ensemble listening Model architecture. Specializes in audio analysis. It orchestrates hundreds of smaller sub models, purpose built to understand the nuances of voice like tone, timing, stress and intent. Perfect for fraud defense, deep fake detection, agent attrition prevention or customer service moderation. Check out the live Velma preview at Preview. Modulate AI. That's preview Modulate AI to see how the model breaks down audio providing timestamped explainable signals. Stop transcribing, Start listening with Modulate AI.
B
My name is Sergey Levin. I'm one of the founders at Physical Intelligence. I'm also at a professor at UC Berkeley. And what I work on these days is algorithms for reinforcement learning for optimal decision making, as well as applications in robotics. And something that I've been very interested in lately in particular is robotic foundation models. These are general purpose models that control any robot in principle to perform any task. And I think we've seen some pretty dramatic transformations in the last few years in the capabilities of these kind of generalist robotic systems where we can use very diverse data sources from many different robotic platforms, performing a wide range of different tasks, and acquire a kind of general physical understanding from these data sets that then make it much more feasible to rapidly acquire effective and robust and highly generalizable robotic skills. So this is something that I've been very interested in the last few years and I think it's an area where we see a lot of progress.
A
Yeah, and a lot of startups these days say they're building foundation models for robots. What does that actually mean for a non technical listener?
B
Yeah, this is a, it's actually a surprisingly nuanced question because after the success of ChatGPT, the term foundation model became obviously very much a buzzword. So in some cases it's almost synonymous to saying I have a good model, it's a foundational model. But I think that insofar as there's a consistent definition. It's something like this that the principle behind language models, vision, language models, things like this, is that you can use very large and diverse data sources that are not necessarily of extremely high quality. Like it might be just data harvested from the web. And you get a model that digests all this data and acquires a kind of broad and general understanding of how the world works. And this kind of understanding is not enough to be an expert, it's not enough to be extremely proficient, but it gives you that kind of basis of common sense. And that's why. So the term foundation model was coined by Percy Lang and his colleagues at Stanford for precisely this reason. Because this kind of broad basis of knowledge gives you a foundation on top of which you can then put other things. And the key thing about the foundation is that in order to be useful, it needs to be broad insofar as there's a deep technical insight here. The insight is that if you, if you need such a large amount of data, it's almost impossible to get in a single domain. But if you are willing to use data from many different sources, maybe all of the text data on the web, all of the image data you can get your hands on, or in our case, data from all of the robots that we've seen, that could be big enough and that can give you that foundation. And on top of that foundation, now you can build up individual skills. In the case of language models, you can fine tune them for expert level computer programming. In the case of our robots, we can fine tune them for, you know, like assembling things or making coffee or cleaning the kitchen with very high quality data, but a much more limited amount of it. Because with that, once you can put it on top of that foundation, you don't need a huge amount of data of very high quality for the downstream tasks. So that's the really important thing. Now, to come back to your question, you asked. Well, there are many startups, many organizations that are building foundation models. I think one of the most important things about a robotic foundation model is to answer the question, where does all of that really broad and diverse data come from that can establish that foundation? This is a place where the answer is very, very delicate. The strategy that I think will be most successful here is to not be too picky. In the same way that language models are trained on all the text data that can be mined from the web, a robotic foundation model should be trained on all of the embodied data that we can get our hands on. And it's one of Those things where once you cross a certain threshold of scale, it actually becomes easier to incorporate other data sources. So if you want really, really high quality data of like one particular high end humanoid robot, that's pretty challenging because now you're very constrained. You have to have that system, you have to get good data for it. You have to figure out how to teleoperate it, put it in the right environments and so on. But if you're willing to pull in everything, then you can pull in lots of robot data from many different kinds of robots. Some of them might be good, some of them might be bad. Once your model understands diverse physical embodiments, you can also start adding in data from humans, because to the model, the human body will look like yet another robot body. So this kind of diversity actually makes it easier to include other data sources. So go ahead.
A
No, go ahead. No, no, no, go ahead.
B
I was going to bring this back to your question for the final conclusion, which is how to actually answer what you asked me. I think a big difference between how physical intelligence is approaching this and how most other research labs approach this question is that we are not being very picky about which robots we use for this. We're bringing in everything and trying to build this very broad foundation.
A
Yeah, and, and, and to build that model, I mean, to collect that data there, there are various strategies. One is simulation. And as I recall from our past conversations, you're not a great fan of simulation, but I just saw 60 Minutes a little while back and they had, I'm trying to think, which was Boston Dynamics in a Hyundai factory and they were showing the simulation of, of sort of an endless army of Atlas robots performing tasks. Why are you not a fan of simulation? That's one. And then two, you're using vision language, action models, is that right? And can you talk about how that differs from other strategies like world models or. Other kinds of foundation models?
B
Yeah, yeah, for sure. So in regards to simulation, what I would say is this, that simulation is a very appealing tool for kind of very easily acquiring lots of data of a robot doing all sorts of different things. But it's not a very appealing tool for getting experience of very diverse environments and very diverse objects. So if we look at kind of the domains in AI where simulation has been successful and the domains where it has struggled to get adoption, computer vision is an area where simulation has actually been used very little, despite a lot of attempts. Why? Well, it's not actually because rendering images is hard. In fact, like computer graphics is very, very advanced. So we can render very realistic images. It's like getting real images is so much easier. Right. Like you can just take a camera and go and photograph stuff and you get lots of real images. And I think that in robotics, a kind of a mental trap that people sometimes fall into is that they say, well, maybe in robotics it's hard to get data. It's not as easy as taking a camera, going out and taking pictures. And I think this is a little bit of a mistake because actually, if you're serious about building general purpose robots that'll go out into the world and do lots of things, the, the kind of boundary condition is in your favor, meaning that the better you get at building generalist robots, the more robots there are and the more data there should be coming in. There's a little bit of like an initial activation energy problem where you have to get over that hump to get enough systems out there, but that's like a transient period. Once you get over that, then you have lots of robots out there and lots of data coming in. So it actually, to me, makes a lot more sense to pay a little bit more of that upfront cost to kind of like force it over that threshold and then get lots of real world data. That doesn't mean that we shouldn't use simulation at all. It just means that we shouldn't worry so much about how hard it is to get robot data. For robot data, we should treat that as the industrial problem that it is. Get robots out there, get the data coming in, and then simulation can be very useful for addressing other edge cases. For example, you can simulate, you know, this is what the autonomous driving folks do all the time. You can simulate cases that you don't want to experience in the real world. Like you can simulate a car collision, but you don't want to experience a car collision in your car. So there's a lot of these kind of edge cases that you might want to take care of with simulation. But I don't think we should think of it as a substitute for real experience, because in other areas where diversity has been critical, like computer vision and natural language processing, real data has been essential to get there. And further, once you have that real data, it's actually easier to incorporate other data sources.
A
Yeah. Although couldn't you use simulation, as I mentioned, with the Boston Dynamics robot, to get over the hump and then start collecting real life data?
B
Yeah, that's a very good question. I think you could. I think in practice in robotic manipulation, we found it to be a lot easier to do that with real data. Part of it has to do with the fact that simulation is very good for simulating the robot. It's much less good for simulating everything else, because the robot, you only have to model once, whereas if you want to have like thousands of scenes, thousands of objects, then you have to go and model and simulate each one of those. I don't think that's impossible. I think it is a tractable problem potentially, but it's just more costly than just going out and getting lots of real stuff.
A
Yeah. And, you know, there's been this explosion of humanoids. Yours are not necessarily humanoid. I mean, yours kind of platform agnostic, is that right? But on the humanoids, Joanna Stern at the Wall Street Journal last year had Neo invited NEO into her home, and it didn't perform very well. And apparently NEO is, is available for home use, but it comes along with a tele operator that spends time in your house, like walking around doing something, and that's to collect that data to train the robot. I mean, they need a lot of these deployments to collect enough data. That doesn't seem like a very scalable solution. So how are you guys approaching it? Are you doing any teleoperation?
B
Yeah. So this is a very good question. And in fact, in some ways, this is like kind of the big question in modern robotic foundation models, which is, what kind of data can you use? Now, I think it makes a lot of sense to start off building the initial foundation with teleoperation data. But as you pointed out, there are major challenges with this, like, you know, not the least of which is that if you actually want to collect data in deployment scenarios, whether it's somebody's home or a business or a factory or a warehouse or whatever, like, that's an additional kind of inconvenience that you have to deal with, and it's a barrier to scale. I think that the right way to proceed with this is to think of it as a mixture of different data sources, where as your model gets better, it should be able to leverage more accessible and more scalable data sources. So initially, maybe when the model is not very good, we need data from teleoperation from humans. That basically illustrates, like this, how the robot should act. But once the model gets better and the robot can be deployed with at least some degree of autonomy, then we can handle more accessible sources of supervision. One more accessible source of supervision is instructions. This is something that we actually found, not entirely intentionally, like, we were just kind of like, you know, trying out a few things in some of our research projects. But we found that we could actually get improvement in our policies by supervising the robot, essentially through language. And this only started happening once the model became powerful enough that the low level skills were already pretty good. Then you could correct the robot and say, like, maybe it's cleaning up the kitchen it messed up. You could say like, oh, you needed to pick up the plate and make sure you put the plate in the sink. And the way the model works internally is it's very similar to how these modern reasoning models, LLMs, work, where there are internal thoughts that are generated and the final action is chosen based on those thoughts. So essentially this kind of language feedback supervises the internal thoughts rather than the low level actions. But once the low level actions are good enough, actually supervising the internal thoughts already gives the robot a lot of learning signal and can improve a policy without direct teleoperation. But again, this only emerges once the base model is strong enough. The other thing we can do is leverage autonomous experience, where we can improve the system through reinforcement learning. So there was a research project that we, we actually published just a few months ago that describes a reinforcement learning system that we built on top of our foundation model. And again, it's the same story that you need the foundation model to be strong enough so that from there it can improve with autonomous experience and reinforce.
A
And one time, as I recall, you had, I didn't see it, but you were setting up ranks of robotic arms and they were going to be sort of informing each other. So any arm that learns something, then that learning would be transferred to all of the robots. Is that kind of scale useful? I mean, are you doing that with.
B
Yeah, yeah, that's exactly right. So the big benefit of having robotic learning rather than human learning here is that you can have this fleet effect and you can do collective learning very effectively. So all of our robots share all the experience. And again, it comes back to this idea that the stronger the base foundation model is, the more readily can incorporate experience from diverse robotic platforms. So the experiment that you're referring to this was done at this point maybe about eight years ago, the Google ARM farm project there. Every single robot platform at every single robot station was as close as possible to each other. They were virtually identical. And that worked very well with like the state of the art learning technology of, you know, 2017, 2018. But these days, when you have a foundation model that accommodate very diverse platforms, very diverse tasks, very, very diverse environments, the collective learning stuff becomes much easier because now we can pull in data from many of our own Robots that are deployed. We can also work with other companies that have their own robotic platforms. And in fact, initially when we started this, we thought that we would need to do something very special to accommodate this, like maybe we can somehow tell the model, like, here's the morphology you're controlling, here are the details of this robot platform. It actually turns out that very little cleverness is required in that respect. Like the model can figure out from looking through the camera, at the camera image what kind of robot is dealing with. A lot of careful engineering is still needed to make sure that training is efficient, to make sure the model is set up in the right way. But handling the cross embodiment aspect of this turns out to actually be pretty straightforward.
A
Yeah. And across form factors as well, is that right? Not only tasks.
B
That's right. Other companies we've worked with have actually adapted our models for controlling multi fingered hands, humanoids, that sort of thing. In fact, some of them use them for mobile robots, things like agricultural equipment that we wouldn't conventionally think of as robots in the usual sense.
A
Yeah. Wasn't there a project that you were involved in that, that brought together data from all around the world, from all various robotics labs, and what was the outcome of that? Or is that ongoing?
B
Yeah. So this, what you're referring to, I think, is the RTX project, which in many ways was actually part of the impetus for starting physical intelligence. So this was a project that we did in 2023, and myself and many of my colleagues that worked on this then went on to found physical intelligence in the RTX project. This was very much an academic research project, but what we did is we contacted academic research labs, about 30 labs in total, and we asked them to basically send us the data from their robotic manipulation experiments. And we limited this to single arm robot manipulators with parallel jaw grippers, just to pick kind of the most common form factor. And then what we did is we trained one model across all of these different data sets and we sent back that model to some of the labs that had donated data and asked them to essentially evaluate it in comparison to whatever they were developing on their own robot for their own application. So each lab was doing a different research project with a different robot and a different task, and they had their own methods that they were developing. And we just said like, whatever is the best you've got, just measure that against our generalist model. And what we found is that the generalist model on average was about 50% more successful than whatever each individual lab was developing and that's really, really exciting because this is kind of paralleling a lot of the development that we've seen in language models. With language models, the, the, the big result, the scientifically, it wasn't actually chatgpt. The scientific result that was so exciting is that the generalist model, the generalist language model could outperform specialized models for machine translation, sentiment analysis. You know, all these NLP tasks that typically would require very specialized data sets and very specialized models could be done better with this, more general. And what we saw with RTX was an early hint that something like that was actually happening in robotics. And I think that's actually really important.
A
Yeah, but in rtx, you're using exclusively robotic arms. Does that data inform other form factors or do you need, if you're training humanoids, do you need to collect data across various different humanoid platforms?
B
Yeah, so the current models we have at Physical Intelligence are trained on many more robots, and they do vary in morphology. Now, I should say that generalizing into an entirely new morphology that was not seen in the data, that is still kind of at the bleeding edge of current research. So we're not really that concerned with that. We're concerned with the case where someone has a robot, they have some data from that robot, but what they want is to benefit from transfer from other robotic platforms. You know, here's an anecdote that I can tell you about that maybe underscores this point. In the first year or so of the company, this was in 2024, we worked almost entirely with static arms. So these were not mobile robots. These were arms that were attached to a table and they were doing manipulation tasks. And then in early 2025, we decided that we wanted to start experimenting with mobile robots. So these are basically arms on a wheeled base. And we had very few mobile robots, so we could only collect a little bit of data with them. And our first kind of publicly released research project on this, which came in April 2025, used a training set, of which only 3% of the data was collected on mobile robots. So 97% was from these statically mounted arms. But we could get the mobile robots to actually generalize very broadly. They could go into a home that was never seen in the training data, clean up the kitchen, put away the dishes, that sort of thing. And most of the knowledge in these models came from not the mobile robot, but these static arms bolted to a table. So that kind of underscores the power of this kind of cross embodiment learning where we could use lower cost, more accessible Platforms to get the bulk of your data and then adapt them to a downstream morphology.
A
Yeah, that's. That's fascinating. Although I imagine when you move to bipedal. Bipedal for legged robots, you're. You're going to that data. It'll help in the robot arm manipulation, but not necessarily in mobility. So this is a kind of an uninformed question, but when you see teleoperation, what is happening there? And what kinds of things are robots learning? Is, Is that imitation learning? Is, is that something else?
B
So the standard kind of default way to use teleoperation data is imitation learning. But this is, I think, a place where there's a lot of room for improvement in current research. And we've started studying this a little bit. There's other research groups that are studying this. Basically, what you would like to do with data, ideally, is not just copy it blindly, but actually understand dynamically which parts of what is being demonstrated are good and which parts are not so good. Basically, instead of using the data to answer the question of how should I do this? Use the data to answer the question of what is possible? And then among the things that are possible, pick, like, the best things. And that's basically where a lot of the ideas in offline reinforcement learning can come in. Roughly speaking, the way that this works is instead of supervising the model to produce the same actions that are in the data, what you do is you supervise the model to predict the outcomes. So you train the model so that it can predict, like, if I see this and I do this, will that be good or will that be bad? And if you can do a really good job predicting those outcomes, then you can tell the model, okay, now do whatever will lead to the good outcome. And that's actually potentially a very powerful tool because now you can bring in heterogeneous data of different quality, and now the variety of data quality actually becomes a blessing rather than a curse. Because if you see lots of good things and lots of bad things, then you can figure out how to distinguish good from bad and do better at test time. So this is kind of where a lot of the current, like, research is situated.
A
I see. How does. I mean, does teleoperation. Is that using a VLA model in the background? Is that collecting data for vision language action?
B
That's right. So our models are based on visual language action models. And this has kind of become essentially like a de facto standard in robotic learning research. This is something that many of the folks on the team here pioneered back in the early 2000s, but now it's basically what everybody uses. VLAS are kind of an interesting thing because initially, like the early, what I Refer to as first generation VLAs, they were trained in a very straightforward way. Basically, visual language models are models that answer questions and they can also take in an image, so they answer visual questions. Early VLAs were trained by basically taking this visual question answering paradigm and simply turning robotic control into like a visual question. So in robotic control, the question is the prompt, like, pick up the socks. And the answer is the numerical value of the actions. That's like a fairly straightforward naive way to cram robotic control into a format that vision language models can understand. But there's a lot more that we can do than just that. And there's, broadly speaking, two big buckets where there is room for improvement. With VLAs, one is dexterity and the other one is reasoning. So dexterity means go beyond treating actions as an answer to a visual question and actually develop a model design that handles dexterity first and foremost. So control is not a discrete thing. It's not an answer to a question. It's a continuous thing, it's a trajectory. So you can use models that are very well adapted to high dimensional continuous dynamical systems. Diffusion models are really good for this. So incorporating diffusion models into vision language models can give you these kind of much more dexterous VLMs. The second thing is the knowledge that is learned by language models and visual language models from the Web. A lot of that knowledge is semantic, it's not physical. So a big way to improve VLAS is to better hook into that semantic knowledge. Essentially when the robot doesn't know what to do, what it should do, much like a person, is it should pause and think. And that thinking maybe taps into more semantic knowledge that is not yet fully grounded in the physical world, but can lead to reasonable inferences. So maybe it's trying to open a drawer to take out a knife to cut a vegetable, and the drawer is an opening. Now, maybe the robot experience is not enough to inform it what to do, but there's a reasonable semantic inference. You can say, well, why isn't this thing open? Maybe I should try a different one. There's kind of this common sense inference you can make, and that common sense, like if you ask, like ChatGPT, it can make that common sense inference and it will tell you something. And the trick then is to digest that inference into a format that the motor control component of this counterstand, and that's basically a thinking process. So that's where I think there's room for a lot of improvements for these models and we've seen a little bit of that in some of our work on chain of thought. And I think it's where we'll see a lot of future developments.
A
Yeah, you know, I, I had Fei Fei Li on recently talking about world models, and in the past I've had Yen LeCun on several times talking about world models. Where does world, where do world models fit into this stock? I mean, are you. Because world models, as I understand it, certainly from Yann Lecun's point of view, it's, it's building an internal representation of the world that, that then can be used to predict futures or reason through problems. If, if you don't have that, it seems that it would be much harder to build a foundation robot foundation model that can generalize and operate in the wild. I mean, right now in your lab, in a lot of the industrial deployments, they're very controlled environments and the robots training is very task specific. But if you want a more generalized model, where do world models fit in?
B
Yeah, it's a really interesting question. I think that some folks tend to present world like, I guess we should nail down what we mean by world models. Typically what people mean when they say world models is some kind of predictive model that operates at the level of raw observations. It doesn't mean that it predicts raw observations. It may be that it is. Like Yann Lecun, for example, I know he advocates for essentially a latent space world model which predicts a sufficient statistic of observations. But roughly speaking, it's something that predicts something about your future observations. It's a very reasonable idea. But I think that something that we should keep in mind is that for human behavior, prediction definitely plays a role. But there are also things that we do that are not grounded entirely in prediction. There's a place where prediction is easy and there's a place where prediction is difficult. And the abstractions that we use are really critical to intelligent behavior. So to give you an example, like if I want to figure out how to get from where I am now in San Francisco to New York City, maybe I'm going to imagine something about it. Maybe I imagine how I get my car keys and I get in my car. I might imagine that I'm taking an airplane. But the further out that I think about this, the more abstract that imagination becomes. I'm not imagining exactly what my seat in the airplane is going to look like. So abstractions are really key to actual effective World modeling. And I think at some level, what language models do, what visual language models do, and what video prediction models do and other kinds of world models is not actually that different. They're just operating with different abstractions. And I think for a real, capable, embodied intelligent system like a robot, we'll need many different abstractions. And I suspect that once we figure out how to use abstractions in general like that, that general part of the question is actually the important one. And then we'll use kind of the right kind of thing at each level and that'll be fine. So I guess what I would say is that like, definitely I'm very sympathetic to the world models view, but I also suspect that the dichotomy between language models and what people today call world models is not as large as some people might see.
A
Yeah, so. So with a good world model, there isn't anything that a robot couldn't do with being trained through with a, a vision, language, action model. It's, it's just. Yeah, go ahead.
B
I suspect the key is to have a system that correctly uses the right abstraction for the job. So if you are, you know, folding a piece of clothing, probably, you know, that's clearly not a semantic task. You're not thinking in words about like, oh, this fold is here, this fold is there. But you're probably also not imagining literally how all the particles of clothing move. At some level. You're, especially if you're good at the skill, you're mostly kind of using your muscle memory a little bit, but it's reactive, it's using perception, but it's not as model based as just like imagining exactly how every particle will move. So the really cool thing about proficient human motor skills is that they blend prediction and this kind of model free behavior and semantic reasoning. It all kind of comes together with the right tool for the job at each level. And I think that kind of blending is really, really important.
A
So do your models actually imagine possible futures before acting?
B
Right, so the way that our models work is they actually perform inferences at different levels of abstraction. I think there's still a lot of research to be done to figure out exactly how that should be done and what abstraction should be used and where. Right now I would say that our choice of abstraction is quite naive, but you can reason semantically about higher level things. Like if you're cleaning a kitchen, do you pick up the plate or do you pick up the towel? First you can figure out where things go spatially and then you can produce the actions. So there's already a little bit of that multiple abstractions. But I think there's a lot more research that needs to be done to get the abstractions to emerge automatically and to automatically figure out the right abstraction for any given stage of task.
A
Yeah, the so, so world models in your view are not necessary for generalized robots that can operate in messy environments.
B
I guess what I would say is not that it's necessary or not necessary, but that it's not as much of a dichotomy as I think that some folks tend to tend to present. Like, I think the view that I would disagree with is that there is like model free policies, reinforcement learning, VLAs and world models, and these are like totally separate things. I actually don't think they're separate things. And I think that the right answer will be a model that can do all of those things together, that can use world model like prediction when it's necessary and can do the model free stuff when that's more appropriate and smartly decides what kind of abstraction is the right abstraction to use for a given problem.
A
Yeah, your models, the, the AI brain, so to speak, is not on the robots. It's, it's communicating with them remotely. But, but when you send robots out into the world, you, you really can't afford, as with cars, autonomous vehicles, you can't afford for them to be waiting for instructions from the cloud. So how much of the models that you're building can, can be on device?
B
Yeah, that's a really interesting question. So, you know, so far we're still very much in like the research and development phase. We haven't had to worry about this very much. And you're completely right that currently the models actually live on the cloud. They actually run through an inference API that looks very similar to what someone might imagine using for an LLM. But in the long run, I think you're right that there needs to be an on device component that is reliable, that is not vulnerable connectivity issues generally. I think that the way to move towards this, which I think is already reflected in current models that we and others have been developing, is to have a system that performs multiple types of inferences in parallel at different levels, where the highest levels are maybe more appropriate to offload to a remote inference server. And the lowest levels, the ones that are really doing motor control and closing the loop very tightly on perception, run locally. Now the good news is that the lowest levels are probably also going to be the smallest ones in terms of the number of parameters, because they're not, you know, you can sort of think of these as like instinctual reactions, reflexes, that sort of thing. Like these are things that are very important, they need to be very fast, but they also are not as cognitively demanding and not as complex, so they can run locally. Now the trick of course is figuring out all that communication so that you still preserve the benefit of end to end training. So this whole thing is trained together to act in concert, but at inference time can be partitioned in this way with different size components running in different places. And it's kind of cool to imagine, like you know, you have good Internet connectivity, you get, you get a lot of intelligence, your Internet connectivity starts to degrade. Okay, maybe the robot gets a little dumber. Like it has to, you know, maybe pull down some nice inferences from the cloud, keep them locally and do some stuff and then, okay, now, now it's time, let's stop and think again. Let's ask the cloud to think some more.
A
So right now you're, you're, you're not concerned about shrinking models to fit on device. You're, the first step is to get a generalized model and then you'll do the distillation or whatever's required to get it on the device. Is that, is that fair?
B
So that's right. But even separately from any inference concerns, we do actually, we've sort of, I guess, conversion evolution in some sense. We already have models that have this kind of multi scale property where the lowest level motor control components are already smaller. So while we're not so worried yet about running things on device, the natural trajectory of that technology is, is leading to a place that makes that actually reasonably straightforward.
A
Yeah, so I have some questions about generalization, but before I get to that, I get very confused about all the different kinds of ways that people are training robots. So there's vision language, action models, but, but there, you know, there is one shot, two shot, few shot, vision language, action world models. What can you sort of give me a little primer on the different ways that, that people are training robots and why you've settled on VLAs and RL.
B
Yeah, yeah, let me think about how to best describe this. So I think it's often hard to get like a complete picture of the robotic learning world just by looking at the kind of results that people present. There's kind of one general truism of the robotics demo, which is robotics demos can be set up in such a way that shows something really cool, but doesn't actually provide like a general solution to a problem. Because if you want to Just like stage a robot demo, you kind of make things work in that one setting. So because of that, it's like a little hard to figure out what are like the big clusters of major effective techniques. But if I were to like, very soberly look at the current robotic learning environment, I would say that there are actually like two big things that work very well. One thing which we've been discussing is vision, language, action models, and more generally this idea of like learning manipulation skills and things like that from data. Typically it's with imitational learning, but it can also be with reinforcement learning insofar as it can leverage that data. And oftentimes actually the recipe is like, use imitation learning to initialize and then maybe like fine tune it with rl. And then the other big cluster is sim to real transfer, using simulation to learn motor skills in a sufficiently randomized way and then run them in the real world. And when you see demos of like, you know, robots doing like, you know, dancing and acrobatics and all that stuff, that's typically doing sim to real. And it's kind of a funny unstable equilibrium. The current research world that these two types of techniques are very different and they are also used to attack very different domains. So the vision language action models, they're basically currently the dominant paradigm for robotic manipulation problems where robots need to interact with diverse environments and diverse objects. The sim to real stuff is the method of choice for highly acrobatic and athletic movements, typically for humanoids. And these are situations where the task is physically extremely demanding, but the diversity of the environment is very low. So if you put the robot on stage and it's dancing, you're not really worried about the shape of the stage. You're just worried about the robot's body. So this maybe tells us actually a lot about the strengths and weaknesses of these two types of approaches that the SIM2Real stuff is great for really understanding the physics of the robot, but not so great for generalization. The visual language action models are great for generalization, but of course, because they're dealing with the physical world, the real world data, they can't run these like giant RL loops that practice for billions of trials and really overfit to the particular body of the robot. Now, of course, there's a lot more to do in the future, but this hopefully gives you some sense for the layout and vla.
A
Just walk me through the process of where the vision comes in, where the language comes in to get to the action.
B
Yeah, so maybe one way I could describe this is to start with visual language models, these are also sometimes called multimodal LLMs. So if you use like Gemini or ChatGPT and you upload an image and you ask some questions, that's basically a vlm. And the way that these models work today is you start with a language model. A language model is very, very simple. It's a transformer that takes in text and predicts future text. And to get these things to process images, what we do is we train a vision encoder, basically another little piece of neural network that takes an image and puts it into the same space as the language tokens. That's sometimes called a vision encode. So now you can feed images to this thing the same way that you feed language, because this little virtual visual cortex kind of takes the images and turns them into semantic looking thingies that the language model knows how to process. So now with VLAs, well, with first generation VLAs, as I mentioned, people basically just took exactly this model and just changed the output to literally like output numbers in text that represent actions. The second generation VLAs, which is what everyone's basically using today, they take inspiration from how VLMs add a visual cortex to the language model, and they also add a kind of a virtual motor cortex, a specialized little piece of circuitry whose job is to take the outputs from the language model backbone and decode them into continuous actions. And this is typically done with diffusion. Basically the same kind of technology that's used to generate images and videos is now used to generate trajectories of robot joints. That makes a lot of sense because trajectories of robot joints are a continuous spatial object, the same way that images are continuous spatial objects. So it makes sense that the same technology would be applicable there. And I think this is really cool because almost like what we're doing is like building a brain piece by piece. Like there is this, the language model backbone, I guess, kind of like a prefrontal cortex almost. There's the little visual cortex part that encodes images, and now there's a little motor cortex part. Anyone familiar with biology would be very offended at this point, because it's backwards, Right? Evolutionarily, the motor cortex comes first, then vision, then prefrontal. But here it's the other way around.
A
Yeah. And then to get the continuous action, it's outputting a stream of numbers that are telling the actuators how far to move or, or. Right. And it's the control theory comes in at, at the end. Is that right?
B
You're, you're outputting well and, and I should be very upfront about this. The amount of control theory and this is like pretty minimal. So currently the way that most of these systems work, including ours, is that the model outputs a short trajectory of future target joint angles. So roughly like half a second of joint angles to hit. And the actual actuation, the motor commands to reach those joint angles are computed with a very simple penance controller. This is kind of like, you know, the most basic type of controller you can put on robot car, basically.
A
Yeah. You know, past robots were often built to do one thing really well, particularly, I mean, certainly before, you know, they were AI was involved. How realistic is, is it that that one robot with, with one of these models controlling it can, can do many different things? Yeah, and, or, or, and I had a conversation with a, a guy, interesting guy, maybe you run across him, Mike LeBlanc, I think his name is. He's got a startup called foundation, and he's doing humanoids for the military. And, and they're training them to do one thing like, you know, put a explosive charge on a door, which is a very dangerous thing for a soldier to do. So it, they drop it from the Humvee, it walks, slaps this thing on the door and comes back hopefully in one piece. And that made a lot of sense, that rather than trying to build this generalized model, just train it to do one thing. But, but how much generalization is, is developing from, from yours and other robots in the field.
B
So here's maybe how I can try to answer this question. I think every time that people try to take robots out of the factory and into open world environments, they very quickly realize that in the real world there's a huge range of things that can happen. And it's very, very hard to get a system that is highly specialized and still robust enough to work in the real world. This lesson was probably learned. I don't know if it's for the first time, but an early instance of this lesson was actually autonomous driving. In the very early days of autonomous driving, some people thought that, well, okay, driving is complicated, there's lots of stuff that could happen, but if we just like an instrument the road and like prevent other things from getting into like the autonomous car lane, maybe then things will work like we'll install like magnetic sensors and all that stuff. And that really never took off because essentially the gap between a closed world and an open world is, is enormous. And you can't, it's like, you can't be just like a little open world, like as soon as you're. You're out in the wild, immediately stuff can happen. Maybe it happens rarely, but that's not. That doesn't save you. Like, even if it happens real, you have to deal with it. So here's an example from our work. We had a project on using our systems to assemble boxes. So you take like a flattened cardboard box and you have to like, build it up, fold it, and it's like a little bit like an origami problem, basically. You think that's pretty structured. Like it's one thing, just build a box. But sometimes you grab the boxes off the pile and you get two boxes instead of one. So you have to put one in the back. And maybe there's someone. Something is torn a little bit, so you have to discarded because it's torn. And maybe someone left their phone on the table, so you have to put the phone away. So there's just so many things that happen, even like once you cross that boundary from the factory to the real world, that you can't actually have a narrow specialist, that even if what you want is to do one thing, you have to handle all these other things that can happen all around. And the generalist model, the one that could handle a wide variety of tasks, actually becomes a better specialist because it can deal with all that weird stuff that arises. So I think that generality is really essential, even if you really want to do one thing. And I think that's a lesson that's been learned time and time again at robotics.
A
And are you going to move into humanoids, or have you?
B
Yeah, that's a good question. So the main reason why we've stayed away from humanoids ourselves for the data collection that we're doing here is actually, it's not ideological, it's just because humanoids are expensive, complicated, and teleoperating humanoids is harder. So we've basically stuck to robotic platforms that can give us the most data, the most data diversity. But I think humanoids are really cool. Like, we've worked with other companies that have humanoids. The model runs on humanoids. And I think that it's something that I'm sure we'll want to explore more in the future. My own take when it comes to robot morphology, though, is that I think that this is maybe like a little bit idealistic, but I actually really hope that robots will kind of end up being a little bit like personal computers, where there's like general software and the form factor of the device can be very different for different jobs. So when you get A computer, Maybe you get a laptop, maybe your computer is your cell phone, basically. Maybe it's a big desktop, maybe it's a big server machine. It's like whatever is right for your job. And I think robots will be the same way that maybe like if you live in a small apartment in New York, maybe you have like a little home robot that kind of attaches to the ceiling and pivots around and cleans up the thing. And maybe if you, you know, if you live on a farm or something, you have a big mobile robot, you know, tractor with a bunch of arms attached that can drive around and do stuff. And maybe there'll be one software stack that can accommodate these things with different applications and prompt engineering that people do for particular domains. And there'll be this heterogeneity of platforms like physical intelligence everywhere.
A
Yeah, well, how does, how do your models work on humanoids? Because your, your models have not been trained on operating legs and walking.
B
Yeah. So typically right now when we work with folks that have humanoid robots, they have their own software stack that takes care of balance and all that and all those things. And then the model drives basically the manipulation behaviors.
A
Yeah, and, and what you were just saying, do you think we'll end up with an iPhone of robots that everyone uses? And, and maybe they're specialized robot bodies for, for different tasks. And would there be, do you think there will be a dominant model? Yeah, yeah,
B
I don't know. But, but I think that one, like one vision of this that's very appealing to me is that, you know, with robots, since we get to build them, we can build them however we want. And I, and I think that's one of the advantages of robots is that they can be designed not to replace people, but to do the, to do things differently from people. And so you could design form factors that are suitable for particular domains. They can be much smaller, much bigger. They can have, you know, five arms, seven arms, one arm, whatever is like appropriate for the cost point that you have whatever's appropriate for your application. And people can be pretty creative about it. It's just the thing that's been preventing folks from experimenting with these things is that if you want a robot to do anything, you have to solve the intelligence problem. And if there isn't a ready made, off the shelf thing that at least gives you a rough prototype, you just can't get started. So what I really hope is that good robotic foundation models will provide that kind of middle layer in a computer. That layer is basically taken up by the Operating system. The operating system is the thing on top of which you put applications. So if there's an intelligence layer that's pretty general, on top of which somebody can essentially prompt, engineer their task, they can experiment with all the aspects of that application, really nail it correctly, design the right form factor, then we'll see a lot of experimentation, a lot of creativity, and then I think we'll actually see what robots should really look like instead of just thinking of them as like a metal version of a person.
A
Yeah, yeah. And on the humanoid question, I was asked to sort of write a brief for somebody late last year or mid last year about humanoids. And I wrote this very bearish thing about how, you know, all of the different problems from, you know, dexterity in the hand to, to models that can deal with unpredictable environments and things like that. But since then there's been this explosion of humanoids, and everyone's talking about humanoids. I mean, did I get it wrong? Do you think that humanoids are much closer to being in the world than, than I, than I thought?
B
It's a good question. I mean, I think it's, it. I think there's a good reason to be excited about humanoids, like in the sense that, like, just emotionally this really kind of captures the imagination. It also makes it easier for people to think about because, like, we know what people can do. So if we kind of build a robot that does what people do, that like, kind of makes sense. But I do think that it's a somewhat limited view to restrict ourselves just to that. My sense is that humanoids will have a place. I think there will be humanoids that do something that people find useful, but I think there will also be a place for all sorts of other things. And as with any new technology, oftentimes what's most important to figure out the right way for it to exist in the world is the tools and machinery and structures to allow people to experiment, to prototype, to try out all sorts of different things. And I think the trouble with robots is that just like the barrier to entry for serious open world robotic systems is extremely high. Like, you have to solve open research problems just to get your prototype out the door. And I think that's kind of what's actually limiting a lot of creativity.
A
Yeah, and for example, with Boston Dynamics, but in the past they were pure control theory, but now they've, they have an AI partner and they're using AI models to control the robots. And as I said, there's, you know, there's been a lot of talk of them being used in factories. I think Hyundai is deploying them. And I understand the, the argument that, well, it's the human form factor. The world is built for that. And if we can have robots operate in that form factor, with that form factor, they can presumably eventually do everything that humans can do. How much of this do you think is hype? It's very hard for a layman when you watch videos, or as I said, the 60 Minutes piece, to know how real it is and how much is our sort of fascinating demos that in two years, people say, oh, yeah, yeah, we had that pilot but didn't work out.
B
So I have two things I can say about this. The first is, I think in regard to demos, I mean, I think you're right that for a demo, it's like there's a big difference between setting up a demo and setting up something that works. And, you know, one thing that I've struggled with in my career, I think quite a bit is that, you know, I work on robotic learning and robotic learning, that the purpose of it is to get systems that can generalize and work in open world environments. But it's very hard to illustrate. Like it's, it's. If you, if you show somebody a demo in a particular setting and the robot does something cool, like, yeah, it's obvious, like, cool stuff going on. If you show it doing something fairly simple, but in like a hundred different environments, well, then each of the videos of that is just a robot doing something simple. So the fact that it can do it in all these different settings is harder to convey. And that's kind of a science communication challenge that I think we have to be cognizant of when we look at the videos. But I think there's a deeper, more technical thing I want to say about humanoids and about robots in general, which is that conventionally, when somebody thinks about building a really cool robotic system, they naturally start with building the physical robot. So if you start from the premise I'm going to build one robot, it makes sense. Like, if it's going to do something general, it should be a very general body. And you should really get that right, because once you've committed to, like that form factor, like, you're kind of stuck. So I think there is actually a technical point here which is that the old way of thinking about robotic software as something that drives one robot kind of naturally leads to that. Because if you want one very general robot, then you kind of have to like, have it, have everything. But if you accept the premise that we're going to have robotic foundation models that can drive lots of different robots who do lots of different things. Now it kind of unchains your thinking. And now it's okay to experiment with different form factors, experiment with different applications, different ways of approaching things. It's just that you need this kind of general AI system to be able to do that. And without it, it actually kind of makes sense that you would go into this single ideal perfect body plan. And maybe a humanoid is a good choice for that.
A
Yeah. And you've made tremendous progress since I first met you. What. What's on the horizon? What are you guys working?
B
So, for me, one of the things that I. I really want to figure out over the next year, or maybe the next few years, both in my academic work and in the work here in physical intelligence, is how to go from the foundation, which I think at this point we have a pretty good understanding of, to something that is a true data flywheel, a true continual learning system where the robot experiences more and more tasks, and the more it experiences them, the better it gets. And that involves, I think, a lot of autonomous learning, a lot of reinforcement learning, a lot of learning from other supervision signals, like these language feedback that I mentioned before, and turning that into a continuous cycle where the more the system is deployed, the more capable it gets. And I think there's some very deep technical challenges there that'll stress kind of all aspects of this. So that's what I'm most excited.
A
Yeah. Okay, Sergey, I'm going to ask one last question. I asked Fe Levis and David Ha and some others. What's your guilty pleasure? Do you have one?
B
Oh,
A
something you do to relax that you, you know, is kind of silly or. Or a waste of time, but you enjoy it nonetheless. I remember you were. You were a big video game guy.
B
Yeah. I actually, back when I was in college, I actually thought that my career would be making video games. I think these days it's probably science fiction. I'm a big science fiction nerd. I made sure to put, like, a little epigraph with quotes from science fiction stories that I like into a lot of the papers that we've been putting out. I like that very much because I think it's. It's just very. It's very good, I think, especially for somebody working in science or engineering. And this is maybe the bit of advice that I'd have for folks listening to this too, is that it's good to not feel too constrained. And sometimes the rigor of doing serious engineering and Scientific work forces them to. Forces them to very constrained thinking. So it's very important, I think, to try to get out of that mindset sometimes. Not to get too carried away with it, but it's good to like, let yourself think things that are maybe not. Not good to do in polite circles in science and engineering, to think about things more out there that are more fanciful, more creative, or more improbable, let's say.
A
Yeah. Is there a favorite work of science fiction that you've read that you advise people to read?
B
When I was younger, I didn't read very much Robert Heinlein. So I think I only discovered Robert Heinlein's work in the last few years. So I've been really getting.
A
That's wonderful.
B
Classic classical American cipher.
A
Yeah.
B
This very optimistic aspect of the. Of American culture that I. That I think is very refreshing, especially in today's day and age.
A
Yeah. Okay, Sergey, well, I hope we have a chance to talk again and I'll see you at the conferences. And this has been fascinating. Most AI is just speech to text, plus a language model. It's full for reading transcripts, not understanding conversations. Velma for Modulate. An AI built on ensemble listening model architecture. Specializes in audio analysis. It orchestrates hundreds of smaller sub models. Purpose built to understand the nuances of voice like tone, timing, stress and intent. Perfect for fraud defense, deep fake detection, agent attrition prevention or customer service moderation. Check out the live Velma preview at Preview. Modulate AI that's preview. Modulate AI to see how the model breaks down audio providing timestamped explainable signals. Stop transcribing, start listening. With modulate AI.
Sergey Levine: The Robot Revolution Nobody Is Talking About
Date: April 12, 2026
Host: Craig S. Smith
Guest: Sergey Levine, Co-founder at Physical Intelligence & Professor at UC Berkeley
This episode explores the evolution and future of robotics AI, focusing on foundations for general-purpose robots, data collection strategies, the role of simulation and real-world data, the surge of humanoid robots, and the critical bottleneck in achieving scalable, adaptable, and continually learning robotic systems. Sergey Levine offers a deep dive into current advances, technical nuances, and societal implications, challenging some popular assumptions and painting a picture of the "robot revolution" grounded in research progress rather than hype.
"It's not a very appealing tool for getting experience of very diverse environments and objects... Getting real images is so much easier." (07:48)
"It's like building a brain piece by piece—language, visual cortex, now a motor cortex." (39:18)
"For a real, capable, embodied intelligent system like a robot, we'll need many different abstractions... what language models do, visual-language models do, video prediction models do, and what world models do isn’t actually that different—they just operate with different abstractions." (28:58)
"The lowest levels... need to be very fast... but also are not as cognitively demanding... so they can run locally. The natural trajectory... makes [running on device] reasonably straightforward." (33:12, 35:22)
"The gap between a closed world and an open world is enormous. You can't be just a little open world—immediately stuff can happen." (44:12)
"I think there's a good reason to be excited about humanoids... it captures the imagination... but I think it's a somewhat limited view to restrict ourselves just to that." (50:27)
"If you show it doing something simple in a hundred different environments... that's harder to convey." (52:51)
On Data Diversity:
"Once your model understands diverse physical embodiments, you can also start adding in data from humans, because to the model, the human body will look like yet another robot body." – Sergey Levine (05:15)
On Simulation vs. Reality:
"It's not as easy as taking a camera, going out and taking pictures. And I think this is a little bit of a mistake because actually, if you're serious about building general purpose robots that'll go out into the world and do lots of things, the kind of boundary condition is in your favor..." (08:09)
On Generalists vs. Specialists in Robotics:
"The generalist model, the one that could handle a wide variety of tasks, actually becomes a better specialist because it can deal with all that weird stuff that arises... generality is really essential, even if you really want to do one thing." (44:55)
On Future Robot Form Factors:
"I actually really hope that robots will kind of end up being a little bit like personal computers, where there's like general software and the form factor... can be very different for different jobs." (46:19)
On the Humanoid Hype:
"There's a good reason to be excited about humanoids... but I think it's a somewhat limited view to restrict ourselves just to that." (50:27)
On the Science Communication Challenge:
"If you show it doing something fairly simple, but in like a hundred different environments, well, then each of the videos of that is just a robot doing something simple. So the fact that it can do it in all these different settings is harder to convey." (52:51)
Sergey Levine paints a nuanced, optimistic, and technically informed vision for the next era in robotics—one in which data diversity, adaptable foundation models, and hybrid AI systems enable a broad array of physical forms and capabilities. He underscores that true revolution isn’t always dramatic demonstrations but rather the accumulation of robustness and generality from messy, heterogeneous data, and the relentless drive toward autonomy and adaptability.
Recommended Listening for: