Transcript
A (0:00)
A lot of startups these days say they're building foundation models for robots. What does that actually mean for a non technical listener?
B (0:08)
I think every time that people try to take robots out of the factory and into open world environments, they very quickly realize that in the real world there's a huge range of things that can happen.
A (0:18)
But since then there's been this explosion of humanoids and everyone's talking about humanoids. I mean, did I get it wrong? Do you think that humanoids are much closer to being in the world? Most AI is just speech to text, plus a language model full for reading transcripts, not understanding conversations. Velma from Modulate, an AI built on ensemble listening Model architecture. Specializes in audio analysis. It orchestrates hundreds of smaller sub models, purpose built to understand the nuances of voice like tone, timing, stress and intent. Perfect for fraud defense, deep fake detection, agent attrition prevention or customer service moderation. Check out the live Velma preview at Preview. Modulate AI. That's preview Modulate AI to see how the model breaks down audio providing timestamped explainable signals. Stop transcribing, Start listening with Modulate AI.
B (1:34)
My name is Sergey Levin. I'm one of the founders at Physical Intelligence. I'm also at a professor at UC Berkeley. And what I work on these days is algorithms for reinforcement learning for optimal decision making, as well as applications in robotics. And something that I've been very interested in lately in particular is robotic foundation models. These are general purpose models that control any robot in principle to perform any task. And I think we've seen some pretty dramatic transformations in the last few years in the capabilities of these kind of generalist robotic systems where we can use very diverse data sources from many different robotic platforms, performing a wide range of different tasks, and acquire a kind of general physical understanding from these data sets that then make it much more feasible to rapidly acquire effective and robust and highly generalizable robotic skills. So this is something that I've been very interested in the last few years and I think it's an area where we see a lot of progress.
A (2:30)
Yeah, and a lot of startups these days say they're building foundation models for robots. What does that actually mean for a non technical listener?
B (2:40)
Yeah, this is a, it's actually a surprisingly nuanced question because after the success of ChatGPT, the term foundation model became obviously very much a buzzword. So in some cases it's almost synonymous to saying I have a good model, it's a foundational model. But I think that insofar as there's a consistent definition. It's something like this that the principle behind language models, vision, language models, things like this, is that you can use very large and diverse data sources that are not necessarily of extremely high quality. Like it might be just data harvested from the web. And you get a model that digests all this data and acquires a kind of broad and general understanding of how the world works. And this kind of understanding is not enough to be an expert, it's not enough to be extremely proficient, but it gives you that kind of basis of common sense. And that's why. So the term foundation model was coined by Percy Lang and his colleagues at Stanford for precisely this reason. Because this kind of broad basis of knowledge gives you a foundation on top of which you can then put other things. And the key thing about the foundation is that in order to be useful, it needs to be broad insofar as there's a deep technical insight here. The insight is that if you, if you need such a large amount of data, it's almost impossible to get in a single domain. But if you are willing to use data from many different sources, maybe all of the text data on the web, all of the image data you can get your hands on, or in our case, data from all of the robots that we've seen, that could be big enough and that can give you that foundation. And on top of that foundation, now you can build up individual skills. In the case of language models, you can fine tune them for expert level computer programming. In the case of our robots, we can fine tune them for, you know, like assembling things or making coffee or cleaning the kitchen with very high quality data, but a much more limited amount of it. Because with that, once you can put it on top of that foundation, you don't need a huge amount of data of very high quality for the downstream tasks. So that's the really important thing. Now, to come back to your question, you asked. Well, there are many startups, many organizations that are building foundation models. I think one of the most important things about a robotic foundation model is to answer the question, where does all of that really broad and diverse data come from that can establish that foundation? This is a place where the answer is very, very delicate. The strategy that I think will be most successful here is to not be too picky. In the same way that language models are trained on all the text data that can be mined from the web, a robotic foundation model should be trained on all of the embodied data that we can get our hands on. And it's one of Those things where once you cross a certain threshold of scale, it actually becomes easier to incorporate other data sources. So if you want really, really high quality data of like one particular high end humanoid robot, that's pretty challenging because now you're very constrained. You have to have that system, you have to get good data for it. You have to figure out how to teleoperate it, put it in the right environments and so on. But if you're willing to pull in everything, then you can pull in lots of robot data from many different kinds of robots. Some of them might be good, some of them might be bad. Once your model understands diverse physical embodiments, you can also start adding in data from humans, because to the model, the human body will look like yet another robot body. So this kind of diversity actually makes it easier to include other data sources. So go ahead.
