
Loading summary
A
Hi listeners, as you may know, I recently wrapped up the AIE Code Conference in New York and while I'm traveling I do like to visit top AI startups in person to bring you interviews that you don't find on any other podcast that just does a zoom call. General Intuition, or GI for short, is a spin out of a 10 year old game clipping company called Metal, which has 12 million users. By comparison, Twitch only has 7 million monthly active streamers. Metal collects this data by building the best retroactive clipping software in the world. In other words, you don't need to be consciously recording, you actually just have Metal on in the background while you're playing and and you hit a button to clip the last 30 seconds after something interesting happens. It's very similar to how Tesla and Self Driving does bug reporting. If you have ever done a self driving bug report in Teslas, the result is that metal has accumulated 3.8 billion clips of the best moments and actions in games, resulting in one of the most unique and diverse data sets of peak human behavior, actively mining for the interesting moments. They were also very prescient in navigating privacy and data collection concerns by mapping actions to these visual inputs and game outcomes. As you saw on our Fei Fei Li and Justin Johnson episode with World Labs and with the recent departure of Yann Lecun from Meta, there's a lot of interest in World models as the next frontier after LLMs to improve on spatial intelligence and to work on embodied robotics use cases. DeepMind has been working on this with Genie 1, 2 and 3 and SEMA 1 and 2 and this year OpenAI seem to finally agree because they have been pending on LLMs a lot and they made the news by offering $500 million for Metal's video game clip data. Our guest today, Pim, turned down that money and instead chose to build an independent World Model lab instead. Khosla ventures led the $134 million seed round, which is Vinod Khosla's largest single seed bet since OpenAI. We were able to get an exclusive preview of GI's models which unfortunately we cannot show you directly, but I can confirm they were incredibly human like and we chose to include the first 11 minutes of the demo discussion even though I couldn't show it to you. It may be hard to follow, but I tried to call out what was noteworthy for you to know as your likely reaction if you were watching along with us. Now enjoy the world's first look at my first look at genuine intuition.
B
So what I'm about to show you is a completely vision based agent that's just seeing pixels and predicting actions the exact same way a human would. And so yeah, what I'll show you here is what this looks like four months ago. So again, this is just an agent that's seeing, that's receiving frames and it's just predicting action. So you can see it has a decent sense of being able to navigate around it. Tabs, scoreboard. Just like gamers always tab the scoreboard. So these are pure imitation learning.
C
I see.
A
So the C is slicing the knife.
B
Yeah, exactly. So it's doing everything that humans would. In this case here was the first interesting part that we saw. It gets stuck and then they have memory as well. So you see it can get unstuck. How long is the memory? Four seconds? Yeah, four seconds. Okay, so this was four months ago. This was maybe a few weeks after that. So you can see there's like it's still doing the scoreboard thing, but they're still quite like. And these are bots too.
A
So you can see that it's very human. Let's just say that.
B
Yeah. And then. Right, so this was really like the early days of research where you can see, right. It does one thing and then goes for another. And then we've been scaling right on data and compute and also we've just been making the models better. And this is where we are now. So what you're seeing is like I said, pure imitation learning. This is just a base model. There's no rl, no fine tuning. This model sees no game states. It is purely capable. Not sequence, exact sequence. It's purely predicting the actions from the frames. That's it. And this is playing against real humans, just like a human would play. And also it's running completely in real time. So there's absolutely everything here plays exactly like human.
C
Do you give it a goal? No, it just figures out it's on.
A
A goal because obviously it's trained on the same thing.
B
And I picked a sequence where also it doesn't do well initially. So you can see like this is just like a sequence, a random sequence.
C
But this is the.
A
I mean it looks like it's doing well, so.
C
Oh, okay.
B
Yeah, it was.
C
Yeah, this is pretty good.
A
It'd be too good.
B
This is my favorite part. So you can see it does something that like here like human would never do. This then gets unstuck, then has four realizes which. And then in the distance.
C
So you're saying one, it makes a.
A
Mistake that a human will never make, but it unsticks itself. And two, what we just saw is it is doing superhuman things.
C
Yeah, okay. Yeah.
B
I mean there are things that demons did, obviously, but because it is trained on the highlights of things that all the exceptional things, it's inherited in those. Yeah, so it's not like move 37 where we are all their way into something, but it's.
A
Yeah, we're replicating superhuman.
C
Yeah, exactly.
B
The baseline of our data set is peak human performance.
C
Yes. Yeah.
B
Okay, so that's the agent. So now what I'm going to show you is we then are able to take those action predictions and we're able to label any video on the Internet using those actions. And so this is just frames in, actions out. Yellow is the model prediction, or sorry, yellow is ground truth, purple is the model prediction. And then bottom left is compound error over the entire sequence. And then this is reset per prediction.
C
Reset, meaning every now and then you reset.
B
Yeah, so this just means it resets the baseline. And so this basically a single error in the entire sequence compounds here, but it doesn't compound here, if that makes sense.
C
Yeah.
B
So, and again, this is just seeing frames, right. It's not seeing any of the actions. And so, you know, so what we did right, is we trained it on less realistic games and we transferred it over to a more realistic game. And then, and this is where it gets really exciting, we transferred it over to a real world video, which means that you can use any video on the Internet as pre training.
A
What was it predicting?
B
It's predicting it as if you were controlling it using keyboard and mouse. So if you're basically playing this sequence as a human, is there some sense of error? So that's why you transfer to more realistic games first and then you transfer to real world video because you can't get a sense from ground truth from the real world video yet. Let's see. And then, so I'll show you here. This one is also. This is the same agents that I just showed you.
C
This is playing against other AIs.
B
This one's playing against bots. Yeah. The previous one was against players, but with the sniper it doesn't really matter that much. I shall say it's like. So one thing that's really interesting is you notice that it behaves differently as it has different items.
C
Right, that makes sense.
A
Yeah, intuitively.
C
Yeah.
A
I think there's also a question about egocentricity versus like so the third person.
C
Does it matter?
B
The third person I think will be very, very helpful if you're for instance Trying to control multiple objects in an environment later on. Right now I think having fully imperception first person is quite helpful. This one's also. This is the policy itself. What do you mean? This is the tallest agent. Yeah. Same constraints that I just told you about. Yeah. Like this where. Right where it hides. That to me was just incredible. Like just from knowing. Being able to predict. Also high when you see it. Exactly. Yeah, yeah.
A
And it needs the spatial intuition to go, well this is hiding and that's not hiding.
B
Exactly. Right. While it was reloading. Yeah. Okay, so that's the policy. And this is a completely general recipe, meaning we can scale this to any environment.
C
Is this work closest?
A
Okay, no, let's keep going on demos.
B
Until I was going to go into research. Yeah, yeah, sounds good. Okay. And then so what I'm about to show you are the world models. There's a few really, really interesting parts about our world models. So the first is we actually made a decision to transfer. Sorry. We made the decision to pre train world models from scratch. But also we've actually been able to fine tune open source video models to get a better sense of physical transfer. And so one of the things that you'll notice here is our world models have mouse sensitivity which is something that gamers absolutely want. Right. So you can have these very rapid movements which you couldn't do in any other world model. And so this is a holdout set. So this clip was never seen before at training type. As you can see, it has a spatial memory. This is about a 22nd ish generation. And here's what's fascinating. This is an explosion that occurs and you can see that in the physical world the camera would shake and in the game that would never happen. So you see the world model inherits the physical world camera shake. But the actual game never does that. Which is sort of. That to us was quite fascinating. Also did the models that I just showed you that we used to transfer over from video. The two of those combined will allow us to push way beyond games in terms of training. There's another interesting. So this is a world model. This is rapid camera motion. Again, this is stuff that we're literally just taking one second from here in the context and the actions and replaying it here. Right. And so you never essentially have like what we're saying is the skill that you see in the clips that like the speed and the movement that also pays off at training time when you're doing world models. This is my favorite example. So this shows that the world model is capable of performing with partial observability. So what you're going to see is again, you're replaying the actions from here and here, just using one second of video context. Everything after that is completely generated. So what you're going to see is the model is going to encounter, in this case, smoke. Normally now models break down. What you actually see comes out at the same place. And so it's capable of, even with partial observability, still maintaining its position in the world. And then here it is also interesting. So this is sniping. So this gives you a reaction time. The fact that it can do depth and sequences in completely different views. So this is a completely different view than if you were to be outside of that view. Right. And so it's able to maintain consistency while zooming in. Yeah, exactly. And so, yeah, so you can see. So even while this goes out of scope. Right, watch. And then it can. And then it comes back and you'll see it's still there. Yeah. And so, yeah, this is the work that Anthony Hu has been working on.
C
I'm just wondering how much game footage you have to watch in order to find these things.
B
We can ask Anthony. I'm sure he's not going to be too excited to play these games afterwards.
C
You're not playing, you're just watching.
B
Yeah, great.
C
Okay.
B
So those were the models. These are interesting. So we also were able to distill into really, really tiny models. So this is for instance, a long sequence on a very, very tiny one. You can see it makes a bit more stupid mistakes. Like it does things that are not as optimal.
C
But I haven't seen anything yet.
B
At the beginning it was running into a wall for free. Exactly.
C
I mean, I do that too.
B
Yeah, yeah, it's looked.
A
I mean, it's doing pretty well.
B
Yeah. And again, all these models are running completely in real time. There's no.
C
Okay, so I was thinking your main model does real time anyway.
A
What's the goal of distilling?
C
Is it cost or.
B
Yeah, parameters. Yeah, yeah, yeah. This is the interesting one. It peaks the corner. That's what we mean by like the spatial poor reasoning aspect is humans actually they sort of simulate the optical dynamics of their eyes and how to actually.
A
Spatially reason all the data.
C
Yeah, right. You've seen all this.
B
Yep, exactly. And so like even in like real. This is kind of interesting. Even in like the real world with for instance, YouTube data. Right. You have to first solve for pose estimation. Then once you have pose estimation, maybe you do something like inverse Dynamics, Right. Where you basically are able to somehow label some of the actions that you're seeing and then you still have to account for optical dynamics of where your eyes actually looking before the decision. Because there's three levels of information loss when you're playing video games, you're actually simulating the optical dynamics with your hand. And that's I think why games are a better representation of Switch sport reasoning initially than than YouTube videos, for instance.
C
Okay, we're in the GI offices with the CEO Pendawe. Welcome.
B
Thank you.
C
Thanks for having us in your office.
B
Yeah, excited to be here.
C
If I'm in New York and you're one of the hottest races of the year, I have to come and visit and thanks for taking some time on the weekends or.
B
Yeah, yeah.
C
So you've raised 133 million seed. So general inspiration. Most people didn't care about you. I guess this GI is new, but more gamers would have run the middle and before that you ran probably state somewhere on the largest depth real state simulator. What's your reflection on just that journey of now you're an AI founder and you started off like Runescape.
B
Yeah, I think I grew up with threats. I spent most of my time as a teenager coding and playing video games. So in that sense it doesn't feel that much different. But I think for. So I started the largest private server on runescape, worked at Dr. Cyber for three years, first in Ebola and then on satellite based map generation for disaster response which was already very AI related adjacent. I built some models back then and then started Metal which became one of the largest social networks in video games. I've always been kind of like AI adjacent. I'm a self taught engineer so for me the modeling itself always felt a little foreign. I actually had to take tons of classes over the summer and early this year to get better at it because it still felt like I was really, really good at the infrastructure side. And I had written our transcoders for Metal myself. So I was very, very familiar with CUDA and the GPU side and all the video infrastructure that we were using for this stuff. But the modeling side itself was still quite foreign. Luckily. Obviously I have really, really good co founders but they essentially put a bunch of coursework together for me to go complete to get really, really good at understanding the fundamentals better. I think for me I had seen inside of the labs that had really, really good leadership with fundamentals on top and also the ones that didn't and I think the ones that did were just Much better. And so for me, yeah, I want it to be more like that. So in that sense it was a bit. It was first very foreign and then now I feel pretty comfortable with everything and. But yeah, like, I think for. There's a lot to be explored starting in video games and also reverse engineering. Like, I think the interesting thing about reverse engineering is it kind of teaches you to look at problems very differently. It's like the ultimate form of deductive reasoning in a way. And so, so this is just how I think, how I operate. And so for me it's been a really, really interesting journey. You know, I don't claim to have any of the credentials or skills that some of the other guests who have had on, but hopefully it will make for a good time.
C
Yeah, well, your co founders definitely bring a lot of that different ability and you bring a lot of the, I guess, gaming expertise.
B
Well, trueties, we'll see what I bring to the table.
C
Just a little bit of history of Metal on board. Let's establish Metal for those who don't know the Daily Twix. Yeah, that's. You have more active users, concurrent users in Twitch, something like that.
B
Yeah. On the creator side, I think. And the reason is because Metal is a lot more like Instagram than it is like Twitch. So people. So the way to think about Metal is it's a native video recorder. Like, unlike something like Twitch, where you actually have to use other software to record and stream to Twitch, it's not a streaming software, it's actually a video recording software. And a lot of gamers love to put things like overlays on top of their videos. And as a result of that, we have sort of the largest data set of ground truth action labeled video footage on the Internet by maybe one or two orders of magnitude. Yeah.
C
What's an example of an overlay, like Naomi overlay, I usually think of as like the case cad.
B
Yeah, yeah. Also controller overlays. For instance, if you're playing like, let's say you're playing console.
C
Yeah.
B
Like flight simulator, you get like, you know, the joystick and all the things. So you get the actual actions that people take inside the games as well as the frames of the games themselves. Which is a loop. Right. Because it's essentially you perceive, then you act and there's a state update and then you perceive again you act. State update, which is like roughly precisely what you use in order to trace to train these agents.
C
Yeah, it's almost perfect training data. You were showing me in the demo and show some B roll here on how you don't log key. It's very important for you to log action. When did you figure this out?
B
Oh, maybe starting a year and a half ago. Yeah. And we realized that like figuring out this side of the research for us was we very much never wanted to be in a position where we eroded privacy or something like that. So we never wanted to actually log like a W or A or S and a D, which for researchers, the fact that we don't do that like often it sounds strange, like why wouldn't you do that? But I think for us, the privacy. Yeah, I think a lot of the researchers, they hadn't quite understood yet that you can actually just get away with just doing the actions. And the reason is at training time, having the actual keys is noise anyways. If there is text in the screen and you would want to, in theory make that part of the training, then reading text from a frame is really easy. And so for us, if we actually convert, basically you hit the input, we convert it to the actual action. So we had thousands of humans label every single action you can take in every single video game over the past year and a half, which is an enormous amount of action labels. Yeah. So when you act, we get the actual action itself. And then it being said at training time, you can for the general set of that game, convert back into computer inputs if you want to, but you can never do it for any individual person. And so that for us from a design perspective was important. So we figured all that stuff out. Then we actually started pushing. We already had features as well with this. So for instance, gamers already love to be able to navigate their clips by things that happened. So we have an events capture system and then we also have the overlays where you actually just want to overlay and render the actions on top of your clip. We developed kind of in tandem with the feature set itself and then obviously when world models became a thing, and it's very, very clear that all the data for this was precisely like that sequence, yet we were able to sort of be first to market, recruit the best researchers and start a lab.
C
Yeah, that's. That's incredible. One more question on metal before I really move closer to Di.
A
It's been 10 years.
C
Yeah. What is the. I don't even know how you bro something like this. You know what I mean? I'm just kind of curious and like the opportunity to ask you what really worked.
B
Yeah.
C
That you became so, so huge. Because you're not the only one.
B
Yeah.
C
But I'm sure it's performance and everything.
B
But a few things that really worked. I think the first was a lot of our competitors were focused on solving the social network and the recorder at the same time. And that never like our bet was really that we could get so many people to record with us that we could bootstrap the network on top of that. And that worked. So while everyone was sort of distracted trying to bootstrap a social network, we were just focused on building a really, really good capture tool. And then we got tens of millions of people to use that, which then we originally bootstrapped a network on top of the shared behaviors we already had, like the profile behaviors and the share behaviors obviously, but the actual content consumption piece and the sharing piece really only came after we hit critical mass. It was actually early days during COVID when the network really accelerated fortnite happened, which was really important. And I think also the fact that Discord existed made it quite a different time than when other types of networks of these types had launched. Because Discord essentially was like the connective tissue already between gamers that never really existed before. And so I think those combination of things really, really made it. I think we also built a product that for instance, with, with most video recorders, you have to remember to start and stop the recorder. So you have to go into the application, then hit start, then start your game. And then, you know, maybe you'll play games for three hours and you'll close the game. Then you have to close your video application, then. Well then you have to process like a multi gigabyte file, then you have to upload those somewhere. And so like this was a pain for people. And so what we did is we just ran this kind of recorder. When you hit that button, it does a retroactive video record. So all the recording initially is in memory. And then when you hit that button, it exports only that sequence to disk and syncs it to your phone. And so that, that became super popular. It also, what was interesting about it also means that you're not sort of behaving or acting differently because it's always there and you can just export whatever happens, which is also very, very helpful for, for training obviously.
C
But thank goeth. You weren't the first to do that. Yeah, the thing you were explaining just before this was similar to how Tesla does the bug reports. Right. You're driving and suddenly having disengaged autopilot, they're like, well, tell us what happened.
B
Exactly, exactly. See, you're driving. Tesla doesn't Want to train on the like 10 hours of you driving through a desert where nothing interesting happens. You have the clip button on the steering wheel. Something interesting happens either while FSD is engaged and I'm not sure if you can use it without FSD as well, but you hit the clip button and basically uses that precise sequence to mark which is then more helpful for training because it's more unique as a training time.
C
Yeah, yeah. I mean so one thing we're going to get to this on the eating side. One thing that I does that does pop up is well, a lot of life is boring. A lot of life is going for me, a lot of playing games is doing the boring stuff that is not capable somehow using the generalized fight.
B
Yeah, yeah, yeah. It makes you think, right? It makes you think.
C
Yeah, yeah.
B
It's also quite interesting like I showed you in the models, like what happens when you increase the size of the context window and how behaviors actually are largely shaped by the size of the context window. That to me was like one of the most interesting parts about the research made me think about our own behaviors in a way.
C
Yeah. Let's talk about also the forming a gene on your website. You have 12, I don't know if that's changed these four, three co founders.
A
Yeah.
C
And just let's talk about how this team comes together because you may not physios of thought, you don't have that metadata network, but you managed to get all these people.
B
Yeah, I started reading all the research papers. By that time I was already pretty deep into having a decent understanding of not world models, in particular LLMs and transformer based models. And so there was Genie, there was Sima, those two were really, really interesting. And SIMA in particular was interesting because what they do is they basically take 10 games and then they have a graphic in SIMA where you can see kind of the precise actions that are inside of those games that they mapped. And I believe they found something like 100 which are actually actions that also exist in the real world. And what they did was they then I believe it was specifically for navigation. They did a nine, one holdout set. So they trained an agent on the nine games and then they had to play the 10th game, the holdout game. But then they also trained a specialized agent just on a tenth game. And they compared how good they did. If I recall correctly, it did roughly as well playing the tenth game on navigation, specifically on the holdout, on the nine game agent than it did on the one game agent. And that to me was really interesting because that's precisely the type of data that we had. Right. And so for us, the thinking was, okay, what if we did exactly what LLMs did? What if we use this? Right. So LLMs were trained on predicting like text tokens on words on the Internet. What if we predict action tokens on essentially what is the equivalent of the common crawl data set, but for interactivity, vision input. Yeah.
C
Action output.
B
Correct.
C
That's it. Well, I think. Well, actually I'm going to double back a little bit to you. Thanks. A question I had, which is one of the reasons why I thought you would want to prefer keyboard and mouse over actions, is the action space is potentially unbounded. Right. You can jump, walk left, walk right, but then also look up, look left, look bench. It's unbounded. So it's huge, isn't it?
B
Yeah, I think, yeah, there's benefits to the action space being small to start with. So I think we're going to start with anything that you can control using a game controller. But yeah, long term we want to actually predict maybe like action embeddings and have models sit inside a general action space to be able to transfer out to other inputs as well.
C
Yeah. Okay. And then let's keep going on the research side. So Genie Sigma. Yeah. And then the co founders.
B
Yeah. So there was the diamond paper, there was genie, and then there was sima. The diamond paper for me was really interesting because they had actually managed to get this world model called diamond running on a consumer GPU. I believe it was a 4090 at 10fps and you could play it. And they did that on like 90 hours of data, like 95 hours. I think it was 87 hours and I think eight in the whole data set or something like that. That was just incredible, right, that they had something playable on that little data. So I actually cold emailed the entire group of students and I told them, hey, I think we have this thing. And then it was pretty interesting. So right when that happened, a lot of the labs also started understanding what we had. And so we started very aggressively. Multiple labs tried to bring us in in various ways and they were part of that. They basically were seeing that happen. And I think for them that also kind of solidified how real it was. And then when we chose to do our own thing, initially we thought that we were going to have to just work on world models. Right. So we thought, okay, the main benefit of this data set is like GENIE is world models. What we didn't realize at the time is that we have so much of this data is that we can essentially do these role models in parallel and, and take the equivalent of like the LLM, bet mostly on imitation learning, and then use the world models after that to get into like RL stage. Right. And so for us, and eventually getting.
C
Rid of a world balance, is this.
B
Something that you can, I mean, ideally you get rid of the imitation. Yeah, the imitation learning. But yeah, we essentially realized that we could get so far on just imitation learning. The way to look at it is we essentially like, let's take the element analogy, we essentially have sort of the Internet or like common crawl, if you will, and, and every single lab is trying to simulate that in order to get similar data in order to train their agents. And so for us, the reason why we stayed independent and we just did our own thing was we think we could essentially leap every single company that's forced to either be consumers of world models or build world models and take this foundation model bet for spatial temporal agents and be in a place where we have a lot of customers years before any of the labs even get there. And maybe the most similar comparison is like when Anthropic did with code. Right. Anthropic just focused really, really hard on nailing the code use case. Their models are incredible for it. A lot of their customers use it for it. So we just want to become incredible at this spatial temporal agent use case. And likely that starts in game simulation and then using world models, we can then start expanding out to other areas.
C
Would you show me a little bit of how does generalize object themes? But although games is kind of the common player.
B
Yeah, games and simulation, I would specify it as game engines in particular. So even if you're for instance, simulating human behavior in Omniverse, because you're trying to create better training data for factory floors, you can use it.
C
Yeah, maybe Meta has a similar data set because of the quest.
B
I never really asked them and I never really looked into the meta quest specifically. So you need a few things. You can't just like, there's lots of companies that have like maybe recorders, but you also need the public graph, otherwise you can't train on the data. Right. You can't train on people's like private videos that they have saved somewhere. Right. And so I think you, you, you need the social network graph components because these videos need to be on the Internet to rank. No, to train on them. Yeah, I, I mean, I think, I think generally people tend, like, people don't want to train on like, like, because these things they live on your device usually. Right? Yeah. And you can't train on anything that lives on your device. Like you actually need to go and upload and your thing. Right. For Meta specifically, I think also VR, the scale of VR is still pretty small. The amount of environments in VR that have consumption at scale is probably in the hundreds, whereas on PC it's probably in the tens of thousands. And so you get a lot less diversity. The three dimensional input space of VR is pretty interesting. We see some of this too, obviously. And so yeah, I do suspect meta starts using these types of things, but it's unclear to me whether they can get to a similar scale of data or diversity on the environments as we can.
C
Yeah, there are a lot of challenges there. Yeah. Okay. I want to take this in a few different ways, but I guess let's fill out the papers. Maybe one more to mention is tire which actually I interviewed the dire authors, but that too seems like the particular insight that brought it overseeing.
B
Yeah. So Anthony Tu, who led the research on Gaia 2 is also one of the engineers that joined our team. So it's all the diamond, the core contributors for diamond and then Anthony and we just had three more researchers join this week. It's been a good week. And yes, I think a lot of the approaches in Gaia 2 were heavily inspired by Diamond. And then Vin Sa, who was one of the authors of diamond, also already was at Wave by the time that I emailed them. Anthony also realized what this was and realized that you could scale world models to a much larger like scale and decided just to make the leap as well. So I think everybody that sees the data set makes a leap because it's. But it takes a while to wrap your head around it because it's like, oh, it's video games. Right. Like intuitively it doesn't make sense. And then when you actually understand and you see. Right. How we've been able to transfer it to physical world, video and things like that, then it makes sense. And then everybody tends to jump.
C
If they don't call it video, they install the rlm.
B
Yeah, if I lived in San Francisco, maybe I would.
C
Yeah. Just a quick note. We actually cover all these papers in the latent Stage Student Club. Sigma 2 did not seem to have as much impact on Sigma 1 and I don't really know why they did it. A lot more work. Gene 3 had a ton of impacts, but I also felt like because you could play with the model or it just seems an extension of all those things, I guess. Any quick takes on Sima 2 Gena 3, which will pull this year.
B
Yeah. I'll talk about Sima 2. The steerability of Sima 2 was to me the most impressive part because lining up the action sequences and the text conditioning is quite hard to do. Right. And the fact that they were. It's also quite interesting. That means that they can sort of use Gemini as part of the flywheel. Right. Where you can sort of scale this orchestrator as like an independent, almost like a puppet master, if you will. And then in theory, Gemini could orchestrate many instances of sima. Right. That, to me is the most interesting part, is where I tend to agree with this, where I think our models will initially be used as you'll have an orchestrator VLM of sorts, that's kind of managing instances and instructing them. And I think for sima, showing that you can do this was fascinating. Also the fact that they didn't just have text conditioning, but they also were able to do drawings and markings of where to go. They really took an interesting end to end approach to me that I look forward to seeing a lot more of.
C
Are you talking to them like you said it? Is everyone collaborating?
B
Yeah, I think we're very friendly with DeepMind. We like them a lot. I just saw the team not too long ago and I think big fans of their work.
C
The thin line that I kind of shaped from LS Heath's coverage of you is you are the biggest benefit in vino Crossl has made since OpenAI.
B
Yeah.
C
How did that conversation start?
B
Okay, so Vanod's style. And maybe I'll get slapped in the fingers for revealing this or whatever, but forgive me if it were bad. He asked you to draw a 2030 picture of your company, and I think he just picks N plus five years, whatever. I don't know.
C
I did the same for you. Yeah.
B
He asked you to walk that back from first principles all the way from today. And. And he expects you to do that flawlessly, where he can challenge any assumption, any part of the vision. And he asks the questions. Right. He has a very technical background. He also has a bunch of technical people on his team, and he truly backs people that have these very large visions on that vision and the ability to defend it alone. And that's what he did for us. And I think that's why he made that bad. So I think also through this question, he gets to know a lot of things about how technical you are. He gets to know how well you think from first principles. Because if that vision is not connected to something real, it's very easy to suss it out by asking good questions and then he just backs fully. I think he really gets in your corner if it's the right fit. And yet they've been incredible partners. They, they've opened so many doors for us.
C
I had to ask the question. I think it's a very notable story. Obviously a lot of work went into it, but it's also worth it when it come out of side for sure. One of the things also wanted to. I kind of asked this question out of sequence but one of the things that excites me about Tao Dxiu is there are a lot of people like you who are founders of business and businesses that along the way have a ton of data and yours happens to be highly valuable. You pursue before deciding to do an independent journey. You also talk to other companies about potential licensing or acquisition and stuff like that. What is your learnings from those periods? Also like one, one version of this is very simply how do you value data?
B
Yeah, I don't think you can value it unless you actually model it yourself and see what the capabilities are. That's my real outcome.
C
You say model like train a model.
B
Yeah, but that's obviously like not doable for everyone. And also I think my general advice would be as model capabilities increase, you and models are also like, you know, these VLMs, they're very, very good at labeling as well generally. Right. What I was afraid of when I was having some of these conversations was okay, like, you know, as the capabilities increase, you're just going to eat less ground truth data and you can do more model based data generation or synthetic data generation. I would recommend if you're going to do large data deals, just try to get a large chunk of equity in the company that you're doing it with if you can. Now a lot of them won't do this, but I think that to me would. Or just go do the research, figure out what's actually possible. In our case, we were quite lucky in the sense that this is actually the foundation data. Right. And I think.
A
Right.
B
That's not true for every data set. I think we just happen to hit a particular gold mine.
C
But you also. Did you read Clip Radi. You did the action thing 1.5 years ago.
B
Yeah.
C
Even word.
B
Yeah. That's the thing. You have to be grounded. Right. And I think a lot of the and I think that's the hard part and I think a lot of what's interesting is you can also kind of look for if scaling laws already exist on your data type which for video there were some. But for these input action labeled sets, there really wasn't any. The other question is, does it go into LLMs, does it go into world models? Does it go into what type of model is it going to be used for? And I think that's an important thing to know. And so I just want to, if you're having these conversations with labs about data, just make sure that you actually understand what it's going to be used for because that's a very, very good way for you to make that decision yourself about whether you want to pursue that. Now a lot of them won't tell you that and I think in that case you generally just don't want to do it because I think for our case we really cared that for instance, there weren't going to be competing products with game developers built. Right. Because we didn't want to bite the hand that feeds us. And I think we are part of the games industry. So those questions I think are normal. And then we eventually decided we just have the data, we're just going to go do it ourselves. And that's when the rest happened and.
C
He assembled the team. Think about into that.
B
I feel like that's.
C
You've aligned a lot of stars in order to make GI Happer that other data founders, they're at the beginning of this journey. Yes, 1Data founder founders who happen to have data, but they have a main business. Right. I don't know if you have.
B
There's two sides to this. Right. It's really easy to be super naive about it. And I had a lot of people tell me initially, oh, it's not that valuable, you're just making this up. And, and so for me, doing the work and actually understanding it myself was a really, really big part of building that confidence and go start a company. But a lot of times it is true that model capabilities increase so quickly that certain data you just don't need anymore. And so I think it's really important to get people to do the work such that you can make these types of distinctions. So my recommendation would be go build models with your data, see if you can create any sort of capabilities that aren't clear clearly already there or on path to being there and then figure out where you go.
C
Yeah, I did want to ask this earlier, but you gave me an opportunity to. When you say do the learning thing, you do coursework and all that and your co founders gave you some homework.
B
Yeah.
C
Is this like some books? I mean Coursera.
B
No, this was Francois Flores. So he has a little book of deep learning and then he also has a full course that he's published on his website. I went through the entire course over the summer. I believe it's like something like 30 or 40 lectures which also take home projects and things like that. And I would recommend anybody does this. It goes through history of deep learning, the topology. It takes you through the linear algebra, the calculus, eventually end up with chain rule, and by this time you've done all the more important concepts. It takes you through how do you create neural networks using these concepts that you've learned?
C
Wow, this is super first principles, this guy.
B
And I've had the opportunity to spend some time with him as well. He is one of the most first principles people I've met in my entire life. I'm convinced. I actually asked him, why did you do this course? He said, oh, because I thought all the other courses weren't right. Because he is so first principles. And he can only explain things from everything you see and how he explains this thing. Everything is from first principles, including the history of deep learning itself was part of the course. And yes, he goes. So he goes through everything. And by the end of it, I now have a pretty good intuitive understanding of how everything works. But obviously still, I like to describe it as I'm like the guy who just got his driver's license. I can drive the car. And my co founders are the F1 drivers that have done this for years. They know where all the gaps are. And so I enjoy getting to learn from them. The cool thing is also that world models is just like a very, very new space. And so I get to bring ideas to the table that everyone's thought of. And not because I'm great at this, just because it's such a new space that people just haven't tried it yet.
C
Let's get a hit on definition. What are world models to you?
B
In a video model, you might predict the next likely sequence or the next most entertaining frame. What world models do is they actually have to understand the full range of possibilities and outcomes from the current state. And based on the action that you take, generates the next state, right? So the next frame. And so it is a much more sort of complex problem than traditional video models. So to me, it is a world that is accurately generated based on the actions that you take as a result of what's already been generated.
C
And just to fact check that is, it needs to understand physics. It needs to understand, if I'm building a type of material, how it interacts with Some type of material.
B
Yeah. I think the interactions is the most important part. I think the reasons why world models are so fascinating. One of the things that I did when I was studying over the summer was I tried to actually build a super rudimentary Pytorch based physics engine which I would not recommend writing a physics engine in Pytorch for obvious reasons. But I wanted to be able to because it's differential. So you can generate the little bad. Yeah, exactly. And then you can train. And so I wanted to. I got so many people ask me about why aren't you just using, why aren't you just simulating or generating this data? And I really wanted to understand from first principles why. And I think the most important thing that I figured out was the compute complexity of simulation goes up really, really rapidly with three variables. First, the numbers of agents in an environment. Second, Der Dof. So their individual jewels of freedom. Yeah. And then third, the information that each action reveals. So like for instance, if you have a text action or a speech action, the environment can change so much based on whether you say water or fire that the outcomes are going to be completely different of like how a human would behave in that type of situation. And so it goes up so quickly with those three variables that at some point you just hit a point where you just want to maximally bet on either video transfer or generation of these environments using world models. Because that type of stochasticity is just incredibly difficult. But it's already very, very present in a lot of the video pre training that goes into these world models. Right. And so I think for us it is more so about making a maximal bet on video transfer and interacting with things that are difficult to simulate. And this durability is also really interesting with text than it is on betting against simulation or something like that. And so I think there's still a large market for traditional simulation engines, specifically in areas where video is really hard to get.
C
Is this exactly what the big lads are also saying when they're talking to.
B
That I honestly haven't talked about to the big lads since we started working on them ourselves. I think people are more reserved with what they share with us.
C
Of course it makes sense. That's from your question. How would you contrast your version of or models with Yamahun?
B
Yeah. So I don't know exactly what Yann Lekun is doing today. My understanding it's based on Levy Jabba like Le Jepa approach, which is. So I'll start with Fei Fei Li. I think what's really interesting about Fei Fei Li's approach is that you in some way are able to reuse the spots right. In game engines and in things that let you stay in verifiable domain. Which I think is a really interesting approach. However, my understanding is they're currently not interactive, which in my opinion is like the whole point of world models. Right. It's environments, they're great environments. And I think from a business perspective I think they picked a really important part of the tool chain. But to me that's not really a world model. But my guess is they'll get there. Right. They'll start generating.
C
Yeah. They just haven't reused it.
B
Yeah, exactly, exactly. And I think. Right. Fei Fei is one of the like founders of the entire space. So I think it's going to be really interesting to me on what maybe that interactive piece looks like for me to really judge their approach.
C
I think we interviewed just before we moved to Yan. We interviewed her with Justin Johnson, her co founder. He was more focused on the physics side of things and the interactivity and they just haven't been instantly effect. I do think that basically that the splats, if you just add more dimensions on I guess the forces acting on them then you get attracted to the out of the box. Basically these are virtual atoms that then has all the lomophysics applied to them.
B
Yeah. I'm excited to see what that looks like when they actually release it. It's really hard for me to comment on anything. I really like the frame based approach because all of our video or all of our training data is in this format.
C
Yes. So we actually asked them about this and they were like, yeah, it's possible but they're choosing the splatter. Yeah, yeah.
B
And you can also go from splat to frames. Right. I'm sure you can write like at some it wouldn't be easy. Like you'd have to actually render out the environment. Sure. It's not going to be a simple problem but like in theory it has to be something that you can do if you really wanted to. So I could. Because it's almost like having a more sort of ground truth through three dimensional representation of the underlying world. Right. So I think it's an interesting approach. It might be overkill. Right. You're also dealing with much larger degrees of freedom on the output space. Right. So who knows how well it scales. I like the fact that I think these video models also use things like autoencoders. Right. You can actually have the world models predict much smaller, maybe like A resolution or size. Yeah, exactly. And then you can use diffusion, upscaling or methods like this to actually enriched. And so I think that world models just allow a much more. Or world models in my sense for a much more like controlled space that we know really well. I'm not suggesting their approach is wrong, I'm just, you know, like this is I think what we really like about it, honestly. Jan's podcast that he did, I don't remember which one it was, but a long time ago where he, where he basically proclaimed LLMs to be a dead end was one of the things that inspired me to do this.
C
I think this is very consensus among models people. Basically every one of the practices stops with their LLMs and just goes through to lower models. I would say that the main pushback. I asked this exact question to Noam Brown from OpenAI and he was like, well, learning the simple models.
A
Right.
C
So there's basically the difference between. Yes.
B
So yeah, I'm not one to proclaim LLMs are dead ends, personally. I think, I think they're actually quite useful in particularly as orchestrators. The way I think about it is as humans we had sort of a three dimensional world, then we invented text in a way in compression method, right. So we invented text in order to communicate with each other in a common way, in a way that actually compresses all this information that we are perceiving in three dimensional space into just like a single sequence. And I think that allowed sciences to emerge, it allowed so many literature, like so many parts of the world that we cherish. So I think it's a critical part of the whole picture. I also agree that it's very, very clear that they do build sort of the internal implicit world models inside LLMs. And so I think they'll be very helpful as things like orchestrators. The problem is when it comes to the generalization, I think text as a generalization backbone, when most of the pre training is text or largely text sequences, then I think you want that backbone to be kind of more spatial, temporal in nature and then also just have text as part of that. And I think the actual argument of LLMs is also, for instance the autoregressive nature of the prediction itself. So the fact that it's running the entire output right through the transformer and then in order to predict the next token which doesn't, the environment in the real world is continuous. Right. It's always changing and LLMs kind of just forget about that. Right. I think a lot of the argument isn't first. Right. So I think the fact that text doesn't necessarily generalize well to superl context and then the auto aggressive nature of the prediction and using text for that. Right. So I think those are the two main arguments. I think text prediction is just one of the actions that is going to come out of these policies and role models. I think speech and text generation will just be one of the actions that can be a part of that. I think that there will just be labs coming at this problem from both sides and everyone ends up in roughly the same place. And the same place will be whatever people think is cool. Right. Like whatever the consumer needs, whatever is closest to AGI. Yeah. And so I don't think there's like a clear answer. I think it's really interesting to come at it from the world modeling side. But it's also because we have to. Right. Because like text is largely commoditized. We can import all the text.
C
I think it's interesting. And chanting limit tempting. It makes sense that you can probably recover. It's sort of like you're taking a step back, you're starting a new branch of the ML research sheet, but you might actually just end up recovering all the other tech stuff emergingly.
B
Yeah, yeah, we can import a lot of that research. Right. A lot of that is.
C
That's really cool. On the research side, let's talk about the stuff that GI is producing or like I guess the sort of research and product output. You mentioned the word customers. What are your target customers?
B
Yeah, so we're already working with some of the largest game developers in the world. Yeah, we're also working with game engines directly. And so really what we're doing at the moment is replacing essentially the player controller inside of a game engine. So anything that you're currently that maybe like behavior trees or things that you're deterministically coding we hope to replace with a single API which is just you stream us frames and we predict actions and that can be inside an engine or it can be eventually even inside the real world. Hopefully those are then also steerable. So the models that you saw weren't text steerable yet. But I think we want to get to a point where they're fully text steerable.
C
But to say steerable means like well I want you to go to share figure anything else out in the pre.
B
Yeah, I think it's sex conditioning on the generation. So yeah, the ability to. You're right. We want to get to a point where you can genuinely. And that's why it's called general intuition where we can sort of can mimic the intuition of all these gamers into human like behaviors in any situation. As I mentioned also the lab is named after AlphaFold, which is. Wouldn't it be amazing if we could mimic the intuition of these gamers who are by the way only amateur biologists on his path to. He tried to get an AI to train Foldit to generate a lot of data for AlphaFold. And so for us, really the North Star, what we hope to get to one day is being able to represent scientific problems in three dimensional space and then have a spatial agent capable of perceiving that space and using hopefully also the text reasoning capabilities that LLMs have today in addition to the spatial portal capabilities to be able to work on the other side of that problem. So that for us is sort of the North Star. That's why we're sort of trying to be hyper focused spatially plural workloads. The same way that Anthropic was hyper focused code and use that to then get into organizations and expand from there.
C
Yeah. Just as a side note, since you mentioned Anthropic, any idea what they did on this to solve code?
B
No. Out of any lab. I probably know Anthropic the least, to be honest.
C
Yeah.
B
I admired him though.
C
Yeah. The current working theory is that they had a super lucky roll of the ducks, but, well, and then it compounds from there.
B
That sounds like a nice story. I'm sure it's not that.
A
Yeah.
C
Okay, so why did a game developers Olympus?
B
So if you're a game developer, how well you're actually retaining players is like if you have a game that's already at scale, it's decently dependent on how good your bots are. So if you're logging in at an obscure time, let's say 3am in America and your player liquidity is low, then you need really, really good bots to keep those players engaged.
C
Is this a thing?
B
Yeah, for sure.
C
For like Fortnite and whatever.
B
A lot of giving words. It is.
C
Yeah.
B
Um, and so if, if you're like.
C
As a human, do I want to play against bots?
B
Usually it's not just bots. It's like players mix in with bots because you don't want to play just against bots. But it's better to have a full game than to have like an empty game. Yeah. And so I think as long as it's part of the environment, I think it's okay.
C
That means you also have to sort of grade that skill level.
B
Yeah, yeah. Which we can do because we have. We know exactly how good people are at these games. Yeah, yeah. I think for us bots is kind of like step one. Right. So what I was showing you is we're building a general agent that can sort of play any game in real time. But really that extends into all of simulation. Right. Like in GTA 5 for instance, people are genuinely role playing real life. Right. And so they're actually behaving in quite aligned ways with the goals they set for themselves. So you have all these examples represented in video games. Right. You have Truck Simulator, Power Wash Simulator.
C
Power Wash Simulator.
B
There's Power Wash Simulator where like actually the behaviors that you'd want an agent to be able to perceive, they're all there.
C
Mysterious. Yeah. It's really learning how seriously some gamers take Truck Simulator. If you haven't seen these tips, you should watch it. Yeah, they buy the whole truck driving set and they're doing the job of a truck driver.
B
Yeah. What I mentioned to you, we have more people at any given time on metal playing with steering wheels in like Truck Simulator and these types of games than Waymo has cars on the road. It's a ridiculous stat, but it's true.
C
Yeah, yeah. I mean so you know I used to think that well to solve self driving you kind of just need to play a lot of GTA 5. Yeah. I mean it's not bad for this.
B
Yeah. Our bet is not that we can zero shot any of these things, it's just that like the next self driving company can maybe have collect 1% of the data because. Right. Also for instance clips already self select into negative events and adversity. Right. And so like a lot of our data set because already highlights is really precisely what a lot of these companies spend like their last 20% doing. Right. And I think that's the main argument if you're another company that's looking at what we're doing. I think the thing that people are not that people won't understand is that anything that you're currently doing in pre training, as long as your robot can be controlled using a game controller, we hope that we can move that to post training for you. So our bet is not that we can create the next self driven car company. It's just that the next self driven car company hopefully only needs 1% of the data or maybe 10% of the data, I don't know to be able to deliver a really good product. Yeah.
C
It's also the term that comes to mind a lot is active learning. I Don't know if you've used to identify with that. It got less cool for a bit and now it seems like not the uptrend, which obviously you have the best data set for the PI intensity. You said negative, but I feel like you found negative. It could be negative or part of it.
B
Yeah, for sure. I think negative events is just because it's the most common term that people use for like if you're Tesla, you want the crashes, you want like.
C
Right, yeah, right, right, right. But it's only gaming.
B
So the model that you saw obviously had really, really incredible moments and that was largely that it had a large representation of people at their best.
C
Yes.
B
And worst.
C
Yeah, amazing. Okay, cool. Anything else on the customer development side that you want to sort of flesh out?
B
Yeah, we're also already working with robotics companies, but again the. And manufacturing. But the key is that the robot has to have gaming inputs. So our bet is not that we can transfer over to higher DoF robots and the keyboard and mouse. It's really just that we can move the hard work of pre training, hopefully to post training.
C
Yeah, it's kind of like the foundation model. That is a very good basis to start.
B
Yeah. You're going to give us frames and likely some text or you'll license the.
C
Model because they're going to want to post training. Yeah.
B
Our business model is initially going to be an API, again like the anthropic API. But you also saw for instance some of the video labeling models that we've been able to develop. So the goal is for any company to be able to take in their video data as well. And we can create first obviously custom versions of the policy for you, the agent. If that doesn't work, then we're already working with a customer that is doing. We distill a model and they turn that into a product for themselves.
C
So people can engage with you on the agent level, API level. People can engage with you on the sort of model level. Can you also buy data?
B
No, we don't sell data.
C
Okay, cool. So that's the business. And is there a world in which. I mean, I think this is on your landing page. If you are Frontier Labs for world models. Is there a world in which there is a more sort of application layer thing that comes up like chatgpt for whatever?
B
Yeah, you're going to see us launch a few things on Metal itself that are going to blow your mind as a result of this, this, this agent. I'll. I'll leave the imagination for now.
C
If people to figure it out. Email and.
B
Yeah. On the world modeling side, like I think one people underestimate is that metal is already one of the largest, you know, video consumption platforms as well. People watch millions and millions of videos a day. So world model based entertainment and things like that, while it's not like a focus for us right now, I think we'll be like on the consumer side, we have the ability to move very, very quickly here and get it integrated in a way that I don't think anyone else can.
C
Yeah, you could theoretically do video, Jen, like the Sora. What is that? Instagram 1. What's the meta one? Meta meals. Not RIAs.
B
Vibes. Yeah.
C
Include theoretically generate Eclipse that nobody play. But you know it's going to be vibes.
B
Yeah. I think for us, the games being so human centric is like a really big part of what makes it special. Like I actually, actually just don't think that would work. Like, one thing that we are really excited about though, I'll give you one sneak peek of what we're thinking about is what if you could literally replay any of the clips that you have inside a world model or your friends can play them like I showed you, a model that already took part of your clip as a context since the.
C
Replay entered that world.
B
But it's also how we go from imitation learning to rl, Right. Because it's part of a research roadmap anyways to make every single, every single clip on metal playable. So yeah. Who is to say that that doesn't apply to just the actual clips that you take?
C
Yeah, yeah. Can you say more about the RL potential?
B
We describe metal as the episodic memory of humanity and simulation. So when you take a clip, really the way to think about it is you get the highlight of what is maybe three hours of playtime. Right. You maybe get like two to three minutes of the things that were the most out of distribution. Right. It is genuinely your episodic memory of that playtime and simulation. The things that you most want to remember and share. We want to be able to load and this is the work that Anthony Hu is doing. The reason why we built world models is every crash that you run into in Euro Truck Simulator or American Truck Simulator or a driving game. We want to be able. And again, these are ground truth labels. So we know precisely the actions that lead up to the negative events. They're also title labeled. When people upload it onto the platform, they say, okay, it's a crash. Right. And so we can select all these events and if we can put them inside a world model we can go into. Right. We can, we can train reward models to then reward based on how you perform in clips that actually contain negative events, for example. And so for us it's very much about. Right. We can, we can create this like LLM moment on imitation learning. But actually making every single clip on the platform playable at billions of clips scale is how we go from imitation learning to rl.
C
Cool. We covered a lot of it. Is there anything else that you want to do before we sort of grapple with the long term vision stuff?
B
Yeah, yeah. I think for us this is a very, very ambitious long term vet. We need the best researchers in the world that want to work on this stuff. It's really exciting not being extremely data constrained. We get so many learnings every week that we didn't think were possible and it makes it for a joy working here also. The other thing is because we have such a large data mode, we don't have to be as concerned as the LLM companies about publishing because we don't.
C
Even want to be able to.
B
Exactly. No one can replicate the models. Right. And so for us, we really want to bring back the original culture of open research, which is why we did the partnership with QTAI in France.
C
I actually didn't.
B
Yeah, we just announced our partnership with QTAI in France, which is an open science lab in Paris, one of the best research labs in the world. Eric Schmidt, I believe, funded it. In addition to some French people. They are essentially acting as the partner that's currently doing a lot of open research on the data. We also want to partner with universities because we do believe this is the frontier, but it's so data constrained that really everyone has their hands tied behind your back right now. And so we want to help fix that. So for instance, we want to work with universities to build negative event prediction models for maybe trucks in India on all the truck data where all these crashes occur. We have all these things that we know we can do that we just have it at the time to do. And so if you're listening to this and you're maybe an academic institution or something and you want access to some of this data and a research in an educational research fashion. I think we're quite open to doing that because we want to educate people and. Yeah. And other than that we just want to work with the best infrastructure and research engineers on the planet as we're going into scaling runs that have thousands, tens of thousands, eventually hundreds of thousands of GPUs.
C
Yeah. Amazing. I primed you this as the closing question of. It's a little bit the no cost 330 question. I didn't know. So what does GI become in Batwircy?
B
Yeah. In 2030, we want to be the gold standard of intelligence. And any sequence long enough is fundamentally segment and plural. Right. Which I think is. So by nailing spatial temporal reasoning, you go after the root killer problem of intelligence itself. What the world looks like is we want to have eight. So I sort of group the sequences of AI in three stages, and I credit Andre Kramparthi for teaching us bits to bits, atoms to bits, and bits to atoms, and then atoms to atoms. In the atoms to atoms stage, I want GI models to be responsible for 80% of all the atoms to atoms interactions driven by AI models. And the reason for that is because we were able to unblock intelligence so quickly. In robotics, like, intelligence is the bottleneck that supply chains actually converged on gaming inputs as their primary input methods, and they converged on essentially simpler systems that let us do a lot more, a lot quicker. So we are essentially the 80% market approach. And then you have lots of companies that have kind of like specialized, maybe unit robot OS stacks that are the other 20. And then so I want to be responsible for 80% of all the atoms, atoms interactions driven by these models and be the gold center for intelligence. And maybe 100x more in simulation, because I think simulation will actually be the larger market initially. So I think in simulation, because you have very little constraints, also from a safety perspective, simulation is much easier. So I think a lot of the takeoff initially is system simulation. So a lot of the simulation use cases, like what I mentioned, scientific use cases, I'm really, really excited about. And so, yeah, 80% of atoms to atoms interactions coming downstream from these types of spatial boreal foundation models, and then 100x more in simulation. Yeah.
C
It reminds me a lot of what Mark and Priscilla from this as Zephyrberg Institute are doing with virtual biology, because you can do a lot brain simulation then you can do.
B
Yeah.
C
Or you can do a lot faster with interest. Amazing. Thank you for inviting us to your office.
B
Yeah.
C
And thank you sharing a little bit about your training.
B
Thank you.
C
Yeah.
Date: December 6, 2025
Host: Latent.Space
Guest: Pim de Witte (CEO, General Intuition "GI")
Length: Key content detailed through 01:04:06
This episode features an in-depth, first-ever interview with Pim de Witte, founder of General Intuition (GI), an ambitious world model lab spun out from Metal—the company behind the largest dataset of labeled video game action clips. The conversation explores the rise of foundation models beyond LLMs, the massive $134M seed led by Khosla Ventures (its biggest since OpenAI), and the role of high-fidelity human behavior data in the future of spatial-temporal AI agents and robotics. Listeners are taken through technical demos, the origins of GI, privacy-conscious data approaches, and the long-term ambition to define the gold standard for machine intelligence in both simulation and the real world.
Metal’s Background: A retroactive video clipping platform, Metal amassed 3.8B game clips focused on “peak” moments—essentially mining for interesting, high-skill instances of human behavior (00:00–02:07).
Privacy-First Design: Rather than logging raw key presses (e.g., WASD), Metal captures abstracted action labels—ensuring privacy while retaining essential behavioral signals for training (17:56–18:08).
Data Application: This action-labeled video enables training world models that predict actions purely from pixels, transferring game-learned intuition to real-world scenarios (06:10–06:33).
Early Demos: GI’s agent “sees” only game frames and predicts next actions, imitating human navigation, getting “stuck” and unsticking itself, revealing both human-like error and superhuman moments due to being trained on user-curated highlights (02:08–05:14).
Technical Approach:
Impact:
Why GI stayed independent:
Seed Round & Vision:
Foundation Model Analogy:
Distillation for Scale: Tiny models can act in real time, sacrificing some performance for deployment at the edge (12:01–13:41).
Real-World Use Cases:
Business Model:
| Timestamp | Speaker | Quote | |-----------|---------|--------------------------------------------------------------------------------------------------------| | 02:08 | Pim | "What I'm about to show you is a completely vision-based agent that's just seeing pixels and predicting actions the exact same way a human would." | | 05:15 | Pim | "The baseline of our dataset is peak human performance." | | 27:57 | Pim | "We think we could essentially leap every single company... and take this foundation model bet for spatial-temporal agents." | | 33:05 | Pim | "He [Khosla] asked you to draw a 2030 picture of your company... and expects you to do that flawlessly, challenging any part of the vision." | | 46:37 | Pim | "[Yann LeCun] proclaimed LLMs to be a dead end. That was one of the things that inspired me to do this." | | 54:12 | Pim | "We have more people at any given time on Metal playing with steering wheels in Truck Simulator than Waymo has cars on the road." | | 62:09 | Pim | "In 2030, we want to be the gold standard of intelligence… nailing spatial temporal reasoning to go after the root killer problem of intelligence itself." |
This interview offers a rare, “in-the-lab” account of the frontier in world models—a new AI paradigm that promises to extend foundation models from text/code into simulational and embodied intelligence. Pim de Witte and the GI team are betting that their privacy-conscious, action-rich video gaming dataset will allow agents not just to play games, but to develop the core spatial and temporal intuitions required for robotics and scientific breakthroughs. Backed by Khosla Ventures’ unwavering vision support, GI is setting out to define the industry benchmarks for machine intelligence in both simulation and real-world environments—a true “gold standard” for embodied learning.
For more, check the full show notes at Latent Space.