
Loading summary
A
Okay. We're here in the studio with Ethan. He most recently of xai. Welcome.
B
Yeah, thank you. Glad being here.
A
We're also here with Vibu. You were first coming to us or joining the latent space world because you were working on Cosmos and Nvidia and you did a great paper. We loved it. You presented it as well. So thank you for doing that. Yep.
B
I also presented the moes twice.
C
Yeah.
A
Yeah. How did you actually hear about us? Did we reach out to you? Is that how it worked?
B
No, actually the community, like I realized, oh, there's this online community that people talk about AI and also learn from each other through papers every week through the paper club. It's very nice.
A
Yeah.
B
I learned a lot.
A
I think three years nonstop. We haven't stopped even on Christmas and New Year's. Many weeks I want to stop.
C
I think you had posted that you worked on a paper and I was like, oh, very cool. We have a paper club presented. But I might have reached out to you after.
A
Yeah. Because it's an amateur club, right?
B
Yeah.
A
So it's very unusual, but we have sometimes paper authors come by and actually explain the paper. Today we just did the poolside paper, which is apparently very good.
C
Came out yesterday. Pretty interesting, right? Fully open. They talk about everything system. So it's a good one. We'll recommend people to read it.
A
Bring us up to speed on your transition to xai because I actually don't even know when you joined. Just tell the story about the transition.
B
Before xai, I was working on Cosmos world model at Nvidia. So Cosmos is a giant video foundation models that aims to simulate the world. And it serves as a foundation for all of the roboticists to build on top of there. Once I built the Cosmos one, I realized this thing also has a scaling law similar to language model. We need to scale up the video models further. That's why I realized I need to move to somewhere with much more computer resources.
A
That's how I than Nvidia GPU rich game themselves.
B
Yeah.
C
And timeline wise, when was Cosmo? It was pretty early, right? It was open world model, open paper.
B
Yeah. It was like end of 2024. End of 2024, yeah. Then at mid-2025, I moved to Xai. At that time, I joined about the time when Xai was about to build video models and multimodal models. There were no infra, no data and no model and just a few engineers. We built it in three, released the first model, Grok Imagine 0.9. And since then I keep working on video models and move more from pre training and to post training of the video models. For example like reference to videos kind of like the cameo feature and video extensions. And before I left I at work on a world model leading a small team to focus on the real time long hours and video generation.
A
Can you give a rough roadmap of like okay, you're on a brand new team. Grok previously was only tech so they partnered with BFL for their image gen stuff. What are the building blocks? Right. You have compute data you can procure somewhere. What are the sequence of things that people should think about when you're setting up a new team?
C
Actually even deeper. Not just data you can procure. You guys had to go through getting the data too, right. So you shipped it pretty fast. But yeah.
A
Yeah. Three months is like actually like very surprisingly fast.
B
Yeah. One thing I say like thanks to my experience at Nvidia because first time when we were building Cosmos together we built it for about a year. So this is like the second time I do it roughly have an idea like what to do. I say the most important thing is is a talent. Everyone were very strong and clever. Very close with each other towards a common goal. So that speed up things a lot. So you reduce the communication bandwidth among people and everyone can work towards the same goal. It's like every day there's not that much meetings on the calendar. Like maybe a sync a day and after that it's just all building. It was pretty fun at that time. And another thing is that XAI has very strong foundations of like data infrast model infras and the supporting there can help the model develop a lot. When I look at training models. So actually the top important thing is how many, how many iterations can you do per day? And the more iteration can you do, you can train the model much faster. So if you have very strong infra and you have a lot of compute, you can train these models in very short period of time. That can give you a much larger bar flow to for arrows. And it also gives you the opportunity to spot more bugs. Yeah.
A
What is an iteration? Is it like a few hundred steps
B
or what are you let's say just training the model like from acquire new data and maybe design new algorithms and train a new model. Maybe a smaller.
A
Yeah. So cycle time for any hyperparam that
B
you're cycle time end to end to evaluate this model. Is this model better than my previous iteration?
A
So it's like before you someone had already set this up that you can. Iterate very quickly.
B
Yeah. I think the foundation there is extremely good for developing and research models. And often I find this is kind of boring. But a lot of the improvements does not come from new algorithms. It comes from finding small bugs here and there in the data pipeline, in the, in the model training pipeline. Those give the biggest boost to the model quality.
C
It's interesting, right? So you say it's like small team, less communication bandwidth, but also a lot of quality is like find little bugs. It seems counterintuitive, right? You have a lot of people, you can iron out more of those. But it's interesting to see the other side, right?
B
Yeah, yeah.
A
I also wonder, have you. Did you try using LLMs to look for bugs? I don't know.
B
I remember at that time it was mid-2025. So it's. The coding model wasn't quite there yet. I remember like December 2025, it was extremely good. Yeah, I've been using it at that time. It's helpful sometimes it produce codes that are kind of difficult to maintain. Even though like the first time it built something extremely fast. But it gave the like a spaghetti code, thousands of lines that I couldn't maintain and the OM itself couldn't figure out what's wrong and how to improve on top of. But now I find it much, much, much better. Yeah, I want to bring up another point here is like now coding models are much more efficient and can help us implement stuff much faster. Compute might become a bottleneck again because previously if you want to train a new model, say you want to generate new synthetic data and then or write a new algorithm, it might take a few weeks and during that period of time you might not have experiments to run. And now you can build that thing within a few hours. Then you can immediately train a model. Now you have to have enough compute to try all of the ideas. So compute might be the bottleneck of iterating speed again.
A
Yeah, actually, honestly, I think it's kind of a stressful job because you're like, well, I should be trying everything and if I'm not, then I'm not doing my job well.
C
There's also the stress of your eating thousands of GPUs per hour, which is very expensive and compute can go down Daddy Elon. But there's still finite amount of computer. You want to use it, you want to use it well, you want more of it.
B
That was quite stressful indeed. Yeah. I think one thing is with coding models now, a lot of these jobs can be automated, which is much better. Second, it's a marathon. So you got to maintain good health and regular schedule.
C
It's hard to hear that when you shift from zero to nothing in three months.
A
Yeah, I think obviously the culture, famously, people work very hard. One thing I did want to dive into, in the notes that you sent ahead of time, you had specific comments about the cost of Videogen training. Presumably this is on the Colossus 1, right. The 200 megawatt cluster and whatever you want to.
C
I think there's three things we're talking about, right? So there's videogen, there's also the ImageGen model that you put out. Do you want to like complete the. Okay, so 0 to 1, you have a few months. Just what are the stages of create?
A
Oh yeah, maybe I got distracted.
C
Sorry. And then, you know, from there there's video gen, there's audiogen. I'd love to get into those next. But what is that first few months like so small team, a lot of bugs, iterations. But like, you know, what does it look like? Do we take something off the shelf? Do we just get data, compute? What's the few months like? How do you go to state of the art image gen model? How do you just start?
B
Yeah, I cannot comment specifically how excited, but it's quite a standard process. I can draw some examples from Cosmos. So mainly it's like building a video model. You actually need to build an image model first. And building these two models, the data you need is 100% synthetic pair of language and image or language to video. Because on the Internet actually the videos don't naturally associate with text. So you can say, oh, like on YouTube you have the title and you have the description and the comments of a video, but usually they're not relevant to the video itself. And say, maybe the video is a natural scene of mountains or something. And the title is I'm so happy today. They have no correlation at all. So the first step is to. You have to generate synthetic pair of language with videos. So you get the videos from the Internet and you use a VLM to caption the videos. So that part. Here's a question. How do you get the VLM to begin with? So if there's no.
A
You fuse the model, right?
B
Say if there's no VLM exists, how do you generate the text to the beginning? It's impossible.
A
I see.
B
In the beginning it's like you ask human to describe the video as detailed as possible. For example, you ask them to describe everything, like all objects, all characters and all interaction and dialogues in the videos. So that's in the protocol of cosmos labeling we require. The objective it gave to the labelers was that you have to describe the video as detailed as possible, such that a blind person here, the blob of text can reconstruct what the video is from their head.
A
Video or image?
B
You're talking about image. Video or image. Either one of them.
A
Okay.
C
This was pretty common when we went from like clip and Dall E. Right. It's all training on really detailed captioning of images. So same is applied to video. But instead of using multimodal model to pass in video or images and write rich descriptions, you can also.
A
I think that's the traditional perspective of supervised or very highly human curated thing. I feel like there's an unlock with unsupervised where you have enough to bootstrap that you can just throw common corpus on it or whatever. Unsupervised vision and language pairing where you just have interspersed image and text. And it just learns to me that is the VLM breakthrough that is different from the clip, different from the pre LLM era.
B
Yeah. Yeah. It's interesting to see that you kind of need both data.
A
Yes.
B
For example, you needed to bootstrap it. Yeah. For the generative model training, there's also really like a small percentage of unlabeled data. So the model is instructed to generate a video without any text instruction. That can also help some model generalize. So after. After this stage of generating the synthetic pair. So one. One important common step is to train a compressor or a tokenizer of the image or videos. So because if you train. If you can technically theoretically train image or video models on pure pixels. But the, the problem is that the. It's a lot of tokens. So like one image, like it's 1000 by 1000, it's like 1 million tokens. 1 million pixels. It's impossible to train transformer on that. So you need to train a tokenizer which can go from image to latent space and latent space back to image.
A
That's why we named the podcast.
B
Exactly.
A
But basically you're talking about vocabulary science. What is immune like a million is
B
impossible in generative models. The vocab is continuous. It's a continuous space you can think about like you map an image to a vector. It's a fixed length vector of like 16 or 48, something like that. And then you map that vector back to the image space and the mapping has. The mapping is patch based. So you say you have a 16 by 16 patch and you map that patch of pixels into this latent space we've covered.
C
This is vision transformer.
B
Yeah.
C
Vaes, you basically compress your input, you do your generation, you're reasoning all that generation in smaller dimension and then you project back out.
B
Yeah.
A
VAES is a form compression. But I think for me, the patching thing is from vit. Right. You can make literally the paper is titled like 16 by 16 is all you need. Something like that. And then I think also people make a lot of comparisons with this kind of patching with convolutions.
B
Yes.
A
Which is you are kind of reconstructing the old paradigm with the new.
B
Yeah. Actually in vaes there are both convolution networks and transformers. You can actually do both after this vae. So what you've got is you've got latent space tokens and you've got the language tokens. So now the training of the diffusion transformer, usually generated models use diffusion transformers. It's actually quite standard. It's very similar to how you train language transformer models. It's not that much difference. It's just the visual tokens in, visual tokens out. The only difference is there's a denoising process. So you train the model to unmask some of the noise. So you add random noise to the visual tokens and then you train the model to remove those noise to generate the clean tokens. And in inference, the model can iteratively remove noise from 100% noise.
A
Yeah. And then there's also to speed things along on the tech tree of diffusion, there's CFG and then there's also, I guess latent diffusion. That is someone in there, I think somewhere along the line, obviously like stability. And all these other guys pioneered a lot of this architecture. I don't know if you want to get into that or just do the video side after you.
B
After you train such model, Such image model. The reason it's a foundation for video models is that image models are cheaper to train and they have much denser connection between language and text. Sorry, language and images. For example, you train a billion, you turn on a billion images and there's a mapping from the text to the image and the cost to trend the same like a billion text to a billion videos, that's much more expensive because videos naturally have more tokens than images. Because the diffusion models, their understanding of language purely come from this mapping. So if you don't have enough mapping, so if you only train on like a 10 million videos or something there, you might not see enough language. Tokens in your training. So your model does not understand human intention enough. So that's why you usually you first train this image diffusion models and then you bootstrap the video model from there.
A
One thing I did want to ask because I actually. I think you're the first video model person I've ever talked to. I think we've talked to Luma and all those folks. There's all these tricks in video compression where basically frame by frame, there's not that much difference. So actually you don't have to regenerate or resave the whole frame. Right. I think mp4 compression or something else like that. Is it tempting to use that or as far as I can tell, everyone just treats it as no, we will just generate every frame. Is that roughly the state of the art?
B
There are a few different approaches. Let's say first you want to just directly use MP4 compression and use that as the tokens for the transformers to train. Right. So people actually have tried that. But the main challenge is the latent space for the MP4 tokens or not were not very comprehensible for the models. It's extremely hard to train on that. And there's a. So that's why we created vaes, which creates more continuous latent space. So the models can understand that latent space and learn from it much easier. Even within the VAEs there are different difficulties of the latent space. So you can imagine something that the simplest, the most naive ve is you have an image and you just shuffle all of the images into a vector. So you don't need to train naves. Right. But that latent space is extremely hard for models to train on top of. That's why there are some debate on how do you compress the tokens. So you mentioned you can compress frame by frame. Also you can compress the temporal dimension.
A
Yes.
B
The difference is if you compress temporal dimension, you get a much higher compression rate because there's temporal redundancy between frames. Because this frame and the last frame, likely they are mostly similar. So there's only some small difference. For example, I think in 1.2.1 ve they have like a 8 by 8 by 4 compression rate. So the four temporal tokens are compressed into one tokens. That can save a lot of the context length. If you do it frame by frame, you have to do maybe like eight by eight by one, your context length will be four times larger. That being said, the benefit of the per frame compression, we might come back to this later, is real timeness. And interactivity. Because if you strain the output of the model frame by frame, the model can respond to any user request immediately. So if you have a temporal four compression four times compression, then it might be laggy. Yeah, there's a lag there in nature.
A
So you're very pilled on this. Let's just go ahead and bring it out because we have the visual prepared anyway. There's some frontier applications of real time video gen. So Flipbook is one of the examples that went viral recently. Right. What is Flipbook?
B
Flipbook is kind of like a web browser. You can see like it has the web browser UI on top. The difference is all of the UIs are generated by generative image model in real time and anything here are fake. But you can explore inside this imaginary world. Here we have engineering the great pyramid. The model generates this for us to understand how it works. And if we want to navigate around and understand further, we can click on some of the description here and the model will generate a new page, new sub page describing the details we want to know about.
A
So it's basically kind of we are playing a video, but it's pausing for our next interaction and then it just plays the next thing based on our interaction.
B
It's kind of cool.
C
Yes, you kind of decide your story. So this was how do you make a pyramid levering technique seemed interesting, right? It shows how do you take. Okay, I want to know what is
A
the demo Tweet had more animation between frames.
C
I think it's just skipping.
A
I was just skipping a lot of frames.
B
Yeah, they also have a lot of video mode, but I guess a lot of people are using it.
A
Yeah, I see it.
C
There's a live video stream we can try.
A
Yeah. So this is an example of the kind of future that you see at the extreme. We're obviously not in it today, but in a world where inference is completely free. Yeah, this is better than generating code in text.
B
Yeah. So this is the final state of where we will be at for word model. I think imagine Internet doesn't exist. And then you type in google.com, what should a model show you? The model can imagine something and this is what the model imagined. And these web pages, they completely do not exist. So I think as the inference costs come down, we are going to have generative UI for everything. If you think about how the coding model works, so they write code for a web page and they render the code might be converted into binary and the binary render the pixels on the screen. So in machine learning, every time we have some breakthrough. Obviously it's more end to end. So why don't we have user instruction to the pixel directly? So the generative UI will be user intention to the pixels directly. And say even if I want email, let's say everyone have the same interface, but I want it slightly different. I want the email to show to me like a TikTok so I can swipe that for the emails. Or maybe you want something else. We can have completely different things. Or like I have, I'm looking at Instagram stories. I don't like the like button. I always make click and generate the UI without it. So it's going to be a revolutionary replacement of the interface. So in the future we might have much more powerful LLMs and coding models running behind the scene and in the front end, the diffusion model will actually be the front end to show stuff to you. That's how I imagine it.
A
Yeah. Diffusion front end deterministic backend. Yes, something like that. I find that very expensive. But you know, I find it interesting
C
you called LLMs writing code on the backend deterministic.
A
But okay, yeah, you write it once and then you execute.
B
If you think about the cars, say, let's say H100 costs $1 per hour. And if you use this eight hours a day and 30 days. So every month you're paying this 240, you're likely not want to pay for that. That's even more expensive than cloud code, max. But if you think about the compute costs come down like two times every year. I think that the future will likely arrive.
C
Compute cost comes down, compute gets faster, model gets smarter, gets smolder.
A
Yeah, I don't know why you say two times, because I think it's like a hundred times. In language models it is roughly 100 to 1000 times every 12 to 18 months for the same given level of LMSIs. Elo.
C
That's a net of everything.
B
Right.
C
That's model performance alongside computer. So different than just compute costs come down. But a very interesting future.
A
Yeah. For the web, designers will have to shout out that accessibility is an issue. How do you deal with screen readers or whatever. But yes, this is higher bandwidth storytelling than anything you can possibly generate with code.
B
Right.
A
So I think that's the rough idea
B
and I'd like to add a little bit. So human naturally have the maximum bandwidth when we are looking at things, look at videos and we also have maximum output bandwidth when we're talking. So in the future it might be something like we talk to AI models and the AI model responds back with the generative ui. So that would be the maximum input and output bandwidth to interact with AI models before neuralink happens.
C
And I mean, it's also very custom. Right. Some people are very visual, some people are not as visual. Right. They prefer the text. But the best thing about generative ui. Right. Can also be text.
B
Yes.
A
There's another project that we wanted to highlight, which is the neural os. Kind of similar idea, but here you're literally simulating an operating system with a video model.
B
Yes.
A
And you can play Doom. You can do Firefox. I find this, like, mildly less impressive, obviously, because it's an OS that I can run. But here everything is imagined.
C
I was, you know, used to the command W to close the Firefox tab that didn't crash.
A
Too immersive.
C
It's too immersive for me. I wanted to close the tab. But, yes, I can play Generated diffusion.
A
This is shockingly fast.
B
Yeah.
A
Because I remember there was a demo about like, maybe one or two years ago, someone tried to do the first person shooter with an image model. There was no consistency. It was very slow. But here, realistically, this is doom.
C
I mean, I think there's two sides to that. Right. There's like, okay, what is running a game? The heavy part of it is actually the game engine. All the lighting, all that stuff, the graphics. This is just kind of video. Right. Like, we've solved consistency. This is still, you know, it looks like a few years old. Image generation. There's some temporal consistency, but it's kind of just images stitched together as frame video. But it's a good visual representation to pitch to picture the future you want to see.
B
Right.
C
That's what I see in these more.
B
So this reminds me of how the video models gets better and better. So neural OS is kind of, if you just look at it, it feels like it's just a crappy version of the Windows we could have. Right. But the difference is the model. This model is overfitted on the existing operating systems. It can generate nothing different than that, but it's actually also similar to video models. So when we're training these video model, image model, we train them on Internet. There's no imaginary supernatural stuff on the Internet. But once we train this model, you can prompt the model to generate something supernatural that have never existed in the data set. So if you train your neural OS or neural computer on the standard screen recordings on the entire Internet, the model can imagine completely new interface to interact with the computer.
A
Yeah, this is one of those Things that is magical to me. Usually generalizing out of distribution is bad, but somehow we have learned some kind of internal world model.
B
Yes.
A
That you say this plus but it looks like rainbows and butterflies. It'll do it and it'll kind of make sense. So yeah, that's kind of cool. Yeah. I don't know if there's any comment more on there. I did wanted to touch a little bit more on the model architecture stuff which I think you were getting. It's really fascinating. We don't get a chance to talk about this enough. So one of the papers that we covered, we've covered every annual segment, anything release. And I don't know if you follow. I mean you're a computer vision guy. So they did memory attention which is kind of interesting. And I always think anything where you can across the temporal dimension keep some consistency. I think it's very fascinating. And I don't know if basically the CV side bleeding into videogen side I think is underexplored. Right. We talk about it for labeling but actually you can borrow the architecture itself.
C
And there's also complete different approaches. Right. Like you brought up the term world model. So we went from video model to world model. There is diffusion, but there's also other approaches that people are doing. So maybe we get into those after as well. Yeah, yeah.
A
He has a whole definition of world models and stuff. I feel like we threw a lot at you. Whatever you want to comment on.
C
I think one thing that we should actually comment back on is like okay, so we were talking about the steps to train image gen to video model. One thing we don't see as much of is like okay, you brought up the delta in training data, right. So you won't have as much. A video model might not generalize, but what is the cost of training a large video model? So we know for LLMs roughly. Okay, even like the poolside thing that came out today, right. It's a GEMMA level model trained on roughly 40 trillion tokens at this many H2 hundreds over this much time.
A
Right.
C
You can see what is the exact cost of that. So how many GPU hours over how much H200 costs. So how do we do the backend math of, you know, same thing for video models. Image models. How do you kind of break that down?
B
I can share some backseat envelope calculation. So surprisingly video models is like the cost is very comparable to language models and obviously the largest scale is language model, maybe like a medium scale language models. I said just storing the videos alone it costs a lot you can maybe look up on AWS or something. Usually like say if you have a billion videos and let's just say like each video like 5 megabyte, then you would need like 5 petabyte to just store those videos. And also remember we talk about you use a VAE to compress the videos and you also need to store. Typically you need to store those continuous feature and also in your storage. That's also comparable size with the videos themselves. So just storing these videos and the features is tens of petabytes alone.
A
I just looked up the calculation. 5 petabytes on S3 standard is 100k per month.
B
Okay.
C
And you need it.
B
And then like a tensor petaby is 200k. And even more expensive is you have the ingress and egress through the Internet. You have to just to download those videos. I believe it's more expensive on AWS than just storing those videos. And each training runs, you probably need to pull them once. If you're trained multiple times. It's even more than that. So it's like just storing on the network those costs. I guess it would be a few millions per month to just storing everything. Not to mention the GPU cost.
C
My side, tangent compute rental, like GPU rentals, very efficient. There's one side. Okay, you can be xai and build your data center. Should we not just build our storage compute as well? Like cloud cost compared to just you save so much? Yeah, exactly. Especially with like egress and stuff.
B
That's a good idea. But it also comes to there are some of its own challenges, of course. Like people who build the GPU data centers, they might not expect this much storage. And yeah, people build storage. Typically they just build it somewhere. It's just CPUs.
A
I just looked up AWS only charges for egress, not ingress. Tier 5 for 5 petabytes is 230k.
B
Yeah, even more expensive.
A
But storing is per month. Right. You check in and you cannot check out. It's cool. It's okay. So my data is larger than you think.
B
Yeah.
C
My backhand math of GPU hours times GPU cost is also very much. I'm missing some storage.
A
You're basically also more IO bound than normal training. Yes, because data loading, caching, everything becomes super important.
B
Yeah. So in Cosmos we did a lot of optimizations to make it not I O bound. Speaking of the training, actually training the model, the GPU costs. If you look up the open source model, how big these video models are, I think like LTX has 19B parameters, that's a dense model and people are also exploring MOEs. So it might be like 20B active and like 100B total. So that's similar size as medium sized LLM models. And if you look at number of tokens, we disclose that in Cosmos it's also like tens of trillions of tokens on the visual tokens. So putting this together, the cost of training these video models is actually comparable with LLMs. Not to mention the infra is slightly different from LLM so it might be less efficient to train these models.
C
Do you get the benefits of traditional diffusion speed up? So for images there's LCM loras for, you know, fine tuning, there's. There's a lot of stuff. Yeah, there's flow matching, there's a lot of stuff that's been done. There's some overlap that applies to diffusion on the inference side and stuff or.
B
Yeah, so. So the difference, the inference side is a completely different story. I think for the training side it might be a little bit hard to reduce that cost. And for, for the inference side the biggest scan is from the desolation of these models. It's called step distillation. Slightly different from knowledge distillation in OLMs. So typically for flow matching models you need like 100 steps or something like diffusion model, even more like 1000 steps to generate a good image or video. Step distillation is try to learn to generate fewer steps from the model itself. It's kind of like now you use the full model to generate in 100 steps and then you take a model that only generate 10 steps and let that model learn from the perfect one. Why this work?
A
Strong to weak.
B
Kind of like strong to be. I guess from the modeling perspective the strong model, the teacher model is trying to model the image and videos of Internet and that distribution is extremely complex as a step distilled model is just trying to learn from the teacher. The teacher is a model and the size is fixed. The distribution is much simpler than the whole Internet. That's the intuition I have why step distillation can work. So usually these models serve in productions they only run in a few steps. In Cosmos I believe we have four step and eight steps. If you do some simpler task like image to image translation, it can even run inferior step by one step in cosmos transfer.
A
Yeah, I think this is the same intuition that guides a lot of the consistency model work. I sent you a link for scm. I don't know if you covered that to me that was actually one of the most impressive papers I've ever seen from OpenAI that this is the unifying grand concept of consistency models. I don't know if you have any comments on this.
B
So there are a few different approaches.
A
Oh yeah, here it is two steps versus 20, 100 steps. Whatever, it's already done.
B
So there are a few different approaches, for example consistency model and there are also. Actually we shouldn't forget gan. So Gan actually that was the OG step distillation because it trained just one step to begin with. So actually a lot of. For example, there's a distribution matching distillation which use, which uses GAN as one of the loss for the solution. GAN just tells you hey, generate an image. And then it has a discriminator to tell is this image real or not. So the model just need to learn one of the distribution, not the full distribution. Because in training the model is asked to reconstruct the ground truth image from the Internet, which is extremely hard. And when you're training again, it's a one step process. It's just hey, you generate image. Does this image look as real as the image from the Internet? Which is a much simpler task. And combining a lot of these approaches together, people typically do that like consistency model and distribution matching. And again we can get these few step models.
A
Okay, then there's one step I wanted to add which is audio.
B
Yes.
A
And video.
B
Yeah. So Garcia, imagine 0.9 I believe is a first audio video transmodel deployed at a large scale.
A
And that was your first model?
B
Yes, I was Grokimagine's first model. It's audio video joint generation. I think the hard part is the modality alignment because before this joint model we have text to video alignment. We have this corresponding vision, text and video. Typically Most of the VLMs, they understand images and videos. Videos very rare. And they don't understand audio mostly. And if you look at the audio generation on the LM side, you can talk to them perfectly fine. But if you ask them to sing a song or something, it typically is not very good. Also they don't have music either. The hard part is that actually audio has two components. It has a discrete component, a continuous component. The discrete component is a language. So when we speak it's just some.
A
It's an ASR issue.
B
Yeah, yeah, it's text token with some characteristics, I would say, but music, I
A
think the speech guys would disagree. Dissonances and you know, I say largely,
B
but the music is completely different. It's very continuous and you cannot model them like discrete tokens in language models. This is like the hard part for model this. Not to mention we have to align text, video and audio together.
A
Yeah.
B
So how so significant. Some significant challenges are like. So first, like we talk about as VLMs, they cannot understand and most of them cannot understand audio. So you have to have some way to do the synthetic data generation for audio. You have to caption the model and that involves synthetic data and human data effort a lot. And not too surprisingly, Most of the LLMs are very bad at recognizing like the beat, tone and the details of music. They can give some general prediction of which song is this, but it's very hard to describe the details of the music. Like we mentioned in image generation, you have to describe image as details as possible so that someone blind can reconstruct that. So here is like someone deaf. Someone deaf can reconstruct how the music sounds like without actually listening to it. Maybe like you can think of it need to have the, what they call the subtitles. Yeah, you gotta have all the details of the music and the dialogue.
C
So is the challenge there typically stuff like music and audio or is it just like, is there a baseline? Okay, there's enough data where we can understand narration, conversation, but there's nuances in audio that, that's where you hit all the data issues or is it just from state zero, you just do it. All right.
B
So one important thing is like the alignment. So as a model, the model has to know the video and audio. It has to have a time based alignment at which time step the video and the audio token correspond to each other. We actually don't have these kind of alignment for most of the other modalities. If you think about text and image, text and video, they are loosely aligned. So you can have a description of what's going on in the video, but you don't have to. Exactly. You typically don't have exact description. Oh, at timestamp one second, what happens? It's very two seconds. Yeah, yeah.
A
So what was the ideal time step? You have to ablate it and then it's like four seconds or something.
B
So that comes down to how you design the model for the model to be aware of as a time modality. So the model is like a time aware and that's something pretty unique if you think about LLMs. So if you ask LLM to complete a task, say you ask them and they will say, oh, this task will probably take 12 hours to complete and they come back in one hour, say, I've already spent two days on this and I've exhausted Everything. Yeah. So the LLMs themselves, they don't have a sense of time there.
C
I actually don't think that's just them not having a sense of time. I think it's somewhat based. Right. Like you tell someone, okay, go work on this feature, go implement this. There's a general understanding you would have of how long that would take without LLMs working at LLM speed. Right. So you think back like two years ago. If I tell you to like build me like a new front end for latent space, have a search bar, have all this, you'll estimate that'll take a few days.
B
Right.
C
So you tell an LLM, go build this. It'll take me a few days. But you know, I think it's somewhat grounded as opposed to them not having the best. Not saying that they have a great understanding. But I think that example is like, you can see where it comes from. Right. You're trained on all over the text.
A
They're trying to estimate what a human would say.
C
Yes. Because that's what. That's what the data kind of represents
B
the corpus on the Internet, people have a estimate.
C
Yeah. And not even just in direct training samples. Right. Just your world understanding of tokens of how long stuff takes. Right. Go read a book. It'll take you a while. Right. Even if you do nothing but read a book, it takes a few days. So yeah, I'll let my Reddit. It took me a few hours. It'll take me a few hours to go through this research. But somewhat a tangent.
A
Yeah, this is a train of thought I haven't really expressed until now, which is basically like a full world model must also be recursive, meaning that the participant in the world model must also be aware that they have a world model which is like this whole recursive thing down the line. But yes. And that the world model can be wrong and that they need to update it and blah, blah, blah. Yeah, we've argued this on the newsletter as well, that there needs to be sort of recursive or adversarial world models.
B
Okay.
C
I mean, just to ask, how do you define world model?
A
Oh, yeah, let's go there.
B
Yeah.
C
So just for context, we talked about video generation. And then if you say there's a distinction between world models, what's your definition? How do you see the two?
B
Yeah. So disclaimer. I'm not going to debate, like, what is world model? There are many definitions. So I'll just talk about my definition since I came from the multimodal domain, so mainly talking from Video. So word model is like real time interactive long horizon videos. So there are three parts. So let's talk about them one by one. So interaction. So we just look at Playbook and neural computer. So the interaction part of it so worth model can allow you to interact with them through keyboard, mouse and maybe also voice. So these also modality you can interact with the model and the model should respond reasonably. Second part is real time. So once you move your mouse if say the word model generates a game, how fast can the game respond? So if you're like professional CS go players might say oh, you have to respond in sub 10 milliseconds or even less. So that's not I guess most of the oh 60fps, let's go 300fps, 500fps.
A
Wait, okay yeah, I didn't do the
B
math but yeah, okay yeah 300 FPS, that's a 3 million seconds. So you have to respond. Most of the video models cannot do that. But if you have a video model that is say like a digital human, the response time might be more generous maybe typically for real time voice interaction it's like 200 milliseconds. So that's much more generous. But even 200 milliseconds is pretty, it is pretty tricky because like remember we mentioned you have this temporal compression coming from the vae. So if you don't compress the temporal dimension your sequence lens is going to explode. So if you want to have this real time real timeness in your model you have to deal with long context problem. And the third part is long horizon because we are not going to just play with video games just like a few seconds. Most of video models only a few seconds, we're going to play with minutes, hours. The model have to be able to generate long form content. So putting these three together it's a real time long horizon interactive videos. I think the final state will be for example like a video version of playbook where you can interact with a neural computer. You move your mouse and you click on the generative interface and it will reply to you through pixels generating in real time. But getting there, it's a very long way to get there. So one of the first step at Grok imagine where I led a small world model team there was to build video extension. So video extension, it's the first step of interactivity. It's the first step. Yeah, we have it here.
A
Video editing.
B
Yeah. So the first step it's because this unlocks non horizon videos. Typically for most of the video generation models you Give it a prompt or an image as an initial frame, you generate video. That's it. That's just one time done. And some creators would try to use the last frame as a first frame for the second video. Sometimes it works, but if you do it a few times, it says a call to a degree and it doesn't
C
have that context over the full video. So the temporal.
A
Yeah, because you only gave it the last frame, of course, right?
B
Yeah, exactly.
C
It's actually a pretty fun hack. Like if you've seen like, oh no,
A
he has something better.
C
Yeah, yeah, yeah.
B
And for example, like a view, I remember Vue3 has like a 1 second context of the last video. It's slightly better than using the last frame, but it has the same problem. Similar problem that the quality would degree. Like if you extend a few times to like one minute, the video quality will look much worse than the first video second. Another problem is as a model doesn't have long range knowledge of like what's happening before. So if they generate some dialogue to people speaking and their voice might change over some time, especially if the 1/2 conditioning does not cover the previous context. So these are the core challenges. So the G imagine video extension, it has historical context of all of the previous generative videos. It has a context of who's speaking and what objects have appeared and everything having that to generate the next video. So if we naively do this, you can imagine just put all of the previous history video tokens into the context. The context lens will easily explode. It's actually for video models that can be like a few million contacts, I would imagine. Context lens. Yes.
C
What's wrong with that?
B
Yeah. For example, in Cosmos, I think just 5 seconds of video is like 50, 50k or 60k number of tokens. So if you do 50 seconds as 500k tokens, if you do longer than that, easily explode. This long horizon problem was the first step we're trying to solve Word model. It turns out people love video extension. A lot of creators love using video extension to create longer form videos. This is the part I like that you have an intermediate step toward the final goal instead of just a straight shot to the final version. Very Mac.
A
Yeah. But I can see you have a strong vision of where we want to end up.
B
Yeah.
C
Does it seem like it's an efficiency issue? Like, okay, we're at a few million tokens context. If you draw the parallel to language models, we had very short context, 2000, 8000. Then you scale it up 1 million, 10 million. Sure. There's effective context. But at the end of the day it's just what's it worth. Sure, there's a whole training data side in video. It might be slightly easier because we have 100 million token video. Just take a movie with the full context. There is this efficiency from an inference standpoint that it's expensive, but we know how to solve it or why is this not the approach? So like my broader point was on your second point of world models, you say it needs to be interactive and live, right? You should be able to play a game and see the interaction live. So one thing I see with research is a lot of what you actually serve is different than what you build, right? So we talked about distillation. You train big model, you distill it, you do quantization, speculative decoding. We do all this stuff to serve it efficiently. Should we not just have a solution like a world model that can interact well, do inference optimization, serve it, distill it secondary. So make it real time after you solve it. So like another parallel is say continual learning, right? What we need is someone to solve it and show it works inefficiently. Give it a few years, people will make it efficient. Same thing with regular attention, right? It worked over a few years. People have different forms of attention. And we've scaled it to be efficient at log context. You know, so kind of two things there, right? One is it seems like it works, you've scaled it. Can we not just scale it a lot more efficiently over time? Do we need a separate approach if this works? And same thing with interaction, right? If we can get it done, if we can solve some way that it works, we can solve making it more efficient from an inference standpoint later.
B
Yeah, that's actually a very good point. So in videos there's actually a lot of redundancies. So we solve a lot of the pixel redundancy from vae, but there's more redundancy in long range and long horizon videos. Say if a character appear in the first clip and then it disappeared, it only reappear like at the end of the video. You probably don't need the context like in the middle of the generation. So you only need that character where you need. So that's why I helped build another feature. It's a reference video.
C
Is it here?
A
Is it the same model release or different one?
B
It's a different one. You probably need to search on app. Reference to video. Okay, so reference video allow you to like upload up to seven images as condition and generate a video Say if like I want it can. It can be characters or objects or even scenes. Say like I want condition. Sean's selfie. And holding. Holding a blade. Yeah, we have a dog.
A
We put the dog in the thing.
B
Yeah, you can put them there. And the video models will generate the video and copy the context over. So that can solve a lot of the problems there. Like the long context problem. It doesn't need to have a very long context, but I feel like it's an intermediate solution.
A
It's cheating.
B
Yeah. The model should be able to selectively know where should I draw references. So say if I want to generate a movie, I generate it autoregressive like 10 seconds at a time or something. And now this character appear. I can look back to where it first appear and bring that back. Yeah, this one, I put the references. Yeah, that's Optimus, Einstein, myself.
C
Oddly enough, I used Grok search to find it and it pulled your LinkedIn post. But you know, we found it.
A
Okay, this is a problem. This is not your fault. But like XAI doesn't communicate all this work that you do very well because they just have the model release and then that's it. But actually these details are very, very good. As far as I understand, everything you just described is state of the art. Like no one else has done it.
B
Thanks
C
a lot. Yeah.
A
And then you just put this blog post with the cookies. I'm like, this is not enough. But obviously this is like the high level numbers that people want to know.
C
But no, I wonder part of that is also some labs don't share research, research into what happens.
A
But this is literally bragging about how good they are. Right. Why would you not say that you are capable of extending with full context? This is not a secret sauce. This is like we did the work. I don't know.
B
Yeah, I guess different labs have slightly different communic.
C
Yeah.
A
Anyway, if anyone from XAI is listening, we are always happy to help you tell your story. Yeah. Okay, so you did references and I think kind of the point here you're making is like it is sort of like a Kluge. Right. Like this is. You can do seven, but what about 100? Right. Then you need a completely different thing.
B
So I think it's like this is like a mechanism to select the context from the history. And you might not put the entire history into the context. For example, there's a paper called Frame Pack which have a heuristic that the latest history, like the last one second I put the entire history and the history before that I would compress it and make the video smaller. So I follow this pattern, this builder pattern that the maximum sequence length is fixed. So the further you are from the current frame you have a smaller image. So this is just a heuristic. I think it can be more automatic. The model is aware like which history part of it can be selected. So this part of the research is actually being actively worked on by a lot of people. It's also quite interesting. I feel this is actually this part of long context is a little bit ahead of the LLM part. So for example, in LLMs the context keep growing. Let's say if you call tool and the tool call history is extremely long, that's still in context and keep growing, keep growing. Even if you switch the topic to something else, the whole context was there. There are some agentic harnesses that help you to say prune the tool results and prune like when you query a file only show like the top 200 lines or something. Those are very heuristic driven for listeners.
A
We did a write up on the cloud code leak where there are eight different kinds of pruning including like you prune the tool results and all that. So you can read up on that kind of thing.
B
Yeah. I think one breakthrough in continual learning might be a way to automatically manage its own.
A
These are all heuristics and they will be replaced by machine learning.
B
Yes. Interestingly, the same thing is being researched in both oms and video models.
C
The interesting thing is also like in the paper you showed it's actually happening at the model level. Right. Compared to language models. Sure. We have base attention but we'll do our own compression, we'll do our own pruning which is separate from model error. Eventually it all just boils in. Hopefully.
A
I think this is a form of attention but also no sort of reasoning attention. I feel like that's different than normal attention. Does that make sense?
B
Yeah, it's different in the sense that attention, not to mention set sparse attention aside like normal attention, like ukv, you have to attend to all of the tokens.
A
Yes.
B
So you don't have high level mechanism to drop which tokens you don't want to attend to as humans humans attention span is surprisingly small. Yes. You can only remember 11 digit of phone number.
A
Well, I have feature detection. Right. I can detect, oh, that's a sequence of 1234 in a phone number that is 11 digit.
C
Very good pattern matches.
B
But humans contacts can like attention can work because we can dynamically pull in contacts from different Places. The same mechanism, I think is going to happen for LLMs and video models.
A
Yeah, RLMS is on the recent work there, which is not that crazy, but it's just recursive.
C
I think it's somewhat inherent in models too.
B
Right.
A
Here's a nice example here.
C
You pull up these, you can read it fine. But language models are also very good at slop parsing.
A
I throw my typos in there, it doesn't matter.
C
Yeah, you have a transcript, you have one, whatever, just throw it in. And it's very good at parsing through noise. That may be a brute force. It can look over a reason over it. But there's parallels to both.
A
I think it's just really fascinating how you relate the world model stuff to the video generation, which I don't think a lot of people hear directly from people like you. So I think that's really helpful. Any other work do we cover, like video, audio, world models? Any other stuff in that Omni team,
C
I guess, or any other work at XAI you want to talk about? Seems like everything we see publicly announced. Oh, cool, cookies. And then there's so much more to it.
A
There's a lot of depth.
C
Any underrated stuff just at the time there?
B
Yeah, I feel the culture is quite interesting and a bit underrated. So the culture is. The culture is three sentences. Move fast, build. No goal is too ambitious. And the first principle, like usually the goal set was very ambitious. It wasn't possible to achieve when I was first thinking about it, for example, build something in three months.
C
And was that like, okay, we're starting team, we want image, we want video. Do it by this deadline or, you know, how do you work back? Like, was it just, okay, we have a rough by, you know, this date, we want something out, or is this like.
B
That's a very good point. So it's from first principle thinking. If you think about. People might say, like first principle thinking applied more to the physical world than the models. I would say, for example, if you think about some limitation, for example, acquiring data, how fast can we acquire the videos? And if you think about training the models, what's the iteration speed for training a model end to end? And how would adding more GPUs accelerate that timeline? And maybe if you need human data, like, what's the turnaround time for human data to arrive? If you put all of those together, that is first principle thinking.
A
Where?
B
Oh, you know, what is the timeline? What's the minimum number of days that is possible to achieve something?
A
I think this is A lot of Elon's type of thinking.
B
Right.
A
I think he's famous for saying that the only law you can't break is the laws of physics. Something like that. Just broadly. You worked a lot with Elon.
B
Yeah. I guess one benefit is working at xai you got a chance to interact more with Elon. So I was very fortunate to get a few retweets from him and that was quite fun. And he also worked very closely with people, like people Imagine online. He's very hands on.
C
There are two things. One, so I was actually looking up Elon retweeting you. I'll pull it up. He talked about you tweeting that you have a really good voice mode. I don't know.
A
No, no, no. Him. Oh, I also did it.
C
I actually. So I would DM you feedback on voice mode because I was like, wow, really good. And then I'm like, oh, this sounds sucks. But I don't know. Anything you want to talk about about your voice mode building it? Was it a team you worked on as well?
B
That's actually not part of the team I worked on.
A
Yeah. Probably worked on more of the video. No, but Grok voice actually very good. This is one of those things where first of all you can speak at 2x, which is fun. Which I listen to 2x. So I like to speak at 2x. But also I think the interruption was better than Gemini. I don't know how it compares to ChatGPT real time now, but as far as driving was concerned, having Grok in my Tesla and driving, I think it was a really good experience.
C
Yeah, he likes voice mode, but also just the crazy reach by your 50
A
million viewers are just saying yes, true. Oh my God.
C
But it's pretty cool how fast it came out. I guess the other thing is the safety aspect of video mode. Anything interesting to talk about there? So spicy, spicy question.
B
A lot of the countries where they don't allow generative data, generative AI videos without watermarks. So in all of those countries Grok Imagine had watermarks and a lot of the takedowns of if the videos were also happening extremely fast.
A
I mean it's part of running a social platform, but also it transfers nicely to the Genai side. Do you have a perspective on synth ID versus other kinds of watermarking?
B
Yeah, I guess it's going to be harder and harder to detect these things. So since id one thing is previously it was only Google and now a lot of different apps are also adapting it. A limitation is the technology. The paper was out there and people can reverse engineer how to get rid of it. And I think even as it advanced, it's still possible to reverse engineer it.
A
Yeah. So if you are interested, you can go onto Reddit and people have taken out the exact like, I don't know, what do you call it, mask or pattern that Google applies and then you can apply it onto any Google generated photo and you can reverse out the synth id.
C
Yeah.
B
And it's also harder and harder to just judge by eyes. I remember like a couple years ago there are like six fingers or something. It's very obvious.
C
My current is actually the audio. I feel like the audio is really lacking my way to tell if something is AI generated outside of like okay, I think I've seen enough. I have a decent eye. The audio matchup especially of Sora is not great. It's all similar style but there's.
A
Those are minor imperfections. I think the point is that actually my closest reference to this is also Ian Goodfellow because I think he did the adversarial Gan thing where it's like okay, here's a picture of a zebra. Then you change one pixel and it becomes a panda. This is like a classic computer vision issue.
B
Yeah. If you think about how these models were trained, like I mentioned before, Gan was in the training process. The objective of GAN is the model generates an image and the model, there's a judge to tell if the image is real or not. The model is trained to make the image more real. So as the model become more and more advanced, it's going to be harder and harder for me personally. Now I have to judge by if these videos have logical sense. This video have a world model. Yeah, yeah, yeah.
A
No, I also like the audio is too nice, like too studio quality. The lighting is too good, the skin is too clear, basically the lack of imperfections.
B
Yeah.
C
Do we have a good way to do reasoning in diffusion? Like is that what separates video generators from world models or in, you know, we really know how to apply it to autoregressive language models. Is there a parallel for diffusion video gen world models? Like on that point. Right.
A
He has a thing on video agents.
B
Yeah, that's a good question. Yeah. Actually I have a. I have a pretty big claim. The visual intelligence are actually mostly coming from language. These video models, especially from now since the diffusion model technology is more mature every time. You see there are some improvement on these models. I would say mostly there's again comes from language model, not coming from the video model itself. Like the video distribution Models themselves in Cosmos, typically these models, they have two parts. There's a prompt rewriter or the prompt up sampler part. I think in Cosmos we use llama or they use mix mixed row. And the Cosmos video model itself is only 7b. And the model, the language model is a prompt rewriter. It's bigger than that. So the prompt rewriter's task is to take user instruction and convert it to extremely detailed description of the video. So because the video diffusion models I would describe their kind of dumb because they, they take the input instruction literally. Because in the training process, remember that we have to describe the video as as detailed as possible when we are creating the the synthetic text pair. So this model they take those kind of instruction to generate the videos. So when you're taking the user instructions, the user instruction really are simple, just say a cat or something. If you put a cat in the video model, they would take that instruction literally. They will literally show a cat in maybe a white background because you didn't describe the background. The cat is not moving because you didn't describe takes the instruction quite lazily. It's kind of dumb. And the prompt rewriter is actually a much bigger model which a language model that takes the using instruction and expand it. So the thinking process you mentioned is from there. So if you look at GPT image like you generate an image in three minutes. Three minutes. It's not all like a pixel generation. A lot of time is spending thinking. So prompt rewriting now have evolved to not only just thinking, it can also be agentic model. For example, say you wanted to generate the image of today's news. So it's likely we'll go to fetch today's news online and then process some data, some then organize the layout and generate it. Another thing quite interesting is if I'm
C
not mistaken, these are. It's no longer a diffusion model though, right? It's auto regressively or is there still.
B
There are different approaches. For example, like Chamney Omni, since they said it's Omni, I believe it's a single model. Maybe it's something like it's a language model with a diffusion head or something like the language model. Do the thinking, do the agent take two calling and then it would use the diffusion head to generate the image in the end. There were also approaches like Cosmos where you have a sec pair language model and separate diffusion models. And there are also like a purely language model like you discretize the images and then you generate the image as discrete tokens. So there are different approaches.
C
I would say one of the claims I've seen for why these approaches struggle is because a lot of the benefits for how we currently learn reasoning with language models is you basically iteratively generate reason, you have your thought and then you work on that answer. Right. So if you have omni model and then diffusion head, you can't feed that back in to continue reasoning. Right. So you can't go like text image, text image, you can't reason on the output it and then go back to diffusion. But I guess in the new Gemini Omni you would be able to as long as you have diffusion.
B
I'm not sure if they have that process. I guess it's definitely possible in the omni paradigm. So if you think about traditional multimodal language model, they would have a VIT encoder that can encode the image. So if they have a diffusion head, they can generate the image and then put that back into the VIT encoder, encode that and then do the iterative refinement of the result.
A
Yeah, I think you have to jointly train the VIT and the diffusion to make that somewhat reasonable because otherwise you're kind of like mismatching or feeding in slop.
C
Yeah, I think it depends on the stage of training. You might be able to freeze it. But anyway, also, just on your earlier,
A
I wanted to also make explicit we do know that nanobanana and GPT image are autoregressive language model with diffusion head. As far as I can tell from your description of GROK image, it is not. It is end to end.
B
I cannot comment on that the way
A
that you described it, but I think there's different approaches. Right. You started off saying prompt rewriter is a big part of the intelligence.
C
And even on that I think everyone should try using an early diffusion model. If you've used stable diffusion one or whatever. If you've seen the prompts like ultra high res 4k this style. Oh my God, the first time I tried one, you don't talk to them like language models. Right. Your prompting is very comma separated, literally
A
talking in the labels that were in the data set. Right? Yeah, but basically I'm just trying to make the point that prompt writer and then image is different from autoregressive language model with diffusion hit. Right? They're different things.
B
Yes, they're different. I just wanted to establish, I think the common part is the, the image part. So it's quite surprising that a lot of the improvement came from the language side, the thinking. The tool Calling. So I still remember in Cosmos I generate a happy sheep without any rewriting it looks so CGI and after rewrite looks so beautiful.
A
I think without any joint training.
B
Yeah, actually without any joint training with rewriting it's already much better. A very interesting thing I guess will happen is the video agents, mostly language models will call these generative model. Either it's a separate model or a diffusion head or whatever as tool. So this model can iteratively refine the results or even generate longer content through a very long trend of thought. It's actually very similar to how human create art. So we don't generate the pixels directly, we literally draw something. And I think through this process these models not only use diffusion as one of the tool, it can also use traditional tool. It can also use image editing tools from Photoshop, it can use video editor, ffmpeg, whatever to take combination of these and the generative AI technology as a set of tool. And they can iteratively create a better, much better video for production grade quality. If you look at existing professional creators, they don't end at generating a video from these models. They will take this video to their editor and edit here and there so
A
much post production and sometimes actually the reason the video is good is not really the video model, it's actually the editing. And yes, we also are engaged in the same process as well. Would you love to use a video editing model?
B
Yeah, actually there's Grok Imagine Agent Beta that was the first attempt in that direction.
A
Yeah.
B
So I think the process would be similar to agent mode. Yeah, you can ask it to.
A
There's no blog post for it.
B
Maybe generate a one minute video which is not possible if you ask the same prompt to video models. But this model will literally call different tools to do that. So yeah, this is actually an interesting thing. So when we first release a video editing model I see on X some people try the video editing feature with added this video to be one minute because they didn't understand how video editing work. Video editing typically is just a removal, add, replace, style transfer, this kind of thing. But that's actually a valid request under the assumption of video agents. So these agents should be able to understand these kind of long hours and tasks to be able to essentially create a long form video. I think this is really fascinating because it's taking the same direction as first you have this AI assisted coding kind of like tab completion, GitHub, Copilot and from there you gradually evolve to codecs and cloud code where you do things fully automated. So In Garcia Magic Asian mode, you can still go in there and do stuff by yourself. Gradually, as the model capability increase, it will be able to do everything fully automated.
A
Yeah, I like that. Okay, so it looks like it's still generating.
C
Also I did notice the Crockett mid gen was always very very fast. I don't know if this is something you guys benchmark but like this is just a tangent compared to when I used to use before the latest OpenAI's image gen and same with Gemini Nano Banana. I would oftentimes use GRO just for the speed.
A
It's in the benchmark. Somewhere in the Imagine API blog post that they have all the speed things mostly combination of distillation plus inference.
B
Yeah, there are a bunch of things like we talk about isolation, then we talk about thinking. If you don't have any thinking budget, the model can just think three minutes and then come back to you. And also like inference, the inference infra team was very talented and they were able to accelerate a whole lot of these models.
A
Yeah, I mean my comment on the video agents things I'm trying to figure out when people say video agents. When you initially told me about your bet on video agents or your vision for video agents, I was a little bit disappointed. I was like, oh, you mean models are tapped out now we have to do agents. But I think you have to. Right? The question now is how much model training is it really going to make a difference versus just building a better harness? Like you said, the models don't have to be jointly trained. You can just take an off the shelf frontier reasoning model, slap it on a harness, give it GROK as a tool. That's it. That's your video agent. Doesn't seem super satisfying. Obviously you can co train and get some more percentage points of proper performance, but if your central claim that the majority of video or generative media alpha or whatever is actually coming from language intelligence and not image diffusion or video diffusion, then that is the future primarily. Just wait.
C
If you pop back at the example, it generated frames. Sorry to interrupt. It's been saying, okay, I'm going to start stitching these frames together. It's using FFmpeg, using this is what
A
GPT Image Pro as well is doing. Right. This is also just writing code in the background and then just stitching, doing an image pass on the final output. It feels dissatisfying for the people who want to just train models.
C
It's interesting, right? It's also somewhat exciting. Like you brought up earlier, a lot of the gains don't come as much from the video. I think you can see that in the language model space too. Right. Anthropic, Very, very good at coding. They're multimodal, not the best. Right. They have basic input PDF. But there's clearly a disconnect in the quality of their image. Video processing, audio processing, yet intelligence, very top tier. Other labs. Gemini OpenAI xai. You can add modalities, but it's not like they're unlocking crazy capabilities. Right. So it's interesting.
B
Yeah, it's interesting to see that the video model's capability increase actually come from language model being more intelligent. I think video agent, it can unlock more stuff than you might imagine. So there's a few things. So one thing is, is when we are prompting these models. So most of the people were actually not very good at prompting. Actually language models have a better sense of how to prompt AI models. AI models know AI models better. So if you jointly train these models, maybe there's a model have a better sense of how to prompt each model. Like different models might be different. Another thing is it might not as simple as just generate a few clips and slap them together using FFmpeg. There might be more image and video editing tool appear in this process. Say if you want to exactly add a blob of text at this time step. The video models might not get that intentional. And very precisely these are possible using these deterministic tools. The video agents can use all sorts of tools. So you don't have to put all of the capabilities into the transition model itself.
A
Yeah, I think that's very true. No. So for what it's worth, I think you're right. I think that this will be a big category. I think probably you are predicting like the next one year in video is going to be all this.
C
Do you have a time time prediction for how when this stuff ramps up?
A
Like, I mean, they already started. He's not very good yet.
C
No, it's so good. I think the last one's just longer. It didn't give me a minute, it gave me 36 seconds. But you know, are we feeling it now? Is there going to be inflection? Is there any timeline predictions you want to make?
B
I guess by the end of this year is this is going to be a big hit. So the inflection point will be where the videos generated by video agents can get to production great quality. It can be presented and it can be distributed in ads. And once that happens, I think the enterprise will have much more budget for video models because the Agents are inherently more expensive than the video models themselves because they do this iterative process. They generate many variations. But once these models have this past, this usability threshold, I think it's going to be exponential growth. Beyond that.
A
No, I would fund a company right now based on this thing. So I think you're right. One thing I'm surprising, I'm reflecting on the whole past hour or so conversation. I think you're into world models and video generation for video generation's sake. I think that a lot of other world models, people we've interviewed, a lot of them genuine intuition and Fei, Fei Li and all those guys and Moondream, which I think I told you about. Moonlake. I keep saying Moon Dream. God damn it. Moon Lake. A lot of them actually say, like, robotics is the endgame. Like embodied robotics. Like you want real time, you want interactive. It is to interact with the physical world. You're not that concerned about it.
B
I think robotics will be a big part of it for sure. I guess the process may happen naturally. So my prediction on robotics is that the problem of physical AI might be solved, like without actually need to be
A
in the real world.
B
Need to be in the real world. So it might get solved by a video. LLM is very strong video capability. So remember we talk about the real time interactive long horizon video once these models. So now these models are just training on screen recordings and computer screens. Once these models can use computers and understand the future state of computer extremely well. The robots might be one of the tools a very powerful AI can use. So the powerful AI might just be able to control the physical embodiment naturally.
A
I see that for sure. Cool. I know we are coming up on time. You left one more spicy topic, which is why you left XAI for me.
B
There's a lot of research you want to do, but you cannot do as a company. And also the priorities and objectives for company typically can change very fast. It's also the same for xai. So now is kind of like the time. So there are some research I want to do, especially more on language model side. I cannot do edx, AI.
A
Oh, okay. Yeah. You're basically leaving. You had this whole transition from computer vision to world models, video generation. Now you're focusing on LLMs.
C
But it seems like a lot of you saying focusing on LLMs, you really, in the past hour described how it all ties together, right? Yeah, but I don't know what do you mean by focusing on LLMs?
B
I realized the fact that the video models even like in the beginning, the Gain might come from improvement on diffusion technology, but this is the point where actually most of the gain come from the language models themselves.
A
It's a huge black pill for anyone who has spent their career in generative media.
C
I mean that's an extreme view, right? You still definitely need a bit of both, right? Yeah, it seems like more pressing, impactful work to do. Now on language model side, do you
A
have any similar predictions? So you predict the video agents? I think you'll be right. On the language side, what are you looking for in the next one year?
B
I think one thing pretty interesting, I think my behaving soon is the language models will be context aware and manage its own context. So some like from the video model side, we've been suffering from the long horizon issue, like we want to generate video longer and longer and we've been trying to solve the context length issues through various ways. One thing is just brute forcing train longer context lens. Another is to manage the context better. I think the same thing in language model is also going to be happening soon. So for example, the language models, they're not aware of how long their own context length is. Once they hit like 80% or something, the automatic context comparison is getting triggered and the model is not aware of that when it's working. And maybe it's good for the models to know, oh I'm approaching like 80% or something. And something also pretty interesting, for example, in openclaw, every time you type in something, the current local time is automatically attached to your message. So the model actually know what time is it. So this is making the model time aware. And also in tool calling, a lot of the intermediate tool call results automatically prune. So there's like context removal, context addition and context compaction. So all of these are from the harnesses themselves. From our experience, the heuristic engineering all have the models get absorbed into the models themselves. I guess that's something very interesting to explore.
C
So infinite context maybe? No, but it's interesting, right?
A
It isn't a space of memory and continual learning.
C
I don't know. It's also like in the space of agent harness use, right?
A
He's saying he doesn't want to do it in a harness, right?
C
No, no. But models are also being trained on using harnesses, right. So some of it is you could say implicitly leaking in. Right. Part of that post training of language models is okay using it in coding harnesses, in which case when our sub agents spawned. When is convection going to happen? It's not explicit. You have this much Token window, which I don't know if you want it to be as that'll change but it's somewhat leaking in there.
B
I mean imagining what if the model have access to the whole code of the agent harness itself and be able to modify it whatever you want. Say if the agent harness is short enough, you can just put in the context length in the system prompt and then the model say when I want to spawn a future version of myself, I can modify the agent harness. For example, if the agent harness can be when I'm reading a long document, I can choose to read the whole thing in chunks and come back smash the summary together or I can just read the first 200 lines and discard the rest and all kind of choices if they can be made by the model themselves. It might be very interesting to see that the model can like a program. The model can program itself online in test time.
C
Yeah.
A
So the self modifying harness is also part of OpenCloud and PI, but I think there's a lot more work to do there. Very cool. I think part of me is kind of curious. I think you are part of big lab, right. And there's this career path of a researcher at a big lab, which is you are. You train models, you get more compute, you train better models, you keep going. And somewhat I feel like you're opting out of that. And if I were you, I'd be like, oh, I think this is like a bit of a career risk, you know what I mean? I don't have any comment. Apart from you're very strongly convicted. I think that a lot of people in your shoes would not be doing what you did.
B
Yeah. Speaking of my career, if I look back actually there was ever there were a lot of huge transitions. So 10 years ago I was doing research with the RESNET authors Shan Yujiang and Chen Sun. At that time the research were completely different. It was mostly computation, like image recognition, object detection, object tracking. I was also doing neural net compression at that time. It was quite different from knowledge dissolution these days. And at that time I wanted to be a professor and I applied. When I applied for a PhD I already had a few first author papers at top conferences, so I confidently applied the top schools. It turns out I got rejected by all of the top PhD programs. So I had to, I had to go to the industry. At that time I was at Facebook Hair Research fair led by Yann Le Kuhn.
A
I wanted to talk about V. JEPA but it's different.
B
Anyway. Yeah, we can leave it for another time. Yeah. At that time I switched to self supervised learning. It was quite different from what I was doing in contribution and, and after that it's Nvidia Cosmos. So I realized scaling up was extremely important. So at Nvidia I was mainly focusing on scaling. So one thing is Cosmos scaling the video description models to a few billion parameters. And another thing is I was working on moes. The Megatron moes was the first, was the first framework open source to be able to train these MOEs at very large scales like 100 billions parameters to even trillions parameters efficiently at like 40% MFU. And going to switching to XAI was trying to work on even larger compute scale even further. And yeah, looking, looking at this trajectory I actually work worked on a lot of different things. So I feel actually within, within ML it's actually easier to switch than, than you think. Like a lot of people might have manifested that. Oh, I work on, I work on computerivation, I always have to work on computer vision and I cannot switch to language. But from my experience at least at Nvidia I worked on both language model MOEs and also video models. It's actually not the case. A lot of the core principles, how to train large models are largely the same. And yeah, for me I feel right now the bottleneck for video models is actually the language part, the agent. Which is why I want to go to work more on LLMs. One thing is it's a bit of a challenge. I don't think it's a huge jump.
A
Yeah, I mean kudos to you. I think you have a lot of strong vision there. Yeah, I think that was mostly everything that we wanted to. However, you've been very generous with your time and it's really nice that you are able to share all these things now. We don't have to go through XAI to clear everything. But also I think we didn't get you in trouble.
C
It's a lot of good stuff about XAI compared to what you just see in the releases. Right. You don't realize how many more levels
B
there are to it.
A
Xai, please do more podcasts anyway, but thank you for sharing, it's been very kind and also like I want to hear more from you. I think you are going to embark on your next phase. You haven't announced what you're doing next, but clearly you have more vision and more ambition on this path. And I think you're basically kind of gradient descending to whatever your final form is.
B
Thank you. Yeah, I'll share more about my next chapter soon. Thank you for having me.
A
Thanks for coming.
Episode: Why Video Agent models are next — Ethan He, xAI Grok Imagine
Date: June 1, 2026
Guests: Ethan He (ex-xAI, Nvidia Cosmos), hosts Shawn “Swyx” Wang (A), and Vibu (C)
In this wide-ranging conversation, the Latent Space team goes deep with Ethan He, a foundational engineer behind xAI’s Grok Imagine, about the fast-moving world of video foundation models, video agents, and the definition and roadmap for “world models” in AI. The episode covers Ethan's journey from Nvidia’s Cosmos video model to xAI, the technical and practical challenges behind scaling generative video and video agents, how language models are increasingly the key driver of visual intelligence, and predictions for where interactive, real-time generative video is headed.
This episode provides a rare behind-the-scenes technical window into how modern video “world models” and video agents are built, and why the next wave of progress will emerge at their intersection with language reasoning. Ethan He’s frank technical depth and forward-looking vision make this a must-listen—and now, a must-read—for anyone tracking the future of generative media, multimodality, and the emerging AI engineer stack.
For show notes, papers, and the full transcript, visit https://latent.space