Summary9 min read

Latent Space Podcast Summary — "Moonlake: Causal World Models Should be Multimodal, Interactive, and Efficient"

Guests: Chris Manning, Fan-yun Sun (Moonlake)
Hosts: Latent Space team
Date: April 2, 2026

Episode Overview

This episode dives into next-generation "world models"—AI models that go beyond language and video to simulate, reason, and interact within multimodal environments. The focus is Moonlake, a startup led by Stanford AI legend Chris Manning and engineer Fan-yun Sun, which proposes that world models for virtual agents, gaming, and embodied AI should be interactive, multimodal, and built around symbolic reasoning, not just raw data scale or pixel-level outputs.

The conversation explores foundational differences in model design philosophies, the challenges in building truly interactive and efficient world models, the limits of diffusion/video-based approaches (like Sora), and Moonlake’s unique architectural and practical strategies—especially around “structure over scale” and abstracted, agentic reasoning.

Key Discussion Points and Insights

1. Origins and Motivation for Moonlake

How the Team Came Together
Sun describes connecting to Chris Manning via industry and academic collaboration, and explains how joint experiences in generating interactive worlds, especially for reinforcement learning agents, inspired Moonlake.
[02:25] "It was very clear to us that, on our way to, let's call it, embodied general intelligence, models need to learn the consequences behind their actions, which means that they need interactive data. ...But everybody's sort of thinking about it from a pure video generation perspective or something else. But we feel like the true opportunity is actually building reasoning models that can do these things, like how humans do these things today." –Sun
Philosophical and Economic Drivers
[02:25] "A lot of dollars being paid out to external vendors to manually curate these types of data ...There's an opportunity there that I feel like nobody's doing it the way I think should be done." –Sun
Chris Manning’s Perspective
[04:04] "Vision understanding sort of stalled out, right? ...all these vision language models, it's the language that's doing 90% of the work and the vision barely works." –Chris Manning
He advocates bringing a symbolic, higher-level structure to vision and world modeling—vs. brute-forcing data and pixels.

2. Defining Modern World Models: What’s Different?

From Video Generation to Action-Conditioned Models
[07:04] "People look at these amazing generative AI video models like Sora... but those visuals aren't accompanied by an understanding of the 3D world, ...and that's what's really needed for spatial intelligence." –Chris Manning

Key Term: Action-conditioned world models: Models that can predict, given an action, what will change in the world (not just generate the next video frame).
Why Simulation Matters
[09:02] "If you're simply collecting observational video data, you don't actually know the actions that are being taken ...so there's a lot of premium on collecting action-conditioned video data—which is part of why there's been a lot of interest in using simulation." –Chris Manning

3. Structure vs. Scale – The Thesis

Balancing Scale and Structure
[05:49] "Scale is good too...but you want the structure to be able to much more efficiently learn." –Chris Manning
"What is the right abstraction level today?" is a recurring question for Moonlake: not discarding the “bitter lesson” (data scale wins), but focusing on meaningful, efficient representations.
Analogy to Human Cognition
[12:47] "Human beings are doing ...very abstracted semantic description of the world around you... all the evidence from neuroscience and psychology is that most of what comes into people's eyes is never processed." –Chris Manning

4. Delving Into Moonlake’s Approach

Reasoning Traces and Multimodal Agents
[14:36] Sun describes providing not just raw visual/audio output, but a chain of reasoning involving geometry, physics, affordances, symbolic logic, all mapped to an interactive state—a big contrast to current LLM and video models.

[22:08-22:44] Case study: Reasoning traces for a bowling game. Moonlake models do not just show bowling, but reason through physics, scoring, and event consequences.
Comparison to Unity Code Generation
[25:38] "Physics engines or tools or code are cognitive tools ...Tools that the model can employ as means to an end." –Sun
The goal is for the model to reason and decide when to use which tools—not just translate prompts into code.

5. Philosophical Contrasts: Moonlake vs. Yann LeCun’s JEPA

Symbolic vs. Purely Visual (JEPA)
[16:16] "Yann is a very visual thinker. ...he thinks language is just a low bit rate communication mechanism ...but humans massively ahead [of animals] in what we understand about the world... what took off for us was that humans managed to develop language and that gave a symbolic knowledge representation and reasoning level..." –Chris Manning
- Takeaway: Moonlake believes symbolic (language-like) representations are an essential piece, not a crutch.

6. Rendering, Fidelity, and the “Reverie” Model

How Moonlake Keeps Worlds Interactive, Not Just Pretty
[29:44] "Typically the diffusion models are producing the whole scene and it looks lovely, but there isn't spatial understanding behind it." –Chris Manning
Moonlake’s renderer (Reverie) receives semantic, persistent representations and then applies style/fidelity—ensuring interactive state and logic remain intact.

[30:35] "We actually believe that this is going to be the next paradigm of rendering. It's going to replace how rasterizers...because ...you can literally play any game in photorealistic styles."

[31:03] "One thing is to just say, okay, it's the appearance. But the second thing is also to say there's these novel interactions that are possible because this renderer now actually has priors of the world." –Sun
Programmable/Interactive Rendering
The renderer is part of the gameplay loop and can respond dynamically to game events ("bullets turn into apples after collecting 10 apples", etc.)

7. Human Intent and Creators in the Loop

Moonlake's abstraction layer allows human creators to inject intent at both high- and low-level world parameters: [32:02] "A lot of the times, whether it's for embodied AI or gaming, you want a layer where human can inject their intentions, right? ...it allows basically human intent to be expressed in these worlds much more explicitly and distributionally." –Sun
"We're not going to be more creative than our users...our job is to let them express their intent." –Sun [33:37]

8. Evaluating World Models: The Hard Part

Benchmarks are Outdated
[36:39] "This whole space is extremely difficult...in the early days it seemed very easy to have good benchmarks ...But these days, so much of what people are wanting to do ...is nothing like that ...and it's the same problem with these world models." –Chris Manning
- Evaluations depend entirely on use-case: time spent in a world (games), robustness after simulation (robotics), ability to express user intent (creation tools), etc.
[39:00] "It's sort of like vibe checking ...but it's actually whether people feel it's giving them utility." –Chris Manning

9. The “Boundary” Question: Symbolic vs. Pixel Priors

Where do you split what should be modeled symbolically vs. at the pixel/data level? [45:57] "Where do you draw the boundary between what's handled with diffusion prior and what's handled with symbolic priors? ...this boundary can actually be fluid." –Sun
Sometimes customer need or new knowledge moves the boundary.

10. Audio and Multimodality

True Multimodal Integration
[54:06] "Part of the spatial audio is from the code that's underlying the simulation. ...But that's exactly sort of more point to we're giving our model an abstraction or a suite of tools such that it's able to achieve that." –Sun
- Moonlake’s system incorporates spatial audio, not just stacking TTS on visuals, leading to true multimodal reasoning: [55:09] "This integrated audio model exploits the understanding and semantics of the Moonlake world. ...for the GenAI video models, there's no actual integration across to audio at all." –Chris Manning

11. Applications and Commercial Focus

Games First, then Embodied AI, then Beyond
- Beta products already targeted for gaming.
- Envisioned as a training and evaluation platform for embodied AI, e.g. "fine tune drones for rescue," "train a vacuum robot robustly in my office" [49:18]
World Model as a Creative, Open Platform
[50:30] "Just this world model that allows people to train any policy that can act in any multimodal environment." –Sun

12. Hiring, Team, and Company Philosophy

What Moonlake Looks For
[61:25-63:49] Seeking candidates with expertise at the intersection of code generation, computer vision, and graphics. Practical experience writing game engines, reinforcement learning, multimodal/fusion models, space alignment.
About the Name ("Moonlake")
[64:35] Inspired by DreamWorks’ moon/creativity vibes; the “lake” reflects self-improvement, iteration, and the ambition to be the "Pixar/OpenAI of world modeling."

Notable Quotes & Memorable Moments

"Vision understanding sort of stalled out, right? ...all these vision language models, it's the language that's doing 90% of the work and the vision barely works." — Chris Manning [04:04]

"What is the right abstraction level today? ...The most bitter lesson approach is to train a next byte prediction model...but the scale and computing need to achieve that [are immense]. So that's why we always come back to like, okay, what is the most efficient way to do it?" — Sun [14:36]

"Yann Lecun is a dear friend of mine, but he has never appreciated the power of language in particular or symbolic representations in general." — Chris Manning [16:16]

"Games are really all about the concept, the gameplay ...there are just lots of very successful games which have relatively primitive visuals ...and other games where people have spent millions producing photorealistic visuals and the game sucks." — Chris Manning [39:30]

"We're not going to be more creative than our users...our job is to let them express their intent." — Sun [33:37]

Timestamps for Important Segments

| Timestamp | Segment | Summary | |-----------|------------------------------------------------------|--------------------------------------------------------------| | 02:25 | Genesis of Moonlake & Motivation | Why world models, origins, sim theory, embodied intelligence | | 04:04 | Structure vs. Scale | Vision/language dichotomy, the need for a new approach | | 07:04 | Action-Conditioned World Models | Why video generation isn’t enough | | 14:36 | Reasoning Traces and Model Design | Multimodal reasoning; blog post discussion | | 16:16 | Contrasting JEPA (Yann LeCun) vs. Moonlake | Symbolic reasoning importance | | 29:44 | Rendering: "Reverie" and Interactive Fidelity | How Moonlake renders worlds, keeping logic/causality | | 32:02 | Human Intent and Creator Involvement | Programmable worlds, user agency | | 36:39 | Evaluating World Models | Benchmarks vs. user-centric evaluation | | 45:57 | The Symbolic-Pixel Boundary | Fluid split in model architecture | | 54:06 | Audio and Multimodality | Integrated, spatial audio; true multimodal state | | 61:25 | Team, Hiring, Company Philosophy | Who should apply and what Moonlake values | | 64:35 | Branding/Name: "Moonlake" | Why the name and inspiration |

Summary Takeaways

Moonlake positions itself as the future of world modeling—not just “pixel soup” or video, but as agentic, efficient, reasoning-based, and multimodal. The team is betting on a structured, symbolic + data-driven hybrid, merging computer graphics tradition with modern foundation models, offering deeper interaction, user creativity, and potentially serving as both a rendering/game engine and a simulation/training tool for AI agents.

Moonlake’s challenge to the field: Don’t be blinded by scale and photorealism alone—structure, reasoning, and the right semantic abstractions are key for next-generation interactive AI.

For the full transcript, related blog posts, and hiring info, visit latent.space.

Loading summary

Transcript200 lines

[00:00]
Chris Manning
I think this whole space is extremely difficult as things are emerging now. And I mean, it's not only for world models. I think it's for everything, including text based models. Right, because, you know, in the early days it seemed very easy to have good benchmarks because we could do things like question answering benchmarks. But you know, these days, so much of what people are wanting to do is nothing like that. Right. You're wanting to get some recommendations about which backpack would be best for you for your trip in Europe next month. It's not so easy to come up with a benchmark. And it's the same problem with these world models.
[00:44]
Host 1
Before we get into today's episode, I just have a small message for listeners. Thank you. We will not be able to bring you the AI engineering, science and entertainment contents that you so clearly want if you didn't choose to also click in and and tune into our content. We've been approached by sponsors on an almost daily basis, but fortunately enough of you actually subscribe to us to keep all this sustainable without ads, and we want to keep it that way.
[01:07]
Sun
But I just have one favor to
[01:09]
Host 1
ask all of you. The single most powerful, completely free thing you can do is to click that subscribe button. It's the only thing I'll ever ask of you, and it means absolutely everything to me and my team that works so hard to bring the inspace to you each and every week. If you do it, I promise you we'll never stop working to make the show even better. Now let's get into it. Okay, we're back in the studio with Moonlake's two leads. I guess there's other founders as well, but sun and Chris Manning. Welcome to the studio.
[01:42]
Chris Manning
Thanks.
[01:44]
Sun
Thanks for having us.
[01:45]
Host 1
You guys have burst onto the scene with a really refreshing new take on world models. I would just want to sort of, I guess, ask how the two of you came together. Chris, you're a legend in NLP and just AI in general. You're his grad student, I guess.
[02:01]
Sun
Actually my co founder. Oh yeah. I should give a lot of credit to my co founder, Sharon. She was actually working with Professor Fei Fei Lian Jajan and then she ended up working with Ron and Chris Manning here. And then so I got connected through to Chris initially actually through my co founder.
[02:18]
Host 1
What is Moonlake? Actually I'm also very curious about the name, but why going into world models?
[02:26]
Sun
So I was working a lot with actually Nvidia research during my PhD years on essentially generating interactive worlds to train reinforcement learning agents or embodied EA agents. And then there was two observations, one in academia and one in industry. In industry, like folks at Nvidia or actually people paying a lot of dollars to purchase these types of interactive worlds, whether it's for the sake of evaluation or training, the robots or policies or models. And then in academia, same thing is happening. And more specifically, when I was actually working with Nvidia on the Synthetic Data foundation model training project, we were actually generating a lot of these synthetic data and showing that, hey, these synthetic data are actually as useful as real world data when it comes to multimodal pre training. But then like I said, there's a lot of dollars being paid out to external vendors or other folks to manually curate these types of data. It was very clear to us that, okay, on our way to, let's call it, embody general intelligence, models need to learn the consequences behind their actions, which means that they need interactive data. And the demand for those types of data are growing exponentially. But everybody's sort of thinking about it from a pure, say, video generation perspective or something else. But we feel like the true actually opportunity is actually building reasoning models that can do these things, like how humans do these things today. So that's a little bit on the genesis of Moonlake. And I think the reason I got into world models was partly a philosophical take on the world where I believe in the simulation theory and stuff like that. But on the other hand, it's really just like, oh, there's an opportunity there that I feel like nobody's doing it the way I think should be done.
[04:05]
Chris Manning
I can say a little bit about that. Yeah. So the overall goal is the pursuit of artificial intelligence. And most of my career has been doing that in the language space. And that's been just extremely productive. As we all know the story of the last few years, I don't have to tell about how much we've achieved with large language models, but although they're being extremely effective for ramping language and general intelligence, it's clearly not the whole world. There's this multimodal world of vision, sound taste that you'd like to be dealing with more than just language. And then the question is how to do it. And despite a huge investment in the computer vision space as a research field, computer vision has been for decades far, far larger than the language space. Actually, I think it's fair to say that, you know, vision understanding sort of stalled out, right? You got to object recognition and then progress just wasn't being made. Right. If you look at any of these vision language Models, it's the language that's doing 90% of the work and the vision barely works. And so there's really an interesting research question as to why that is. And at heart, the ideas behind Moonlake are an attempt to answer that, believing that there can be a really rich connection between a more symbolic layer of abstracted understanding of visual domains which aren't in the mainstream vision. Models which are still trying to operate on the surface level of pixels.
[05:50]
Host 1
I think one of your blog posts, you put it as structure, not scale. Is that a general thesis?
[05:57]
Chris Manning
Yeah. Well, scale is good too.
[05:59]
Sun
Scale is good too.
[05:59]
Chris Manning
Lots of data is good as well. But nevertheless, you want the structure to be able to much more efficiently learn.
[06:08]
Host 1
Yeah. The other thing I really liked also is you put out an example of what your kind of reasoning traces look like, which you would distill is the word that comes to mind. I don't even think that's a good description, but it would involve, for example, geometry, physics affordances, symbolic logic, perceptual mappings and what have you. But that is the kind of example that involves, let's call it spatial reasoning, world model reasoning as compared to normal LLM reasoning.
[06:36]
Sun
Yeah.
[06:37]
Host 2
But also taking it a step back. So how do you guys define world models? A lot of people see like, okay, you can do diffusion, you can do video generation, but you guys put out quite a few blog posts, you put out an essay recently, we can even pull it up about efficient world models. You have a pretty structural definition here, but for the general audience that don't super follow the space, what's. What's the difference in what we see from like a video generation model to a world gen, a simulator. How do you kind of paint that? Les?
[07:04]
Chris Manning
Yeah, so I think this is actually a little bit subtle because people look at these amazing generative AI video models, Sora VO3 one of these things, and they think, genies, they think, oh, this is amazing. This is sort of we've solved understanding the world because you can produce these generative AI videos. But the reality is that although the visuals do look fantastic, those visuals actually aren't accompanied by an understanding of the 3D world, understanding how objects can move, what the consequences of different actions are, and that's what's really needed for spatial intelligence. So I mean, a term we sometimes use is that you need action condition world models, that you only actually have a world model if you can predict, given some action is taken, what is going to change in the world because of it. And in particular that becomes hard over longer timescales. So if you're simply trying to predict the next video frame, that's not so difficult. But what you actually want to do is understand the likely consequences of actions minutes into the future. And to do that you actually need much more of an abstracted semantic model of the world.
[08:36]
Host 1
Yeah, the question comes where you want to have more structure than is available in just predicting the next token. And typically, well, let's call it the experience of the last five years has been that it's just washed away by scale. Right. So what is the right middle ground here? That you don't ignore the bitter lesson, but also you can be more efficient than what we're doing today.
[09:02]
Chris Manning
One possibility is, look, if we just collect masses and masses and masses and masses of video data, this problem will be solved under certain assumptions that could be true, but there are sort of multiple avenues in which it could not be true. The first is what's really essential is understanding the consequences of actions, producing an action conditioned world model. And if you're simply collecting observational video data, which is the easy stuff to collect when you're sort of mining online videos, you don't actually know the actions that are being taken to see how the video is changing. And so if you're never collecting directly actions and you're having to try and infer them from what happened in the observed video, that's not impossible, but it's very hard. And it's not really established that you can get that to work at any scale yet. And so there's a lot of premium on collecting action condition video data, which is part of why there's been a lot of interest in using simulation so that you can be collecting data where you do know the actions, which is in quite limited supply. But there's also in the limit of as much data as you could possibly have, maybe the problem is eventually solvable. But even though we collect huge amounts of text data, text data is always at a great level of abstraction. Right. Language is a human designed abstracted representation where there's meaning in each token and it's representing an abstraction of the world. As soon as you're describing someone as a professor and as soon as you're saying that they're condescending, these are very abstracted descriptions of the world. It's not at what you're observing as pixel level. And so to get to that kind of degree of abstraction, starting from pixels is orders and magnitude of extra data and processing. And so all the, although we absolutely want to exploit, get as much data as Possible use the bitter lesson. Nevertheless, if there are ways in which you can work with five orders of magnitude less data than people working purely from pixels, you're going to be able to make a lot more progress a lot more quickly. And that's the bet here. You could just say that's only wanting to be able to do it more efficiently, do it more quickly, do it more cheaply. But I think it's actually more than that. I think one should be making the analogy to how human beings work at one level. Yes, we have these high resolution eyes and we can look and see a scene like a video. But all of the evidence from neuroscience and psychology is that most of what comes into people's eyes is never processed. That you're doing fairly fine processing of exactly what you're focusing on, but as soon as it's away from that of yeah, there's another guy over there that you've sort of only processing top down, this very abstracted semantic description of the world around you. And so that's what human beings are doing. They're working with semantic abstractions. And so I think it is just the right representation because we also have other goals. We want to be able to do real time worlds. That means there's a limit to how much processing you can do. And we want to do long term planning and consistency. And again, that favors abstraction. I mean, I guess. There was actually a recent blog post that came out from our friends at Physical Intelligence and they were sort of heading in the same direction. They were saying, oh, the PI model. Yeah, to maintain a long term memory of what's happening in the world so we can do longer term. We're actually storing text of what has been happening in the world.
[13:34]
Host 1
Right.
[13:34]
Chris Manning
It's not such a successful strategy of trying to keep it all at a pixel level.
[13:39]
Host 2
And yeah, I mean you can see it in video models like that. Temporal consistency. We're at a scale of train on all the video data we have. We have it for maybe 30 seconds, a few minutes. That's not the same as a game state played for half an hour. Right. I thought you guys break it down pretty well. You have a, you have a blog post about building multimodal worlds with an agent. I don't know if you guys want to talk about this. This is one of the things I
[14:04]
Host 1
read, I thought, yeah, this is the thing I talked about with the reasoning chain.
[14:07]
Sun
Yeah.
[14:07]
Host 2
So there's like different phases to this. It seems like it's more of an agent, a scaffold. Very different approach than Just, you know, type in a prompt and you, you don't have the same consistency. It also, like, for people that are listening, you know, I would highly recommend reading it. It breaks down the problem in a different light. Right, so what do you need to consider when you're talking about video, like world game models? Right, what do you need to consider? What are the factors? What are the elements, what's the state? So I don't know if you guys have stuff to talk about for this one.
[14:37]
Sun
Yeah, actually I wanted to add on a little bit on our previous point, which is just like I do feel like sometimes people confuse like, oh, like we're taking an A method with abstraction. That means they don't believe in bitter lesson. That's just false. Right? We are believers of bitter lesson. But then I feel like the question that we always discuss is what is the right abstraction level today? The analogy I like to make is let's just say we can encode and decode represent all of images, videos, audio in bytes. Then the most bitter lesson approach is to train a next byte prediction model as opposed to a next token prediction model where it's just like, okay, it's natively multimodal, but it's like, well, yeah, to Chris point, it's like the scale and computing need to achieve that. So that's why we always come back to like, okay, what is the most efficient way to do it? And reasoning models to the point of this blog post is a showcase of like, hey, we're actually just reasoning about the world and reasoning about the aspects of the world that matter for me to learn what I want to learn from this role model.
[15:43]
Host 1
Yeah, it's like you're improving the encoder of whatever you're trying to model. And a better representation would just represent the important things in less space, which would just be more efficient. So yeah, I fully agree that it is not antagonistic to bitter lesson. I do want to mention one more thing. Is there any philosophical differences with the Jetpa stuff that Yann Lecun is working on? I gotta go there. You're mentioning some latent abstraction. I'm like, okay, fine, let's talk about it. It's an elephant in the room.
[16:17]
Chris Manning
Yeah, there are philosophical differences. Yann Lecun is a dear friend of mine, but he has never appreciated the power of language in particular or symbolic representations in general. Jan is a very visual thinker. He always wants to claim that he thinks visually and there are no words, symbols, or math in his head. Maybe that's true of Yarn it's certainly not the way I think. But at any rate, the world according to Yann is the basic stuff of the world and of intelligence is visual. And language is just this low bit rate communication mechanism between humans and it doesn't have much other utility and it's far inferior to the high bit rate video that comes into your eyes. And I think he's fundamentally missing a number of important things there. Think of this evolutionary argument looking at animals, that the closest analogies are things with chimps, right? So chimpanzees have fairly similar brains to human beings. They have great vision systems, they have great memory systems, they've got better memory than we do of short term memories. They can plan, they can build primitive tools, but humans massively ahead in what we understand about the world, what we can plan, what we can build. And, and essentially what took off for us was that humans managed to develop language and that gave a symbolic knowledge representation and reasoning level which just gave this sort of vaulting of what could be done with the intelligence in brains. So the philosopher Dan Dennett refers to language as a cognitive tool and argues that humans, unique among the creatures in the world, have managed to build their own cognitive tools. And language is the famous first example. But other things like mathematics and programming languages are also cognitive tools. They give you an ability to think in abstractions, in extended causal reasoning chains and that allows you to do much more. And we use that for spatial representation and intelligence and planning and gameplay as well. So we believe, and this is underlying the specific technologies that Moonlake is making, that symbolic representations are powerful and you want to use it in your understanding of the visual world when you want a causal understanding, when you want to maintain long term consistency and prediction. And you know, as I understand it, that's just not in Yann Lecun's worldview. So I think that's the fundamental philosophical difference. Then there's the specific model he's been advancing, jepa. I mean that's a reasonable research bet as a direction as to head for building out a model of the visual world. To my mind, it's sort of one reasonable research bet. It's not, not really established, it's the best one that everyone should be following,
[20:06]
Host 1
at least developed at scale, at meta. But it's not just vision, right? Like I mean JEPA is just join embedding. Prediction can be applied to anything really and people have done it. If the argument is that there is a latent representation that is probably more suited to the task, then why not let machines do it for us instead of predefining it at all. And isn't something like a JEPPA shaped thing the right answer? And if not, why not?
[20:31]
Chris Manning
So I think there's a part of JEPA that's right, which is you do want to have a joint embedding that gives you a consistent model of the world. Jan's argument is you can never get that from autoregressive language models because they're sort of left to right, churning out one token at a time. I guess this is where the researcher arguments of the field. I'm not actually convinced that's right, because although the token production is this autoregressive process that's heading left to right, I guess it doesn't have to be left to right. But anyway, in sequence of tokens we could have right to left for Arabic. Although that's true, all of the weights of the model that are internal to the transformer, they are a joint model of the model's understanding of the world. And so I think you can think of the weights of the model as a form of joint representation. And therefore it is plausible to think that that could be the basis of a world model which. Which avoids Jan's objections.
[21:52]
Host 1
I think I follow. And obviously that will touch on what Moonlake eventually ends up doing as well. Right, which it's hard to tell because you put out the end results, but we don't know the inputs that go into it. So that's something that we have to figure out over time.
[22:09]
Host 2
Yeah, I mean, I guess this kind of breaks down some of the outputs. Do you want to walk us through it?
[22:15]
Sun
Yeah. So this really just walks us through the reasoning traces of like, okay, let's just say if we want to build a world in this context, it's really just a game demo that shows the variety of interactions that this world model can build. And yeah, it's really just a reasoning traces of like, okay, it prompted to create a bowling game. How did it achieve what you saw, that level of causality, interaction and consistency? Right. So yeah, this is almost just like an example of the reasoning traces.
[22:45]
Host 1
Very detailed, very, very detailed.
[22:47]
Host 2
You don't even realize it. Right. Like when a video is generated, what happens when a ball strikes a pin? Right. So first there's audio in that like audio triggers happen, score increments, the world changes. Like pins have to start dropping. There's a timer that goes on. It's just like very similar to how now we're used to reasoning for language models. There's a whole state of what happens. So geometry, physics, all this stuff. And then there's kind of that single prompt, so asset physication, all this stuff. It's a nice view to see what's going on.
[23:20]
Host 1
I think sun is also too polite to point out that both Google's GENIE demos as well as World Labs Marble do not have interactive worlds.
[23:32]
Sun
That's the benefit of having a reasoning model, right? Because you can say, oh, maybe in this particular context, I want to learn how to bowl. And then you can say, okay, then what is it important when it comes to learning how to bowl? Okay, maybe it's like I need to understand the basic of physics and I want to throw it over them. I want to know that when it resets, it's a new game. So I know that, yeah, basically, you know, to pick up the ball, you know that ball's going to cause the pins to fall down. You know that what's important to this particular bowling game is, is to score, and you know that the score corresponds to the number of pins that fell down. So it's just like if it's a model that sort of knows what it looks like, knows what a bowling game looks like, but doesn't actually allows you to practice over and over again and to understand that, oh, what it takes to actually get a high score, then it sort of doesn't actually allow you to learn what you set out to learn within the world model. Right? And I think this is really just one example of showing the advantages of the approach that we're taking over most, let's call it the Zeitgeist is today when people talk about quote, unquote, world models, right?
[24:44]
Chris Manning
So it sort of seems like the question to ask when there's a world model is can I not only just wander around the world and look at the beautiful graphics, can I interact with the objects in the world and see the right consequences of actions?
[25:03]
Host 2
And you also understand what the consequences would be if you do something, right? So it's not just like, okay, there's one thing, if I pick it up, something will happen. But there's 50 options and I know I can expect, I can infer what would happen if I do any of them, right? So very different when you can actually see it, play around with it.
[25:21]
Host 1
There's two cheeky elements of that. The sort of, I guess, less ambitious one is let's really establish for listeners, why is this fundamentally different than writing Unity code, Right? Just creating a model to translate a prompt into Unity code.
[25:39]
Sun
So there is an underlying physics engine in that sense there's some overlapping things to Unity, but the way we think about it is physics engineering or tools or code are cognitive tools, like borrowing Chris's term. Right? Like tools that the model can employ as means to an end. So today maybe you say, okay, in this particular context we care about physics, we care about the long term causality consequences. Then yes, we employ a physics engine. And then maybe tomorrow we say, okay, we're training, let's just say drones, where we only care about really fluid dynamics and the visual aspect of the world. Then yeah, maybe we don't actually. The model actually doesn't have to use a physics engine, or maybe it employs other types of representation or physics engine to achieve the task. So yes, writing code for Unity is sort of similar to a tool that a model can employ, but our goal is for model to take a representation conditioned reasoning approach or process internally.
[26:43]
Host 1
Yeah, using these things is just like general two calls. Right. Which I think is very interesting. The other more ambitious one is some kind of recursive element where it becomes multiplayer.
[26:52]
Sun
Right.
[26:52]
Host 1
Like here there's a single player element. You're not modeling any other people involved. And that is a whole other thing.
[26:59]
Sun
But in fact we can already do multiplayers.
[27:01]
Host 1
Oh yeah, Okay. I haven't seen any demos.
[27:03]
Sun
You just actually just prompt our model to say, hey, configure the multiplayer. Then it'll do like this. You'll be able to configure multiplayer. Person can see a database for you.
[27:14]
Host 1
Easy. Yeah.
[27:14]
Host 2
So what are some of the current limitations in where we're at? So there's one approach of like, okay, scale up video predictors. Obviously there's data issues with approaches like this. Is it data constraints? What are like the next steps? Is it real time? So there's one side if write an agent to write Unity code. But okay, I want to be streaming a game real time. I want to have characters being also agentic. But where do we kind of see this scaling up? Right.
[27:40]
Sun
Yeah, there's definitely a data constraint. Like the more data, the better. This reasoning model can almost basically act as humans to operate a variety of tools and softwares to build whatever is necessary. And then there's a sort of fidelity constraint which we're actually solving with another model reverie, which we can talk about later. But it's like, well, it's not as easy to get to photorealism with the approach that we're taking, but we think there are better solutions to that, which is we could dive into later.
[28:13]
Host 2
One thing you note here is it's a diffusion model. Right. So there's a few approaches. Diffusion, Gaussian splatting. Yeah. So reverie diffusion model you guys want to introduce?
[28:24]
Sun
Yeah, totally. So within our world modeling framework, we think there are two models that we train.
[28:29]
Host 1
Right.
[28:29]
Sun
Like there's the multimodal reasoning model that we just talked about that essentially handles mainly the causality, the persistency and logic. Determinism. Determinism of the world. And then reverie is our bet on saying, okay, while all those models can take care of all these things that we just talked about, its limitations compared to existing, say video models is that it doesn't have as high of a pixel fidelity right off the gate. Right. And reverie is to say, hey, we can actually take whatever persistent representation that we generate with our multimodal reasoning model and learn to restyle it into photorealistic styles or arbitrary styles you want. So this model is almost to say, hey, I'm going to respect the persistency and interactivity of the world that you created, but my only job is, is to make sure that its pixel distribution is close to what we want.
[29:28]
Host 1
Yeah, you kept the KL divergence where. No, I mean this is a classic like how you don't stray too far from the source material as you cap the kl, which is kind of cool.
[29:43]
Sun
Yeah.
[29:44]
Chris Manning
And the difference is, and I mean sun was pointing at this where sort of saying it's in one way a more difficult path, but a better path that typically the diffusion models are producing the whole scene and it looks lovely, but there isn't spatial understanding behind it, which is allowing for the real time graphics, gameplay, the spatial intelligence, understanding the consequences of worlds where this is taking a path where it is assuming an abstracted semantic model of the world, the world state and then the diffusion model is then being used on top of that to produce the high quality graphics.
[30:29]
Host 1
Is there an intended practical or business use for this or is it like a demonstration of capabilities?
[30:36]
Sun
We actually believe that this is going to be the next paradigm of rendering. So it's going to replace how rasterizers, it's going to replace DLSS today because it not only has these pixel prior that's learned from the world such that you can literally play any game in photorealistic styles, which is a lot of people's desire when they do gta. Right.
[30:54]
Host 2
All the mods, all the people adding perfect lighting and all this.
[30:57]
Host 1
So skins for worlds, let's call it
[30:59]
Sun
skins, let's call it skin.
[31:00]
Host 2
You can call it skin, you can call it customization, you can Play it how you want, right?
[31:04]
Sun
Yeah, exactly. And I think another thing that we really pointed out specifically in this blog is the programmability of it, right? So what this means is that this renderer, well, historically, renderer is always a derivative of the game state, right? You're saying, oh, here's the game state, I'm rendering out a frame. But here I'm saying actually this renderer can be part of the gameplay loop. I can say something along the lines of, if upon getting 10 apples, my weapon of choice, my bullet's going to turn into apples. And that's possible because we can say, we can basically dynamically have certain game state trigger the preconditions to the renderer, such that the rendering is now part of the game loop too. One thing is to just say, okay, it's the appearance. But the second thing is also to say there's these novel interactions that are possible because this renderer now has actually priors of the world.
[32:00]
Host 1
It is up to the artist to figure out what to do with it.
[32:02]
Sun
It is up to the creators. Yes. And I also think that's actually another big argument that we're making. And the reason that we're taking the bet we're making is that a lot of the times, whether it's for embodied AI or gaming, you want a layer where human can inject their intentions, right? So, for example, let's just say in the context of gaming, it's obviously my creative intent, but maybe in the context of embodied AI, it's like, oh, I take this foundational policy and I want to actually fine tune it to deploy in my house. So you want to almost say, have a layer where human can say, oh, here's the distribution of things I want to create to achieve my goal. And I think 3D graphics, as it is today, is basically the layer for people to say, hey, what do I care about in this world? And it allows basically human intent to be expressed in these worlds much more explicitly and distributionally, as opposed to just saying, hey, I'm going to generate arbitrary. And it's like just prompts.
[33:00]
Host 1
It's one of those things where I think you're going to build up a series of models, right? This is just one of. This is probably like the highest utility or heaviest frequency 1. I don't know what to call this where, yeah, you can immediately drop this in on any game and you don't need anything else that you guys do. But I could see that. I think the human intent is something that people are not even used to because we're so used to static worlds or worlds that just don't react or. I don't know, you're kind of blowing my mind right now with like. Well, I wonder if you've talked to people at GDC and what are they going to do with it.
[33:37]
Sun
Yeah. Now the stance that we take on this front is like, we're not going to be more creative than our users.
[33:42]
Host 1
Ship it out.
[33:43]
Sun
Yeah. But we want to make sure that we're building things in a way that really allows them to express their intent.
[33:49]
Host 1
The thing that you said about here's the distribution that I want. I think text may be too low of a bandwidth to really demonstrate because I'm probably just going to want to drop in a bunch of reference assets and then you can figure it out from there.
[34:06]
Host 2
You probably want to do a mixture of both, Right. Like you throw in a few images. I want it this style. I want it to look like this. It's a mixture, right?
[34:14]
Chris Manning
I think it's a mixture. I mean, yeah. I mean, there's clearly a visual component of this. And it's not that everything can be text because of course you want to give a visual look, but there's also a massive amount of giving the overall picture of the look of the world and the behavior of things that you can express in a few words of text. And it'd be very time consuming and difficult to do via visual means. So I think, yeah, you want a combination of both.
[34:50]
Host 2
So one question I kind of have is how do we go about evaluating world models? So there's many axes, right? One is like, okay, I have preferences. How well do we adhere to prompts? One is the simulation, one is like, do things. Is there core logic that's broken? So coming from, we know how to evaluate diffusion. There's fidelity, there's stuff like that. But what are some of the challenges that most people probably aren't thinking about?
[35:13]
Sun
Yeah, I think this is a great question and probably one of the hardest questions in world models because I think it always comes back to what are you building this world model for? And depending on your end goal and purpose, the evaluation should differ. So in the context of games, then the most direct way of measuring is how much time are people actually spending in this world that you create. And if your goal is to say, for example, in the context that we just talked about, like, hey, deploying action, embody a agent, then your end metric is then, okay, after training in these worlds that you generate how robust it is when you actually deploy to the target. Environment. But then it's hard to measure these end metrics. So today people have these proxy metrics that I call that basically try to measure what we really care about, which is the end metrics. But then frankly it's different for every
[36:06]
Host 2
use case, which seems like quite a challenge, right? Like in language models or video models, image models, your benchmarks are proxies, right? People aren't actually asking instruction following tool use questions. They're proxies of how well it will do downstream. But for this. So like you know, should, should teams, should companies have their own individual benchmarks outside of games? If you think of stuff like, okay, video production, movies, stuff like that, that also want to use world models, should, should they sort of internalize like their own proxy? Is this something you guys do? Where does that connect?
[36:39]
Chris Manning
I think this whole space is extremely difficult as things are emerging now. And I mean it's not only for world models, I think it's for everything, including text based models. Right. Because in the early days it seemed very easy to have good benchmarks because we could do things like question answering benchmarks and could you answer the question based on these documents and the various other kinds of do pieces of logical reasoning or math. But again, these are sort of, and there are sort of visual equivalents of things like object recognition for these small component tasks. But these days, so much of what people are wanting to do also with language models is nothing like that. Right? You're wanting to have an interaction with the language model and get some recommendations about which backpack would be best for you for your trip in Europe next month. And it's not the same kind of thing. Right? And it's not so easy to come up with a benchmark as to does this large language model give you an effective interaction for guiding you in a good way for shopping? And it's the same problem with these world models. So if we take the game design case, well, success is that a game designer can produce what they are imagining in a reasonable amount of time. And that's really the kind of macro task. But that's a very hard thing to turn into a benchmark. And I think a lot of this is actually going to turn into people walking with their feet. Right? I mean, I guess that's what's happening at the large language model level when people are choosing to use GPT5 or Gemini or Claude, individuals are trying out these different models and deciding, oh, I like the kind of answers that GPT5 gives me. Or no, I feel like I get more accurate detail from Claude. Right.
[38:59]
Host 2
It's a lot of vibe checking.
[39:01]
Chris Manning
It's sort of like vibe checking. I realize that, but it's actually whether people feel it's giving them utility and what they want. Right.
[39:09]
Host 2
And the interesting thing there is a lot of people prefer the visual. Right. This looks pretty, which is not the objective of what this is for. Right. If a game designer is working on something they care about, the game engine state, it's. It can look whatever, you can fix that up later or you can have a really good game state and you can quickly edit it to 20 different versions that keep state.
[39:31]
Chris Manning
Right. So that's a really important distinction. And for speaking to Moonlake strength. Right. So, yeah, I mean, great visuals are lovely to look at for a few seconds, but games are really all about the concept, the gameplay, and a lot of the time that doesn't actually even require great visuals. I mean, there are just lots of very successful games which have relatively primitive visuals. And there are other games where people have spent millions producing photorealistic visuals and the game sucks. Right. So keeping those two axes apart is really important in thinking about what's important in a world model for different uses.
[40:22]
Host 1
This conversation is reminding me of some game review and fiction discussions I've had in my non AI related life. Some people might know Brandon Sanderson, who's a very famous fiction author, is a big, big game reviewer and he's a big fan of video games where you change one thing about a normal what you might assume about the world. For example, Baba is yous. I don't know if you might have come across that. Where the rules change as you play the game and also where you can do things like reverse time selectively or change gravity selectively. I think this also reminds me of other kinds of world models that are created by authors where Ted Chiang is my typical example where he'll take the world that you know today, but change one thing about it, but then create a consistent world based on that. Which is a long window for me to say, is it easy to create alternative worlds that don't exist, but you change one thing and then let's run a whole bunch of people through it to see if it works?
[41:23]
Chris Manning
My first answer will be that seems a lot easier and more conceivable to do using technology like Moon Lakes than with some of the other world models out there where the sun can actually make it happen. I'll let him give the second answer.
[41:41]
Host 1
I guess for you, you're constrained by the game engine tool, right? At the end of the day that's the thought partner that you have. If I ask for something where if it never is allowed to reverse time or if gravity only ever works one way, then well, that's it. But sometimes gravity might change, but it's
[42:00]
Sun
a lot easier to change with code. As opposed to a model that is learned primarily on data of real world and virtual worlds that are I guess, for example genie. There's actually train on a lot of real world data and a lot of virtual gaming data. And it's hard to say. Well maybe it's easy to say okay, I want to change the visuals in the time period of the world. But you can't change gravity, for example,
[42:27]
Host 2
I feel like you can to light bounds. Right. Everything comes down to code is a better way to execute it. But the models aren't that diverse and creative. You can say, okay, make gravity slower. It can do that. But it's limited to your representation of how you text it out. They're only going to do a few iterations. Whereas programmatically, if there's a game engine under the hood, you can go wild. Right. One of the. I don't know. One of the limitations of most models is that they're very over trained to one style. Right. And extracting diversity is pretty difficult. At least that's something we've seen.
[43:03]
Sun
I mean are there examples you have in mind where existing models it would be easier to do that's not using code like certain types of creative intent or like state transitions.
[43:15]
Host 1
Other world models are very good at clipping through things. Clipping my legs, clipping through a rock. Because it's just bad. You would have to struggle very hard with your stuff to actually make that happen. Which I think is maybe a topic that you actually prepared on. Gaussian splatting versus the other stuff.
[43:39]
Host 2
Yeah, it's just for those not super familiar. Right. There's Gaussian splatting. There is diffusion. Like what works, what scales up. I feel like in February when SORA one came out, the blog post was literally titled Bring it up.
[43:52]
Host 1
You never know.
[43:55]
Host 2
Video generation models are world simulators. It's super bitter lesson pilled. A lot of it is emergence. Right. So not to go through their blog post. Basically their whole thing was as you scale up all this consistency, all this stuff just kind of solves. It's a very simple premise. Right. They just scaled up diffusion and from there. This is February 2024. How much can we. It's already been two years, which is basically five years. How much more in AI time do we need to Just scale up or do we hit a data cap? But I think we already talked about this a lot. This is back to the beginning discussion of what's appropriate for the time. And that seems like your approach. Right?
[44:36]
Sun
Yeah. The point I'm trying to make is that there are many, many different types of world simulators. And like, having a world simulator that can produce pixel coherency is very, very useful for games and marketing and all these things, but it's not as useful as people think when it comes to causal reasoning, when it comes to embodied AI. And yeah, this title is true. We're not saying that it's not a great world simulator, but actually in the blog that we wrote, the bet is more so that there are going to be disproportionately large share of value of real world tasks and virtual tasks where high resolution pixel fidelity is not needed. And yes, video models have their values.
[45:24]
Host 1
Yeah, this is at the absolute limit of my physics understanding. But one example that comes to mind is basically having to solve the equivalent of a three body problem in a deterministic world, whereas the video models would just approximate it. Good enough.
[45:40]
Sun
Yeah. Right.
[45:42]
Host 1
There's some point at which your approach kind of runs into, well, you now have to simulate the world. Please. Thank you very much. And you're trying to do that, but only to the extent that the game engine lets you. And game engines cannot do some things.
[45:58]
Sun
Yeah, no, I mean, I think the interesting or more technical question here actually is where do you draw the boundary between what's handled with, let's say, diffusion prior and what's handled with symbolic priors? Yes. Okay, okay, right. Because this boundary can actually be fluid. I think maybe what you're trying to get at is like, okay, people are saying pixel prior everything. But what we're saying is, okay, there's a boundary that we draw where this is where we think provides the most economical value for the domains and things that we care about today. And I actually do think, and it's something that we do internally all the time, which is like, okay, given new equations that we learn or new elements of the world that we learn, or maybe some other knowledge that we acquire in the process developing the models, should we still be maintaining this line exactly as it is today, or should we move it a little bit left or a little bit right sometimes that we realize that, oh, maybe customers or folks want certain things that are better handled with pixel prior as opposed to symbolic prior.
[47:10]
Host 1
Your skin thing is an example of moving it right or left. I don't know what the direction the left right is. Yeah.
[47:18]
Sun
The Reverie model.
[47:19]
Host 1
Yes.
[47:20]
Sun
Actually we have a few iterations of them. They're actually at slightly different.
[47:23]
Host 1
I know, you should do that. That's a cool dimension to show.
[47:27]
Sun
Yeah.
[47:27]
Host 1
Is quantum mechanics the diffusion prior of our world?
[47:34]
Chris Manning
Right.
[47:34]
Host 1
It's like that's the boundary of classical mechanics versus quantum. Right. That's it. At one point God plays dice and the other point doesn't.
[47:42]
Sun
I don't know if Chris, you want to say, but I think generally I feel like physics is better with symbolic priors.
[47:49]
Chris Manning
Even quantum physics.
[47:50]
Sun
Even quantum physics, yeah.
[47:51]
Host 1
This stars gets to mlst territory is what I call it where he likes to get philosophical. So we're quite friendly.
[47:59]
Host 2
I mean we need to get singularity. I heard some of that.
[48:04]
Host 1
No, I think that is actually really helpful. And man, I just want you to productize this. As a product guy, I'm just like researcher. It's cool. This sort of theoretical. You have a very good. I don't know the way of thinking about these things, but I just want to see you express it. I do think you're fundamentally things when you leave open new tools like okay, use human intent to incorporate it into how you render. Well, artists are going to have to take two to three years to figure out what to do with this and you just don't know.
[48:41]
Chris Manning
But I think this gives a much more approachable and controllable world, which is
[48:49]
Host 1
the beauty of nlp, that that will
[48:53]
Chris Manning
enable it to be adopted and used. And we're very hopeful about that.
[48:58]
Sun
Yeah, I mean we are very focused actually on commercialization in the sense that we do really believe in the data flywheel approach where we put this in the hands of the creators and the users and then they will teach us what capability our model should improve. And that's why we are actually products in beta focusing on gaming.
[49:19]
Host 1
What's like the adjacent thing to gaming
[49:20]
Sun
embodied AI, Jason, basically. So I'll maybe start with where we see the platform in three years, which is like, okay, the users would tell us what they want to achieve. The end goal could be, hey, I want to make something to teach my kids the value of humility. Or it could be, hey, I want to fine tune my drones to be really good at rescue situations. I could be vacuum robots. I want to train my manipulation or vacuum robot to be very robust to my office. But it's like whatever it is scenario very robustly in my office. But then it's like whatever end goal that you want. Our world model will say, okay, given what you want to achieve. Let me generate a distribution of environments such that I can train and evaluate whatever it is you want. Maybe for the purpose of games. It's just the end simulation and that's the end product for a certain policies. It's like I can train it within these environments and then help you see where your policy is failing or not.
[50:24]
Host 1
So in that case, much more of a training tool than in other training evaluation.
[50:29]
Host 2
Both, right?
[50:30]
Host 1
Sure, same thing.
[50:31]
Sun
I think it's just this world model that allows people to train any policy that can act in any multimodal environment.
[50:38]
Host 1
Would it be harder to reward hack? Is there an angle here where it is harder to reward hack? I'll just put it generally because I think that's obviously a key problem that a lot of people face when training agents in these environments. And I don't know, can you solve it?
[50:54]
Chris Manning
I think not necessarily. I mean to the extent that there's a misspecified reward that it seems like it could be hacked in a more symbolic world or in a more pixel based world. I don't know if Sun's got any thoughts, but I don't think think that's really being solved.
[51:14]
Host 1
The other thing that comes to mind is just you could just build a better Sora as a video generated model.
[51:19]
Sun
Right.
[51:19]
Host 1
Because then you would move the diffusion side a bit more further to the right. I think if I got the directionality correct and that's it.
[51:29]
Host 2
It's better on domains, right? Like on consistency over an hour for sure it exists versus something doesn't. Right?
[51:36]
Host 1
Yeah.
[51:38]
Sun
Is a question more like I'm just
[51:40]
Host 1
riffing on what can you build with the stuff that you have. I do think that the minor, the academic does go immediately to training and in evaluation. But art tends to take unusual directions like you might end up okay.
[51:55]
Chris Manning
Yeah, but the question is, can you use this piece of software to develop compelling gameplay? And I don't think you can take soar and produce compelling gameplay. Right. If you want to have a world that you can wander around in a bit, you're good. But what are your abilities? To have gameplay mechanics implemented the way you'd like them to be and to have things stay with the long term history of your gameplay that influences future actions. I think there's just nothing there for that.
[52:28]
Host 1
Yeah, I do tend to agree. I'm just trying to sort of test the boundaries. I would also make the observation that as AAA games industry has developed the line between what is a movie and what is a game has blurred and you do end up Basically producing a two hour movie as part of your game.
[52:48]
Sun
No, honestly, there's so many actually applications in adjacent markets that our world model can go into. But yeah, it's sort of fun to riff on. Although on the execution side we need to stay focused with like, okay, what are the capabilities we want to unlock over time? And there's a roadmap for that. But yeah, if we were just riffing on sort of like the possibilities, I feel like whether it's endless. Yeah, it's like classic.
[53:10]
Host 1
The embedding for possibility, endless in my mind is very close.
[53:15]
Sun
Yeah.
[53:16]
Host 1
I do want to focus on one weird choice. I don't know if it's weird. Maybe I got something here. Audio.
[53:22]
Sun
Right.
[53:23]
Host 1
You could have just said no audio. And audio in my mind has a lot of recursion, whereas in video you can just do ray casting and that's much computationally much simpler. Audio just seems way harder. I don't know if you want to just comment on just the spatial 3D audio problem. Did you really have to do it? I guess you do to be immersive. But a lot of people do treat it as like. Well, we just stick a TTS model on top of.
[53:50]
Host 2
Well, there's a lot more to game audio than just speech.
[53:53]
Chris Manning
Right.
[53:54]
Host 1
It's not just tts, tts, sfx, Spatial in my mind, echoes and reflections and I don't even know what else. I don't know what other problems in this space.
[54:07]
Sun
Yeah. I think this point is sort of more pointing to the benefits of using a game engine as a tool that's available to the model. Right. Because part of the spatial audio is from the code that is underlying the simulation. And while we do give our model access to other types of audio models
[54:31]
Host 1
as tools, none of them would be spatial, I think.
[54:35]
Sun
Right, but that's exactly sort of more point to we're giving our model an abstraction or a suite of tools such that it's able to achieve that. And you can argue that sort of spatial is an emergence out of the tools and abstraction that we provide to the agents. And I think that's the beauty of this approach is like there's a lot of things kind of like how humanity's built technology and they're like Lego blocks that build on top of each other. And it's the same thing here. There's going to be things that just sort of emerges from being able to put these things together in combinatorially interesting ways.
[55:09]
Chris Manning
This integrated audio model exploits the understanding and semantics of the Moonlake world. Right. Whereas in general, for the Genai video models, there's no actual integration across to audio at all. That someone might stick some music or stick a soundscape or whatever else on top of their video. So it's not a silent video, but they're in no way connected into a consistent world model. And there's nothing. That's. Okay. An action is happening in the video. Therefore there should be a sound that's coming from this part of the visual field.
[55:58]
Host 1
Yeah.
[55:58]
Host 2
Is that different than Sora 2? Does it not have audio?
[56:01]
Host 1
Not to say it's not like there's no spatial audio.
[56:04]
Host 2
It doesn't, no.
[56:07]
Host 1
I've played around with it enough. It just sounds like someone put an 11 Labs voice on top of it and just tried to do the lip sync.
[56:13]
Host 2
I've seen. Okay. Generate a dog at the beach and reactions to Big Wave and move around.
[56:18]
Host 1
It's definitely like, have the dog move away from camera and see if the sound goes down. It doesn't. Right. Because they don't have spatial audio.
[56:28]
Sun
We do want to. Basically, our moral model, like the one we're training, is basically towards the goal of having a combined latent representation across all these different modalities, such that you can reason across these different modalities. So, for example, if I close my eyes and you play a sound of car skidding away from me, I almost can visually extrapolate that trajectory in my mind. And I think that type of capability. We want our model to be able to reason.
[56:55]
Host 1
Right.
[56:55]
Sun
And that's the reason that we're sort of taking this multimodal reasoning approach. It's like we want this combined latent space that can. Yeah.
[57:01]
Host 1
Oh, you said latent space. And we like that. Here we have to play the bell every time that someone says latent space. No, you got to train Daredevil 1 where it's only audio, but you have to work out where everything is cool. I think that's about it for our Moonlake coverage. I do think we have a couple of Chris Manning questions on IR and just any other sort of attention topics or NLP topics. Okay, go ahead.
[57:30]
Host 2
Yeah, it's just fun. We talked a bit about how you guys met, but you basically, you are like the godfather of NLP per se. You spent the whole career from early embeddings, early, early attention. You did 2015. Attention for machine translation, everything. You had information retrieval. So rag before rag. We just want to shout that out and admire a lot of that. Right. So what prompted the switch over to world models? How did all that come about?
[57:57]
Chris Manning
To sum out the enthusiasms and creativity of students. But there's a bit of a history there. Clearly. Most of my career has been doing stuff with language. And how I got into research was thinking, this is just so amazing how humans can produce speech and understand each other in real time. And somehow they managed to learn languages when they're kids. How could this possibly happen? Starting off, I was very focused on language, but as it sort of got into the 2000 and tens, I started working on question answering and then I started to get interest in visual question answering. And that was an area where it was very noticeable that the visual understanding was bad. Right. These were the days when it sort of seemed like there's almost no visual understanding. You're just getting answers that came from priors. So if you asked how many people are sitting at the table, it always answered 2, regardless of how many people you could see in the picture. So it seemed like, oh, these models actually aren't able to get semantic information out of images. And so I was interested in that problem and tried to work more on that. And so then that required knowing more about what's happening in vision and how you can represent visual information. And then there started to be this revolution of doing generative AI images. And then I had students that started looking at that before the era of Moonlake. I was also working with Demi Guo, who founded Pica,
[59:54]
Host 1
and Ian, obviously with Gans.
[59:57]
Chris Manning
Yeah. Though Ian was never my student, but yeah, Ian, I was very aware for the whole decade there of Ian with Gans. Yeah. And I mean, Ian was a Stanford
[60:06]
Host 2
undergrad, but yeah, Richard du dot com, I believe he was your student.
[60:11]
Chris Manning
Yeah. And you know, there were links across at that stage as well. So, I mean, there were several papers in that era of doing so. Andrej Karpathy was a PhD student at the same time as Richard, and so there was some joint language vision work in that era as well. It seems kind of ancient by modern standards, but yeah, we're trying to go from sort of textual dependency graphs to visual scenes at a time.
[60:41]
Host 2
The glove embeddings really took over a lot of TF IDF one hot encoding, all that. The early vision language models we saw were like Lava style adapters. Right. It's technically still just embedding latent space. Let's add image, let's mix modality. And that's one of the things you super put out there too. Right, yeah.
[61:00]
Host 1
Well, thank you for all of that. Thank you for advancing the world on world modeling. I honestly do think that if people deeply Understand everything we just covered. They will see what's coming. And I think you guys have made some really significant contribution here. What are you hiring for?
[61:17]
Host 2
Where do people find.
[61:18]
Host 1
We agreed that the CTA was a hiring call. Yeah. I mean, don't we have AGI? You don't need engineers anymore, right?
[61:25]
Sun
Yeah. On the model side, we are actually striving towards basically a self improving system. But what that means is that we need people to set up the self improving system. More specifically, people who have the intersection of knowledge within cogeneration and computer vision and graphics. Right. That's sort of the core research background that we look for within our team. And the majority of the team today do have both backgrounds.
[61:51]
Host 1
When you say computer vision and graphics, are they the same thing or is it computer vision one thing, graphics another thing? How intertwined are they?
[61:59]
Chris Manning
They're intertwined, but different. And I think this relates to some of the themes that we've been talking about, that the more explicit underlying world models that are being constructed inside Moonlake really draw on the computer graphics tradition. And so it's then combining that with the visual understanding of vision.
[62:29]
Host 1
Got it.
[62:29]
Sun
Yeah.
[62:30]
Host 1
All right. So if you've written a game engine, come talk to us, right?
[62:33]
Sun
Oh, yeah, definitely. But I do think that the line is blurred, like increasingly blurred these days, where it's like, if you have a general understanding of vision and graphics, I
[62:44]
Host 1
think for your standards it is. For me, it feels like vision. I'll leave that to the big labs. Graphics, I can get that. You would want to do that from more first principles. But vision, there's so many vision models off the shelf that I can take, but probably not good enough for your.
[62:58]
Sun
I see. If you're sort of making that distinction, then maybe we care a little bit more about having graphics knowledge.
[63:07]
Host 1
It could be like sometimes a hiring call can be as simple as if you know the answer to blah, you should talk to me. The sort of core known hard problem in your world.
[63:17]
Sun
Ah, I see. Yeah. In that case, definitely. If you've written a game engine before, if you've rl'd a variety of coding models on different objectives, like easy many of those.
[63:31]
Host 1
Yeah.
[63:31]
Sun
If you've done multimodal in space alignment, I intentionally included space alignment.
[63:37]
Host 1
Poor editor, has to edit thing every time. Yeah. Lean space alignment. Honestly, is it that hard? Well, there's some scripts out there that I've saved for the day. Someday I have to do it, but I don't have to do it, but
[63:50]
Sun
it's done, I think. Yeah. There's versions of that that are done. But I think we are aligning audio, text, language and video. And basically we have these role models that are able to act as agents to act in these worlds and extract long horizon videos and encoding that back to the model to sort of self improve. So it's an insanely exciting but also technically challenged problem. So people who want to do their life's best work Luniks a place.
[64:19]
Host 2
How big are you guys? Where are you guys based?
[64:21]
Sun
We're currently based in San Mateo, although we're moving up to SF. We're about 18 folks right now.
[64:27]
Host 1
My ending question was going to be what is the name? What's behind the name?
[64:33]
Host 2
Very cool graphics and design, by the way.
[64:35]
Sun
Actually, at the time when we started the company, we were thinking a lot about how do we make a company name that gives people the vibe of OpenAI but for almost Industrial Light and Magic vibes because it's like we care about creativity and using that as a funnel to solve AGI. So then we brainstormed a lot around DreamWorks, right, like industrial Light and Magic. So there's a few basically space of things that we feel like are very, very semantically close to the company's identity. And then it ended up being Moonlake, partly because of the DreamWorks vibe. The DreamWorks Moon Lake. Exactly. So that was a little bit of that inspiration. And then the moon was sort of like. It basically was like about the reflection. The reflection part also implies the self improvement loop that we sort of like really believed in. And that's the path towards multimodal general intelligence. So that's that.
[65:35]
Host 1
I'll leave that. I love a good name.
[65:36]
Sun
I love a good name.
[65:37]
Host 1
This is great.
[65:38]
Host 2
This is a very good name.
[65:38]
Host 1
It's very good lore. I'm glad I asked the question. I will also say, you know, one of my favorite story books or biographies ever is Creativity Inc. With Ed Catmull's story about Pixar and how he was rejected as a Disney animation artist. So then he went into computing and brute forced his way back into Disney.
[66:00]
Sun
And Walt Disney is also one of my favorite founders. He's like his story at the time, you're like, okay, I'm going to create this immersive park. People don't even have that technology to create it virtually. But you know what, let's just build it physically such that people can.
[66:13]
Host 1
So he's the first world modeler?
[66:16]
Sun
No, I tell people that theme parks are world models too.
[66:20]
Host 1
Yeah, yeah, yeah. I mean, you know, it's a small world or it's like the Epcot center with all the little replicas of the countries. Yeah, those are very interesting. Okay, well, thank you. We've covered, you know, a huge amount. Thank you for your time and thank you for inspiring us.
[66:35]
Sun
Thank you for having us.
[66:36]
Chris Manning
It's fun chatting.
[66:36]
Sun
Yeah, it's been a good time.