Podcast Summary: Latent Space – World Models & General Intuition: Khosla's Largest Bet Since LLMs & OpenAI
Date: December 6, 2025
Host: Latent.Space
Guest: Pim de Witte (CEO, General Intuition "GI")
Length: Key content detailed through 01:04:06
Episode Overview
This episode features an in-depth, first-ever interview with Pim de Witte, founder of General Intuition (GI), an ambitious world model lab spun out from Metal—the company behind the largest dataset of labeled video game action clips. The conversation explores the rise of foundation models beyond LLMs, the massive $134M seed led by Khosla Ventures (its biggest since OpenAI), and the role of high-fidelity human behavior data in the future of spatial-temporal AI agents and robotics. Listeners are taken through technical demos, the origins of GI, privacy-conscious data approaches, and the long-term ambition to define the gold standard for machine intelligence in both simulation and the real world.
Key Discussion Points & Insights
1. General Intuition’s Origin & Unique Data Advantage
-
Metal’s Background: A retroactive video clipping platform, Metal amassed 3.8B game clips focused on “peak” moments—essentially mining for interesting, high-skill instances of human behavior (00:00–02:07).
- "We now have the largest dataset of ground-truth, action-labeled video footage on the internet by maybe one or two orders of magnitude." (16:54, Pim)
-
Privacy-First Design: Rather than logging raw key presses (e.g., WASD), Metal captures abstracted action labels—ensuring privacy while retaining essential behavioral signals for training (17:56–18:08).
-
Data Application: This action-labeled video enables training world models that predict actions purely from pixels, transferring game-learned intuition to real-world scenarios (06:10–06:33).
2. Demo: Imitation Learning Agents and World Models
-
Early Demos: GI’s agent “sees” only game frames and predicts next actions, imitating human navigation, getting “stuck” and unsticking itself, revealing both human-like error and superhuman moments due to being trained on user-curated highlights (02:08–05:14).
- Notable Quote: "The baseline of our dataset is peak human performance." (05:15, Pim)
-
Technical Approach:
- Agents trained through pure imitation learning—no RL or game state access, running in real time, and able to generalize across humans, bots, and different game environments (03:07–04:59).
- World models go further, using pretraining from scratch or open-source videos, modeling real-world physics, partial observability, and camera effects—a major step towards agents with spatial and temporal intuition (08:36–11:44).
-
Impact:
- Transfer learning: “We trained it on less realistic games and transferred it over to a more realistic game. Then... to real-world video, which means you can use any video on the Internet as pretraining.” (06:10–06:33, Pim)
- Partial observability: The model can handle occlusions (e.g., smoke) and maintain spatial consistency (10:33–11:44).
3. Building a Specialized World Model Lab
-
Why GI stayed independent:
- GI declined a $500M acquisition offer for Metal’s dataset to build a unique, world-model-focused lab, believing their imitation learning–first approach could leapfrog others (05:20–06:10, 27:16–28:22).
- "We think we could essentially leap every single company that's forced to either be consumers of world models or build world models and take this foundation model bet for spatial-temporal agents." (27:57, Pim)
-
Seed Round & Vision:
- Khosla Ventures’ $134M seed is its biggest since OpenAI.
- Khosla’s funding philosophy: Drill founder vision to 2030, working backward from first principles (33:05–34:23).
4. Research Context: Papers and Collaborations
- Inspiration from Genie, Sima, and Diamond world modeling papers.
- Hired co-authors from landmark projects (30:16–31:07).
- Comparison with DeepMind, Meta’s Quest, and ongoing collaborations (32:45–33:05, 43:30–45:16).
5. Technical and Product Roadmap
-
Foundation Model Analogy:
- "LLMs were about predicting text tokens... what if we predict action tokens on essentially what is the equivalent of the Common Crawl dataset, but for interactivity, vision input." (24:34–25:10, Pim)
-
Distillation for Scale: Tiny models can act in real time, sacrificing some performance for deployment at the edge (12:01–13:41).
-
Real-World Use Cases:
- Early customers: Major game developers and engine providers, replacing scripted behavior trees with API-based, real-time action agents (50:08–50:45).
- Application in robotics/manufacturing provided the robot is "game controller" compatible (55:59–56:55).
-
Business Model:
- Primarily an API: Game/frame inputs in, action outputs out. No plans to license data, but potential for custom models and RL-based feedback loops (56:29–57:06).
6. Long-Term Vision & Market Strategy
- Simulation First, Real-world Next:
- "Our north star is to represent scientific problems in 3D space... and have a spatial agent capable of perceiving that space, using LLM reasoning plus spatial intuition." (51:10–51:56, Pim)
- Ambition:
- By 2030, GI aims to power "80% of all atoms-to-atoms interactions driven by AI models" and "100x more in simulation" (62:09–63:49).
7. Memorable Quotes & Moments
| Timestamp | Speaker | Quote | |-----------|---------|--------------------------------------------------------------------------------------------------------| | 02:08 | Pim | "What I'm about to show you is a completely vision-based agent that's just seeing pixels and predicting actions the exact same way a human would." | | 05:15 | Pim | "The baseline of our dataset is peak human performance." | | 27:57 | Pim | "We think we could essentially leap every single company... and take this foundation model bet for spatial-temporal agents." | | 33:05 | Pim | "He [Khosla] asked you to draw a 2030 picture of your company... and expects you to do that flawlessly, challenging any part of the vision." | | 46:37 | Pim | "[Yann LeCun] proclaimed LLMs to be a dead end. That was one of the things that inspired me to do this." | | 54:12 | Pim | "We have more people at any given time on Metal playing with steering wheels in Truck Simulator than Waymo has cars on the road." | | 62:09 | Pim | "In 2030, we want to be the gold standard of intelligence… nailing spatial temporal reasoning to go after the root killer problem of intelligence itself." |
8. Collaboration & Open Research
- Open Science Partnership:
- New partnership with QTAI (France)—Eric Schmidt–funded, supports university research using Metal's data, aiming to democratize spatial-temporal RL research for negative event prediction and beyond (60:43–61:57).
Important Segment Timestamps
- 00:00–02:07 — Host intro, Metal, and the unique data asset
- 02:08–11:44 — GI demo: imitation agents, world models, transfer to real-world video
- 12:01–18:08 — Data privacy, overlay strategies, action abstraction
- 27:16–34:23 — Staying independent, funding story and Khosla’s approach
- 38:39–41:07 — Self-taught AI learning journey
- 43:30–46:37 — World models definition, comparisons with Fei-Fei Li, Yann LeCun, and others
- 50:08–57:06 — Product direction: APIs, use cases, and business model
- 60:43–62:09 — Open research vision, university collaboration
- 62:09–64:06 — Closing vision: GI’s 2030 “gold standard” goal
Conclusion
This interview offers a rare, “in-the-lab” account of the frontier in world models—a new AI paradigm that promises to extend foundation models from text/code into simulational and embodied intelligence. Pim de Witte and the GI team are betting that their privacy-conscious, action-rich video gaming dataset will allow agents not just to play games, but to develop the core spatial and temporal intuitions required for robotics and scientific breakthroughs. Backed by Khosla Ventures’ unwavering vision support, GI is setting out to define the industry benchmarks for machine intelligence in both simulation and real-world environments—a true “gold standard” for embodied learning.
For more, check the full show notes at Latent Space.
