Latent Space: The AI Engineer Podcast
Episode: NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light"
Guests: Nader Khalil (Brev), Kyle Kranen (Dynamo, NVIDIA)
Host(s): Latent.Space team, with guest host Vivu
Date: March 10, 2026
1. Overview
This episode dives into NVIDIA’s evolving approach to developer experience, data center-scale inference, and the cutting-edge world of AI agents. Featuring Nader Khalil, founder of Brev (now part of NVIDIA), and Kyle Kranen, Dynamo architect at NVIDIA, the discussion spans the journey from quirky startup stunts (surfboards and shiny GPU cards) to VLC-scale inference engines, agent orchestration, and the cultural mantras (“SOL” or “Speed of Light”) that infuse NVIDIA’s distinctive engineering ethos.
Core Themes:
- The evolution of developer tooling from Brev to NVIDIA
- Scaling inference (Dynamo) to support agent-driven, large-context applications
- Internals of NVIDIA’s culture, research philosophy, and approach to developer UX
- The role and risks of AI agents in enterprise and engineering settings
- Hardware/software/agent co-design, and a peek into where multi-agent systems, context length, and AI infrastructure are heading
2. Brev’s Journey and Developer Experience at NVIDIA
Brev’s Origin Story & Conference Stunts (02:09–06:15)
-
Brev began as a startup focusing on simplifying GPU access for developers (“one-click deploys for any software on GPU”) — think big, visible GPU icons and SSH simplicity rather than cloud provider drop-down hell.
-
Marketing stunts like the surfboard booth and foil-pressed GPU cards at NVIDIA’s GTC helped Brev “stand out” and signaled their developer-first ethos.
- Memorable quote:
"Why are we spending time doing these stunts for GPUs? ...I do think it just shows the level of care throughout Brev and also Dynamo and NVIDIA." — Host (06:17) - On printing the cards:
"It's a third-generation San Francisco shop...they poked out over the walls. So you could see the Brev booth and no one else, just from very far away." — Natter/Kyle (02:16–05:00)
- Memorable quote:
-
The acquisition by NVIDIA preserved Brev’s soul: Brev.Nvidia.com is now “the front page for GPUs” (08:14–08:21).
Developer Experience Scaling (09:13–11:05)
-
Brev’s democratization of GPU access aligns with NVIDIA's "widening developer base," from data scientists to total beginners—including Natter’s own family.
- "AI is a big equalizer and you're seeing a more technologically literate society. ...You really understand who your end user is...you have to almost reinvent the practice." — Natter (09:46–10:31)
-
NVIDIA’s internal culture: Deep technical curiosity is prized; even VPs download and try new tools personally.
3. "Speed of Light" (SOL): NVIDIA’s Cultural Operating Model (13:48–18:38)
-
SOL: Speed of Light — a cultural shorthand for “what’s the physical (theoretical) limit?” applied to product delivery, experimentation, and hardware.
- Memorable definition:
“SOL is essentially like: what is the physics? Right. ...let's just understand the physics. What is the theoretical limit to how fast this can go? And then start to tell me why.” — Natter (14:03) - “SOL is a term at NVIDIA used to instigate a compelling event. ...What is the minimum, as much as necessary, as little as possible thing that it takes for us to get exactly here.” — Kyle (15:02)
- Memorable definition:
-
Application: Everything from hardware design to product launches. Stability and ops/maintenance are also weighed in.
-
Jensen Huang (NVIDIA CEO) and frontline engineers alike use SOL to cut through noise, “create urgency,” and focus on what truly gates progress.
4. NVIDIA Research Culture & Organizational Dynamics (21:03–24:27)
-
NVIDIA encourages choose-your-own-adventure engineering—engineers “index into passion,” jump teams, email “out of chain,” and organize in email mosh-pits.
- “The mission is the boss. ...honestly for every new initiative, that's what it feels like—a game of pickup basketball.” — Natter (19:14, 21:20)
-
"Momentum is the only authority." — Natter (23:59)
- If you build something, show progress and get people to use it, support and resources follow.
-
Jensen: “We're completely happy investing in zero-billion-dollar markets.” — Invention and research are driven by expected future market importance, not short-term ROI.
5. Data Center-Scale Inference & Dynamo (26:32–43:44)
Why Scaling Inference Is Hard
-
Inference at planetary scale—especially for multi-agent or long-context applications—faces hardware and algorithmic scaling ceilings.
- Key challenge: Scaling up (adding more GPUs to a big model) hits hardware boundaries (e.g., NVLink domain limits).
- "The maximum NVLink domain for H100 for most DGX H100s is 8 GPUs. Beyond that, you have to use Infiniband, which is still fast, but not as fast as NVLink." — Kyle (29:07–29:59)
- Key challenge: Scaling up (adding more GPUs to a big model) hits hardware boundaries (e.g., NVLink domain limits).
-
Tradeoffs across three axes:
- Quality (accuracy/completeness of results)
- Cost (efficiency, $$$)
- Latency (speed/SLA)
- "When you start this journey of trying to figure out how you want to host a model, you think about three things: what is the model I need to serve, how many times do I need to call it, what does the workflow look like..." — Kyle (33:00–34:26)
Enter: Dynamo (26:38–44:00)
-
Dynamo: A data-center scale inference engine that optimizes scaling out (rather than up), sits on top of frameworks like VLLM, SGLang, TensorRT-LLM, etc., and enables efficient inference for large, agentic workloads.
- Modular design—integrates optimizations like disaggregation, KV-cache sharing, and specialized scheduling for prefill and decode workloads.
- "There are tiers of developer base that were added. ...The amount of layers that are added to that developer stack has just exploded because AI has become ubiquitous." — Kyle (10:34–11:05)
- "You actually have to scale out. ...We kind of realized there was a lot of potential optimization that we could do in scaling out and building systems for data center scale inference." — Kyle (28:53)
-
Disaggregation—splitting “prefill” and “decode” onto different hardware/resources to match their differing compute/memory profiles, managed via Kubernetes-based scheduler (Grove).
Dynamo’s Optimizations (38:10–45:32)
- Prefill (long sequence encoding) = compute-bound, quadratic scaling.
- Decode (token generation) = memory-bound, linear scaling.
- Disaggregation, machine stratification, dynamic pool balancing: Assign dedicated resources to each phase, adapt pool sizes as workload changes.
- "Dynamo...provides a scheduling API for Kubernetes that allows you to actually represent and affect this scheduling on your actual hardware." — Kyle (42:29)
6. Hardware/Model/Context Co-Design and Scaling Context Length (43:44–51:21)
-
The push for longer context lengths is mostly attention-limited (“quadratic” scaling). Hybrid models (e.g., Kimi, DeepSeek) try to manage context size via architectural tweaks (e.g., attention heads, expert sparsity).
- "Kimmy has more experts but fewer attention heads...they did an experiment: attention scales with the number of heads. If you have 64 heads versus 32, you do half the work..." — Kyle (45:55)
-
Co-design: Hardware and model architectures (and even agent harnesses) are increasingly developed together—“model/hardware/context co-design” (46:45).
-
“Unhobblers”: Critical architectural breakthroughs (“scientific discoveries") that unlock orders-of-magnitude gains—e.g., multi-head latent attention, grouped queries.
- Leopold Aschenbrenner’s “situational awareness” essay is cited as modeling this (49:01–50:38).
-
Current “hard limit” is at roughly 1-million-token contexts. Next leaps likely hinge on “unhobblers”.
7. Agent Inference at Planetary Scale: Practical Engineering & Security (54:44–67:41)
Agent Infrastructure, Security, and Internal Rollouts
-
Agents at NVIDIA perform tasks touching files, the internet, and custom code execution. Security principle: never let an agent have all three powers at once.
- "You should really only let an agent do two of those three things. ...If you have access to Internet and your file system, you should know the full scope...Otherwise, malware can get injected." — Natter (58:04, 00:00)
-
Massive internal adoption (e.g., using OpenAI's Codex, Claude Code); tools spread rapidly through “mosh pit” email culture.
-
Security reviews are robust, balancing progressive adoption and enterprise caution.
Agents, CLI, and the OS Shell
- CLI “wrappers” make agent integrations manageable, discoverable, and secure; possibility of open-sourcing a wider “open CLI foundation” for core business tools.
- "Everything needs some CLI tool. ...Computing began with a terminal. ...Now LLMs are navigating user interfaces, but ironically we're not empathetic to the machine anymore. Just give the LLM access to the shell." — Natter (67:13–67:41)
8. Multi-Agent Systems, Subagents, and “System as Model” (73:24–76:07)
-
Architectural vision: future AI systems will look like “systems as models” — many models (or agents) collaborating under the hood, even as the API remains a simple single model call on top.
- "Instead of having a single model, you have a system of models and components working together to emulate the black box model." — Kyle (74:41)
-
Dynamo’s roadmap includes supporting multi-agent orchestration and model routing (local and foundation models, context-specific selection).
9. The Year of the Subagent & Ongoing Scaling Challenges (76:19–79:36)
-
Subagent trend: main agents kicking off subordinate “tool agents,” each specialized for different tasks/context windows.
-
Ongoing tension between scalability (long-running agents), efficiency, and cost:
- "There’s insatiable demand for tokens. …Every improvement just makes demand even higher." — Natter (76:11)
-
Varied agent autonomy: practical agents today commonly run for 20–45 minutes; longer (“all-day”, “all-week”) agents will need continued architectural and scientific breakthroughs.
10. San Francisco, Community, & AI Engineer Culture (79:36–82:57)
- A segment of the show reminisces about the unique energy of San Francisco’s AI builder community (“the city believes in you more than you do,” cheap rent, collaborative neighborliness).
- "Imagine some random person DMs you feedback on this blog post, and you do a Zoom call. …People are trained to write a certain way in school, and never see the broader world." — Host (82:48–82:55)
11. Notable Quotes & Moments
- "SOL is a term at NVIDIA used to instigate a compelling event...what is the minimum, as much as necessary, as little as possible thing that it takes for us to get exactly here." — Kyle (15:02)
- "Momentum is the only authority." — Natter (23:59)
- "Amazon Ads...talked about using Dynamo for generative recommendation, which was super weirdly cathartic for me...I've supplanted what I was working on." — Kyle (26:10)
- "I feel a little embarrassed for being proud of my SVG function earlier." — Natter (43:01)
- "Agents can do three things: access files, access the Internet, and write custom code… you should only let an agent do two of those." — Natter (58:04)
- "The model/hardware/context co-design thing is super interesting. It's my secret side passion." — Kyle (43:55)
12. Key Timestamps
- [02:09–06:15] – Brev’s conference stunts, startup beginnings
- [08:13–09:46] – NVIDIA acquisition, developer experience, Launchables
- [13:48–18:38] – SOL (“Speed of Light”) as a NVIDIA culture code
- [26:32–34:26] – Technical breakdown: Dynamo, inference scaling, cost-quality-latency axis
- [38:10–45:32] – Dynamo optimizations: disaggregation, prefill vs. decode, scheduling
- [54:44–61:13] – Agent adoption, CLI interfaces, security practices
- [73:24–76:07] – Subagents, “system as model,” Dynamo roadmap for multi-agent orchestration
- [79:36–82:57] – Builder culture in San Francisco and the AI engineer movement
13. Final Thoughts
This episode offers a rare, inside look at NVIDIA’s AI transformation—from the developer’s perspective on the ground up. Hear how deep technical culture, quirky beginnings, “zero-billion-dollar” bets, and a relentless focus on UX and infrastructure have made NVIDIA a crucible for planetary-scale AI engineering. Dynamo and Brev are more than tools—they represent a philosophy of breaking complexity simple (and electrifyingly fast), the agentic future of software, and the beating, mission-driven heart of Silicon Valley’s AI innovators.
For more technical detail, check out NVIDIA’s Dynamo and Brev documentation, and the upcoming GTC sessions referenced in the episode. Full show notes.
