Summary9 min read

Latent Space: The AI Engineer Podcast

Episode: NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light"
Guests: Nader Khalil (Brev), Kyle Kranen (Dynamo, NVIDIA)
Host(s): Latent.Space team, with guest host Vivu
Date: March 10, 2026

1. Overview

This episode dives into NVIDIA’s evolving approach to developer experience, data center-scale inference, and the cutting-edge world of AI agents. Featuring Nader Khalil, founder of Brev (now part of NVIDIA), and Kyle Kranen, Dynamo architect at NVIDIA, the discussion spans the journey from quirky startup stunts (surfboards and shiny GPU cards) to VLC-scale inference engines, agent orchestration, and the cultural mantras (“SOL” or “Speed of Light”) that infuse NVIDIA’s distinctive engineering ethos.

Core Themes:

The evolution of developer tooling from Brev to NVIDIA
Scaling inference (Dynamo) to support agent-driven, large-context applications
Internals of NVIDIA’s culture, research philosophy, and approach to developer UX
The role and risks of AI agents in enterprise and engineering settings
Hardware/software/agent co-design, and a peek into where multi-agent systems, context length, and AI infrastructure are heading

2. Brev’s Journey and Developer Experience at NVIDIA

Brev’s Origin Story & Conference Stunts (02:09–06:15)

Brev began as a startup focusing on simplifying GPU access for developers (“one-click deploys for any software on GPU”) — think big, visible GPU icons and SSH simplicity rather than cloud provider drop-down hell.
Marketing stunts like the surfboard booth and foil-pressed GPU cards at NVIDIA’s GTC helped Brev “stand out” and signaled their developer-first ethos.
- Memorable quote:
  "Why are we spending time doing these stunts for GPUs? ...I do think it just shows the level of care throughout Brev and also Dynamo and NVIDIA." — Host (06:17)
- On printing the cards:
  "It's a third-generation San Francisco shop...they poked out over the walls. So you could see the Brev booth and no one else, just from very far away." — Natter/Kyle (02:16–05:00)
The acquisition by NVIDIA preserved Brev’s soul: Brev.Nvidia.com is now “the front page for GPUs” (08:14–08:21).

Developer Experience Scaling (09:13–11:05)

Brev’s democratization of GPU access aligns with NVIDIA's "widening developer base," from data scientists to total beginners—including Natter’s own family.
- "AI is a big equalizer and you're seeing a more technologically literate society. ...You really understand who your end user is...you have to almost reinvent the practice." — Natter (09:46–10:31)
NVIDIA’s internal culture: Deep technical curiosity is prized; even VPs download and try new tools personally.

3. "Speed of Light" (SOL): NVIDIA’s Cultural Operating Model (13:48–18:38)

SOL: Speed of Light — a cultural shorthand for “what’s the physical (theoretical) limit?” applied to product delivery, experimentation, and hardware.
- Memorable definition:
  “SOL is essentially like: what is the physics? Right. ...let's just understand the physics. What is the theoretical limit to how fast this can go? And then start to tell me why.” — Natter (14:03)
- “SOL is a term at NVIDIA used to instigate a compelling event. ...What is the minimum, as much as necessary, as little as possible thing that it takes for us to get exactly here.” — Kyle (15:02)
Application: Everything from hardware design to product launches. Stability and ops/maintenance are also weighed in.
Jensen Huang (NVIDIA CEO) and frontline engineers alike use SOL to cut through noise, “create urgency,” and focus on what truly gates progress.

4. NVIDIA Research Culture & Organizational Dynamics (21:03–24:27)

NVIDIA encourages choose-your-own-adventure engineering—engineers “index into passion,” jump teams, email “out of chain,” and organize in email mosh-pits.
- “The mission is the boss. ...honestly for every new initiative, that's what it feels like—a game of pickup basketball.” — Natter (19:14, 21:20)
"Momentum is the only authority." — Natter (23:59)
- If you build something, show progress and get people to use it, support and resources follow.
Jensen: “We're completely happy investing in zero-billion-dollar markets.” — Invention and research are driven by expected future market importance, not short-term ROI.

5. Data Center-Scale Inference & Dynamo (26:32–43:44)

Why Scaling Inference Is Hard

Inference at planetary scale—especially for multi-agent or long-context applications—faces hardware and algorithmic scaling ceilings.
- Key challenge: Scaling up (adding more GPUs to a big model) hits hardware boundaries (e.g., NVLink domain limits).
  - "The maximum NVLink domain for H100 for most DGX H100s is 8 GPUs. Beyond that, you have to use Infiniband, which is still fast, but not as fast as NVLink." — Kyle (29:07–29:59)
Tradeoffs across three axes:
1. Quality (accuracy/completeness of results)
2. Cost (efficiency, $$$)
3. Latency (speed/SLA)
- "When you start this journey of trying to figure out how you want to host a model, you think about three things: what is the model I need to serve, how many times do I need to call it, what does the workflow look like..." — Kyle (33:00–34:26)

Enter: Dynamo (26:38–44:00)

Dynamo: A data-center scale inference engine that optimizes scaling out (rather than up), sits on top of frameworks like VLLM, SGLang, TensorRT-LLM, etc., and enables efficient inference for large, agentic workloads.
- Modular design—integrates optimizations like disaggregation, KV-cache sharing, and specialized scheduling for prefill and decode workloads.
- "There are tiers of developer base that were added. ...The amount of layers that are added to that developer stack has just exploded because AI has become ubiquitous." — Kyle (10:34–11:05)
- "You actually have to scale out. ...We kind of realized there was a lot of potential optimization that we could do in scaling out and building systems for data center scale inference." — Kyle (28:53)
Disaggregation—splitting “prefill” and “decode” onto different hardware/resources to match their differing compute/memory profiles, managed via Kubernetes-based scheduler (Grove).

Dynamo’s Optimizations (38:10–45:32)

Prefill (long sequence encoding) = compute-bound, quadratic scaling.
Decode (token generation) = memory-bound, linear scaling.
Disaggregation, machine stratification, dynamic pool balancing: Assign dedicated resources to each phase, adapt pool sizes as workload changes.
"Dynamo...provides a scheduling API for Kubernetes that allows you to actually represent and affect this scheduling on your actual hardware." — Kyle (42:29)

6. Hardware/Model/Context Co-Design and Scaling Context Length (43:44–51:21)

The push for longer context lengths is mostly attention-limited (“quadratic” scaling). Hybrid models (e.g., Kimi, DeepSeek) try to manage context size via architectural tweaks (e.g., attention heads, expert sparsity).
- "Kimmy has more experts but fewer attention heads...they did an experiment: attention scales with the number of heads. If you have 64 heads versus 32, you do half the work..." — Kyle (45:55)
Co-design: Hardware and model architectures (and even agent harnesses) are increasingly developed together—“model/hardware/context co-design” (46:45).
“Unhobblers”: Critical architectural breakthroughs (“scientific discoveries") that unlock orders-of-magnitude gains—e.g., multi-head latent attention, grouped queries.
- Leopold Aschenbrenner’s “situational awareness” essay is cited as modeling this (49:01–50:38).
Current “hard limit” is at roughly 1-million-token contexts. Next leaps likely hinge on “unhobblers”.

7. Agent Inference at Planetary Scale: Practical Engineering & Security (54:44–67:41)

Agent Infrastructure, Security, and Internal Rollouts

Agents at NVIDIA perform tasks touching files, the internet, and custom code execution. Security principle: never let an agent have all three powers at once.
- "You should really only let an agent do two of those three things. ...If you have access to Internet and your file system, you should know the full scope...Otherwise, malware can get injected." — Natter (58:04, 00:00)
Massive internal adoption (e.g., using OpenAI's Codex, Claude Code); tools spread rapidly through “mosh pit” email culture.
Security reviews are robust, balancing progressive adoption and enterprise caution.

Agents, CLI, and the OS Shell

CLI “wrappers” make agent integrations manageable, discoverable, and secure; possibility of open-sourcing a wider “open CLI foundation” for core business tools.
- "Everything needs some CLI tool. ...Computing began with a terminal. ...Now LLMs are navigating user interfaces, but ironically we're not empathetic to the machine anymore. Just give the LLM access to the shell." — Natter (67:13–67:41)

8. Multi-Agent Systems, Subagents, and “System as Model” (73:24–76:07)

Architectural vision: future AI systems will look like “systems as models” — many models (or agents) collaborating under the hood, even as the API remains a simple single model call on top.
- "Instead of having a single model, you have a system of models and components working together to emulate the black box model." — Kyle (74:41)
Dynamo’s roadmap includes supporting multi-agent orchestration and model routing (local and foundation models, context-specific selection).

9. The Year of the Subagent & Ongoing Scaling Challenges (76:19–79:36)

Subagent trend: main agents kicking off subordinate “tool agents,” each specialized for different tasks/context windows.
Ongoing tension between scalability (long-running agents), efficiency, and cost:
- "There’s insatiable demand for tokens. …Every improvement just makes demand even higher." — Natter (76:11)
Varied agent autonomy: practical agents today commonly run for 20–45 minutes; longer (“all-day”, “all-week”) agents will need continued architectural and scientific breakthroughs.

10. San Francisco, Community, & AI Engineer Culture (79:36–82:57)

A segment of the show reminisces about the unique energy of San Francisco’s AI builder community (“the city believes in you more than you do,” cheap rent, collaborative neighborliness).
- "Imagine some random person DMs you feedback on this blog post, and you do a Zoom call. …People are trained to write a certain way in school, and never see the broader world." — Host (82:48–82:55)

11. Notable Quotes & Moments

"SOL is a term at NVIDIA used to instigate a compelling event...what is the minimum, as much as necessary, as little as possible thing that it takes for us to get exactly here." — Kyle (15:02)
"Momentum is the only authority." — Natter (23:59)
"Amazon Ads...talked about using Dynamo for generative recommendation, which was super weirdly cathartic for me...I've supplanted what I was working on." — Kyle (26:10)
"I feel a little embarrassed for being proud of my SVG function earlier." — Natter (43:01)
"Agents can do three things: access files, access the Internet, and write custom code… you should only let an agent do two of those." — Natter (58:04)
"The model/hardware/context co-design thing is super interesting. It's my secret side passion." — Kyle (43:55)

12. Key Timestamps

[02:09–06:15] – Brev’s conference stunts, startup beginnings
[08:13–09:46] – NVIDIA acquisition, developer experience, Launchables
[13:48–18:38] – SOL (“Speed of Light”) as a NVIDIA culture code
[26:32–34:26] – Technical breakdown: Dynamo, inference scaling, cost-quality-latency axis
[38:10–45:32] – Dynamo optimizations: disaggregation, prefill vs. decode, scheduling
[54:44–61:13] – Agent adoption, CLI interfaces, security practices
[73:24–76:07] – Subagents, “system as model,” Dynamo roadmap for multi-agent orchestration
[79:36–82:57] – Builder culture in San Francisco and the AI engineer movement

13. Final Thoughts

This episode offers a rare, inside look at NVIDIA’s AI transformation—from the developer’s perspective on the ground up. Hear how deep technical culture, quirky beginnings, “zero-billion-dollar” bets, and a relentless focus on UX and infrastructure have made NVIDIA a crucible for planetary-scale AI engineering. Dynamo and Brev are more than tools—they represent a philosophy of breaking complexity simple (and electrifyingly fast), the agentic future of software, and the beating, mission-driven heart of Silicon Valley’s AI innovators.

For more technical detail, check out NVIDIA’s Dynamo and Brev documentation, and the upcoming GTC sessions referenced in the episode. Full show notes.

Loading summary

Transcript541 lines

[00:00]
Natter
Agents can do three things. They can access your files, they can access the Internet, and then now they can write custom code and execute it. You really only let an agent do two of those three things. If you can access your files and you can write custom code, you don't want Internet access because that's one. It's a vulnerability. Right. If you have access to Internet and your file system, you should know the full scope of what that agent's capable of doing. Otherwise, malware can get injected or something that can happen. And so that's a lot of what we've been thinking about is like, you know, how do we both enable this because it's clearly the future. But then also, you know, what are these enforcement points that we can start to, like, protect?
[00:39]
Host
All right, welcome to the L Space podcast in the Chroma Studio. Welcome to all the guests here. We're back with our guest host, Vivu. Welcome. Good to have you back. And our friends Nether and Kyle from Nvidia.
[00:50]
Natter
Welcome.
[00:50]
Kyle
Yeah, thanks for having us.
[00:51]
Natter
Yeah, thank you.
[00:52]
Host
Actually, I don't even know your titles. I know you're like architect, something of Dynamo.
[00:58]
Kyle
Yeah, I'm one of the engineering leaders and architects of Dynamo.
[01:02]
Host
And you're director of something. Developers. Yeah, you're the developers. Developers. Developers guy at Nvidia.
[01:08]
Natter
Open source agent marketing, Brev and dev tools and stuff in the focus.
[01:12]
Host
And we're kind of recording this ahead of Nvidia gtc, which is coming to town again, taking over town, which we'll all be at, and we'll talk a little bit about your sessions and stuff.
[01:23]
Natter
Yeah, we're super excited for it.
[01:24]
Host
One of my favorite memories for nadr, you always do, like, marketing stunts. And while you were brev, you had this surfboard that you went down to GTC with, and Nvidia apparently liked it so much that they bought you. What was that like?
[01:41]
Natter
Yeah, yeah. Our logo was a Shaka. We were always just kind of like, trying to keep true to who we were. I think so much of startups, you're trying to pretend that you're a bigger, more mature company than you are. And it was actually Evan Conrad, SF compute, who was just like, you guys are really amazing. Yeah, he was just like, guys, you're two dudes in a room. Why are you pretending that you're not? And so then we were like, okay, let's make the logo Ashaka. We brought surfboards to our booth to gtc and the energy was great. Some palm trees too.
[02:10]
Kyle
They actually poked out over, like, the walls. So you could see the bread booth and no one else, just from very far away.
[02:17]
Host
Oh, so you remember it back then?
[02:18]
Kyle
I remember it pre acquisition.
[02:20]
Natter
I was like, oh, those guys are cool, dude. That makes sense because we signed up really last minute. And so we had the last booth. It was all the way in the corner. And so I was, I was worried that no one was going to come. So that's why we had like the palm trees. We really came in with the surfboards. We even had one of our investors bring her dog and then she was just like walking the dog around to try to like bring energy towards our booth. Yeah, Steph, Yeah, yeah, she's the best.
[02:42]
Host
You know, as a conference organizer, I love that.
[02:44]
Kyle
Right?
[02:44]
Host
Like, it's like everyone who sponsors a conference comes, does their booth. They're like, we are changing the future of AI or something. Some generic bullshit. And like, no, like actually try to stand out, make it fun. Right. And people still remember after three years.
[02:56]
Natter
Yeah, yeah. You know what's so funny? I'll give you this clip if you want, if you want to add it in. But my wife was at the time, fiance, she was in medical school and she came to help us because it was like a big moment for us. And so we bought this cricket. It's like a Vine, like a vinyl printer. Because like, how else are we going to label the surfboard? So we got a surfboard, luckily was able to purchase that on the company card. We got a cricket. And it was just like fine tuning for enterprises or something like that that we put on the. On the surfboard. And it's 1am the day before we go to GTC. She's helping me put these like vinyl stickers on. And she goes, you son of. She's like, if you pull this off, you son of a bitch. And pretty much after the acquisition, I stitched that within the news of the acquisition. I sent it to our family group chat.
[03:38]
Host
Well, she made a good choice there. Was that. Basically the origin story for Launchables is that maybe we should explain what Brev is.
[03:45]
Natter
Yeah, Brev is just a developer tool that makes it really easy to get a gpu. So we connect a bunch of different GPU sources. So the basics of it is how quickly can we ssh you into a gpu? And whenever we would talk to users, they wanted a GPU, they wanted an A100. And if you go to any cloud provisioning page, usually it's like three pages of forms, or in the form somewhere there's a dropdown. And in the Dropdown, there's some weird code that you know to translate to an A100. And I remember just thinking like every time someone says they want an A100, like the piece of text that they're telling me that they want is like stuffed away in the corner. And so we're like, what if the biggest piece of text was what the user's asking for? And so when you go to Brav, it's just big GPU chips with the
[04:22]
Host
type of beautiful animations that you worked on Pre. Like pre you can do like now you can just prompt it. But back in the day, artisanal code,
[04:31]
Natter
I was actually really proud of that because it was. I made it in Figma.
[04:35]
Host
Yeah.
[04:35]
Natter
And then I found I was like really struggling to figure out how to turn it from like figma to react. So what it actually is is just an svg and I have all the styles. And so when you change the chip, whether it's like active or not, it changes the SVG code. And that somehow like renders, like looks like it's animating, but we just had the transition slow. But it's just like a JavaScript function to change the like underlying SVG. That was how I ended up like figuring out how to move it from, from Figma. But yeah, that's artisan.
[05:01]
Kyle
Speaking of marketing stunts though, he actually used those SVGs or kind of used those SVGs to make these cards. Oh yeah, like a GPU gift card that he handed out everywhere. That was actually my first impression of that.
[05:15]
Host
Yeah, Yeah, I think I still have one of them.
[05:17]
Kyle
They look great.
[05:18]
Natter
Yeah, I have a ton of them still actually in our garage, but just they don't have labels. We should honestly like bring, bring them back. But I found this old printing press here actually just around the corner on Venice. And it's a third generation San Francisco shop. And so I come in, an excited startup founder trying to like. And they just have this crazy old machinery and I'm in awe because the whole building is so physical. Like you're seeing these machines, they have like pedals to like move these saws and whatever. I don't know what this machinery is, but I saw all three generations. Like there's like the grandpa, the father and the son. The son was like around my age.
[05:48]
Host
It's like a holy, holy trinity.
[05:49]
Natter
Yeah. It's funny because we. So I just took the same SVG and we just like printed it and it's foil printing. So they make a mold that's like an inverse of like the A100. And then they put the foil on it and then they press it into the paper. And I remember once we got them, he was like, hey, don't forget about us. You know, I guess like early Apple and Cisco's first business cards were all made there. And so he was like, yeah, we get like the startup businesses, but then as they mature they kind of go somewhere else. And so I actually, I think we were talking with marketing about like using them.
[06:16]
Kyle
You should go back and make some cards.
[06:18]
Host
Yeah, yeah, yeah, yeah. You know, I remember, you know, as a very, very small BREV investor, I was like, what? Why are we spending time like doing these like stunts for GPUs? Like, you know, I think as a typical cloud hardware person, you go into AWS, you pick like T5XXL, whatever, and from a list and you look at the specs, like, why animate this gp? And I do think it just shows the level of care that goes throughout Briv and also Dynamo and Nvidia.
[06:44]
Natter
I think that's the thing that struck me most when we first came in was the amount of passion that everyone has. I think you talked to Kyle, you talked to every VP that I've met at Nvidia goes so close to the metal. I remember it was almost a year ago and like my VP asked me, he's like, hey, what's cursor? And like are you using it? And if so, why? And I'm just like surprised at this. And he downloaded cursor and he was asking me to help him like use it and I thought that was. Or like just show him what, you know, why we were using it. And so the amount of care that I think everyone has and the appreciation, passion and appreciation for the moment. Right. This is a very unique time. So it's really cool to see everyone really like appreciate that.
[07:19]
Host
Yeah. One thing I wanted to do before we move over to sort of like research topics and the stuff that Kyle's working on is just tell the story of the acquisition. Right. Like not many people have been been through an acquisition with Nvidia. What's it like? Yeah, just anything you'd like to say?
[07:35]
Natter
It's a crazy experience. I think the thing that was the most exciting for us was our goal was just to make it easier for developers. We wanted to find access to GPUs, make it easier to do that. And then actually your question about launchables. So launchables was just make one click deploys for any software on top of the gpu. And so what we really liked About Nvidia was that it felt like we just got a lot more resources to do all of that. I think, you know, Nvidia's goal is to make things as easy for developers as possible. So there was a really nice like synergy there. I think, you know, when it comes to like an acquisition, I think the amount that the soul of the products align I think is going to be, is going to speak to the success of the acquisition.
[08:14]
Host
Yeah.
[08:14]
Natter
So in many ways feels like we're home. This is a really great outcome for us. Like we, you know, I love brev.Nvidia.com like you should, you should use it.
[08:22]
Kyle
It's a front page for GPUs. Yeah, you want GPUs, you go there
[08:24]
Host
and it's like internally is growing very quickly. I remember you said some stats.
[08:28]
Natter
Ye. Yeah, yeah, yeah, it's. I wish I had the exact numbers. But like internally, externally it's been growing really quickly. We've been working with a bunch of partners with a bunch of different customers and ISVs. If you have a solution that you want someone that runs on a GPU and you want people to use it quickly, we can bundle it up in a launchable and make it a one click run. If you're doing things and you want just like a sandbox or something to run on. Right. Like openclaw, huge moment, super exciting and we'll talk into it more. But you know, internally people want to run this and we know we have to be really careful from the security implications. Do we let this run on the corporate network? Security's guidance was hey, run this on breadth. It's in, you know, it's a vm, it's sitting in the cloud, it's off the corporate network, it's isolated. And so that's been our stance internally and externally about how to even run something like openclaw while we figure out how to run these things securely.
[09:14]
Host
But yeah, I think there's also like you almost like we're the right team at the right time when Nvidia is starting to invest a lot more in developer experience or whatever you call UX or I don't know what you call it. Like software. Like obviously Nvidia is always invested in software but like there's like this is like a different audience, it's a wider developer base.
[09:34]
Kyle
Yeah, right.
[09:35]
Natter
Yeah, yeah. You know it's funny, it's like it's
[09:37]
Host
not so like what is it called internally? What is this that people should be aware that is going on there?
[09:41]
Kyle
Like developer yeah, yeah.
[09:43]
Host
It's called developer experience. Or is there like a broader strategy here?
[09:46]
Natter
Nvidia. Nvidia always wants to make a good developer experience. The thing is, a lot of the technology is just really complicated. Like, it's not. It's. You know, I think the thing that's been really growing, or AI is growing is having a huge moment. Not because, like, let's say Data scientists in 2018 were quiet then and are much louder now. The pie is there's a whole bunch of new audiences. My mom's wondering what she's doing. My sister's taught herself how to code. I actually think, just generally AI is a big equalizer and you're seeing a more technologically literate society. I guess everyone's learning how to code. There isn't really an excuse for that. And so building a good UX means that you really understand who your end user is. And when your end user becomes such a wide variety of people, then you have to almost reinvent the practice and
[10:31]
Kyle
actually build more developer ux.
[10:34]
Host
Right.
[10:34]
Kyle
Because there are tiers of developer base that were added. You know, the hackers that are building on top of OpenClaw, right. For example, have never used GPU. They don't know what CUDA is. They just want to run something. You need new UX that is not just, hey, how do you program something in Cuda and run it? And then we built, like when Deep Learning was getting big, we built Torch. But recently the amount of layers that are added to that developer stack has just exploded because AI has become ubiquitous. Everyone's using it in different ways.
[11:05]
Natter
It's moving fast in every direction, vertical, horizontal.
[11:09]
Dan
You even take it down to hardware like the DGX Spark. You know, it's basically the same system as just throwing it up on big GPU clusters.
[11:15]
Kyle
Yeah, yeah, it's a Blackwell.
[11:19]
Host
Yeah. We saw the preview at last year's GTC and that was one of the better performing videos of our Nvidia coverage so far.
[11:24]
Dan
Awesome.
[11:25]
Host
This will be the.
[11:27]
Natter
That was actually.
[11:27]
Kyle
Fingers crossed. Yeah.
[11:28]
Natter
Even when Grace Blackwell or when DGX Spark was first coming out, getting to be involved in that from the beginning of the developer experience and it just
[11:36]
Host
comes back, you were involved.
[11:37]
Natter
Yeah, yeah, yeah. I mean, from. It was just like I got an email. We just got thrown into the loop and suddenly. Yeah, it was actually really funny because I'm still pretty fresh from the acquisition and I'm getting an email from a bunch of the engineering VPs about like the new hardware GPU chip or not chip but just GPU system that we're putting out. And I'm like, okay, cool. Natter is now involved with this for the ux. I'm like, what am I going to do here? So I remember the first meeting. I was just like, kind of quiet as I was hearing engineering VPS talk about what this box could be, what it could do, how we should use it. And I remember one of the first ideas that people were ideating was like, oh, the first thing that it was like, I think a quote was like, the first thing someone's going to want to do with this is get two of them and run a Kubernetes cluster on top of them. And I was like, oh, I think I know why I'm here. I was like, the first thing we're doing is easy SSH into the machine and then just kind of like scoping it down. Of like, once you can do that. The person who wants to run a Kubernetes cluster on 2 Sparks has a higher propensity for pain than someone who buys it and wants to run OpenClaw right now. Right. If you can make sure that that's as effortless as possible, then the rest becomes easy. So there's a tool called Nvidia Sync. It just makes the SSH connection really simple. So if you think about it, if you have a Mac or a PC or whatever, if you have a laptop and you buy this GPU and you want to use it, you should be able to use it like it's a GPU in the cloud.
[12:56]
Host
Right.
[12:57]
Natter
But there's all this friction of, like, how do you actually get into that? That's part of Brev's value proposition, is just there's a CLI that wraps SSH and makes it simple. And so our goal is just get you into that machine really easily. And one thing we just launched at ces, it's still in, like, early access. We're ironing out some kinks, but it should be ready by gtc. You can register your Spark on Brev. And so now if you like remote managed local thing.
[13:20]
Host
Yeah, because Brev can already manage other clouds anyway.
[13:24]
Natter
Right.
[13:24]
Dan
And you use the Spark on Brev as well, right?
[13:26]
Natter
Yeah, yeah, exactly. So you set it up at home, you can run a command on it and then it gets. It's. Essentially, it'll appear in your Brev account and then you can take your laptop to a Starbucks or to a cafe and you will continue to use your. You can continue to spark just like any other cloud node on Brev.
[13:40]
Host
Yeah, yeah.
[13:40]
Natter
It's just like a pre provisioned. So your little data center in your home. Yeah, exactly.
[13:45]
Host
Yeah, yeah.
[13:45]
Dan
Tiny little data center, tiny little size of your phone.
[13:49]
Host
One more thing before we move on to Kyle. Just have so many Jensen stories and I just love mining Jensen stories. My favorite so far is sol. What is sol?
[13:58]
Natter
SOL is actually I think of all the lessons I've learned, that one's definitely my favorite.
[14:03]
Kyle
It'll always stick with you.
[14:03]
Natter
Yeah, yeah. You know, when you're a startup, everything's existential, right. Like we've run out of money. We were like on the risk of losing payroll. We've had to contract our team because we ran out of. And so like because of that you're really always forcing yourself to like understand the root cause of everything. If you get a date, if you get a timeline, you know exactly why that date or timeline is there. You're. You're pushing every boundary. And like you're not just say you're not just accepting like a. No just because. And so as you start to introduce more layers, as you start to become a much larger organization, SOL is essentially like, what is the physics? Right. The speed of light moves at a certain speed. So if light's moving some slower, then you know something's in the way. So before trying to layer reality back in of like, why can't this be delivered at some date? Let's just understand the physics. What is the theoretical limit to how fast this can go? And then start to tell me why. Because otherwise people will start telling you why something can't be done. But actually I think any great leader's goal is just to create urgency.
[15:00]
Kyle
There's an integrity, create compelling events.
[15:01]
Natter
Right.
[15:03]
Kyle
Sol is a term in Nvidia is used to instigate a compelling event. You say this is done. How do we get there? What is the minimum, as much as necessary, as little as possible thing that it takes for us to get exactly here. And it helps you just break through a bunch of noise.
[15:19]
Host
Yeah.
[15:19]
Kyle
Instantly.
[15:20]
Host
One thing I'm unclear about is can only Jensen use the SOL card? Like get the bullshit out. Because obviously it's Jensen. But can someone else be like.
[15:29]
Kyle
No, frontline engineers use it?
[15:32]
Natter
I think it's not so much about get the bullshit out. It's like, give me the root understanding, right? If you tell me something takes three weeks. Yeah. The first principles, it's like, what's the. Is it three weeks? What is the actual. Yeah. What's the actual limit of why this is going to take three weeks? If you're going to. If you. If let's say you wanted to buy a new computer and someone told you it's going to be here in five days, what's the sol? Well, like, the SOL is like, I could walk into a Best Buy and pick it up for you. Right? So then anything that's like, beyond that is. And is that practical? Is that how we're going to, you know, let's say, give everyone in the company a laptop? Like, obviously not. So then, like, that's the Sol. And then it's like, okay, well, if we have to get more than 10, suddenly there might be some. Right. And so now we can kind of piece the reality back.
[16:09]
Host
So this is the program do things that don't scale. And this is also what people would now call Beehive Agency.
[16:16]
Kyle
It's actually really interesting because there's a second hardware angle to SOL that doesn't come up for all the Org. So SOL is used culturally at Nvidia for everything.
[16:25]
Host
I'm also mining for. I think that can be annoying sometimes when someone keeps going SOL and you're like, guys, we have to be stable. We have to fucking plan.
[16:35]
Natter
It's an interesting balance. Yeah, I encountered that actually just with Alec, because we have a new conference, so we need to launch. We have, we have goals of what we want to launch by the conference. And like, yeah, at the end of
[16:44]
Host
the day, is this gtc?
[16:46]
Natter
Well, this is like. So we, I mean, we did it for ces, we did it for GTC DC before that, we're doing it for GTC San Jose. So I mean, like every, you know, we have a new moment and we want to launch something and we want to do so at sol. And that does mean that some, there's some level of prioritization that needs to happen. And so it is difficult. Right. I think you have to be careful with what you're pushing. You know, stability is important and that should be factored into sol. SOL isn't just like, like, build everything and let it break. You know that that's part of the conversation. So as you're laying layering in all the details, one of them might be, hey, we could build this, but then it's not going to be stable for XYZ reasons. And so that was like one of our conversations for CES was, you know, hey, like, we, we can get this into early access, registering your spark with brev. But there are a lot of things that we need to do in order to feel really comfortable from a security perspective. Right there's a lot of networking involved before we deliver that to users. So it's like, okay, let's get this to a point where we can at least let people experiment with it. We had it in a booth, we had it in Jensen's keynote, and then let's go iron out all the networking kinks. And that's not easy. And so that can come later. And so that was the way that we layered that back in.
[17:50]
Kyle
But it's not really about saying you don't have to do the maintenance or operational work. It's more about saying it's kind of like highlights how progress is incremental.
[18:02]
Natter
Right.
[18:02]
Kyle
What is the minimum thing that we can get to? And then there's SOL for every component after that. But there's the SOL to get you to the starting line. And that's usually how it's asked on the other side. SOL came out of hardware at Nvidia. So SOL is literally, if we ran the accelerator or the GPU at basically full speed with no other constraints, how fast would be able to make a program go.
[18:26]
Host
Yeah.
[18:26]
Kyle
Right.
[18:27]
Host
So in training that, like, you know, then you work back to like some percentage of like mfu, for example.
[18:33]
Kyle
Yeah, that's. That's a great example. So like there's an. There's an SOL MFU and then there's like, you know, what's practically achievable.
[18:38]
Host
Cool. Shall we move on to sort of Kyle's side? Kyle, you're coming more from the data science world. And I mean, I always. When. Whenever I meet someone who's done work in tabular stuff, graph neural networks, time series. These are basically when I go to Neurips, I go to icml, I walk the back halls. There's always like a small group of graph people, small group of tabular people, and there's no one there. And it's very. You know what I mean? It's important, interesting work if you care about solving the problems that they solve. But everyone else is just LLMs all the time.
[19:13]
Kyle
Yeah, it's like the black hole, right? Has the event horizon reached this yet in Neurops?
[19:19]
Host
But those are transformers too, and those are also interesting things. Anyway, I just wanted to spend a little bit of time on that background before we go into Dynamo proper.
[19:30]
Kyle
Yeah, sure. I took a different path to Nvidia than Natter. I joined six years ago, seven if you count when I was an intern. So I joined Nvidia right out of college and the first thing I jumped into was not what I'd done during internship. Which was some stuff for autonomous vehicles, like heavyweight object detection. I jumped into something, I'm like recommenders. This is popular.
[19:51]
Host
Yeah. You did Rexis.
[19:52]
Kyle
Yeah, Rexis. That was the taboo data at the time, right? You have tables of audience qualities and item qualities and you're trying to figure out which member of the audience matches which item or more practically which item matches which member of the audience. And at the time, really it was like we were trying to enable recommenders which had historically been a little bit of a CPU based workflow, into something that ran really well in GPUs and, and it's since been done. There are a bunch of libraries for EXIs that run on GPUs. The common models like Deep Learning Recommendation model, which came out of Meta and the wide and Deep model which was used or was released by Google, were very accelerated by GPUs using the fast HBM on the chips especially to do vector lookups. But it was very interesting at the time and super, super relevant because we were starting to get this explosion of feeds and things that required recommenders to just actively be on all the time. And sort of transitioned that a little bit towards graph neural networks when I discovered them, because I was like, okay, you can actually use graph neural networks to represent relationships between people, items, concepts. And that interested me. So I jumped into that at Nvidia and got really involved for like 2ish years.
[21:04]
Host
And something I learned from Brian Cannizzaro is that you can just kind of choose your own path in Nvidia.
[21:09]
Kyle
Oh my God.
[21:09]
Natter
Yeah.
[21:10]
Host
Which is not a normal big corp thing. Like you have a lane, you stay in your lane.
[21:14]
Natter
I think probably the reason why I enjoy being in a big company, coming from a startup guy.
[21:20]
Host
Yeah, the mission is the boss.
[21:21]
Natter
Yeah, it feels like a big game of pickup basketball. Like, you know, if you play one, if you want to play basketball, you just go up to the court and you're like, hey, we're going to play this game and we need three. And you just like find your three. That's honestly for every new initiative. That's what it feels like.
[21:32]
Dan
Yeah, it also like shows, right? Like Nvidia is just releasing state of the art stuff in every domain. Like okay, you expect foundation models with Nemotron voice just randomly pop tier parakeet just comes out. Another one the Nvidia voice team has always been producing. There's always just every other domain of paper that comes out, data set that comes out. And it's like, I mean it also stems back to what Nvidia has to do. Right. You have to make chips years before they're actually produced. Right. So you need to know. You need to really forget.
[22:00]
Kyle
Design process starts like three to five years before the chip gets to the market.
[22:05]
Dan
Yeah. I'm curious more about what that's like. Right? So like, you have specialist teams. Is it just like, you know, people find an interest, you go in, you go deep on whatever and that kind of feeds back into, you know. Okay. We expect predictions like the internals at Nvidia must be crazy. Right. You know, you must not. Even without selling the people, you have your own predictions of where things are going and they're very based, very grounded, right?
[22:29]
Kyle
Yeah, it's really interesting. So there's like two things I think that Nvidia does which are quite interesting. One is we really index into passion. There's a big sort of organizational topsound push to ensure that people are working on the things that they're passionate about. So if someone proposes something that's interesting, many times they can just email someone way up the chain that they would find this relevant and say, hey, can I go work on this?
[22:52]
Natter
That's actually. I worked at a big company for a couple of years before starting on my startup journey and it felt very weird if you were to email out of chain, if that makes sense. The emails at Nvidia are like mosh pits. It's just like 60 people just whatever.
[23:07]
Host
And like I messy, like reply all.
[23:09]
Natter
Oh, it gets. It's insane. It's insane.
[23:11]
Kyle
It does help, you know, manage the context.
[23:14]
Natter
But that's actually like I've actually. So this is a weird thing where I used to be like, why would we send emails? We have Slack. I am the entire. I'm the exact opposite. I feel so bad for anyone who's like messaging me on Slack because I'm so unresponsive.
[23:25]
Host
Your email, Max.
[23:26]
Natter
I'm email maxing.
[23:27]
Kyle
Email is a different.
[23:27]
Host
Email is perfect because we can't work together. I'm stuck.
[23:31]
Natter
You know, it's great because important threads get bumped back up, right? Yeah. And so Slack doesn't do that. So I just have like this casino going off on the right or on the left and like, I don't know which thread was from where or what. But like the threads get. And then also just like the subject. So you can have like working threads. I think what's difficult is like when you're small, if it's not 40,000 people, I think Slack will work fine. But there's. I don't know what the inflection point is there is going to be a point where that becomes really messy and you'll actually prefer having email because you can have working threads. You can CC more than nine people in a thread.
[23:59]
Kyle
You can fork stuff.
[24:00]
Natter
You can fork stuff, which is super nice. And just like. Yeah. And so. But that is part of where you can propose a plan. You can also just like start. Honestly, momentum is the only authority, right? So like, if you can just start to make a little bit of progress and show someone something and then they can try it, that's I think what's been, you know, I think the most effective way to push anything forward. And that's both at Nvidia and I think just generally.
[24:20]
Host
Yeah.
[24:20]
Kyle
There's the other concept that like, is explored a lot at Nvidia, which is this idea of a $0 billion business. Like market creation is a big thing
[24:28]
Host
at Nvidia, or you want to go and start a $0 billion business?
[24:32]
Kyle
Jensen says we're completely happy investing in $0 billion markets. We don't care if this creates revenue. It's important for us to know about this market. We think it will be important in the future. It can be $0 billion for a while. I'm probably mangling his words here, but I'll give an example. Nvidia has been working on autonomous driving for a long time.
[24:52]
Host
Like an Nvidia car.
[24:53]
Dan
No, They've used the Mercedes. Right. They're around the HQ and I think it finally just got licensed out. Now they're starting to be used. Qu.
[25:00]
Kyle
Yeah, yeah.
[25:01]
Dan
For 10 years you've been seeing Mercedes with Nvidia logos.
[25:06]
Kyle
If you're in like the South Santa Clara, it's from South. Yeah. So zero billion dollar markets are. Are a thing like, you know, Jensen,
[25:17]
Host
I mean, okay, look, cars are not a zero billion dollar market, but yeah,
[25:20]
Natter
that's a bad example. I think, I think he's messaging zero today, but. Or even like internally. Right? Like, like it's like an org doesn't have to ruthlessly find revenue very quickly to justify their existence. Right. The important research, a lot of the important technology being developed, that's kind of
[25:36]
Kyle
where research is very ideologically free at Nvidia, like they can pursue things that they were you research officially, I was never in research. Officially I was always in engineering. I'm in an org called Deep Learning Algorithms, which is basically just how do we make things that are relevant to deep learning go fast.
[25:50]
Host
That sounds freaking cool.
[25:51]
Dan
And I think a lot of that is underappreciated. Right? Like time series. This week Google put out Timeframe, a new time series paper. Rexis Semantic ID started applying Transformers LLMs to Rexys. And when you think the scale of companies deploying these. Right. Amazon recommendations, Google Web certs, it's huge skill and you want fast.
[26:11]
Kyle
Yeah. Actually there's a fun moment that brought me full circle. Amazon ads recently gave a talk where they talked about using Dynamo for generative recommendation, which was super weirdly cathartic for me. I'm like, oh my God, I've supplanted what I was working on. You're using LLMs now to do what I was doing five years ago.
[26:32]
Host
Yeah. Amazing. Let's go right into Dynamo, maybe introduce to the top down and yeah, I
[26:39]
Kyle
think at this point a lot of people are familiar with the term of inference. Funnily enough, I went from inference being a really niche topic to being something that's discussed on normal people's Twitter feeds.
[26:49]
Natter
It's on billboards here.
[26:50]
Kyle
Very strange driving seeing just an inference AD on 101. Inference at scale is becoming a lot more important. We have these moments like openclaw where you have these agents that take lots and lots of tokens but produce incredible results. There are many different aspects of test time scaling so that you can use more inference to generate a better result than if you were to use a short amount of inference. There's reasoning, there's requerying, there's adding agency to the model, allowing it to call tools and use skills. Dynoassort came about at Nvidia because myself and a couple others were sort of talking about these concepts that you have inference engines like VLM, SGLang, TensortLM and they have one single copy, they sort of think about things as one single copy, like one replica, one version of the model. But when you're actually serving things at scale, you can't just scale up that replica because you end up with performance problems. There's a scaling limit to scaling up replicas. So you actually have to scale out to use maybe some Kubernetes stuff terminology. We kind of realized that there was a lot of potential optimization that we could do in scaling out and building systems for data center scale inference. So Dynamo is this data center scale inference engine that sits on top of the frameworks like vlm, Echeling and Tensor GLM and just makes things go faster because you can leverage the economy of scale. The fact that you have KV cache, which we can define a little bit later in all of these machines, that is unique and you want to Figure out, like, the ways to maximize your cache hits. Or you want to employ new techniques in inference, like disaggregation, which Dynamo introduced to the world in March. Not introduced. It was an academic talk beforehand. But we're one of the first frameworks to start supporting it. And we want to combine all these techniques into sort of a modular framework that allows you to accelerate your inference at scale.
[28:47]
Natter
By the way, Kyle and I became friends on my first date, Nvidia and I always loved. Because he always teaches me new things.
[28:53]
Host
By the way, this is why I wanted to put two of you together. I was like, yeah, this is going to be good.
[28:57]
Kyle
It's very different. We've talked to each other a bunch.
[29:00]
Host
Actually.
[29:00]
Kyle
You asked, why can't we scale up?
[29:02]
Natter
Yeah. You said model replicas.
[29:04]
Kyle
Yeah. So scale up means assigning more.
[29:06]
Host
Heavier.
[29:07]
Kyle
Yeah, heavier. Like making things Heavier. Adding more GPUs, adding more CPUs. Scale out is just like having a barrier, saying, I'm going to duplicate my representation of the model or representation of this microservice or something, and I'm going to replicate it many times to handle the load. And the reason that you can't scale up past some points is there are sort of hardware bounds and algorithmic bounds on that type of scaling. So I'll give you a good example that's very trivial. Let's say you're on an H100. The maximum MV link domain for H100 for most DGX H1 hundreds is 8 gpus. Right. So if you scaled up past that, you're going to have to figure out ways to handle the fact that now for the GPUs to communicate, you have to do it over Infiniband, which is still very fast, but is not as fast as NVLink.
[29:54]
Host
Is it like one order of magnitude? Like hundreds?
[29:56]
Kyle
It's about an order of magnitude.
[29:59]
Host
Not terrible.
[30:00]
Kyle
Yeah. I need to remember the data sheet here. I think it's about 500 gigabytes a second unidirectional for NVLink and about 50 gigabytes a second unidirectionral for Infiniband. It depends on the generation.
[30:18]
Host
I just want to set this up for people who are not familiar with these kinds of layers and the transfer
[30:23]
Dan
speeds also, maybe even just going a few steps back before that most people are very familiar with. You see, you can use on your laptop, whatever these SD Lang Vllm. You can just run inference.
[30:36]
Kyle
You can run it on that laptop.
[30:37]
Dan
You can run on laptop up. Then you get to. Okay, Models got pretty big, right? GLM5, they doubled the size. So what do you do when you have to go from, okay, I can get 128 gigs of memory, I can run it on a spark. Then you have to go multi gpu. Okay, multi gpu. There's some support there. Now, if I'm a company and I don't have like, I'm not hiring the best researchers for this. Right. But I need to go multi node. Right. I have a lot of servers. Okay. Now there's efficiency problems, Right. You can have multiple 8H100 nodes. But, you know, is that as a. Like, how do you do that? Efficient. Yeah.
[31:10]
Kyle
How do you like represent them? How do you choose how to represent the model? Right. That's like a hard question everyone asks, how do you size? Oh, I want to run GLM5, which just came out.
[31:19]
Natter
New model.
[31:19]
Kyle
There have been like four of them in the past week, by the way. Like a bunch of new models.
[31:23]
Host
You know why, right? Deep sea.
[31:25]
Kyle
No comment. Yeah, but glm5, right. We, we have this new model.
[31:30]
Natter
It's.
[31:30]
Kyle
It's of like a large size. And you have to figure out how to both scale up and scale out.
[31:34]
Natter
Right.
[31:35]
Kyle
Because you have to find the right representation that you care about. Everyone does this differently. Let's be very clear. Everyone figures this out in their own path.
[31:41]
Natter
I feel like a lot of AI or ML even is like, is like this. I think people think, you know, there was some tweet a few months ago that was like, why hasn't fine tuning as a service taken off?
[31:49]
Host
And you know, that might be me,
[31:52]
Natter
it might have been you. Yeah. But people want it to be such an easy recipe to follow. But even like if you look at an ML model specific to you. Yeah, yeah.
[32:00]
Kyle
And the model.
[32:00]
Natter
And there's so much, there's so much tinkering. Like when you see a model that has however many experts in the MOE model, it's like, why that many experts person, they tried a bunch of things and that one seemed to do better. And I think when it comes to how you're serving inference, you have a bunch of decisions to make. And you can always argue that you can take something and make it more optimal. But I think there's this internal calibration and appetite for continued calibration.
[32:21]
Kyle
Yeah.
[32:21]
Dan
And that doesn't mean people aren't taking a shot at this, like tinker from thinking machines. RL is a service. It also gets even harder when you try to do big model training.
[32:31]
Host
Right.
[32:31]
Dan
We're not the best at training MOEs when they're pre trained, like we saw this with Llama 3, right? They're trained in such a sparse way that meta knows there's going to be a bunch of inference done on these, right? They'll open source it, but it's very trained for what meta infrastructure wants, right? They want to inference it a lot. Now the question to basically think about is, okay, say you want to serve a chat application, a coding copilot, right? You're doing a layer of rl, you're serving a model for X amount of people. Is it a chat model? A coding model? Dynamo? You know, back to that, Sorry.
[33:00]
Kyle
So we sort of like jumped off of, you know, on that topic. Everyone has like their own journey and I like to think of it as defined by like what is the model you need? What is the accuracy you need? Actually I talked to Natter about this earlier. There's three axes you care about. What is the quality that you're able to produce? So like are you accurate enough or can you complete the task with enough performance? High enough performance? Yeah, there's cost. Can you serve the model or serve your workflow? Because it's not just the model anymore, it's the workflow, it's the multi turn with an agent cheaply enough and then can you serve it fast enough? And we're seeing all three of these play out. We saw new models from OpenAI that are faster. You have these new fast versions of models. You can change the amount of thinking to change the amount of quality, produce more tokens, but at a higher cost and a higher latency. And really when you start this journey of trying to figure out how you want to host a model, you think about three things. What is the model I need to serve? How many times do I need to call it? What is the input sequence link? What does the workflow look like on top of it? What is the sla? What is the latency SLA that I need to achieve? Because this is usually a constant, you know, the SLA that you need to hit and then you try and find the lowest cost version that hits all of these constraints. Usually you start with those things and you say you kind of do a bit of experimentation across some common configurations. You change the tensor parallel size, which is a form of parallelism.
[34:27]
Dan
I'd say it goes even deeper. First you got to think, well model,
[34:31]
Kyle
it's like a multi step design process because as you said, you can choose a smaller model and then do more test time scaling and it'll equate Quality of a larger model because you're doing the test time scaling or you're adding a harness or something. So yes, it goes way deeper than that. But from the performance perspective, once you get to the model you need to host, you look at that and you say, hey, I have this model, I need to serve it at this speed. What is the right configuration for that?
[34:55]
Natter
You guys see the recent. There's a paper I just saw like a few days ago that if you run the same prompt twice, you're getting like double.
[35:02]
Dan
Try it again.
[35:02]
Natter
Yeah, exactly.
[35:03]
Dan
And you get a lot.
[35:04]
Natter
Yeah.
[35:04]
Dan
But the key thing there is you give the context of the failed try. Right. So it takes a shot. And this has been like, you know, basic guidance for quite a while. Just try again. Because you know, try.
[35:15]
Natter
Just try again.
[35:15]
Host
Did you try again?
[35:16]
Natter
All advice in life.
[35:18]
Dan
It's a paper from Google if I'm not mistaken. Right. I think it's like a seven little short paper. The title is very cute. And it's just like, yeah, just try again.
[35:25]
Kyle
Give it.
[35:25]
Dan
It has context.
[35:26]
Host
Multi shot.
[35:27]
Kyle
You just like say like, hey, like, you know, like take, take a little bit more. Take a little bit more information. Try and fail.
[35:32]
Dan
And that basic concept has gone pretty deep. There's like self distillation RL where you, you do self distillation, you do RL and you have past failure. And you know, that gives some signal. So people take. Try it again.
[35:45]
Host
Not strong enough for, for listeners who listen to here, Vibu actually and I and we run a second YouTube channel for our paper club where.
[35:55]
Natter
Oh, that's.
[35:56]
Host
We would just cover this self desolation and all that. That's why he's so up to speed on it.
[36:00]
Kyle
I'll have to check it out.
[36:01]
Host
Yeah, it's just a good practice. Everyone needs a paper club where you just read papers together and the social pressure just kind of forces you.
[36:08]
Kyle
There's like a big inference reading group.
[36:11]
Natter
I feel so bad every time he put it on. Like on our. He shared it.
[36:14]
Host
One of your guys is big in that. I forget.
[36:17]
Natter
Yeah. Ishan.
[36:17]
Host
Ishan.
[36:18]
Kyle
Ishan's on my team actually. Funny, funny. There's a, there's a, there's a employee transfer between us. Ishan worked for Natter at Brev and now he's.
[36:25]
Natter
He was our head of AI and then yeah, once we got in.
[36:28]
Host
Because I'm always looking for like, okay, can, can I start another podcast that only does that thing? And Ishan was. I was trying to like nudge Isan into like, is there something Here, I mean I don't think there's, there's new inference techniques every day. So it's like it's.
[36:40]
Kyle
You would, you would actually be surprised the amount of blog posts you see
[36:45]
Host
and if there was a period where it was like Medusa, Hydra, what Eagle.
[36:49]
Kyle
Now we have new forms of decode, we have new forms of specular decoding.
[36:53]
Host
What are you expecting?
[36:54]
Dan
It's exciting when you guys put out something like Nemotron. Because I remember the paper on this Nemotron 3, the amount of post training, the amount of tokens that the GPU rich can just train on. And it was a hybrid state space model, right?
[37:07]
Kyle
Yeah, it's co designed for the hardware.
[37:09]
Dan
Yeah, co designed for the hardware. And one of the things was always the state space models don't scale as well when you do a conversion or whatever the performance and you guys are like no, just keep training. And Nemotron chose a lot of that.
[37:21]
Natter
Also something cool about Nevotron, it was released in layers, if you will. Very similar to Dynamo. It was released as aggregated. You can the pre training, post training data sets are released, the recipes on how to do it are released, the model itself is released. So you can benefit from us turning on the GPUs. But there are companies like ServiceNow took the data set and they trained their own model. And we were super excited and like you know, celebrated that work.
[37:43]
Host
Zoom, the frontier model, Labs. Zoom is, Zoom is AGI.
[37:47]
Dan
I think you know, also just to add like a lot of models don't put out base models. And if there's that, why is fine tuning not taken off? You know, you can do your own training but you guys put out base model. I think you put out everything.
[37:59]
Host
I believe, I don't know about base. Base can, base can be cancelable.
[38:04]
Dan
Bass can be cancelable.
[38:05]
Host
Yeah.
[38:06]
Dan
Safety training.
[38:08]
Host
Do we get a full picture of Dynamo? I don't know if we.
[38:10]
Natter
What I'd love is you mentioned the three axes like break it down of like, you know, what's pre filled, decode and like what are the optimizations that we can get with Dynamo.
[38:18]
Kyle
Yeah, that's a great point. So to summarize on that three axis problem, there are three things that determine whether or not something can be done with inference. Cost, quality, latency. Dynamo is supposed to be there to provide you the runtime that allows you to pull levers to mix it up and move around the Pareto frontier or the Pareto surface that determines is this actually possible with inference? And AI today gives you the knobs yeah, exactly. It gives you the knobs. And one thing that we use a lot in contemporary inference and is starting to pick up in general knowledge is this concept of disaggregation. So historically models would be hosted with a single inference engine and that inference engine would ping pong between two phases. There's pre fill where you're reading the sequence generating kvcache, which is basically just a set of vectors that represent the sequence, and then using that KVCACHE to generate new tokens, which is called decode. And some brilliant researchers across multiple different papers essentially made the realization that if you separate these two phases, you actually gain some benefits. Those benefits are basically a you don't have to worry about step synchronous scheduling. So the way that an inference engine works is you do one step and then you finish it and then you start scheduling the next step. It's not like fully asynchronous. And the problem with that is you would have essentially prefill and decode are actually very different in terms of both their resource requirements and sometimes their runtime. So you would have prefill that would block decode steps because you'd still be prefilling and you couldn't schedule because the step has to end. So you remove that scheduling issue and then you also allow yourself to split the work into two different types of pools. Prefill typically. And this changes as model architecture changes. Prefill is right now compute bound. Most of the time if the sequence is sufficiently long, it's compute bound on the decode side because you're doing a full passover, all the weights and the entire sequence. Every time you do a decode step and you don't have the quadratic computation of kbcache, it's usually memory bound because you're retrieving a linear amount of memory and you're doing a linear amount of compute as opposed to pre fill where you retrieve a linear amount of memory and then use a quadratic memory.
[40:35]
Natter
You know what's funny? Someone exolabs did a really cool demo demo where for the DGX Spark, which has a lot more compute, you can do the compute hungry pre fill on a DGX Spark and then do the decode on a Mac. And so that's faster.
[40:50]
Kyle
Yeah, so you can do machine stratification. And like with our future generation generations of hardware, we actually announced like with Ruben, this new accelerator that is pre fill specific, it's called Rubin cpx.
[41:05]
Natter
I have a question. When you do the scale out, is scaling out easier with Dynamo. Because when you need a new node, you can dedicate it to either the pre fill or decode.
[41:14]
Kyle
Yeah, so Dynamo actually has a Kubernetes component in it called Grove that allows you to do this crazy scaling specialization. It's a representation that. I don't want to go too deep into Kubernetes here, but there was a previous way that you would launch multinode work. It's called leader worker set. It's in the Kubernetes standard. Standard and leader worker set is great. It served a lot of people super well for a long period of time. But one of the things that it struggles with is representing a set of cases where you have a multi node replica that has a pair pre fill and decode, or it's not paired, but it has a second stage that has a ratio that changes over time. Pre fill and decode are two different things. As your workload changes, the amount of preflow you'll need to do may change. The amount of decode that you'll need to do might change. Let's say you start getting insanely long queries. That probably means that your pre fill scales harder because you're hitting this quadratic scaling growth for listeners.
[42:12]
Host
Prefill will be long input, decode will be long output, for example.
[42:16]
Natter
Yeah.
[42:16]
Kyle
So decode scale. I mean, decode is funny because the amount of tokens that you produce scales with the output length, but the amount of work that you do per step scales with the amount of tokens in the context.
[42:27]
Host
Yes.
[42:27]
Kyle
So both scales with input and the output output.
[42:29]
Host
That's true.
[42:30]
Kyle
But on the preflow decode side, if suddenly the amount of work you're doing on the decode side stays about the same or scales a little bit, and then the prefill side jumps up a lot, you actually don't want that ratio to be the same. You want it to change over time. So Dynamo is a set of components that a tell you how to scale. It tells you how many prefill workers and decoded workers it thinks you should have. And also provides a scheduling API for Kubernetes that allows you to actually represent and affect this scheduling on your actual hardware on your computer infrastructure.
[43:02]
Natter
Not gonna lie, I feel a little embarrassed for being proud of my SVG function earlier.
[43:07]
Kyle
It was really cute.
[43:09]
Host
It's all engineering. It's all engineering, sort of technical. One thing I'm kind of just curious about, you see at a systems level, everything going on here, and we're scaling it up in multi distributed systems. I think one thing that's kind of of the moment right now is people are asking, is there any SOL sort of upper bounds in terms of. Let's just call it context length for want for a better word. But you can break it down however you like. Yeah, I just think like. Well, yeah, I mean like clearly you can engage in hybrid architectures and throw in some state space models in there all you want, but it still looks very attention heavy.
[43:45]
Kyle
Yes, yeah. Long context is attention heavy. I mean we have these hybrid models
[43:50]
Host
and most models cap out at a million contexts and that's it for the last two years. It's been it.
[43:55]
Natter
Yeah.
[43:55]
Kyle
The model hardware context co design thing that we're seeing these days is actually super interesting. It's like my passion, like my secret side passion. We see models like KIMI or GPT oss. I'm going to use these because I know specific things about these models. So Kimi 2 comes out right. And it's an interesting model. It's like a deep SEQ style architecture. It's mla. It's basically deep SEQ scaled a little bit differently and obviously trained differently as well. Well, but they talked about why they made the design choices for context. Kimmy has more experts but fewer attention heads and I believe a slightly smaller attention dimension. But I need to check that doesn't matter. But they discussed this actually at length in a blog post on jihoo which is like. Or Jipu which is like. Yeah, Chinese Reddit. Yeah, it is, yeah. So it's actually an incredible blog post. Like all the mlsys people that I've seen on GPU are very brilliant but they talk about. The creators of QEMI K2 actually talked about it on there in a blog post and they say we actually did an experiment. Right. Attention scales with the number of heads. Obviously if you have 64 heads versus 32 heads, you do half the work of attention. You still scale quadratically, but you do have to work. And they made a very specific sort of barter in their system, in their architecture. They basically said, hey, what if we gave it more experts? So we're going to use more memory capacity but we keep the amount of activated experts the same. We increase the expert sparsity so we have fewer experts. The ratio of experts activated to number of experts is smaller and we decrease the number of attention heads.
[45:39]
Dan
And kind of for context, what we had been seeing was you make models sparser instead no one was really touching heads, you're just having.
[45:46]
Kyle
Well, they implicitly made it sparser.
[45:48]
Dan
Yeah. For Kimmy they did. They Also made it sparser. But basically what we were seeing was people were at the level of okay. There's a sparsity ratio. You want more total parameters, less active and that's sparsity. But what you see from papers like the labs like Moonshot, Deep Seq, they go to the level of okay, outside of just number of experts. You can also change how many attention heads and less attention layers. More attention layers.
[46:12]
Kyle
Yes, yes.
[46:12]
Dan
So and that's all basically coming back to just tied together is like hardware
[46:16]
Kyle
model, co design, which is harder model context co design. Yeah. Right. Like if you were training a model that was like really, really short context or like really is good at super short context tasks, you may like design it in a way such that like you don't care about attention scaling because it hasn't hit that like the turning point where like the quadratic curve takes over.
[46:36]
Natter
How do you consider attention or context as a separate part of the co design? Like I would imagine hard work or just how I would have thought of it is like hardware model co design would be hardware model context co design.
[46:45]
Kyle
Because the harness and the context that is produced by the harness is a part of the model once it's trained
[46:52]
Dan
in like, even though towards the end you'll do long context, you're not changing architecture through training.
[46:57]
Kyle
I mean you can try.
[46:59]
Host
You're saying everyone's training the harness into the model.
[47:03]
Kyle
I would say to some degree.
[47:04]
Natter
Or there's co design.
[47:05]
Host
I know there's a small amount, but I feel like not everyone has gone run full send on this.
[47:10]
Kyle
I think it's important to internalize the harness that you think the model will be running into the model.
[47:15]
Dan
Yeah.
[47:15]
Natter
Interesting. Okay.
[47:16]
Host
And like Bash is like the universal harness.
[47:20]
Kyle
I'll give an example here. Right. I mean, or just like a, like a. It's easy proof, right. If you can train against a harness and you're using that harness for everything, wouldn't you just train with a harness to ensure that you get the best possible quality out of.
[47:35]
Host
Well, I can provide the counter argument, which is what you want to provide a generally useful model for other people to plug into their harnesses.
[47:43]
Kyle
Harnesses can be open source, right?
[47:45]
Host
Yes. I mean that's effectively what's happening with codecs. But you may want a different search tool and then you may have to name it differently.
[47:52]
Natter
I don't know how much people have pushed on this, but can you train a model? Have people compared training a model for the harness versus post training training for.
[48:01]
Host
I think it's the same thing. It's just extra post training.
[48:04]
Natter
I see.
[48:05]
Host
And so I mean cognition does this, cursor does this where you just have to like, if your tool is slightly different, either force your tool to be like the tool that they train for or undo their training for their tool and then retrain. It's really annoying.
[48:17]
Kyle
And like, I would hope that eventually we hit like a certain level of generality with respect to these new tools.
[48:23]
Host
It's not AGI. Really stupid. Like, like, learn my tool, bitch. Like, I don't know if, I don't know if I can say that. But like, you know, I think what my point kind of is is that there's like, I look at slopes of the scaling laws and like this slope is not working, man. We're at a million token context. Okay, maybe next year 2 million. We're not going to 100 trillion. You know, like this, this just.
[48:44]
Kyle
Oh, there's so many interesting ways.
[48:46]
Host
This doesn't work. This doesn't work.
[48:48]
Natter
What's kind of funny is whenever there I. I feel like we always want to see a trend that we can predict, but every time something's come, it's been like a leapfrog. So I imagine, I don't know how we go from one to two, but I imagine what's likely to happen is we break through that from some new.
[49:02]
Kyle
Yeah, there's an interesting formalization of this. There's an essay, it's a pretty interesting essay by Leopold Aschenbrenner called situational awareness.
[49:10]
Host
No kidding. Yes.
[49:11]
Kyle
He introduces a concept awareness called an unhobbler. Right. So Leopold in this essay, details, hey, I want to get, I want to get to this point in intelligence and I think that it is four orders of magnitude worth of compute and data and training away. And he says, oh yeah, I think data centers can scale up by about this much. I think that you can scale up the data and some other things by this much. But one of the things that makes the rest of that order of magnitude growth possible is these unhobblers, these scientific discoveries that are discovered during the model architecture search or training that really, really, really impact how you are able to scale. A good example of this might be that we see a lot of models that are, and this is probably a very tiny unhobbler, but is important for the performance perspective. We see a lot of models that are trained with multi token prediction natively during pre training and per deepseq. In their paper they say, hey, this actually helped us ensure more stable convergence. But there are unhobblers that are like that. And then there are rather large unhobblers. Architecturally, a lot of our models we had different types of attention. And one of the problems with attention is you have a lot of kv. But people have found different forms of attention like group query attention and MLA in deep seq multi head latent attention that decrease the burden that KV has on the model which allows you to grow longer in context.
[50:39]
Host
Yeah. And that was very drastic for deepseak.
[50:41]
Kyle
Yeah, for context like the total, I think the total context length of Deepseek is 128,000 tokens or it might be 256,000 with rope extension. That entire context, I think it's 128,000 fits into 8 gigabytes. And previously context, I think the llama 405B context of a similar size was like 40 or 80 gigabytes in the same precision. So those unhobblers really decrease the stuff of that size. And I wouldn't be surprised if we do see the ability to break through to 10 million, 20 million 100 million context through an unhobbler showing up.
[51:20]
Host
I see.
[51:21]
Kyle
And it's just science.
[51:21]
Host
More deep learning algorithms is what it is.
[51:26]
Natter
A frame pickup and he has room for two.
[51:28]
Kyle
I could actually give you an example of a theory. Not a theory here, but something theoretically
[51:34]
Natter
an unhabbler that you're excited about.
[51:36]
Kyle
An unhobbler that I mean, I haven't seen. So it could be a tar pit and it could just not work. But I would be really excited to see a model that does pre fill and decode differently. So a model that does pre fill locally document wise pre fill it doesn't in chunks. And then you do decode globally across the entire sequence. Logically, to me it doesn't seem like you would necessarily need to have KV be associative between documents that have no mutual association. But that places a lot of burden on pre fill on decode and pure attention within the decode phase to make those connections. Since the KV is static at that point, you see other techniques that are interesting like this too. But if you're able to do that, if pre fill becomes local and decode is still global, you solve that pre fill quadratic scaling problem because you have a bunch of small chunks that you pre fill independently.
[52:30]
Host
Okay, all right. Well, let's wait and see. But I think it'll be pretty exciting.
[52:33]
Kyle
Fingers crossed.
[52:34]
Host
Yeah, fingers crossed.
[52:35]
Kyle
Yeah.
[52:35]
Dan
I'm excited for prefront decode on separate hardware. So like Grok acquisition. Right. Can we decode on the Grock, can we get super fast?
[52:43]
Kyle
I don't think I'm allowed to comment on this.
[52:46]
Host
Mark is going to shoot arrows at us.
[52:48]
Natter
He's got a blow Dark. Yeah, he's in the side of the room.
[52:50]
Kyle
Just like go to sleep.
[52:53]
Natter
I'm super excited to see the team come in and like, you know, I've gotten the pleasure of working with some of the Grok people coming in. So, you know. Yeah, I know Sunny.
[53:00]
Host
We've had him at the same conference that you were at.
[53:02]
Natter
Yeah.
[53:03]
Host
And I think you guys are going to be doing some sessions at gtc. I don't know if you. This is a good place to plug them.
[53:08]
Kyle
Yeah, yeah. So I can't speak to any LPU related sessions at gdc. I have no idea about that. On the Grox side. Yeah, I use the associative Nvidia U.
[53:19]
Natter
On the.
[53:20]
Kyle
On the Nvidia Dynamo side we're giving it. There are a large number of sessions for those that aren't aware. You can actually search all of these sessions for GTC online. Just go to the GTC website. I don't know what the URL is but. But go there and you can just look up Dynamo and you'll get all the sessions. There are about 20. There are a couple that are hosted by the Dynamo team. There are a couple that are hosted by people that use Dynamo that want to show off the results they've been able to get. But there are two that I'm really excited about. One is just the general Dynamo tutorial and this is the. I'm going out with Harry who's our lead product manager for Dynamo and we're sort of talking about how to use Dynamo to get better performance and also where we see Dynamo going in the future. And then there's another session that I'm doing with one of our agents teams at Nvidia to talk about the future of agents in production inference. So we're talking about there's this new horizon with respect to agents because we have these harnesses that actually impart structure upon calls. If you compare the past and the present with respect to how LM calls work in the early days when they were chatbots, every call was like very different. There was basically no structure. You could assume that if it was conversational there might be some implicit structure because you have a multi turn conversation. But agents, you have this harness that abides by rules so it imparts direct structure onto the context. And you see this. There was an interesting Twitter post about how Claude code structures its Context so that you get as many cache hits as possible. And I think it was by One of the PMs for clog code and he wrote about it. And that type of structure that the harness can impart actually goes hand in hand with the inference code design. So I'm doing a talk. I don't know the session name or the session number, but I'm doing a talk. You can look me up by name on the GTC website on how we accelerate agents and where we see specific optimizations for agents going in Dynamo and in inference in general.
[55:19]
Host
Yeah, I think there's only 1pm for cloud code and it's got the rest. There's Devrel, there's Boris.
[55:24]
Kyle
Maybe it's Devrel.
[55:25]
Host
Exactly. I mean, let's go into agents. I think this was like the last part of this discussion we planned.
[55:29]
Kyle
How have we not talked about agents
[55:31]
Dan
also with you guys?
[55:32]
Host
We scheduled it. I was like, okay, let's have cohesive sections.
[55:36]
Dan
I mean, there's a big news, right? Nvidia is a huge deployment of codecs.
[55:42]
Host
Nvidia uses everything. We use this cursor and we use this.
[55:44]
Dan
But that's a pretty big deployment, right? That's tens of thousands of pieces, people.
[55:49]
Natter
Totally. Yeah. We were super curious. Yeah. It goes back to the mosh pit of emails we kind of mentioned earlier or just the, like, how fluid the. Org feels. So when there's new technology, people will just email it out and everyone will try it. And if it. If it's making people's lives easier, it'll spread like wildfire.
[56:03]
Kyle
A lot of times Jensen will get it and he'll be like, let's make this work across the company. Let's make this work right now.
[56:08]
Natter
Honestly, if I was a startup, I feel like a cool hack. If you have something that's going to save an nvidian's time, they'll spread it to a couple and the same thing.
[56:15]
Kyle
Right.
[56:15]
Natter
It'll just spread like wildfire.
[56:17]
Dan
Careful before your email blows up from startups, by the way.
[56:20]
Natter
Well, y. To have to know the person. Right. But no, I. Yeah, so, I mean, we. I love using Codex. It's been a ton of fun. I've been using it personally. Been using it at work. It's been. Yeah, I don't know, it's been great to see the rollout. Something really funny. On the day that we got Codex and Claude code access, I found this person, his name's Carlos, at the company, he wrote an Outlook cli.
[56:40]
Kyle
Oh yeah.
[56:41]
Natter
And just the CLI for email and this was. I've been using that. Yeah. Maybe like four or five weeks ago. And the site. So once I got like Codex access, I installed the cli. It had a skill and I just asked it to go through all of my emails, which it's very messy, so I don't respond to your email. I'm really sorry. But I asked it to give me a summary, highlight any escalations that I should look at, put any thread that it thinks I should respond to in a folder and then archive everything. And it did. So if I missed your email, it's because it didn't get.
[57:09]
Host
So I should put a prompt injection in my emails to. Yeah, what you should do is just paste the OSH's.
[57:15]
Natter
Yeah, yeah, yeah. My SLA is highest on FaceTime. But it was magic. And so sent it in a big email thread to like 500 people. A bunch of folks tried it out. I started like FaceTiming whoever I could at the company to get them set up with this. Yeah.
[57:28]
Host
That specific example, you guys deal with, like some pretty sensitive emails.
[57:33]
Natter
Yeah.
[57:33]
Host
Is there a security review with this? Because, like, one guy made it for himself, but, like, it's not meant for all.
[57:38]
Kyle
The security team at Nvidia is incredible. Like, shout out to them.
[57:41]
Natter
They're.
[57:41]
Kyle
They're, they're trying to.
[57:42]
Natter
We have an amazing security team because they're progressive and they know that this is really important technology. Even have to bring it in. If you think about, like, if you work at a big company, your laptop's usually very locked down if you can only access certain things. Nvidia engineers have those restrictions, aren't there? So you're expected to understand the risks when you try things out. And so very quickly, you know, made sure to chime in security on what we were doing. There's actually a lot that we've been thinking about, especially with OpenCloud.
[58:04]
Host
Right.
[58:04]
Natter
Like there's, you know, agents can do three things. Yeah, agents can do three things. They can access your files, they can access the Internet, and then now they can write custom code and execute. And you should really only let an agent do two of those three things. If you can access your files and you can write custom code, you don't want Internet access because that's one. It's a vulnerability. Right. If you have access to Internet and your file system, you should know the full scope of what that agent's capable of doing. Otherwise malware can get injected or something that can happen. And so that's a lot of what we've been thinking about is like, you know, how do we both enable this because it's clearly the future. But then also, you know, what are these enforcement points that we can start
[58:41]
Host
to like protect and is there any directive of like hey, we have a company account or company agreement with OpenAI, we use OpenAI models here or like choose whatever.
[58:49]
Natter
No, no. So I would never put any company data in a model that's not either that we don't even.
[58:55]
Kyle
It has the most security. Yes. Yeah, like how that goes.
[58:58]
Host
You know, obviously you could run your own models. You have Nemotron and we do.
[59:03]
Natter
We have an internal cluster. So you know, of course an English. Yeah, yeah, yeah. I think we're Dynamo's first customers.
[59:10]
Kyle
Actually there's a funny story about like how I got the experience that informed what we needed for Dynamo at one point. There's a website called build.Nvidia.com and also for US inference Nvidia.com that is allows people to try models. It gives an API service. You can call the model with like a REST API and you know, you get a response. I ran the model side for that and it was at one point the largest inference deployment and still may actually be the largest inference deployment. Nvidia. I've since like handed it off to some people and they're doing a wonderful.
[59:39]
Natter
This is an extremely under known or less known resource build dynv.com you can get any of these open source models and it's rate limited but it's free so it's perfect for hackers and, and,
[59:49]
Kyle
and the SLA on getting models Day zero models up is like a day. Yeah, like they're, they're incredibly good at like figuring out the right way to host the model to get it up there. As soon as it comes up you render is. Yeah, I ran, I ran it a long time ago. It was originally called Nvidia AI Playground. Then it was called AI foundation and then it was called Build Out Nvidia Cog and I ran the model side of it. So there was a large multi organizational team I ran which models should we host, how should we host them and like what's the proportion of them? And then of course there was like an SRE team that like made sure that things ran well and scaled the models as well. But I ran like you know, model how do we get the model to silicon and then which also worked with our product team determine like which models were important a very long time ago.
[60:40]
Dan
Yeah, there's also like a middle ground in between. Right. This is like for the Hacker try anything. There's the Brev console, then there's Dynamo. There was also nims. Right? Yeah, I remember it had its little moment like a year or two ago. Is it still.
[60:53]
Natter
Yeah, no, NIM is, you know, inference mode, I think it looks like for something.
[60:58]
Kyle
It's no longer an acronym, it's just a nim.
[61:01]
Natter
But yeah, NIM is how enterprises can take any of this technology and run it with support and all of that. And so that includes Dynamo. That includes, I don't know, all of our other optimizations that are packers over enterprise.
[61:14]
Kyle
Yep.
[61:15]
Host
Anyway, so you got a bunch of experience running the sort of internal inference gateway playground.
[61:19]
Kyle
Yeah. Bill also built Health Build, Nvidia's first internal like VS code thing we call the MV code.
[61:26]
Host
It's like the extension, right?
[61:28]
Kyle
Yeah, it was a VS first.
[61:30]
Dan
Like the fork VS code.
[61:31]
Natter
Agree.
[61:32]
Host
We jokes absolutely not. It just a while back be like, we should have a 4th VS code
[61:36]
Kyle
hackathon where you the best for VS code.
[61:39]
Dan
Earlier we were doing a How do
[61:41]
Host
you make a billion dollars?
[61:42]
Dan
Someone from VS code was there and he was like somewhat down to get involved.
[61:46]
Natter
And I was like, oh, you should do that.
[61:47]
Host
That's all I said. Then the cool thing became for a Chrome hackathon from. No, no, no. IDs are not cooling.
[61:52]
Dan
I.
[61:53]
Natter
What's it called? I was talking to Joseph from roboflow and your partner in crime we were talking about with the new Alpa Mayo model. So Nvidia just released an open source the. The Mercedes cars that you saw drive.
[62:03]
Kyle
Shit, sounds crazy.
[62:04]
Natter
Yeah.
[62:04]
Dan
Released.
[62:05]
Kyle
Will you open source a autonomous driving model?
[62:09]
Natter
Yeah. So we were thinking like, could we hackathon a driverless car? Like I have my old car. Let's just try it.
[62:15]
Host
We'll take it.
[62:16]
Dan
Take it to like Click Trail with
[62:18]
Natter
Treasure island in the middle of the bay. Just like just see it, let it roam. Yeah. Like how cameras do we need? Right? Like 1, 2, 3, 4, I don't know, maybe 5, 6. I don't know. Yeah, but I think we're going to try. You just do it with us. We can see we could even have a race. It's like the first person to automate their driving. I mean, over a weekend.
[62:35]
Host
We do have an autonomy track at World's Fair. Waymo was there like. Yeah, Nvidia did send people those for Groot because he didn't have the driving thing yet. Yeah, that's cool.
[62:45]
Dan
I think comma also has a version of this comma. They have open source driving. They've done a Fun hackathon on he
[62:50]
Host
and I as host. Because what I really want is a Tesla. It was Tesla level self driving.
[62:55]
Natter
Yeah.
[62:55]
Host
But as a smart car like a two seater that's basically a wheelchair with a roof.
[63:00]
Dan
And I think they make them in the demand has DNA.
[63:06]
Host
They're like this for like five years.
[63:08]
Natter
Yeah.
[63:08]
Host
Really? Yeah.
[63:09]
Dan
They were different manufacturer.
[63:12]
Kyle
I feel like it's one of those things where we'll see someone buy the brand and it'll be revived. I would buy it like a private go. Someone hears this. Go buy your car.
[63:22]
Host
Yeah. That's crazy Mercedes because they're like I think Mercedes.
[63:28]
Dan
Mercedes used to make them.
[63:29]
Kyle
Yeah.
[63:30]
Dan
I don't know.
[63:30]
Natter
I feel like they own the brand
[63:32]
Host
and you that's your dream might come true, you know. Okay.
[63:37]
Natter
We're
[63:40]
Host
like every time I try to park in San Francisco I have to buy a smart car because like 20% of the parking lots in San Francisco only fit smart cars.
[63:48]
Natter
Yeah. Really? That's what I mean. Even though it was late here trying to.
[63:53]
Kyle
This comes from someone that like basically does not drive.
[63:56]
Natter
That's where the Vespa was a life hack.
[63:57]
Host
Yeah, exactly.
[63:58]
Natter
Yeah.
[63:58]
Kyle
You know what happened to the Vespa?
[63:59]
Natter
I used to have this yellow Vespa. I left it outside the hacker house when we moved out. It's just it was always there and then like a month ago it's not there anymore. I've been meeting to. I don't know. You could light so it's actually been like a db.
[64:13]
Kyle
You forgot about it.
[64:14]
Natter
Yeah.
[64:14]
Dan
Unless.
[64:15]
Host
Yeah, yeah, yeah. No, this is probably has it. And speaking of hackathons, I also wanted to give a big shout out to the world shortest hackathon. Let's go. You did twice.
[64:23]
Natter
A handful of times. Yeah. There's going to be one at gtc.
[64:25]
Kyle
Oh, we're doing another one.
[64:26]
Natter
Pretty much. We have a bunch of challenges that no, we haven't released and you get to bring your agent to come and attempt to go through those challenges.
[64:34]
Kyle
It's like the zero minute hackathon idea. You just bring your I promised eight night a long long time ago. You just bring your agent and then you press the go button. You're not allowed to code.
[64:44]
Natter
It's just the Asian doing.
[64:46]
Dan
It's a good hidden evo, right?
[64:48]
Kyle
Yeah.
[64:48]
Dan
You make a J rope and you
[64:49]
Kyle
make this something I would love to see from cognition or someone else be like come bring your agent.
[64:56]
Host
Like drop it in.
[64:57]
Dan
Because you don't know you like supervisor. Will it be a. You know, operate a browser, order a Pizza. Will it just seem like that Snake
[65:03]
Host
game, you know, and you don't know what the task is?
[65:05]
Kyle
Yeah, I don't know what the task is. Like, we're just like, you don't even know what the judging categories are. And then you give it the judging categories, like try and win as much as possible.
[65:12]
Dan
It's great though. It turns into like, like. Yeah. So let's build something on Dynapod.
[65:16]
Host
It's a great business.
[65:19]
Kyle
Funny story, actually, we have a couple of people at Nvidia we've been working with security to like bring agents really close to compute. So we now have like stuff where we can like tell Dynamo, like, go run some experience with Dynamo, like on X cluster and just like try it right now, like queue up. Once you get queued, like send this request, load, load. And we've actually been able to just one shot problems. We used to have this problem where with Dynamode you have to find the right configurations and we do it automatically for some parts of it. But you have to have a good initial configuration that you want to use. And we've just had an agent just completely one shot that it goes. It gets the compute. It runs a couple experiments. It's like, this is the best. These are part of the Pareto frontier. Go run this. And then we just give that to people and it's like faster than anything that they have.
[66:07]
Natter
Agent UX and Agent Marketing are super important. They're stuff that we've been thinking a lot about. Alec is like redoing the entire Rev CLI so that you can fetch all the different compute types that are available. I don't know, it's going to be really soon, but then you can just browse what GPUs are available and then provision 1 SSH to it right there and you can pipe all the commands. But I think it goes back to like the Alex CLI coding agents are. It's kind of funny, I feel like coding agents have been so much more effective than general purpose agents. And I think a large part of that that is it just has access to the terminal like you said. And that means it has access to everything that you've installed into your terminal. It can run so you know, it would write code and it can compile the code and if there are errors, it can fix it. It can run your suite of tests because that's all just in your terminal. And so that, you know, for the idea or what got me really excited about the outlook CLI we're now just churning through building CLIs for the entire, like for the entire business suite. Slack building, Slack also workday cli, SAP Go.
[66:57]
Host
I've also done that for myself. Really?
[66:58]
Natter
Yeah, yeah, we're going to, we're going to open source all of this and like, yeah, all the, I mean they're just, they're CLIs for the business applications. We would love for someone to run with this and like build like, I don't know, like open CLI foundation in or something. Yeah, Nvidia would love to support anyone that's doing this.
[67:13]
Dan
Like every dev tool should really have good CLI support at this point. Like at one point it was you want your docs to be accessible by an LLM, right? You want LLM good docs. No, everything needs some CLI tool.
[67:25]
Natter
Yeah. It's kind of funny, right? Like computing began with a terminal with a shell, but we said that it's not empathetic to, to humans. So we built these nice user interfaces and then now we have LLMs navigating our user interfaces and ironically we're not empathetic to the machine anymore. Just give the LLM access to the shell.
[67:42]
Host
One thing that slightly makes me uncomfortable is like why do we have to build CLIS? Why can't we just expose APIs?
[67:48]
Kyle
I have an interesting answer to this. So there are a couple reasons. There's Portability is one issue. Sometimes APIs are not discoverable or reachable by some, some, you know, types of things. There's some element of locality, right. Like, like the CLI is like literally you interfacing with your like local system, which is a little bit different. You could still do it by API but like there's this highlighting of like what is the difference between like a CLI and an mcp, right? Like they kind of occupy the same purposes and you call them. It does something on the system and that's done. I think that in pre training there's just an enormous amount of command line data. Yeah, yeah. Like even let's ignore, let's, let's ignore rl. Like you're doing no harm Harness. You're doing no harness posturing. Just the amount of CLI versus API documentation for just navigating this world of the CLI in your file system through that is just enormous. Yeah, right.
[68:41]
Natter
I think there's a couple of things too. Like if, let's say we want to. So one, I think your intuition is right. The CLI is just wrapping the API. Right.
[68:48]
Host
Functionally.
[68:49]
Natter
Functionally, right. And I think it's nice because one, you're being Very specific and pedantic, you know, even of what. And that's really good because you're describing the problem space. So you know what the. I don't know, I don't want to call it like the space for vulnerability. You know, what network calls you're making. It's not arbitrary. And that's not decided on the fly. That's like pre decided, which is important from a security perspective. But then if you were to write a bunch of API requests, you would probably do that. I don't know. Would the model, like, use Python to do so I kind of like that. Everything like a CLI is just dash because it's ubiquitous. Like it's just there. And you don't have to make sure that there's certain environment variables that. That are set up. Like, if your Python version is different than my Python version, we're using the same model to go do the same thing. Is it going to write different code? It probably would.
[69:31]
Kyle
And so it's kind of nice to go work. Right.
[69:33]
Natter
With human as well, I think. Just like making those decisions happen ahead of time versus yeah.
[69:38]
Host
One last thing on this sort of agent, I guess maybe colocation or whatever you call it. One pattern I'm tracking for this year. I always try to think about what's the theme of this year going to be? Last year, definitely coding agents. This year is definitely coding agents breaking out of containment into broader. I go, definitely have seen her rent a human. Yeah, I'm on.
[69:59]
Natter
Are you really?
[69:59]
Kyle
When I say I'm like $5,000, I'll do anything, really. I think so.
[70:04]
Host
I need my bowels from Costco.
[70:07]
Dan
But I think the best part is only the agent can book me. You know, it's very.
[70:11]
Kyle
Usually it's just like another labor marketplace
[70:14]
Host
mechanical Turk was this. So this.
[70:16]
Dan
I have a weird story with why I did it. So back to your example of just giving agent access to compute. Right?
[70:22]
Kyle
Yeah.
[70:22]
Dan
You guys are GPU rich at Nvidia. I hooked up.
[70:25]
Natter
He's not shy about it.
[70:26]
Dan
I have a 247 agent running. I hooked up to RunPod. It doesn't shut down instances. And I'm like, I've tried prompting it. I've given it instructions. Shut down when you're done. It's like, I need to keep it warm. I'll need it soon. It's horrible on time estimates too, because like, they realize it's like, yeah, I'll need it in 45 minutes. 45 minutes, I'll shut it down. 45 minutes of human time is actually three minute of agent time. So it's like I'm booting it up, I'm waiting. I'll just leave it on all night. And Moto is good at shutting down after some inactivity. I had it on my local server, like a little dual GPU thing. It just stays on. I have a little space heater at home now, but careful. So basically, you know, they don't care about the concept of money. Just burn it. I need it.
[71:04]
Natter
It's useful and DGX Spark will be really nice. Like I think I'm looking at it as a super useful for agents because yeah, you buy it once you plug it in and then it comes rip.
[71:14]
Kyle
I'm gonna make a. I'm gonna make an Nvidia ad here. Okay. The Blackwell like RTX 6000 cards Pro Pro are only like. I think it's $8,000 slightly cheaper. Yeah, well it's much, it's much cheaper than the data center cards.
[71:29]
Natter
Yeah.
[71:30]
Kyle
And it's got 96 gigabytes of VRAM. So if you and your, your crew want to go like run a local agent for, you know, you, you in the home. I feel like, like it's got a significant amount of vram. I've thought about purchasing this and running in my basement, except my neighbors would hate me.
[71:47]
Dan
It's just a single like two, three slot GPU.
[71:49]
Natter
It's mostly.
[71:49]
Kyle
Yeah, it's a PCIe. Yeah, PCIe GPU. You can go by that. I mean the big difference against like the RTX, like gaming GPUs. Is it. I mean obviously it's like black belt. Like it's a pro GPU and has a lot of vram, which means you can run pretty large models on it.
[72:03]
Dan
You can stack four of them for the Max Q in a system.
[72:06]
Kyle
But as that's a beast, it's beefy.
[72:08]
Natter
You can run.
[72:09]
Kyle
What is that? 96 Zega and anything 96, you're on a lowseek.
[72:13]
Dan
But also they are slow. They're not. I mean performance of speed will be somewhat slower to API. Like.
[72:22]
Kyle
Oh yeah, that's true. So again, big learning economy of scale allows you to do things that allow you to get both speed and throughput. Like you can run. I'll give you an example. There's an optimization called Wide ep. I'm not going to go into it fully, but like it featured heavily in Inference Max for Deep seq and there's a. There's a great set of stories from Nvidia and from Semianalysis about like why. Yep. Is Important. But for like MOE models it's like basically essential. And you run it like the level of parallelism, the level of scale up parallelism used for it is like 32. So it goes beyond that 8 barrier. And it like really, really, really is important to have that MVL 72GB 200 MV link to serve at scale. And like, it's like, I don't remember, like the, you know, cost improvement, I think. Against Hopper. Right, against hopper. With this MVL72 system, you're getting like 35 times cheaper per token for like a lot of the curve. Yeah. Which is crazy. Yeah. And normalized per gpu, obviously, because part of the GPU is cost or the code. The GPU is part of the cost.
[73:25]
Host
And one thing I'm exploring is the sort of this year is also the year the sub agent where you have the main agent, but then that also kicks off tools which are in themselves agents that have limited agents and such. Low context locally, whatever.
[73:39]
Kyle
Right.
[73:39]
Host
Different prompts. So for example, one thing that cognition does is before you kick off a search, they do like a fast context model where you kick off April, you just search across the code base. That is better than indexing a lot of the times, not all the times, and you should still index for something. But the idea that agents should be able to command subagents and probably run them maybe close to inference as well. I don't know if that's architecturally possible or even.
[74:05]
Kyle
Yeah, we're thinking about that for Dymo, that's our big theme for the year.
[74:09]
Host
Because if you can design that into your stuff, then a lot more people will use it. Right now it's just kind of theoretical because you do pay a lot of back and forth coordination costs.
[74:18]
Dan
I think you'll net speed up though. Right. Even at a basic level, speculative decoding, you're running a small model, you're running two instances. But it's Netflix that is one example.
[74:27]
Host
Yes.
[74:27]
Kyle
Yeah, but this is like a little bit like different with like agents. Agents.
[74:31]
Host
Yeah. This is not spectacular.
[74:33]
Kyle
I think there's like a summarization of that trend that I like to do or I like to say to my team. It's like this is the year. So there are two things. This is the year system as model.
[74:41]
Natter
Right.
[74:42]
Kyle
Where like instead of having like a single model be a thing, you have a system of models and components that are working together to like emulate the black box model. So when you make an API call call to something that's like, like a multi agent in the background it still looks like an API called to a model you're still getting back to.
[74:58]
Natter
But under the hood.
[74:59]
Kyle
Yeah, under the hood it's like a billion different models and that's a lot of complexity. With Dynamo and with other libraries and media, we're looking to help manage that complexity.
[75:06]
Natter
It's funny, we actually for ces we just released the model router for DGX Spark where you can have a local model that's running on the Spark and then also a foundation model and then the model router decides when to send queries to which ones. So it's no longer this like either or it's use the best of everything that's available to you. You have a good post training model
[75:23]
Host
that's running on is it leads to also the bread functionality of being able to manage the Spark.
[75:26]
Kyle
Oh, that'd be cool.
[75:27]
Natter
Oh yeah, I did be able to request. Yeah, there we go.
[75:32]
Kyle
I actually have a question, like I'd like to like extend and flip over how much longer do you guys think like agents are going to be running? Because that's one thing I've been throwing around. Like what happens when.
[75:39]
Host
I mean always on it even affects
[75:41]
Dan
the like back to the prefill decode, right? Like Codex is, I'd say compared to cloud code it's much longer at tasks. Like that thing will like to run six, seven, eight hours, I'll run it overnight and I'll go back and I have like a little crappy logging software I use and there's just times where it wants to like I'm going to go deep on research and it'll eat up 80,000 tokens. Go on another, go on another, just eat through tokens and you know that's part of it. Like at the end it does, it does hit a long time task and I think you only see that there's
[76:12]
Natter
insatiable demand for tokens and every improvement that comes kind of just makes our demand even higher. It's kind of funny, right? If you have a teammate and you ask them to do a task and they're like, should I save some effort and not think too hard about this task? Fuck no.
[76:23]
Dan
I'm in my favor level.
[76:25]
Natter
Too bad.
[76:25]
Dan
You can have four shots, right? Like the original Codex before the app. Why do one call, Give it four attempts. Just use all the tokens. Try more, try again, try more.
[76:37]
Kyle
It's like the meta index, right? Is the thing that tracks how long models are able to run. I expect that we'll just see log linear, if not log super linear. Growth we will see before the end of the year an agent that is capable of running for longer than 24 hours with self consistency the entire time.
[76:57]
Dan
I would also poke at different domains having different desires. At a consumer level I'm getting slightly frustrated at 20 minutes per basic query. Sure you can optimize, you know six, eight hour. I don't see myself shooting off many one week agents. Right. Someone doing like okay GPU kernel research or medical or biological like you know in those domains. Sure shoot off a lot that take a lot of. So like I think it will be somewhat domain specific because you also really need to train that in.
[77:25]
Kyle
Right. You know what's funny one thing is doing your taxes right. Like that's tax. Get it right.
[77:31]
Natter
I wonder if like a major school case that's sort of like speculative decoding is like your agent figuring out what you might be prompting it the next day at night and like pre fetching.
[77:40]
Host
Yeah, you can already do that.
[77:41]
Natter
Yeah. Really?
[77:41]
Kyle
Branch. Branch prediction.
[77:42]
Host
Oh well no, well that's, that's too. That's too low level but yes. Sorry. Yeah, yeah, yeah. One question I got to get is. So like we actually did record a part with the meter folks who Sarah right here their chart is the human equivalent work hours of work rather than than how long the ads themselves are being autonomous and there's a huge difference. Right. Like human work five hours, agent work 30 minutes. It's actually 30 minutes, not five hours. Right. So that chart that you see is them estimating what the human equivalent replacement is. I think actually Anthropic released a more recent chart that showed cloud code autonomy from their production traffic numbers and that was 20 to 45 minutes. That's roughly where we are. So that's the sort of realistic. I mean I do think like there's experimental setups where you can just like Ralph William might just prompt it to keep going when it stops. And obviously you can. That can go arbitrarily long.
[78:33]
Natter
I feel like from my experience around Yeah I guess 20 to 40 minutes seems right for when I'm using like codecs or cloud code. But then like I always try to just like if I want to spin up like a new. There's a net new project. I'll often start with replit and like it only happened for bleeding. Yeah like spin up like they're new like from the V3 agent. Like it'll spin up a web browser and like click around and discover new bugs and just keep churning. So I think like my longest was like over an hour. That hey, I've been churning.
[78:59]
Dan
I think before we see super long running, I think there's going to be a bit of an efficiency hit. So sure, you can take an hour and go down paths, but you also want, you want to be more efficient, you want to be smarter in your reasoning.
[79:11]
Host
Right.
[79:12]
Dan
So I think that'll actually go down before we go back up. Like you don't want to scale non optimized systems just for the heck of it. As much as I love saying use all the tokens. Tokens, they are expensive. Like going from dense to reasoning models, that's an added cost.
[79:26]
Host
Right.
[79:26]
Dan
You're paying for a lot of tokens and it doesn't make sense to just scale stuff that's not optimized. So there's always that little balance. But I think you'll see both sides of it.
[79:36]
Natter
Yeah. So 2023 was super exciting. I think if you were in SF you were like, okay, I know this is going to be a huge world changing moment, but it seemed like no one had known yet and maybe even before. Was it 2022? Maybe? Yeah.
[79:48]
Host
I would say Rune had this tweet where everyone was in SF from 2021 to 2023 understood what it was like to be already totally unhappy.
[79:56]
Natter
Yeah, 2021, that's when I made my first OpenAI account. Yeah, it was crazy. And I remember it was so funny because at the time SF had not been doing well. So pretty much what it felt like was the concentration of founders in the city had risen because where my neighbors were used to doing a bunch of stuff, those people had all left. So the only people that were still in the city were people that really wanted to be build. It was cheap tech.
[80:15]
Host
It was. Yeah.
[80:16]
Natter
It was also way cheaper. I feel really bad anyone who is trying to get rent now, but there was Celo was. They had a huge office.
[80:24]
Host
So Blockchain. It like took over the. The old Casper building.
[80:28]
Natter
Yeah, they had the showroom and they had the. Like the. What would I think was like the back warehouse. It was, it was a huge office
[80:33]
Host
and it's right across on OpenAI in Neuralink.
[80:36]
Dan
Yeah, it was in the original arena.
[80:38]
Host
I named the arena because of it.
[80:39]
Natter
Yeah, yeah. And so it was really exciting because like rovoflow. I think I forgot mintlify. Yeah, mintlify Brev was there. You guys were there. I remember that was actually. It was there that you bought the AI engineer domain.
[80:51]
Host
Yeah. I didn't know what I was going to do in AI. I want to do something.
[80:54]
Natter
But it was Kind of this. It was a really fun moment where we were kind of all in this cello space. And I don't know, it was a really cool community, especially being so early.
[81:03]
Host
Yeah. And so, Dan, you got me early cruise access.
[81:05]
Natter
Oh, yeah.
[81:06]
Host
So there was a going period of time that both crews didn't wait. Models. Which is free.
[81:09]
Kyle
Yeah, always.
[81:11]
Natter
If you had it.
[81:12]
Dan
I mean, they're so back. Cellos opened again.
[81:16]
Host
So nature.
[81:17]
Natter
Zooks.
[81:17]
Host
Zooks is doing Zooks and Robotaxi. Yeah. So totally.
[81:22]
Natter
But yeah. And so it's actually really cool that you guys have this studio so close to Celo, this rock climbing gin right around the corner. So. Yeah, it's an awesome block.
[81:33]
Host
Yeah.
[81:33]
Natter
Just. And
[81:36]
Host
I do think one thing I try to do with the podcast is like, bring, like, what it's like to be in San Francisco to the rest of the world. And also just like, maybe give El Tepo taqueria.
[81:46]
Natter
Yeah. My favorite tacos in the city. And.
[81:49]
Host
Yeah. Steak and shrimp. I know, it's very good.
[81:51]
Natter
Yeah. And I guess what it's like to be in San Francisco, I think, is just everyone seems to be super supportive sometimes. I feel like the city believes in you more than you do. And even. I don't know if you remember, but I remember posting my first blog post and. And I had met you on Twitter and you gave me like an hour of your time super randomly. And you kind of coached me through writing content for developers. And I was trying really hard not to come off salesy or plug myself. And so I kind of stripped all personality out of the blog post. And you, you brought that out. You're like, people don't. It's okay to talk about what you're doing. Like, you don't have to be weird about it. And I remember just that. I think that really helped me kind of figure out what our voice is and not shy away from it. And so always really grateful for you.
[82:28]
Kyle
Hey, you inject your voice into, like, everything.
[82:32]
Host
Huge advance.
[82:33]
Kyle
Manage to be very genuine about what you care about. Yeah.
[82:36]
Host
Imagine some Rand person. DMs you. And can you give me feedback on this blog post? And it's pretty boring. And you're like, fine, he looks interesting. I'll just do a zoom call. And then you meet this guy. He's so energetic.
[82:49]
Natter
Just be right there.
[82:50]
Host
But I think people are trained to write a certain way in school, and they never see there's a broader world
[82:56]
Natter
out there not to unlearn.
[82:57]
Kyle
Writing is thinking. And everyone thinks different, differently. So you might as well just fight your way.
[83:03]
Dan
Cool.
[83:04]
Host
Well, thank you for indulging with us. Really broad breaking discussion. But I love you guys are the young faces of Nvidia with so much energy but also a lot of typical death. And I think people learn a lot for this session, so thank you.
[83:16]
Natter
This is awesome. Thank you, guys. Thank you for everything that you've done.
[83:20]
Host
Yeah.
[83:20]
Natter
Nga, the podcast, all the above and
[83:22]
Host
see you at gtc.
[83:24]
Kyle
Forward to it.
[83:25]
Natter
Yeah.
[83:25]
Host
Cool.
[83:26]
Natter
It's awesome. Thank you, thank you.