NVIDIA AI Podcast – Ep. 284
Lowering the Cost of Intelligence With NVIDIA’s Ian Buck
Date: December 29, 2025
Host: Noah Kravitz
Guest: Ian Buck, VP of Hyperscale and High-Performance Computing, NVIDIA
Episode Overview
This episode explores how mixture of experts (MoE) architectures enable leading “frontier” AI models to be smarter, more cost-effective, and more scalable. Host Noah Kravitz and guest Ian Buck discuss the technical and strategic advances—especially in NVIDIA hardware and software—that have driven the current AI landscape, focusing on how advances in infrastructure and “extreme co-design” are making intelligence dramatically cheaper and more accessible.
Key Discussion Points & Insights
1. Mixture of Experts (MoE) Explained
(00:43 – 08:26)
- Definition & Analogy
- MoE models divide a large neural network into many “expert” sub-networks (experts), each specializing in certain types of knowledge.
- Only the relevant experts are activated in response to a specific input or question, much like routing a question to a domain specialist within a company or research team.
- Why MoE Matters
- As AI models grew from 1 to 405 billion+ parameters, querying every neural unit became slower and more expensive.
- MoE enables larger, smarter models while dramatically reducing computation and cost by activating only a subset of neurons.
- Example: The Llama 405B model (405B parameters) activates all neurons for each answer, costing ~$200 to benchmark (intelligence score 28). Newer models like GPT-OSS (120B parameters, but only activating 5B per query) achieve higher intelligence scores (61) for a fraction of the cost (~$75).
- Notable Quote:
"Instead of having one big model, we actually split the model up into smaller experts... Now we only ask the... experts that probably know that information." — Ian Buck (03:22)
2. How Experts Are Organized & Routed
(05:39 – 08:26)
- Self-organization, Not Hard-coding
- Experts are not hard-coded for subjects (e.g., math, science); rather, the AI “learns” to group knowledge naturally during training.
- An internal “router” module directs each query or token to the relevant experts at each layer, potentially consulting multiple experts per layer.
- Architecture Details
- Answers are combined from various experts, similar to listening to different specialists before a final decision is made.
- Quote:
"The model as it comes up to the answer asks only the right experts... And that's actually how we work today. One person is not a company.” — Ian Buck (07:09)
3. The Rise and Impact of MoE
(08:26 – 11:03)
- While MoE isn't a new idea in machine learning, its application to large-scale AI has recently become industry-standard.
- The pivotal moment: The “DeepSeek” model (about a year before the podcast) showed true MoE potential—256 experts per layer, highly optimized, and extremely cost-efficient, though technically complex.
- DeepSeek’s open research and code demonstrated how to implement and deploy MoE at scale, catalyzing industry-wide adoption.
- Quote:
"Deepseek sort of shined a light on how to do it, how to train it, how to do inference and deploy it and sort of kicked off that revolution of MOEs..." — Ian Buck (10:15)
4. Why Not All Models Use MoE
(11:03 – 13:17)
- Large, general-purpose, agentic, or reasoning-intensive AIs favor MoE for cost-effective intelligence and scalability.
- Small, task-specific models (e.g., object detection in a camera) may not benefit from MoE’s complexity.
- The cost and intelligence advantages push researchers to use MoE for nearly all cutting-edge models.
- Quote:
"Anything that wants to be agentic... and pretty much most of the AIs that we interact with purposefully... they're all MOEs because... they need to be able to reason about a wide variety of different stuff." — Ian Buck (12:18)
5. Tokenomics: Cost, Performance & the Hardware-Software Cycle
(13:17 – 20:58)
- Balancing Cost and Complexity
- Tokenomics: Not crypto, but measuring the economic cost per AI-generated token/word.
- MoE makes running inference (using the model) much cheaper, but building MoE systems is technically more demanding.
- Hardware Evolution
- GPU advancements (compute, memory, interconnects) directly enable bigger, smarter, and cheaper-to-run models.
- Key innovation: NVLink and NVSwitch enable massive GPU-to-GPU communication—critical for efficient MoE operation.
- Generational Improvements
- Example: DeepSeek R1 on Hopper hardware (8 GPUs/server, unified with NVLink) vs. newer architectures (GB200, NVL72 with 72 GPUs/rack). More GPUs mean higher upfront costs, but smarter communication yields a 10–15x reduction in cost per token, dropping from ~$1 to $0.10 per million tokens.
- Notable Quotes:
"Because as those experts all had to talk to each other, they would do that over NVLink. That was very important..." — Ian Buck (17:53)
“That actually generated a 10x reduction in the cost per token.” — Ian Buck (19:53)
6. NVIDIA’s Unique Value for MoE & Extreme Co-Design
(22:11 – 31:19)
- What Makes NVIDIA Hardware Special
- Purpose-built interconnects (NVLink, NVSwitch) minimize communication bottlenecks in large MoE deployments.
- Each GPU in large racks can communicate at full speed with every other GPU—“no compromises”.
- Innovations in signaling (e.g., PAM4), engineering, and rack-level design push the edge of physics while managing cost.
- Extreme Co-Design
- NVIDIA collaborates closely with AI developers, co-optimizing hardware, software, and models to extract maximum performance and minimize hidden costs.
- Teams work on optimizing everything from Pytorch integration to low-level memory and communication kernels, often doubling performance in a matter of weeks for customer models.
- Quote:
“This is the extreme co design that we do at Nvidia... not just to have the fastest and be the fastest, but also to reduce the cost because…if just our software alone could increase performance by 2x you've now reduced the cost per token by 2x…” — Ian Buck (30:27)
7. Future-Proofing: Beyond MoE and Towards Universal Intelligence
(31:19 – 35:52)
- MoE is an optimization, not a destination—future model architectures may look different, but the core principle (sparse, expert-driven computation for efficiency) will remain relevant.
- Similar methods are already appearing in vision, scientific, multi-modal, and supercomputing contexts—beyond language models.
- Reducing token cost enables models to be smarter and more widely applied—such as drug discovery, protein modeling, robotics, etc.
- NVIDIA’s infrastructure is designed to be flexible and extensible for whatever shapes future AIs take.
- Quote:
“We see MOEs happening not just in chatbots, but similar sparsity MOE expert applications being done in vision models and video models… The ability to revolutionize biology… and drug discovery for cancer research alone is an investment that the whole world's making right now.” — Ian Buck (33:50)
8. Resources and Further Learning
(35:52 – End)
- Recommended Resource:
- The GPU Technology Conference (GTC) is NVIDIA’s primary venue for technical deep-dives, keynote presentations, and community innovation—covering MoE architectures, hardware, and future trends.
- Quote:
“If you want to learn more, go check out GTC. We put all the presentations online. Jensen’s keynote is wonderful. He has a... he'll explain it even better than I can…” — Ian Buck (36:30)
Notable Quotes & Memorable Moments (with Timestamps)
- “Instead of having one big model, we actually split the model up into smaller experts... Now we only ask the experts that probably know that information.” — Ian Buck (03:22)
- “One person is not a company. Companies exist because we have all this expertise around and the MOE method is basically applying that to AI.” — Ian Buck (07:09)
- “Deepseek sort of shined a light on how to do it, how to train it, how to do inference and deploy it and sort of kicked off that revolution of MOEs…” — Ian Buck (10:15)
- “That actually generated a 10x reduction in the cost per token.” — Ian Buck (19:53)
- “This is the extreme co design that we do at Nvidia...if just our software alone could increase performance by 2x you've now reduced the cost per token by 2x…” — Ian Buck (30:27)
- “The ability to revolutionize biology… and drug discovery for cancer research alone is an investment that the whole world's making right now.” — Ian Buck (33:50)
- “If you want to learn more, go check out GTC. We put all the presentations online. Jensen’s keynote is wonderful. He has a... he'll explain it even better than I can…” — Ian Buck (36:30)
Key Takeaways
- MoE has enabled AI models to become both more intelligent and more scalable by leveraging specialist networks—dramatically reducing the cost of generating “intelligence.”
- NVIDIA’s hardware and software, especially innovations like NVLink and extreme co-design with model makers, are crucial for the rapid pace of AI progress.
- Lowering the cost per token is the central metric for measuring progress—not just for language models but across the full spectrum of AI applications.
- The infrastructure and techniques being built today are flexible and extensible, aimed at whatever direction AI takes next.
Explore NVIDIA’s GTC resources for deeper dives and keep an eye on the rapidly evolving landscape of AI model innovation.
