
Loading summary
A
Hello and welcome to the Nvidia AI Podcast. I'm your host, Noah Kravitz. Ian Buck is here with us today. Ian is vice president of hyperscale and high performance computing here at Nvidia and he's here to discuss mixture of experts, the architecture powering the world's leading frontier models, and how extreme co design can both drive down the cost of generating intelligence today and future proof your AI platform for whatever advances come tomorrow. Ian, welcome. Thanks so much for taking the time to join the podcast.
B
Thanks, Noah. Glad to be here.
A
So let's jump right into it. What is mixture of experts? Moe, as we call it. Why does it matter? If you look at the top 10 open models on artificial analysis right now on their leaderboard, they all share the MOE architecture. So can you explain kind of in lay terms what MOE is and why it's suddenly become the standard for frontier AI?
B
Yeah, it's a great question and I think there's a lot of. It's a term that is used in industry and amongst AI researchers, but it's not really understood. Like what does mixture of experts mean? Yeah, we've all heard of neural networks, and that's what these neural networks are. They're neurons, they're parameters, they're components of a AI model. And you know, when AI got started and really became, in the zeitgeist, the world, the neural network was simply each parameter represented a neuron of the model. And we heard about a 1 billion parameter model on a 10 billion, now 100 billion now trillion parameter models. Those are basically the neurons of the AI brain that you activate when you ask ChatGPT a question. But something happened along the way. As these models got smarter and smarter and smarter, they naturally got bigger and bigger and bigger. In fact, two years ago, when llama first came on the scene, there was a 7B llama and then there was a 70B llama and now we have a 405B BB billion parameter model. And that makes them smarter. They have more information, they understand more things, and they give you better answers. But there was a problem. As they got smarter and smarter and smarter. To get the answer, you actually had to ask and activate every neuron in that brain. So as a result, while the models are getting more and more intelligent, they're also getting slower and slower because you had to ask every neuron and calculate every neuron and perform all the math on every neuron on a GPU. And then it wasn't one GPU, it was lots of GPUs and even more along the way, researchers came up with this idea and they realized, just like a human brain, we probably don't need all of these neurons to ask every question, simple questions, probably just a few neurons or different parts of the brain may encode different information. Let's just activate those. So to make the AI cheaper, the tokens, which is the piece of data that's flying through, that eventually becomes a word on the screen, the tokens cheaper. Let's only activate the neurons we need to activate. And that's what mixtures of experts is. Instead of having one big model, we actually split the model up into smaller experts. Same number of total parameters, but now we only ask the. We train the model to only ask the experts that probably know that information along the way. And that's part of the training process to build that model. Once you do that, you can have a model which has maybe 100 billion parameters, 100 billion neurons, but we only ask or activate about 10 billion. That's a compression mechanism. That's a way of making AI cheaper, but still being able to encode all the possible information, answer all the questions. So today, most models today are achieving higher and higher intelligence scores by taking advantage of having more than lots of experts and able to ask, have the model as it come up to the answer, ask only the right experts. In order to get that, to get the right answers to give, you put some numbers behind it. You know, we have that llama 405B, 405 billion parameter. That's one big model. You know, on, on leaderboards like artificial analysis, you mentioned, you know, it gets an intelligence score of about 28. 28 is just a weighted score of the benchmarks they tested.
A
Sure.
B
But all 405 billion parameters are getting active right now. Fast forward to like a modern open model, like OpenAI's GPT OSS model, it has 120 billion parameters, actually a little bit smaller total parameters. But when you ask the question, it only activates on the order of about 5 billion parameters. So instead of 405 billion parameters and all that math and all that cost, it actually only needs to activate about 5 billion parameters. That's like a, like a 10 to 1 or beyond compression, making it cheaper. And then it gets an intelligence score of 61. So it is going from 28 to 61, going from 405 billion parameters to 5 billion parameters. Way cheaper. It's not a 10x cheaper. It's still complicated. And we can Talk why these MOEs are complicated to run. Um, but artificial analysis does measure the cost to run the benchmark. So, like how to run and calculate the intelligence score for llama 405B. I think that currently it costs about $200 for them to actually ask a cloud service to get all the answers. To create that score, they asked GPT oss. The same thing is tokens are cheaper and only cost about 75 bucks. So MOES are making models, allowing models to get bigger, smarter, it's allowing to get cheaper, and as a result, advancing AI. Now, of course, across the board, all the leaderboards, they're all these mixture of expert models.
A
Right? Correct me, bring me back on track if I, if I get off here with the questions, but from kind of a layperson, to use that, that word standpoint, if I'm trying to wrap my head around this idea of mixture of experts, are the experts divided up in ways that I might think about knowledge? You know, this expert handles math, and this one handles science, and this one handles, I don't know, visual understanding.
B
Yeah, it's a great question. You know, that is the art of training these things. In fact, AI, it's not like hard coded in there. They don't train a separate model for doing math questions and a separate model for telling you how to make a pizza. The AI, the beauty of AI is that the algorithms that these researchers and scientists and companies like Anthropic and OpenAI and everybody else have figured out is that they can just give it the data and they encourage the model to sort of camp, to identify and create these little pockets of knowledge. It's not prescriptive, it's just the data that they're seeing. It naturally clumps the activity of these different questions to different experts. So, and then in front of those experts, there's this thing called a router. And the router actually is able to just look at the string of questions, like, what's what the answer is, what answer is forming, what, how is it thinking? And then be able to predict, you know what, this one probably goes to that guy or this other guy. In fact, today's experts, they may have on the order of dozens of experts on every layer of the model. And there's a little router between. And they may actually ask not just one expert, but like at every layer, they may ask two experts or eight experts. And then there's another unit model which listens to all the experts. This guy says, I pretty sure I got the right answer. Maybe I got the right answer. I don't know, I don't know, I don't Know combines the answer and then goes to the next one. So that's actually the architecture of it. You know, it's kind of like you could train one person, one brilliant scientist. You train an Einstein to be able to answer any question. That's really hard, takes a lot of energy. That's a very expensive person to hire and have on staff. Instead maybe I can hire a couple of domain experts or teach a couple of different people some stuff and you know, I just give them all that question, they can all answer it very quickly in parallel. And the combined knowledge and that's actually how we work today. We don't work in one. One person is not a company. Companies exist because we have all this expertise around and the MOE method is basically applying that to AI.
A
Right.
B
So the models are all trained that way. IT use there's all sorts of training methods to create the condition where, where information activations can start grouping and gathering together and you can train these little routers and combiners and then you just do that and multiple, multiple layers and sure enough, at the end of it you've got a chat model like GPT OSS or Kimik2.
A
Yeah, no, Moe isn't new to 2025. The idea of the architecture has been around for a few years. So was it being used, has it been being used all along and we just weren't so aware of it and then why has it kind of come to prominence lately?
B
Yeah, the idea of experts is not new in machine learning. You know, before AI there was an idea of creating, you know, combining multiple machine learning models together and how to do that with statistically to improve the accuracy. There's all sorts of history and math around that.
A
Yeah.
B
Applying it to AI though is is relatively new. You know, the early versions of of we now, we now know were ChatGPT. They were a mixture of expert models but they were not public publicly known. Okay. It really wasn't until the Deep SEQ moment, which is about a year ago, where I really blew the doors open because Deep Seq, those researchers were the first to really build a world class MOE based model. People have written papers about it, but it was one that actually competed and demonstrated the intelligence course that could be leaving with the closed source models. And it was a beast. It was awesome. It had 256 experts in every layer. I mean it did every single optimization and as a result it was extremely cheap to run. Incredibly complicated, but cheap to run because it was so it went, took MOE all the way to the extreme and maybe many people Think it's kind of where OpenAI was, you know, with the original GPT. So now once we had that moment, you know, the first time deepseek was run on even GPU systems, it actually didn't run that well because we didn't have the infrastructure or even the software to run it that well. The Deep SEQ engineers had written all this custom code to make it run. Awesome. But at that point, every model, every researcher realized, hey, this thing's real. We now can see how we do it. They made the whole thing open, they published the paper. It's a brilliant paper and it shows the opportunity for moe. And since that moment, you can see that every model now has shifted to building MOEs. Deepseek sort of shined a light on how to do it, how to train it, how to, how to do inference and deploy it and sort of kicked off that, that revolution of MOEs that's been, that we've been enjoying.
A
Right. So we know the Deep SEQ moment was huge as, as you just said for, for many reasons. Is that kind of, are we going to look back and say like, hey, the lights went on then and you know, new things will come. But for the moment, is, is everything moe? And if, if not, why? What's kind of the, I don't know, the decision making process. When would you train a model to be MOE and when would you not?
B
You know, I think all the models that really are focused on providing an intelligent response, it makes a lot of sense why they're moe. Yeah, you want to do your best to encode as much knowledge into the neural network so it just knows things. You don't need to like on pencil and paper, write two plus two to work out that it's four. You just know two plus two is four. So the more neurons you can throw into a holistic model, it gives it innate knowledge. It doesn't have to work. It work that out in a reasoning chain or other such things. So there's a huge advantage to having models be bigger as long as we don't increase the cost. And that's why moes, we want to be able to push the limits of only activating 10%, 5%, 3% of the neurons. More and more experts. And you can see that in the research and the way the models are evolving, they're really pushing the limits of some of the modern models. You know, they'll have 300, 400 experts they're trying to combine. Now, getting all those experts and all that communication is complicated. We'll talk about that. Yeah, but it is innate by, you know, having that the foundation model with all of those experts allows them to then apply all the other techniques of inference, of reasoning. It allows models that are smaller to be distilled and fine tuned for specific tasks. It creates a foundation for the rest of, for the rest of the AI models around the world. Certainly some of the smallest models for the more dedicated individual use cases, I've got to put a box around a stop sign or I've got a ring doorbell. Uses AI to detect if it's a squirrel or not a squirrel. Those small models may not. They need to do one specific thing. Probably I can get it, squeeze it down. I don't need to go to the complexity of an expert system, but anything that wants to be agentic, any kind of agent, and pretty much Most of the AIs that we interact with purposefully, they're all moes because they can be thrown and they need to know and they need to be able to reason about a wide variety of different stuff. And it makes AI cheaper. Yeah, it lowers the cost per token. So there's always a driving cost and the continuous, like let's, let's increase intelligence and let's lower cost we can do.
A
I was going to ask you about that because there's this, it seems like there's this focus happening now. You know, generative has, has progressed far enough and certainly it's, it's everywh, you know, including the news, the business section if you will. There's this shift kind of from, you know, the biggest models, raw speed, you know, the highest scores to as you said, how much does this cost and can we get it to be cheaper while being just as smart, if not more intelligent? So we're calling it tokenomics. Right. So not in the sense of blockchain or crypto tokens, but you know, as you mentioned, AI systems generating tokens, reasoning tokens, output tokens, what have you. So if we're focused on bringing the costs down, how does a more complex system, and I'm kind of inferring here a little bit, but I would imagine it's more expensive to train, to architect, to train, perhaps not to run, but total cost. How does a more expensive kind of premium system actually drive the total cost down?
B
Yeah, there's a wonderful symbiotic relationship that happens in the market between the AI hardware and the models that are being created to serve AI. They inherently, and they kind of have to make sense, you know, if the, if the hardware offers a certain level of Connectivity, a certain GPU performance, a certain memory size. Obviously building an AI model that's even bigger is going to be hard to take to market or even not possible to efficiently train. So, you know, since the beginning of the original Kepler GPUs that were used for those cat, those first cat AIs, to today's modern GB200GB, 300 MVL72X, you can see a pattern where with every new platform we advance the state of the art or what the capabilities of what Nvidia is able to offer, the compute performance, the memory performance, the connectivity, the I O. We'll talk about NVLink. Those things enable the next wave of building to train the next model, but also to do inference. They add complexity. When we started, we were doing PCIe cards, little basic graphics cards that plugged into the server equivalent of your PC and use the floating point calculations and the graphics memory in order to do that computation. And they were great. When the AI revolution took off, we saw that by adding more floating point calculations and building a bigger gpu, adding things like HB and memory, adding things like, you know, increasing the power beyond what a typical PCIe slot will do, we often would increase the performance of what was capable in the AI, not by the just the percentage of more flops or more memory bandwidth, but by X factors. And that's really because the model, the AI models they were able to build were bigger, smarter, and could run more efficiently and could do more things. You know, tco. People talk about TCO as the cost. You know, TCO actually is just not the goal like in itself. It's just the lowest cost. You want the lowest cost, buy one gpu.
A
Sure.
B
The goal is actually to deliver, to improve intelligence and intelligence per dollar, the cost of that intelligence, or if we're at the same level of intelligence, say this 60 score from artificial intelligence, are we reducing, are we reducing the cost of that intelligence over time? The tokens that people need to buy, or the cost in order to run it? That's really the goal in every generation of Nvidia architecture. We're looking to figure out what technologies can we incorporate, expand, double down on, invest in, or pull from the community, or pull from our partners in order to deliver X factors of performance improvement. Where the model, even the existing models like the current MOEs, could get an X factor of performance improvement. While only we're not afraid to add more cost and more technology on a per GPU basis. The HBM memory, it's a lot more expensive than the old school graphics memory. But it only increases the cost in percentages where because you now have HBM and because you have the bandwidth that it offers to and can connect to that much floating point, you can deliver an X factor in total end to end performance. Yeah, and we saw that actually, you know, when Deepseek R1 came out, you know, the GPU of the time was the Hopper H200 system. Hopper had eight GPUs in a server. They were all connected with any link through an nvlink switch. So we could effectively build one giant GPU of eight GPUs working as one.
A
Right.
B
That was really important. The model was so large it couldn't really fit on a single GPU. It had to use multi GPU. And the researchers that built Deepsea took great advantage of that. It also had MVLink capability. So we could actually put every expert on different GPUs and you could see that you could paralyze the work running things even more efficiently, even faster. And because as those experts all had to talk to each other, they would do that over MVLink. So that was very important before we had Emilink, you know, you would have to send things over a PCIe bus and only one could talk at a time. And it was much slower because we have emilink, all those GPUs can talk to every other GPU at full speed. It's a totally unblocked, you know, literally at gigabytes and terabytes a second of bandwidth without any concern for collision. It was critical for those Deep SEQ researchers to get good performance if you fast forward. So obviously it also happened at a time which now we can say is when we're in the heart of bringing and building the what is now the GB200 and VL72, where we scaled up the number of GPUs we can connect from just eight GPUs in a server to 72 GPUs in an entire rack. A 9x multiple. Yeah, now that's a lot more GPUs. So did the cost go up? And it certainly obviously 9G, that many GPUs, entire rack worth of GPUs versus a server is, is a lot more money.
A
Sure.
B
In fact, we actually even had to add more technology because we needed to take that those, all those NV switches and build a separate NV switch plane is more. It does cost more. But because we could, we did that, we can actually paralyze and improve the performance of deepsea GAR one even more. We can take all those experts and instead of having to try to make it all fit and work within only eight GPUs, we could actually get all 72 GPUs working as one. And that improved performance of just going generation over generation, being able to further paralyze and run all those experts across it could actually increase the performance so much that we actually got a 15x improvement on running Deepseek R1 versus only adding percents, about 50% more total cost on a per GPU basis.
A
Wow. Okay.
B
That actually generated a 10x reduction in the cost per token.
A
Right, right, right, right.
B
So we do have to add more technology. We want to keep running more technology. Nvidia is a technology company, but we turn that technology back into performance, which in the net of it reduces the cost per token because those 72, it's that much faster and as a result they can actually run, get more out of that rack, more out of the, on a per GPU basis. And we've taken it down from what was Hopper, it cost about a $1 to get her a million tokens.
A
Okay.
B
Roughly a million words. It's now down to about 10 cents. So people look at this rack and they see it's really expensive.
A
Right.
B
But the way you do that is actually you put all that investment in Ambulink, in all the connectivity and all the next generation software and you also do all that software work to make it all work really well. And generation over generation, you get that multiple, the 10x multiple and the reduction in cost, that's just one model. That same story is playing out for GPOSS and everything else. And those are models that were built and trained and designed for Hopper.
A
Right.
B
You know, we're entering into the, you know, starting to see some models come out that are trained on Blackwell and you're going to see that, you know, now raise the bar and go even further. So this, this is the virtuous cycle that we've been working so fervorously to make help make happen. We add, you know, we might add percents in terms of cost and complexity on a, on a per GPU basis. But we, we aim at every generation to deliver X factors of performance and as a result we dramatically lower the cost of per token by that, by a 10x.
A
As I'm listening to you describe, you know, NVLink and the advances in getting the experts, getting the GPUs to communicate and kind of act as one, I can't help but think like we need NVLink for like teams meetings so we can get everybody we're able, instead of talking over each other, just communic at one as one at the speed of light.
B
That's right.
A
I'm speaking with Ian Buck. Ian is vice president of Hyperscale and high performance Computing at Nvidia. And we're discussing mixture of experts and why it's become the architecture. Well, as it has been for a while, but now gaining public prominence, if you will, the architecture behind so many leading frontier models and what goes into not only architecting and training the models, but the infrastructure that really makes them.
B
Humor.
A
And Ian, I wanted to ask you, you talked about this a little bit, as I said, with, you know, NVLink and all of the technologies you kind of alluded to as you were describing the MOE architecture. But what is it specifically about these Nvidia systems that make them such a good and such a unique fit for these complex MOE models and are able to achieve, as you just described, you know, this lowering cost of intelligence measured per token.
B
Yeah, it's an interesting and understandable. It goes back to the original idea about having experts. Okay. We're reducing the cost per token by not turning on every neuron, but only turning on the ones we need. It's a cost savings. And we talked about llama, the 405 billion parameter llama model, you know, that you have. In order to use it, you got to activate all 405 billion of those neurons, even though they're not all needed.
A
Right.
B
Look at GPTSS, it's 120 billion parameters. Still a lot. 100, but you only need about 5 billion parameters in. It is smart and is a cost saving Measure. Only does five. She also notices, though, it's not. That's like a 10x less, actually more than 10x1.1% of the number of neurons we're actually doing math on. The cost isn't unfortunate. On GPT OSS, it's not 1%. Actually, you know, it is. It is. It is X factor slower. It's about 3x3x less cost, but it's not, you know, 1% less cost.
A
Sure, yeah.
B
There's a hidden tax to MOE. And it's all about how those experts need, want to and need to communicate with each other in order to get MOEs to run efficiently. Those experts are all doing their math very, very, very fast. And they all need to communicate with each other very, very, very quickly. And one of the challenges with MOEs is. And as we go and get sparser and sparser and sparser, which makes the models more and more valuable and we're saving more and more. Cost is can we make sure that all that math is happening and all those experts can talk to each other without ever running, going idle without ever waiting for that get it waiting for message. You're buying those GPUs, you're paying for them so they can do the math they need to do, not to sit around and wait for someone else to send them something. Or worse, the network that connects all these GPUs gets gummed up and now everyone is sitting idle and that's going to go straight to the bottom line of the cost. So that's the key part. And the hidden cost of memory is communication we've looked at. Can we make it work with just point to point? Like maybe I can just connect this GPU with this GPU and this GPU with that gpu. It'll be a much lower cost to actually just directly wire them up. But there's a limit to how much I could do that. If I take one GPU and I connect it to four. Well, this GPU now is. Its IO is split four ways. And I can only do that so far. And even with our hopper systems, we had eight. And there was an NV switch chip, another. We built another chip specifically for this, but we can't scale beyond that 8 because that's the chip. Yeah. So if you have point to point or a Taurus like network, you're fundamentally limited by how much moe. How cheap you can make those tokens. Because the hidden cost is communication. And if you try to go bigger than the, you know, the what a neighboring or point to point connection or some kind of loop or message passing thing or use a fabric like Ethernet. They weren't designed for this. The best answer is no compromises. I want this expert, this GPU to be able to talk to every other expert at full speed. No limitations, no worry about congestion. I need a network. I want to connect these things so there's no, there's nothing blocking.
A
Yeah.
B
And that's what MVLink is. In fact, that that chip that we built is specifically designed to make sure that every GPU and it's all of its terabytes a second of, of. Of bandwidth can talk to every other chip at full speed and never compromise on the, the maximum I O bandwidth we can get out of every gpu. We did that with Hopper with eight way and one of the big innovations and obviously it took a lot of engineering to make that 72 racks. Every one of those 72 can talk. Every one of those GPUs at full speed, no constraints. And you can see that taking off. You can see the benefit. You know, that allows people to go even further and build even bigger models. The Kimik 2 model is even bigger than the GPT one. We now have open source trillion parameter model Kimik 2, yet it only uses 32 billion parameters. When you ask it a question, right, that's like a 3% activation of the brain. Yeah, it has, but it's incredibly complicated. 61 layers over 340 experts. They all got to talk to each other. And as a result, we now have open models that are trillion parameter scale levels of intelligence. And the cost is all comparable to what and even lower than what we could ever possibly have with a fully dense model. It's possible because of that emulink connectivity. Nvidia is committed to like, let's keep going down that path. Build. We have some of the world's best serdes engineers, signal processing engineers, wire engineers, mechanical engineers. To make all that work without having costs explode and make it all connected. Every One of those GPUs, by the way, is connected with a copper wire to one switch to another switch. There's a reason why it all sits in the rack. It's because we're running at 200 gigabits per second. On every one of those wires is PAM4 signaling. So it's like four bits per wire. It's a 0, 1, 2, 3 and 4, not a 0, 1. We've gone past digit binary at this point and it's going so fast. It's actually. Its wavelength is about millimeter, I think. So, you know, we're, we're pushing the limits of physics.
A
Yeah.
B
Keeping it all nice and tight and also doing everything in copper for, for low cost. We're super happy with GB200 and what it's been able to do for inference and just, just make. Keeping the cost and driving the cost of tokens down, down, down, while intelligence goes up, up, up.
A
So is this getting into what we call extreme co design?
B
Yeah. One of the joys of working at Nvidia is that we're the one company that works with every company in AI, right?
A
Yes.
B
And you know, we work with them in building their data centers and getting the latest GPUs to them and explaining the MVL72 architecture in building and help build a lot of the software that they use. We have teams working on Pytorch, on jax, on sglang, on VLM and all the software that's out there. And as they, these Model makers are building new models or pushing the limits. Both. Some inside Nvidia actually now, but all around the world. We can co design with them how to take the maximum utility out of those 72 GPUs to manage that hidden cost of communication. To make sure every GPU is running at 110% on computing on the fewest possible neurons and doing that seamlessly and incredibly fast. All the while thinking about the next model, what's that next GPT, that next vision model, the next video model, the next sora. And figuring and making smart decisions about how to add more bandwidth, more communication, more NV link and the right kind of floating point. And all doing so without blowing out cost or blowing out power and keeping and leveraging all the work that they've done up to date so that it can be applied moving forward to the future. This is the extreme co design that we do at Nvidia and some of our, the folks that I get to work with and probably watching this get to enjoy and we, we work really really hard to continuously work on performance. Not just to have the fastest and be the fastest, but also to reduce the cost because perform. You talked about tokenomics. If just our software alone could increase performance by 2x you've now reduced the cost per token by 2x direct to the, to the user and the customer or whoever's going to deploy this AI. I was on a call this morning. We got a model from a customer. They wanted some help. We applied the latest NVFP4 techniques, the latest kernel fusions, the latest mv link communication, IO overlaps. Within two weeks we did, we hit 2x on their model and gave them the code back. And you know, and we're not done. There's so many places where we can optimize. I think a lot of people get confused. They see a gpu, a certain number of flops and they say oh let's better faster. I'll tell you this stuff's pretty complicated. Manage and run 72 GPUs with 348 experts and all the different kernels and all the different AI and all the different math. We didn't even talk about KB caching and reasoning models and all the tricks and techniques. That's an end to end problem. It requires extreme co design between the hardware, what's already possible, the model builders themselves, and the dense and deep software stack that run on it. Nvidia actually has more software engineers than hardware engineers specifically for that purpose.
A
Right? Yep. So to kind of zoom out for a second because we've been talking about and kind of get hearkening back to what you just said about, you know, thinking about what's next. We've been talking about MOE in the context of language models predominantly, you know, now. And the GB200 NVL72 is really well suited to that architecture. But is there a risk of focusing too narrowly on this single model trend of moe? What happens when we get, you know, sort of beyond moe? What happens? Is the architecture still well suited? Is the cost of tokens still going down? How do you, how do you think about that going forward? And how does the, you know, the design that Nvidia has today, you know, how is it ready for whatever the next trend might be?
B
Well, there's one clear trend in AI is that intelligence creates opportunity. As the models get smarter, as they start to learn new things or as they specialize in certain areas, they create opportunities to advance that industry, that science, that application. Or just make computers more productive for you and I every day.
A
Yeah.
B
And in order to do that, we need to make the models smarter themselves. We need to use techniques like reasoning, which is only going to generate more tokens. And the only way to advance the state of the art of AI. Well, there's lots of ways. One way Nvidia can help is just reduce the cost of tokens. And doing that moe. It's just an optimization technique. If you don't need all the neurons, don't waste time, don't computing on them. That's an idea that's not unique to LLMs and chatbots. That's just a good idea. So we see it may materialize in different ways and how these networks and experts want to communicate, communicate or the shape of the models are actually diversifying in lots of ways. There's lots of different techniques. Mixture of experts is certainly one of them that will stick around for a while. There's lots of other hybrid approaches and other things that people are talking about or trade offs that you can make in order to reduce cost. But we see MOEs happening not just in chatbots, but similar sparsity MOE expert applications being done in vision models and video models. As the models are expanding into science and not just generating tokens which turn into words that you and I talk about, but work on proteins or working on material properties, or understanding or working on things like in robotics and or path planning or logic or business applications. All of those will benefit from having a large intelligent model that can be sparsely optimized to only use and leverage the part that is needed for that particular question in that particular use case. You can always go down to the back down to the squirrel detector and a doorbell. Yeah, but there's, there's usually a benefit to having a model that is actually able to reason about or has some multimodal aspects, maybe listen to what's going on, see the things around it and be able to make intelligent decisions smartly. That is going to continue to grow. And Nvidia is not just working on movies. We've got lots of different irons in the fire. There's lots of different models. The models are diverse. I get to work in HPC as well. And the whole supercomputing community has now embraced AI building all sorts of models for simulating physics and simulating the weather and things that look nothing like chatbots. But they're going to use MOEs. They're going to use every trick in the book because the opportunity is huge. The ability to revolutionize like biology to do drug discovery for cancer research alone is an investment that the whole world's making right now. And they can take these ideas and take our platform and apply them to their domain, their problem. To take an open source model or a general model and fine tune it to be a science model or an application specific model or a business model. That is possible because they're starting from a really intelligent model that can learn or be used to teach another model to make things possible. So I'm super excited about moes. I'm super excited. And we'll continue to work on reducing the cost per every token. And while that may make our technology bigger, smarter, more complicated at times, and will make it more expensive, it is going to deliver X factors and capability improvement intelligent and as a result dramatically lower the cost per token.
A
Ian, for listeners who want to dive in further, we could talk about this all day, but you have things to go build and customers to take care of and all that good stuff. Where can listeners go online? What's the best place to start to dive into MOEs, to the infrastructure you've been talking about, to any and all of it.
B
I check out gtc. You know, one of the things that's. We started this conference a few years ago, over a decade, I guess. I was there for the first one. It's called the GPU Technology Conference.
A
Right.
B
It's not a business conference, although obviously many business people show up. It's not a demo conference, it's a developer conference.
A
Yeah.
B
And if you want to learn more, go check out gtc. We put all the presentations online Jensen's keynote is wonderful. He has a he'll explain it even better than I can and you can. We actually do a few a year now. I encourage you to check out gtc, go see the old ones and if you're going to be in San Jose in March, please come and check it out and attend. There's tons of sessions at every level from beginner to deep dive. If you want to go down to the hardware, all the Nvidia experts will be there, all of the different developers are going to be there. It is kind of the go to place to go. Go learn and also present your work on what you can do with GPUs and the state of the art of AI. Check it out.
A
Perfect. Ian Buck again, thank you and you know, for what it's worth, Jensen's an amazing presenter. You did a great job explaining all this, so we appreciate you taking the time. And as always, all the best to you and your teams on continued progress.
B
Thank you. Sa. Sam.
Lowering the Cost of Intelligence With NVIDIA’s Ian Buck
Date: December 29, 2025
Host: Noah Kravitz
Guest: Ian Buck, VP of Hyperscale and High-Performance Computing, NVIDIA
This episode explores how mixture of experts (MoE) architectures enable leading “frontier” AI models to be smarter, more cost-effective, and more scalable. Host Noah Kravitz and guest Ian Buck discuss the technical and strategic advances—especially in NVIDIA hardware and software—that have driven the current AI landscape, focusing on how advances in infrastructure and “extreme co-design” are making intelligence dramatically cheaper and more accessible.
(00:43 – 08:26)
"Instead of having one big model, we actually split the model up into smaller experts... Now we only ask the... experts that probably know that information." — Ian Buck (03:22)
(05:39 – 08:26)
"The model as it comes up to the answer asks only the right experts... And that's actually how we work today. One person is not a company.” — Ian Buck (07:09)
(08:26 – 11:03)
"Deepseek sort of shined a light on how to do it, how to train it, how to do inference and deploy it and sort of kicked off that revolution of MOEs..." — Ian Buck (10:15)
(11:03 – 13:17)
"Anything that wants to be agentic... and pretty much most of the AIs that we interact with purposefully... they're all MOEs because... they need to be able to reason about a wide variety of different stuff." — Ian Buck (12:18)
(13:17 – 20:58)
"Because as those experts all had to talk to each other, they would do that over NVLink. That was very important..." — Ian Buck (17:53)
“That actually generated a 10x reduction in the cost per token.” — Ian Buck (19:53)
(22:11 – 31:19)
“This is the extreme co design that we do at Nvidia... not just to have the fastest and be the fastest, but also to reduce the cost because…if just our software alone could increase performance by 2x you've now reduced the cost per token by 2x…” — Ian Buck (30:27)
(31:19 – 35:52)
“We see MOEs happening not just in chatbots, but similar sparsity MOE expert applications being done in vision models and video models… The ability to revolutionize biology… and drug discovery for cancer research alone is an investment that the whole world's making right now.” — Ian Buck (33:50)
(35:52 – End)
“If you want to learn more, go check out GTC. We put all the presentations online. Jensen’s keynote is wonderful. He has a... he'll explain it even better than I can…” — Ian Buck (36:30)
Explore NVIDIA’s GTC resources for deeper dives and keep an eye on the rapidly evolving landscape of AI model innovation.