AI Semiconductor Landscape feat. Dylan Patel | BG2 w/ Bill Gurley & Brad Gerstner - BG2Pod with Brad Gerstner and Bill Gurley

Summary6 min read

BG2Pod Summary: AI Semiconductor Landscape feat. Dylan Patel

Episode Title: AI Semiconductor Landscape feat. Dylan Patel
Host/Author: BG2Pod
Release Date: December 23, 2024
Participants: Dylan Patel (Semianalysis), Bill Gurley, Brad Gerstner

1. Introduction

In this episode of BG2Pod, host Bill Gurley and co-host Brad Gerstner engage in an insightful discussion with Dylan Patel from Semianalysis. The conversation delves into the evolving AI semiconductor landscape, exploring the technical architectures, market dynamics, and strategic investments shaping the industry. The panel aims to provide a snapshot of semiconductor activities in relation to the AI surge, offering valuable perspectives for investors and tech enthusiasts alike.

2. Scaling Narrative and Data Center Build-outs

Dylan Patel opens the discussion by challenging the prevailing narrative that scaling in the semiconductor industry is declining. He questions, "Is scaling dead. Then why is Mark Zuckerberg building a 2 gigawatt data center in Louisiana?" [00:00]. Patel points out the extensive investments by tech giants like Amazon, Google, and Microsoft in multi-gigawatt data centers and high-bandwidth fiber connections, which are geared towards achieving unprecedented scale and connectivity.

Key Points:

Massive investments in data centers indicate that scaling remains a critical focus.
Super high bandwidth connections allow multiple data centers to function as a unified entity for large-scale AI tasks.
The perceived decline in scaling is contradicted by the strategic expenditures of leading tech companies.

3. Nvidia's Dominance and Moats

A significant portion of the discussion centers on Nvidia's predominant role in the AI semiconductor market. Dylan Patel attributes Nvidia's success to its superior integration of hardware, software, and networking capabilities, describing it as a "three-headed Dragon" [07:02]. He emphasizes that Nvidia not only leads in GPU hardware but also excels in software (like CUDA) and networking (acquisitions like Mellanox).

Notable Quotes:

"Every semiconductor company in the world sucks at software except for Nvidia." [07:02] – Dylan Patel
"Jensen is probably the most paranoid man in the world." [11:28] – Dylan Patel

Key Points:

Nvidia's comprehensive approach provides a competitive moat that is difficult for other companies to replicate.
Continuous innovation and rapid deployment of new technologies keep Nvidia ahead.
Superior software and networking solutions enhance the performance and scalability of Nvidia's hardware.

4. The Debate on AI Scaling Laws

The conversation shifts to the scaling laws of AI, particularly the balance between model parameters and data. Dylan Patel references Google's Chinchilla paper, which discusses the optimal ratio of data to parameters for effective scaling [23:31]. He argues against the notion that data scarcity will halt AI advancements, suggesting that synthetic data generation and new methodologies can sustain growth.

Notable Quotes:

"Pre training scaling laws are pretty simple, right? You get more compute and then I throw it at a model and it'll get better." [23:31]
"Scaling laws are a log, log axis... we're not here, we haven't pushed it to billions of dollars spent on synthetic data generation." [27:37]

Key Points:

Optimal scaling involves a balanced increase in both model size and data.
Synthetic data and augmented training techniques can compensate for data limitations.
The debate centers on whether scaling efficiency has reached its peak or can continue through innovation.

5. Inference Time Reasoning and Compute Intensity

Dylan Patel introduces the concept of inference time reasoning, highlighting its computational demands compared to traditional pre-training. He explains that reasoning processes require generating and evaluating numerous potential outputs, leading to significantly higher compute costs.

Notable Quotes:

"Inference time compute is actually a lot bigger on software, but it's a lot bigger on, hey, they just have the best hardware. Now." [14:05] – Dylan Patel
"In inference time compute requires you to have multiples more compute." [50:30] – Dylan Patel

Key Points:

Reasoning models, like OpenAI's GPT-4, generate extensive intermediate data, increasing computational load.
The cost of operating reasoning models is substantially higher due to longer token generation and increased memory usage.
Enhanced inferencing demands drive up the necessity for advanced and efficient hardware solutions.

6. Competition and Alternatives: AMD, Google TPU, Amazon Trainium

The discussion broadens to explore competitors in the AI semiconductor space, including AMD, Google's TPU, and Amazon's Trainium.

AMD:

Dylan Patel criticizes AMD for lacking robust software development capabilities, which hampers its ability to compete effectively with Nvidia.
Despite strong silicon engineering, AMD struggles with system-level design and integrating comprehensive software suites.
Notable Quote: "AMD is missing software... they've got very few developers on it." [67:35]

Google TPU:

Google’s Tensor Processing Units (TPUs) are acknowledged for their strong system integration and custom architecture optimized for AI workloads.
TPUs benefit from Google's extensive networking and cooling solutions, making them highly reliable for internal use.
Notable Quote: "TPU's vastly integrated with Google's software and networking, providing competitive performance." [70:46]

Amazon Trainium:

Amazon’s Trainium is highlighted as a cost-effective alternative, offering high memory bandwidth per dollar through strategic partnerships and optimizations.
Despite being less efficient individually, Trainium's scale compensates for its lower per-chip performance.
Notable Quote: "Trainium 2 is very cost-effective per HBM and memory bandwidth." [74:59]

Key Points:

AMD faces challenges due to limited software support and system-level expertise, despite strong hardware.
Google's TPU remains strong internally but struggles with commercial expansion due to pricing and software limitations.
Amazon's Trainium offers a competitive edge in cost and scalability but lacks the integrated system prowess of Nvidia.

7. Market Projections for 2025 and 2026

Looking ahead, Dylan Patel offers projections for the AI semiconductor market, emphasizing sustained investment and the critical role of model improvements.

Key Points:

2025: Continued significant investments by hyperscalers, driven by the need to maintain competitive advantage through scaling AI capabilities.
2026: Potential consolidation in the market as only the most efficient and innovative players sustain growth. The industry's long-term success hinges on ongoing model advancements and the influx of new capital from sources like sovereign wealth funds.
Notable Quote: "2026 is where the reckoning comes, right? Will people keep spending like this?" [81:19]

Key Takeaways:

Sustained growth is expected in the near term, supported by aggressive scaling and deployment of advanced AI models.
Long-term stability depends on continuous innovation and the ability to translate compute investments into tangible revenue gains.
Market consolidation may occur as only the top performers can manage the escalating costs and complexity of AI scaling.

8. Final Insights and Conclusions

The episode concludes with reflections on the dynamic nature of the AI semiconductor market. Dylan Patel underscores the importance of balancing compute investments with model performance and revenue generation. The panel acknowledges the potential for both significant growth and market consolidation, contingent on technological advancements and strategic investments.

Notable Quotes:

"There is a game of chicken here... overshoot goes up and every bubble ever, we overshoot." [86:11] – Bill Gurley
"The scaling of models is not just about getting bigger, it's about getting smarter and more efficient." [concurrent throughout]

Key Points:

Nvidia remains a dominant force due to its integrated approach, but competition is intensifying with bespoke solutions from major tech players.
The balance between scaling compute resources and improving AI model efficiency is crucial for sustained market growth.
Future success in the AI semiconductor landscape will depend on the ability to innovate and meet the evolving demands of AI workloads.

Conclusion

This episode of BG2Pod provides a deep dive into the AI semiconductor landscape, highlighting the critical balance between hardware advancements, software integration, and strategic investments. Dylan Patel's expertise offers valuable insights into the competitive dynamics and future projections of the industry, emphasizing that while Nvidia currently leads, the market is poised for continued evolution driven by technological innovation and strategic capital allocation.

Disclaimer:
The views and opinions expressed in this summary are based on the podcast transcript and do not constitute investment advice. Always conduct your own research or consult with a financial advisor before making investment decisions.

Loading summary

Transcript424 lines

[00:00]
Dylan Patel
Is scaling dead. Then why is Mark Zuckerberg building a 2 gigawatt data center in Louisiana?
[00:04]
Bill
Right?
[00:04]
Dylan Patel
Why is, why is Amazon building these multi gigawatt data centers? Why is Google, why is Microsoft building multiple gigawatt data centers plus buying billions and billions of dollars of fiber to connect them together? Because they think, hey, I need to win on scale. So let me just connect all the data centers together with super high bandwidth so then I can make them act like one data center right towards one job.
[00:24]
Brad
Right?
[00:24]
Dylan Patel
So this is, this, this whole like is scaling over narrative falls on its face when you see what the people who know the best are spending on.
[00:45]
Bill
Great to be here. Psyched you both are in the shop today. Dylan, this is one of the things we've been talking about all year, which is how the world of compute is radically changing. So Bill, why don't you tell folks who, who Dylan is and let's get started.
[00:59]
Unnamed Host
Yeah, we're thrilled to have Dylan Patel with us from Semianalysis. Dylan has quickly built, I think, the most respected research group on global semiconductor industry. And so what we thought we'd do today is dive deep on the intersection, I think, between everything Dylan knows from a technical perspective about the architectures that are out there, about the scaling, about the key players in the market globally, the supply chain. And the best and the brightest of people we know are all listening and reading Dylan's work and then connect it to some of the business issues that our audience cares about and see where it comes out. What I was hoping to do is kind of get a moment in time, snapshot of all the semiconductor activity that relates to this big AI wave and try and put it in perspective.
[01:50]
Bill
Dylan, how did you get into this?
[01:53]
Dylan Patel
So when I was 8, my Xbox broke and I have immigrant parents. I grew up in rural Georgia, so I didn't have much to do besides be a nerd. And I couldn't tell them I broke my Xbox. I had to open it up short the temperature sensor and fix it. And that was the way to fix it. Didn't know what I was doing at the time. But then I stayed on those forms and I became a form warrior.
[02:11]
Brad
Right?
[02:11]
Dylan Patel
You know, you see those people in the comments always yelling at you, Brad. You know, it's like, it's like that was me, right? As a child and you didn't know as a child then, but you know, it was just like, you know, arguing with people online as a child and then being passionate. As soon as I started making money, I was reading earnings from semiconductor companies and Investing in them, you know, with my internship money and. Yeah. Reading technical stuff as well of course. And then working a little bit and then.
[02:36]
Bill
Yeah, and just tell us, give us that quick thumbnail on semianalysis today, like what is the business?
[02:41]
Dylan Patel
Yeah. So today we are a semiconductor research firm. AI research firm. We service companies. Our biggest customers are all hyperscalers. The largest semiconductor companies, private equity as well as hedge funds. We sell data around where every data center in the world is, what the power is in each quarter, how the build outs are going. We sell data around fabs, we track all 1500 Fabs in the world. For your purposes only 50 of them matter. But all 1500 Fabs around the world, same thing with the supply chain of whether it be cables or servers or boards or transformer substation equipment. We try and track all of this on a very number driven basis as well as forecasting and then we do consulting around those areas.
[03:21]
Bill
Yeah, so I mean, you know, Bill, you and I talked about this. I mean for Altimeter, our team talks with Dylan and Dylan's team all the time. I think you're right. He's quickly emerged really just through hustle, hard work doing, doing the grindy stuff that matters I think is, you know, a benchmark for what's going on in the semiconductor industry. And we're at this, you know, I suggested we're two years into this maybe you know, this build out and it's been hyper kinetic and one of the things Bill and I are talking about is as we enter the end of 2024, taking a deep breath, thinking about 25, 26 and beyond because a lot of things are changing and there's a lot of debates and it's going to have consequence for trillions of dollars of value in the public markets, in the private markets, how the hyperscalers are investing and where we go from here. So Bill, why don't you take us a little bit through the start of the questions.
[04:17]
Unnamed Host
Well, so I think if you're going to talk about AI and semiconductors, there's only one place to start which is to talk about Nvidia broadly. Dylan, what percentage of global AI workloads do you think are on Nvidia chips right now?
[04:31]
Dylan Patel
So I would say if you ignored Google would be over 98% but then when you bring Google into the mix it's actually more like 70 because Google is really that large a percentage of AI workloads, especially production workloads.
[04:45]
Unnamed Host
You know by production you mean in.
[04:47]
Dylan Patel
House workloads for Google production as in things that are making money. Things that are making money. They're actually probably, it's probably even less than 70%.
[04:54]
Brad
Right.
[04:54]
Dylan Patel
Because you think about Google Search and Google Ads are the two largest, you know, two of the largest AI driven businesses in the world.
[05:02]
Brad
Right.
[05:02]
Dylan Patel
You know, the only things that are even comparable are like TikTok and Metas.
[05:06]
Brad
Right.
[05:07]
Unnamed Host
And those Google workloads, I think it's important just to kind of frame this. Those are running on Google's proprietary chips. They're non LLM workloads. Correct.
[05:21]
Dylan Patel
So Google's production workloads for non LLM and LLM run on their internal silicon. And I think one of the interesting things is yes, everyone will say Google dropped the ball on transformers and LLMs.
[05:34]
Brad
Right?
[05:34]
Dylan Patel
How did OpenAI do GPT?
[05:35]
Brad
Right.
[05:36]
Dylan Patel
And not Google. But Google was running Transformers even in their search workload since 2018, 2019, the advent of BERT, which was one of the most well known, most popular Transformers before we got to the GPT Madness, has been in their production search workloads for years. So they run Transformers on their own in their search and ads business as well.
[06:00]
Unnamed Host
Going back to this number, you'd use 98% if you just look at, I guess, workloads people are purchasing to do work on their own. So you take the captives out. You're at 98, right? This is a dominant landslide at this moment in time.
[06:17]
Bill
Back to Google for a second. They also are one of the big customers of Nvidia.
[06:22]
Dylan Patel
They do buy a number of GPUs. They buy some for some YouTube video related workloads. Internal workload.
[06:28]
Brad
Right.
[06:28]
Dylan Patel
So not everything internal is a tpu.
[06:33]
Brad
Right.
[06:34]
Dylan Patel
They do buy some for some other internal workloads. But by and large their GPU purchases are for Google Cloud to then rent out to customers. Because while they do have some customers for their internal silicon externally, such as Apple, the vast majority of their external rental business for AI in terms of cloud business is still GPUs.
[06:56]
Unnamed Host
And that's Nvidia GPUs.
[06:58]
Dylan Patel
Correct. Nvidia GPUs.
[06:59]
Unnamed Host
Why are they so dominant? Why is Nvidia so dominant?
[07:03]
Dylan Patel
So I like to think of it as like a three headed dragon, right? I would say every semiconductor company in the world sucks at software except for Nvidia. So there's software, there's of course hardware. People don't realize that Nvidia is actually just much better at hardware than most people. They get to the newest technologies first and fastest because they drive like crazy towards hitting certain production goals, targets. They get chips out faster than other people. From thought design to deployed and then the networking side of things, right? They bought Mellanox and they've driven really hard with the networking side of things. So those three things kind of combine to make a three headed Dragon that no other semiconductor company can do alone.
[07:45]
Unnamed Host
Yeah, I would call out a piece you did, Dylan, where you helped everyone visualize the complexity of one of these modern cutting edge Nvidia deployments that involves the racks, the memory, the networking, the size of the scale of the whole. Super helpful.
[08:02]
Bill
I mean there's this comparison oftentimes between companies that are truly standalone chip companies. They're not systems companies, they're not infrastructure companies and Nvidia. But I think one of the things that's deeply underappreciated is the level of competitive moats that Nvidia has. You know, software is becoming a bigger and bigger component of squeezing efficiencies and you know, total cost of operation out of these infrastructures. So talk to us a little bit about that schema, you know, that Bill's referring to. Like there are many different layers of systems architecture and how that's differentiated from maybe, you know, a custom ASIC or an amd, right?
[08:44]
Dylan Patel
So when you, when you look broadly at the gpu, right, no one buys one chip for running an AI workload, right? Models have far exceeded that, right? You look at, you know, today's leading edge models like GPT4 was, you know, over a trillion parameters, right? A true in parameters is over a terabyte of memory. You can't get a chip with that capacity. A chip can't have enough performance to serve that model, even if it had enough memory capacity. So therefore you must tie together many chips together. And so what's interesting is that Nvidia has seen that and built an architecture that has many chips networked together really well called NVLink. But funnily enough, and the thing that many ignore is that Google actually did this alongside Broadcom, you know, and they did it before Nvidia, right? You know, today everyone's freaking out about, or not freaking out, but like everyone's like very excited about Nvidia's Blackwell system, right? It is a rack of GPUs that is the purchased unit, right? It's not one server, it's not one chip, it's a rack. And this rack weighs 3 tons and it has thousands and thousands of cables and all these things that Jensen will probably tell you, right? Extremely complex. Interestingly, Google did something very similar in 2018, right, with the TPU now they couldn't do it alone, right? They know the software, they know what the compute element needs to be, but they didn't know anything. They can't do a lot of the other difficult things like package design, like networking. And so they had to work with other vendors like Broadcom to do this. And because Google had such a unified vision of where AI models were headed, they actually were able to build this system, this system architecture that was optimized for AI, Right. Whereas at the time Nvidia was like, well, how big do we go? I'm sure they could have tried to scale up bigger, but what they saw as the primary workloads didn't require scaling to that degree.
[10:32]
Brad
Right?
[10:33]
Dylan Patel
Now everyone sort of sees this and they're running towards it, but Nvidia's got Blackwell coming now. Competitors like AMD and others have to make an acquisition recently to help them get into the system design.
[10:44]
Brad
Right?
[10:44]
Dylan Patel
Because building a chip is one thing, but building many chips that connect together, cooling them appropriately, networking them together, making sure that it's reliable at that scale, is a whole host of problems that semiconductor companies don't have the engineers for.
[10:59]
Unnamed Host
Where would you say Nvidia has been investing the most in incremental differentiation?
[11:07]
Dylan Patel
I would say for differentiating. Nvidia has primarily focused on supply chain things which, you know, might sound like, oh, well, like, yeah, they're just like ordering stuff. No, no, no, no. You have to work deeply with the supply chain to build the next generation technology so that you can bring it to market before anyone else does.
[11:25]
Brad
Right.
[11:26]
Dylan Patel
Because if Nvidia stands still, they will be eaten up.
[11:29]
Brad
Right?
[11:30]
Dylan Patel
They're, you know, sort of the Andy Grove. Only the paranoid will survive. Jensen is probably the most paranoid man in the world. Yes, right. He's, he's known for many years, since before the LLM craze. All of his biggest customers were building AI chips.
[11:44]
Brad
Right.
[11:44]
Dylan Patel
Before the LLM craze, his main competitors were like, oh, we should make GPUs. And yet he stays on top because he's bringing to market technologies at volume that no one else can.
[11:56]
Brad
Right.
[11:56]
Dylan Patel
And so whether it be in networking, whether it be in optics, whether it be in water cooling, Right. Whether it be in all sorts of other power delivery, all these things he's bringing to market technologies that no one else has. And he has to work with the supply chain and teach those supply chain companies and they're helping, obviously they have their own capabilities to build things that don't exist today. And Nvidia is trying to do this on an annual cadence.
[12:20]
Bill
Now, that's incredible.
[12:21]
Dylan Patel
Yeah. Blackwell. Blackwell Ultra. Rubin. Rubin Ultra. They're going so fast. They're driving so many changes every year. Of course people are gonna be like, oh, no, there are some delays in Blackwell. Yeah, of course you're driving. Look how hard you're driving the supply.
[12:33]
Bill
Chain is that part, like, how big a part of the competitive advantage is the fact that they're now on this annual cadence. Right. Because it seems like by going there, it almost precludes their competitors from catching up. Because even if you skate to where Blackwell is right, you're already on next generation within 12 months. He's already planning two or three generations ahead because it's only two, two to three years ahead.
[12:57]
Dylan Patel
Well, the funny thing is a lot of people in Nvidia will say, Jensen doesn't plan more than a year or year and a half out because they change things. And they'll deploy them out that fast, right? No, semi. Every other semiconductor company takes years to deploy, you know, make architecture changes.
[13:12]
Unnamed Host
But you said if they stand still there, they, they would have competition. Like what, what would be their area of vulnerability or what would have to play out in the market for other alternatives to take more share of the workloads.
[13:29]
Dylan Patel
Yeah, so. So the main thing for Nvidia is, you know, hey, this workload is this big, right? It's, it's, it's well over $100 billion of spend for the biggest customers. They have multiple customers that are spending billions of dollars.
[13:42]
Unnamed Host
Yeah.
[13:43]
Dylan Patel
I can hire enough engineers to figure out what, how to run my model on other hardware right now, maybe I can't figure out how to train on other hardware, but I can figure out how to run it for inference on other hardware. So Nvidia's moat in inference is actually a lot smaller on software, but it's a lot bigger on, hey, they just have the best hardware. Now. What is, what does the best hardware mean? It means capital cost, and it means operation cost. And that means performance.
[14:06]
Brad
Right?
[14:06]
Dylan Patel
Performance tco.
[14:07]
Bill
Yes.
[14:08]
Dylan Patel
And Nvidia's whole moat here is if they stand still, their performance TCO doesn't grow. But interestingly, they are right. Like with Blackwell, not only is it way, way, way faster, anywhere from 10 to 15 times on really large models for inference because they've optimized it for very large language models. They've also decided, hey, we're going to cut our margin too somewhat because I'm competing with Amazon's chip and TPU and AMD and all these things. They've decided to cut their margin too. So between all these things, they've decided that they need to push Performance TCO, not 2x every two years, right? You know Moore's Law, right? They've decided they need to push Performance 5 Performance TCO 5x maybe every year, Right? At least that's what Blackwell is, and we'll see what Ruben does. But you know, 5x plus in a single year for performance TCO is an insane pace, right? And then you stack on top like, hey, AI models are actually getting a lot better for the same size. The cost for delivering LLMs is tanking, which is going to induce demand.
[15:08]
Brad
Right?
[15:10]
Unnamed Host
Just to clarify one thing you said, or at least restated to make sure, I think when you said the software is more important for training, you meant CUDA is more of a differentiator on training than it is on inference.
[15:23]
Dylan Patel
So I think a lot of people in the investor community call cuda, which is just one layer for all of Nvidia software. There's a lot of layers of software. But for simplicity's sake, regarding networking or what runs on switches or what runs on all sorts of things, fleet management stuff, all sorts of stuff that Nvidia makes, that we'll just call KUDO for simplicity's sake. But all of this software is stupendously difficult to replicate. In fact, no one else has deployments to do that besides the hyperscalers.
[15:52]
Brad
Right?
[15:53]
Dylan Patel
And a few thousand GPUs is like a Microsoft inference cluster, right? It's not a training cluster. So when you talk about, hey, what is the difficulty here, right, on training, this is users constantly experimenting, right? Researchers saying, hey, let's try this, let's try that, let's try this, let's try that. I don't have time to optimize in hand wring out the performance. I rely on Nvidia's performance to be quite good with existing software stacks or very little effort.
[16:18]
Brad
Right?
[16:19]
Dylan Patel
But when I go to inference, Microsoft is deploying 5, 6 models across how many billions of revenue? Right? So all of OpenAI's revenue, plus whatever.
[16:29]
Bill
They have on Copilot, 10 billion of inference revenue.
[16:31]
Dylan Patel
Yeah. So they have $10 billion of revenue here and they're deploying five models, right? GPT4 4.040 mini and now the reasoning models.01. And yeah, so it's like they're deploying very few models and those change what, every six months?
[16:49]
Brad
Right.
[16:50]
Dylan Patel
So every six months they get a new model and they deploy that. So within that time frame you can hand wring out the performance. And so Microsoft has deployed GPD style models on, on other competitors, hardware such as AMD and some of their own, but mostly amd. And so they can ring that out with software because they can spend hundreds of engineers, dozens of engineers hours, hundreds of engineer hours or thousands of engineer hours on working this out because it's such a unified sort of workload.
[17:17]
Brad
Right.
[17:19]
Bill
I want to get you to comment on this chart. This is a chart we showed earlier in the year that I think was kind of a moment for me with Jensen when he was in, I think the Middle East. And for the first time he said not only are we going to have $1 trillion of new AI workloads over the course of the next four years, he said, but we're also going to have $1 trillion of CPU replacement of data center replacement workloads over the course of the next four years. So that's an effort to model it out. And I, you know, we referenced it on, on the POD with him and, and he seemed to indicate that it was directionally correct, right. That he still believes that it's not just about. Because there's a lot of fuss in the world about pre training and what if pre training, does it continue apace? And it seemed to suggest that there was a lot of AI workloads that had nothing to do with pre training that they're working on, but also that they had all of this data center replacement. Do you buy that? I've heard a lot of people push back on the data center replacement and say there's no way people are gonna, you know, rebuild a CPU data center with a bunch of Nvidia GPUs. It just doesn't make any sense. But he, his argument is that an increasing number of these applications, even things like Excel and PowerPoint are becoming machine learning applications and require accelerated compute.
[18:42]
Dylan Patel
Nvidia has been pushing non AI workloads for accelerators for a very long time.
[18:46]
Brad
Right?
[18:46]
Dylan Patel
Professional visualization.
[18:48]
Brad
Right.
[18:48]
Dylan Patel
Pixar uses a ton of GPUs, right. To make every movie. You know, all these Siemens engineering applications, right. All these things do use GPUs, right? I would say they're a drop in the bucket compared to AI. The other aspect I would say is, and this is sort of a bit contentious with your chart, I think, but IBM mainframes sell more volume and revenue every single cycle, right? So yes, no one in the bay uses mainframes or talks about mainframes, but they're still growing, right? And so I would say the Same applies to CPUs, right? To classic workloads. Just because AI is here doesn't mean web serving is going to slow down or databasing is going to slow down. Now what does happen is that line is like this and the AI line is like this. And furthermore, when you talk about, hey, these applications, they're now AI Excel with Copilot or Word with Copilot or whatever, they're still going to have all of those classic operations. You don't get rid of what you used to have. Southwest doesn't stop booking flights, they just run AI analytics on top of their flights to maybe do pricing better or whatever. Right? So I would say that still happens. But there is an element of replacement that is sort of misunderstood, right? Which is given how much people are deploying, how tight the supply chains for data centers are. Data centers take longer, they're longer time supply chains, unfortunately.
[20:14]
Brad
Right.
[20:14]
Dylan Patel
Which is why you see things like what Elon's doing. But when you think about that, well, how can I get power then?
[20:20]
Brad
Right?
[20:20]
Dylan Patel
So you can do what coreweave is doing and go to crypto mining companies and just like clear them out and put a bunch of GPUs in them, right? Retrofit the data center, put a data GPUs in them like they're doing in Texas. Or you can do what some of these other folks are doing, which is, hey, well, my depreciation for CPU servers has gone from three years to six years in just a handful of years.
[20:41]
Brad
Why?
[20:41]
Dylan Patel
Because Intel's progress has been this.
[20:43]
Brad
Right?
[20:43]
Unnamed Host
Right.
[20:43]
Dylan Patel
So, so in reality, like the old Intel CPU is not that much better. But all of a sudden over the last couple years, AMD has burst onto the scene, ARM CPUs have burst onto the scene. Intel's started to right the ship. Now I can upgrade. The plurality of Amazon CPUs in their data centers are 24 core Intel CPUs that were manufactured from 2015 to 2020, more or less the same architecture. This 24 core CPU, I can buy 128 or 192 core CPU. Now today, where each CPU core is higher performance and well, if I just replace like six servers with one, I've basically invented power out of thin air, right? I mean, like, you know, in effect, because these old servers, which are six plus years old or even, you know, they can, they can just be deprecated and put. So with capex of new servers, I can replace these old servers. And now, you know, when I do, every time I do that, I can throw another AI server in there.
[21:39]
Brad
Right.
[21:40]
Dylan Patel
So this is sort of the. Yes, there is some replacement. I still need more total capacity, but that total capacity can be served by fewer machines. Maybe if I buy new ones. And generally the market is not going to shrink. It's still going to grow just nowhere close to what AI is. And AI is causing this behavior of I need to replace so I can get power.
[21:58]
Bill
Hey Bill, this reminds me of a point Satya made on the pod last week that I've seen replayed a bunch of times and I think is fairly misunderstood. He said last week on the pod that he was power and data center constrained, not chip constrained. What I think it was was more a assessment on the real bottleneck, which is data centers and power as opposed to GPUs, because GPUs have, have come online. And so I think the case you just made, I think helps to clarify that.
[22:30]
Unnamed Host
Well, before, before we dive into the alternatives to Nvidia, I thought we would hit on this, this pre training scaling debate that you wrote about in your last piece, Dylan, and we've already talked about quite a bit, but, but why don't you give us your view of what's going on there. I think Ilya was the one, the, the most credible AI specialist that brought this up and then it got repeated and, and, and cross analyzed quite a bit.
[22:59]
Bill
And Bill, just to repeat what it is.
[23:01]
Unnamed Host
Oh, fair enough.
[23:02]
Bill
I think Ilya said, you know, data is the fossil fuel of the Internet of AI, that we've consumed all the fossil fuel because we only have but one Internet. And so the huge gains we got from pre training are not going to be repeated.
[23:16]
Unnamed Host
And some, some experts had predicted a data, the data would run out a year or two ago. So it wasn't, it wasn't like out of, out of nowhere that that argument came to light. Anyway, let's hear what Dylan has to say.
[23:31]
Dylan Patel
So pre training scaling laws are pretty simple, right? You get more computer and then I throw it at a model and it'll get better. Now what does that, that breaks out into two axes, right? Data and parameters.
[23:42]
Brad
Right.
[23:42]
Dylan Patel
You know, the bigger the model, the more data the better. And there's actually an optimal ratio.
[23:46]
Brad
Right.
[23:46]
Dylan Patel
So Google published a paper called Chinchilla which says the optimal ratio of data to parameter, you know, model size and that's the scaling thing. Now what happens when the data runs out? Well, I don't really get much more data, but I keep growing the size of the model because my budget for compute keeps growing. This is a bit not fair though, right? We have barely, barely, barely tapped video data, right? So there is a significant amount of data that's not tapped. It's just video data is so much more information than written data, right? And so therefore you're throwing that away. But I think that's like, that's part of the, like, you know, there's a bit of misunderstanding there. But more importantly, text is the most efficient domain, right? Humans generally, yes, a picture paints a thousand words, but if I write a hundred words, I can probably, you can tell, figure out.
[24:36]
Unnamed Host
And the transcripts of most of those videos were already.
[24:39]
Dylan Patel
Yeah, the transcripts of many of those videos are in there already. But you know, regardless, the data, data is like a big axis. Now the problem is this is only pre training, right? Quote, pre training. A model is more than just the pre training, right? There is, there's many elements of it. And so people have been talking about, hey, inference time computer. Yeah, that's important, right? You can continue to scale models if you figure out how to make them think and recursively be like, oh, that's not right. Let me think this way. Oh, that's not right. Much like you don't hire an intern and say, hey, what's the answer to X? Or you don't hire a PhD and say, hey, what's the answer to X? You're like, go work on this. And then they come back and bring something to you. So inference time compute is important, but really what's more important is as we continue to get more and more compute, can we improve models if data is run out? And the answer is you can create data out of thin air almost, right? In certain domains, right? And so this is the whole, the debate around scaling laws is how can we create data, right? And so what is Ilya's company doing, most likely. What is Mira's company doing, most likely? Amira Muradi CTO of OpenAI what are all these companies focused on? OpenAI, what are all these companies focused? They have Noam Brown, who's sort of one of the big reasoning people on roadshows, just going and speaking everywhere, basically, right? What are they doing?
[26:00]
Brad
Right?
[26:00]
Dylan Patel
They're saying, hey, we can still improve these models. Yes, spending compute at inference time is important, but what do we do at training time? Because you can't just tell a model, think more and it gets better. You have to do a lot of training time. And so what that is is I take the model, I take an objective function I have, right? What is the square root of 81 right? Now if I told you the square asked many people, what's the square root of 81? Many could answer, but I bet many people could answer if they thought about it more. Like almost a lot more people. Maybe this is a simplistic problem, but you say, hey, let's have the existing model do that. Let's have it just run every possible, not every possible, many permutations of this. Start off with say five, and then anytime it's unsure, branch into multiple. So you have hundreds of rollouts or trajectories of generated data. Most of this is garbage. You prune it down to, hey, only these paths got to the right answer. Okay, now I feed that, and that is now new training data. So I do this with every possible area where I can do functional verification. Functional verification, that is. Hey, this code compiles. Hey, this unit test that I have in my code base, how can I generate the solution? How can I generate the function? Okay, now, and you do this over and over and over and over again across many, many, many different domains where you can functionally prove it's real. You generate all this data, you throw away the vast, vast majority of it, but you now have some chains of thought that you can train the model on, which then it will learn how to do that more effectively and it generalizes outside of it.
[27:28]
Brad
Right.
[27:28]
Dylan Patel
And so this is the whole domain. Now, when you talk about scaling laws, it's point of diminishing returns is kind of not proven yet, by the way.
[27:38]
Brad
Right.
[27:38]
Dylan Patel
Because it's more so, hey, the scaling laws are a log, log axis, a log, that is, it takes 10x more investment to get the next iteration. Well, 10x more investment going from 30 million to 300 million. 300 million to 3 billion is relevant, but when SAM wants to go from 3 billion to 30 billion, it's a little difficult to raise that money. Right. That's why the most recent rounds are a bit like, oh, crap, we can't spend 30 billion on the next run. And so the question is, well, that's just one axis. Where have we gone on synthetic data? Oh, we're still very early days.
[28:11]
Brad
Right.
[28:12]
Dylan Patel
We've spent tens of millions of dollars maybe on synthetic data.
[28:17]
Unnamed Host
With synthetic data, you used a qualifier in certain domains. When they released 01, it also had a qualifier like that in certain domains. I'm just saying those two scaling axes do better in certain domains and aren't as applicable in others. And we have to figure that out.
[28:33]
Dylan Patel
Yeah. I think one of the interesting things about AI is at first, in 2022, 2023, with the release of diffusion models, with the release of text models, people are like, oh wow. Artists are the one that are the most out of luck. Not technical jobs. Actually these things suck at technical jobs. But with this new axis of synthetic data and test time, compute actually, where are the areas where we can teach the model? We can't teach it what good art is because we have no way to functionally prove what good art is. We can teach it to write really good software, we can teach it how to do mathematical proofs, we can teach it how to engineer systems because there are, while there are trade offs and this is not like it's not just a 10 thing, especially on engineering systems. This is something you can functionally verify is this works or not or this is correct or not.
[29:21]
Unnamed Host
Grade the output and then the model can iterate more often.
[29:24]
Dylan Patel
Exactly.
[29:25]
Unnamed Host
It goes back to the AlphaGo thing and why that was a sandbox that could allow for a novel, novel moves and plays because you could traverse it and run synthetically. You could just let it create and create and create.
[29:43]
Bill
Putting on my investor hat, public investor hat here. There is a lot of tension in the world over Nvidia as we look forward at 2025 and this question of pre training, right? And if in fact, you know, we've seen we pluck 90% of the low hanging fruit that comes from pre training, then do people really need to buy bigger clusters? And I think there's a view in the world, particularly post Ilya's comments that, you know, no, the 90% benefit of pre training is gone. But then I look at, you know, the comments out of Hock Tan this week, you know, during their earnings call that all the hyperscalers are building these million, you know, XPU clusters. I look at, you know, the commentary out of X AI that they're going to build 200 or 300,000 GPU clusters, you know, meta reportedly building much bigger clusters. Microsoft building much bigger clusters. How do you square those two things, right? If everybody's right and pre training's dead, then why is everybody building much bigger clusters?
[30:45]
Dylan Patel
So the scaling, right, goes back to what's the optimal ratio? What's the, how do we continue to grow, right? Just blindly growing parameter count when we don't have any more data or the data is very hard to get at that because it's video data wouldn't give you so many gains. And then there's also the access of, it's a log chart, right? You need 10x more to get the next jump, Correct?
[31:04]
Brad
Correct, Right.
[31:04]
Dylan Patel
So when you look at both of those, oh crap, like I need to invest 10x more and I might not get the full gain because I don't have the data. But the, the data generation side, we are so early days with this. Right.
[31:15]
Bill
So the point is I'm still going to squeak out enough gain that it's a positive return, particularly when you look at the competitive dynamic, you know, our models versus our competitor models. So it's a rational decision to go from 100,000 to 200,000 or 300,000 even if you know the kind of big one time gain and pre training is behind us.
[31:37]
Dylan Patel
Or rather it's exponentially more, it's logarithmically more expensive to do that gain. Right. So it's still there. Like the gain is still there, but like the sort of whole like Orion has failed sort of narrative around OpenAI's model and they didn't release Orion.
[31:50]
Brad
Right.
[31:50]
Dylan Patel
They released 01, which is sort of a different axis. It's partially because, hey, this is, you know, because of these like data issues, but it's partially because they did not scale 10x. Right, right. Because scaling 10x from 4 to this is actually was like.
[32:04]
Bill
I think this is Gavin's point. Right.
[32:05]
Unnamed Host
Well, I would also. Let's go to Gavin a second. One of the reasons this became controversial I think is, and I don't know, Dario and Sam had, prior to this moment, or at least the way I heard him, made it sound like they were just going to build the next biggest thing and get the same amount of gain they had left that impression. And so we get to this place as you described, and it's not quite like that. And then people go, oh, what does that mean? Like, it causes them to raise their head up.
[32:38]
Dylan Patel
So I think they have never said the chinchilla scaling laws were what delivers us, you know, AGI, right. They've had scaling. Scaling is you need a lot of compute and, and guess what, if you have, you know, if you have to generate a ton of data and throw away most of it because hey, only some of the paths are good, you're spending a ton of compute at train time.
[32:59]
Brad
Right.
[33:00]
Dylan Patel
And this is sort of the access that is like we may actually see models improve faster in the next six months to a year than we saw them improve in the last year. Because there's this new axes of synthetic data generation and the amount of compute we can throw at it is we're still right here in the scaling law, right? We're not here. We haven't pushed it to billions of dollars spent on synthetic data generation, functional verification, reasoning, training. We've only spent millions, tens of millions of dollars.
[33:27]
Brad
Right.
[33:28]
Dylan Patel
So what happens when we scale that up so there is a new axes of spending money and then there's of course test time, compute as well. That is spending time at inference to get better and better. So it's possible, and in fact many people at these labs believe that the next year of gains or the next six months of gains will be faster because they've unlocked this new axis through a new methodology.
[33:47]
Brad
Right.
[33:48]
Dylan Patel
And it's still scale.
[33:49]
Brad
Right.
[33:49]
Dylan Patel
Because this requires stupendous amounts of compute. You're generating so much more data than exist on the web and then you're throwing away most of it, but you're generating so much data that you have to, you have to run the model constantly.
[34:00]
Brad
Right.
[34:01]
Unnamed Host
What domains do you think are most applicable with this approach? Like where were synthetic data be most effective? And maybe you do both a pro and a scenario where it's going to be really good and one where it wouldn't work as well.
[34:20]
Dylan Patel
Yeah. So I think that goes back to the point around what can we functionally verify is true or not? What can I grade? And it's not subjective. What class can, you know, you take in college and you get the thing back and you're like, oh, this is bs or you're like, dang, I messed that up.
[34:36]
Brad
Right.
[34:37]
Dylan Patel
There's certain classes where you can get.
[34:38]
Unnamed Host
A determinism of grading the output.
[34:42]
Brad
Right?
[34:43]
Dylan Patel
Exactly. So if it can be functionally verified, amazing. If it has to be judged.
[34:47]
Brad
Right.
[34:48]
Dylan Patel
So there's sort of two ways to judge an output.
[34:50]
Brad
Right?
[34:50]
Dylan Patel
There is, you know, without using humans.
[34:52]
Brad
Right.
[34:52]
Dylan Patel
This is sort of the whole scale AI, Right. What were they initially doing? They were using a ton of manpower to create good data.
[35:01]
Brad
Right.
[35:01]
Dylan Patel
Label data. But now humans don't scale for this level of data.
[35:05]
Brad
Right.
[35:05]
Dylan Patel
Humans post on the Internet every day and we've already tapped that out more or less on a good.
[35:10]
Unnamed Host
What are domains at work?
[35:11]
Dylan Patel
So these are domains where. Hey, in Google, when they push data to any of their services, they have tons of unit tests. These unit tests make sure everything's working well. Why can't I have the LLM just generate a ton of outputs and then use those unit tests to grade those outputs.
[35:27]
Brad
Right.
[35:27]
Dylan Patel
Because it's pass or fail.
[35:29]
Brad
Right?
[35:29]
Dylan Patel
It's not. And then you can also grade these outputs in other ways like oh, it takes this long to run versus this long to run. So then you have various. There's Other areas such as, like hey, image generation. Well, actually it's harder to say which image looks more beautiful to you versus me. I might like some sunsets and flowers and you might like the beach. Right. You can't really argue what is good there. So there's no functional verification. There is only subjective.
[35:54]
Brad
Right.
[35:54]
Dylan Patel
So the objective nature of this is where, so where do we have objective grading?
[35:57]
Brad
Right.
[35:58]
Dylan Patel
We have that in code, we have that in math, we have that in engineering. And while these can be complex like, hey, engineering is not just this is the best solution, it's hey, given all the resources we've had and given all these trade offs, we think this is the best trade off.
[36:11]
Brad
Right.
[36:11]
Dylan Patel
That's usually what engineering ends up being. Well, I can still, I can still look at all these axes, right. Whereas in subjective things, right, like hey, what's the best way to write this email or what's the best way to negotiate with this person? That's difficult, right. That's not something that is objective.
[36:27]
Bill
What are you hearing from the hyperscalers? I mean they're all out there saying our capex is going up next year, we're building larger clusters, you know, is that in fact happening? Like what's happening out there?
[36:37]
Dylan Patel
Yeah. So I think when you look at the streets estimates for Capex, they're all far too low, you know, based on a few factors.
[36:44]
Brad
Right.
[36:45]
Dylan Patel
So when we track every data center in the world and it's insane how much, especially Microsoft and now Meta and Amazon and many others, right. But those guys specifically are spending on data center capacity and as that power comes online, which you can track pretty easily, if you look at all of the different regulatory filings and use satellite imagery, all these things that we do, you can see that hey, they're going to have this much data center capacity, Right? Right.
[37:11]
Bill
What are you going to accelerating, what.
[37:13]
Dylan Patel
Are you going to fill in there?
[37:14]
Brad
Right.
[37:14]
Dylan Patel
It turns out you have to fill to fill it up. You know, you can make some estimates around how much power is each GPU all in everything.
[37:21]
Brad
Right.
[37:21]
Dylan Patel
Satya said he's going to slow down that a little bit, but they've signed deals for next year rentals in some of these cases.
[37:26]
Brad
Right.
[37:27]
Bill
Part of the reason he said is he expects his cloud revenue in the first half of next year to accelerate because he said we're going to have a lot more data center capacity and we're currently capacity constrained.
[37:36]
Dylan Patel
So you know what they're, you know, like again going back to the, is scaling dead. Then why is Mark Zuckerberg building a 2 gigawatt data center in Louisiana.
[37:42]
Bill
Right.
[37:43]
Dylan Patel
Why is, why is Amazon building these multi gigawatt data centers? Why is Google, why is Microsoft building multiple gigawatt data centers plus buying billions and billions of dollars of fiber to connect them together? Because they think hey, I need to win on scale. So let me just connect all the data centers together with super high bandwidth so then I can make them act like one data center right towards one job. So this whole like is scaling over narrative falls on its face when you see what the people who know the best are spending on.
[38:11]
Brad
Right?
[38:12]
Unnamed Host
You talked a lot at the beginning about Nvidia's differentiation around these large coherent clusters that are used in pre training. Can you see anything like I guess one, someone might be super bullish on inference and keep building out a data center, but they might have thought they were going to go from 100,000 nodes to 200 to 400 and might not be doing that right now. If this pre training thing is real, are you seeing anything that gives you any visibility on that dimension?
[38:44]
Dylan Patel
So when you think about training a neural network, right, it is doing a forwards pass and a backwards pass, right? Forwards pass is generating the data basically and it's half as much compute as the backwards pass which is updating the weights. When you look at this new paradigm of synthetic data generation, grading the outputs and then training the model, you are going to do many, many, many forward passes before you do a backwards pass. What is serving a user that's also just a forwards pass. So it turns out that there is a lot of inference in training, right? In fact there's more inference in training than there is updating the model weights because you have to generate hundreds of possibilities and then, oh, you only train on a couple of them, right. So there is, that paradigm is very relevant. The other paradigm I would say that is very relevant is when you're training a model, do you necessarily need to be co located for every single aspect of it?
[39:38]
Brad
Right.
[39:39]
Unnamed Host
And what's the answer?
[39:40]
Dylan Patel
The answer is depends on what you're doing. If you're in the pre training paradigm, then maybe you don't. Yeah, you need it to be co located, Right. You need everything to be in one spot. Yet why Microsoft in Q1 and Q2 sign these massive fiber deals, Right. And why are they building multiple similar sized data centers in Wisconsin and Atlanta and Texas and so on and so forth.
[40:02]
Brad
Right.
[40:02]
Dylan Patel
And Arizona, why are they doing that? Because they already see the research is there for being able to split the workload more appropriately, which is, hey, this data center, it's not serving users, it's running inference. It's just running inference and then throwing away most of the output because some of the output is good, because I'm grading it right? And they're doing this while they're also updating the model in other areas. So the whole paradigm of training, pre training is not slowing down. It's just logarithmically more expensive for each generation, for each incremental improvement.
[40:32]
Unnamed Host
People are finding other ways to, but.
[40:34]
Dylan Patel
There'S other ways to not just continue this. But hey, I don't need a logarithmic increase in spend to get the next generation of improvement. In fact, through this reasoning, training and inference, I can get that logarithmic improvement in the model without ever spending that. Now I'm going to, I'm going to do both, right? Because this is because each model jump has unlocked huge value, right?
[40:58]
Bill
The thing that I think so interesting, you know, I hear Kramer on CNBC this morning, you know, and they're talking about, is this Cisco from 2000. I was in Omaha, Bill, Sunday night for dinner. They're obviously big investors in utilities and they're watching what's going on in the data center build out and they're like, is this Cisco from 2000? So I had my team pull up a chart for Cisco 2000 and we'll show it on the pod. But they peaked at like 120pe, right? And if you look at the fall off that occurred in revenue and in EBITDA, you know, and then it had 70% compression in the, in the price to earnings multiple, right? So the price to earnings multiple went from 120 down to something closer to 30. And so I said to, you know, in this dinner conversation, I said, well, Nvidia's, you know, PE today is 30, it's not 120. Right. So you would have to think that there would be 70% PE compression from here or that their revenue was gonna fall by 70% or that their earnings were gonna fall by 70%. You know, in order to have a Cisco like event, we all have post traumatic stress about that. I mean, hell, I, you know, I live through that too. Nobody wants to repeat that. But when people make that comparison, it strikes me as uninformed. Right. It's not to say that there can't be a pullback, but given what you just told us about the build out next year, given what you told us about scaling laws continuing, you know, what do you think when you hear, you know, the Cisco comparison, when people are talking about Nvidia.
[42:31]
Dylan Patel
Yeah. So I think there's a couple things that are not fair.
[42:34]
Brad
Right.
[42:35]
Dylan Patel
Cisco's revenue, a lot of it was funded through private credit investments into building out telecom infrastructure.
[42:43]
Brad
Right.
[42:44]
Dylan Patel
When we look at Nvidia's revenue sources, very little of it is private credit.
[42:49]
Brad
Right.
[42:50]
Dylan Patel
And in some cases, yes, it's private slash credit like CoreWeave.
[42:53]
Brad
Right.
[42:53]
Dylan Patel
But Core weave is just backstopped by Microsoft. There is significant amounts of like difference in like what is the source of the capital.
[42:59]
Brad
Right.
[43:00]
Dylan Patel
The other thing is at the peak of the dot com, especially when you inflation adjust it, the private capital entering the space was much larger than it is today. As much as people say the venture markets are going crazy throwing these huge valuations at all these companies and we were just talking about this before the show, but hey, the venture markets, the private markets have not even tapped in. Guess what? Private market money like in the Middle east in these sovereign wealth funds, it's not coming in yet, has barely come in.
[43:31]
Brad
Right.
[43:32]
Dylan Patel
Why wouldn't there be a lot more spend from them as well? Right. And so there is a significant amount of the difference of capital. The source is positive cash flows of the most profitable companies that have ever lived or ever existed in humanity versus credit speculatively spent.
[43:48]
Brad
Right.
[43:48]
Dylan Patel
So I think that is like a big aspect that also gives it a bit of a, a knob.
[43:52]
Brad
Right.
[43:52]
Dylan Patel
These companies that are profitable will be a bit more rational.
[43:56]
Unnamed Host
I think corporate America is investing more in AI and with more conviction than they did even in the Internet wave.
[44:03]
Bill
Also maybe we can switch a little bit. You've mentioned inference time reasoning a few times now is clearly a new vector of scaling intelligence. And I read some of your analysis recently about how inference time reasoning is, is way more compute intensive. Right. Than simply pre train, you know, scaling pre training. Why don't you walk us through. We have a really interesting graph here about why that's the case that we'll, we'll post as well. But why don't you walk us through first just kind of what inference time reasoning is from the perspective of compute consumption, why it's so much more compute intensive and so leading to the conclusion that if, if this is in fact going to continue to scale as a new vector of intelligence, it looks like it's going to be even more compute intensive than what came before it.
[44:54]
Dylan Patel
Yeah. So pre training maybe slowing out or it's too expensive. But there's these other aspects of synthetic data generation and inference time compute. Inference time compute is on the surface sounds amazing, right? I Don't need to spend more training a model. But when you think about it for a second, this is actually very, very, this is not the way you want to scale. You only do that because you have to right the way. Because think about it. GPT4 was trained with hundreds of billions of dollars and it's generating billions of dollars of revenue. Hundreds of millions of dollars, hundreds of millions of dollars to train GPT4 and it's generating billions of dollars of revenue. So when you say, like, hey, Microsoft's capex is nuts, sure, but their spend on GPT4 was very reasonable relative to the ROI they're getting out of it right now. When you say, hey, I want the next gain if I just spend, you know, sort of a large amount of capital and train a better model, awesome. But if I don't have to spend that large amount of capital and I deploy it, you know, I deployed this better model without, you know, at the time of revenue generation rather than ahead of time when I'm training the model. This also sounds awesome, but this, this comes with this big trade off, right? When you're running reasoning, right, you're having the model generate a lot and then the answer is only a portion of that, right? Today when you open up ChatGPT, use GPT4 4O, you say something, you get a response.
[46:17]
Bill
Correct.
[46:17]
Dylan Patel
You send something, you get a response. Whatever it is, all of the stuff that's being generated is being sent to you. Now you're having this reasoning phase and OpenAI doesn't want to show you, but there's some open source Chinese models like Alibaba and Deep seq. They've released some open source models which are not quite as good as OpenAI, of course, but they show you what that reasoning looks like if you want to. And OpenAI has really some examples. It generates tons of things. It's like, it sometimes switches between Chinese and English, right? Like whatever it is, it's thinking, right? It's churning. It's like this, this, this, this, oh, should I do it this way? Should I break it down in these steps? And then it comes out with an answer right now on the surface. Awesome. I didn't have to spend any more on R and D or capital, right? I'm saying this in the loose terms. They don't, they don't, they don't treat training models as R and D, I think on Microsoft on a financial basis, but they don't have to treat this, they don't have this R and D ahead of time, right? You get it at spend time, but think about what that means, right? If for you, right? For example, one simple thing that we've done a lot of tests on is hey, generate me this code, right? Like make this function great. I describe the function in a few hundred words. I get back a response that's, you know, 1,000 words, awesome. And I'm paying per token. When I do this with 01 or any other reasoning model, I'm sending the same response, right? Few hundred tokens, I'm paying for that. I'm getting the same response, roughly 1,000 tokens, but in the middle there was 10,000 tokens of it. Thinking, right? Now what does that 10,000 tokens of thinking actually mean? Means, well, the model spitting out 10 times as many tokens. Well, if Microsoft's generating, call it $10 billion of inference revenue and their margins on that are good. They've stated this right, they're anywhere from 50 to 70% depending on how you count the OpenAI profit share, anywhere from 50 to 70% gross margins. Their cost for that is a few billion dollars for $10 billion of revenue right? Now, obviously the better model gets to charge more, right? So O1 does charge a lot more, but you're now increasing your cost from hey, I outputted 1000 tokens to I outputted 11,000 tokens. I've 10x to my spend to generate. Now, not the same thing, right? It's higher quality. And that's only part of it. That's deceptively simple. It's not just 10x, right? Because if you go look at O1, despite it being the same model architecture as GPT 4.0, it actually costs significantly more per token as well. And that's because of sort of this chart that we're looking at here, right? And this chart shows, hey, what does GPT4O, right? If I'm generating, you know, call it 1000 tokens. And that's what GPT4O on the bottom right is of llama 405B. This is an open model, so it's easier to simulate the exact metrics of it. But if I'm doing that, I'm keeping my users experience of the model constant, I.e. the number of tokens they're getting at the speed. Then when I ask it a question, it generates the unit, generates the code, whatever it is. I can group together many users requests. I can group together over 256 users requests on one server for llama 405B of Nvidia server, right? Like, you know, $300,000 server. So when I do this with O1, right, because it's doing that thinking phase of 10,000, this is basically the whole context length thing. Context length is not free.
[49:32]
Brad
Right?
[49:33]
Dylan Patel
Context length or sequence length means that it has to calculate attention attention mechanism that is it spends a lot of memory on generating this KB cache and reading this KB cache constantly. Now the maximum batch size that is concurrent users I can have is a fraction of that 1 4th to 1 5th the number of users can currently use this server. So not only do I need to generate 10x as many tokens, each token that's generated is 4 to 5x less users. So the cost increase is stupendous when you think about a single user cost increase for a single token to be generated is 4 to 5x but then I'm generating 10x as many tokens. So you could argue the cost increase is 50x for an 01 style model.
[50:15]
Unnamed Host
On input to output 10x because it was on the original one release but with the log scale. But I didn't know.
[50:22]
Bill
Well, it's 10x and it just requires you to have again to service the same number of customers you have to have multiples more compute.
[50:30]
Unnamed Host
Well, there's good news and bad news here Brad, which I think is what Dylan's telling us. If you're just selling Nvidia hardware and they remain the architecture and this is our scaling path, you're going to consume way more of it.
[50:44]
Bill
But the margins for the people who are generating on the other end, unless they can pass it on to the end consumer, are going to, you know, are going to compress.
[50:52]
Dylan Patel
And the thing is you can pass it on to the end consumer because hey, it's not really like oh, it's x percent better on this benchmark. It's it literally could not do this before and now it can. Right.
[51:03]
Bill
And so, and they're running a test right now where they're 10xing what they're charging the end consumer.
[51:08]
Dylan Patel
And it's 10x per token, right?
[51:10]
Bill
Correct.
[51:10]
Dylan Patel
Remember, they're also paying for 10x as many tokens. Right. So it's actually the consumer is paying 50x more per query, but they're getting value out of it because now all of a sudden it can pass certain benchmarks like Sweebench software engineering benchmark, which is just a benchmark for generating decent code. There's front end web development. What do you pay front end web developers? What do you Pay backend developers vs hey, what if they use 01? How much more code can they output, how much more can they output? Yes, the queries are expensive, but they're nothing close to the human.
[51:42]
Brad
Right.
[51:43]
Dylan Patel
And so each level of productivity gain I get, each level of capabilities jump is a whole new class of tasks that it can do.
[51:50]
Brad
Right.
[51:51]
Dylan Patel
And therefore I can charge for that.
[51:52]
Brad
Right.
[51:52]
Dylan Patel
So this is the whole axes of yes, I spend a lot more to get the same output. But you're not getting the same output with this model.
[52:00]
Bill
Are we overestimating or underestimating end demand, enterprise level Demand for the O1 model? What are you hearing?
[52:08]
Dylan Patel
So I would say the O1 style model is so early days, people don't even like get it right. O1 is like they just cracked the code and they're doing it. But guess what, right now on some of the anonymous benchmarks, it's called LLMSIs, which is like an arena where different LLMs get to compete sort of and people vote on them. There is a Google model that is doing reasoning right now and it's not released yet, but it's going to be released soon enough.
[52:34]
Brad
Right.
[52:34]
Dylan Patel
Anthropic is going to release a reasoning model. These people are going to one up each other. And also they've spent so little compute on reasoning right now in terms of training time and they see a very clear path to spending a lot more that is jumping up the scaling laws. Oh, I only spent $10 million. Well wait, that means I can jump up two to three logarithms in scaling like that because I've already got the compute, you know, I can go from ten million to a hundred million to billion to ten billion dollars on reasoning in such a quick succession. And so the performance improvements we'll get out of these models is humongous.
[53:06]
Brad
Right.
[53:07]
Dylan Patel
In the coming six months to a year in certain benchmarks where you have functional verifiers.
[53:13]
Unnamed Host
Quick question. And we promised we'd go to these alternatives so we'll have to get there eventually. But if you go back, we've used this Internet wave comparison multiple times. When all of the venture backed companies got started on the Internet, they were all on Oracle and Sun and five years later they weren't on Oracle or Sun. And some have argued that went from a development sandbox world to a optimization world. Is that likely to happen? Is there an equivalency here or not? And if you could touch on why the back end is so steep and cheap, like you know, just you go a model you know behind or you like the, the token, the price you can save by just backing up a little Bit is nutty.
[54:03]
Dylan Patel
Yeah, yeah. So today, right, like O1 is stupendously expensive. You drop down to 4.0, it's a lot cheaper. You jump down to 4.0 mini. It's so cheap.
[54:11]
Brad
Why?
[54:12]
Dylan Patel
Because now I'm with 4 oh mini. I'm competing against Llama and I'm competing against Deepseek, I'm competing against Mistral, I'm competing against Alibaba and I'm competing against tons of.
[54:21]
Unnamed Host
So you think those are market clearing prices?
[54:24]
Dylan Patel
I think. And in addition, right, there is also the problem of inferencing. A small model is quite easy, right? I can run llama 70B on one AMD GPU, I can run llama 70B on 1 Nvidia GPU and soon enough there will be on one set of Amazon's new Trainium, right? I can sort of run this model on a single chip. This is a very easy, I won't say very easy problem. Still hard. But it's a quite easier problem than running this complex reasoning or this very large model.
[54:52]
Brad
Right.
[54:53]
Dylan Patel
And so there is that difference, right? There's also the fact that, hey, there's literally 15 different companies out there offering API inferences. Inference APIs on llama and Alibaba and Deep Seq and Mistral, like these different.
[55:06]
Bill
Models, right, that are talking about Cerebras and Grok and, you know, Fireworks and all these others.
[55:10]
Dylan Patel
Yeah, Fireworks together. You know, all the companies that aren't using their own hardware. Now, of course, Grok and Cerebras are doing their own hardware and doing this as well. But the, the market, these, the margins here are bad, right? You know, sort of. Sort of. We had this whole thing about the inference race to the bottom. When Mistral released their Mixtral model, which was like very revolutionary sort of late last year because it was such a level of performance that didn't exist in the open source that it drove pricing down so fast. Right, because everyone's competing for API. What am I as an API provider providing you that, like, why don't you switch from mine to why? Because. Well, there's no. It's pretty fungible, right? I'm still getting the same tokens on the same model. And so the margins for these guys is much lower. So Microsoft's earning 50 to 70% gross margins on OpenAI models and that's with the profit share they get to get or the share that they give OpenAI, right, or, you know, anthropic. Similarly, in their most recent round, they were showing like 70% gross margins, but that's because they have this model. You step down to here, no one uses this model from, you know, a lot less people use it from OpenAI or Anthropic because they can just like take the weights from Llama, put it on their own server or vice versa, go to one of the many competitive API providers, some of them being venture funded, some of them and losing money. So there's all this competition here. So not only are you saying, I'm taking a step back and it's easier problem. And so therefore if the model is 10x smaller, it's 15x cheaper to run, on top of that, I'm removing that gross margin. And so it's not 15x cheaper to run, it's 30x cheaper to run. And so this is, this is sort of the beauty of like, well, is everything commodity? No, but like, there is a huge chase to like, if you're deploying it in services, that's going to be, this is great for you, a B, you have to have the best model or you're no. 1 if you're one of the labs, right? And so you see a lot of struggles for the companies that were trying to build the best models but failing. Right.
[57:03]
Bill
And arguably, not only do you have to have the best model, you actually have to have an enterprise or a consumer willing to pay for the best model. Right? Because at the end of the day, you know, the best model implies that somebody's willing to pay you these high margins. Right? And that's either an enterprise or a consumer. So I think, you know, you're quickly narrowing down to just a handful of folks who will be able to compete, you know, in that market.
[57:26]
Dylan Patel
I think on the model side, yes, I think on the. Who's willing to pay for these models is, I think a lot more people will pay for the best model.
[57:34]
Brad
Right.
[57:34]
Dylan Patel
When we use models internally, right. We have, we have language models, go through every regulatory filing and permit to look at data center stuff and pull that out and tell us where to look and where to not to. And we just use the best model because it's so cheap. Right? Like the data that I'm getting out of it, the value I'm getting out of it is so much higher.
[57:51]
Bill
What model are you using?
[57:52]
Dylan Patel
We're using Anthropic, actually right now. Claude 3.5 sonnet. And so just because O1 is a lot better on certain tasks, but not necessarily regulatory filings and permitting and like, things like that, because the cost of errors is so much higher.
[58:06]
Brad
Right?
[58:06]
Dylan Patel
Same With a developer.
[58:08]
Brad
Right.
[58:08]
Dylan Patel
If I can increase a developer who makes $300,000 a year here in the Bay by 20%, that's a lot. If I can. If I can take a team of 100 developers and use 75 or 50 to do the same job, or I can ship twice as much code. This is so worth using the most expensive model because oh, one is expensive. It is. Relative to 4.0, it's still super cheap.
[58:29]
Brad
Right.
[58:29]
Dylan Patel
The cost for intelligence is so high in society.
[58:33]
Brad
Right.
[58:33]
Dylan Patel
That's why intelligent jobs are the most high paying jobs. White collar jobs.
[58:37]
Brad
Right.
[58:37]
Dylan Patel
Are the most high paying jobs. If you can bring down the cost of intelligence or augment intelligence, then there's a high market clearing price for that. Which is why I think that sort of the oh yes, oh one is expensive. And people will always gravitate to what's the cheapest thing at a certain level? Intelligence. But each time we break a new level of intelligence, it's not just oh, we've got a few more tasks we can do. I think it grows the mode of tasks that can be done dramatically. Very few people could use GPT 2 and 3.
[59:03]
Brad
Right.
[59:04]
Dylan Patel
A lot of people can use GPT 4. When we get to that quality of jump that we see for the next generation, the amount of people that can use it, the tasks that it can do balloons out. And therefore the amount of white collar jobs that it can augment increase productivity on will grow and therefore the market clearing price for that token will be very super interesting.
[59:21]
Unnamed Host
I could make the other argument that someone that's in a high volume, you know this replacing tons of customer service calls or whatever, might, might be tempted to minimize the spend.
[59:35]
Dylan Patel
Absolutely.
[59:36]
Unnamed Host
And maximize the amount of value add they build around this thing. Database writes and reads.
[59:42]
Dylan Patel
Absolutely. So one of the funny things I like to the calculations we did is if you take 1/4 of Nvidia shipments and you said all of them are going to inference llama7b you can give every single person on earth 100 tokens per minute.
[59:55]
Brad
Right.
[59:56]
Dylan Patel
Or sorry, 100 tokens per second. You can give every single person on Earth 100 tokens per second. Which is like absurd. So if we're just deploying Llama 7B quality models, we've so overbuilt it's not even funny. Now if we're deploying things that can augment engineers and increase productivity and help us build robotics or AV or whatever else faster, then that's a very different calculation.
[60:23]
Brad
Right.
[60:24]
Dylan Patel
And so that's sort of the whole thing like yes, small models are there, but they're so easy to run.
[60:28]
Unnamed Host
And it may just. Both these things may be true.
[60:31]
Dylan Patel
Right. We're going to have tons of small models running everywhere, but the compute cost of them is so low.
[60:34]
Unnamed Host
Yeah, fair enough.
[60:35]
Bill
Bill and I were talking about this earlier with respect to the hard drives that you used to cover. But if you look at the memory market, it's been one of these boom or bust markets. The idea was you would always sell these things when they're nearing peak. You always buy em at trough, you don't own em anywhere in between. They trade at very low earnings multiples. And I'm talking about Hynix and I'm talking about Micron. As you think about the shift toward inference time compute, it seems that the memory demanded of these chips and Jensen has talked a lot about this just as on a secular shift higher. Right. Because if they're doing these passes, you know, and you're running like you said, ten or a hundred or a thousand passes for inference time reasoning, you just have to have more and more memory as this context length expands. So, you know, talk to us a little bit about, you know, kind of how you think about the memory market.
[61:28]
Dylan Patel
Yeah. So you know, to sort of like set the stage a little bit more is thinking. Reasoning models output thousands and thousands of tokens. And when we're looking at transformers, attention, holy grail of transformers, that is how it understands the entire context grows dramatically. And the KV cache, that is the memory that is keeping track of what this context means is growing quadratically. And therefore if I go from a context length of 10 to 100, it's not just a 10x, it's much more. Right. And so you treat that right like today's reasoning models, they'll think 10,000 tokens, 20,000 tokens when we get to hey, what is complex reasoning going to look like? Models are going to get to the point where they're thinking for hundreds of thousands of tokens and then this is all one chain of thought. Or it might be some search, but it's going to be thinking a lot and this KV cache is going to balloon.
[62:23]
Unnamed Host
So you're saying memory could grow faster than gpu.
[62:27]
Dylan Patel
And it objectively is, when you look at the cost of goods sold of Nvidia, their highest cost of goods sold is not tsmc, which is a thing that people don't realize. It's actually HBM memory, primarily ski.
[62:39]
Unnamed Host
That might be a. For now also.
[62:41]
Dylan Patel
Yeah. There's three memory companies out there. Right. There's Samsung, sk, Hynix and Micron. Nvidia has majority used sk, Hynix. And this is like a big shift in the memory market as a whole because historically it has been a commodity.
[62:54]
Brad
Right.
[62:54]
Dylan Patel
That is, it's fungible whether I buy from Samsung or sk, Hynix or Micron.
[62:58]
Unnamed Host
Or is the socket replaceable?
[63:00]
Dylan Patel
Yeah. And even now Samsung is getting really, really hit hard because there's a Chinese memory maker, cxmt, and their memory's not as good as the last, but it's fine. And low end memory, it's fungible and therefore the price of low end memory has fallen a lot.
[63:15]
Bill
Correct.
[63:16]
Dylan Patel
In hbm, Samsung has almost no share.
[63:19]
Brad
Right.
[63:19]
Dylan Patel
Especially at Nvidia. And so this is like this is hitting Samsung really hard.
[63:24]
Brad
Right.
[63:25]
Dylan Patel
Despite them being the largest memory maker in the world, everyone's always like, if you said memory, it's like, yeah, Samsung's a little bit ahead in tech and their margins are a little bit better and they're killing it.
[63:34]
Brad
Right.
[63:34]
Dylan Patel
But now it's not quite not the case because on the low end they're getting a little bit hit and on the high end they can't break in or they keep trying but they keep failing. On the flip side, you have companies like sk, Hynix and Micron who are converting significant amounts of their capacity of sort of commodity DRAM to hbm. Now HBM is still fungible, right? In that if someone hits a certain level of technology, they can swap out Micron to Hynix.
[63:58]
Brad
Right.
[63:58]
Dylan Patel
So this fungible in that sense.
[64:00]
Brad
Right.
[64:00]
Dylan Patel
It's a commodity in that sense. But because reasoning requires so much more memory and the cost of goods sold of an H100 to Blackwell, the percentage of cost to HBM has grown faster than the percentage of cost to leading edge silicon. You've got this big shift or dynamic going on. And this applies not just to Nvidia's GPUs, but it applies to the hyperscalers GPUs as well.
[64:23]
Brad
Right.
[64:24]
Dylan Patel
Or accelerators like the TPU, Amazon, Trainium, et cetera.
[64:27]
Unnamed Host
And SK has higher gross margins than memory companies have.
[64:31]
Bill
Correct? Correct. If you listen to Jensen at least describe it, you know, it's not all memory is created equal. Right. And so it's not only that the product is more differentiated today, there's more software associated with the product today, but it's also how it's integrated into the overall system. Right. And going back to the supply chain question, it sounds like it's all commodity. It just seems to me that at least there's a question out there. Is it structurally changing? We know the secular curve is up and to the right.
[64:59]
Unnamed Host
I'm hearing you say maybe it may be differentiated enough to not be a combined.
[65:04]
Dylan Patel
It may be. And I think another thing to point at is funnily enough the gross margins on HBM have not been fantastic. They've been good, but they haven't been fantastic. Actually regular memory, high end like server memory that is not HBM is actually higher gross margin than hbm. And the reason for this is because Nvidia is pushing the memory makers so hard, right? They want the faster, newer generation of memory faster and faster and faster for hb, for hbm, but not necessarily like everyone else for servers. Now what is this like? What does this like meant is that hey, even though Samsung may achieve level four, right, or level three or whatever that they had previously, they can't reach what Hynix is at now. What are the competitors doing, right? What is AMD and Amazon saying? AMD explicitly has a better inference GPU because they give you more memory, right? They give you more memory and more memory bandwidth. That's literally the only reason AMD's GPU is even considered better on chip HBM memory, which is on package, right? Specifically, yeah. And then when we look at Amazon, their whole thing at re invent, if you really talk to them when they announced Trainium 2 and our whole post about it and our analysis of it is like supply chain wise this is looks, you know, you squint your eyes. This looks like an Amazon Basics tpu, right? It's decent, right? But it's really cheap A and B it gives you the most HBM capacity per dollar and most HBM memory bandwidth per dollar of any chip on the market. And therefore it actually makes sense for certain applications to use. And so this is like a real, real shift like hey, we maybe can't design as well as Nvidia, but we can put more memory on the package right now this is just only one vector of like you know, there's a multi vector problem here. They don't have the networking nearly as good, they don't have the software nearly as good. Their compute elements are not nearly as good but they've by golly, they've got more memory bandwidth per dollar.
[66:46]
Unnamed Host
Well this is where we wanted to go before we run out of time is just to talk about these alternatives which you just started doing. So despite all the amazing reasons why no one would seemingly want to pick a fight with Nvidia, many are trying, right? And I Even hear people talk about trying that haven't tried yet. Like OpenAI is constantly talking about their own chip. How are these other players doing? Like how would you handicap. Let's start with AMD just because their standalone company and then we'll go to some of the internal program.
[67:18]
Dylan Patel
Yeah. So AMD is competing well because silicon engineering wise they're amazing.
[67:23]
Brad
Right?
[67:24]
Dylan Patel
They're competitive potentially, but yeah, they kicked Intel's ass. But that's like, you know, stealing candy from the baby.
[67:30]
Unnamed Host
Come on, they started way down here over a 20 year period. It was pretty amazing.
[67:35]
Dylan Patel
So AMD is. AMD is really good, but they're missing software. AMD has no clue how to do software. I think they've got very few developers on it. They won't spend the money to build a GPU cluster for themselves so that they can develop software.
[67:49]
Brad
Right.
[67:50]
Dylan Patel
Which is like insane.
[67:51]
Brad
Right?
[67:52]
Dylan Patel
Like Nvidia, you know, the top 500 supercomputer list is not relevant because most of the super biggest supercomputers like Elon's and Microsoft's and so on and so forth are not on there. But Nvidia has multiple supercomputers on the top 500 supercomputer list and they use them fully internally to develop software, network software, whether it be network software or compute software, inference software, all these things, you know, test all these changes they make and then roll out pushes where if XAI is mad because of software is not working, Nvidia will push it the next day or two days later like clockwork.
[68:24]
Brad
Right.
[68:25]
Dylan Patel
Because there's tons of things that break constantly when you're training models. AMD doesn't do that.
[68:30]
Brad
Right.
[68:30]
Dylan Patel
And I don't know why they won't spend the money on a cert on a big cluster. The other thing is they have no idea how to do system level design. They've always lived in the world of I'm competing with intel, so if I make a better chip than intel then I'm great. Because software x86, it's x86, everything's right.
[68:46]
Unnamed Host
I mean Nvidia doesn't keep it a secret that they're a systems company. So presumably they've read that.
[68:50]
Dylan Patel
Yeah. And so they bought this systems company called ZT Systems. But they're, you know, the whole rack scale architecture which Google deployed in 2018 with the TPU V3.
[69:00]
Unnamed Host
Are there any hyperscalers that are so interested in AMD being successful that they're co developing with them?
[69:08]
Dylan Patel
So the hyperscalers all have their own custom silicon efforts, but they also are helping AMD a lot in different ways.
[69:15]
Brad
Right.
[69:15]
Dylan Patel
So Meta and Microsoft are helping them with software.
[69:18]
Brad
Right?
[69:18]
Dylan Patel
Not enough that like AMD is like caught up or anything close to it. They're helping AMD a lot with what they should even do. Right. So the other thing that people recognize is if I have the best engineering team in the world, that doesn't tell me what the problem is.
[69:30]
Brad
Right.
[69:30]
Dylan Patel
The problem has this, this, this, this. It's got these trade offs. AMD doesn't know software development, it doesn't know model development, it doesn't know inference, what inference economics look like. And so how do they know what trade offs to make? Do I push this lever on the chip a bit harder which then makes me have to back off on this or what exactly do I do? Right, but the hyperscalers are helping, but not enough that AMD is on the same timelines as Nvidia.
[69:53]
Unnamed Host
How successful will AMD be in the next year on AI revenue and what kind of sockets might they succeed in?
[70:01]
Dylan Patel
Yes, I think they'll have a lot less success with Microsoft than they did this year and they'll have less success than they did with Meta than they did this year. This is because the regulations make it so. Actually AMD's GPU is quite good for China because the way they shaped it. But generally I think AMD will do okay. They'll profit from the market, they just won't like go gangbusters like people are hoping. And they won't be a. Their share of total revenue will fall next year. Okay, but they will still do really well. Right? Billions of dollars of revenue is not nothing to snot.
[70:36]
Unnamed Host
Let's go with the Google tpu. You earlier stated that it's got the second most workloads. It seems like by a lot, like it's firmly in second place.
[70:46]
Dylan Patel
Yeah, so this is where the whole systems and infrastructure thing matters a lot more. Each individual TPU is not that impressive. It's impressive, right? It's got good networking, it's got good architecture, et cetera. It's got okay memory, Right. It's not that impressive on its own, but when you say, hey, if I'm spending X amount of money and then what's my system? Google's TPU looks amazing.
[71:10]
Brad
Right?
[71:10]
Dylan Patel
So Google's engineered it for things that Nvidia maybe has not focused on as much.
[71:14]
Brad
Right.
[71:15]
Dylan Patel
So actually their interconnects between chips is arguably competitive, if not better in certain aspects, worse in other aspects than Nvidia's because they've been Doing this with Broadcom, the world leader in networking, building a chip with them. And since 2018 they've had this scale up, right? Nvidia's talking about GB200. NVL72 TPUs go to 8,000 today, right? And while it's not a switch, it's a point to point. You know, it's a little bit. There's some technical nuances there. So it's not just like those numbers are not all you should look at, but this is important. The other aspect is Google's brought in water cooling for years, right? Nvidia only just realized they needed water cooling on this generation. Google's brought in a level of reliability that Nvidia GPUs don't have. The dirty secret is to go ask people what the reliability rate of GPUs is in the cloud or in a deployment. It's like, oh God, they're reliable. Ish. But like, especially initially, you have to pull out like 5% of them.
[72:12]
Bill
Why has TPU not been more commercially successful outside of Google?
[72:17]
Dylan Patel
I think Google keeps a lot of their software internal when they should just have it be open because like, who cares, you know, like, that's one aspect of it. You know, there's a lot of software that DeepMind uses that just is not available to Google Cloud 2.
[72:33]
Unnamed Host
Even their Google Cloud offering relative to AWS had that bias.
[72:39]
Dylan Patel
Number two, the pricing of it is sort of. It's not that it's egregious on list price. Like list price of a GPU at Google Cloud is also egregious. But you as a person know, when I go rent a gpu, you know, I tell Google like, hey, like, you know, blah, blah, blah, you're like, okay, you can get around the first round of negotiations, get both down. But then you're like, well, look at this offer from Oracle or from Microsoft or from Amazon or from core weave or one of the 80 Neo clouds that exist. And Google might not match like many of these companies, but like they'll go down because they, you know, and then you're like, oh well, like what's the market clearing price for? If I wanted an H100 for two years or a year, oh yeah, I could get for like two bucks, right? A little bit over versus like the $4 quoted, right? Whereas a TPU, it's here, you don't know that you can get here. And so people see the list price.
[73:27]
Unnamed Host
And they're like, do you think that'll change?
[73:30]
Dylan Patel
I don't see any reason why it would. And so number three is sort of Google is better off using all of their TPUs internally. Microsoft rents very few GPUs by the way.
[73:40]
Brad
Right.
[73:40]
Dylan Patel
They actually get far more profit from using their GPUs for internal workloads or using them for inference. Because the gross margin on selling tokens is 50 to 70%.
[73:50]
Brad
Right.
[73:50]
Dylan Patel
The gross margin on selling a GPU server is lower than that.
[73:54]
Brad
Right.
[73:54]
Dylan Patel
So while it is a good gross margin, it's like, you know, it's.
[73:57]
Bill
And they've said out of the 10 billion that they've quoted, none of that's coming from external renting of GPUs.
[74:02]
Unnamed Host
If, if, if, if Gemini becomes hyper competitive as an API, then you indirectly will have third parties using the Google tpu. Is that accurate?
[74:14]
Dylan Patel
Yeah, absolutely. Ads, search, Gemini applications, all of these things use TPUs. So it's not that like, you know, you know that you're not using every YouTube video you upload is going through a TPU.
[74:25]
Brad
Right.
[74:25]
Dylan Patel
Like, you know, it goes through other chips as well that they've made themselves custom chips for YouTube. But like there's so much that touches a TPU, but you indirectly, you directly would never rent it.
[74:35]
Brad
Right.
[74:35]
Dylan Patel
And that's therefore like when you look at the market of renters, there's only one company accounts for over 70% of Google's revenue from TPUs as far as I understand, and that's Apple.
[74:45]
Brad
Right.
[74:45]
Dylan Patel
And I think there's, there's a whole long story around why Apple hates Nvidia. But you know, that may be a story for another time, but you just.
[74:52]
Unnamed Host
Did a super deep piece on Trainium. Why don't you do the Amazon version of what you just did with Google?
[75:00]
Dylan Patel
Yeah. So funnily enough, Amazon's chip is the Amazon. I call it the Amazon's Basics tpu.
[75:06]
Brad
Right?
[75:07]
Dylan Patel
And the reason I call it that is because yes, it uses more silicon, yes, it uses more memory, yes, the network is like somewhat comparable to TPUs, right? It's a 6, it's a 4x4x4 Taurus. They just do it in a less efficient way in terms of, you know, hey, they're spending a lot more on active cables, right? Because they're working with Marvell and Alchip on their own chips versus working with Broadcom, the leader in networking, who then can use passive cables, right? For. Because their serdes are so strong. Like there's other, there's other things here. There's certis speed is lower, they spend more silicon area. Like there's all These things about the Trainium that are, you know, you could look at and be like, wow, this would suck if it was a merchant silicon thing. But it doesn't because it's, it's, it's. Amazon's not paying Broadcom margins, right? They're paying lower margins. They're, they're not paying the margins on the hbm. They're paying, you know, they're paying lower margins in general.
[76:00]
Brad
Right.
[76:01]
Dylan Patel
Paying the margins to marvell on hbm. You know, there's all these different things they do to crush the price down to where their Amazon Basics TPU, the Trainium 2, right. Is very, very cost effective to the end customer and themselves in terms of HBM per dollar, memory bandwidth per dollar. And it has this world size of 64. Now, Amazon can't do it in one rack. It actually requires them two racks to do 64. And the bandwidth between each chip is much slower than Nvidia's rack. And their memory per chip is lower than Nvidia's and their memory bandwidth per chip is lower than Nvidia. But you're not paying north of 30,000, $40,000 per chip for the server. You're paying significantly less, $5,000 per chip.
[76:42]
Brad
Right.
[76:42]
Dylan Patel
It's like such a gulf for Amazon. And then they pass that on to the customer because when you buy an Nvidia gpu, so there is legitimate use cases. And because of this, Amazon and Anthropic have decided to make a 400,000 trainium supercomputer.
[76:58]
Brad
Right?
[76:58]
Dylan Patel
400,000 chips.
[76:59]
Brad
Right.
[76:59]
Dylan Patel
Going back to the host, scaling law is dead. Oh, they're making a 400,000 chip system because they truly believe in this.
[77:06]
Brad
Right.
[77:07]
Dylan Patel
And 400,000 chips in one location is not useful for serving inference.
[77:11]
Brad
Right.
[77:12]
Dylan Patel
It's useful for making better models.
[77:14]
Brad
Right.
[77:14]
Dylan Patel
You want your inference to be more distributed than that. So this is a huge, huge investment for them. And while technically it's not that impressive, there are some impressive aspects that I kind of glossed over. It is so cheap and so cost effective that I think it's a decent play for Amazon.
[77:33]
Bill
Maybe just wrapping this up, I want to shift a little bit to kind of what you see happening in 25 and 26. Right. For example, over the last 30 days, right. We've seen Broadcom, you know, explode higher. Nvidia trade off a lot. I think there's about a 40% separation over the last 30 days. You know, with Broadcom being this play on custom Asics, you know, people questioning whether or not Nvidia has got A lot of new competition, pre training, you know, not, not improving at the rate that it was before. Look into your crystal ball for 25, 26. What are you talking to clients about, you know, in terms of what you think are kind of the things that are most misunderstood best ideas, you know, in the spaces that you cover.
[78:20]
Dylan Patel
So I think a couple of the things are, you know, hey, Broadcom does have multiple custom ASIC wins, right? It's not just Google here. Meta's ramping up mostly still for recommendation systems. But their custom chips are going to get better. You know, there's other players like OpenAI who are making a chip, right? You know there's, there's Apple who are not quite making the whole chip with Broadcom, but a small portion of it will be made with Broadcom, right? You know, there's, there's a lot of wins they have right? Right now these all won't hit in 25, some of them will hit in 26. And it's, you know, it's a custom Asic. So like it could, it could be a failure and not be good like, like Microsoft's and therefore never ramp or it could be really good and like, like, or at least, you know, good price to performance like Amazon's and it could ramp a lot, right? So there are risks here, but Broadcom has that custom ASIC business. One and two, really importantly, the networking side is so, so important, right? Yes, Nvidia is selling a lot of networking equipment, but when people make their own asic, what are they going to do?
[79:22]
Brad
Right?
[79:22]
Dylan Patel
Yes, they could go to Amazon or not, but they could also. They also need to network many of these chips together or sorry. To Broadcom or not. They could go to Marvell or many other competitors out there like Al Chipper and gc. Like you could you, can you. Broadcom is really well positioned to make the competitor to NV switch, which many would argue is one of Nvidia's biggest competitive advantages on a hardware basis versus everyone else. And Broadcom is making a competitor to that that they will seed to the market, right? Multiple companies will be using that, not just, you know, AMD will be using that competitor to NV switch, but they're not making it themselves because they don't have the skills, right? They're going to Broadcom to get it made, right?
[80:01]
Bill
So make a, make it, make, make a call for us. As you think about the semis market today, you know, you've got arm, Broadcom, you've got Nvidia, you got amd, et Cetera, does the whole market continue to elevate as we head into 25 and 26? Who's best positioned from current levels to do well? Who's most, you know, overestimated? Who's least. Who's most underestimated?
[80:25]
Dylan Patel
I think I bought Broadcom long term, but like in the next six months, there is a bit of a slowdown in Google TPU purchases because they have no data center space. They want more. They just literally have no data center space to put them. So we actually like, you know, can, can see how they're like. There's a bit of a pause, but people may look past that. Beyond that, right? It's the question is like, who wins what custom ASIC deals, right? Is Marvell going to win future generations? Is Broadcom going to win future generations? How big are these generations going to be? Are the hyperscalers going to be able to internalize more and more of this or no, right? Like it's no secret Google's trying to leave Broadcom. They could succeed or they could fail, right?
[81:02]
Bill
It's not just like broaden out beyond Broadcom. I'm talking Nvidia and everybody else. Like, you know, we've had these two massive years, right, of tailwinds behind this sector. Is 2025, a year of consolidation. Do you think it's another year that the sector does well? Just kind of.
[81:20]
Dylan Patel
Yeah, I think, I think the plans for hyperscalers are pretty firm on. They're. They're going to spend a crapload more next year, right? And therefore the ecosystem of networking players, of ASIC players, vendors, of systems vendors, is going to do well, whether it be Nvidia or Marvell or Broadcom or AMD or generally some better than others. The real question that people should be looking out to is 2026, does the spend continue? Right. The growth rate for Nvidia is going to be stupendous next year and that's going to drag the entire component supply chain up. It's going to bring so many people with them. But 2026 is where the reckoning comes, right? Will people keep spending like this? And it's all points to will the models continue to get better? Because if they don't continue to get better, in my opinion, will get better faster. In fact, next year, then there will be a big sort of clearing event.
[82:11]
Brad
Right?
[82:11]
Dylan Patel
But that's not next year, Right. The other aspect I would say is there is consolidation in the NEO cloud market, right? There are 80 Neo clouds that we're tracking that we talk to that we see how many GPUs they have.
[82:23]
Brad
Right.
[82:23]
Dylan Patel
The problem is nowadays if you look at rental prices for H1 hundreds, they're tanking.
[82:29]
Brad
Right.
[82:30]
Dylan Patel
Not just at these Neo clouds, right, where you used to have to pay, you know, do four year deals and prepay 25%, you'd sign a venture ground and you'd buy a cluster and that's about it.
[82:39]
Brad
Right.
[82:39]
Dylan Patel
You'd rent one cluster. Right. Nowadays you can get three month, six month deals at way better pricing than even the four month or the four year, three year deals that you used to have for Hopper.
[82:48]
Brad
Right.
[82:49]
Dylan Patel
And on top of that, it's not just through the Neo clouds. Amazon's pricing for on demand GPUs is falling now. It's still over, it's still really expensive, relatively. But like pricing is falling really fast. 80 Neo clouds are not going to survive. Maybe five to ten will. And that's because five of those are sovereign, right. And then the other five are like actually like market competitive.
[83:10]
Unnamed Host
What percentage of the industry AI revenues have come from those Neo clouds that may not survive?
[83:16]
Dylan Patel
Yeah. So roughly you can say hyperscalers are 50ish percent of revenue. 50 to 60%. And the rest of it is neo cloud sovereign AI because enterprises purchases of GPU clusters is still quite low and it ends up being better for them to just outsource it to Neo clouds when they can get through the security, which they can for certain companies like Core Weave.
[83:38]
Bill
And is there a scenario where in 2026 where you see industry volumes actually down versus 2025 or Nvidia volumes actually down meaningfully from 2025.
[83:56]
Dylan Patel
So when you look at custom ASIC designs that are coming as well as Nvidia's chips that are coming, the revenue, the content in each chip is exploding. The cost to make Blackwell is, is north of 2x that of the cost to make Hopper.
[84:11]
Brad
Right.
[84:12]
Dylan Patel
So Nvidia can make the same. You know, obviously they're cutting margins a little bit, but Nvidia can ship the same volumes and still grow a ton.
[84:18]
Brad
Right.
[84:19]
Bill
So rather than unit volumes, is there a scenario where industry revenues are down in 26 or Nvidia revenues are down in 26?
[84:28]
Dylan Patel
I mean the reckoning is, the reckoning is do models continue to get much faster, better and will hyperscalers, are they okay with taking their free cash flow to zero? I think they are, by the way, I think Meta and Microsoft may even take their free cash flows close to zero and just spend. But then that's only if models continue to get better. That's A. And then B, are we going to have this huge influx of capital from people we haven't had it yet, from the Middle East. The sovereign wealth funds in Singapore and Nordics and Canadian pension fund and all these folks, they can throw, they can write really big checks. They haven't, but they could. And if things continue to get better, I truly do believe that OpenAI and XAI and Anthropic will continue to raise more and more money and keep this game going of not just, hey, where's the revenue for OpenAI? Well, it's 8 billion and it might double or whatever or even more next year. And that's their spend. No, no, no. They have to raise more money to spend significantly more. And that, that keeps the engine rolling because once one of them spends, Elon is forcing everyone to spend more actually, right. With his cluster because. And his plans, because everybody's like, well, we can't get outscaled by Elon. We have to spend more.
[85:40]
Brad
Right.
[85:41]
Dylan Patel
And so there's sort of a game of chicken there too, like, oh, they're buying this. We have to, we have to match them or go bigger because it is a game of scale. So. So, you know, in sort of Pascal's wager sense, right. If I underspend, that's just the worst scenario ever. And I'm like the worst CEO ever of the most profitable business ever. But if I overspend and yeah, shareholders will be mad, but it's fine, right? It's, you know, $20 billion, $50 billion.
[86:02]
Unnamed Host
You can paint that either way, though, because if that becomes the reasoning for doing it, you're more. The probability of overshooting goes up and.
[86:11]
Dylan Patel
Every bubble ever, we overshoot.
[86:13]
Bill
And you know, to me, it, you know, you said it all hangs on models improving. I would take it a step further, you know, and go back to what Satya said to us last week. It all comes down ultimately to the revenues that are generated by the people who are making the purchases of the GPUs. Right. Like he said last week, I'm going to buy a certain amount every single year and it's going to be related to the revenues that I'm able to generate in that year or the next couple of years. So, like, they're not going to spend way ahead of where those revenues are. So he's looking at what, you know, he had 10 billion in revenues this year. He knows the growth rate associated with those inference revenues. And they're making, he and Amy are making some forecast as to what they can afford to spend. I think Zuckerberg's doing the same thing. I think Sundar is doing the same thing. And so if you assume they're acting rationally, it's not just the models improving, it's also the rate of adoption of the underlying, you know, enterprises who are using their services. It's the rate of adoption of consumers and what consumers are willing to Pay to use ChatGPT or to use Claude or to use these other services. So, you know, if you think that infrastructure spent expenses are going to grow at 30% a year, then I think you have to believe that the underlying inference, revenues, right. Both on the consumer side and the enterprise side are going to grow somewhere in that range as well.
[87:42]
Dylan Patel
There is definitely an element of spend ahead though, Right. And at this point in time, spend versus, you know, what do I think revenue will be for the next five years for the server?
[87:49]
Brad
Right.
[87:50]
Dylan Patel
So I think there is an element of that for sure, but absolutely right. Models, the whole point is models getting better is what generates more revenue.
[87:58]
Brad
Right.
[87:58]
Dylan Patel
And it gets deployed. So I think, I think that's, I'm in agreement, but people are definitely spending ahead of what's, what's, what's charted.
[88:06]
Bill
Well, that's what makes it spicy. You know, it's fun to have you here. I mean, you know, a fellow analyst, you guys do a lot of digging. Congratulations on the success of your business. You know, I think you add a lot of important information to the entire ecosystem. You know, one of the things I think about the wall of worry, Bill, is the fact that we're all talking about and looking for, right, the bubble. Sometimes that's what prevents the bubble from actually happening. But, you know, you know, as both an investor and an analyst, you know, I look at this and I say there are definitely people out there who are spending who don't have commensurate revenues. To your point, spending, spending way ahead. On the other hand, and frankly, you know, we heard that from Satya last week. He said, listen, I've got the revenues, I've said what my revenues are. I haven't heard that from everybody else. Right. And so it'll be interesting to see in 2025 who shows up with the revenues. I think you already see some of these smaller, second, third tier models changing, business model falling aside, no longer engaged in the arms race of investment here. I think that's part of the creative destructive process. But it's been fun having you on.
[89:20]
Unnamed Host
Yeah. Thank you so much, Dylan. Really appreciate it.
[89:22]
Bill
Yeah. Fun having you here in person, Bill. And until next year, awesome.
[89:28]
Dylan Patel
Thank you.
[89:38]
Bill
As a reminder to everybody, just our opinions, not investment advice.