Transcript
A (0:00)
This is the biggest change in human history, maybe ever. What's about to happen with AI? This is the biggest revolution, bigger than industrial revolution. Jensen is very paranoid about losing. If he just kept making his mainline chip, people crush him on cost and performance. Acquiring Grok is how you get those resources to make more solutions for different parts of the market to stay king. At the end of the day, this is an economic war. If the US and the west win in AI, China will not rise to be the global hegemony. But without AI, China definitely will rise. They're just going to outrun America.
B (0:29)
Hi, I'm Matt Turk. Welcome Back to the Mad PodC. Today I'm joined by the one person Wall street and Silicon Valley turn to when they need to cut through the hardware hype. Dylan Patel of Semianalysis. We dove into many of the most important topics of Nvidia's massive move to acquire Grok, the truth about the Capex bubble, whether the US power grid can actually handle the AI boom, and the geopolitical chess match playing out between the US and China. But I have to warn you, this conversation went off the rails in the best possible way. And we ended up going into all sorts of fun tangents, like the strange phenomenon of Chinese romance drama set inside semiconductor factories and what it's really like when three AI famous roommates live together in sf. Please enjoy this fantastic conversation with Dylan. Hey Dylan, welcome.
A (1:15)
Hello. How are you?
B (1:16)
I'm great. I'd love to start with GROK and Nvidia since it's still fresh. So not so long ago Nvidia was saying that one GPU could do it all, and now they're doing this acquisition slash non exclusive deal with grok. What does that mean? Mean from your perspective, it's very clear.
A (1:33)
We'Re not sure where AI models are headed in terms of, you know, over the next few years what happens to the architecture. But you know, the thing that I think everyone has sort of like agreed on is models are pretty auto regressive, right? Next token generation is like the thing, but beyond that, right? Attention mechanisms change the how, how it works. Everything changes, right? Could, could change. And so what's interesting is the reason Nvidia won is because they just took like the widest surface area bet and then people kept developing models on that and that kind of shape worked. But now the workload is so there is room for specialization that will give you 10x increases in certain domains, right? In a general purpose workload doesn't work right. You know, it can't train it can't, you know, it can't inference really, really large models cost efficiently, right? You can't serve many, many, many users. But what it can do is it can go block screamingly fast, right? Same with the Cerebras OpenAI deal. But that's like one workload, right? Very decode focused, right? Gener doing auto regressive tokens in a, in a single stream super fast. Another direction AI models could head, right? We don't know. Are models going to think in one token stream or is it actually they're constantly context switching, right? And they're going from. They have this humongous, humongous context and they're generating in multiple parallel streams, right? And so Google and OpenAI have both released mechanisms of this with their pro models where the model actually doesn't just have one single chain of thought for reasoning, it has multiple, right. And then I don't know exactly like, you know, and, and, and how they choose which one and what the final answer to you delivers is is an area of research. But there, there is room for that kind of chip, right? Something that works on very parallel, lot of streams of chain of thought. And maybe the latency requirements are not as crazy, right? Maybe you don't want to go blindingly fast, right? Maybe you're okay with it being, you know, because I can spin up 100 parallel, you know, streams of thought or agents or whatever you want to call them. Maybe I care a lot about cost there. And because it's 100 in parallel instead of one going super, super fast, it's not as deep, right? The tree search or the depth of the inference is not as deep, but it is much wider. You know, there's other parts of inference. Hey, process creating the KV Ca Nvidia has a chip for that, right? That's the cpx. So they've made the cpx, they bought Grok for Decode and then they still have their general purpose gpu. So they're kind of trying to cover their bases because unlike the first wave of AI chip companies where they sort of just made chips and then tried to figure out where it would work, right? They had a thesis, Groq and Cerebras both as well as Samba Nova, right? Which was put a lot of memory on the chip and not necessarily in the case of Cerebras and Groq, no memory off chip. And in the case of Samba Nova, less memory off chip or slower memory off chip with higher capacity. You know, they sort of all made similar bets in that direction. And it didn't work for a while until it kind of did. Right. Because there's a workload that now necessitates it. Nvidia recognizes they're, they're the leader, they're at the tent pole. Hey, in one respect they can just run faster than everyone. But it's kind of hard to be 2x better than Google or, or OpenAI or whoever else's internal chip, right. To justify their, you know, 75% plus margins. Right. And then they have to be 2x to 4x better to justify 4x, better to justify the margins because that's what they're charging above cogs. You know, the question is what, what architecture will deliver that? Well, yes, keep the programmability of their GPUs is great for training and for a lot of workloads. But you know, guess what I think, I think a lot of people will just be downloading an open source model, downloading an inference framework and pressing go, right. A little bit more complicated than that. But that's, that's going to be the consumption method for a lot of enterprises. A lot of startups, a lot of tech companies is they're just going to do that or they're going to rent the GPUs or rent the chips and then download an open source framework and model and go right. And Nvidia recognizes this. And hey, there is room for products that aren't general purpose, right. The general purpose GPU will still probably be the main line for training and for a lot of inference and for cost efficient inference, but maybe blindingly fast or workloads that have a ton of pre fill that is creating the, the KV cache, maybe that those workloads could be different chips, right. And the CPX chip they announced, right. They say it's for the context, processing, creating KV cache. It's also really useful for video models because video models don't care about memory bandwidth. And so you know, why pay for the expensive memory that the general purpose chip has? Or why do what GROK is doing, which is tying hundreds or thousands of chips together and not having memory, but keeping the entire model on chip. The trade off for that of course is you need thousands of chips and you have less compute per chip. And so like Nvidia's trying to capture the whole surface area because again, you don't know where models are headed and it's hard to say where the research is headed.
