20VC: Why Google Will Win the AI Arms Race & OpenAI Will Not | NVIDIA vs AMD: Who Wins and Why | The Future of Inference vs Training | The Economics of Compute & Why To Win You Must Have Product, Data & Compute with Steeve Morin @ ZML - The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch

Summary6 min read

Podcast Summary: 20VC Episode Featuring Steeve Morin on AI Infrastructure and the Compute Race

Release Date: February 24, 2025

Introduction

In this episode of The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch, host Harry Stebbings engages in an in-depth conversation with Steeve Morin, the founder of ZML, a pioneering inference engine designed to optimize performance across various chip architectures. The discussion delves into the competitive landscape of AI compute, the dynamics between major chip manufacturers like NVIDIA and AMD, and the future trajectory of AI training versus inference.

1. The Compute Arms Race: NVIDIA vs. AMD

Steve Morin opens the conversation by critiquing NVIDIA's dominance in the AI compute space:

"The thing with Nvidia is that they spend a lot of energy making you care about stuff you shouldn't care about. And they were very successful. Like who gives a shit about CUDA?" ([00:00])

Morin emphasizes the critical importance of owning compute infrastructure, asserting that without it, organizations are handicapped. He predicts a future where Google emerges as the dominant force in AI inference due to its comprehensive ecosystem encompassing products, data, and compute capabilities.

Key Points:

NVIDIA's Dominance: NVIDIA's investment in CUDA has entrenched it deeply in the AI ecosystem, making it the default choice despite potential inefficiencies compared to competitors like AMD.
AMD's Potential: Morin highlights AMD's superior cost efficiency, citing scenarios where switching from NVIDIA to AMD can yield up to four times better efficiency in compute spend.
Google's Edge: With its vast array of products like Android and Google Docs, Google possesses the requisite components—product, data, and compute—to dominate the AI inference landscape.

2. The Shift from Training to Inference

The discussion pivots to the divergent needs of AI training versus inference. Morin illustrates that while training requires massive compute resources and benefits from scaling "more is better," inference demands efficiency and reliability, embodying a "less is more" philosophy.

"Training is research and inference is production. They are fundamentally different in terms of infrastructure." ([18:16])

Key Points:

Infrastructure Differences: Training involves iterative processes with a focus on speed and scale, whereas inference prioritizes stable, efficient deployment without the need for extensive interconnects between hardware.
Efficiency in Inference: Morin underscores the importance of provisioning compute resources dynamically to avoid the high costs associated with maintaining idle or over-provisioned infrastructure.

3. Economic Implications of Compute Choices

Morin delves into the economics behind choosing different compute providers, highlighting the exorbitant margins imposed by NVIDIA and the strategic advantages of more cost-effective alternatives.

"NVIDIA is here to stay at least if not for the H100 bubble bust. Because these chips are going to be on the market and people will buy them and do inference with them." ([25:18])

Key Points:

Cost Efficiency: NVIDIA's GPUs are priced with high margins, making alternatives like AMD or emerging chipmakers like Etched and Vsora more attractive for cost-conscious organizations.
Supply Constraints: Morin points out the challenges related to GPU supply, noting that high demand for NVIDIA products can lead to order cancellations and supply chain bottlenecks.
Alternative Solutions: Companies like ZML aim to abstract the complexities of hardware dependencies, allowing seamless switching between different compute providers to optimize costs and performance.

4. The Future of Chip Architecture in AI

Exploring beyond traditional GPU architectures, Morin discusses emerging technologies and their potential to revolutionize AI compute.

Emerging Technologies:

SRAM vs. HBM: Morin explains that while SRAM offers superior speed essential for high-performance inference, its high cost and chip surface consumption present significant challenges. In contrast, HBM (High Bandwidth Memory) remains slower and less efficient for AI tasks.
Compute-in-Memory: Innovations like Rain AI and Fractile are developing compute-in-memory technologies, which integrate computation directly with memory to enhance efficiency and speed.

"Compute-in-memory means bringing the CPU to the memory and doing everything. It’s crazy stuff, but it’s coming." ([32:39])

Key Points:

Specialized AI Chips: The limitations of GPUs for certain AI tasks, especially inference, necessitate the development of more specialized chip architectures.
Latency Reasoning: The industry is shifting focus from throughput to latency, optimizing for quicker response times in AI applications, which GPUs are inherently less suited for.
Monolithic vs. Dynamic Models: There is a trend towards smaller, more efficient models that can dynamically adapt, moving away from large, monolithic AI models that are resource-intensive.

5. Strategic Advice for AI Startups

When asked to offer advice to AI startups navigating the complexities of training, inference, and hardware choices, Morin stresses the importance of avoiding reliance on reselling compute.

"Do not resell compute if you can. A lot of AI startups are trying to make a margin on top of a very big cake." ([66:01])

Key Points:

Verticalization: Startups should focus on verticalizing their products rather than building business models that depend heavily on compute reselling, which often incurs high margins and inefficiencies.
Software Abstraction: Investing in software solutions that abstract hardware dependencies can provide greater flexibility and cost savings, allowing startups to switch compute providers as needed without substantial overhead.

6. Challenges Facing Industry Leaders

Morin candidly discusses the challenges faced by NVIDIA and other industry giants in maintaining their dominance amidst evolving technological demands and supply chain issues.

"The biggest challenge that Jensen Huang faces today... is navigating the downslope. Blackwell is probably something that keeps them awake at night." ([66:51])

Key Points:

Supply Chain Issues: NVIDIA is grappling with supply constraints and hardware issues, such as the problematic Blackwell chips, leading to order cancellations and customer dissatisfaction.
Sustainability of Dominance: While NVIDIA remains a strong player, Morin suggests that its dominance might not be sustainable if alternative, more efficient solutions emerge and gain market traction.

7. Broader Industry Trends and Innovations

The conversation touches upon broader trends in AI infrastructure, including synthetic versus real data, model scaling, and the potential obsolescence of transformer-based models.

Synthetic vs. Real Data:

Mixed Perspectives: Morin expresses skepticism about injecting synthetic data back into models, citing potential deterioration. However, he acknowledges success stories like AlphaGo, which benefited from synthetic game data.

Model Scaling and Efficiency:

Beyond Transformers: Emerging architectures that do not rely on transformer models could significantly reduce compute requirements and enhance efficiency.
Energy-Based Models: Morin is bullish on energy-based models, which aim to understand the world fundamentally by minimizing energy transitions between states, potentially offering more efficient reasoning capabilities.

"What pushes smaller models are efficiency, roughly speed... less is better." ([56:17])

Conclusion

In concluding the episode, Steeve Morin reiterates the critical importance of aligning compute choices with business objectives and the necessity of flexible, software-driven solutions to navigate the rapidly evolving AI landscape.

"If you do AI as much as you can, try to verticalize on the product, but not on the compute. If your business model implies buying a lot of tokens, it's a very hard circle to square to put that into $20 right a month." ([66:01])

Final Takeaways:

Ownership of Compute: Owning and optimizing compute infrastructure is paramount for long-term success in AI.
Flexibility and Efficiency: Emphasizing software solutions that allow for flexible compute switching can unlock significant cost and performance advantages.
Future-Proofing: Staying abreast of emerging technologies and being willing to adopt new architectures will be key to maintaining competitive edge.

Notable Quotes:

"Who gives a shit about CUDA?" - Steve Morin ([00:00])
"Training is research and inference is production." - Steve Morin ([18:16])
"If you do AI as much as you can, try to verticalize on the product, but not on the compute." - Steve Morin ([66:01])

Listen to the full episode on 20VC to gain deeper insights into the future of AI infrastructure and the competitive dynamics shaping the industry.

Loading summary

Transcript242 lines

[00:00]
Steve Morin
The thing with Nvidia is that they spend a lot of energy making you care about stuff you shouldn't care about. And they were very successful. Like who gives a about Cuda? OpenAI is amazing, but it's not their compute. Ultimately, if you don't own your compute, you're starting with something at your ankle. In five years, I would say 95% inference, 5% training. You have the products, the data and the compute. Who has all three? Google has like Android, Google Docs. They have everything they can sprinkle everywhere. This is the sleeping giant in my mind.
[00:31]
Harry Stebbings
This is 20 VC with me, Harry Stebbings, and our show with Jonathan Ross at Grok went so well last week, but I had so many more questions on two things. The future of chips and the future of inference. So today we dig deep on both and there's no one better to join me than Steve Morin. Steve is the founder of zml, a next generation inference engine enabling peak performance on a wide range of chips. Literally the perfect speaker for this topic. And this was a super nerdy show. It was probably the most information episode we've done in a long time. So do slow it down, pause it, get a notebook out. But wow, there is so much gold in this one. But before we dive in today, turning your back of a napkin idea into a billion dollar startup requires countless hours of collaboration and teamwork. It can be really difficult to build a team that's aligned on everything from values to workflow. But that's exactly what Coda was made to do. Coda is an all in one collaborative workspace that started as a napkin sketch. Now, just five years since launching in BET, Coda has helped 50,000 teams all over the world get on the same page. Now at 20 VC, we've used Coda to bring structure to our content planning and episode prep, and it's made a huge difference. Instead of bouncing between different tools, we can keep everything from guest research to scheduling and notes all in one place, which saves us so much time. With Kodi, you get the flexibility of docs, the structure of spreadsheets, and the power of applications, all built for enterprise. And it's got the intelligence of AI, which makes it even more awesome. If you're a startup team looking to increase alignment and agility, Coda can help you move from planning to execution in record time. To try it for yourself, go to CODA io20VC today and get six free months of the team plan for startups, that's Coda iO20VC. To get started for free and get six free months of the team plan. Now that your team is aligned and collaborating, let's tackle those messy expense reports. You know, those receipts that seem to multiply like rabbits in your wallet. The endless email chains asking can you approve this? Don't even get me started on a month end panic when you realize you have to reconcile it all. Well, Pleo offers smart company cards, physical, virtual and vendor specific so teams can buy what they need while finance stays in control. Automate your expense reports, process invoices seamlessly, and manage reimbursements effortlessly all in one platform. With integrations to tools like Xero, QuickBooks and Netsuite, Pleo fits right into your workflow, saving time and g you full visibility over every entity, payment and subscription. Join over 37,000 companies already using Pleo to streamline their finances. Try Pleo today. It's like magic, but with fewer rabbits. Find out more at Pleo IO 20 VC and don't forget to revolutionize how your team works together. Rome A Company of Tomorrow runs at hyperspeed with quick drop in meetings. A Company of Tomorrow is globally distributed and fully digitized. A Company of Tomorrow instantly connects human and AI workers. A Company of Tomorrow is in a roam virtual office. See a visualization of your whole company. The live presence, the drop in meetings, the AI summaries, the chats. It's an incredible view to see. Roam is a breakthrough workplace experience loved by over 500 companies of tomorrow. For a fraction of the cost of Zoom and Slack. Visit Rome. That's or AM for an instant demo of Rome today. Nobody knows what the future holds, but I do know this. It's going to be built in a roam virtual office. Hopefully by you. That's Romero. AM for an instant demo.
[04:14]
Jonathan Ross
You have now arrived at your destination. Steve, dude, I am so grateful to you for joining me today. I've wanted to make this one happen.
[04:22]
Harry Stebbings
For a while, but when we were discussing who would be best for this.
[04:25]
Jonathan Ross
Topic, I was like, we've got to have Steve on.
[04:26]
Harry Stebbings
So thank you for joining me.
[04:27]
Jonathan Ross
Stay, man.
[04:28]
Steve Morin
Well, thank you. I feel humbled. I appreciate it. Thank you.
[04:32]
Harry Stebbings
Dude, I want to start.
[04:33]
Jonathan Ross
Can you just give us a quick.
[04:34]
Harry Stebbings
Overview of ZML and specifically your role.
[04:37]
Jonathan Ross
In the infrastructure strategy today and where you sit.
[04:42]
Steve Morin
So at the very bottom of things, ZML is an ML framework that runs any models on any hardware. We sit ultimately at the infrastructure layer. We enable anybody to run their model better, faster, more reliably. But on any compute whatsoever doesn't really matter. It could be Nvidia. It can be amd, could be TPU and whatnot. And we do all that without compromise. That's the key point. Because if there's a compromise, then it's not really agnostic.
[05:13]
Jonathan Ross
Can I ask you then, if we think about sitting between any model and any provider there in terms of amd, Nvidia, do you think then we will be existing in a world where people are using multiple models simultaneously and that is concurrently running?
[05:29]
Steve Morin
Yes, you actually can see it. It's been happening for a while. Models now are not the right abstractions. At least if you look at closed source models, they're not really models, they're more like backend. And there are a lot of tricks that you feel like you're talking to one model, but ultimately you're talking to a constellation, an assembly of backends that produces, you know, a response, probably the number one. You know, I would say obvious thing would be that if you ask a model to generate an image, then it will, you know, switch to a diffusion model, right? Not an LLM. And there's many, many more tricks. The turbo models and OpenAI do that. There's a lot of tricks. So definitely models in the sense of getting weights and running them is something that is ultimately going away. Because in favor of full blown backends, you feel like you're talking to a model, but ultimately you're talking to an API. The thing is that API will be running locally in your own cloud instances and so on.
[06:28]
Jonathan Ross
So we will have a world where we're switching between models and there's kind of this trickery around them. Okay, perfect. So we've got that at the top, then we've got ZML in the middle and then you said, and then on any hardware.
[06:38]
Harry Stebbings
So will we be using multiple hardware.
[06:41]
Jonathan Ross
Providers at the same time or will we be more rigid in our hardware usage?
[06:45]
Steve Morin
No, absolutely. You can get probably an order of magnitude more efficiency depending on the hardware you run on. That is substantial. Not a lot of people have that problem at the moment. Things are getting built as we speak. But a simple example is if you switch From Nvidia to AMD on a 7 TB model, you can get four times better efficiency in terms of spend.
[07:10]
Unnamed Speaker
Right?
[07:10]
Steve Morin
So that is substantial. That is very much substantial. Now the problem is getting some AMD GPUs.
[07:16]
Unnamed Speaker
Right?
[07:17]
Jonathan Ross
I'm really sorry. If there is such a cost efficiency, four times, why does everyone not do that?
[07:23]
Steve Morin
So there's a few reasons. Probably the most important one is the Pytorch Cuda, I would say Duo and that's very, very hard to break. These two are very much intertwined.
[07:35]
Jonathan Ross
Can you just explain to us what Pytorch includes?
[07:37]
Steve Morin
Oh, yes, absolutely, yeah. Pytorch is the ML framework that people use to build actually trained models.
[07:44]
Unnamed Speaker
Right.
[07:44]
Steve Morin
You can do inference with it. But by far the most successful framework for training is Pytorch. And Pytorch was very much built on top of Cuda, which is Nvidia software.
[07:57]
Unnamed Speaker
Right.
[07:57]
Steve Morin
Let's just say the strings of Pytorch make it ultimately very, very bound to Cuda. So of course it runs on, you know, it runs on amd, it runs on, you know, even Apple and so on, but there was always, you know, the tens of little details that not exactly run like, you know, you would expect and there's work involved, but then also there's supply. So probably that's the number one thing. The second thing is there's a lot of GPUs on the market. Pretty much all of them are Nvidia. The reason being that if you think, you know, in layers and you say, all right, I'm going to buy, let's say GPUs and I'm going to sell them to folks to maybe not even do training, right, just do inference. Then most likely, if you look at it that way, you'll end up buying Nvidia because everybody will want to run on Nvidia because nobody knows really how to do whatever. And they've trained on Nvidia so they're like, I can just reuse my code and so on. So there's like this self perpetuating circle of people just buy Nvidia because they want to resell and people just use Nvidia because it's there.
[09:03]
Unnamed Speaker
Right.
[09:03]
Steve Morin
But it's by far not the most efficient platform and arguably even in terms of software, it's not the best software platform. So that is probably two of the most, I'd wager the most important reasons.
[09:18]
Jonathan Ross
Can I ask before, you know, we were chatting about Nvidia and AMD when Deepseek obviously happened and the stock crash that happened. Why did Nvidia rebound, do you think, in a way that AMD didn't?
[09:32]
Steve Morin
Because the chips are there. There's a lot of things, but in my opinion there's going to be a need for inference. Very hard to say whether it will be worth, you know, everybody's money to do it on H100. That is a bubble that I think will blow some time. I'm kind of afraid of that to be Honest.
[09:50]
Jonathan Ross
Why do you think that's a bubble that will blow sometime? Why is that not legitimate?
[09:53]
Steve Morin
Because it was built on the A100, I would say financial model, which was at generation zero. We do training, but when it's last generation we do inference. And it worked Beautifully, right. For a 100. Then H100 comes along and inference is. It's worth five times the price and it may be runs twice in terms of performance. On inference, that is, on training it's a lot better, but on inference it's like maybe twice as fast when it actually, when it came out, it rained at the same speed than the A100. So there's a money gap that's going to have to be bridged sometime.
[10:29]
Unnamed Speaker
Right.
[10:30]
Steve Morin
And the part that worries me is that I see amortization plans in like six, seven years, right, with the GPUs at the collateral, and I'm like, well, I'm not sure how it's going to work because at least when they came out they were five times the price and they're just two times faster. Something has got to give.
[10:49]
Jonathan Ross
Is speed of development trumping chip development speeds, where it's now becoming a real problem where, as we say, models are far outpacing the speed of chip deployment.
[11:00]
Steve Morin
Not much. Ultimately, the two things that could really very much shake the industry, the chip industry, in my opinion, is our agents and reasoning.
[11:11]
Jonathan Ross
Number one, agents. Why does that change chipping?
[11:14]
Steve Morin
I think this is where Nvidia can be attacked. I mean, why agents and why reasoning? The difference is, for agents and reasoning, you need to wait until the end of the request to get whatever it is you came for. You don't really care about the speed at which the text outputs, which is what you want in a chat, right? You only care about how much time does it take between the beginning of my request and the end. And so that fundamentally changes the incentives from throughput bound to latency bound. And so GPUs, let's say you're running a GPUs at, let's say 10,000 tokens per second, you very much like to do it 100 times 100, right? And they can do that, but they cannot give you 10,000 tokens per second. Only on you, per stream, what we say. But in terms of agents or reasoning, this is exactly what you want because you don't want to wait like 50 seconds for whatever thinking, right? And agents, it's the same. So these two, I think are the shot that might make Nvidia change its course with respect to chips. I mean, they're not idiots, right?
[12:25]
Jonathan Ross
How should agents change Nvidia's strategy?
[12:29]
Steve Morin
Hard to say, because Nvidia has a very, very vertical approach. They do more of more.
[12:35]
Unnamed Speaker
Right.
[12:36]
Steve Morin
Like, if you look at Blackwell, it's actually crazy what they did for Blackwell. They assembled two chips, but the surface was so big that the chips started to bend a bit, which further perpetuated the problem because it then didn't make contact with the heatsink and so on. So they are very much. And you know, the power envelope, they push it to a thousand watts, it requires liquid cooling and so on. So they are very much in a very vertical foot to the pedal in terms of GPU scaling. But the thing is, GPUs are a good trick for AI, but they're not built for AI. It's not a specialized chip. It is a specialization of a gpu, but it is not an AI chip.
[13:18]
Jonathan Ross
Forgive me for continuously asking stupid questions. Why are GPUs not built for AI? And if not, what is better?
[13:26]
Steve Morin
So the way it worked is that you can think of a screen as a matrix and if you have to render pixels on a screen, there's a lot of pixels and everything has to happen in parallel, right, so that you don't waste time. Turns out matrices are a very important thing in AI. So there was this cool trick in which we essentially tricked the GPU into back, that was like probably 20 years ago, we would trick the GPU into believing it was doing graphics rendering, where actually we would making it do parallel work.
[13:58]
Unnamed Speaker
Right.
[13:59]
Steve Morin
It was called GPGPU at the time.
[14:01]
Unnamed Speaker
Right.
[14:01]
Steve Morin
So it was always a cool trick, but it was not dedicated for this. The pioneers probably were, of course, Google with gpu, which are very much more advanced on the architectural level. But essentially the way they work, it kind of works for AI, but for LLMs that starts to crack because they're so big and there's a lot of memory transfers and so on. Actually, that's why Grok achieves, not Grok, but Grok Cerebras and all these folks, they achieve very high performance. Single stream is because the data is right in the chip. They don't have to get it from memory, which is slow, which GPU has to do. So there's a lot of these things that ultimately make it a good trick, but not, I would say, dedicated solution per se. That said, though, the reason probably Nvidia won, at least in the training space, is because of Mellanox, not because of the RAW compute, because you need to run lots of these GPUs in parallel. So the interconnect between them is ultimately what matters. How fast can they exchange data? Because remember, when you do a matrix multiplication, let's say you read the matrix is read like hundreds of times during the multiplication. So there's a lot of transfers going on. And so far Melanox with, you know, Infiniband had the best technology. So that's why you know a lot of people. And when you do training, by the way, it is the name of the game, the interconnect. When you do inference, not so much. You don't care when you do inference.
[15:39]
Jonathan Ross
Before we move to inference, I do just want us to stay on chips and just say, okay, so we have TPUs, we have Nvidia, we have AMD. Is this in terms of distribution of gains? Is this a winner take all market is this cloud where you have several providers who are dominant. What does the distribution of gains look like in the chip market?
[16:00]
Steve Morin
So I would divide it in two categories. Well, three categories. The GPUs you can buy or rent, the TPUs you can rent and the GPUs you can buy. This is how the market is structure today. Right, right now if you are, you want to go dedicated, there's at least in the cloud there's two options, TPUs and Trainium. TPUs on Google, Trainium and Amazon. So these are, you know, available chips, you can rent them today if you want to buy GPUs or rent GPUs, you know, they're GPUs. We, we, we know it all the time. And there's this new wave of computing which are dedicated, you know, chips you can actually buy the tentstorrent, the etched, the vsora. So I think it will be a mix of, you know, for instance, let's say you are in Google Cloud, of course you don't want to do Nvidia, you get ripped off. Here's the dirty secret is that Nvidia, like TSMC sells you at 60% margin. Nvidia sells you at, you know, 90% margin. And on top of that there's Amazon that takes, let's say a 30% margin. So you are a very thin crust on a very big cake. It's a bit of a losing game if you go all in on one provider.
[17:13]
Jonathan Ross
You want optionality with increasing competitiveness within each of those layers. Do we not see margin reduction?
[17:21]
Steve Morin
Absolutely, yes. Yeah, yeah. Here's the problem though. Let's say you are on Google Cloud and you're on TPUs. Suddenly you just remove that 90% chunk on the spend. The problem is that for multiple software reasons, which we are solving, DML is that they're not really, I would say, a commercial success. They are very much successful inside of Google, but not much outside of Google. Amazon same is pushing very, very hard for their trainium chips. So the future I see is that you use whatever your provider has because you don't want to pay 90% outrageous margin and try to make a profit out of that.
[18:02]
Jonathan Ross
Okay, so when we move to actually inference and training, everyone's focused so much on training.
[18:07]
Harry Stebbings
I'd love to understand what are fundamental.
[18:10]
Jonathan Ross
Differences in infrastructure needs when we think about training versus inference.
[18:16]
Steve Morin
So these two obey fundamentally different, I would say tectonic forces. So in training, more is better. You want more of everything, essentially. And the recipe for success is the speed of iteration. You change stuff, you see how it works and you do it again. Hopefully it converges. And it's like, you know, changing the wheel of a moving car, so to speak. So that is training on inference. This is a complete reverse. Less is better, you want less headaches, you don't want to be woken up at night. Because inference is production. You could say that training is research and inference is production. And it's fundamentally different in terms of infra. Probably the number one thing that is the number one difference between these two is the need for interconnect. So if you do production, you. If you can avoid to have interconnect between, you know, let's say a cluster of GPUs, of course you will not go. You will, you know, avoid that.
[19:14]
Unnamed Speaker
Right.
[19:14]
Steve Morin
If you can. And this is why models have the sizes they have is because. So that people can run them without the need to connect multiple machines together. It's very constraining in terms of the environment. So that is probably the fundamental difference, the need for interconnect. And number two is ultimately, do you really care about what your model is running on as long as it's outputting whatever you want it to output?
[19:39]
Jonathan Ross
Can you just help me understand, Sorry, why is training more is more, and that's great. And in inference, less is more. Why do we have that difference?
[19:49]
Steve Morin
Think of it like doing a painting and doing a million paintings. The tools you will use, the process you will do. If you do one painting, what you favor is the speed at which you can do a stroke and do some iteration. If you do a million, what you want is a process, a process that is reliable, that can deliver you efficiently a million paintings. So that is the same for, for training versus inference. If you run around, you know, millions of instances of a model, you cannot, you know, hack your way to do that. By the way, people do hack their way today, but this is probably the fundamental difference.
[20:25]
Jonathan Ross
How do people then put inference in production today? You know, we've seen with training, that's really where Nvidia have dominated so heavily, right? How do people put inference in production?
[20:37]
Steve Morin
There's a lot of duct tape. Here's also probably one of the problem is that training on first principle is actually two passes, forward and backward, right? It's called forward pass and backward pass. Right? Inference is running only the forward pass. So that's how things are today. There are people who are trying to specialize a bit because at some point duct tape doesn't really work out. And when you run big scales, that makes a problem. And it's a problem that's growing because a lot of people are coming on the market with needs for inference. That wasn't the case a year and a half ago. Or a year ago. OpenAI had this problem, right? Maybe Anthropic had this problem, but it wasn't a universal problem yet, and now it's becoming a universal problem.
[21:22]
Jonathan Ross
Can you articulate what problem did OpenAI and Anthropic have with regards to inference?
[21:27]
Steve Morin
So, for instance, probably the number one thing, depending on how you deploy. But if you deploy inference, the number one thing that will get you is what's called auto scaling. So as your systems get more and more loaded, you want to provision. Because these things are tremendously expensive, you want to provision them as you scale, right? So you want to say, I have a thousand GPUs, 24 hours, even if there's nobody on the production, I will pay for them. Which is, mind you, what people are doing today. This is crazy. So what you want to do is you want to provision, compute as you grow your needs, right? And you want to do it up and you want to do down, probably the number one thing that gives you a lot of efficiency in terms of spend. Like, we're talking multiples like 5, sometimes 10x improvement. The thing is, this is a problem, at least in that, in, I would say, regular back engineering, this is a problem. Everybody knows, right? Everybody is doing it because the savings are so huge. But on AI, nobody really had the problem, so now they're coming up to it. So this is one example.
[22:39]
Jonathan Ross
So the problem is that they're not doing provisioning. They're paying a shit ton more because they are fully in production. All the time versus provisioned as needed.
[22:47]
Steve Morin
That's one example. Yeah. Another one is choosing the right compute. It's like kind of, I would say a vicious circle because provisioning compute is very hard. So if you lose compute, it's very bad. You are essentially incentivized to overbuy. In the case of Amazon or Google, that would be buying reserved compute, which you're not going to use because if you buy it on demand, you will get tremendously ripped off. So that creates this face scarcity of compute that people buy preemptively because they're a shit ton of money and they're not using it right. So this is a major problem too.
[23:25]
Jonathan Ross
When you buy compute preemptively, does it not become outdated by the time you use it though?
[23:31]
Steve Morin
It might well be, yes. We are being spared a bit because Blackwell is late and others are getting canceled. And so H series I would say are still, you know, in the active, but. Yes, absolutely. But, you know, what choice do you have? This is the thing.
[23:49]
Jonathan Ross
Will we have a moment in time where there is this massive overhang or oversupply of compute which we've proactively bought ahead of time, but then actually the hyperscalers go, we'd rather just burn it and buy fresh. And we have the money to do that.
[24:05]
Steve Morin
So I might tell you that I think it already started. I'm getting cold emails for, you know, discounts, you know, from services I never heard about. And I started getting these emails probably around October, November. Some people are left with a lot of capex that they don't know what to do with. You know, it's a different thing to build a cluster and run a training and do a training run than it is to build literally a cloud, you know, provider or hyperscaler or, you know, whatever you want to call it. So there are a lot of people who do their training runs on the regular, you know, providers, but then move to regular hyperscaler when they do production. So I very much worry there will be an oversupply of these chips. The problem is that, remember, the chips are the collateral. So somewhere in the US or whatever there's going to be a data center with like 1000 GPUs that people may buy 30 cents on the dollar. This is what might happen.
[25:05]
Jonathan Ross
What is the time frame for that might happening?
[25:07]
Steve Morin
Probably this year.
[25:09]
Jonathan Ross
Jensen has made it very clear that inference opens up more revenue opportunity for Nvidia. He said that 40% of their revenues today comes from inference.
[25:18]
Steve Morin
Right.
[25:19]
Jonathan Ross
To what extent is that correct? Or actually as Jonathan at Grok said in the show Nvidia is not meant for inference. Definitely not. And actually that market won't be won by Nvidia.
[25:32]
Steve Morin
Technically speaking is right, but realistically speaking I don't, I'm not sure I agree. The thing is these chips are, are on the market. They're here, Altab on Chrome and get one that is something that I don't take lightly. Availability. That is right. I think Nvidia is here to stay at least if not for the H100, you know, bubble bust. Because these chips are going to be on the market and people will buy them and do inference with them. Remains to see, you know, the, the opex and the electricity, etc. But the thing is, the only chips that are really, you know, frontier on that sense are probably TPU's and then the upcoming chips. But the thing is they're great chips but they're not on the market. Or like there are outrageous prices, like millions of dollars to run a model.
[26:23]
Jonathan Ross
So what chips are great and why aren't they on the market?
[26:26]
Steve Morin
Let's say for instance Cerebras, incredible technology, incredibly expensive. So how will the market value the premium of having single stream, very high tokens per second. There is a value into that, right? As we saw with mistral and perplexity. But I think that was done at a loss. I don't know, I don't have the details, but I think it was done at a loss that Cerebras put it out. So today there's three actors on the market that can deliver this. I think this will be, I would say the pushing force for change in the inference landscape. Agents and reasoning. So that is very high tokens per second. Only for you.
[27:06]
Jonathan Ross
What is forcing the price of a Cerebras to be so high? And then you heard Jonathan at Grok on the show say that hey, they're 80% cheaper than Nvidia.
[27:15]
Steve Morin
So there's this trick because here's the thing, there's no magic. This little trick is called sram. SRAM is memory on the chip directly. So that is very, very fast memory. But here's the problem with SRAM is that SRAM consumes, you know, surface on the chip, which makes it a bigger chip, which is very hard in terms of yield, right? Because the chances of like problems are higher and so on. So SRAM is, I would say very, very, very fast memory, which gives you a lot of advantage when you do very, very high inference. But it's terribly expensive. And if you look at for instance Grok, they have on their generation, this generation, they have 230 megabytes of SRAM per chip. A seven terabyte model is 140 gigabytes. So you do the math, right? Cerebras has 44 gigabytes of SRAM into what they're called their wafer scale engine, which is a chip the size of a wafer. I mean, most likely it's interconnected, but it's huge, right? And it has to be water cooled. They have a copper, I would say needles that touch the chip. It's crazy stuff. Very, very impressive technology, mind you, but very, very expensive. So my bet is I think there will be chips on the market that do that at much lower price. And there's two companies I see going in that direction. One is called Etched and the other one is called vsora. That's the two. I see. Because if you can deliver this at I would say the price that is comparable to GPUs, you've won.
[28:49]
Jonathan Ross
Is minimizing SRAM the only way to reduce unit cost on these chips? Really?
[28:56]
Steve Morin
It's hard to say. I mean, you need some sram, but if you can have a smaller process node, but if you can hook yourself with external memory, then yes, you can do that a lot better. But the thing is, if you go like full blown sram, then there's no magic. You will have to pay the price.
[29:15]
Jonathan Ross
I'm so enjoying this. I'm also learning. My notes here are just expanding by the day. If that's today, how do you think the inference market evolves over the next three to five years?
[29:25]
Steve Morin
Pushed by reasoning. So reasoning not in the sense that you see on deep SEQ and whatever. Right. Reasoning. And what's called latent space reasoning. Latent space reasoning and agents will push the market towards different types of computers.
[29:40]
Jonathan Ross
Can I just ask, what's latent space reasoning?
[29:43]
Steve Morin
So the way models reason today is their reason in tokens. So it's as if you think to yourself, you would say out loud what you're thinking. So yes, it works, but it is a bit inefficient, right? And you lose information doing this. Latent space reasoning is this without going, I would say to English or whatever, right? So staying in what's called the latent space, which is where all the information of an LLM, let's say an LLM, an LLM lives, right? So this is very much how we work as humans and we move toward what Yann Lecan calls energy based model, in which we have different Types of longer or shorter, I would say, thinking times, if you will.
[30:29]
Unnamed Speaker
Right?
[30:29]
Steve Morin
So that fundamentally GPUs cannot deliver, deliver this plain and simple at scale.
[30:35]
Jonathan Ross
Why can't GPUs deliver it?
[30:37]
Steve Morin
Because the access to external memory prevents it. So HBM is all the rage, right? But HBM compared to SRAM is absolutely, you know, dark, slow. So this is the problem you get. So HBM is like the best we can do, but it's still slow versus sram.
[30:56]
Jonathan Ross
So when I had Jonathan on, he was like, actually Nvidia have such a stronghold because. Because they're one of the only buyers of hbm and that gives them this unique position. Actually. Is being a sole buyer of HBM irrelevant if the world needs SRAM instead?
[31:11]
Steve Morin
No, you want HBM to be clear, no esram, this will not deliver. It's a dead end. In terms of scaling. ESRAM means scaling the surface, means you get depreciating problems. It explodes everywhere, right? So you need some sram, right? So, you know, we'll have bigger amounts of SRAM into chips and of course bigger, what's called external memory into chips. The issue with HBM is that it's still slow. And yes, maybe Nvidia has a stronghold and they can prevent you from getting some. So that would be like, I call it the Nutella situation, in which, you know, Nutella, they own 80% of the hazelnuts market, right? So yes, you can do a competitor, but who will you buy the nuts from?
[31:56]
Unnamed Speaker
Right?
[31:56]
Steve Morin
So there will be a need for hbm, there will be a need for esram. I would say better, more dedicated architecture will be able to deliver these things. And then there's like the next frontier after that, which is called compute in memory. There's two companies that are on that market. One is called Rain. Rain AI. Sam Altman is one of the investors. There's no surprise. The other one is called Fractile. So this is the next frontier. And the idea is that instead of transferring the data between external memory and the CPU and do the compute there, you actually bring the CPU to the memory and you do everything. It's crazy stuff, but it's coming. Maybe not this year, but how does.
[32:39]
Jonathan Ross
That change the situation? It makes it much more efficient. But what does that actually mean in reality?
[32:44]
Steve Morin
It means you get, maybe not SRAM level performance, but you get a lot faster performance in terms of compute. And if you translate that to LLMs, let's say you get much, much higher tokens per second in a single stream, which is exactly what you want when you go into reasoning, you want your model to maybe think, let's say for like half a second and then boom, you don't want to wait 50 seconds and context switch to some other thing, which is the problem everybody has today, mind you. So, yeah, I think inference will be pushed, the compute landscape will be pushed to change because of these two constraints. I know, I'm working on it.
[33:22]
Jonathan Ross
If you were to ascribe value between training and inference, out of a pool of 100, is it 80 inference, 20 training, what does that look like in five years?
[33:33]
Steve Morin
I would say 95% inference, 5% training.
[33:38]
Jonathan Ross
Do you think Nvidia owns both of those markets in five years?
[33:41]
Steve Morin
Time depends on the supply. I think that there's a shot that they don't. Because here's the thing, you know, even if we take, you know, same amount of, you know, let's imagine we have a new chip from Amazon, right? That is the same amount. Oh, wait, we do. It's called Trainium. You know, why would I pay 90% margin of Nvidia if I can freely change to Trainium? My old production is run on runs on aws. Anyways, if you run on the cloud and you're running on Nvidia, you're getting squeezed out of your money, right? So if you're on production on dedicated chips, of course, so maybe through commoditization. But hey, I'm on AWS, I can just click and boom, it runs on AWS's chips. Who cares, right? I just run my model like I did two minutes ago with that realization.
[34:34]
Jonathan Ross
Do you think we'll see Nvidia move up, stack and also move into the cloud and model?
[34:39]
Steve Morin
They are, they have a product called Nim, sort of does that. The thing with Nvidia is that they spend a lot of energy making you care about stuff you shouldn't care about. And they were very successful. Like, who gives a shit about Cuda? I'm sorry, but I don't want to care about that, right? I want to do my stuff. And Nvidia got me into saying, hey, you should care about this because there's nothing else on the market. Well, that's not true. But ultimately this is the GPU I have in my machine. So, you know, off I go. If tomorrow that changes. Why would I pay 90% margin on my compute? That's insane. This is why I believe it ultimately goes through the software. Because the software, like if my entry, this is my entry point to the ecosystem. So if the software abstracts away those idiosyncrasies as they do on CPUs.
[35:26]
Unnamed Speaker
Right.
[35:27]
Steve Morin
Then the providers will compete on specs and not on fake moats or circumstantial modes.
[35:35]
Unnamed Speaker
Right.
[35:35]
Steve Morin
So this is where I think the market is going. And of course there's the availability problem. There is, you know, if you, you know, piss off, Jensen, you might need to kiss the ring, you know, to get back in line. Right. But I mean, ultimately this is, I don't see this as being sustainable.
[35:53]
Jonathan Ross
When we chatted before, you said about AMD and I said, hey, I bought Nvidia and I bought AMD and Nvidia. Thanks, Jensen. I've made a ton of money and AMD, I'm up 1% versus the 20% gain I've had on Nvidia. You said that AMD basically sold everything to Microsoft and Meta and had a GTM problem. Can you just unpack that for me?
[36:17]
Steve Morin
So all I would say chip makers have a GTM problem, all of them. Whether, you know, it's Google, whether it's AMD, whether it's 10 store. The problem is, is that there's, I would say probably two fundamental problems. The number one is if you maintaining multiple stacks today is very, very, very hard. So you don't. So let's say I buy, you know, amd. I want to buy amd, right? That means I'm going to abandon Nvidia. Oh crap, you know, I have a six year amortization plan on that. Oh man, what do I do? So do I need to support both stacks? Unclear. Maybe. Until AMD tells me, hey, you know, you have, I don't know, let's say a thousand Nvidia GPUs. You're about to buy 100,000 of AMD. I mean, come on, right? And I'm like, okay, that is, you know, makes it worth my while.
[37:07]
Unnamed Speaker
Right.
[37:07]
Steve Morin
But that is ultimately the fundamental problem is that the steps are very high.
[37:11]
Unnamed Speaker
Right?
[37:12]
Steve Morin
I need to have a lot of incentives to buy into that ecosystem, so I need to buy a lot of them. So if you're amd, that is already a problem. But then Microsoft comes along and buys it all. Makes, by the way, OpenAI, at least on the inference side, puts OpenAI in the green because of the efficiency gains.
[37:30]
Jonathan Ross
I'm just trying to understand. So are you saying the switching costs are really high from one provider to another?
[37:35]
Steve Morin
Oh yeah, absolutely.
[37:36]
Jonathan Ross
Which is why you don't. Or are you saying that to get into one of these buy processes, you have to buy so much that it prohibits you.
[37:43]
Steve Morin
It's actually both the Buy in is very high. So to make it worth it, you have to buy a lot. And if you buy a lot, this is, you know what, we talk to all of them, they always have the same questions and it's completely understandable. They say, this is great, but who's the customer? Because on the other side, let's take Amazon for instance, with Trainium, Apple just came and said, hey, we're going to buy 100,000 of them. So you want to buy 10,000, you feel like the big shot, right? Yeah, but go back to the queue because there's Apple before you.
[38:15]
Unnamed Speaker
Right.
[38:16]
Steve Morin
So they have to have very high commitments. You cannot be incrementally better. It's very hard.
[38:21]
Unnamed Speaker
Right.
[38:22]
Steve Morin
And also very hard. I can give you one metric if you want. I know for a fact that being seven times better and whatever, take whatever metric you want, whether it's spend, whether it's whatever, is not enough to get people to switch. People will choose nothing over something. So this is a very hard market to enter into because you cannot also compete of incremental gains. It's very hard. Right. So you have to convince a lot of people, maybe you can go the Middle east route in which, you know, they sprinkle everything and they, you know, evaluate everything that's not, you know, very sustainable. I would say strategy in the long term, at least in the midterm.
[39:01]
Jonathan Ross
What's the right sustainable strategy then? You don't want to go so heavy that you can't ever get out and you have that switching cost.
[39:07]
Steve Morin
Right.
[39:08]
Jonathan Ross
But you also don't want to sprinkle it around and do, as you said, multiple.
[39:11]
Steve Morin
Absolutely. The right approach to me is making the buy in 0. If the buy in is 0, you don't worry about this. You just buy whatever is best today.
[39:20]
Jonathan Ross
How do you do that? By renting.
[39:21]
Steve Morin
Oh, because this is what we do. This is our promise. Our thesis is that if the buy in is zero, you know, you completely unlock that value because you're free.
[39:31]
Jonathan Ross
When you say the buy in is zero, what does that actually mean?
[39:34]
Steve Morin
It means that you can freely switch. Compute. To compute freely. You just say, hey, now it's AMD, boom, it runs. You just say, oh, it's 10 store and boom, it runs.
[39:45]
Unnamed Speaker
Right.
[39:46]
Jonathan Ross
How do you do that then? Do you have agreements with all the different providers?
[39:50]
Steve Morin
Oh yeah, yeah, yeah. Not agreements, but like we work with them to support their chips. But the thing is, my, at least as you know, I would say a user myself of our tech, is that if it's free for me to switch or to choose whichever provider I want in terms of compute, right? Amd, Nvidia, whatever then I can take whatever is best today and I can take whatever is best tomorrow and I can run both. I can run three different platforms at the same time, I don't care. I only run what is good at the moment and that unlocks to me a very cool thing which is incremental improvement. If you are 30% better, I'll switch to you.
[40:29]
Jonathan Ross
So are you taking the risk on that hardware then? If you're the one providing them to turn off and on, on demand provisioning, you name it, who takes the risk?
[40:39]
Steve Morin
This is actually a great question. I think that if you are doing it bottom up infra to applications, you will lose because nobody will care as they don't today, right? If you look at TPUs, they're available, they're great, nobody cares.
[40:53]
Jonathan Ross
Why does Nobody care about TPUs? Sorry?
[40:55]
Steve Morin
Because the cost of buying, it's always the same, right? You have to spend six months of engineering to switch to TPUs and mind you, TPUs do training, they are the only ones with Trainium now. But AMD can do training. But it's so, so, but in terms of maturity, the by far the most mature software and compute is TPUs and then it's Nvidia, right? So the buy in is so high that people are like we'll see, right? I'm not on Google Cloud, I have just, you know, sign up. Oh my God. Right, so these are tremendous, you know, chips, these are tremendous assets. Now in terms of the risk, I think if you want to do it, you have to do it top to bottom. You have to start with whatever it is you're going to build and then permeate downwards into the infrastructure. Take for example Microsoft with OpenAI. They just bought all of AMD supply and they run, you know, ChatGPT on it. That's it. And that puts them in the green. That's actually what makes them profitable. On inference or at least let's say not lose money, right?
[41:58]
Jonathan Ross
I'm sorry, how does Microsoft buying all of AMD's supply make them not lose money on inference?
[42:03]
Steve Morin
Just help me understand that because I can give you actual numbers. If you run 8H100 you can put two 70B models on them because of the RAM, right? That's number one. Number two is if you go from one GPU to two, you don't get twice the performance. Maybe you get 10% better performance. Yeah, that's the dirty secret nobody talks about. I'm talking inference, right? So you go from let's say 100 to 110 by doubling the amount of GPUs. That is insane. So you rather have 2 by 1 than 1 by 2, right? So with one machine of H100 you can run 270B model if you do 4 GPUs and 4 GPUs.
[42:44]
Unnamed Speaker
Right.
[42:45]
Steve Morin
That's number one. If you run on AMD, well there's enough memory inside the GPU to run one model per card. So you get 8 GPUs 8 times the throughput. While on the other hand you get eight GPUs to maybe two and a half times the throughput. So that is a 4X right there just by virtue of this. So that is the compute part. But if you look at all of these things there are tremendous amount of, you know, we talk to companies who have chips are coming with almost 300 gigabytes of memory on it, right? So that is a model like one chip per model. This is the best thing you want if you run seven terabytes, right? So which is what I would say, not the state of the art but this is the regular stuff people will use for serving. So if you look top to bottom and you know what you're going to build with them, then it's a lot better to do the efficiency gains because four times is a big deal.
[43:42]
Unnamed Speaker
Right.
[43:43]
Steve Morin
And mind you, These chips are 30% cheaper than Nvidia's. It's like a no brainer. But if you go brought them up and say I'm going to rent them out, people will not rent them. Simple. So that's why, you know, I think it's a good way to attack it from the software because ultimately do you really care about that? Your MacBook let's say is an M2 or an M3, it's like it's the better one and that's it.
[44:08]
Unnamed Speaker
Right?
[44:08]
Steve Morin
And imagine if you had to care about these things, that would be insane.
[44:12]
Jonathan Ross
When I listen to you now I'm like shit, I should sell my Nvidia and buy more AMD if you were forced to buy one, I'm not saying sell the other, I'm not saying like this on the other, but buy one. Which would you buy and why?
[44:27]
Steve Morin
Stock. Yeah, I used to think the market was efficient so probably I would go today at least I would go with Nvidia still because the supply. But you know, if we play our cards right, we ship our stuff, hopefully I will come back and tell you to buy AMD as much as you can, or 10 storage, you know, if they go public or whoever else. These chips are amazing, by the way.
[44:50]
Jonathan Ross
What does everyone think they know about inference? That they actually don't? Or what does everyone get wrong about inference?
[44:58]
Steve Morin
Probably not. A lot of people are accustomed to what it entails to run production. So that inference is production and production is hard. Somebody has to wake up at night and I used to be that guy, right? I don't want to do it again. So production is hard. Thankfully we have a lot of software nowadays to do that a lot better. But there's not a lot of reuse because the AI field at least is not really accustomed to that yet. It's changing. But you know, the discussions I had, you know, a year ago and the discussions I had today are not the same. They're going to the right direction, but they're not there exactly yet. So probably that would be the number one thing that is only, you know, training code running, only forward pass, right? This is not what it is.
[45:46]
Jonathan Ross
Can I ask, how did you evaluate the data center investment that we're seeing being made? When you look at Facebook doing 60 to 65, Microsoft doing 80, and some of the intense capex expenditure that you're seeing, how do you think about that on the data center side?
[46:01]
Steve Morin
I mean, they're still going after training, so there's still this frontier. Probably it's why also Nvidia is the better buy right now because on the Nvidia side, if you do training, it's incremental. If you have bought a thousand Nvidia GPUs and you buy 1,000 new Nvidia GPUs, that gives you 2,000 GPUs, right? But if you buy 1,000 and 1,000AMD, that gives you twice a thousand, right? It's a bit different. So they're still going after training, definitely, and they're very pragmatic in doing so. But I mean, they have the capex to spend, they're not making their money out of it. Probably the only one, by the way, that owns their compute is are Google. There's like this triangle of, I would say of wind that I this is my mental model, mind you. You have the products, the data and the compute. Who has all three? And you get everything flows from there, products, data, compute.
[46:53]
Jonathan Ross
Who has all three? Google, Amazon, Amazon.
[46:56]
Steve Morin
They don't have products. They have Amazon, right? They have aws, but they don't have actual products. Google has like, you know, Android, Google Docs, whatever. They have everything. They can Sprinkle everywhere. This is the sleeping giant in my mind. If they're not busy doing a reorg, they might.
[47:13]
Jonathan Ross
It's fascinating because everyone, if you're a shallow thinker, you think that OpenAI challenges their golden goose, which is search. And Google is threatened more than ever now.
[47:23]
Steve Morin
I mean, OpenAI is amazing, but it's not their compute, it is Microsoft's compute.
[47:29]
Jonathan Ross
And if you own your compute, you own your margin is essentially what you're saying.
[47:33]
Steve Morin
Yeah, even Microsoft, they bought, when they were running Nvidia, they bought Nvidia at some outrageous margins. I talked to a lot of people that build data centers and I tell them, mind you, these people buy tens of thousands of GPUs. And I ask them, hey, do you get at least a discount or something? And they're like, no, the only thing we get is the supply. So, I mean, ultimately, if you don't own your compute, you're starting with something at your ankle. Definitely. And so this is why I like to think in this, like this triangle, product, data, compute. And you can see where everybody sits and their weaknesses and their strengths.
[48:12]
Jonathan Ross
Can I ask you if we move a little bit? You said it's totally rational that everyone's focusing on training still. When we think about that, it's rational. If you think that efficiency and scaling laws continue to continue, place such emphasis on it, how do you think about model scaling and scaling laws coming into place? How do you think about that?
[48:32]
Steve Morin
There's like a brute force approach to this. It is a very American approach, more and more and more. But the thing is, you look at, for instance, the Xai cluster, it's not 100,000 GPUs. It is four times 25,000. You're starting to see some because Infiniband, and in the case Rocky, which is anyways the technology they used to bridge their GPUs together. You have upper bound, right. At some point you're fighting physics so you can push. It's like, you know, trying to get to the speed of light. As you approach it, the amount of energy you need is a lot higher and a lot higher and it grows and grows. So there's two, I would say, counter to that would be that number one is we still scale, but there's a lot of waste and excess, you know, spending on the engineering side, which is the deep SEQ approach.
[49:21]
Unnamed Speaker
Right.
[49:22]
Steve Morin
Very successful at that. Mind you. They said, yeah, if we do this and this differently, then we get multiples sometimes. Right. So virtually you increase your compute capacity because you're More efficient. And the other approach is Jan Lecan's approach, which is this is not scaling. And at some point we need to look the problem in the face and do something better.
[49:44]
Unnamed Speaker
Right?
[49:44]
Steve Morin
So of course we push and push and push because there's capital still. But I'm more of these two approaches, I think you can do more with less.
[49:53]
Jonathan Ross
At what point do we stop and say, hey, there is a lot of wastage and we could do more better?
[49:59]
Steve Morin
I think until somebody does it. Deepseek was a good wake up call, right? Suddenly efficiency is in. That's number one. And number two is until there's a new architecture that comes out and changes the game. So in the case of LLMs, for instance, you have these, what's called non transformer models that changes fundamentally the compute requirements. So that might be a frontier that completely obsoletes the transformers. And if the transformers, sorry, the transformers are the, I would say, the building block by which current model work.
[50:30]
Unnamed Speaker
Right.
[50:30]
Steve Morin
So the way they work is that for each token or syllable, if you will, the model will look at everything behind it. So you can see that as you add more text, you have more work to do. So there are these new architectures that do not require this, that might change, you know, these things and probably shift the amount of compute needed to do training or to do inference. And then there's the new thing, which is Yann's thesis, which is the word model, as in LLMs are dead end. What we need is something that understands the world fundamentally. And this is, it's Jippa thesis, it's called. I'm very bullish on this, but it's very frontier.
[51:08]
Jonathan Ross
Why are you bullish on it and why is it so frontier?
[51:12]
Steve Morin
Because it's, it's hard to. He's no bullshit, right? So he explained to me how it worked and I was blown away. But it makes a lot of sense. We are creeped out because the machine talks back to us. But it's not a new thing, right? It used to, you know, this is not new technology. When it came out. Well, like when it, when it exploded, it was in new technology, but suddenly it was talking back and that freaked us out and we got crazy on it.
[51:38]
Unnamed Speaker
Right?
[51:39]
Steve Morin
But language is one form of communication, but it is ultimately a very narrow window into, you know, the world. We use it to describe the world, arguably with some loss.
[51:50]
Unnamed Speaker
Right.
[51:50]
Steve Morin
And so the JEPA approach is, long story short, is that you have essentially two things you want to do and you try and minimize the energy to do them. And from this understanding emerges, physics emerges and etc. Because you're trying to minimize the amount of energy to go from one state to the other. And that actually makes sense. Like if you try and pick this AirPod case, I'm not going to go round trip around the block to get it right, I just get it and in my brain it's wired to just do the thing. If I go and talk to myself out loud, put the hand down, move to the left and whatever, that feels very inefficient. So probably this will be something that changes. And in the case of LLMs, there's good work also on what's called diffusion based LLMs, which means instead of thinking what's called autoregressively, that means you get a new token, you re inject and you redo, et cetera. They think more like what we do, which is in patches.
[52:49]
Unnamed Speaker
Right.
[52:50]
Steve Morin
Imagine a paragraph of text and words appear until it's done.
[52:54]
Jonathan Ross
Is distillation wrong. And if we're all progressively moving towards a better future for humanity, more efficient models is distillation not effectively open source in another wrapper.
[53:10]
Steve Morin
I think it's fair game, to be honest. I will not shed a tear. It's fair game if you. There were like some people who tried to ask, I think it was, I don't remember if it was an OpenAI model. So a diffusion model image.
[53:21]
Unnamed Speaker
Right.
[53:22]
Steve Morin
They asked it to generate an image from a Star wars movie at whatever timestamp and it came out with the Star wars movie screenshot. Obviously it was trained with it. I think it's fair game because there's no free lunch.
[53:35]
Unnamed Speaker
Right?
[53:35]
Steve Morin
It was trained with data, you had a good ride, somebody was sneaky and took it, but you took it from the beginning too. So let's just accept it's for our game.
[53:48]
Jonathan Ross
And you also learn from their advancements.
[53:51]
Steve Morin
Absolutely, absolutely. I take my cup and enjoy it very much, that movie every single day.
[53:59]
Jonathan Ross
You mentioned the training there. Obviously data and data quality dictates a lot of training ability. When you think about the future of data that feeds into training, how do you think about how that will be between synthetic data versus real data?
[54:15]
Steve Morin
I'm a bit split on this. There's a part of me that said that if you re inject data into the system, the system deteriorates. That feels a bit, I would say, intuitive. But if you look at AlphaGo, for instance, the moment it's, you know, ramped up in its skills is when they started generating games, synthetic games.
[54:34]
Unnamed Speaker
Right.
[54:34]
Steve Morin
So I'm a bit, you know, split. But there are some verticals that very much benefit from this code. LLMs, for instance, we can run code, right? So this is the poolside thesis.
[54:45]
Jonathan Ross
Just so I understand, why does it work for coding and not for other things?
[54:49]
Steve Morin
Because you don't use the AI model to generate output. You use the machine. You just run the code, right? And you see what it makes and you run all this code and you create data out of it. Whereas if you run an LLM and you say to an LLM, all right, generate me 2 trillion tokens of text. It will do it with its. So you may inject and stuff. So there's a lot of tricks, but ultimately my guts tell me that it feels wrong because you re inject data that was there and so it will deteriorate, there's loss. So yeah, I'm a bit bullish. I'm not sure exactly on what vertical code is one we'll see. Distillation is in some sense a bit like that. You create synthetic data from a bigger model into a smaller one. Probably the most, I would say mind blowing thing about distillation is that sometimes the smaller models become better than the bigger model through distillation.
[55:46]
Jonathan Ross
Smaller models become better than bigger models purely because of the quality of the data that's inputted through them.
[55:53]
Steve Morin
One theory is that the smaller model is better at generating output that you would want it to generate. Essentially it's not better in the general sense. It's better at the task at which you were measuring it. This is what it learned to imitate.
[56:09]
Jonathan Ross
How do you think about the future in terms of large monolithic models versus more dynamic architectures, smaller models?
[56:18]
Steve Morin
Sometimes it's wasteful to run big models, A lot of times it's actually wasteful to run big models. I think there's going to be a lot of smaller models for efficiency reasons. But. There's a but which is you talk to people at DeepMind and they don't even fine tune anymore because they have such what's called big context window, which is what the model, the data, the model you inject at runtime that nowadays they just dump data into it and just say do whatever that data tells you to do instead of fine tuning as we used to do. So if the efficiency gains, we're not there yet.
[56:56]
Unnamed Speaker
Right.
[56:56]
Steve Morin
But if the efficiency gains, I would say past that threshold, we'll just do it at runtime. We'll just have a great model that will just specialize at each request. But that's not for tomorrow.
[57:07]
Jonathan Ross
I think what is retrieval Augmented generation first.
[57:12]
Steve Morin
It's a very, very clever trick. What you do is you represent knowledge into what's called the vector space or latent space. And what you do is through what's called vector search. So imagine you have, let's say a 3D space that represents all knowledge, all of everything. And let's say a cat sits here, a dog sits close because it's an animal, but it's far from some other property, and so on. So what you do is you run the user's request through this same system. It's called an embedding. And that will give you a vector and you will take whatever is closer to you, what's called semantically close. And then it's actually very, it's very clever. You actually insert those pieces of text before the request. So it's as if you would say knowing the following and you give the data let's say it's law or whatever. Please answer my request. And that's it. So that's a bit of a clever trick. It's a bit dirty because of course you are limited by the amount of data you can input, right? So there's this problem in which how do you chunk the data that you input?
[58:27]
Jonathan Ross
Are a lot of things we do not retrieval, augmented generation, then where we say, here's a link, summarize it to the key points. Is that not rag because we're inputting the data?
[58:37]
Steve Morin
It is, it is. Depends on how it works. But yes, sometimes it is. But think of it as in, it's like a preamble to your question, knowing the following. And the following is a tiny window into the content. Please answer my question. And of course as you talk more and more it will forget because that window is fixed.
[58:57]
Jonathan Ross
And so how does that shift the movement from large generalized model to smaller, more advanced models?
[59:04]
Steve Morin
What pushes smaller models are efficiency, roughly speed. You know, less is better. So if we can do with less, then less. It is simple as this, right? In terms of rag, the key frontier is what we call attention level search. But this is something we're working on. You have the exclusivity now I'm putting it out there. It doesn't push, I would say model sizes. What really pushes model sizes are the efficiency rather than specializing. Meaning that if you can do the same performance with a smaller model that is fine tuned with RAG or whatever, then you'll do it with the smaller, because again, less is better.
[59:39]
Jonathan Ross
Can I ask you before we move into a quick firearm, I do just want to ask you, when we had Deepseek, as we mentioned, to what extent were you surprised that such innovation, I would argue, and I think many would agree with me, came from a Chinese competitor, not from a Western competitor?
[59:56]
Steve Morin
Oh, I love it. Constraint is the mother of innovation. Yes, we can troll a bit about the Singapore gray market and all of these things, but ultimately they had no choice. Here's the thing, if you can buy more, why would you give a damn, right? You can just buy more. So if you are pushed to efficiency, then you will deliver efficiency. These are very, very skilled people. This is the coolest thing to me about AI, honestly is the geography doesn't matter anymore. You can just do things, you appear out of nowhere, boom, you're on the map. And so I'm very, very glad that they did. I found the reaction very entertaining, to be honest. So, yeah, I mean, constraint is a very good driver of efficiency.
[60:41]
Jonathan Ross
Do you think it is a meaningful threat to OpenAI and ChatGPT? Bluntly, they still have the consumer loyalty, the consumer brand. To what extent is it actually a long term threat?
[60:52]
Steve Morin
I'm not sure who is a threat to OpenAI at the moment. Here's why you look at the numbers. I mean, we live in a bubble. We follow every new episode, whatever, new model, whatever, who said who what, and so on. But you know, I go to my mother and I ask her, you know, do you know chatgpt? And she says yes. And do you know, I don't know, I don't want to dunk on anybody but do you want to know some other model? And she says, what it is, what is it?
[61:16]
Unnamed Speaker
Right?
[61:16]
Steve Morin
Even Gemini, right, Like Google, right? So they have a strong brand, they have a strong product, but there's a balance between the product and the models, honestly. So this is Gary from Fluidstack actually who told me that his mental model, in terms of model providers, they'll be like car makers. There's no winner take all. Everybody will have their own because ultimately also human knowledge is everybody has everything. So we're converging. But I like that analogy. Yes, Deepseek made a very good, you know, made waves. But it was, it was, you know, waves that were amplified by the media and the narrative and the drama.
[61:53]
Jonathan Ross
Do you think export regulations inhibit China's ability to compete in any way today, maybe tomorrow?
[62:01]
Steve Morin
I'm not sure they're a bit late in terms of, you know, ASIC. They are like a 100 level. But they have probably, I would say one of their unfair advantage is that it's like, you know, when you do when you do exercise in the water, right, it's like this. So this is their state. They are constrained, so they are bound to do better. They can just not buy their way into better compute. So I think it hinders their success, but I think it's short term to think that way.
[62:31]
Jonathan Ross
Are you fearful that Europe are going to regulate ourselves into constraints in a world of AI?
[62:38]
Steve Morin
No, I don't care. This is something I. It makes me wonder sometimes. I understand the narrative and so on, but I am absolutely not fearful. Let's be successful first and then we'll talk about the politics I have so far. But again, I'm not Mistral. I'm not building gigawatt data centers and so on. So if you build gigawatt data centers, you run into these problems, but maybe you run into these problems. But the thing is, if you're successful, everything flows from there. From there.
[63:08]
Jonathan Ross
Steve, I'm being direct here, but I'm asking you for the pros. Everyone says Mistral just doesn't have enough money to compete. That is kind of word on the street. To what extent is that fair?
[63:19]
Steve Morin
They are very competent. I think it's easy to spread fud. There's a lot of FUD going around, especially about regulation and everything. But here's the. Here's the thing. I look around me and I don't see, you know, what I read, right? So I am hardly convinced about, you know, everybody was saying that they were dead and boom, they came out with their. Their release and it was insane. So what I know is that I hope they don't have too much money. That's for sure. You want to be clever, right?
[63:46]
Jonathan Ross
Final one before we do a quick five. So enjoyed this, Steve. Final one before we do it. Stargate was, you know, a $500 billion announcement. How did you evaluate that?
[63:56]
Steve Morin
My first impression was that I don't buy it. I would say, you know, American style, right? You start with the claim and we'll figure it out later. I don't buy it. It ultimately, ultimately, I'm not sure I care that much about it. Let's imagine it's true, right? Congratulations. Amazing. But it is more of the same. It is a vertical scaling. And as you know, my days are spent on efficiency. So I look at these things as being like, all right, this is a bigger. This is an American car of AI. It's big, it consumes a lot of gas, but ultimately it's not a good car. Right. I think there has to be sufficient capital. But at some point, I'm not sure it is really a differentiator. That was prior to deep sea. Then deep sea came. There was always my thesis, but you need money, you need infrastructure. But what is ultimately probably the two limiting factor today is talent and energy. That's it. The rest, yes, of course you can buy 500 billion of GPUs I think by the way, 90% margin. So if we work on that margin, we can shrink that number. Probably. So I'm not easily entertained by these numbers. I've seen how the sausage is made way too many times.
[65:14]
Jonathan Ross
Dude, I want to do a quick fire with you. So I say a short statement. You give me your immediate.
[65:18]
Steve Morin
Sure.
[65:19]
Harry Stebbings
If you had to bet on one.
[65:21]
Jonathan Ross
Major shift in AI infrastructure over the next five years, what would it be?
[65:26]
Steve Morin
Oh yeah, latency reasoning. Definitely this year.
[65:29]
Jonathan Ross
What does that mean?
[65:31]
Steve Morin
So the shift from throughput. So how speed. My answer is to how long it takes for my answer complete to appear. That is probably one of the fundamental. Like this year, Right. Longer term, I'm very rooting for non transformer models that will change the compute. Also landscape and of course world models, right?
[65:51]
Unnamed Speaker
Yes.
[65:52]
Steve Morin
And. Or energy based models.
[65:54]
Jonathan Ross
What's one piece of advice you'd give to AI startups navigating the changing landscape of training, inference and hardware?
[66:01]
Steve Morin
Probably the number one thing I would say is do not resell compute if you can. A lot of AI startups that are building on top of AI are trying to make a margin on top of a very big cake. And ultimately what they sell is compute. If you look at the dollar of spend for $1 spend, maybe 98% of it goes to somebody else's margin. So if you do AI as much as you can, try to verticalize on the product, but not on the compute. If your business model implies buying a lot of tokens, it's a very hard circle to square to put that into $20 right a month. So I always say please look at it from that angle and if you can, try and avoid it.
[66:48]
Jonathan Ross
What's the biggest challenge that Jensen Huang faces today?
[66:52]
Steve Morin
The highs are very high, but they don't last forever. So probably it's how to navigate the downslope. Blackwell is probably something that keeps them awake. That keeps him awake at night.
[67:03]
Jonathan Ross
Why would that keep him awake at night? Would that not re energize him? More orders, New enthusiasm, new product, baby.
[67:10]
Steve Morin
Because orders are getting cancelled.
[67:12]
Jonathan Ross
Why are they getting cancelled?
[67:13]
Steve Morin
They have a lot of problems with this chip. So a lot of people are canceling their orders. These chips are like on the frontier of scaling. And so they were supposed to come out last summer. But that heat dissipation and matter bending problem it used to be called. The people who are very privy to silicon told me this is what we call a pretty big fucking problem, right? End quote. Probably how to navigate the downslope, maybe you don't know. But the supply of H100 was actually smoothed out over the year, so that they decided so that they didn't have like a big spike in deliveries and then a quarter less.
[67:53]
Unnamed Speaker
Right.
[67:54]
Steve Morin
Which pissed a lot of people, mind you, who bought a lot of them. Some of them even haven't received their order from last year and they already see like the new chip, the B200 and then the one after, you know, and they're super pissed. There will be a downslope at some point. The question is when? How? Like if there's like the H100 bubble, of course it will impact Nvidia, but Blackwell is. I'm probably going to get a lot of flack for this, but I've seen some very worrying numbers about it and varying testimonies about people who operate these things right. So that ride will stop or at least slow down.
[68:32]
Jonathan Ross
Steve, I'm not sure I've ever learned quite as much in one episode. Seriously, we said before, oh, wow. No, I love what I do because I'm able to ask anything to the smartest people in the business. And I so appreciate you unpacking so much for me today, man. I'm thrilled to say that I actually finally get what you do after years.
[68:54]
Steve Morin
Of investing with you.
[68:57]
Jonathan Ross
But you've been a star, so thank you, man.
[68:59]
Steve Morin
Thank you. Appreciate it. Thank you.
[69:03]
Jonathan Ross
I mean, I said it there.
[69:04]
Harry Stebbings
I think I learnt more in that episode than I have done in the last 1000. When it comes to technical specs and the future of AI, Steve was incredible. If you want to watch the episode, you can find it on YouTube by searching for 20VC. That's 20VC. But before we leave you today, turning your back of a napkin idea into a billion dollar startup requires countless hours of collaboration and teamwork. It can be really difficult to build a team that's aligned on everything from values to workflow. But that's exactly what Coda was made to do. Coda is an all in one collaborative workspace that started as a napkin sketch. Now, just five years since launching in beta, Coda has helped 50,000 teams all over the world get on the same page. Now, at 20 VC, we've used Coda to bring structure to our content planning and episode prep, and it's made a huge difference. Instead of bouncing between different tools, we can keep everything from guest research to scheduling and notes all in one place, which saves us so much time. With Coda you get the flexibility of docs, the structure of spreadsheets, and the power of applications, all built for enterprise. And it's got the intelligence of AI, which makes it even more awesome. If you're a startup team looking to increase alignment and agility, Coda can help you move from planning to execution in record. To try it for yourself, go to Coda iO20VC today and get six free months of the team plan. For startups, that's Coda iO20VC to get started for free and get six free months of the team plan. Now that your team is aligned and collaborating, let's tackle those messy expense reports. You know, those receipts that seem to multiply like rabbits in your wallet. The endless email chains asking can you approve this? Don't even get me started on a month end panic when you realize you have to reconcile it well. Pleo offers smart company cards, physical, virtual and vendor specific so teams can buy what they need while finance stays in control. Automate your expense reports, process invoices seamlessly and manage reimbursements effortlessly all in one platform. With integrations to tools like Xero, QuickBooks and Netsuite, Pleo fits right into your workflow, saving time and giving you full visibility over every entity, payment and subscription. Join over 37,000 companies already using Pleo to streamline their finances. Try Pleo today. It's like magic, but with fewer rabbits. Find out more at Pleo IO 20 VC and don't forget to revolutionize how your team works together. Rome A Company of Tomorrow runs at Hyperspeed with quick drop in meetings. A Company of Tomorrow is globally distributed and fully digitized. The Company Company of Tomorrow instantly connects human and AI workers. A Company of Tomorrow is in a Roam virtual office. See a visualization of your whole company. The live presence, the drop in meetings, the AI summaries, the chats. It's an incredible view to see. Roam is a breakthrough workplace experience loved by over 500 companies of tomorrow. For a fraction of the cost of Zoom and slack. Visit Rome that's or AM for an instant demo of Roam Today, nobody knows what the future holds, but I do know this. It's going to be built in a Roam virtual office. Hopefully by you. That's Romero AM for an instant demo. As always, I so appreciate all your support and stay tuned for an incredible episode coming on Wednesday with Oscar, founder at Glovo on turning Glovo into a two billion dollar business.