
Loading summary
A
Hello, and welcome to a free preview of Sharp Tech. Hello, and welcome back to another episode of Sharp Tech. I'm Andrew Sharp, and on the other line, Ben Thompson. Ben, how you doing?
B
Well, the question is, how are you doing, Andrew? We unfortunately did not podcast last week. There was no chance it was happening. What's. My travel plans actually worked out, so good thing, you know, we did the makeup in Taiwan. Just reminding everyone we're very conscientious of your time and attention, but in the meantime, your Washington Wizards, number one pick in the NBA draft.
A
Oh, my God. Yeah.
B
Is this making up for the fact that the NBA let Dallas win last year so that Luca could go to the Lakers, thus demoting you, so they made it up for you the following year?
A
Um, it doesn't quite make up for it because there's no Cooper flag level prospect available this year. But look, I can't get greedy. It's. Does it make up for the last 25 years of Wizards fandom? Probably not. It's not making up for that either. However, it's nice to have some Hope in Washington, D.C. for the first time in about 10 years or so. I was very, very happy on Sunday afternoon. Now I'm, I would say, pretty anxious about what direction the Wizards are going to go over the next.
B
I'm glad you had a few moments of happiness. I mean, hopefully that's not the peak level of happiness in like, your entire adulthood of being a Wizards fan, but I'm happy that it happened.
A
You know what? That's all that matters. There was a window of pure happiness. Now we're back to anxiety. We'll see where we end up by the end of June.
B
But look, here's the deal. Here's the deal. I hope they choose someone that you don't like, because my experience of the current Wizards regime is that every time they do something you disagree with, they end up right and you end up wrong.
A
So, look, we're not going to relitigate the Bradley Beal trade at the top of this Sharp Tech episode.
B
Hey, you become an Alex Star guy. I mean, like, like you. You wanted. What's his name in Houston? Reed Shepherd. I think shepherd star than shepherd these days.
A
I'm not giving up hope on Reed Shepherd. I'm not fully in on Alex Sar, but I'm.
B
But you'd rather have Sar than Shepherd.
A
I would definitely rather have Sar, and he looked great this year, so things are looking up in our nation's category.
B
I'm happy for you. I'm happy for Charles. You know, what I'm a big believer in Raise your son to support the hometown team. Like you don't want your son end up like Andrew, I guess? No, if you want to be like Andrew and not cheer for the hometown team just because you're a contrarian, look, potential big history for you in podcasting. So that's fine as long as you did it yourself. But by and large, I either raise your son to support the hometown team or let him rebel on his own, you know. So we'll see.
A
Well, and that's one of the reasons I'm excited. I can actually take my son to some Wizards games over the next several years with a clear conscience. I don't have to worry about consigning him to decades of mediocrity, or at least the next decade of mediocrity. But in any event, it's great to see you. You know, I did miss you last week. It's good to see you on the other end of the video call here and we have a lot to cover. So we're going to begin with your article on Monday that was headlined the Inference Shift, and you mentioned that there have been three inflection points over the past three years of AI development. I'll list those three inflection points for anybody who's been asleep for the past few years. ChatGPT demonstrated the utility of token prediction. O1 introduced the idea of reasoning where more tokens meant better answers, and then Opus 4.5 and Claude Code introduced the first usable agents which could actually accomplish tasks using a combination of reasoning models and a harness that utilized tools, verified work, etc. So reading your article Monday, it seemed like the kernel of insight that spawned the article was that fast inference for coding is ultimately going to be a temporary use case. Can you explain what you mean by that? Because it was a bit of a light bulb moment for me that seems obvious, but hasn't really been articulated, at least from what I've seen.
B
Well, I mean, I don't know. When it comes to AI, I think everything has probably been articulated to some extent. You know. You know, this is where the doomers get credit. A lot of stuff they've talked about has come true. But this idea overall, when we set the stage, when it comes to computing, speed is always important. So I'm going to make some assertions about speed and quickness that some people are like, why would you want a slow computer? No, this entire discussion is about trade offs, scaling, all those sorts of things. So let's have that up front. If you're coding, of course you want the computer to be fast. But everything that we've done with computing by and large has been the human's in the loop. And as long as the human is in the loop, like we care how fast. Computers can basically never be fast enough. Right. We're always looking for them to be faster, or at least fast enough for the human interaction speed, if you think about it, think about it from an enterprise perspective. Why would an enterprise update computers quite frequently back in the 80s or in 90s, even though they were much more expensive? Because the more expensive asset are the human workers who, if they're waiting around for the computer, you're sort of wasting time and money.
A
You're losing productivity.
B
Sure, that's right. And so your price has always been willing to pay for productivity. And so at least for like your coder or whatever, you typically get a very good computer and would be updated fairly regularly so that you could work and the computer could respond and you could continue your work. And there's a bit now where more and more of that work is obviously being done by the computer. And the more these agents are capable, there's this. There's a couple weird things going on. One, you're having the agent go and do something and then like, what are you doing in the meantime? I mean, you could spit up another agent, do something else. But are you like, are you losing your own context, mental context? I know for me, I'm certainly have a hard time sort of switching gears sometimes. There's also this weird bit where people have token budgets and like they or quad code will have their limitations of how much usage and then they'll use it up. It's like, oh, guess I gotta go home for the day. You want me to.
A
What do I do now? This is my new workflow.
B
Sure, that's right. Am I gonna like just go back to working the way I used to for the next three hours? No, of course not. This is ridiculous. Just wait till tomorrow. But so you have this idea of the, you know, even in today, like there's a measure of how much work these can do and it's getting longer and longer and longer. I think that's actually one of the more interesting benchmarks of these programs is how long can they do an autonomous task before they sort of like lose the thread? And that's getting longer and longer and longer, but still it comes back to the human. And then the human has to like
A
the human over what to do next most over tasks.
B
That's right. All These sorts of things. And that's obviously it makes sense, that's where we are. But. And as long as that's the case, I'll get back to the button a second. As long as that's the case, of course we want faster and faster inference and it's worth paying for to get that inference because if you can get that response much more quickly, the better. And I think, you know, I've been focusing on the agent bit, but the thinking part is super important. Like the, you know, ChatGPT, before they got the spud model, it was still running like a GPT4 class model. Like the base model was horrible, but their reasoning was so good that it was. You'd still get really good answers. It just took forever.
A
Yeah, you just have to wait like 55 seconds for it to come back
B
with an answer to 5 seconds if you're lucky. Right. Some of this stuff, it would just like take a good, like minutes, but it come back with a really good answer. But like, what am I doing here? Like what am I waiting, waiting around for? And so you think, oh man, would it be great if that'd be faster. You see things like Cerebras or Grok or whatever and, and it's not just amazing in terms of spitting out an answer, but if you're reasoning where the more you reason, the more tokens you use, the smarter it gets. Wow, wouldn't it be great if that could be faster and faster and faster and absolutely, that is the case.
A
And Cerebras and Grok, just for anybody who's not familiar, those two are. Those are chip companies that specialize in inference and specialize in speed as responses
B
accidental, specializing in inference. Both of them started kind of before the LLM moment and sort of retrofitted what they'd been working on to this. I think they're like next generation chips for both of them are going to be super interesting in terms of like, how would you change things now that you have that in mind, they're architected a little bit differently. Cerebras is actually really interesting architecturally. You have a wafer as 300 millimeters and usually at a wafer you're limited by the size of the reticle limit. But the reticle limit is the lens basically for your lithography and how much of the chip it covers, which, you know, usually that's the size of a chip or something. I don't have the numbers in front of me. They're in my article.
A
There was a tiny, tiny number in your article. It's mind boggling how infinitesimal all these measurements are.
B
Well, so what you have to do though is if you want a bigger and bigger chip like Blackwell is actually two chips fused together. And those two chips are defined, the size of them is defined by the reticle limits. And then they have to put an interposer to let them communicate and expose themselves to the system as one chip. Even though they're actually two chips linked together. Apple's I think their ultra chips is like sort of something different and they're limited by the reticle limit. And so the idea is on a big wafer you have the bigger the chips, the more expensive because you're more likely to have yield problems because there's a defect on the chip. But you're gonna have a number of chips that are defined by the reticle limit. What Cerebras has done is basically fig. They developed a technology to. Now I'm forgetting the name. The lines between all the reticle exposures have scribes. I think a scribe limits. I should have had this in front of me. But they basically run wires across that so that you do a bunch of exposures limited by the reticle limit all over the chip of all the different part of the chips. And then they do this additional step of adding in all these lines across these sort of boundaries so that the entire wafer is one chip. And it like everything's all. It's wild stuff. It's a really interesting approach to get sort of a lot of compute and a lot of sram like the super fast on the chip ram which is what GROK does also. But GROK is still limited by the. So they're, they're rock is more systemizing putting different ones together super like no one way for one chip. And then they're, they're.
A
And that allows them to serve stuff faster than unbelievably fast.
B
Like, like this is like this, this solves. So there's different aspects of the inference process but there's parts of it that are just extremely limited by bandwidth of like how fast you can get memory into the processor and move on to the next step. And, and they're unbelievably fast at that. Like orders of magnitude faster than other sort of approaches. But there's limitations. Like you're limited by how much memory you can fit on that chip and the moment you're going off the chip like your performance is totally plummeting. Right. So it's definitely a narrow use case, but there's situations where if you want immediate response and not just an immediate response, but immediate, like sort of like thinking through things sort of response, it makes a lot of sense. But then if you're reasoning and you're doing stuff, it's not just the size of the model fitting on the chip, but the KV cash, which is like all the context of the conversation that gets large very quickly. There's lots of limitations, but the larger the market is, the more room there is for different sort of approaches. Yeah. So we'll see how it turns out. They're IPOing this week or I thought it was gonna be this week. It hasn't come out yet. I don't think maybe it's today. And of all the times, ipo like right now is not that.
A
Yeah, stretch, sure, we'll see. Well, yeah, you can envision, like I think you mentioned in your article, if there's voice interactions with AI, that's the biggest part. Speed is gonna matter. And for the consumer market, speed will always matter. But to the extent that we expect a lot of computing to just be done by the computers, certainly in the enterprise that will probably be independent of humans and then optimizing for speed just doesn't make as much sense. And the obsession with speed is sort of immaterial to the conversation. Nobody has to care about the speed for the robot users. And that was sort of the secondary implication of the agentic shift that seems inevitable, but was not immediately apparent to me over the last couple months as we're all sort of obsessing over agents and what they mean in the enterprise.
B
That's. Yep. So if you think about agents are like, what's the upside? They never sleep, they're always working. Right.
A
And that's like 24 hour employees. Yeah.
B
Right. And so we're gonna need all this compute. And I think that's all completely true. But part of the implication of them always being awake and always being available is, is they can sit around for compute. It's fine. Right. Like there's no loss in terms of them waiting around and particularly when for these agentic workflows, at least right now. And I think there will be breakthroughs, there will be algorithmic breakthroughs, there'll be architectural breakthroughs, but for now, a lot of these agentic workloads, the real limiter is memory, is it's this KV cache issue. It's pulling in all this context. It's, it's remembering state. And if you want these sorts of things that aren't just Useful for a task you define right now, but can be spun up. Like suddenly something comes up seven, you know, in a week and it's spun back up and it has all the right context and it knows what needs to be done and it executes a job and then it goes back to sleep or whatever it might be. All that stuff needs memory. And, and the memory question is also interesting because everything's been about HBM high bandwidth memory. The reason we want high bandwidth memory is because we want. What do you think we want from high bandwidth memory? Is that for training we want high bandwidth is. Is the answer. So.
A
Well, I gave it a shot spot there.
B
Okay, so.
A
So I know that China lacks high bandwidth memory and training is a problem in China. So that's how I landed on that guess.
B
No, you're totally right actually because the reason we want, we need it for training is we. Everything on training is like it's this highly distributed problem where we want the GPUs to do these calculations super fast. We want to keep the GPUs fed. And so it's like this multivariate problem that Nvidia has solved way better than everyone else. It's not just the fast processors. It's also loading them up with tons and tons of HBM high bandwidth memory, but then also developing all this crazy networking to tie all this stuff together. So it's not just like you have chips acting as one CPU system or. But you have fleets of them, tens of thousands of them, acting at acting as sort of one ship and it's going to be like hundreds of thousands of them. And so this is a. There's a lot about the way development for AI has gone that has been very focused on this problem. How do we execute stuff quickly but then keep the executors full so they're being utilized all the time. And it turns out because they're GPUs and GPUs are fairly flexible, not as flexible as a CPU, but more flexible than like an ASIC. This is also an architecture that works for inference, right? What do we have? What's the inference problem? You need to get the model into memory and then you also need to house this KV cache. So if you have all these GPUs linked together, you can solve both problems. You can get large models into like a pod and, and you can have the KV cache issue. But as it gets larger and larger, actually the KV cache is getting to be a problem even for GPUs. So like Nvidia's announced this entire like their own thing called Dynamo for inference in general, but they've announced like a whole rack for their systems. That's just memory, it's just SSDs. And the whole point of that memory is for KV cache. But even then, like if you fast forward five years fewer, I don't know. But you have these agents where again you sort of. There's no limit to how many agents you might want. There is a limit to how much compute you might want. For humans, the limiter is how many humans there are and like how much stuff they can come up with. But at least in theory, the limit for computers doing computing and especially once they're doing their own programming and spinning up their own sorts of things is effectively infinite. You know the, you're going to need to store all this context. And again there's lots of innovations around here, whether you're on batching or caching or like, you know, some prompts have the same sort of context and so you can put those together. There's all sorts of things that are, people are going to figure out and they're going to innovate on these in lots of ways. But as long the idea of there's a concept called a memory hierarchy. So with traditional computing, like your typical cpu, you, you have registers, which is like the actual data that's being processed. Then you have like L1 cache, L2 cache, maybe L3 cache. This is all storage on the chip itself. One of the reasons Apple's M chips are really fast is they have a lot of cash. And so it's like right there on the chip. All the, like a lot of the core foundational like operating system stuff they, they've done a lot of. This is where their integration has really paid off to make sure everything is like super available and executes very, very quickly. But then from there you go out to ram. RAM is super fast relative to your hard drive. Astronomically slow compared to cache. Right? You go past RAM, you go to your SSD. Remember when we got our first SSDs unbelievably fast compared to spinning disks, but SSDs way slower than RAM. And then you can go out to spinning disks, you can go out like you could go out to tape machines. Like there's still storage using magnetic tape, huge capacity, very, very slow. In general there's this capacity speed sort of trade off with memory. And so a part of designing a computer is designing the memory hierarchy, like figuring out in general what stuff should be super close to the processor and Thus super fast. But knowing you have a limited amount of space there and what stuff gets bumped down. And this is stuff.
A
Are we just talking about tasks and different applications on a computer?
B
No, like literal, like ones and zeros, like the actual, like bits that go into calculating this sort of thing, which ultimately is everything. Everything is a 1 and 0 at the end of the day. And so this is already happening for inference and it's going to happen even more. So right now Mostly everything's in HBM of a bunch of GPUs tied together. There's not much of a memory hierarchy. There is a little bit of what we don't need to overcomplicate it. But generally speaking there's been this one size fits all. A bunch of GPUs tied together with a bunch of high bandwidth memory and put everything in there and it'll do everything. But we're already seeing this shift to, particularly with, if you want a ton of context to having other places to put stuff in memory that are slower, but you get way more, just way more room, way more capacity. And as that increases over time, where the memory aspect becomes more important and stuff gets slower, that's okay, because the agents, there's no human in the womb. That's right, that's right.
A
And if that's the world that we live in, who wins in that sort of world? Like how does that change how AI is served and what sort of infrastructure is best to serve it? Does the infrastructure get more affordable? Is that bad news for Nvidia in that scenario? What do you think?
B
I think the biggest winner is China.
A
Okay, why is that?
B
Well, for a few reasons. Number one, if a lot of the most important AI is done on, you don't necessarily need hbm, just regular RAM is fine and then you need like a lot of storage beyond that. And relatively slower chips are okay because they're waiting around a memory anyway. China can make all that stuff, number one. Number two, they can make all that stuff and start selling it abroad, alleviating the sort of memory shortage that we're facing, particularly if the, you know, sk, Hynix and Samsung and Micron are all focused on hbm. So who's going to make dram? Right. Like there's. And so it's not just that they can probably self supply for more AI workloads than you might think in this agentic workload, but also there's going to be a large market for their companies to expand and sell stuff, even if it's relatively lower end compared to Western Companies overall, the hyperscalers in general, buying cheaper stuff is always better. And I do think Nvidia is running as fast as they can. They've completely this whole Dynamo approach, it's basically like an operating system for inference to balance different loads and adding on like a GROK for the superfast and their regular GPUs for some stuff. And the way they tie that together is really interesting. They have GROK just doing the sort of inference, speed related aspect of inference but then also shipping like these new memory sort of racks. Like they're certainly out there trying to get ahead of it. But it's going to be a challenge in the long run which we saw with the cloud back in the day in the dot com era. All the money you raised for a startup went to buy sun systems. Like you would buy these incredible servers that was fully integrated, super reliable. Best there were Spark operating system and you had to buy that to actually run your website so you could then build a business. So what happened was, it's funny actually, Hotmail was one of the first ones to do this, Hotmail and Yahoo. But Google is the one that really did it at scale which was taking commodity hardware like intel based systems and just building a ton of them and saying like these are way less reliable, way like flakier, much cheaper. But that's fine because once you get to our scale even sun systems are going to break down and also we're not going to pay sun all that money. And so they designed an entire way of computing that assumed stuff was just way slower, way less fragile. But we can, on software we can work around that. We can have built in resiliency, built in fault tolerance, all these sorts of things. Amazon took that concept, did it and made it available to everyone. So now if you're a startup you didn't need your, your, you could start a website right away like with, with nothing up front. And that was a. Yeah. And I, I, you know that seems like sort of what I'm talking about. We, you start out at the beginning. You have these dedicated super high end systems that sort of do it all. And in the long run every piece of that system is going to get disaggregated and become sort of commodity markets in their own. Right.
A
Right. So it goes from integrated to modular in terms of what exactly they're building.
B
Exactly. And because there's not like a user, I think to hold on to the integration in the long run depends a lot on owning the user to a certain extent. Like Apple still owns the user the user cares about the benefits of integration no one likes, no one cares about how their inference stack is constructed as long as it does the job right.
A
So cost will obviously matter in the long run if they're more affordable, more durable options than the Nvidia GPUs. Yeah, well it'll be very interesting.
B
This isn't necessarily totally bearish for Nvidia. Nvidia's approach is still by far the best for training and it's not like we're going to suddenly stop training.
A
All right, and that is the end of the free preview. If you'd like to hear more from Ben and I, there are links to subscribe in the show Notes or you can also go to SharpTech FM. Either option will get you access to a personalized feed that has all the shows we do every week plus lots more great content from strikeri and the Structechary plus bundle. Check it out and if you've got feedback please email us at. Email sharptech FM.
Date: May 15, 2026
Hosts: Andrew Sharp and Ben Thompson
This episode explores the evolving role of inference speed in AI, especially as autonomous “agents” shift much AI operation away from real-time human oversight. The hosts examine inflection points in AI development, the implications of agentic computing, and how new hardware (like chips and memory) will reshape infrastructure, with special attention to winners and losers in this coming agentic future—including a surprising argument for China's strategic position.
“If you're coding, of course you want the computer to be fast. But everything that we've done with computing ... has been, the human's in the loop. And as long as the human is in the loop ... computers can basically never be fast enough.” (Ben, 04:22)
“Nobody has to care about the speed for the robot users... They can sit around for compute; it’s fine.” (Ben, 13:43)
On why agentic AI weakens the need for speed:
“Nobody has to care about the speed for the robot users. And that was sort of the secondary implication of the agentic shift that seems inevitable, but was not immediately apparent to me...”
(Andrew, 13:36)
On China as a strategic AI hardware winner:
“I think the biggest winner is China... They can make all that stuff and start selling it abroad, alleviating the sort of memory shortage that we're facing...”
(Ben, 21:03)
On infrastructure transitions:
“You start out at the beginning. You have these dedicated super high-end systems that sort of do it all. And in the long run, every piece of that system is going to get disaggregated and become sort of commodity markets in their own right.”
(Ben, 24:36)
The discussion is fast, deep, and sometimes dense but lively—reflecting Ben’s enthusiasm for the technical weeds and Andrew’s skill at drawing out big, strategic implications. There’s a mix of lightheartedness (sports banter) and sharp, accessible explainer content.
For full episodes and subscriber content, visit SharpTech.fm.