
Shifting AI inference from hyperscale data centers to smaller edge data centers – and even consumer devices – could have big implications for energy.
Loading summary
Podcast Host / Announcer
A very brief word before we start the show. We've got a survey for listeners of Catalyst and Open Circuit and we would be so grateful if you could take a few moments to fill it out. As our audience continues to expand. It's an opportunity to understand how and why you listen to our shows and it helps us continue bringing relevant content on the tech and markets you care about in clean energy. If you fill it out, you'll get a chance to win a $100 gift card from Amazon and you can find it@latitudemedia.com survey or just click the survey link in the show notes. Thank you so much.
Latitude Media Announcer
Latitude Media Covering the new frontiers of the energy transition.
Shayle Khan
I'm Shayl Khan and this is Catalyst.
Dr. Ben Lee
We could be getting 80% of our compute done locally and leaving 20% of the heavy lifting for the data center cloud. Of the 80%, I would say most of that will be on the edge. I think maybe on the order of 1% ends up being put on your consumer electronics.
Shayle Khan
Coming up, could the age of edge inference blunt the big data center boom?
Podcast Host / Announcer
Surging electricity demand is testing the limits of the grid, but Energy Hub is helping utilities stay ahead of Energy Hub's platform transforms millions of connected devices like thermostats, EVs, batteries and more into flexible energy resources. That means more reliability, lower costs and cleaner power without new infrastructure. Energy Hub partners with over 120 utilities nationwide to build virtual power plants that scale. Learn how the industry's leading flexibility provider is shaping the future of the grid. Visit energyhub.com Clean energy is under attack.
Latitude Media Announcer
And it's more important than ever to understand why projects fail and how to get them back on track. The center Left think tank Third Way surveyed over 200 clean energy professionals to answer that exact question. Their newest study identifies the top non cost barriers to getting projects over the finish line, from permitting challenges to NIMBYism. To read Third Way's full report and to learn more about their PACE initiative, visit thirdway.orgpace or click the link in the show notes. The AI boom is here, but the grid wasn't built for it. Bloom Energy is helping the AI industry take charge. Bloom Energy delivers affordable, always on ultra reliable on site power. That's why chip makers, hyperscalers and data center leaders are already using Bloom to power their operations with electricity that scales from megawatts to gigawatts. This isn't yesterday's fuel cell. This is on site power built to deliver at AI speed. To learn more about how fuel cells are powering the AI era. Visit bloomenergy.com or click the link in the show Notes.
Shayle Khan
I'm Sheil Khan. I lead the early stage venture strategy and energy impact partners. Welcome. Okay, so here's an energy question disguised as an AI infrastructure question. What proportion of the world's AI compute in 2035 will be cloud, that is in large centralized data centers versus edge versus edge edge, that is on device? It's an energy question because the answer today is effectively 100% in that first category, cloud. And that's why we have this crazy dynamic in the electricity sector and actually in the natural gas sector too, where hyperscalers and neoclouds and developers and real estate speculators and crypto miners turned AI companies and more are hunting for sites that can accommodate hundreds of megawatts or gigawatts of power. And the whole thing, as we know, is crashing through the electricity sector, affecting generation and transmission, distribution prices, now, politics and so on. But there's a narrative that I've heard a number of times that if borne out, would potentially present a very different future from the present. This is one where AI workloads first of all shift significantly from training to inference, and then where those inference workloads become highly latency sensitive and are also able to be executed in a more distributed fashion. And as a result, much of that compute, and thus the power demand shifts from these big centralized data centers to the edge. That could mean it shifts to 10 megawatt data centers clustered around an urban core or an autonomous vehicle corridor. Or at the limit, it could mean inference compute happens on device and and centralized data centers fall back into a pure training position. Any version of this that takes significant share of the market would have profound implications for the energy question and for the grid. So it's worth exploring, which is what I'm doing today with my guest, Dr. Ben Lee. Ben is a professor of electrical engineering and Computer science at the University of Pennsylvania. He's also a visiting researcher at Google. By the way, this edge AI infrastructure world and the energy implications thereof is super interesting to me as you will hear. So if you are building something in the space, please come get in touch. In the meantime, here's Ben. Ben, welcome.
Dr. Ben Lee
Great to be here. Thanks so much.
Shayle Khan
I'm very excited for this conversation because this is the topic that I, in my energy circles, that I travel in, I've heard scuttlebutt about a bunch of times, but I've never actually spent the time to like really try to understand the topic basically being how much of inference compute might move from central cloud infrastructure to the edge and then how far to the edge, of course, being another question. I think we should start by actually defining those categories a little bit. How do you think about the categorization of where compute can occur? Then we'll talk about each of those categories individually.
Dr. Ben Lee
Right? So even before we talk about generative AI, there are for classical compute, cloud computing in general, all the services we love and change the way we live and work today. There are three levels generally I think about for compute. The first is massive hyperscale data centers, the ones run by Microsoft and Google and Amazon. Hundreds of thousands of machines, massive facilities. That's what most people think about when they think about cloud computing. At the other end of the extreme would be personal devices, consumer electronics. So you think about your phone, you think about your tablet, your laptop. Plenty of compute can happen there as well. There is a perhaps less understood middle layer or intermediate layer called edge computing. And edge computing really means that there are times where you don't want to go all the way to this remote massive facility and wait for the data to go out to that data center and then come back. You might want to access some compute that's a little bit closer to you, maybe in the same city, maybe the same geographic region. That's edge computing. So they're still going to supply really capable high performance machines, these servers. But you don't suffer those longer communication times or latencies that you might if you were to go to that remote massive data center.
Shayle Khan
And my recollection is that there was, I think, okay, so the advent of cloud computing meant the build out of lots of big centralized data centers. There was a fair amount of conversation some number of years ago in the kind of first wave of excitement around autonomous vehicles in particular, that you might see a fair amount of edge infrastructure get built because of the latency tolerance requirement for AVs. I mean, I'm on the outside, so tell me if I've got the kind of narrative wrong here. But then it seems to me that because AVs were generally delayed, or maybe the need wasn't as high, like what we've got today. If you just look at the infrastructure today, it seems like the vast, vast majority of classical compute, even except for stuff that's sitting in like mainframes at companies, is in the cloud, in the big centralized data centers. Do I have that right?
Dr. Ben Lee
That's right. And this is a decades long trend. I mean, we've seen this progression, this adoption of cloud computing over the last 15 to 20 years. And there are a couple of reasons we are seeing that shift or we have seen that shift. The first is that computing in a massive data center run by the hyperscaler companies, these big tech companies, is much more energy efficient. They know how to deploy these facilities, they know how to cool them and build H vac systems efficiently. So they're incurring very small overheads per watt of compute. There's this industry stand standard metric called power usage effectiveness, or pue. And that's the ratio of the power you're using divided compared to the power that's going to compute. So Google's PUE is close to 1.1, which is to say for every watt going to compute, there's an additional 0.1 watts going to the overheads of power delivery or cooling or whatever. So that's really incredibly efficient. And most mom and pop data center operators, most enterprise data center operators, don't get the scale and efficiency that these hyperscalers do. The scale also gives a second key advantage, which is the ability to share hardware. So you buy the hardware once and you have lots of users sharing the same physical hardware. That allows us again to drive the cost down, allows the hyperscaler operators to drive the costs down, and that essentially gets a massive increase in efficiency. So most compute now is being done in, in these large data centers and in the cloud.
Shayle Khan
Okay, so let's talk about the world of AI now, which is where all this growth in compute is happening. You know, AI workloads of course, divided into two major categories, one being training of models and the other being inference. I think we'll spend most of our time today talking about inference, probably, but let's spend one minute on training. Is there any movement or argument that training should take place anywhere other than large centralized data? It seems very clear to me that the trend right now is just build the largest possible data center to train the largest possible model. So is there anyone who thinks that it might, that might turn in the other direction?
Dr. Ben Lee
Some, but that really hasn't gotten much traction. I think the reason why we see most training go happening in massive data centers is because of the scale. You need to communicate large data sets, you need lots of GPUs all closely coordinated, learning the model parameters. The only scenario that some people have explored for training away from the data center is if you've got private data and somehow you want to refine your model or somehow fine tune your model with that private data, you don't want to share it with the hyperscalers. That has been primarily A research question rather than a production system that people have deployed.
Shayle Khan
Okay, so let's assume then that the vast majority of training compute is still going to happen in centralized data centers as it stands today. I don't know if you know the numbers, but just high level of all AI workloads, how much is training versus inference? Because I think the other big point people have made is like over time, the proportion of workloads going toward inference is going to increase. The proportion of workloads going toward training may decrease as we sort of asymptote the next model or something like that. But today it's mostly training still.
Dr. Ben Lee
I would agree with that. I think to first order, the training costs are historically what people have cared about the most because the data sets are massive and you're talking about these massive 1,000 megawatt data centers for the training workloads. There was a study we did when I was a visiting research scientist at Meta where we found that energy costs for AI were roughly broken into three categories. There's a data pre processing aspect as well, that's about a third, the training is another third, and then the inference or the use of the model is the last third. But clearly those fractions are evolving rapidly. And I would agree with you when you're saying that the training costs are probably flatlining, we're reaching a plateau and how quickly they are growing perhaps. And if the optimism about AI is to be justified, going to have to see inference costs go way up, because that will be an indicator that adoption has gone up in a fairly significant way, both among individual users, but also among companies and enterprise users. So I think it's true to say that inference costs are large and potentially will grow very rapidly.
Shayle Khan
Okay, so then we're getting to the sort of crux of our question today, which is inference workloads, inference costs increase over time, usage of the models increases over time. That's the presumption of everything going on in AI world. And then the question is, will that inference compute predominantly still take place in these big centralized cloud data centers, or will some or much of it potentially shift either to one of the other two categories you described, sort of edge localized or fully localized on device. So let's talk about the edge version first, which is essentially smaller data centers, still data centers, but smaller and more local. What's the argument for why that might happen and what are the limitations?
Dr. Ben Lee
So the argument in favor of edge computing is mainly the proximity to the end user, right? So when you we have conditioned in an era before generative AI, that when we access Internet based services like a search engine, we expect the answer to come back on the order of 100 milliseconds, that is the order of magnitude that we're talking about. And, and as a result, to get those 100 millisecond latencies, oftentimes you require computation closer to the user. So you don't have to travel across the Internet. You don't have to travel from the west coast out to the east coast and back again. The data, I mean, and get that answer back in a timely way. What is interesting with generative AI is that we are being reconditioned to tolerate much longer delays. So if you use something like GPT or you use something like Claude or your favorite chatbot, oftentimes it's just sitting there thinking for seconds and seconds, maybe tens of seconds before it gets you the first token. So the question there is to what extent we care about that latency and need that really fast responsive access to the answer.
Shayle Khan
Yeah, and I think we've been especially trained even further in that direction with the introduction of things like deep research, where even in the name you sort of think, well, of course that has to take time. It is deep research that they are doing. So it's an interesting point that maybe we are becoming reconditioned to allowing more latency. The argument that I've heard for why latency is really going to matter, apart from just wanting search queries or chat queries to come back quicker, is the next wave of applications for AI. Right. And so maybe we go back to the autonomous vehicle world and things like that, where like latency making decisions in near real time does become really important. Robotics being another category that could be a major user of AI compute, but needs really, really low latency. Is that part of the argument for shifting some compute to the edge?
Dr. Ben Lee
Yes, absolutely. So the class of compute you mentioned, autonomous vehicles, robotics, fit into what we call cyber physical AI. So cyber physical systems are those that have a cyber component, a computational component, but also interact with the physical world. And once those interactions with the physical world arise, then we care about responsiveness because that underpins safety guarantees and the ability to make sure that your robotic arm is able to respond quickly enough to hazards, your autonomous vehicles are able to do so. So I agree that there will be cases where we will need those really low latencies. And that is going to require edge computing much closer to the user. So we have much shorter Internet delays, network delays.
Shayle Khan
I'm curious to understand the trade offs here. Right. Like I know with, with model training there, there are technical reasons why you want all your compute as clustered together as closely as possible. You want every GPU as close to every other GPU as you can make them, minimizing the copper between them or the optics or whatever it is that's communicating between them. And that, for some reason that you can explain to me, makes model training more effective. Is there a similar dynamic in inference? Is there a technical reason why that you are paying a penalty if you shift to smaller data centers at the edge, or is there no technical reason why it's suboptimal?
Dr. Ben Lee
Right, yeah. Let's talk about the training piece first. The reason why we need a thousand megawatt data centers where we have hundreds of thousands of GPUs connected so closely together, is because the data sets are massive and the models are massive. We're trying to learn on the order of a trillion parameters for these machine learning models, these AI models, and we're trying to do it on the wealth of data we find in the Internet. There's no way that any single GPU can handle that much data. So what we end up doing is partitioning the data into smaller pieces and then handing each GPU a slice or a partition of this data. And each GPU will churn on its own, share on its own partition of the data, and learn the models that work best for its piece of the data. And all the other GPUs in the data center are doing the same thing on their partitions of the data. Periodically, what they will do is they will compare notes, they will share the weights that they've learned. And this sharing is really, really expensive. And some of the people in the energy space may know that there are massive energy fluctuations or power fluctuations we will see in data center usage when the GPUs go from this computational intensive phase where you're learning the model weights, to this communication intensive phase where they're comparing notes and sharing their intermediate results with each other. So as a result, that's why we're talking about these massive data centers for training. They all need to communicate for frequently to share what they've learned from their own data sets for inference. We don't see that effect.
Shayle Khan
Just to add the craziest thing to me about how model training data centers operate right now, the absolute craziest thing is, as you said, there are surprisingly large spikes in power demand as a result of how the models are trained. What they do, in large part because those spikes are actually problematic, not just to the grid but to the equipment inside the data center as well. What they do at least sometimes to manage that is they create dummy workloads so they keep the power profile basically flat. But you are literally just wasting energy on absolutely nothing. Nothing is happening during those times. They're dummy workloads at that scale. The fact that that is happening is wild to me.
Dr. Ben Lee
Absolutely. And I think we've seen this in other contexts as well, but not perhaps at this scale. This notion of an electrical engineering we call it the DIDT problem, the change in current divided by a change in time. If the if large current swings over very short periods of time, you could imagine building batteries to sort of damp things out or decouple. And certainly a lot of people are thinking about that. But the easiest thing to do might be to just modulate the software, as you say, because we have very precise control over what the software does. So that is an active and ongoing area of research that needs to further develop.
Podcast Host / Announcer
The grid wasn't built for what's coming next. Electricity demand is surging from data centers to EVs and utilities. They need reliable, affordable solutions that don't require building expensive new infrastructure. Energy Hub helps by turning everyday devices like smart thermostats, EVs, home batteries and more into virtual power plants. These flexible energy resources respond in near real time to grid needs, balancing supply and demand. Plus they can be deployed in under a year and and at 40 to 60% lower cost than traditional infrastructure. That means more reliability, lower costs, cleaner energy. You can't get much better than that. And that is why over 120 utilities across North America trust Energy Hub to manage more than 1.8 million devices. Learn more at energyhub.com or click the link in the show Notes.
Latitude Media Announcer
Do you ever wonder why it takes so long to get clean energy projects up and running? Do you have permitting reform on the brain? Are you NEPA reform Curious? The new PACE study from Third Way pinpoints the non cost barriers that stand in the way of clean energy deployment and keep new solar and transmission projects in limbo. They surveyed more than 200 industry professionals to understand what's slowing down deployment and offer practical solutions on how to fix it. To read Third Way's full report and to learn more about the PACE initiative, visit thirdway.org/pace. You heard the phrase speed to power a lot lately, but here's what it really means. AI data centers are being told that it will take years to get grid power. They can't afford to wait, so they're turning to onsite power solutions like Bloom Energy. Bloom can deliver clean, ultra reliable, affordable power that's always on in as little as 90 days. Bloom's fuel cells offer data centers other important advantages. They adapt to volatile AI workloads. They have an ultra low emissions profile that usually allows for faster and simpler permitting. And they're cost effective too. That's why leaders from across the industry trust Bloom to power their data centers ready to power your AI future? Visit bloomingergy.com to learn more.
Shayle Khan
Okay, so then onto inference. So you're saying inference does not contain that same challenge. So is there any what is the downside to shifting inference workloads to the edge?
Dr. Ben Lee
To my knowledge, there isn't much of a downside. Because the reason why inference is amenable to edge computing is because when you send a prompt for processing by a large language model, that prompt is probably handled by one GPU or maybe eight GPUs inside a single machine. And the reason that is is because the model sits in that machine, the data sits in that machine, and all of your prior conversations with that bot are sitting in that machine. And it's a very localized piece of compute that needs to be done. And you don't need tens or hundreds of GPUs to be coordinating to give you an answer back. You've got that one GPU or a tightly coupled GPUs giving you that answer back. And that is amenable. That is great for edge computing and we can certainly supply that.
Shayle Khan
So a thought experiment that I've given people recently in thinking about this is let's just say that you need a gigawatt of inference compute in five years from now, or seven years from now, something like that. You think you need a gigawatt, wherein the demand for that gigawatt is geographically centralized somewhere. Let's just say you need a gigawatt of inference. You think you're going to need a gigawatt of inference compute to serve the Dallas metropolitan area, whatever it might be at that point a few years from now? This is back to the power perspective. Is it going to be easier for you to find and site a 1 GW site or 110 megawatt sites within that geographic region Today, I think it is still probably easier to find the gigawatt site, or at least the past couple of years it has been. But there aren't that many gigawatt sites out there from a power availability perspective. So at some point is that going to flip and is it going to be easier to build 110 megawatt sites, which is sounds really hard to do and indeed is, but these are all hard problems. So if that happens, do you think that we are going to see a significant portion of that inference workload move to that type of scale? Is that the right scale? Like, should we be looking at 10 megawatt sites, 100 megawatt sites, 1 megawatt sites? Like, how far to the edge do we want to go?
Dr. Ben Lee
Yeah, absolutely. And I agree with the premise of that question 100%. I think that there are two reasons to go to many smaller data centers. The first is the one you mentioned, power provisioning and connections to the grid. The second is the fact that you don't need massive GPU coordination for an inference workload. I guess the catch might be that if you are thinking about your existing edge data centers, maybe you've got data centers in downtown Los Angeles or something like that already serving workloads. Those workloads may not be configured to handle GPU and AI compute. They may have power delivery infrastructure that was optimized for CPUs. They might have H vac systems optimized for the much lower power density of CPUs. So it's not simply a matter of pulling out your CPUs and replacing them with GPUs. You're going to, you may have to retrofit the facility itself to support that. But I agree, I think finding capacity there may eventually become easier than finding the next thousand megawatts.
Shayle Khan
Is there any limitation I could imagine? I'm trying to think of why you wouldn't do that. You know, you need to sort of house all of the. You need to have a fair amount of memory. You need to house all the model weights and so on in every individual data center. If you're going to do that at the edge. Right. So there's got to be some minimum viable scale, I assume.
Dr. Ben Lee
Right. And maybe to give you a sense of the type of data centers we were talking about in the past, again in a study that we had done with Meta, we looked at 15 of their data centers before generative AI, and the scale of those facilities were somewhere between 15 to 50 megawatts. Right. So less than 100 megawatts. And certainly that was fairly conventional, uncontroversial, to build those sites of data centers in the past. So that's the starting point, I think, in terms of the scale now as you scale down towards, for example, 1 megawatt, not clear at what point things start making less sense.
Shayle Khan
I guess the other point here the way that the data center build out has gone historically, just like the cloud data center build out, it's been fairly clustered in these regions. Right. And there's a reason why Northern Virginia is the data center hub of the world. And there are others as well. Chicago, Dallas, et cetera, Phoenix is. And that that, as I understand it, is largely because the cloud providers needed to offer a certain level of reliability to their customers and so they could have redundancy within a given region and that was helpful to them in terms of what they were offering. Do you think that this future world wherein a bunch of inference compute moves to the edge, let's call it 15 to 50 megawatt data centers. Then instead of hundreds or thousands of megawatt data centers, does it look similar? Is that you have a bunch of a small number of regions that have a really high concentration of those 15 to 50 megawatt data centers? Or could it be much more dispersed? Because the whole point of this is really low latency and local and you don't need them to be as clustered.
Dr. Ben Lee
I think there are lots of different aspects at play in terms of data center siting. I think the redundancy is definitely one of them. And I have trouble disentangling the role that some of these other factors play as well. Some people talk about tax breaks and incentives from local companies and local states. Some people talk about proximity to Internet exchange points. So not only are talking about congestion free power movement, but you're also talking about congestion free data movement into and out of the data centers. Northern Virginia has that. And then of course, the availability of the power itself. I guess I would say that when you start talking about many of these smaller data centers, from a redundancy perspective, it might be okay that they're not all geographically clustered as long as you have a strategy for rolling over the compute or rolling over the workload to spare capacity somewhere within that region that has a similar performance profile or some sort of similar latency or delay characteristic. So that's really the concern, whether you have robust geographical redundancy and resilience.
Shayle Khan
There is this happening. It's interesting. I was thinking, okay, so it sounds like you're saying there's not a big downside. We already have significant inference workloads, so it's not like we're waiting on workloads to show up that could accommodate this. And yet if you look at everybody, most everybody building data centers, certainly the hyperscalers and I think the colos and folks as well, you know, the focus continues to Be on. We gotta find big sites for big data centers. Why don't we see more development of this small, smaller scale edge AI inference world?
Dr. Ben Lee
I think it really depends on the workload in the application and we don't know. I view AI as a more fundamental basic technology and we don't necessarily know what application or capability will be layered on top of it. I'd say that we've been talking about edge data centers a lot. There are other words for this type of data center. A content distribution network is one of those examples, a CDN or a point of presence, a BOP that the facilities are sometimes called and they exist in fairly significant numbers. Content distribution networks ensure that when you want to access, e.g. new York times.com or WSJ.com, your web page is not being served from the other end of the country, those web pages are sitting close to you because the content distribution network took those updated web pages and moved them to facilities near you data centers near you. Likewise, companies like Meta, when they have Instagram or when they have these social media applications, they also have these points of presence that supply data from local points of presence rather than retrieving content for your feed from across the country. So we already see that. But these are application level performance requirements, whether they be for social media or for other sort of news content. Once it becomes clear what applications of AI really drive further inference deployments, then we'll know what sort of performance requirements are needed, what sort of, what we call caching techniques or strategies might be useful so that we can keep fresher data or more recent, more frequently used models closer to these users and then serve them more quickly. I think it'll become clearer as we see which models really get traction, which applications really get traction.
Shayle Khan
Right. So maybe the state of affairs today is, look, anybody who's developing data centers, we know we need the big centralized data centers because there is currently essentially endless demand to train models, at least relative to the availability of compute today. And so we know we need to build the big centralized ones. We might as well use those big centralized ones that we know we need right now for inference workloads, such as they are today. But we don't have enough certainty yet about what the inference workloads are going to be long term to invest that kind of capital and time expenditure that it would take to build out the network of 110 megawatt data centers in a particular geographic region. Something like that.
Dr. Ben Lee
That's right. And I would say maybe that my crystal ball is as Clear as anyone else's crystal ball. But I feel like there's a huge amount of GPU capacity being discussed in the pipeline in these large data centers. And if it turns out that maybe there are diminishing returns from training larger and larger models, or maybe we run out of data because we've exhausted all data that's available on the Internet, when those things happen, it may be that demand for these GPUs in these largest data centers will flatten out and we're going to have spare capacity. At which point, as you say, they will be used or repurposed to serve and inference. And then it will be hard to make the case of building yet more data centers, smaller ones, with GPUs closer to the users. I think the catch there will be if one of these model providers or one of these application developers makes performance a distinguishing feature of their offering, if they start competing on performance rather than on capability, then we're going to see, well, I may have a thousand GPUs in the middle of Nebraska that are already deployed, but if I really want to break into the San Francisco market, I've got to build my GPUs right there and have them available.
Shayle Khan
All right, so speaking of performance, let's transition to the full extreme version of this, which is also, I think, theoretically the most disruptive from an energy perspective, which is shifting any significant portion of these inference workloads all the way onto the device. Either skip the middle ground of edge 5 megawatt data centers or 15 or 50, or include them, but shift workloads that would have gone to a big data center that requires a lot of power, straight onto your iPhone or your iPad or whatever it is. And we've heard some glimmers of this as well. Give me the similar sort of like pros and cons of shifting that workload straight onto the device.
Dr. Ben Lee
Right? Pros, primarily two things. One is performance, right? You don't have to go across the Internet. The model is right there and the compute is right there. Assuming that you get really capable hardware on your device as well, you get really quick, responsive answers, but from your AI. The second is also something we've mentioned earlier, which is the notion of privacy. You don't necessarily need to send your data out into this hyperscale data center where it gets blended with lots of other user data and you want you have made fewer guarantees about what happens to it. Localized compute is certainly more private than compute on shared systems. So those are the two key advantages. And then I guess, third would Be that it gets more tightly integrated with the capabilities on a particular platform. So for example, Apple's ecosystem.
Shayle Khan
Right. And Apple seems like the obvious candidate to do this. Clearly you mentioned privacy. Apple is particularly focused on privacy. They have the hardware, the device. Right. Like Apple is notoriously or at least reputationally behind in the AI race. And so like this, it's not hard to picture that. Like if somebody's going to move a lot of this inference on device, it's going to be, it's going to be Apple. Okay, but there is a real trade off here, I assume.
Dr. Ben Lee
Yes, and the trade off is primarily, primarily with respect to the capabilities of the device. So if we have a very large model, we're going to have to deploy that model on a much more capable hardware platform than we've got today. This means having some number of gigabytes of memory to hold the model weights and then also some additional gigabytes of memory to hold the context. As you develop this conversation with the model, in addition to the memory, you're also going to need the compute. You're not going to have this high performance GPU sitting inside your phone. So you're going to have to have specialized chips. Those specialized chips on your hand are going to be less powerful or less capable than the ones in the data center. So all of this speaks to not getting exactly the same model that you would get into the data center. You would get a shrunk down model. Maybe in the data center you would have a trillion parameters, this massive GPT5 model, for example. But on a personal consumer electronics device, you might only have 7 billion parameters. So orders of magnitude smaller. That smaller model will be less capable. It will give you less capable answers. It will be capable of doing fewer tasks. But maybe that's okay because you've identified only a handful of tasks that you really care about on your personal device. So that is really the trade off. As you go towards the device, you're going to have to shrink the size of the model down. You're also going to get less and less capability out of your AI. The final thing, of course, is the power and energy profile. At data center scale, we care primarily about power because power influences infrastructure and power delivery and influences thermal and so on. Thermal management. For device level compute, there are two considerations. We care about energy rather than power because that affects battery life. Right? So even if you could deliver a really capable GPU chip onto your phone, the question is, how long would your phone last if you were using that chip on a fairly consistent basis? So that the power, the energy aspect will continue to be challenging and then the thermal aspect will also be challenging. If you have a really powerful device, that's going to be a hot brick inside your pocket and that's going to be, that's, that's going to be a deal breaker as well.
Shayle Khan
So when you say deal breaker, do is there progress toward on device inference? I mean to your point on performance, that strikes me as like, okay, this is now, we're now again in the, in the context of like specific workloads, certain types of workloads, like a 7 billion parameter model might be fine and others it wouldn't be. And so maybe there will be some on device, an on device chip and some inference that you could do on device. But you know, you, you pull up your ChatGPT app or whatever and of course it's going to send you back out to the cloud or maybe to the edge. But you know, these other challenges of thermal management and things like that are hardware challenges. Where are we in the progression of on device inference? Is it coming, is it not coming? Do we not know?
Dr. Ben Lee
I think the assumption with on device inference is that you'll be able to shrink the model without loss in performance for the tasks you care about. That is the primary strategy that computer scientists have been taking. On the hardware side, we have made strides in developing custom chips, custom silicon for the specific types of tensor algebra that are required for machine learning models. So we know how to build those chips and that gives us energy efficient compute, higher performance. We know how to build really capable memory systems or solid state disks. So when your phone now has hundreds of gigabytes of memory on it or hundreds of gigabytes of storage on it, so there's a question of, well, maybe you'll end up using less of it for your photos and more of it for your AI model, something like that. So I think there are fairly significant resource constraints, but I don't think that they are insurmountable in the sense that more intelligent hardware design and more intelligent hardware management could go some ways in terms of making these AI models feasible on the device.
Shayle Khan
Okay, so I'm going to put you on the spot and we promise not to hold you to these numbers, but just to give a sense of where we think things are heading. If we're fast forwarding 10 years, right, let's just say we're in 2035 and imagine there's a total volume of inference compute in the world or whatever, that's, let's just say it's 100 megawatts total. What would be your guess of the ranges of how much of that computer is going to take place in large centralized data centers or versus at the edge? Let's, let's, we'll draw a line. Let's say, you know, 100 megawatts and above is large centralized sub, 100 megawatts but not on device is edge. And then the third category of course being on device, like how much of it can go anywhere but decentralized data centers.
Dr. Ben Lee
So I would, I would go straight to this idea of having a 2080 rule because we see this all the time in computer systems where you have 20% of your tasks being extremely popular. Maybe there are 20 things that you always want to do and you spend 80% of your AI compute doing those things. That could be email processing, that could be photo analysis. So we can identify what those really compelling applications and tasks are. And we're going to be spending most of our time doing that. And then for the remainder of the long tail, long heavy tail of other tasks that people might want to do, there will always be backup capabilities residing in the cloud data center. So I would say that we could be getting 80% of our compute done locally and leaving 20% of the heavy lifting or the more esoteric, the more corner case compute for the data center cloud. That is of course excluding the training. The training will continue to all reside in the massive facilities. But in terms of the inference, I think there's huge potential.
Shayle Khan
Right? But yeah, that's actually a very significant shift. If 80% of the inference workload appreciate that doesn't include training. But still, if 80% of the inference workload could end up local, that's a significant shift and has pretty profound implications for the energy picture as well. Are you saying that 80%, just to pin you down even a little bit more, is that local in the sense of being at the edge or is that local in the sense of being on device or what do you think the split ends up being there?
Dr. Ben Lee
Yeah, so I think of the 80%, I would say most of that will be on the edge like I suspect it is today. I think that if you look at what we talked about earlier, content delivery networks, points of presence, they probably identified 20% of the content that 80% of the people will be looking at most of the time and they're putting it at the edge. I think maybe on the order of 1% ends up being put on your consumer electronics. Actually, even for today's compute, when we set aside AI, there is A trend towards consumer electronics hiding that flow of data back and forth between the device and the edge for you. Right. So sometimes, like if you use a cloud storage service like Dropbox, or if you're using a photo storage service, they will let you pretend that you have access to all of your videos or all of your photos and all of your documents. And they will transparently, behind the scenes, move things back and forth between the data center and your local device. So you may think you have all of it, but maybe you've only got a tiny sliver, less than 1% on your local device.
Shayle Khan
Right. Certain things open up in my box instance, certain things open up much faster than others when I try to open them. And I've. It's occurred to me that that is why if I step back then. Okay, so it sounds like what you're saying in this scenario, you're painting of the future. Roughly 80% of the inference workloads are edge, very little of it actually on device, and then the other 20% or so sitting in cloud, big cloud data centers. So when I think about the energy implications of that, there's, I think, a couple ways to think about it that, that are pretty interesting. One is this. Okay, so maybe a fair amount of the energy consumption of at least inference compute is going to shift to these 5 megawatt, 15 megawatt, 50 megawatt, type local sites. That's, that has big implications for the grid in ways that are, I don't know, both good and bad, probably harder to manage in some ways, easier to manage in other ways. But the overall energy consumption of inference compute, I would expect, and you can tell me if I'm wrong, would actually be higher in this scenario than it would be if it was all centralized. Because I assume the PUE that you get for these edge data centers isn't quite as good as it is for the large centralized data center. So, like, on balance, this probably means more overall AI energy consumption. Do you think that's right?
Dr. Ben Lee
Yes, yes. I think you get economies of scale. When you go to a gigawatt or 2 gigawatts, you have a single facility, you're managing it in a highly optimized, coordinated way, and you've got hundreds of thousands of these machines all managed very precisely. I think as you shrink the system down, you will lose an efficiency. You will be trying to build these 20 megawatt data centers and maybe footprints or facilities that weren't designed initially for those workloads. So, yes, I think total energy costs may go up as a result, we're.
Shayle Khan
Talking about inference workloads to some extent as like a monolith. I'm sure they are not. So are there big distinctions in your mind in terms of the different types of inference workloads and how that influences where they should be housed?
Dr. Ben Lee
Right, yes. So that's a really great question, actually. I would say that there are fundamental limits that the number of inference queries a human user can actually produce because we're ultimately limited by the speed of our typing or the number of tokens we can actually produce to query the models. So there is some of that where humans will continue to send requests to agents, but I think increasingly most of the inference workload will come from other software agents. This could be a search engine retrieving web pages and then asking the large language model to summarize it for Indo coherence discussion for you, this could be your photo app learning something about your images, or this could be your mail app doing something with the mails and helping you compose messages. So all of that is done behind the scenes. And those inference workloads are potentially much larger because of course, software can generate those requests at much, much higher rates. From the perspective of where that computation happens, to the extent that the data center already has servers running your mail workloads, or to the extent that your search engines are already running in the same data center, the communication to the model will be a bottleneck. Right. So if you have a data center in Nebraska running your search engine for you or doing some of these other big heavy software jobs, then potentially they could query and execute inference in these largest hyperscale data centers.
Shayle Khan
All right, Ben, this was super interesting. Really appreciate your time.
Dr. Ben Lee
It was my pleasure. I really enjoyed the conversation. Thanks so much.
Shayle Khan
Dr. Ben Lee is a professor of electrical engineering and Computer science at the University of Pennsylvania. He's also a visiting researcher at Google. This show is a production of Latitude Media. You can head over to latitudemedia.com for links to today's topics. Latitude is supported by Prelude Ventures. This episode was produced by Daniel Waldorf. Mixing in theme song by Sean Marquis Stephen Lacy is our Executive editor. I'm Shayael Khan and this is Catalyst.
Episode Title: Will inference move to the edge?
Date: December 18, 2025
Host: Shayle Kann
Guest: Dr. Ben Lee, Professor of Electrical Engineering and Computer Science at University of Pennsylvania, Visiting Researcher at Google
This episode explores a provocative and timely question: As artificial intelligence (AI) workloads—especially inference tasks—grow, will they remain the domain of massive, centralized cloud data centers, or could they decentralize toward the “edge,” closer to users or even directly onto devices? Shayle Kann and Dr. Ben Lee discuss the technical, economic, and energy implications of this possible shift, examining what might push AI inference toward the edge, how that could fundamentally alter energy demand and grid management, and what obstacles stand in the way.
Three Layers of Compute:
Current State:
Training:
Inference:
Latency:
Technical Feasibility and Trade-Offs:
Data Center Siting and Grid Interactions:
Infrastructure & Economics:
Technological and Energy Trade-Offs:
On-device Inference: Pros & Cons:
Dr. Ben Lee’s 80/20 “Rule of Thumb”:
Implications for Energy and the Grid:
The conversation balances deep technical insight with practical, market-oriented perspective. Shayle’s energetic curiosity complements Dr. Lee’s clarity and expertise. Both are cautiously optimistic about edge inference’s potential but realistic about economic bottlenecks and energy trade-offs.
Bottom line:
While AI inference currently clusters in mega data centers, technical and market signals suggest a future with much more decentralized compute—at the edge, if not (yet) on-device. This shift will fundamentally reshape where energy for AI is consumed, how efficiently it's used, and what investments get made in infrastructure across the grid and technology stack.