Will inference move to the edge?

Listen to the episode on:

Today virtually all AI compute takes place in centralized data centers, driving the demand for massive power infrastructure.

But as workloads shift from training to inference, and AI applications become more latency-sensitive (autonomous vehicles, anyone?), there‘s another pathway: migrating a portion of inference from centralized computing to the edge. Instead of a gigawatt-scale data center in a remote location, we might see a fleet of smaller data centers clustered around an urban core. Some inference might even shift to our devices.

So how likely is a shift like this, and what would need to happen for it to substantially reshape AI power?

In this episode, Shayle talks to Dr. Ben Lee, a professor of electrical engineering and computer science at the University of Pennsylvania, as well as a visiting researcher at Google. Shayle and Ben cover topics like:

The three main categories of compute: hyperscale, edge, and on-device
Why training is unlikely to move from hyperscale
The low latency demands of new applications like autonomous vehicles
How generative AI is training us to tolerate longer latencies
Why distributed inference doesn‘t face the same technical challenges as distributed training
Why consumer devices may limit model capability

Resources

ACM SIGMETRICS Performance Evaluation Review: A Case Study of Environmental Footprints for Generative AI Inference: Cloud versus Edge
Internet of Things and Cyber-Physical Systems: Edge AI: A survey

Credits: Hosted by Shayle Kann. Produced and edited by Daniel Woldorff. Original music and engineering by Sean Marquand. Stephen Lacey is our executive editor.

Catalyst is brought to you by EnergyHub. EnergyHub helps utilities build next-generation virtual power plants that unlock reliable flexibility at every level of the grid. See how EnergyHub helps unlock the power of flexibility at scale, and deliver more value through cross-DER dispatch with their leading Edge DERMS platform, by visiting energyhub.com.

Catalyst is brought to you by Bloom Energy. AI data centers can’t wait years for grid power—and with Bloom Energy’s fuel cells, they don’t have to. Bloom Energy delivers affordable, always-on, ultra-reliable onsite power, built for chipmakers, hyperscalers, and data center leaders looking to power their operations at AI speed. Learn more by visiting BloomEnergy.com.

Catalyst is supported by Third Way. Third Way’s new PACE study surveyed over 200 clean energy professionals to pinpoint the non-cost barriers delaying clean energy deployment today and offers practical solutions to help get projects over the finish line. Read Third Way’s full report, and learn more about their PACE initiative, at www.thirdway.org/pace.

Transcript

Stephen Lacey: A very brief word before we start the show. We’ve got a survey for listeners of Catalyst and Open Circuit and we would be so grateful if you could take a few moments to fill it out. As our audience continues to expand, it’s an opportunity to understand how and why you listen to our shows, and it helps us continue bringing relevant content on the tech and markets you care about in clean energy. If you fill it out, you’ll get a chance to win a $100 gift card from Amazon and you can find it at latitudemedia.com/survey or just click the survey link in the show notes. Thank you so much. Latitude Media covering the new frontiers of the energy transition.

Shayle Kann: I’m Shayle Kann, and this is Catalyst.

Ben Lee: We could be getting 80% of our compute done locally and leaving 20% of the heavy lifting for the data center cloud. Of the 80% I would say most of that will be on the edge. I think maybe under the order of 1% ends up being put on your consumer electronics.

Shayle Kann: Coming up, the age of edge inference, blunting the big data center boom.

I’m Shayle Kann. I lead the early stage venture strategy at Energy Impact Partners. Welcome. Okay, so here’s an energy question disguised as an AI infrastructure question. What proportion of the world’s AI compute in 2035 will be cloud, i.e. in large centralized data centers, versus edge, versus on-device? It’s an energy question because the answer today is effectively 100% in that first category, cloud. And that’s why we have this crazy dynamic in the electricity sector and actually in the natural gas sector too, where hyperscalers and neo-clouds and developers and real estate speculators and crypto miners turned AI companies and more are hunting for sites that can accommodate hundreds of megawatts or gigawatts of power. And the whole thing, as we know, is crashing through the electricity sector affecting generation and transmission, distribution, prices, now politics, and so on.

But there’s a narrative that I’ve heard a number of times that if borne out would potentially present a very different future from the present. This is one where AI workloads first of all shift significantly from training to inference and then where those inference workloads become highly latency sensitive and are also able to be executed in a more distributed fashion. And as a result, much of that compute and thus the power demand shifts from these big centralized data centers to the edge. That could mean it shifts to 10 megawatt data centers clustered around an urban core or an autonomous vehicle corridor, or at the limit it could mean inference compute happens on device and centralized data centers fall back into a pure training position. Any version of this that takes significant share of the market would have profound implications for the energy question and for the grid. So it’s worth exploring, which is what I’m doing today with my guest Dr. Ben Lee. Ben is a professor of electrical engineering and computer science at the University of Pennsylvania. He’s also a visiting researcher at Google. By the way, this edge AI infrastructure world and the energy implications thereof is super interesting to me as you will hear. So if you are building something in this space, please come get in touch. In the meantime, here’s Ben. Ben, welcome.

Ben Lee: Great to be here. Thanks so much.

Shayle Kann: I’m very excited for this conversation because this is the topic that I, in my energy circles that I travel in, I’ve heard scuttlebutt about a bunch of times, but I’ve never actually spent the time to really try to understand the topic. Basically being how much of inference compute might move from central cloud infrastructure to the edge and then how far to the edge of course being another question. But I think we should start by actually defining those categories a little bit. How do you think about the categorization of where compute can occur? Then we’ll talk about each of those categories individually.

Ben Lee: Right. So even before we talk about generative AI, for classical compute, cloud computing in general, all of the services we love and change the way we live and work today, there are three levels generally I think about for compute. The first is massive hyperscale data centers. The ones run by Microsoft and Google and Amazon, hundreds of thousands of machines, massive facilities. That’s what most people think about when they think about cloud computing. At the other extreme would be personal devices, consumer electronics. So you think about your phone, you think about your tablet, your laptop, plenty of compute can happen there as well. There is a perhaps less understood middle layer or intermediate layer called edge computing. And edge computing really means that there are times where you don’t want to go all the way to this remote massive facility and wait for the data to go out to that data center and then come back. You might want to access some compute that’s a little bit closer to you, maybe in the same city, maybe in the same geographic region. That’s edge computing, so there’s still going to be really capable high performance machines, these servers, but you don’t suffer those longer communication times or latencies that you might if you were to go to that remote massive data center.

Shayle Kann: And my recollection is that there was, I think, okay, so the advent of cloud computing meant the build out of lots of big centralized data centers. There was a fair amount of conversation some number of years ago in the first wave around autonomous vehicles in particular that you might see a fair amount of edge infrastructure get built because of the latency tolerance requirement for AVs. I mean, I’m on the outside, so tell me if I’ve got the kind of narrative wrong here, but then it seems to me that because AVs were generally delayed or maybe the need wasn’t as high, what we’ve got today, if you just look at the infrastructure today, it seems like the vast, vast majority of classical compute even except for stuff that’s sitting in mainframes that companies have is in the cloud and the big centralized data centers. Do I have that right?

Ben Lee: That’s right. And this is a decades long trend. I mean we’ve seen this progression, this adoption of cloud computing over the last 15 to 20 years and there are a couple of reasons we are seeing that shift or we have seen that shift. The first is that computing in a massive data center run by the hyperscaler companies, these big tech companies, is much more energy efficient. They know how to deploy these facilities, they know how to cool them and build HVAC systems efficiently, so they’re incurring very, very small overheads per watt of compute. There’s this industry standard metric called power usage effectiveness or PUE, and that’s the ratio of the power you’re using divided compared to the power that’s going to compute. So Google’s PUE is close to 1.1, which is to say for every watt going to compute, there’s an additional 0.1 watts going to the overheads of power delivery or cooling or whatever.

So that’s really incredibly efficient and most mom and pop data center operators, most enterprise data center operators don’t get the scale and efficiency that these hyperscalers do. The scale also gives a second key advantage, which is the ability to share hardware. So you buy the hardware once and you have lots of users sharing the same physical hardware. That allows us, again, to drive the cost down, allows the hyperscaler operators to drive the costs down and that essentially gets a massive increase in efficiency. So most compute now is being done in these large data centers and in the cloud.

Shayle Kann: Okay, so let’s talk about the world of AI now, which is where all this growth in compute is happening. AI workloads of course divided into two major categories, one being training of models and the other being inference. I think we’ll spend most of our time today talking about inference probably, but let’s spend one minute on training. Is there any movement or argument that training should take place anywhere other than large centralized data centers? It seems very clear to me that the trend right now is just build the largest possible data center to train the largest possible model. So is there anyone who thinks that that might turn in the other direction?

Ben Lee: Some, but there really hasn’t gotten much traction. I think the reason why we see most training happening in massive data centers is because of the scale. You need to communicate large data sets, you need lots of GPUs, all closely coordinated learning the model parameters. The only scenario that some people have explored for training away from the data center is if you’ve got private data and somehow you want to refine your model or somehow fine tune your model with that private data. You don’t want to share it with the hyperscalers. That has been primarily a research question rather than a production system that people have deployed.

Shayle Kann: Okay. So let’s assume then that the vast majority of training compute is still going to happen in centralized data centers as it stands today. I don’t know if you know the numbers, but just high level, of all AI workloads, how much is training versus inference? Because I think the other big point people have made is over time the proportion of workloads going toward inference is going to increase. The proportion of workloads going toward training may decrease as we sort of reach the next model or something like that. But today it’s mostly training still.

Ben Lee: I would agree with that. I think to first order the training costs are historically what people have cared about the most because the data sets are massive and you’re talking about these massive 1000 megawatt data centers for the training workloads. There was a study we did when I was a visiting research scientist at Meta where we found that energy costs for AI were roughly broken into three categories. There’s a data pre-processing aspect as well, and that’s about a third. The training is another third, and then the inference or the use of the model is the last third. But clearly those fractions are evolving rapidly and I would agree with you when you’re saying that the training costs are probably flatlining. They’re reaching a plateau in how quickly they’re growing perhaps, and if the optimism around AI is to be justified, you’re going to have to see inference costs go way up because that will be an indicator that adoption has gone up in a fairly significant way both among individual users but also among companies, enterprise users. So I think it’s true to say that inference costs are large and potentially will grow very rapidly.

Shayle Kann: So then we’re getting to the crux of our question today, which is inference workloads, inference costs increase over time, usage of the models increases over time. That’s the presumption of everything going on in AI world. And then the question is will that inference compute predominantly still take place in these big centralized cloud data centers or will some or much of it potentially shift either to one of the other two categories you described, sort of edge localized or fully localized on device? So let’s talk about the edge version first, which is essentially smaller data centers, still data centers, but smaller and more local. What’s the argument for why that might happen and what are the limitations?

Ben Lee: So the argument in favor of edge computing is mainly the proximity to the end user. So we’ve been conditioned in an era before generative AI that when we access internet-based services like a search engine, we expect the answer to come back in the order of a hundred milliseconds. That is the order of magnitude that we’re talking about. And as a result to get those hundred millisecond latencies, oftentimes you require computation closer to the user so you don’t have to travel across the internet, you don’t have to travel from the west coast out to the east coast and back again with the data, and get that answer back in a timely way. What is interesting with generative AI is that we are being reconditioned to tolerate much longer delays. So if you use something like GPT or you use something like Claude or your favorite chatbot, oftentimes it’s just sitting there thinking for seconds and seconds, maybe tens of seconds before it gets you the first token. So the question there is to what extent we care about that latency and need that really fast responsive access to the answer.

Shayle Kann: And I think we’ve been especially trained even further in that direction with the introduction of things like deep research where even in the name you sort of think, well of course that has to take time. It is deep research that they’re doing. So it’s an interesting point that maybe we are becoming reconditioned to allowing more latency. The argument that I’ve heard for why latency is really going to matter apart from just wanting search queries or chat queries to come back quicker is the next wave of applications for AI, right? And so maybe we go back to the autonomous vehicle world and things like that where latency, making decisions in near real time does become really important. Robotics being another category that could be a major user of AI compute but needs really, really low latency. Is that part of the argument for shifting some compute to the edge?

Ben Lee: Yes, absolutely. So in the class of compute, you mentioned autonomous vehicles or robotics fit into what we call cyber physical AI. So cyber physical systems are those that have a cyber component, a computational component, but also interact with the physical world. And once those interactions with the physical world arise, then we care about responsiveness because that underpins safety guarantees and the ability to make sure that your robotic arm is able to respond quickly enough to hazards, your autonomous vehicles are able to do so. So I agree that there will be cases where we will need those really low latencies and that is going to require edge computing much closer to the user so we have much shorter internet delays, network delays.

Shayle Kann: I’m curious to understand the trade-offs here, right? I know with model training there are technical reasons why you want all your compute as clustered together as closely as possible. You want every GPU as close to every other GPU as you can make them minimizing the copper between them or the optics or whatever it is that’s communicating between them and that for some reason that you can explain to me makes model training more effective. Is there a similar dynamic in inference? Is there a technical reason that you are paying a penalty if you shift to smaller data centers at the edge or is there no technical reason why it’s suboptimal?

Ben Lee: Right. Yeah, let’s talk about the training piece first. The reason why we need thousand megawatt data centers where we have hundreds of thousands of GPUs connected so closely together is because the data sets are massive and the models are massive. We’re trying to learn on the order of a trillion parameters for these machine learning models, these AI models, and we’re trying to do it on the wealth of data we find on the internet. There’s no way that any single GPU can handle that much data. So what we end up doing is partitioning the data into smaller pieces and then handing each GPU a slice or a partition of this data and each GPU will churn on its own, work on its own partition of the data and learn the models that work best for its piece of the data. And all the other GPUs in the data center are doing the same thing on their partitions of the data.

Periodically what they will do is they’ll compare notes, they will share the weights that they’ve learned, and this sharing is really, really expensive and some of the people in the energy space may know that there are massive energy fluctuations or power fluctuations we will see in data center usage when the GPUs go from this computationally intensive phase where you are learning the model weights to this communication intensive phase where they’re comparing notes and sharing their intermediate results with each other. So as a result, that’s why we’re talking about these massive data centers for training. They all need to communicate frequently to share what they’ve learned from their own data sets. For inference, we don’t see that effect.

Shayle Kann: Just to add the craziest thing to me about how model training data centers operate right now, the absolute craziest thing is as you said, there are these surprisingly large spikes in power demand as a result of how the models are trained. What they do, in large part because those spikes are actually problematic not just to the grid but to the equipment inside the data center as well, so what they do, at least sometimes to manage that is they create dummy workloads. So they keep the power profile basically flat, but you are literally just wasting energy on absolutely nothing happening during those times. They’re dummy workloads at that scale. The fact that that is happening is wild to me.

Ben Lee: Absolutely, and I think we’ve seen this in other contexts as well, but not perhaps at this scale, this notion of, in electrical engineering, we call it the dI/dt problem, the change in current divided by a change in time. If large current swings over very short periods of time, you could imagine building batteries to sort of smooth things out or decouple, and certainly a lot of people are thinking about that, but the easiest thing to do might be to just modulate the software as you say, because we have very precise control over what the software does. So that is an active and ongoing area of research that needs to further develop.

Shayle Kann: Okay, so then onto inference. So you’re saying inference does not contain that same challenge, so what is the downside to shifting inference workloads to the edge?

Ben Lee: To my knowledge, there isn’t much of a downside because the reason why inference is amenable to edge computing is because when you send a prompt for processing by a large language model, that prompt is probably handled by one GPU or maybe eight GPUs inside a single machine. And the reason that it is, is because the model sits in that machine, the data sits in that machine, and all of your prior conversations with that bot are sitting in that machine and it’s a very localized piece of compute that needs to be done, and you don’t need tens or hundreds of GPUs to be coordinating to give you an answer back. You’ve got that one GPU or a tightly coupled set of GPUs giving you that answer back, and that is amenable, that is great for edge computing and we can certainly supply that.

Shayle Kann: A thought experiment that I’ve given people recently in thinking about this is let’s just say that you need a gigawatt of inference compute in five years from now or seven years from now, something like that. You think you need a gigawatt wherein the demand for that gigawatt is geographically centralized somewhere. Let’s just say you need a gigawatt, you think you’re going to need a gigawatt of inference compute to serve the Dallas metropolitan area, whatever it might be at that point a few years from now. This is back to the power perspective. Is it going to be easier for you to find and site a one gigawatt site or 10 100 megawatt sites within that geographic region? Today, I think it is still probably easier to find the gigawatt site, or at least the past couple of years it has been, but there aren’t that many gigawatt sites out there from a power availability perspective. So at some point, is that going to flip and is it going to be easier to build 10 100 megawatt sites, which sounds really hard to do and indeed is, but these are all hard problems. So if that happens, do you think that we are going to see a significant portion of that inference workload move to that type of scale? Is that the right scale? Should we be looking at 10 megawatt sites, 100 megawatt sites, one megawatt sites? How far to the edge do we want to go?

Ben Lee: Yeah, absolutely, and I agree with the premise of that question 100%. I think that there are two reasons to go to many smaller data centers. The first is the one you mentioned, power provisioning and connections to the grid. The second is the fact that you don’t need massive GPU coordination for an inference workload. I guess the catch might be that if you are thinking about your existing edge data centers, maybe you’ve got data centers in downtown Los Angeles or something like that, already serving workloads. Those workloads may not be configured to handle GPU and AI compute. They may have power delivery infrastructure that was optimized for CPUs. They might have HVAC systems optimized for the much lower power density of CPUs. So it’s not simply a matter of pulling out your CPUs and replacing them with GPUs. You may have to retrofit the facility itself to support that, but I agree, I think finding capacity there may eventually become easier than finding the next thousand megawatts.

Shayle Kann: Is there any limitation? I can imagine, I’m trying to think of why you wouldn’t do that. You need to have a fair amount of memory. You need to house all the model weights and so on in every individual data center if you’re going to do that at the edge, right? So there’s got to be some minimum viable scale, I assume.

Ben Lee: Right. And maybe to give you a sense of the type of data centers we were talking about in the past, again, in a study that we had done with Meta, we looked at 15 of their data centers before generative AI and the scale of those facilities were somewhere between 15 to 50 megawatts, so less than 100 megawatts, and certainly that was fairly conventional, uncontroversial to build those sizes of data centers in the past. So that’s the starting point I think in terms of the scale. Now as you scale down towards, for example, one megawatt, not clear at what point things start making less sense.

Shayle Kann: I guess the other point here, the way that the data center buildout has gone historically, just like the cloud data center buildout, it’s been fairly clustered in these regions and there’s a reason why Northern Virginia is the data center hub of the world and there are others as well, Chicago, Dallas, Phoenix. And that as I understand it, is largely because the cloud needed to offer a certain level of reliability to their customers, and so they could have redundancy within a given region and that was helpful to them in terms of what they were offering. Do you think that this future world wherein a bunch of inference compute moves to the edge, let’s call it 15 to 50 megawatt data centers, instead of hundreds or thousands of megawatt data centers, does it look similar? Is it that you have a small number of regions that have a really high concentration of those 15 to 50 megawatt data centers or could it be much more dispersed because the whole point of this is really low latency and local and you don’t need them to be as clustered?

Ben Lee: I think there are lots of different aspects that play in terms of data center siting. I think the redundancy is definitely one of them, and I have trouble disentangling the role that some of these other factors play as well. Some people talk about tax breaks and incentives from local companies and local states. Some people talk about proximity to internet exchange points. So not only are we talking about congestion free power movement, but you’re also talking about congestion free data movement into and out of the data centers. Northern Virginia has that. And then of course the availability of the power itself. I guess I would say that when you start talking about many of these smaller data centers from a redundancy perspective, it might be okay that they’re not all geographically clustered as long as you have a strategy for rolling over the compute or rolling over the workload to spare capacity somewhere within that region that has a similar performance profile or some sort of similar latency or delay characteristic. And so that’s really the concern, whether you have robust geographical redundancy and resilience there.

Shayle Kann: Is this happening? It’s interesting. I was thinking about, okay, so it sounds like you’re saying there’s not a big downside. We already have significant inference workloads, so it’s not like we’re waiting on workloads to show up that could accommodate this. And yet if you look at everybody, most everybody building data centers, certainly the hyperscalers and I think the colos and folks as well, the focus continues to be on, we got to find big sites for big data centers. Why don’t we see more development of this smaller scale edge AI inference world?

Ben Lee: I think it really depends on the workload and the application. And we don’t know. I would view AI as a more fundamental basic technology and we don’t necessarily know what application or capability will be layered on top of it. I’d say that we’ve been talking about edge data centers a lot. There are other words for this type of data center. A content distribution network is one of those examples, a CDN or a point of presence, a POP that these facilities are sometimes called and they exist in fairly significant numbers. Content distribution networks ensure that when you want to access, for example, nytimes.com or wsj.com, your webpage is not being served from the other end of the country. Those webpages are sitting close to you because the content distribution network took those updated webpages and moved them to facilities near you, data centers near you. Likewise, companies like Meta when they have Instagram or when they have these social media applications, they also have these points of presence that supply data from local points of presence rather than retrieving content for your feed from across the country. So we already see that, but these are application level performance requirements, whether they be for social media or for other sort of news content. Once it becomes clear what applications of AI really drive further inference deployments, then we’ll know what sort of performance requirements are needed, what we call caching techniques or strategies might be useful so that we can keep fresher data or more recent, more frequently used models closer to these users and then serve them more quickly. I think it will become clearer as we see which models really get traction, which applications really get traction.

Shayle Kann: So maybe the state of affairs today is, look, anybody who’s developing data centers, we know we need the big centralized data centers because there is currently essentially endless demand to train models, at least relative to the availability of compute today. And so we know we need to build the big centralized ones. We might as well use those big centralized ones that we know we need right now for inference workloads such as they are today, but we don’t have enough certainty yet about what the inference workloads are going to be long term to invest that kind of capital and time expenditure that it would take to build out the network of 10 100 megawatt data centers in a particular geographic region. Something like that.

Ben Lee: That’s right. And I would say maybe that my crystal ball is as cloudy as anyone else’s crystal ball, but I feel like there’s a huge amount of GPU capacity being discussed in the pipeline in these large data centers. And if it turns out that maybe there are diminishing returns from training larger and larger models, or maybe we run out of data because we’ve exhausted all the data that’s available on the internet, when those things happen, it may be that demand for these GPUs in these largest data centers will flatten out and we’re going to have spare capacity at which point, as you say, they will be used to serve inference, and then it’ll be hard to make the case for building yet more data centers, smaller ones with GPUs closer to the users. I think the catch there will be if one of these model providers or one of these application developers makes performance a distinguishing feature of their offering, right? If they start competing on performance rather than on capability, then we’re going to see, well, I may have a thousand GPUs in the middle of Nebraska that are already deployed, but if I really want to break into the San Francisco market, I’ve got to build my GPUs right there and have them available.

Shayle Kann: Alright, so speaking of performance, let’s transition to the full extreme version of this, which is also I think theoretically the most disruptive from an energy perspective, which is shifting any significant portion of these inference workloads all the way onto the device, skip the middle ground of edge, five megawatt data centers or 15 or 50, or include them, but shift workloads that would’ve gone to a big data center that requires a lot of power straight onto your iPhone or your iPad or whatever it is. And we’ve heard some claims of this as well. Give me the similar sort of pros and cons of shifting that workload straight onto the device.

Ben Lee: Right. Pros primarily two things. One is performance, right? You don’t have to go across the internet. The model is right there and the compute is right there assuming that you have really capable hardware on your device as well. You get really quickly responsive answers from your AI. The second is also something we’ve mentioned earlier, which is the notion of privacy. You don’t necessarily need to send your data out into this hyperscale data center where it gets blended with lots of other user data and you have fewer guarantees about what happens to it. Localized compute is certainly more private than compute on shared systems. So those are the two key advantages. And then I guess the third would be that it gets more tightly integrated with the capabilities on a particular platform. So for example, Apple’s ecosystem.

Shayle Kann: Right. And Apple seems like the obvious candidate to do this. Clearly you mentioned privacy. Apple is particularly focused on privacy. They have the hardware, the device, right? Apple is notoriously or at least reputationally behind in the AI race, and so it’s not hard to picture that if somebody’s going to move a lot of this inference on device, it’s going to be Apple. Okay, but there is a real trade-off here, I assume.

Ben Lee: Yes. And the trade-off is primarily with respect to the capabilities of the device. So if we have a very large model, we’re going to have to deploy that model on a much more capable hardware platform than we’ve got today. This means having some number of gigabytes of memory to hold the model weights and then also some additional gigabytes of memory to hold the context. As you develop this conversation with the model, in addition to the memory, you’re also going to need the compute. Now you’re not going to have this high performance GPU sitting inside your phone, so you’re going to have to have specialized chips. Those specialized chips are on hand, and they’re going to be less powerful, less capable than the ones in the data center. So all of this speaks to not getting exactly the same model that you would get in the data center.

You would get a shrunk down model. Maybe in the data center you would have a trillion parameters, this massive GPT-5 model for example. But on a personal consumer electronics device, you might only have 7 billion parameters. So orders of magnitude smaller, and that smaller model will be less capable. It’ll give you less capable answers, it will be capable of doing fewer tasks, but maybe that’s okay because you’ve identified only a handful of tasks that you really care about on your personal device. So that is really the trade off. As you go towards the device, you’re going to have to shrink the size of the model down. You’re also going to get less and less capability out of your AI. The final thing, of course is the power and energy profile. At data center scale, we care primarily about power because power influences infrastructure and power delivery and influences thermal management and so on.

For device level compute, there are two considerations. We care about energy rather than power because that affects battery life. So even if you could deliver a really capable GPU chip onto your phone, the question is how long would your phone last if you were using that chip on a fairly consistent basis? So the energy aspect will continue to be challenging. And then the thermal aspect will also be challenging. If you have a really powerful device that’s going to be a hot brick inside your pocket, and that’s going to be a deal breaker as well.

Shayle Kann: So when you say deal breaker, is there progress toward on-device inference? To your point on performance, that strikes me as like, okay, this is now, we’re now again, in the context of specific workloads, certain types of workloads like a 7 billion parameter model might be fine, and others it wouldn’t be. So maybe there will be some on-device chip and some inference that you could do on-device, but you pull up your ChatGPT app or whatever, and of course it’s going to send you back out to the cloud or maybe to the edge. But these other challenges of thermal management and things like that are hardware challenges. Where are we in the progression of on-device inference? Is it coming, is it not coming? Do we not know?

Ben Lee: I think the assumption with on-device inference is that you’ll be able to shrink the model without loss in performance for the tasks you care about. That is the primary strategy the computer scientists have been taking. On the hardware side, we have made strides in developing custom chips, custom silicon for the specific types of tensor algebra that are required for machine learning models. So we know how to build those chips and that gives us energy efficient compute, higher performance. We know how to build really capable memory systems or solid state disks. So when your phone now has hundreds of gigabytes of memory on it or hundreds of gigabytes of storage on it, so there’s a question of, well, maybe you’ll end up using less of it for your photos and more of it for your AI models, something like that. So I think there are fairly significant resource constraints, but I don’t think that they are insurmountable in the sense that more intelligent hardware design and more intelligent hardware management could go some ways in terms of making these AI models feasible on the device.

Shayle Kann: Okay. So I’m going to put you on the spot, and we promise not to hold you to these numbers, but just to give a sense of where we think things are heading. If we’re fast forwarding 10 years, right? Let’s just say we’re in 2035, and imagine there’s a total volume of inference compute in the world or whatever that’s, let’s just say it’s 100 megawatts total, what would be your guess of the ranges of how much of that compute is going to take place in large centralized data centers versus at the edge? We’ll draw a line, let’s say 100 megawatts and above is large centralized, sub-100 megawatts, but not on device is edge. And then the third category of course being on device, how much of it can go anywhere but the centralized data centers?

Ben Lee: I would go straight to this idea of having an 80/20 rule. We see this all the time in computer systems where you have 20% of your tasks being extremely popular, maybe there are 20 things that you always want to do and you spend 80% of your AI compute doing those things. That could be email processing, that could be photo analysis, that could be, so we can identify what those really compelling applications and tasks are, and we’re going to be spending most of our time doing that. And then for the remainder of the long tail, long heavy tail of other tasks that people might want to do, there will always be backup capabilities residing in the cloud data center. So I would say that we could be getting 80% of our compute done locally and leaving 20% of the heavy lifting or the more esoteric, the more corner case compute for the data center cloud. That is of course, excluding the training. The training will continue to all reside in the massive facilities. But in terms of the inference, I think there’s huge potential.

Shayle Kann: Right. But yeah, that’s actually a very significant shift. If 80% of the inference workload, I appreciate that that does include training, but still if 80% of the inference workload could end up local, that’s a significant shift and has pretty profound implications for the energy picture as well. Are you saying that 80%, just to pin you down even a little bit more, is that local in the sense of being at the edge or is that local in the sense of being on device or what do you think the split ends up being there?

Ben Lee: Yeah, so I think of the 80%, I would say most of that will be on the edge, I suspect it is today. I think that if you look at what we talked about earlier, content delivery networks, points of presence, they’ve probably identified 20% of the content that 80% of the people will be looking at most of the time, and they’re putting it at the edge. I think maybe on the order of 1% ends up being put on your consumer electronics. Actually, even for today’s compute, when we set aside AI, there is a trend towards consumer electronics hiding that flow of data back and forth between the device and the edge for you. So sometimes if you use a cloud storage service like Dropbox or if you’re using a photo storage service, they will let you pretend that you have access to all of your videos or all of your photos and all of your documents, and they will transparently behind the scenes move things back and forth between the data center and your local device. So you may think you have all of it, but maybe you’ve only got a tiny sliver, less than 1% on your local device.

Shayle Kann: Right. Certain things open up in my Dropbox instance, certain things open up much faster than others when I try to open them and it’s occurred to me that that is why. If I step back then, okay, so it sounds like what you’re saying in this scenario, you’re painting of the future, roughly 80% of the inference workloads are edge, very little of it actually on device and then the other 20% or so sitting in big cloud data centers. So when I think about the energy implications of that, there’s I think a couple ways to think about it that are pretty interesting. One is, okay, so maybe a fair amount of the energy consumption of at least inference compute is going to shift to these five megawatt, 15 megawatt, 50 megawatt type local sites. That has big implications for the grid in ways that are, I don’t know, both good and bad, probably harder to manage in some ways, easier to manage in other ways. But the overall energy consumption of inference I would expect, and you can tell me if I’m wrong, would actually be higher in this scenario than it would be if it was all centralized. Because I assume the PUE that you get for these edge data centers isn’t quite as good as it is for the large centralized data center. So on balance, this probably means more overall AI energy consumption. Do you think that’s right?

Ben Lee: Yes. Yes. I think you get economies of scale when you go to a gigawatt or two gigawatts, you have a single facility, you’re managing it in a highly optimized, coordinated way, and you’ve got hundreds of thousands of these machines all managed very precisely. I think as you shrink the system down, you will lose in efficiency. You’ll be trying to build these 20 megawatt data centers and maybe in footprints or facilities that weren’t designed initially for those workloads. So yes, I think total energy costs may go up as a result.

Shayle Kann: We’re talking about inference workloads to some extent as a monolith. I’m sure they’re not. So are there big distinctions in your mind in terms of the different types of inference workloads and how that influences where they should be housed?

Ben Lee: Right. Yes. So that’s a really great question actually. I would say that there are fundamental limits to the number of inference queries a human user can actually produce because we’re ultimately limited by the speed of our typing, the number of tokens we can actually produce to query the models. So there is some of that where humans will continue to send requests to agents, but I think increasingly most of the inference workload will come from other software agents. This could be a search engine retrieving webpages, and then asking the large language model to summarize it into a coherent discussion for you.

This could be your photo app learning something about your images, or this could be your email app doing something with the emails and helping you compose messages. So all of that is done behind the scenes and those inference workloads are potentially much larger because of course, software can generate those requests at much, much higher rates. From the perspective of where that computation happens, to the extent that the data center already has servers running your email workloads, or to the extent that your search engines are already running in the same data center, the communication to the model will be less of a bottleneck, right? So if you have a data center in Nebraska running your search engine for you or doing some of these other big heavy lifting software jobs, then potentially they could query and execute inference in these largest hyperscale data centers.

Shayle Kann: Alright, Ben, this was super interesting. Really appreciate your time.

Ben Lee: It was my pleasure. I really enjoyed the conversation. Thanks so much.

Shayle Kann: Dr. Ben Lee is a professor of electrical engineering and computer science at the University of Pennsylvania. He’s also a visiting researcher at Google. This show is a production of Latitude Media. You can head over to latitudemedia.com for links to today’s topics. Latitude is supported by Prelude Ventures. This episode was produced by Daniel Woldorff, mixing and theme song by Sean Marquand. Stephen Lacey is our executive editor. I’m Shayle Kann, and this is Catalyst.