Will inference move to the edge?

Catalyst with Shayle Kann – Episode Summary

Episode Title: Will inference move to the edge?
Date: December 18, 2025
Host: Shayle Kann
Guest: Dr. Ben Lee, Professor of Electrical Engineering and Computer Science at University of Pennsylvania, Visiting Researcher at Google

Overview:

This episode explores a provocative and timely question: As artificial intelligence (AI) workloads—especially inference tasks—grow, will they remain the domain of massive, centralized cloud data centers, or could they decentralize toward the “edge,” closer to users or even directly onto devices? Shayle Kann and Dr. Ben Lee discuss the technical, economic, and energy implications of this possible shift, examining what might push AI inference toward the edge, how that could fundamentally alter energy demand and grid management, and what obstacles stand in the way.

Key Discussion Points and Insights

1. Defining the Compute Landscape

Three Layers of Compute:
- Cloud (Centralized Data Centers): Massive structures run by hyperscalers (Google, Amazon, Microsoft) handling most compute today.
- Edge Computing: Intermediate layer; smaller, still sophisticated data centers closer to users (same city/region), providing lower latency.
- On-device/Edge-Edge: Compute taking place directly on consumer hardware (phones, laptops, etc.)
  ([05:54])
Current State:
- The vast majority of classical and AI inference compute still happens in large, centralized data centers due to their unmatched energy and cost efficiency, resource sharing, and economies of scale.
- Edge computing, though frequently discussed (e.g., for autonomous vehicles), has not taken off as initially speculated due to lack of urgency in latency-sensitive applications.
  ([08:12])

2. AI Workloads: Training vs. Inference

Training:
- Remains the sole preserve of centralized data centers, requiring massive coordination of GPUs, enormous datasets, and high energy efficiency.
- Privacy-based or specialized distributed training exists in research, but not in production.
  ([10:28])
Inference:
- Historically a smaller portion of AI costs, but set to rapidly increase as model adoption expands.
- “Inference costs are large and potentially will grow very rapidly.”
  (Dr. Ben Lee, [11:39])
- Training and inference workloads differ in technical needs, influencing where they can efficiently run. Inference requires less coordination among machines than training.

3. Why Move Inference to the Edge?

Latency:
- Traditionally, applications like search require sub-100ms responses, favoring compute closer to the user.
- New habits (e.g., tolerating multi-second delays with LLM chatbots) could lessen the pressure for ultra-low latency.
- Future applications—autonomous vehicles, robotics (“cyber-physical AI”)—could drive strong demand for real-time, edge inference.
  ([15:39])
Technical Feasibility and Trade-Offs:
- Inference is much less bottlenecked by the need for massive, tightly coupled GPU clusters than training.
- “The reason why inference is amenable to edge computing is because...that prompt is probably handled by one GPU or maybe eight GPUs inside a single machine...and you don’t need tens or hundreds of GPUs to be coordinating to give you an answer back.”
  (Dr. Ben Lee, [22:24])
Data Center Siting and Grid Interactions:
- As siting and powering new gigawatt-scale data centers becomes increasingly difficult, smaller (10–50 MW) edge data centers may become more attractive and feasible.
  ([24:45])
- However, retrofitting existing edge sites to support GPU- and AI-intense applications is not trivial.

4. Obstacles to Edge and On-device Inference

Infrastructure & Economics:
- Edge data centers often lack infrastructure (e.g., adequate power, optimized cooling) for dense GPU deployment.
- Presently, massive demand for centralized training leaves little incentive to invest in wide edge buildouts until inference workloads are more predictable and profitable.
  ([32:01])
Technological and Energy Trade-Offs:
- Decentralizing may reduce efficiency (higher PUE—Power Usage Effectiveness), increasing total energy demand compared to hyperscale data centers.
- “Total energy costs may go up as a result [of moving inference to the edge].”
  (Dr. Ben Lee, [45:34])
On-device Inference: Pros & Cons:
- Pros: Ultra-low latency, privacy (data remains local), tight hardware/software integration (e.g., Apple devices).
- Cons: Need to significantly shrink models; less capable chips; battery life and thermal management become limiting factors; only a subset of tasks will be feasible on-device.
  ([34:42–38:28])
- Custom AI chips and smarter resource management can help, but constraints will remain.

5. Future Scenarios – 2035 Outlook

Dr. Ben Lee’s 80/20 “Rule of Thumb”:
- “We could be getting 80% of our compute done locally and leaving 20% of the heavy lifting or the more esoteric...for the data center cloud. Of the 80%, I would say most of that will be on the edge...maybe on the order of 1% ends up being put on your consumer electronics.”
  (Dr. Ben Lee, [41:18, 42:51])
- The “locally” here is mostly edge rather than on-device.
- Training remains in centralized data centers; inference could shift heavily to edge, with only a small fraction making it to the actual device.
Implications for Energy and the Grid:
- Increased edge adoption would spread power demand more broadly across the grid, creating both new challenges and flexibility opportunities.
- Efficiency could decline, raising total energy use.

6. Types of Inference Workloads: Human vs. Agent

Increasingly, AI agents and software will generate the bulk of inference requests, not end-users—potentially raising compute demand even higher and influencing optimal siting (often still centralized).
([46:26])

Notable Quotes & Memorable Moments

Edge Possibilities:
- “We could be getting 80% of our compute done locally and leaving 20% of the heavy lifting for the data center cloud.” — Dr. Ben Lee ([41:18])
Technical Trade-offs:
- “As you shrink the system down, you will lose in efficiency… So, yes, I think total energy costs may go up as a result.”
  — Dr. Ben Lee ([45:34])
On-device Limitations:
- “That smaller model will be less capable. It will give you less capable answers. It will be capable of doing fewer tasks. But maybe that’s okay because you’ve identified only a handful of tasks that you really care about on your personal device.”
  — Dr. Ben Lee ([36:02])
On the Wildness of Data Center Power Management:
- “They create dummy workloads so they keep the power profile basically flat. But you are literally just wasting energy on absolutely nothing...”
  — Shayle Kann ([18:46])
Why smaller edge data centers might take off:
- “If one of these model providers or one of these application developers makes performance a distinguishing feature of their offering... then we’re going to see, well, I may have a thousand GPUs in the middle of Nebraska that are already deployed, but if I really want to break into the San Francisco market, I’ve got to build my GPUs right there and have them available.”
  — Dr. Ben Lee ([32:44])

Timestamps for Key Segments

Defining Compute Layers: [05:54]
Data Center Energy Efficiency: [08:12]
AI Training vs. Inference Workloads: [10:28–11:39]
Edge Computing, Latency & AI Applications: [13:39–16:24]
Training Requires Centralization: [17:08]
Inference Technical Feasibility at the Edge: [22:24]
Data Center Siting and Power Issues: [24:45]
Obstacles and Market Dynamics: [32:01]
On-device Inference Pros and Cons: [34:42–38:28]
2035 Scenario & 80/20 Rule: [41:18–42:51]
Energy Implications of Edge Inference: [45:34]
Inference Workload Diversity: [46:26]

Tone and Final Thoughts

The conversation balances deep technical insight with practical, market-oriented perspective. Shayle’s energetic curiosity complements Dr. Lee’s clarity and expertise. Both are cautiously optimistic about edge inference’s potential but realistic about economic bottlenecks and energy trade-offs.

Bottom line:
While AI inference currently clusters in mega data centers, technical and market signals suggest a future with much more decentralized compute—at the edge, if not (yet) on-device. This shift will fundamentally reshape where energy for AI is consumed, how efficiently it's used, and what investments get made in infrastructure across the grid and technology stack.

Catalyst with Shayle Kann – Episode Summary

Overview:

Key Discussion Points and Insights

1. Defining the Compute Landscape

Three Layers of Compute:
- Cloud (Centralized Data Centers): Massive structures run by hyperscalers (Google, Amazon, Microsoft) handling most compute today.
- Edge Computing: Intermediate layer; smaller, still sophisticated data centers closer to users (same city/region), providing lower latency.
- On-device/Edge-Edge: Compute taking place directly on consumer hardware (phones, laptops, etc.)
  ([05:54])
Current State:
- The vast majority of classical and AI inference compute still happens in large, centralized data centers due to their unmatched energy and cost efficiency, resource sharing, and economies of scale.
- Edge computing, though frequently discussed (e.g., for autonomous vehicles), has not taken off as initially speculated due to lack of urgency in latency-sensitive applications.
  ([08:12])

2. AI Workloads: Training vs. Inference

Training:
- Remains the sole preserve of centralized data centers, requiring massive coordination of GPUs, enormous datasets, and high energy efficiency.
- Privacy-based or specialized distributed training exists in research, but not in production.
  ([10:28])
Inference:
- Historically a smaller portion of AI costs, but set to rapidly increase as model adoption expands.
- “Inference costs are large and potentially will grow very rapidly.”
  (Dr. Ben Lee, [11:39])
- Training and inference workloads differ in technical needs, influencing where they can efficiently run. Inference requires less coordination among machines than training.

3. Why Move Inference to the Edge?

Latency:
- Traditionally, applications like search require sub-100ms responses, favoring compute closer to the user.
- New habits (e.g., tolerating multi-second delays with LLM chatbots) could lessen the pressure for ultra-low latency.
- Future applications—autonomous vehicles, robotics (“cyber-physical AI”)—could drive strong demand for real-time, edge inference.
  ([15:39])
Technical Feasibility and Trade-Offs:
- Inference is much less bottlenecked by the need for massive, tightly coupled GPU clusters than training.
- “The reason why inference is amenable to edge computing is because...that prompt is probably handled by one GPU or maybe eight GPUs inside a single machine...and you don’t need tens or hundreds of GPUs to be coordinating to give you an answer back.”
  (Dr. Ben Lee, [22:24])
Data Center Siting and Grid Interactions:
- As siting and powering new gigawatt-scale data centers becomes increasingly difficult, smaller (10–50 MW) edge data centers may become more attractive and feasible.
  ([24:45])
- However, retrofitting existing edge sites to support GPU- and AI-intense applications is not trivial.

4. Obstacles to Edge and On-device Inference

Infrastructure & Economics:
- Edge data centers often lack infrastructure (e.g., adequate power, optimized cooling) for dense GPU deployment.
- Presently, massive demand for centralized training leaves little incentive to invest in wide edge buildouts until inference workloads are more predictable and profitable.
  ([32:01])
Technological and Energy Trade-Offs:
- Decentralizing may reduce efficiency (higher PUE—Power Usage Effectiveness), increasing total energy demand compared to hyperscale data centers.
- “Total energy costs may go up as a result [of moving inference to the edge].”
  (Dr. Ben Lee, [45:34])
On-device Inference: Pros & Cons:
- Pros: Ultra-low latency, privacy (data remains local), tight hardware/software integration (e.g., Apple devices).
- Cons: Need to significantly shrink models; less capable chips; battery life and thermal management become limiting factors; only a subset of tasks will be feasible on-device.
  ([34:42–38:28])
- Custom AI chips and smarter resource management can help, but constraints will remain.

5. Future Scenarios – 2035 Outlook

Dr. Ben Lee’s 80/20 “Rule of Thumb”:
- “We could be getting 80% of our compute done locally and leaving 20% of the heavy lifting or the more esoteric...for the data center cloud. Of the 80%, I would say most of that will be on the edge...maybe on the order of 1% ends up being put on your consumer electronics.”
  (Dr. Ben Lee, [41:18, 42:51])
- The “locally” here is mostly edge rather than on-device.
- Training remains in centralized data centers; inference could shift heavily to edge, with only a small fraction making it to the actual device.
Implications for Energy and the Grid:
- Increased edge adoption would spread power demand more broadly across the grid, creating both new challenges and flexibility opportunities.
- Efficiency could decline, raising total energy use.

6. Types of Inference Workloads: Human vs. Agent

Increasingly, AI agents and software will generate the bulk of inference requests, not end-users—potentially raising compute demand even higher and influencing optimal siting (often still centralized).
([46:26])

Notable Quotes & Memorable Moments

Edge Possibilities:
- “We could be getting 80% of our compute done locally and leaving 20% of the heavy lifting for the data center cloud.” — Dr. Ben Lee ([41:18])
Technical Trade-offs:
- “As you shrink the system down, you will lose in efficiency… So, yes, I think total energy costs may go up as a result.”
  — Dr. Ben Lee ([45:34])
On-device Limitations:
- “That smaller model will be less capable. It will give you less capable answers. It will be capable of doing fewer tasks. But maybe that’s okay because you’ve identified only a handful of tasks that you really care about on your personal device.”
  — Dr. Ben Lee ([36:02])
On the Wildness of Data Center Power Management:
- “They create dummy workloads so they keep the power profile basically flat. But you are literally just wasting energy on absolutely nothing...”
  — Shayle Kann ([18:46])
Why smaller edge data centers might take off:
- “If one of these model providers or one of these application developers makes performance a distinguishing feature of their offering... then we’re going to see, well, I may have a thousand GPUs in the middle of Nebraska that are already deployed, but if I really want to break into the San Francisco market, I’ve got to build my GPUs right there and have them available.”
  — Dr. Ben Lee ([32:44])

Timestamps for Key Segments

Defining Compute Layers: [05:54]
Data Center Energy Efficiency: [08:12]
AI Training vs. Inference Workloads: [10:28–11:39]
Edge Computing, Latency & AI Applications: [13:39–16:24]
Training Requires Centralization: [17:08]
Inference Technical Feasibility at the Edge: [22:24]
Data Center Siting and Power Issues: [24:45]
Obstacles and Market Dynamics: [32:01]
On-device Inference Pros and Cons: [34:42–38:28]
2035 Scenario & 80/20 Rule: [41:18–42:51]
Energy Implications of Edge Inference: [45:34]
Inference Workload Diversity: [46:26]

wavePod

Powered by Wave AI

Summary

Catalyst with Shayle Kann – Episode Summary

Overview:

Key Discussion Points and Insights

1. Defining the Compute Landscape

2. AI Workloads: Training vs. Inference

3. Why Move Inference to the Edge?

4. Obstacles to Edge and On-device Inference

5. Future Scenarios – 2035 Outlook

6. Types of Inference Workloads: Human vs. Agent

Notable Quotes & Memorable Moments

Timestamps for Key Segments

Tone and Final Thoughts

Summary

Catalyst with Shayle Kann – Episode Summary

Overview:

Key Discussion Points and Insights

1. Defining the Compute Landscape

2. AI Workloads: Training vs. Inference

3. Why Move Inference to the Edge?

4. Obstacles to Edge and On-device Inference

5. Future Scenarios – 2035 Outlook

6. Types of Inference Workloads: Human vs. Agent

Notable Quotes & Memorable Moments

Timestamps for Key Segments

Tone and Final Thoughts