Baseten CEO Tuhin Srivastava on the AI Inference Crunch, Custom Models, and Building the Inference Cloud

No Priors Podcast Episode Summary

Released: May 1, 2026
Hosts: Sarah Guo & Elad Gil
Guest: Tuhin Srivastava (Founder & CEO, Baseten)

Episode Overview

In this episode, Sarah Guo and Elad Gil sit down with Tuhin Srivastava, CEO of Baseten, to discuss the explosive growth and challenges in AI inference, the rapid evolution of custom and open-source models, the realities of compute supply, and building for the future of intelligent applications. Baseten has experienced 30x growth in the last year and is on track for $1B+ in revenue, highlighting both the scale and urgency of the AI inference market. The conversation spans from the strategic importance of inference and post-training, to geopolitical implications of model origins, to the nitty-gritty challenges of scaling infrastructure and talent.

Key Discussion Points & Insights

1. The Inference Explosion & Market Dynamics

Scale & Demand: Baseten's 30x growth is fueled by an increasing realization that AI can be embedded everywhere, across both open and closed-source models. The “long tail” of custom/specialized models is materializing as customers in-house intelligence and the application layer expands. (00:49)
- "Everyone is realizing that you can put AI everywhere… the long tail models coming true, customers in-housing a lot of that intelligence themselves." — Tuhin, (00:49)
Application vs. Lab Layer: Tuhin defends the continued existence of an independent application layer, arguing that companies’ unique workflows and user signals can’t simply be absorbed by foundational labs.
- "The application layer will exist… because what is valuable to a company is the user signal only they can gather." — Tuhin, (02:17)

2. Who's Adopting AI—and How?

New AI-Native Companies vs. Enterprise In-Housing: The majority of inference is still coming from new application companies, but enterprise adoption is just ramping up. Once enterprises become comfortable with closed-source APIs, custom models will follow.
- "Most enterprise adoption is well ahead of us. That’s one of the very exciting things about AI—there’s just so much still to come and people are underestimating that." — Elad, (05:25)
Learning from AI-Natives: Building with frontier customers (Abridge, Open Evidence, Decagon, etc.) allows Baseten to pre-empt enterprise needs, given these companies sell into regulated, demanding sectors like healthcare. (06:21)

3. Open Source vs. Closed Source, & Geopolitical Tensions

Model Choices: Leading AI companies generally choose models for capability, optimizing later for cost. There’s robust use of both Western and Chinese-origin open-source models. (08:16)
Security & Geopolitics: Security concerns around Chinese models exist, but Tuhin downplays evidence of direct risk while stressing the vital need for a strong US open-source ecosystem.
- "If we don’t have access to that intelligence in that form, it’s just a massive loss… we won’t be able to innovate as fast…" — Tuhin, (12:00)
- "There are five labs in China creating open source models and we’re struggling to get one set up…" — Tuhin, (09:46)

4. Custom Models & Post-Training: Now the Norm

Workload Breakdown: Over 95% of tokens served on Baseten are from custom, modified models—very few are vanilla open-source. (13:20)
- "No one is just running the vanilla open source weights… almost all are modified for their own use case." — Tuhin, (13:36)
Investment in Post-Training: Baseten acquired a research team to build post-training expertise, directly tying infrastructure to customers’ ability to continually improve and specialize their models. There’s a virtuous loop between post-training and inference. (14:34, 17:10)
- "Inference creates data, you do evals, you can now post-train on that reward function… it’s the entire loop." — Tuhin, (16:23)

5. Inference Capacity Crunch: Realities & Strategies

Severity of Supply Shortage: The compute supply crunch is worse than many realize; there’s very little slack globally. Baseten has built technology and operations for extreme flexibility—deploying across 18 clouds and 90 clusters to secure compute. (18:47, 21:48)
- "No matter as much as we hear about it, I don't think people realize how bad it really is… very little slack compute available." — Tuhin, (18:47)
Supply, Suppliers & Contracts: Not only is GPU supply constrained, but there’s also a shortage of reliable, operationally competent cloud providers. Locking in capacity now requires multi-year contracts and significant prepay—driving unique working capital and financing needs. (21:54, 23:20)
Strategic Advantage: In a constrained compute world, actually owning compute is a core differentiator. The software layer (custom, sticky inference APIs) further locks in customers, far more than commodity GPU leases. (24:45)
- "Inference with the software layer included is incredibly sticky. None of our top 30 customers have ever churned…" — Tuhin, (24:45)

6. Chips, Multi-Vendor Ecosystem, and Technical Frontiers

Multi-Chip Future: While diversification is expected (Nvidia, decode/inference-specific chips), Nvidia’s scale, supply chain, and CUDA developer ecosystem makes it hard to unseat them soon. (26:56)
- "People really underestimate NVIDIA’s supply chain… the ability to move fast, given their scale, it’s hard to see anyone compete soon." — Tuhin, (26:56)
Runtime, Workload Trends: Continued focus on optimizing runtimes (diffusion, agents, sandboxes, async batch inference) and deepening the inference–post-training loop. (28:42)
Scale Surprises: Edge cases at scale manifest as systems and kernel-level incidents, rarely just LLM-specific quirks. (31:44)
- "You start seeing limitations with kernel panics, log overflows… the craziest stuff is these runtimes are pretty immature." — Tuhin, (31:44)

7. Scaling Talent & Culture at Hypergrowth

Leadership Lessons: Baseten was flat, engineer-heavy until recently. The imperative now is leaders who can solve whole problems, clarity on hiring philosophy (first principles, kindness, low-ego). Result? Retention is high. (34:19)
- "If you feel like you need to be involved in everything, it's probably a cop out… you probably don't have the right people." — Tuhin, (35:06)
Operations Culture: True infrastructure companies live an “on-call” culture; pagers and alerts become part of daily life, and it quickly screens for fit. (36:58)
- "Inference can't go down… my cofounder's 7-year-old asks, 'Is that a P0?'" — Tuhin, (37:13)

8. Jevons Paradox & Demand Elasticity

Lower Costs = More Inference: Lowering inference cost just increases consumption—people want more/better intelligence, not less.
- "If you make it cheaper, they'll insert more intelligence anyway… agents are just longer running now." — Tuhin, (38:54)
- "Inference going down just begets more. It is the last market." — Tuhin, (40:12)

9. The Next Few Years: The Inference Cloud as a New Paradigm

Future Vision: Intelligence embedded everywhere—“units of cognition” sold as a utility; companies must embrace the shift or become obsolete; personalized agents for every aspect of life—AI-powered “concierges” for health, education, personal management. (41:01)
- "Everything is smarter… you get better care, more software, more things built." — Tuhin, (41:01)
- "Concierge is everything for everyone." — Elad & Tuhin, (41:41)

Notable Quotes & Memorable Moments

Hypergrowth Reality Check:
- "We've grown a ton over the last 12 months… the answer is always just go bigger, go faster… but the big one is compute." — Tuhin, (32:59)
On Build vs. Buy—What Makes a Winner:
- "GPUs as a service is not sticky. Inference with the software layer included is incredibly sticky. That's been seen." — Tuhin, (24:45)
Cultural Ritual:
- "When we have P0s—everyone on the call, may as well be a siren that goes off in the office." — Tuhin, (38:09)
The Inference Cloud Thesis:
- "To me, this is what an Inference Cloud looks like: you are very good at inference, then do all the things tangential, loop into inference and partner where necessary." — Tuhin, (30:39)

Timestamped Topics & Highlights

| Time | Topic | |-----------|-----------------------------------------------------------------------------------------------| | 00:49 | Baseten’s scale: 30x growth, AI everywhere, open source, post-training goes mainstream | | 02:07 | Will the application layer survive vs. the AI “labs”? | | 04:34 | AI-native startups vs. enterprise AI adoption: who drives volume? | | 06:21 | Learning from AI-native customers to pre-empt enterprise needs | | 07:55 | Open source adoption evolution: Mistral, Llama, Chinese origin models traversing frontiers | | 09:46 | Security/geopolitics: Should the U.S. worry about Chinese models? | | 13:20 | Custom vs. vanilla models: tokens served, how everyone is customizing | | 14:34 | Post-training expertise and acquisition rationale | | 17:42 | When customers should start customizing/post-training | | 18:47 | Severity of the compute supply crunch—no slack, global multi-cloud, operational diligence | | 21:54 | Securing capacity: contracts, term lengths, prepayment, working capital considerations | | 24:45 | What makes an inference player “sticky”? Software glue, not GPU commodity | | 26:56 | Multi-chip world? Why Nvidia’s ecosystem is dominant “for now” | | 28:42 | Technical roadmap: runtimes, sandboxes, prefill, decode, async batch, evals | | 31:44 | What breaks at extreme scale—kernel panics, log overflows, runtime immaturity | | 32:56 | What keeps Tuhin up at night: Capacity & the pressure to go even bigger | | 34:19 | Scaling philosophy: moving from flat org to empowered leaders | | 36:58 | Operations culture—inference outages, alerts, fit & retention | | 38:54 | Jevons Paradox: Lower inference cost drives consumption even higher | | 41:01 | Future state: The era of “cognition as a utility,” ubiquitous agents/concierge experiences |

Conclusion

This episode delivers a front-row perspective on how AI inference is rapidly maturing into one of technology’s largest, most competitive markets, with Baseten at the forefront. Key takeaways include the crucial role of custom/post-trained models, extreme compute constraints dictating business strategy, sticky value at the software layer, and a future shaped by infinite loops of learning and intelligent automation. The Baseten story is both a real-world playbook and an early window into AI’s infrastructure future.

No Priors Podcast Episode Summary