Latent Space: The AI Engineer Podcast
Episode: Why RL Won — Kyle Corbitt, OpenPipe (acq. CoreWeave)
Date: October 16, 2025
Guests:
- Kyle Corbitt (Co-founder & CEO of OpenPipe, acquired by CoreWeave)
- Alessio (Host, Founder of Kernel Labs)
- Swix (Host, Editor of Latent.Space)
Overview
This episode brings on Kyle Corbitt, co-founder and CEO of OpenPipe, recently acquired by CoreWeave. Through a vibrant discussion, Alessio and Swix dig into Kyle’s journey from YC’s Startup School to OpenPipe’s founding, rapid scaling, pivot journeys, and eventual acquisition. The focus centers on the evolution of model fine-tuning, why Reinforcement Learning (RL) has become central in AI infrastructure, technical trade-offs in fine-tuning strategies, the realities of running an RL-first business, and what the future holds for continual learning, agent environments, and the economics driving the foundational model ecosystem.
Kyle’s Journey: From YC to OpenPipe
(00:17 – 04:01)
-
Startup School Background:
- Led YC’s “Startup School” for 4+ years, running technical and content products, including a successful co-founder matching service
- “A very large fraction of the batches that went through YC while I was there were directly attributable to people that we found and ended up recruiting through their experience at startup school.” (01:13–01:29)
-
OpenPipe’s Genesis:
- After leaving YC, explored AI experimentation; launched OpenPipe right after GPT-4, March 2023, with his brother as co-founder
- “We saw GPT-4 was insanely expensive and extremely powerful. But there was an opportunity to distill specific workflows from GPT-4 down to much smaller, much cheaper models.” (03:14)
Product–Market Fit, Value Proposition & Early Growth
(04:01 – 08:38)
-
Strong value prop: distilling GPT-4’s expensive capabilities into smaller, affordable open models
-
“Anyone who did have production workflows, it was extremely painful. Like they were paying hundreds of thousands of dollars a month to OpenAI.” (04:28)
-
Quick traction: first three customers within a month of launch; $1M ARR within eight months
-
Market Headwinds:
- As open models improved and token prices dropped, the “cost to value” rationale diminished:
- “There was just this slow march of the frontier model token prices just dropping over and over… which kind of ate away at our value prop.” (05:10)
- As open models improved and token prices dropped, the “cost to value” rationale diminished:
-
Product Experience:
- SDK acted as a drop-in for OpenAI’s, capturing production requests/responses, enabling an easy, managed “distillation flow” from GPT-4 data to custom models
The Fine-Tuning Landscape: Peaks, Troughs & Inflection Points
(07:46 – 10:51)
-
Open Source Model Evolutions:
- Mistral 7B’s release (“a credible open source model”) proved an inflection point
- “At the time that was a pretty big deal that they had this fully open Apache 2 license.” (08:09–08:36)
-
LoRA (Low-Rank Adapters) – Rise, Fall, Resurrection:
- “If you’re predicated on the fact that you’re doing fine-tuning at all, LoRAs have very attractive properties. …You can multiplex basically an arbitrarily large number of LoRAs on the same GPU deployment.” (09:06)
- LoRA’s reputation suffered (viewed as “store brand” fine-tuning), but may be cycling back to favor with evidence (Thinking Machines’ blog post):
“If you’re doing fine-tuning anyway, LoRAs are still in many cases the way you want to do it. But not that many people were doing fine-tuning.” (09:49)
When Does Fine-Tuning Make Sense?
(11:29 – 13:00)
- Kyle’s Heuristic for Fine-Tuning:
- "When it’s cost, latency, or quality consistency that you really care about." (11:39)
- Most common: latency (need to deploy smaller, faster models, e.g., for real-time voice)
- Caution: “For 90% of use cases where you aren’t forced to a smaller model, it’s still not a good ROI and you probably shouldn’t invest in it today.” (12:44)
Cost-Benefit Mental Model:
- Upfront effort: “a couple weeks of a competent engineer’s time,” up to months for RL
- Ongoing cost: less flexible stack, slower iterations
- Direct financial cost rarely a main factor: “Each of these runs is between five and a couple hundred dollars.” (14:00)
The Shift to RL — Why Reinforcement Learning "Won"
(14:42 – 18:47)
-
Trigger Event: Emergence of Zero1 (“01”) models; realization via leaks (Strawberry, etc.) that RL could significantly improve LLMs
- “It seemed very clear… that RL was going to work in that context. And then the question in our mind was, can we apply this in a different segment… task specific customization?” (15:10–15:40)
-
Strategic Bet:
- Went all-in on RL in January 2025, despite estimating just a “25% chance” it was the right move
- “If it turns out that just doing RL on your task is something everyone should be doing… being the first people working on that would be a really, really awesome position.” (16:45)
-
First proof of concept: RL-trained “email agent”; informed bet, not “obvious”
Understanding RL: Tooling, Math, and Accessibility
(18:47 – 20:35)
- Onboarding into RL:
- Kyle pushes back on RL’s reputation for mathematical complexity:
- “I don’t think the math is actually that complicated. …If you just did the naive implementation in Python… it’s actually quite grokkable. …You just have to believe you can do it.” (19:14)
- LLMs help: “I can dump all the context… into GPT-5 and say, ‘Can you write this out in Python for me?’ and that’s super helpful.” (19:26)
- Kyle pushes back on RL’s reputation for mathematical complexity:
RL Techniques in Practice: DPO, PPO, GRPO
(20:35 – 25:46)
-
Key Methods:
- Discusses shift from PPO (Proximal Policy Optimization) to GRPO (Groupwise RL from Preferences via Optimization)
- Key GRPO insight: comparisons are relative, not global—“That’s actually what unlocks some mono… self supervised RL.” (21:09)
-
Pros and Cons:
- Pros:
- “Operational simplicity… there’s a whole extra value model you need for PPO that you don’t for GRPO.” (22:05)
- Relative scoring aligns with human intuition
- Cons:
- GRPO requires parallel rollouts in reproducible environments—“Getting that set up is the hardest challenge today… especially when we’re training agents on real codebases.” (23:32)
- With PPO, you have option to train on production traces directly
- Pros:
Building RL Environments: Value, Challenges, and Business Models
(25:46 – 32:21)
-
Why are RL environments hard to create?
- Sandboxing real-world systems is tough:
- “You have to build a copy… that reacts to you the exact same way… with the same failure modes. Because if you don’t…, your agent’s gonna have no idea what to do with it.” (24:08–24:36)
- Sandboxing real-world systems is tough:
-
Market for RL environments:
- “There’s like 20 startups apparently… Labs are buying ad hoc—it’s almost like they’re paying the company to build an environment ad hoc for them. It’s a services business at the moment.” (29:03–31:35)
RL Data Pipelines & Simulations
(29:57 – 33:55)
-
RL requires ongoing, tight data-in-the-loop from real rollouts; can’t just batch up a CSV and train
-
The most challenging parts are integrating the agent’s “tool calls” with environment responses that closely mimic production (30:57)
-
Discussion of simulation tools, regulated environments, and generalization in RL market segmentation
Prompt Optimization, JEPA, and RL Interplay
(35:02 – 41:11)
-
Prompt Optimization vs. Weight Updates:
- Debate over whether prompt optimization (JEPA, etc.) can compete with or complement RL weight fine-tuning
- Kyle: “If you get better performance… I don’t care if you’re changing my prompt or my weights.” (36:31)
- On JEPA: “It just doesn’t work. Okay, that’s going to be the fighting words. JEPA doesn’t work… on the problems we tried it on.” (37:05)
-
Baseline design matters:
- “Maybe our baseline was… top 10 percentile of prompts that people put in these LLMs.” (41:41)
Toward Continual (Online) RL Learning and Next-Gen Agent Infrastructure
(41:49 – 54:18)
-
Push for Online & Continual Learning:
- “If you’re bringing data from your real evals… I’m much more optimistic that you’re going to get good results.” (43:15)
-
Market Size and Open vs. Closed Model Dynamics:
- Debate on percentage of tokens from open source vs. proprietary models by 2026 (44:33–49:29)
- Discussion on economics, cloud subsidies, costs, and where open models may break through
OpenPipe’s RULER: Closing the Reward Loop
(50:20 – 54:18)
- Launch of RULER (July 2025):
- “RULER is a library that we released… which lets you score a group of agent rollouts relatively, using only LLMs as a judge.” (52:06–52:09)
- “It turns out that works phenomenally well with GRPO. …The reward assignment problem is fairly solved.” (52:29)
- “Even with an extremely weak judge model… we were able to get our agent doing state of the art…on the task we tried it on.” (53:16)
The Acquisition by CoreWeave (via Weights & Biases)
(60:11 – 62:16)
-
How it Happened:
- Driven by the Weights & Biases founding team’s interest, after their own CoreWeave acquisition
- “It was a long, pretty painful process. …There were points as late as, you know, like the week before we actually signed where it was like unclear if it was actually going to happen.” (60:23)
-
Post-Acquisition:
- Launched “serverless RL” product: “It lets you offload all of the GPU management… we handle all that for you… It makes it way easier.” (62:16)
Notable Quotes & Insights
-
On the Fine-Tuning Wave:
- “LoRAs had bad marketing. They were just like, oh, you can’t afford full fine tuning. Here’s the Walmart store brand fine tuning.” — Swix (10:15)
- “For 90% of use cases where you aren’t forced to a smaller model, then it’s still not a good ROI and you probably shouldn’t invest in it today.” — Kyle Corbitt (12:44)
-
On RL’s Evolution & Importance:
- “I keep talking about these percentages… if we get to the world where we build [continual online learning], the advantages are huge. They’re clear. Everyone should just deploy their agents that way.” — Kyle Corbitt (63:12)
-
On Productizing RL:
- “There’s still not enough people, smart people working in this space. Honestly, we need… there’s still a lot of low hanging fruit.” — Kyle Corbitt (62:54)
-
On YC’s Advice:
- “Hold your problem tight and your solution loosely.” — Kyle Corbitt (66:38)
Timestamps of Key Segments
| Segment | Topic | Timestamp | |---------|-------|-----------| | 01:13 | Kyle’s YC Startup School work | 01:13–01:29 | | 03:03 | OpenPipe founding inspiration | 03:03–03:14 | | 04:28 | Early OpenPipe product-market fit | 04:28–05:22 | | 09:06 | LoRA fine-tuning, multiplexed inference | 09:06–09:36 | | 12:44 | Fine-tuning ROI heuristic | 12:44–13:00 | | 16:45 | RL focus: high-risk/high-reward bet | 16:45–18:47 | | 23:32 | RL environment sandboxing pain | 23:32–25:46 | | 36:31 | Prompt optimization (JEPA) vs. weights | 36:31–37:05 | | 52:09 | RULER launch – relative LLM-based rewards | 52:09–53:14 | | 62:16 | New serverless RL product | 62:16–62:43 | | 63:12 | Vision: continual RL for every agent | 63:12–65:30 |
Memorable Moments & Takeaways
-
Technological Pivots as Market Response:
OpenPipe's journey was a case study in responding to rapid drops in model pricing, the rise of open models, and shifting value props for AI infra companies. -
RL as an Unlocked Superpower — If the Environment Problem is Solved:
RL’s effectiveness is now much less about reward design thanks to advances like RULER, but fully realizing its potential depends on practical, high-fidelity simulation environments. -
Community Skepticism About “Hot” Research Fads:
Despite buzz, prompt optimization frameworks like JEPA fell flat for OpenPipe’s tasks, reinforcing the need for sober, hands-on benchmarks over hype. -
RL’s Business & Product Potential Unlocked via Infra:
“Serverless RL” and similar abstractions are about making RL feasible for production teams, reducing friction and opening RL capabilities to a much broader developer audience. -
Industry Economics Will Shape the Next Decade:
The fate of open models, who can fund a “$500B” Stargate-scale compute estate, and token pricing subsidies are as crucial as any ML breakthrough for who wins in enterprise AI.
Closing Perspective
Kyle’s arc with OpenPipe illustrates both the whiplash-fast nature of the modern AI product landscape and the kind of relentless, direct engagement with technical challenges—like reproducible RL environments and reliable reward pipelines—that unlock real differentiation. The conversation reveals both optimism (“I think that there is today, like, 10 times as much AI inference that could exist than is existing right now… if we can solve [agent reliability].”) and caution (on overhyping prompt optimization, or assuming the environment problem is a quick fix).
Above all, the episode is a window into how today’s AI engineers aren’t just plugging papers into products, but are running real-time experiments in a turbulent, capital-drenched, and opportunity-soaked phase of the industry—a phase where who builds what kind of infra, in what market, and how quickly, is still very much up for grabs.
