Latent Space: The AI Engineer Podcast
Episode: Why RL Won — Kyle Corbitt, OpenPipe (acq. CoreWeave)
Date: October 16, 2025
Guests:
- Kyle Corbitt (Co-founder & CEO of OpenPipe, acquired by CoreWeave)
- Alessio (Host, Founder of Kernel Labs)
- Swix (Host, Editor of Latent.Space)
Overview
This episode brings on Kyle Corbitt, co-founder and CEO of OpenPipe, recently acquired by CoreWeave. Through a vibrant discussion, Alessio and Swix dig into Kyle’s journey from YC’s Startup School to OpenPipe’s founding, rapid scaling, pivot journeys, and eventual acquisition. The focus centers on the evolution of model fine-tuning, why Reinforcement Learning (RL) has become central in AI infrastructure, technical trade-offs in fine-tuning strategies, the realities of running an RL-first business, and what the future holds for continual learning, agent environments, and the economics driving the foundational model ecosystem.
Kyle’s Journey: From YC to OpenPipe
(00:17 – 04:01)
Product–Market Fit, Value Proposition & Early Growth
(04:01 – 08:38)
-
Strong value prop: distilling GPT-4’s expensive capabilities into smaller, affordable open models
-
“Anyone who did have production workflows, it was extremely painful. Like they were paying hundreds of thousands of dollars a month to OpenAI.” (04:28)
-
Quick traction: first three customers within a month of launch; $1M ARR within eight months
-
Market Headwinds:
- As open models improved and token prices dropped, the “cost to value” rationale diminished:
- “There was just this slow march of the frontier model token prices just dropping over and over… which kind of ate away at our value prop.” (05:10)
-
Product Experience:
- SDK acted as a drop-in for OpenAI’s, capturing production requests/responses, enabling an easy, managed “distillation flow” from GPT-4 data to custom models
The Fine-Tuning Landscape: Peaks, Troughs & Inflection Points
(07:46 – 10:51)
-
Open Source Model Evolutions:
- Mistral 7B’s release (“a credible open source model”) proved an inflection point
- “At the time that was a pretty big deal that they had this fully open Apache 2 license.” (08:09–08:36)
-
LoRA (Low-Rank Adapters) – Rise, Fall, Resurrection:
- “If you’re predicated on the fact that you’re doing fine-tuning at all, LoRAs have very attractive properties. …You can multiplex basically an arbitrarily large number of LoRAs on the same GPU deployment.” (09:06)
- LoRA’s reputation suffered (viewed as “store brand” fine-tuning), but may be cycling back to favor with evidence (Thinking Machines’ blog post):
“If you’re doing fine-tuning anyway, LoRAs are still in many cases the way you want to do it. But not that many people were doing fine-tuning.” (09:49)
When Does Fine-Tuning Make Sense?
(11:29 – 13:00)
- Kyle’s Heuristic for Fine-Tuning:
- "When it’s cost, latency, or quality consistency that you really care about." (11:39)
- Most common: latency (need to deploy smaller, faster models, e.g., for real-time voice)
- Caution: “For 90% of use cases where you aren’t forced to a smaller model, it’s still not a good ROI and you probably shouldn’t invest in it today.” (12:44)
Cost-Benefit Mental Model:
- Upfront effort: “a couple weeks of a competent engineer’s time,” up to months for RL
- Ongoing cost: less flexible stack, slower iterations
- Direct financial cost rarely a main factor: “Each of these runs is between five and a couple hundred dollars.” (14:00)
The Shift to RL — Why Reinforcement Learning "Won"
(14:42 – 18:47)
-
Trigger Event: Emergence of Zero1 (“01”) models; realization via leaks (Strawberry, etc.) that RL could significantly improve LLMs
- “It seemed very clear… that RL was going to work in that context. And then the question in our mind was, can we apply this in a different segment… task specific customization?” (15:10–15:40)
-
Strategic Bet:
- Went all-in on RL in January 2025, despite estimating just a “25% chance” it was the right move
- “If it turns out that just doing RL on your task is something everyone should be doing… being the first people working on that would be a really, really awesome position.” (16:45)
-
First proof of concept: RL-trained “email agent”; informed bet, not “obvious”
Understanding RL: Tooling, Math, and Accessibility
(18:47 – 20:35)
- Onboarding into RL:
- Kyle pushes back on RL’s reputation for mathematical complexity:
- “I don’t think the math is actually that complicated. …If you just did the naive implementation in Python… it’s actually quite grokkable. …You just have to believe you can do it.” (19:14)
- LLMs help: “I can dump all the context… into GPT-5 and say, ‘Can you write this out in Python for me?’ and that’s super helpful.” (19:26)
RL Techniques in Practice: DPO, PPO, GRPO
(20:35 – 25:46)
-
Key Methods:
- Discusses shift from PPO (Proximal Policy Optimization) to GRPO (Groupwise RL from Preferences via Optimization)
- Key GRPO insight: comparisons are relative, not global—“That’s actually what unlocks some mono… self supervised RL.” (21:09)
-
Pros and Cons:
- Pros:
- “Operational simplicity… there’s a whole extra value model you need for PPO that you don’t for GRPO.” (22:05)
- Relative scoring aligns with human intuition
- Cons:
- GRPO requires parallel rollouts in reproducible environments—“Getting that set up is the hardest challenge today… especially when we’re training agents on real codebases.” (23:32)
- With PPO, you have option to train on production traces directly
Building RL Environments: Value, Challenges, and Business Models
(25:46 – 32:21)
RL Data Pipelines & Simulations
(29:57 – 33:55)
-
RL requires ongoing, tight data-in-the-loop from real rollouts; can’t just batch up a CSV and train
-
The most challenging parts are integrating the agent’s “tool calls” with environment responses that closely mimic production (30:57)
-
Discussion of simulation tools, regulated environments, and generalization in RL market segmentation
Prompt Optimization, JEPA, and RL Interplay
(35:02 – 41:11)
Toward Continual (Online) RL Learning and Next-Gen Agent Infrastructure
(41:49 – 54:18)
OpenPipe’s RULER: Closing the Reward Loop
(50:20 – 54:18)
- Launch of RULER (July 2025):
- “RULER is a library that we released… which lets you score a group of agent rollouts relatively, using only LLMs as a judge.” (52:06–52:09)
- “It turns out that works phenomenally well with GRPO. …The reward assignment problem is fairly solved.” (52:29)
- “Even with an extremely weak judge model… we were able to get our agent doing state of the art…on the task we tried it on.” (53:16)
The Acquisition by CoreWeave (via Weights & Biases)
(60:11 – 62:16)
-
How it Happened:
- Driven by the Weights & Biases founding team’s interest, after their own CoreWeave acquisition
- “It was a long, pretty painful process. …There were points as late as, you know, like the week before we actually signed where it was like unclear if it was actually going to happen.” (60:23)
-
Post-Acquisition:
- Launched “serverless RL” product: “It lets you offload all of the GPU management… we handle all that for you… It makes it way easier.” (62:16)
Notable Quotes & Insights
Timestamps of Key Segments
| Segment | Topic | Timestamp |
|---------|-------|-----------|
| 01:13 | Kyle’s YC Startup School work | 01:13–01:29 |
| 03:03 | OpenPipe founding inspiration | 03:03–03:14 |
| 04:28 | Early OpenPipe product-market fit | 04:28–05:22 |
| 09:06 | LoRA fine-tuning, multiplexed inference | 09:06–09:36 |
| 12:44 | Fine-tuning ROI heuristic | 12:44–13:00 |
| 16:45 | RL focus: high-risk/high-reward bet | 16:45–18:47 |
| 23:32 | RL environment sandboxing pain | 23:32–25:46 |
| 36:31 | Prompt optimization (JEPA) vs. weights | 36:31–37:05 |
| 52:09 | RULER launch – relative LLM-based rewards | 52:09–53:14 |
| 62:16 | New serverless RL product | 62:16–62:43 |
| 63:12 | Vision: continual RL for every agent | 63:12–65:30 |
Memorable Moments & Takeaways
-
Technological Pivots as Market Response:
OpenPipe's journey was a case study in responding to rapid drops in model pricing, the rise of open models, and shifting value props for AI infra companies.
-
RL as an Unlocked Superpower — If the Environment Problem is Solved:
RL’s effectiveness is now much less about reward design thanks to advances like RULER, but fully realizing its potential depends on practical, high-fidelity simulation environments.
-
Community Skepticism About “Hot” Research Fads:
Despite buzz, prompt optimization frameworks like JEPA fell flat for OpenPipe’s tasks, reinforcing the need for sober, hands-on benchmarks over hype.
-
RL’s Business & Product Potential Unlocked via Infra:
“Serverless RL” and similar abstractions are about making RL feasible for production teams, reducing friction and opening RL capabilities to a much broader developer audience.
-
Industry Economics Will Shape the Next Decade:
The fate of open models, who can fund a “$500B” Stargate-scale compute estate, and token pricing subsidies are as crucial as any ML breakthrough for who wins in enterprise AI.
Closing Perspective
Kyle’s arc with OpenPipe illustrates both the whiplash-fast nature of the modern AI product landscape and the kind of relentless, direct engagement with technical challenges—like reproducible RL environments and reliable reward pipelines—that unlock real differentiation. The conversation reveals both optimism (“I think that there is today, like, 10 times as much AI inference that could exist than is existing right now… if we can solve [agent reliability].”) and caution (on overhyping prompt optimization, or assuming the environment problem is a quick fix).
Above all, the episode is a window into how today’s AI engineers aren’t just plugging papers into products, but are running real-time experiments in a turbulent, capital-drenched, and opportunity-soaked phase of the industry—a phase where who builds what kind of infra, in what market, and how quickly, is still very much up for grabs.