Latent Space: The AI Engineer Podcast
Episode: [State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI
Date: December 31, 2025
Host: Latent.Space
Guest: Josh McGrath (Post-Training Researcher, OpenAI)
Episode Overview
In this forward-looking, deeply technical episode, Latent.Space sits down with Josh McGrath from OpenAI to dissect the rapidly evolving post-training landscape in large language models (LLMs) from GPT-4.1 through 5.1. The conversation touches on reinforcement learning techniques (RLHF, RLVR, GRPO), token and agent efficiency, model personalities, context scaling, and the dynamic interplay between pre-training and post-training in 2025. The conversation is candid, peppered with real insights from cutting-edge research, everyday engineering realities, and practical reflections on how model improvements are impacting engineers and users.
Key Discussion Points & Insights
1. From Pre-Training to Post-Training
-
Motivation for Post-Training Focus:
- Josh explains his shift from pre-training data curation to post-training, attracted by the prospect of "chang[ing] the behavior by 40%" instead of incrementally optimizing compute by a few percent.
- "It just seemed more exciting to go to post training and many late nights later. That's definitely true." (01:13)
-
Difference in Engineering:
- Reinforcement learning (RL) introduces significant complexity compared to pre-training ("the number of moving parts in an RL run is just a lot higher").
- Need for faster context switch and deeper code comprehension during RL runs, especially when integrating unfamiliar code or troubleshooting shared workflows.
- "Codex can do more work than I could do in a few hours in like 15 minutes. But then like what do I do during those 15 minutes after?" (03:36)
2. Model Interactivity, Personality, and User Control
-
Recent Model Releases ("Shopping Model"):
- Discussed the new "shopping" model's chain-of-thought transparency and real-time user interrupts, paralleling innovations from Codex.
- "It shows you its chain of thought with like what products it's looking at and you can write it new messages..." (05:08)
-
Why Separate Models?:
- Testing new paradigms sometimes benefits from siloed models; over time, capabilities will likely consolidate.
-
Model Personality and Toggles:
- OpenAI now allows users to select model personalities, from tool-like "Anton" (serious, focused) to cheery "Clippy".
- Josh favors the more utilitarian Anton: "I personally want my model to like be a tool and so like I don't necessarily want the warmth..." (07:43)
3. Post-Training Techniques: RLHF, RLVR, Optimization
-
Evolving Methods:
- Progression from RLHF/PPO to RLVR and more agent-specific RL.
- The real differentiator among these policy-gradient techniques is not algorithmic but the data signal: how "clean" and high-quality/reliable is the optimization or reward signal?
- "RLHF, rlvr, they're both policy gradient methods, but what's different is just like the input data." (09:01)
- Human feedback versus verifiable task rewards (e.g., math): "When you find the answer to a math problem, it's a lot less debatable than, like, oh, well, is this thing that the human preferred actually what we want to do?" (12:41)
-
Optimization vs Data-Centricity in Research:
- Academic publication is skewed toward optimization-narratives rather than focusing on innovation in data collection/signal ("what really matters is how narrativizable it is"). (11:11–11:28)
4. Efficiency: Agents, Tokens, and Context
-
Token Efficiency as a Metric:
- OpenAI now prioritizes not just raw performance, but also how many tokens it takes to achieve results—token efficiency improvements are a major focus moving from 5.0 to 5.1.
- "If you look at a 2D plot of how many tokens it takes for us to get that, it went way down." (13:58)
-
Routers and Implicit Routing:
- Discussed explicit and implicit routing in GPT-5, and how in the long run the goal is unified, abstracted models that obviate user-facing “thinking” knobs.
- "Eventually, you know, we’ll have AGI and like you’re not going to have to worry too much about how hard to think directly..." (15:22)
-
Context Compaction & Memory Management:
- Trend towards automatic context/memory compaction, shifting from developer-level code to internal model processes.
- "Feels like I used to do that as part of my harness and now...the model's doing it for me and I don't know how to think about that." (16:03)
-
Long Context Windows & Utilization:
- Debated the utility of ever-larger context windows (10M, 100M tokens and beyond).
- Real-world use of enormous windows is mixed; simple search/IR methods (e.g., GREP) are still highly effective.
- "The agents with GREP are like, they feel really similar to me where it's like just unread. Effective." (19:39)
5. AI Systems and Human-AI Co-Design
-
Co-Design Culture at OpenAI:
- OpenAI's post-training work is characterized by deep integration between systems/engineering and machine learning—engineers must do both for frontier work.
- "I think it's a great culture to have a place where people just move seamlessly between the two." (20:57)
-
Hiring Challenges:
- Critical shortage of engineers who are equally comfortable with distributed systems and deep ML research.
- "We should probably be producing more students that are great at doing both, you know, distributed systems and...the statistics and other things that are required to be a good machine learning researcher." (22:02)
6. Meta & Future Trends
-
Pre-Training Is Not "Dead":
- The meme that pre-training is overblown; both pre-training and post-training are scaling up and consuming more resources.
- Analogy to industrial revolutions: true, disruptive change takes time to reveal itself, and best practices are only obvious in hindsight.
- "There's this almost like fog of war...I think it really gives me no confidence in being like, oh, this thing is dead." (25:58)
-
Cycles of Hype and Innovation:
- Expect cyclical enthusiasm ("it's so over, we're so back") as the community toggles between optimism/pessimism and different paradigms.
Notable Quotes & Memorable Moments
-
Josh on the transition to post-training:
"Do I want to make compute efficiency wins of like 3% or do I want to change the behavior by 40%?" (01:14)
-
On Code Understanding with AI Tools:
"Codex can do more work than I could do in a few hours in like 15 minutes. But then what do I do during those 15 minutes after?" (03:36)
-
On Model Personality:
"I just want some answers because I’m, you know, mostly using it at work." (07:43)
-
On RLHF vs RLVR & the nature of reward signals:
"If the, if like your value of truth is like does the user like this more, like there's, there's something strained that I think we haven't like looked at that axis of. Okay, well how like sort of clean is this signal? How much do I trust it?" (09:47)
-
On context window scaling:
"There's always be some dance of like...should also have strategies for keeping that context window available for as long as possible." (16:36)
-
On the future of interfaces:
"If we lock the interface, if we discover something new about models, we might sort of trap that improvement under an interface that needs to change." (17:09)
-
On hiring and the required skill set:
"We should probably be producing more students that are great at doing both, you know, distributed systems and...the statistics and other things that are required to be a good machine learning researcher." (22:02)
-
On the ongoing evolution of pre- and post-training:
"There's this almost like fog of war where I'm like, oh, did people think that, like, we got like the steam engine...and they would have, you know, the factories? I don't know if you know, but like the factories, they used to be like very linear...And it took...a couple of decades before they realized, wait, if we have electricity, we can move the little, like, stations in what's whatever is most ergonomic. And then, you know, manufacturing was transformed..." (25:58)
Timestamps for Key Segments
- Josh’s Background, Pre-training → Post-training — 00:15–01:40
- Engineering in RL & Model Usage (Codex) — 01:40–04:30
- Model Interactivity ("Shopping Model"), Personalities & Toggles — 04:36–08:25
- Post-Training Methods, RLHF, RLVR — 08:25–12:41
- Long Horizon Tasks & Token Efficiency — 13:14–15:54
- Routers & Thinking Abstractions — 14:46–15:54
- Context Compaction, Long Context Windows — 16:02–20:47
- Systems vs Models, Co-Design Culture — 20:47–21:24
- Hiring & Skill Gaps in AI Engineering — 21:24–23:38
- Pre-training vs Post-training, Industrial Revolution Analogy — 24:47–26:38
- Cycles of Innovation & Hype — 26:38–27:10
Shout-Outs
- OpenAI Shopping Team: Andrew Hoyel, Manukastrada, John Holman, Issa Fulford, and the original Deep Research team.
Final Note
This episode is a must-listen for AI engineers interested in the rapidly evolving field of post-training, system-model interplay, and the next frontiers of efficient, dynamic, and personalized language models. The conversation balances deep technical dives with candid, real-world observations and offers rare insight into the day-to-day realities at the cutting edge of OpenAI.
![[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI - Latent Space: The AI Engineer Podcast cover](/_next/image?url=https%3A%2F%2Fsubstackcdn.com%2Ffeed%2Fpodcast%2F1084089%2Fpost%2F186610564%2Fe1bc34b9ec6a2ebd16157f848fb57b2d.jpg&w=1200&q=75)