Summary7 min read

Latent Space: The AI Engineer Podcast

Episode: [State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI
Date: December 31, 2025
Host: Latent.Space
Guest: Josh McGrath (Post-Training Researcher, OpenAI)

Episode Overview

In this forward-looking, deeply technical episode, Latent.Space sits down with Josh McGrath from OpenAI to dissect the rapidly evolving post-training landscape in large language models (LLMs) from GPT-4.1 through 5.1. The conversation touches on reinforcement learning techniques (RLHF, RLVR, GRPO), token and agent efficiency, model personalities, context scaling, and the dynamic interplay between pre-training and post-training in 2025. The conversation is candid, peppered with real insights from cutting-edge research, everyday engineering realities, and practical reflections on how model improvements are impacting engineers and users.

Key Discussion Points & Insights

1. From Pre-Training to Post-Training

Motivation for Post-Training Focus:
- Josh explains his shift from pre-training data curation to post-training, attracted by the prospect of "chang[ing] the behavior by 40%" instead of incrementally optimizing compute by a few percent.
- "It just seemed more exciting to go to post training and many late nights later. That's definitely true." (01:13)
Difference in Engineering:
- Reinforcement learning (RL) introduces significant complexity compared to pre-training ("the number of moving parts in an RL run is just a lot higher").
- Need for faster context switch and deeper code comprehension during RL runs, especially when integrating unfamiliar code or troubleshooting shared workflows.
- "Codex can do more work than I could do in a few hours in like 15 minutes. But then like what do I do during those 15 minutes after?" (03:36)

2. Model Interactivity, Personality, and User Control

Recent Model Releases ("Shopping Model"):
- Discussed the new "shopping" model's chain-of-thought transparency and real-time user interrupts, paralleling innovations from Codex.
- "It shows you its chain of thought with like what products it's looking at and you can write it new messages..." (05:08)
Why Separate Models?:
- Testing new paradigms sometimes benefits from siloed models; over time, capabilities will likely consolidate.
Model Personality and Toggles:
- OpenAI now allows users to select model personalities, from tool-like "Anton" (serious, focused) to cheery "Clippy".
- Josh favors the more utilitarian Anton: "I personally want my model to like be a tool and so like I don't necessarily want the warmth..." (07:43)

3. Post-Training Techniques: RLHF, RLVR, Optimization

Evolving Methods:
- Progression from RLHF/PPO to RLVR and more agent-specific RL.
- The real differentiator among these policy-gradient techniques is not algorithmic but the data signal: how "clean" and high-quality/reliable is the optimization or reward signal?
- "RLHF, rlvr, they're both policy gradient methods, but what's different is just like the input data." (09:01)
- Human feedback versus verifiable task rewards (e.g., math): "When you find the answer to a math problem, it's a lot less debatable than, like, oh, well, is this thing that the human preferred actually what we want to do?" (12:41)
Optimization vs Data-Centricity in Research:
- Academic publication is skewed toward optimization-narratives rather than focusing on innovation in data collection/signal ("what really matters is how narrativizable it is"). (11:11–11:28)

4. Efficiency: Agents, Tokens, and Context

Token Efficiency as a Metric:
- OpenAI now prioritizes not just raw performance, but also how many tokens it takes to achieve results—token efficiency improvements are a major focus moving from 5.0 to 5.1.
- "If you look at a 2D plot of how many tokens it takes for us to get that, it went way down." (13:58)
Routers and Implicit Routing:
- Discussed explicit and implicit routing in GPT-5, and how in the long run the goal is unified, abstracted models that obviate user-facing “thinking” knobs.
- "Eventually, you know, we’ll have AGI and like you’re not going to have to worry too much about how hard to think directly..." (15:22)
Context Compaction & Memory Management:
- Trend towards automatic context/memory compaction, shifting from developer-level code to internal model processes.
- "Feels like I used to do that as part of my harness and now...the model's doing it for me and I don't know how to think about that." (16:03)
Long Context Windows & Utilization:
- Debated the utility of ever-larger context windows (10M, 100M tokens and beyond).
- Real-world use of enormous windows is mixed; simple search/IR methods (e.g., GREP) are still highly effective.
- "The agents with GREP are like, they feel really similar to me where it's like just unread. Effective." (19:39)

5. AI Systems and Human-AI Co-Design

Co-Design Culture at OpenAI:
- OpenAI's post-training work is characterized by deep integration between systems/engineering and machine learning—engineers must do both for frontier work.
- "I think it's a great culture to have a place where people just move seamlessly between the two." (20:57)
Hiring Challenges:
- Critical shortage of engineers who are equally comfortable with distributed systems and deep ML research.
- "We should probably be producing more students that are great at doing both, you know, distributed systems and...the statistics and other things that are required to be a good machine learning researcher." (22:02)

6. Meta & Future Trends

Pre-Training Is Not "Dead":
- The meme that pre-training is overblown; both pre-training and post-training are scaling up and consuming more resources.
- Analogy to industrial revolutions: true, disruptive change takes time to reveal itself, and best practices are only obvious in hindsight.
- "There's this almost like fog of war...I think it really gives me no confidence in being like, oh, this thing is dead." (25:58)
Cycles of Hype and Innovation:
- Expect cyclical enthusiasm ("it's so over, we're so back") as the community toggles between optimism/pessimism and different paradigms.

Notable Quotes & Memorable Moments

Josh on the transition to post-training:

"Do I want to make compute efficiency wins of like 3% or do I want to change the behavior by 40%?" (01:14)
On Code Understanding with AI Tools:

"Codex can do more work than I could do in a few hours in like 15 minutes. But then what do I do during those 15 minutes after?" (03:36)
On Model Personality:

"I just want some answers because I’m, you know, mostly using it at work." (07:43)
On RLHF vs RLVR & the nature of reward signals:

"If the, if like your value of truth is like does the user like this more, like there's, there's something strained that I think we haven't like looked at that axis of. Okay, well how like sort of clean is this signal? How much do I trust it?" (09:47)
On context window scaling:

"There's always be some dance of like...should also have strategies for keeping that context window available for as long as possible." (16:36)
On the future of interfaces:

"If we lock the interface, if we discover something new about models, we might sort of trap that improvement under an interface that needs to change." (17:09)
On hiring and the required skill set:

"We should probably be producing more students that are great at doing both, you know, distributed systems and...the statistics and other things that are required to be a good machine learning researcher." (22:02)
On the ongoing evolution of pre- and post-training:

"There's this almost like fog of war where I'm like, oh, did people think that, like, we got like the steam engine...and they would have, you know, the factories? I don't know if you know, but like the factories, they used to be like very linear...And it took...a couple of decades before they realized, wait, if we have electricity, we can move the little, like, stations in what's whatever is most ergonomic. And then, you know, manufacturing was transformed..." (25:58)

Timestamps for Key Segments

Josh’s Background, Pre-training → Post-training — 00:15–01:40
Engineering in RL & Model Usage (Codex) — 01:40–04:30
Model Interactivity ("Shopping Model"), Personalities & Toggles — 04:36–08:25
Post-Training Methods, RLHF, RLVR — 08:25–12:41
Long Horizon Tasks & Token Efficiency — 13:14–15:54
Routers & Thinking Abstractions — 14:46–15:54
Context Compaction, Long Context Windows — 16:02–20:47
Systems vs Models, Co-Design Culture — 20:47–21:24
Hiring & Skill Gaps in AI Engineering — 21:24–23:38
Pre-training vs Post-training, Industrial Revolution Analogy — 24:47–26:38
Cycles of Innovation & Hype — 26:38–27:10

Shout-Outs

OpenAI Shopping Team: Andrew Hoyel, Manukastrada, John Holman, Issa Fulford, and the original Deep Research team.

Final Note

This episode is a must-listen for AI engineers interested in the rapidly evolving field of post-training, system-model interplay, and the next frontiers of efficient, dynamic, and personalized language models. The conversation balances deep technical dives with candid, real-world observations and offers rare insight into the day-to-day realities at the cutting edge of OpenAI.

Loading summary

Transcript122 lines

[00:03]
A
Light in space.
[00:12]
B
Oh, here is Josh from OpenAI. Welcome. How else you introduce yourself?
[00:16]
A
Yeah, I work on a bunch of the thinking models at OpenAI and like recently I've been sort of focused on doing search related stuff. But yeah, just a post training researcher at OpenAI.
[00:27]
B
Yeah. And you were on with us for GPT 4.1 when we were talking with Michelle who's on maternity leave. I, I didn't know that. And now we're 5.1. It's been a, it's been a whole generation.
[00:39]
A
Yeah, it's been wild and like, you know, 4.1 was a non thinking model and then since then I, you know, we sort of switched into doing.
[00:45]
B
Is that your last. Was your last?
[00:47]
A
No, we're still, we still are releasing non thinking models but that one was the one that we did that was like API specific non thinking. So you know, focus has shifted a little.
[00:58]
B
Yeah. How'd you get into post training?
[01:01]
A
So previously before OpenAI, I was doing like pre training data curation stuff and I think what I was seeing from like the news and looking at papers is like, oh, it seems like a lot of not pre training is dead. But I was like, oh, there's going to be so much interesting stuff in post training. And at that point I was like, I really want to like make some contributions there. And I mean it's not even necessarily that like pre training was dead, but it was definitely changing and like, you know, do I want to make compute efficiency wins of like 3% or do I want to like change the behavior by 40%? And honestly it just seemed more exciting to go to post training and many late nights later. That's definitely true.
[01:41]
B
It's a different kind of data and engineering discipline too. It's very strange, like the, the kind of work that you need in especially rl, like scaling it.
[01:52]
A
Yeah, definitely. I think like for example, the number of moving parts in an RL room is just a lot higher, like in.
[01:59]
B
Some ways order of magnitude or I.
[02:02]
A
Don'T know if I could do order of magnitude. But if you think about like pre training, you know, you're moving tokens to many machines and then you're getting like basically a scalar from them and then you're back propping. Yeah, the issue with RL is like you're doing tasks and each task could have like a, a different grading setup and each one of those different grading setups that's like more infrastructure. And so you know, when I'm staying up late trying to figure out what's going on with a run, it could be in way more things than there is in a pre training run generally.
[02:33]
B
Yeah. And does it matter if you own the code of the task or is it an outsourced third party person or. You know, my sense of it and the external sense of it, obviously I don't see it up close is that you work a lot with external partners and I'm sure also some internal stuff, but which is better?
[02:51]
A
Honestly, I don't think I'll comment like too much on like how many external partners there are.
[02:56]
B
Some. And there, there's some internal.
[02:58]
A
Yeah, there's.
[02:59]
B
We do like the technical trade off of like well shit like I don't own a phone.
[03:04]
A
Okay. You know, so well when it comes to. I don't own this code actually like when, you know, when I'm babysitting a run or something, it doesn't really matter if it's like internal, external, whatever. Like do I understand the system that's going underneath? And I think you end up having to like jump into a lot more code that you're like, I actually don't know what this does because like I'll be watching the, you know, I, I work on my pieces of a run and then there's also you know, other people working on it and like do I understand what their code is doing? So that way at like 12:30 in the morning when I'm like something looks wrong and it's, I'm like looking at this code, can I like get context fast enough to understand Throw code. I said it wrong. Oh, I use Codex so much, it's really changed. I feel like there's a degree to which like sometimes I feel trapped by Codex because if I spend like you know, 30, 40 minutes writing something that looks like a design doc or something, Codex can do more work than I could do in a few hours in like 15 minutes. But then like what do I do during those 15 minutes after? And like the, it's actually just like really changed how the flow of my day goes because I have to somehow now manage these like 40 minute sessions with like 15 minutes where like I could do something, but it's actually not nearly as effective as like this new flow to the day. So I think I'm still getting used to that, honestly.
[04:28]
B
Yeah, I think it should be interesting for also just code based understanding when you're encountering unfamiliar code.
[04:36]
A
Absolutely.
[04:37]
B
So you briefly, before we started, talked a little bit about the shopping model which is the latest, hottest thing. And obviously we're just recording this right after Black Friday, Cyber Monday. First of all, any interesting findings from basically releasing shopping in ChatGPT right into that period?
[04:53]
A
Okay, well, I think the first thing is, I don't know, like why I would say in a meeting in, you know, August or so, like, oh, hey, Bach Friday is coming up, like maybe we could, maybe we could do a release by them in hindsight. Like, wait, why would I say something.
[05:06]
B
Like, yes, now you own it?
[05:08]
A
Yeah, exactly. I guess the most interesting thing to me is the new interruptability and like the, the sort of qualitative experience of using it. And the same thing happens with Codex, right? Like you, you write a prompt and you can like press escape and say like, oh, I like I messed something up. And we actually did the same thing in the shopping model. So it shows you its chain of thought with like what products it's looking at and you can write it new messages saying like, oh, you know, I actually wanted. Yeah, like I wanted USB C on this or whatever it is. And like, I think that's a really new interesting like interaction paradigm that we have in a couple of our different services. And I'm excited to see how people use it and if they enjoy it.
[05:48]
B
Yeah. Why did it have to be its own model and not just like a new tool?
[05:53]
A
Stay tuned. I think like, there's no reason that we couldn't do it in the same model eventually. But I think, you know, if we want to try out new things, sometimes it makes sense to make a new model. And I think it just made sense to this time say like, can we do a deep research style model but like for shopping where it's going to look really hard all across the Internet for different things. You know, I think if you look at Deep Research, the original one, and GPT5 thinking on high reasoning today, I think you'll see that eventually the models all sort of converge in their capabilities.
[06:25]
B
Yeah. Would you say that this is a discussion that also a little spicy that I've kicked off in the community? There's still maybe 30% of the community is still using Deep Research. A lot of them have moved over to just using five thinking as Deep Research. Is that the spiritual successor? Are they direct replacements? Are there things that we lose in the original Deep Research model if we, if we do that?
[06:46]
A
I mean, I think if you look at our published evals, they look like basically on par, if not better. So like, I mean that's personally what I do. I use like thinking on high versus using the Deep Research model. But like, you know, I think every as We've learned over the past few months there is sometimes people prefer the quirks of like one model over another. And so people like the deep research model, you know, more power to them.
[07:09]
B
People like 4.0. Anything special in the 40 post trading that like are people really responding to personality? Is that like a differentiator that people really care about and it's a part of your job to care about personality?
[07:24]
A
Yeah, I mean definitely people care quite a bit about personality. I think over the past few months we've been working a lot on giving users more choice over what personality they want.
[07:33]
B
Which is the toggles.
[07:34]
A
Yeah, yeah. So now we have those toggles. What's your favorite toggle? Honestly, custom instruction for like I want, I personally want my model to like be a tool and so like I don't, I don't necessarily like want the, the warmth or anything. I just want some answers because I'm you know, mostly using it at work.
[07:49]
B
Yeah. So I call this the Anton versus Clippy divide. So Anton is the Silicon Valley HBO machine. It only does work and doesn't try to be helpful or friendly or anything. It tries to be helpful but like doesn't try to be cheery whereas Clippy tries to be cheery and. And I'm like well stop smiling at me. I'm like having a problem.
[08:09]
A
So it sounds like you also come down on the side of like using it.
[08:12]
B
Using Anton. Yeah, I think a lot of developers want Anton. They're just like it just quietly does its work and when it's done it shuts up and.
[08:18]
A
Yeah, yeah, well I think like we're doing a lot of work to provide both like people, Antons and Clifties and I hope they all like it.
[08:25]
B
Yeah. So just generally I was thinking about like well what can we update people on post post training? What do we know today in nearest 2025 that we didn't know in Neuros 2024? I would say a lot of people at the time. There's still this whole PPO versus DPO discussion that was there. That was a whole era and since then we've moved on to RLVR and I think a lot of agents specific RL training I guess. Am I missing any large chunks of the post training debates that are going on?
[09:02]
A
Yeah, I mean so not necessarily debates internal but like my read personally from like looking at different papers that are coming out when you look at like an RLVR paper or like a RLHF paper they read more like an optimization paper and to me like the sort of interesting thing that's going on is we have this like spectrum of how high quality a signal is. So like really at the end of the day, like RLHF, rlvr, they're both policy gradient methods, but the what's different is just like the input data. And it's always interesting to me that we call RLHF non verifiable because we've trained this model to be good at like predicting human feedback. So in some sense that's like verification.
[09:44]
B
But obviously it's human preference rather than truth.
[09:47]
A
Yeah, yeah. But like if the, if like your value of truth is like does the user like this more like there's, there's something strained that I think we haven't like looked at that axis of. Okay, well how like sort of clean is this signal? How much do I trust it? And like I totally agree that you know, you don't necessarily trust the RLHF signal as much as like is this the solution to this polynomial? But I think there's a whole spectrum of like how high quality is the signal? What's going to happen when I like do a lot of optimization against it. And that's very different than I think worrying about the variance of different gradients, which I think is what you end up seeing in a lot of the papers that are currently coming out. Rather than being very data centric, they're pretty optimization centric. Even though I think the innovation really is where the data is coming from.
[10:33]
B
Yeah. And before I want to go broad, before I go deep, any other discussions that maybe you're having in Europe or sort of roundabout this time on post training debates, like what are, what are you meet your peer at Anthropic and DeepMind and what do you talk about?
[10:48]
A
Well, Anthropic and DeepMind, we're all saying I'm working on stuff and things, you know, I think like it's more so talking a lot more broadly with my friends there or we're just talking about, man, the infra is so hard to keep up. We're not necessarily talking too much about methods directly because on one level it.
[11:10]
B
Kind of doesn't matter. Yeah.
[11:12]
A
And I think also like there's, there's something that's very different about academic work where like the, what really matters is how narrativizable it is. And I think that's one of the reasons you see like a lot of optimization papers come out is a lot of the data work. There's a less clear narrative around it.
[11:28]
B
I think the data and the scaling is actually more important than a specific element.
[11:33]
A
Yeah, but it doesn't have, like, necessarily the same narrative that you get out of, like, some of the papers that you see here. And so, like, there becomes more of a, like, given a specific vertical, how do I, like, understand that? And I wish there was actually more papers on it here, but I think it can sometimes be harder to wrap up into a clean story.
[11:53]
B
Yeah, that's also something that, like, we're actually having a lot of conversations about with other folks as well, like, what's. What's next? Right. Like, where do you go from here now that we. We have, like, some kind of roadmap? I think what's interesting also for me is I guess the innovations that are exposed by the Chinese models are maybe copies or discussions of what's going on in the labs. I think, obviously, grpo, you mentioned a lot of these RL optimizations. They present themselves as optimizations. GRPO came out in the Deep SEQ math paper, which, when it came out, I read it and I was like, okay, this is kind of cool. It's a little bit cheaper, but, like, it does seem to have more broad impacts, I think, on the industry as a whole than was initially appreciated. I just want to. I don't feel like we've processed that enough.
[12:41]
A
Yeah, definitely. I mean, like, yeah, as you said, it came out in the Deep SEQ math paper, and, like, it's an interesting optimization method, but it's like the more interesting thing that they have a new reward signal that they sort of like that we can really, really trust. Like, when, you know, you find the answer to a math problem, it's a lot less debatable than, like, oh, well, is this thing that the human preferred actually what we want to do? Yeah, like, you want to be right at math.
[13:04]
B
Yeah.
[13:06]
A
And so I think in some ways that's underappreciated in, I would say, what's getting published.
[13:11]
B
Yeah, yeah. Let's talk about, I guess, long horizon.
[13:15]
A
Yeah.
[13:15]
B
What do people consider in terms of, like, very long horizon? Like, we're talking like 30 hours, you know, more than. More than a day of autonomy. Does this. Is it just more of the same or there anything, like, sort of qualitatively different?
[13:28]
A
Okay, so first off, what I would first say is I tend to think more in terms of, like, actual number of tokens than. Than time, because I think, yeah, the.
[13:36]
B
Human in the loop can take a while.
[13:38]
A
Yeah. Well, and also, like, it gives you a different measure to optimize against. Right. Like as I was saying earlier with, when I use codex, it does something that would take me much longer. You know, it would take me like four hours in, you know, 10 minutes. What we can actually push on there is token efficiency. So like.
[13:55]
B
Yeah, and that is a huge, huge research area.
[13:58]
A
Yeah. And so you can see like from 5 to 5.1 our overall evals, you know, we, we bumped some, but if you look at a 2D plot of how many tokens it takes for us to get that, it went way down. And so I think that's like a difference here.
[14:13]
B
When you, when you had that, like I know such a great chart, dude, I live by those charts.
[14:17]
A
Like that, that your chart. Okay, not necessarily that, but like that shape of chart, like I think that's something that we think about a lot just because it contributes so much to your experience. Like how long does it take to, to do this task? Yeah. And I think the other thing is as you're pushing that token efficiency, it changes, you know, how many tool calls can I make and like how many different things can the agent do in a reasonable number of tokens that we can actually serve? Yeah. And so I personally think in terms of tokens.
[14:46]
B
Yeah. I think the interesting thing or the hard to understand thing from the outside is having an explicit router in GPT5, but then also basically having an implicit router in terms of the thinking spending thing that conflates a little bit. Right. Like at some point you do kind of need to merge them or else you're just going to get these like weird bumps where sometimes the router at the top decides something and it's wrong. And actually if you just handed it to GPT5, it would have figured it out.
[15:15]
A
Yeah. And I think, you know, we'll figure out the correct abstractions over time. I think like there's a.
[15:19]
B
Is the intention is still to merge? Because that's what it was said in the paper.
[15:23]
A
Yeah. I think like eventually, you know, we'll have AGI and like you're not going to have to worry too much about how hard to, to think directly. It'll just, you know, we'll have a one tool that you always go to and it knows how long to think for and things like that. I think that the abstractions in the way that we drive these things today, it'll change and like, you know, I think even the amount that we've changed from, you know, having a non thinking model to you can choose between two and like, you know, now we can sort of route and how hard do you want to think? We're adding lots of knobs and, you know, eventually it'll. It'll probably simplify.
[15:55]
B
Yeah. Another super interesting knob that everyone is doing is context compaction or memory compaction. What's going on there?
[16:02]
A
Nothing to share at the moment.
[16:03]
B
Let me share. Okay. Clearly an important feature. Clearly inspired by codecs usage as well. Obviously. But I think, like, from the engineer's point of view, it feels like I used to do that as part of my harness and now it's. Now the model's doing it for me and I don't know how to think about that in terms of. I guess I'm used to having more control and now I have less.
[16:26]
A
Yeah. Is there a specific.
[16:28]
B
There's a specific question. I'm just getting feedback on, like, well, is this a trend that, like, we need where you. It's basically a permanent fact of life from here on out.
[16:37]
A
Oh, I see. You know, I don't know. I worked on long context. That was why I was on last was for 4.1 where we, you know, I think 10x the. The effective context window for 4.1. And so there always be some dance of like, well, if we want to push as much as what we can do, not only should we increase the length of the context window, but, like, we should also have strategies for keeping that context window available for as long as possible. I'm guessing that both things will. Will sort of happen just because we want to put as much power into the models as possible.
[17:08]
B
Yeah, yeah.
[17:09]
A
I think we're still in a period where we should all be expecting changes in the interfaces that all of the models give to us. That way we can improve the models. Because if we lock the interface, I think what would be sad from my perspective is if we lock the interface, if we discover something new about models, we might sort of trap that improvement under an interface that needs to change.
[17:29]
B
Got it. Talking about long context as well, there is some discussion about, I guess, context rot or the utilization of the context. Even if you gave us a million token context, probably wouldn't use all of it. What's the recommendation there? Where are things going? Are we going to have, I guess, perfect context by next year? Is that an impossible dream? I don't know.
[17:49]
A
No, it's not an impossible dream. I think I'll give a shout out to some of the evals that we did for 4.1 called graph walks, where.
[17:57]
B
I love Graph walks. We covered this in the podcast.
[18:01]
A
Yeah, yeah, yeah, we did. You know, I think if you look over time, all of those, all of those evals are so climbing. And I think one of the interesting things about that is you have to do complicated transformations across the entire context window. Like that's sort of the issue with those heat map plots of the.
[18:18]
B
Those different needle haystack.
[18:19]
A
Yeah. But the problem is if you only have to sample from one point in the context window, it's like sort of easy. Whereas with those graph walks problems, you're having to do multiple transformations across the entire context window. And so I think keep watching those. I think they've been climbing. They'll continue to climb. I would say that that's definitely like a temporary issue that we are climbing on over time.
[18:41]
B
Yeah. And then is 10 million tokens realistic? Is 100 million? Is there a natural end or there's no end and we just are going as far as the eye can see?
[18:52]
A
Oh gosh, I don't know. What do you think?
[18:54]
B
Yeah, I feel like, okay, there are use cases that require billions and there are use cases that require many, many billions, maybe trillions.
[19:02]
A
Yeah. Out of curiosity, what would be billions of tokens?
[19:05]
B
We just had a context engineering discussion about like a rag code base over support issues for a company and it was 100,000 documents totaling about 8 billion tokens. You can't stick that in a context window.
[19:17]
A
For now, that's fair. I guess the. So I would still say like, I don't know, but I think I've been like really surprised. It reminds me of when I was doing like more information retrieval stuff and like BM25 and these like very simple like engram indexes were like just super hard to beat. I think the agents with GREP are like, they feel really similar to me where it's like just unread.
[19:39]
B
Effective.
[19:40]
A
Yeah. So.
[19:41]
B
So that. But at the. Then I will not use your 10 million token context window even if you gave it.
[19:46]
A
Maybe. But like what if we're using that context window in service of like some larger goal that just has a lot of sub search calls, which is why I'm saying like I just don't know. And I think that's what makes it so exciting.
[19:58]
B
Yeah, yeah. I would say also like the other other modalities, like video would eat up a lot and like then obviously the hard sciences have proteins and all that, which a lot, A lot of information just encoded in. In physics, I mixed feelings about it just because I'm like, well, this will never scale, not with full attention. And we probably just need to invest in systems anyway, which means we're good with what we have. I mean get your graph walks up. I don't know if we need like 10 100x when actually maybe we need to figure out ways to 1001,000,000 x. Yeah, right. Like these are just different slopes.
[20:41]
A
I mean, I'm glad that you're happy with the current context windows. I think my dream would be to push it and see what happens anyway.
[20:48]
B
But I think the engineer's incentive is always to say, well, the systems matter more than the models and the researchers incentive to say, well, screw your systems or we'll just put the models.
[20:58]
A
Oh no. So differently. Yeah. And I think that's one of the most like sort of beautiful things about post training@OpenAI. Is everyone co design? Yeah, it's also co design. Like, you know, I, I spend a lot of time just doing our system stuff and I also do lots of stuff like where I'm making graph walks and I'm like doing a lot more like things on the learning side. And I think it's a great culture to have a place where people just move seamlessly between the two.
[21:24]
B
Yeah. What are you guys hiring for? Presumably you're hiring. What are you guys hiring for? That is hard to hire. What is the skill set that is like we really need this. Can't find it. Please everyone go skill up on this.
[21:35]
A
As my definitely personal opinion here, I think we're still having trouble, not at OpenAI, but I think as a whole producing lots of people that do want to do lots of both systems work and ML work. And I think if you're trying to push the frontier, you don't know which place is currently bottlenecking the frontier. And it changes all the time. I mean even within one project it might change change multiple times where the current bottleneck is. But I think the education system we have right now isn't really optimized for that. So like I personally I studied math and then I was very, very lucky to have some like great mentors after school that like taught me to be a good software engineer. But it seems like if we're going to be in this place for a while, and I think, I think we will be, we should probably be producing more students that are great at doing both, you know, distributed systems and like a lot of core engineering, as well as the statistics and other things that are required to be a good machine learning researcher.
[22:34]
B
If we were to throw codecs at it, obviously we can't do codecs at everything. That's why you still, let's say which Will progress faster which is more solvable by LLM.
[22:44]
A
That's a spicy question.
[22:46]
B
You can't say they're both equally hard. I don't know. Maybe they are. They are differently hard. I think one is more hill climbable than the other. Which is it? Because then we can go, go do it.
[22:56]
A
Okay. I think, I think one thing that's slightly simpler about some of the ML research like or you know, ML research is also distributed systems to be clear. But like some of the things that I would say like get traditionally called ML research are things that you can treat a bit more of as a black box. Whereas like you know, the, the environment to train on you. You know, building these, these different systems is actually just like a complicated engineering problem. And so theoretically I would say that they're like probably roughly equal but I think that there's some, there's some amount of effort I feel like to making the, the environments for it.
[23:39]
B
Yeah, but let's say they require, they require GPUs in themselves as well.
[23:44]
A
Yeah, yeah, I guess they both would. But yeah, that's, that would be my guess. But I don't have high confidence in it.
[23:51]
B
So a lot of people are building this like AI scientists.
[23:53]
A
Right.
[23:54]
B
That automates research. You guys have your own benchmark on paper bench. And that's the one area that for example at clinician we've just decided to not do because it's so hard. Okay. Any other people on the post training team that you're going to shout out have done interesting work this year, they should get more attention but they're not giving credit.
[24:14]
A
Well, okay, for sure. Everyone on the shopping team that I was just working with. So like Andrew Hoyel, Manukastrada, John Holman, all, all great people. Yeah. Issa Fulford, obviously the, the manager for it then.
[24:26]
B
She was the original deep, deep research.
[24:28]
A
Yeah, yeah.
[24:29]
B
Person.
[24:29]
A
Yeah, yeah.
[24:30]
B
There was like three of them.
[24:31]
A
Yeah, yeah, yeah. And so definitely that part of the team. But I mean everyone, everyone is so great. Like I think it's hard to, to give out a list. It's a, it's a really fun time on, on post training right now. It's exciting every day.
[24:43]
B
Yeah.
[24:43]
A
It feels like we're all enjoying our Diet Cokes together in the office late at night.
[24:48]
B
Yeah. Oh, I did wanted to squeeze this in before we end. Nobody actually serious is saying that pre training is dead. It's just a meme. There's a lot of work going on in pre training and in fact actually a lot of my researcher friends are Saying too much money is going to post training. That's also spicy. I don't know. One of the charts I hold in memory from this year is the Grok 4 chart. I don't, I don't know if you've seen it, but it's basically saying, well, we scaled pre training to here and about this level of compute, and now we're spending the same level of compute on post training as well. That's very controversial, I guess, to me, because we're all used to post training taker taking orders of magnitude, less data, compute, whatever, and obviously we're scaling that up now. Do we get to a point where they're equal? I don't know. But that's a topic for conversation. I think much do we invest in this versus more like different pre trading. Yeah. You found both.
[25:41]
A
Yeah. So first off, neither, neither one of those is dead. I think it's really interesting to sort of be living through something that I, you know, all of my other like, historic or technological revolutions, I. Things that I read about in, in history books and like, this one's live as it's happening.
[25:57]
B
Yeah. This one, you don't know the ambient.
[25:58]
A
Yeah. And so there's this almost like fog of war where I'm like, oh, did people think that, like, we got like the steam, like the steam engine and they would have, you know, the factories. I don't know if you know, but like the factories, they used to be like very linear because you had to drive like one motor across an, an entire room and it made it. So when electricity got developed, they just tried to do the same thing and they're like, ah, this isn't all that useful. And it took, I think, like a couple of decades before they realized, wait, if we have electricity, we can move the little, like stations in what's. Whatever is most ergonomic. And then, you know, manufacturing was transformed by electricity. And I think it really gives me no confidence in being like, oh, this thing is dead.
[26:39]
B
Yeah. Our timelines are so short, but usually the way like, good ideas get experimented and funded and propagated, actually that they're still on a human timeline, it's not on AI timeline.
[26:49]
A
Yeah, yeah. And so I think, like, things will maybe be like, dormant, but it'll be spiky. Like there'll be all of some, you know, some whoop. Yeah, yeah. And then we'll all feel different. It's like we're. What's. What's the meme? It's so over. We're so back. It's going to be that many times. And I think having like a, some, some emotional stabilizing to it is probably going to be good for, for everyone's sanity.
[27:11]
B
Yeah, more sanity. Well, thank you so much for joining. Thanks for all the great post training this year.
[27:16]
A
Yeah, thank you. And yeah, continue giving feedback. I love to hear what you think.
[27:20]
B
Yeah, awesome. It.