Transcript
A (0:00)
So we talked three years ago. I'm curious, in your view, what has been the biggest update of the last three years? What has been the biggest difference between what it felt like last three years versus now?
B (0:08)
Yeah, I would say actually the underlying technology, like the exponential of the technology has gone broadly speaking, I would say about as I expected it to go. I mean there's like plus or minus, you know, a couple. There's plus or minus a year or two here, there's plus or minus a year or two there. I don't know that I would predicted the specific direction of code, but. But actually when I look at the exponential, it is roughly what I expected in terms of the march of the models from smart high school student to smart college student to beginning to do PhD and professional stuff. And in the case of code, reaching beyond that. So the frontier is a little bit uneven. It's roughly what I expected. I will tell you though what the most surprising thing has been. The most surprising thing has been the lack of public recognition of how close we are to the end of the exponential. To me it is is absolutely wild that you have within the bubble and outside the bubble, but you have people talking about just the same tired old hot button political issues around us for near the end of the exponential.
A (1:21)
I want to understand what that exponential looks like right now because the first question I asked you when we recorded three years ago was what's up with scaling? Why does it work? And I have a similar question now, but I feel like it's a more complicated question because at least from the public's point of view, three years ago there were these well known public trends where across many orders of magnitude of compute you could see how the loss improves. And now we have RL scaling and there's no publicly known scaling law for it. It's not even clear what exactly the story is of is this supposed to be teaching the model skills? Is this supposed to be teaching meta learning? What is the scaling hypothesis at this point?
B (1:58)
Yeah, so I have actually the same hypothesis that I had even all the way back in 2017. So in 2017 I think I talked about it last time, but I wrote a doc called the Big Blob of Compute Hypothesis. And it wasn't about the scaling of language models in particular. When I wrote it, GPT1 had just come out. So that was one among many things. Back in those days there was robotics. People tried to work on reasoning as a separate thing from language models. There was scaling of the kind of RL that kind of happened in AlphaGo and that that happened at Dota, at OpenAI and you know, people remember StarCraft, at DeepMind, you know, the AlphaStar. So it was written as a more general document. And the specific thing I said was the following that and you know, it's very, you know, Rich Sutton put out the bitter lesson a couple years later. But you know, the hypothesis is basically the same. So what it says is all the cleverness, all the techniques, all the kind of we need a new method to do something like that doesn't very much. There are only a few things that matter. And I think I listed seven of them. One is like how much raw compute you have. The other is the quantity of data that you have. Then the third is kind of the quality and distribution of data. Right. It needs to be a broad, broad distribution of data. The fourth is I think how long you train for. The fifth is you need an objective function that can scale to the moon. So the pre training objective function is one such objective function. Right. Another objective function is the kind of RL objective function that says you have a goal, you're going to go out and reach the goal. Within that of course there's objective rewards like you see in math and coding and there's more subjective rewards like you see in RL from human feedback or kind of higher order versions of that. And then the sixth and seventh were things around kind of like normalization or conditioning, like just getting the numerical stability. So that kind of the big blob of compute flows in this laminar way away instead of running into problems. So that was the hypothesis and it's a hypothesis I still hold. I don't think I've seen very much that is not in line with that hypothesis. And so the pre trained scaling laws were one example of kind of what we see there. And indeed those have continued going like I think now it's been widely reported like we feel good about pre training, like pre training is continuing to give us gains. What has changed is that now we're also seeing the same thing for rl, right. So we're seeing a pre training phase and then we're seeing like an RL phase on top of that. And with RL it's actually just the same like you know, even other companies have published like you know, in some of their releases have published things that say look, you know, we train the model on math contests, AIME or the kind of other things and how well the model does is log linear and how long we've trained it and we see that as well, and it's not just math contests, it's a wide variety of RL tasks. And so we're seeing the same scaling in RL that we saw for pre training.
