Marius Hobhan (16:07)
Yeah, I think there are a couple of different reasons, but at the core of it, it's a rational strategy in many situations. So the way I think of scheming is like, if you have a misaligned goal and you're sufficiently smart, then you should at least consider scheming as a strategy if you're this model, and sometimes it actually is the best strategy. So, concretely, if I know that the two of us have different goals, I may conclude that I could just pretend to have your goal for a while, and then once I get whatever I want from you, then I can actually reveal my true intentions and do whatever else I want. And there are other ways to solve this problem. Like, you could trade, for example, and just be like, okay, we both agree that we have different preferences here. We could trade or we can do something else. But there are at least some cases where scheming is maybe just the correct answer from a rational point of view for the AI. And I think this is actually the scary part, because that suggests that this will happen across a very wide variety and range of different contexts. And. And this was sort of a convergent behavior where no matter where you start, you will get a lot of different models that kind of learn that scheming might be a rational strategy and reason through that kind of logically and come to that conclusion. And I think we really don't want our AIs to ever be scheming, so we should be really careful. How do we create the incentives during training and so on? And there's sort of a little bit of a longer story of how I expect scheming to arise in future models. So right now, as I said, the models have these preferences, maybe, or they're not super coherent, they're not crisp, they're not goals. They're more like preferences. And now in the status quo, we train the models maybe for horizon lengths of an hour or two or maybe three, sort of in the longer cases. But we're not at the days or weeks. But eventually we do want to train them. That's where a lot of the economic value is coming from. So companies have a very strong incentive to do that. And also, if you want to get the benefits of AGI or artificial superintelligence, then you actually need your model to solve tasks that humans can't solve. So one of my core examples here is something like cure cancer. So you give the model the task to cure cancer. You have sort of this genius in a data center, how Dario Amodai calls it that you have this big AI that has the cognitive power of thousands of different humans and is way faster in thinking and so on. And because this is a hard problem, it kind of has to learn on the job and now what it does is it thinks about how do I solve this problem of curing cancer. And it comes up with a bunch of strategies and it tries them. And then probably it has to distill the knowledge that it has found in these experiments and the way it reasoned about the problem. And then this knowledge has to be distilled back into the model. So maybe it fine tunes itself, or maybe it selects the data for later fine tuning or something. And what I expect is that very often it will explore into the kind of things like, oh, in order to solve cancer, I need to instrumentally do all of these other things. Like money would be very helpful, or having more data or more access to experimental equipment for cancer experiments and things like that. And so these are the kind of things that will come up very explicitly across many, many different explorations. And so the model will learn these instrumental goals because they are convergent, because they're very helpful across a wide variety of things. And so it starts to get reinforced on all of these instrumental drives. Something like having more money, more GPUs, more access. Yeah, exactly. All of these kind of things. And then these get reinforced and these become more and more crisp. These become more and more like an actual preference, strong preference or even goal of the model. And I think at that point you now have these systems that have these very strong preferences or goals, and these are not the ones that you originally intended. And these are often kind of conflicty. Often the money and GPUs and so on have a very zero sum flavor where if you don't get it, somebody else will, and the other way around. And so if you have these zero sum games, then it's suddenly quite plausible for you to be like, okay, in order for me to get the money, I need to trick them, I need to scheme. And then it also learns that over training and that gets reinforced. And so it just becomes smarter and smarter and better and better at scheming. And so that's kind of the overarching story that I have for how scheming would arise in practice. It's just a result of the AI getting smarter and caring more about specific goals and then having these instrumental drives that then force it into situations where it has goal conflict with someone else. And scheming just happens to be a good strategy for that. And it reasons about that. I think there is another reason why one should be more worried about scheming in LLMs than in previous RL setups in particular. So we have Atari games and Alphago and all of these other things in the past, and these were trained with a lot of reinforcement learning. And I'm not very worried about scheming there because the model never. It can't reason about its own training process or how to circumvent it or anything like this. So no matter how much compute I put on the Atari game reinforcement learning agents, it would never get to a position where the reinforcement learning agent is like, oh, I'm in this game and the way I get rewarded is this and this. And therefore I now need to run into this corner because that's somehow the best way to get reward or something. But the LLMs can do that. They have a lot of world knowledge from pre training. They understand how the training process is set up. They probably even know or can at least guess what their reward function is. And they understand a lot of these other general world knowledge settings as well. So models do quite often reason about how am I being graded? And this is just a very useful thing to explore into. And as soon as you do, that happens to get rewarded. So there are actually a couple of chains of thought in the paper where the model very explicitly reasons about what kind of environment is this? This is the kind of environment that I typically see during training, where is the grader. And then it starts looking in its environment in the computer system for grader PY or something where it wants to see how is this being graded? Because as soon as it knows, obviously it would be able to dissolve the task better and then get more reward.