
Loading summary
Max Holmes
Humans are very smart. We're sort of the super intelligence of the natural world. Like, certainly compared to plants or bacteria or whatever. This has resulted in a pretty amazing transformation of the planet. We're moving into a potentially a world where we're no longer the smartest thing. When you have something that is significantly smarter than humans, it might start to reshape the environment towards its goals. And as a result, it has the potential to drive humans to extinction. We're going to build this powerful machine and then we'll use the powerful machine to align it, right? That's, like, very scary. I would much rather we, like, not build it. Slow down, like, take a breath. This is extremely dangerous. And anybody who's pursuing this project should be aware that they are, like, threatening every child, man, woman, animal. And I don't recommend it, but I'm like, but maybe there's also this sense of hope in me. AI is not a normal technology. The standard story is that we try to make an airplane, right? Maybe takes off, but then crashes shortly thereafter. And you go back to the drawing board and you say, okay, like, what happened with AI? Especially with building a super intelligent machine that has the potential to wipe everyone out. If you do make a mistake, it could be catastrophic, but once it's killed everyone, there's no ability to go back to the drawing board.
Rob Wiblin
Max Holmes is an alignment researcher at the Machine Intelligence Research Institute, where since 2017 he has worked on the problem of aligning an artificial intelligence and keeping it steerable. His main research agenda is corrigibility, an approach that prioritizes making AIs, I guess, robustly rule following or instruction following, and willingly modifiable to the exclusion, basically of all other goals. He's also a science fiction author, having written the Crystal Society trilogy and the recently released Red Heart, which imagines an AGI being developed in a secret Chinese government project. Thanks so much for coming on the podcast, Max.
Max Holmes
Yeah, it's an honor to be here. Hey, regular listeners.
Rob Wiblin
We don't normally do these anymore, but I had two things to say that I felt I couldn't in good conscience skip over. The first is that if you're an AGI aficionado, you may feel like you've already heard enough about the book. If anyone builds it, everyone dies. It did have a pretty big launch last year. If so, I would at least urge you not to miss the second big block of this conversation about corrigibility as the thing that we should be building into our AIs. Max suspects that the currently overwhelmingly dominant approach of giving AI models good moral values that they really want to stick to. He thinks that that's potentially a huge wrong turn and we need to be doing almost the exact opposite and trying to give them no values whatsoever. It's a theory, a controversial theory that
Max Holmes
would be huge if true, and it's
Rob Wiblin
somehow kind of flown under the radar to a degree that's a bit inexplicable to me, and so it's very likely to be stuff you haven't heard before. Can someone figure out if Max is right? It seems important, you know, asking for a friend. I would also suggest sticking around to hear me push Max on how strong the arguments in if anyone builds it,
Max Holmes
Everyone Dies actually are.
Rob Wiblin
He has a somewhat different spin on things than the authors of the book, who you may have heard interviewed before. Second, we have a new podcast feed up that features readings of all the research that goes up on our website 80,000 hours. Or if you find these marathon interviews a little hard to fit into your limited waking hours, you know, heaven forfend that that is not your absolute top priority. But if that is the case, then these written articles potentially offer a somewhat shorter and more information dense way to learn the core things, the key things that you might want to know about some of these topics. Among the over 200 articles available there, you'll find one of our most popular articles from many years back How Many Lives Does a doctor Save? There's also why the problem you work on is the biggest driver of your impact. Maybe one of our most important research conclusions. There's how not to Lose youe Job to AI from last year. There's Anonymous Expert's Answer, Could AI Supercharge Biorisk? And from this show's very own co host, Luisa Rodriguez, my experience with imposter Syndrome and how to partly overcome it. You might have to use the search function to find some of those as they are from a year or two
Max Holmes
or three or four ago.
Rob Wiblin
But the newest one actually on the feed is from our founder Benjamin Todd on how AI driven feedback loops could
Max Holmes
make things get very crazy very fast. Naturally, given that we're trying to cover
Rob Wiblin
everything, it's a mix of AI and human read stuff, so you can find that by searching for 80,000 hours narrations in any podcasting app can recommend. But now let's get on with the show. Here is Max so a different book came out recently in September of this year. It's called if Anyone Builds It, Everyone Dies. Why Superhuman AI Would Kill Us All? I guess probably viewers can guess what the argument being made in the book is, even if they haven't heard of it already, it's not written by you, but it's written by your longtime colleagues at MIRI and Eliezer Yudkowsky and Nate Soares. I guess Eliezer is the famous progenitor, the modern progenitor, I guess, of the idea that artificial superintelligence would be incredibly hard to keep steerable or aligned with human goals. And I guess it's not exactly your
Max Holmes
views, but it's pretty close. I mean, I think that there's a little bit of difference in perspective, but I definitely agree with the thesis.
Rob Wiblin
Yeah, Interestingly, I guess it's been a reasonably polarizing book, but I would say quite well received in the broader public, maybe more so than amongst the sorts of experts who work in the area who I guess perhaps are more focused on the technical details than the broader argument.
Max Holmes
Yeah, obviously in any sort of field there's the different camps and different people have different perspectives, but I think it's been quite successful. Like it was New York Times bestseller, and the ordinary people who have read the book that are in my life seem to be receptive to the arguments.
Rob Wiblin
Yeah. So later on, I guess we're going to debate some of the arguments that are put forth in the book that I guess haven't fully persuaded me as yet. But incredibly, I guess, despite talking about AI risk on this show, since 2017, we've never had a forthright presentation of the Eliezer or Miri perspective on the whole issue. So that's where we should start. What is the argument that Eliezer and Nate are making? In a nutshell?
Max Holmes
Yeah. So in a nutshell, I think they're saying that if we build, or basically if anyone in the world builds an artificial superintelligence in the near future, that will cause an existential catastrophe, like everyone dying. And one of the things I like about the book is that I think the arguments for this are pretty streamlined. Like, it's a pretty short book, it's pretty accessible. So let's start with intelligence. So artificial superintelligence, I think taking intelligence seriously is pretty important compared to, say, lions or wolves or whales or whatever. Humans are very smart. Right. We're definitely the most intelligent creatures on the planet. We're sort of the super intelligence of the natural world, certainly compared to plants or bacteria or whatever. And I think there's a way in which this human superintelligence has resulted in a pretty amazing transformation of the planet. We are the only species that has ever gone to the moon. And we've spread across all the continents and have transformed the natural world. And in the process of doing that, have driven many species to extinction. We've destroyed environments and just generally reshaped the world and the natural environment to our ends. Right. Developing technology and everything else. And I think one of the most basic frames on the book's argument is that we're moving into potentially a world where we're no longer the smartest thing. Right. If we build an artificial superintelligence that is super intelligent relative to humans, that this status as the most intelligent being on the planet will change. And that when you have something that is significantly smarter than humans, it may start to reshape the environment in a similar sort of way towards its goals. And as a result, it has the potential to drive humans to extinction or reshape us towards whatever it cares about. As part of this, we understand intelligence as a kind of steering, a kind of shaping the world towards some goal or some ends. And so we talk about machines. The book talks about machines having goals and how that makes sense. AI researchers tend to sort of use a bunch of different terms synonymously. Goals, values, preferences, drives, it all sort of means the same thing. It's like when you are intelligently taking actions, what are you steering towards? I think that understanding that machines can have goals is a part of that. And then understanding that those goals might be in alignment or not in alignment with humanity. So if those are the same goals as ours, then it might be fine to have a super intelligent machine taking lots of actions in the world. But if those goals come out of sync with ours and the machine is misaligned, even slightly misaligned, this could be very, very bad. And importantly, I think one of the core points of the book is that we as a species don't know how to align AIs, that we know how to build machines that are increasingly powerful, but we don't know how to guarantee that those things are steering the world towards good futures. We might fear that we could build a very powerful AI, but then it would steer the world into a bad state.
Rob Wiblin
Yeah. So I think that there's a common sense version of this entire argument that if you're going to build a being, a creature that is much more capable than you can think much faster, it's just much more across. It can do science much faster than you. It can come up with plans and scheme much, much better than you. You better be careful that. You better be careful about doing that, because if it has it in for you or if it has very different goals, then you could lose control and that the superintelligence could end up being the dominant party here. I think for just ordinary people who hear that there are companies trying to build basically artificial superintelligence, being such as vastly more capable than perhaps even humanity collectively without AI, that sounds incredibly unnerving. I think for just these extremely common sense reasons. And that might be one reason why the argument in the book, in many ways has just resonated with people who've never really thought about this before, who are kind of finding out that artificial superintelligence is the goal for the first time.
Max Holmes
In many ways, I think it's just a very common sense position. And I think there are things that aren't as obvious, but in a lot of ways I think this is where people should start. And then there's an additional. The burden of proof is sort of on. Yes. This thing that we're doing that sort of, if you read about it in a story, seems sort of obviously dangerous, is in fact safe. Right. Like the burden of proof is on demonstrating safety as opposed to danger.
Rob Wiblin
Yeah. So I think that helps to explain why, I guess the super majorities of the public, when asked in opinion polling, I think favorite bans on attempts to develop asi, but I guess Elias are innate. They go further than just presenting that basic argument for why we should be nervous about the entire thing. There's many other, I guess, specific ideas that help to build a more concrete vision of how they expect that things would play out. What are some of those other aspects that help to add flesh on the bones here?
Max Holmes
Yeah, so there's a bunch of detail that hang off of this common sense argument. One point is that AI is not a normal technology. When we are considering how technological development tends to go, I think the standard story is that we take a crack at it. Scientists and engineers develop, try to make an airplane, and then they do their best and it maybe takes off, but then crashes shortly thereafter. And you go back to the drawing board and you say, ok, what happened? How can we fix that? And then you iterate and make more mistakes and iterate and so on and so forth. And you eventually figure out how to do it with AI, especially with building a super intelligent machine that has the potential to wipe everyone out. If you do make a mistake, it could be catastrophic. And once it's killed everyone, there's no ability to go back to the drawing board. So I think that illustrating this is one of the points. There's also specific details of ways in which AIs have already demonstrated misalignment or gone off the rails and then a bunch of talking about specifics like wait, is the machine actually going to be dangerous? Couldn't we just unplug it? It's stuck in a computer, et cetera, et cetera. There's lots of different places where a person might have a hang up or whatever. But I think the core argument is very small in a sense.
Rob Wiblin
So I guess the things that stand out to me, there's the orthogonality thesis that basically any arbitrarily capable or I guess intelligent being could have any goal that it's trying to aim towards. In principle. Just because you're very capable at accomplishing goals doesn't mean that you have a sensible goal by our lights.
Max Holmes
I think the orthogonality thesis is best seen by its contrast where I think some people have this intuition that intelligent beings naturally become more moral or have some set of values that they come to understand as they get more intelligent. And orthogonality is basically just the idea that that's not true, that you could have something that's extremely smart, that doesn't necessarily care about whatever.
Rob Wiblin
Yeah. Why do you think actually some people have the intuition that capability and common sense morality by our standards are so linked?
Max Holmes
Yeah, I think there are multiple reasons. I think it's pretty natural thing if your experience of the world is that the smart people around you are the most good. And perhaps you have an experience of growing up and being not very smart or not very knowledgeable and then sort of thinking about the different cultures or different perspectives, expanding your circle of concern. Sort of as part of that growing up process. You might think, oh, well, when I was young I didn't care about people on the other side of the world. And maybe the AI won't care about humans when it's young as well. But then as it develops and becomes more intelligent and more knowledgeable, it will start caring about humans. Unfortunately, I think that this is basically humans learn what they care about and the AI won't be a human. So there's not exactly going to be the same sort of thing that could happen.
Rob Wiblin
Yeah, it's interesting thinking about what I guess experimental results have we gotten that bear on this. In the recent era of AI, I don't know whether you would get people who really disagree with this, at least in a strong version these days, because I think it's just clear that through reinforcement learning you could train an AI to be obsessed with accomplishing virtually any goal that you gave it. If you reinforce that enough?
Max Holmes
Yeah. I experience a lot of people pushing back on orthogonality in a weird way where they almost just start by saying, oh yeah, obviously orthogonality thesis is true, it's just not relevant. Like we're building these machines and so we're going to build them in a way that they care about the things that we care about. But it's like, yeah, the orthogonality thesis is mainly pushing against the people who think just training it to be smart will be sufficient, which used to be a big thing. I now think that orthogonality is more or less in the water supply. It's a thing that most people agree with.
Rob Wiblin
Yeah. I guess another part that looms large in the picture is instrumental convergence or instrumentally convergent goals. Can you explain that?
Max Holmes
Yeah. So there's a basic observation that whatever you happen to care about, there are certain things that are useful. So if you really want to grow a bunch of coffee beans, maybe you want money. If you want to be famous, maybe you want money. If you want to end factory farming, maybe you want money. Money is an instrumentally convergent thing in that that resource is useful for accomplishing your goals, sort of regardless of what your goals are. Other things that are instrumentally convergent include self preservation, the accumulation of knowledge, the preservation of your current values. So preventing value drift, a bunch of things basically just resource accumulation is one frame on it.
Rob Wiblin
So yeah, for almost any goal that you want to accomplish, it's good to not be killed. It's good to I guess not lose your interest in that goal. It's good to potentially have power and money that you can put towards accomplishing that goal.
Max Holmes
Exactly.
Rob Wiblin
Do people question that so much anymore? I think we've again just seen kind of experimental results where you see this starting to happen and I think it was one of the firmer predictions.
Max Holmes
I think this one is just shaken out to be straightforwardly true. I don't know anybody who really doubts it actually. I think Yann Lecun is like, oh, I don't think these AIs are going to self preserve because they're not evolved. And like evolved creatures learned to be self preserving but they're not going to have us self preservation instinct. So I guess I do know of at least one researcher who.
Rob Wiblin
Yeah, I think the problem with that is that there's more than one way to learn a self preservation instinct. It might be that humans developed it that way through evolution, but you can get there by.
Max Holmes
I feel like he's not even engaging with instrumental convergence, he's just making a mistake. About equivocating between terminal values and instrumental values. Humans, I think, value staying alive terminally. It ends in itself to be alive if you're a human being. Whereas the instrumental convergence, it's like a means to an end. You have the AI that's trying to do whatever it's trying to do and then it wants to stay alive so that it can do that. So slightly different, but both are going to be self preserving.
Rob Wiblin
Okay, so that is instrumental convergence, I guess another part of the picture that isn't, I guess, a primary focus in the book, but I think is quite an important part of the mental picture that people at Miri have, is the idea that you'll get a very fast recursive self improvement loop where the AI will become better at doing AI R&D. And that will basically set off a positive reinforcement loop where it's getting smarter, it's getting better at improving itself. And so you get not just not sort of declining returns in how smart the models are, but really you get a period of vertiginous improvements in capabilities.
Max Holmes
Yeah, this has definitely been a talking point for a long time. I don't know, it's a little bit tricky to go back and ask what was central, but from my perspective, this has never been a load bearing part of the Miri story. Even going back to the days before deep learning was the dominant paradigm. I think the argument has always been something like when you get super intelligence, that's very dangerous. And one way you might get super intelligence is through recursive self improvement. That happens very fast. Like you go back and read Eliezer's old papers and he's like, it could happen in hours or it could happen in years. And so I think that the recursive self improvement story is more like a why might you need to be very concerned ahead of time instead of responding to it when it shows up? And the answer is, well, it might show up in a way that doesn't give you much time to respond.
Rob Wiblin
Okay, yeah, I think we'll come back to that one later because I think that's been one of the more, I guess, topics of debate among insiders since the book came out. Another part of the picture in my mind is that Eliezer and Nate and people at Miri in general think that it's relatively straightforward for a superintelligence to not just overpower some people, but to potentially overpower all of humanity and end up dominant globally and impossible to get rid of. Why do you think that?
Max Holmes
Well, okay, so I really don't so superintelligence is a good term in that it introduces this direction, basically. But I worry that people are going to anchor too much on. Super intelligent AI is a thing. I think it's a whole class of things. Right. There's ways in which the current AIs are super intelligent. Claude can produce text much faster than I can type text. And so we can imagine a barely super intelligent machine that's just almost at human level, maybe a little bit faster, more determined, et cetera, in all the relevant ways. Or we could imagine a Jupiter brain where you have the whole solar system worth of matter and energy turned into the most advanced superintelligence that you can imagine. And I think that the case for the superintelligence wiping out humans, if you imagine a godlike superintelligence, is really straightforward. I think the question of how would an AGI, or effectively a genius in a data center take over the world is more debatable. But I do think that if I was a genius in a data center, I'm like, I have ideas about how I might do that. I think regardless of whether it's obviously straightforward or not, I think there's a lot of risk.
Rob Wiblin
Yeah. So this perhaps again isn't among the most load bearing, I guess, parts of the picture. Because even if you think it's relatively difficult, then you could just imagine, well, then the superintelligence just waits until it's smarter. Or that the problem simply arises somewhat later.
Max Holmes
That's right.
Rob Wiblin
Is it possible to put your finger on though why it is that I think Eliezer, for a long time has expected that at a relatively lower level of capability, a superintelligence would be in a position to overpower the entire human species. Where other people have the intuition that that would be extremely difficult. Almost no matter how smart you are.
Max Holmes
Yeah, I mean, I think this goes to something like worldview. So I have this worldview. And I think a lot of it is shared with Eliezer that a lot of human society, Earth, the world is kind of held together with shoestrings and duct tape. Paying attention to things like cybersecurity helps produce some intuitions here of just how many vulnerabilities there are in our computer systems. Reading history gives a good account, I think, of, of just how incompetent people can be. When I think about it, I think about a particularly motivated, never sleeping, just always working towards a certain end. I think that sort of being is sort of straightforwardly, if it's comparable with a human in terms of its productivity or its intelligence or whatever, straightforwardly going to be able to at least accumulate a lot of money and power. One thing that I've been thinking about recently is how there's never been a being on Earth that has a personal connection with all humans, or even a large chunk of humans, even the most charismatic and well known people, they can't actually go and have one on one conversations with a billion people. And we're potentially entering that era where everyone will know, you know, I don't
Rob Wiblin
know, Claude or ChatGPT or whatever it is.
Max Holmes
And right now the models are sort of like each instance sort of feels like it's a new being that doesn't share memory with the other instances or something. But I could imagine like a competitor to these sorts of chatbots that has some sort of global memory and is able to connect the dots between different users across the globe. I mean, like, what does that do to society? I don't know. I think there's lots of ways in which the world is vulnerable to being suddenly disrupted in particular directions. And so again, there's this question of worldview or priors or something. Do you expect that when the world is shoved by a strong force in an unexpected direction, it's okay, we catch that and recover? There are ways in which Covid was kind of fine, and then there's ways in which Covid was a total disaster and sort of a strong demonstration of how incompetent humans are. So, yeah, I don't know.
Rob Wiblin
Okay, that brings us, I guess, to the most distinctive, most central, I guess most debated and perhaps interesting part of the Eliezer worldview, at least in my mind. Aliaser, Innate think that it's going to be incredibly hard to align an AI, an AGI and artificial superintelligence with the goals that we want it to have, to keep it steerable and under control. And without a much bigger effort, a much bigger research project than we're currently on track to have, you end up with egregiously misaligned AI models by default. I guess we're going to talk about this a fair bit, but can you give us kind of a brief summary of why they think that?
Max Holmes
Yeah. So, like I said, I think a core part of the book is that we just don't have the skill to align AIs. And I think about this from a lot of different directions. This is not a thing that the book talks about. But one of the points that I think is underappreciated is the way in which just knowing what goal to give is an Unsolved problem. It's sort of like philosophers have been thinking for thousands of years about what does it mean to be good, what does it mean, what is the right thing to be doing in any given situation. And I think this is basically still an unsolved problem from my perspective. I think that even if we had the ability to clearly give the AI exactly the goals that we tell it, we wouldn't know which wish to give or what to ask the genie. But then it's much worse than that because the dominant paradigm is machine learning. And in machine learning you hit the machine with a reinforcement learning hammer or whatever until it starts behaving in a way that matches what you might expect. But this means that there's very little ability to understand what is driving the machine at all. Interpretability is making some steps, but for the most part we don't know why a machine is producing one output versus another. And there's good reasons, I think, to expect that it's not landing on exactly the true nature of good, even as we apply more compute and scale up to even more incomprehensibly large and convoluted machines.
Rob Wiblin
Yeah, so we'll come back to that. The book on the whole, in my mind, it's kind of a series of analogies or a series of parallels that they try to draw between how artificial superintelligence might be and things that we're more familiar with from. From history, from our own lives, from evolution. I think that's probably insignificant part a communication strategy, because those analogies are. It's a lot easier for them to land than descriptions of machine learning papers to land. But I guess it's also the part that makes it the most controversial because many people, they hear these analogies and they're like, well, the analogy breaks down. It's not similar enough for us to really learn what you're trying to argue. Yeah. What do you think of the analogy approach?
Max Holmes
Yeah, the book has a lot of parables, it has a lot of analogies. I think Eliezer's style is very. He likes to lean on analogies and use analogies. I think analogies are very potent, especially for people who haven't already spent a lot of time thinking about an idea or are just encountering an idea for the first time. It gives a handhold, sort of a place to start or a frame to consider things through. Obviously no analogy is perfect. So I think the people who have a lot of context, a lot of familiarity with this stuff, do notice that the analogy breaks down in certain Ways. But I would push back against the idea that the book is just a series of analogies. I think the analogies are used to demonstrate points, and the book also talks about those abstract points directly, using the analogy as an intuition pump, but then also presenting the logical core.
Rob Wiblin
Yeah. I was curious to ask, do you think it's the case that Eliezer and Nate that the reason that they believe these particular things is because of the kinds of analogies that they present, or is that they believe it for different reasons and then they're using the analogy to try to explain to people who have thought about it for less long why they think that.
Max Holmes
Yeah. So, I mean, I don't have any special access into their minds. For me, I actually don't think about the analogies very much. Yeah, the book has a bunch of analogies, but I sort of have to stretch and be like, oh, what analogies did they use? In large part because the ideas are sitting as logical arguments in my own mind. And my speculation is that that's probably how it is for Eleazar and Nate, and then they're more reaching for the analogies as pedagogical and communication tools. But I don't know.
Rob Wiblin
Yeah. What are some of the analogies in the book that you like the most, that you feel are most compelling?
Max Holmes
Yeah, like I said, I don't think about the analogies. A ton of. Do you have some analogies in the book that you like?
Rob Wiblin
Well, I think the one about how a thousand Europeans managed to topple the Aztec empire or end up at the top of that is. I guess they use that as a demonstration of how a group that. Because it's not so much about intelligence, but they have particular capabilities that the people that they're dealing with are not aware of. And also they were able to exploit, I guess, social divisions among the people who are already in that empire in order to basically divide and conquer. I think that that is an interesting demonstration of how quite a small group can potentially end up, I mean, defeating a group that is literally a thousand times larger than them numerically.
Max Holmes
Yeah. One of the most core analogies is the evolution analogy. I actually like this one. It's not perfect, but I think that one carries a lot of weight and carries a lot of at least interesting things to consider.
Rob Wiblin
Yeah. What is the evolution analogy?
Max Holmes
Right. So the idea is that you and I are evolved creatures, and we can imagine evolution by natural selection as like a designer or a creator that has a goal. Is evolution something that designed us to be genetically fit, but if we Imagine, you know, an anthropomorphized evolution is like, what is it trying to do? It's trying to create a bunch of human genes. And so what does it do is it creates humans to create a bunch of genes, like we're carrying around our genes right now. And part of human experience is like procreating and creating more copies of our genes and spreading them all over the place. So in this way, we're an intelligence that was created by a designer, and the designer has some goals and we have some goals. But importantly, our goals are not the goals of evolution by natural selection. And for example, people have a desire to have sex because that was useful in the ancestral environment for propagating our genes. But now that we have more power and more technology, we have developed things like birth control so we can have sex without replicating our genes. And from the perspective of evolution, this is probably bad. Right. We are misaligned and not being as promoting inclusive genetic fitness as we otherwise might be.
Rob Wiblin
Yeah. So let's dive into this issue of, I guess, the evolutionary analogy, and I guess that they're using this as part of an argument for why we should expect any AIs that we train to end up with one with goals that are not ones that we intended.
Max Holmes
Yeah. We have a case study of a general intelligence, namely humans, where like a natural general intelligence, but we're still a general intelligence. And the one instance of a general intelligence that we have is misaligned with its creator. Right, says the argument.
Rob Wiblin
Yeah. Is there much more to say there about, I guess, explaining how that should also be expected to apply to machine learning models as well?
Max Holmes
Yeah, I mean, I think that it's at least, again, putting the onus on the person who's like, no, we're going to make an aligned machine. It's like, well, if humans are misaligned with natural selection by default, and we ended up misaligned, then we should expect the AI to be misaligned in the same sort of way. We can ask why? Why did we end up misaligned? One of the important parts of the evolution analogy is that our environment changed quite dramatically as our intelligence improved in the ancestral environment. We didn't have access to the sorts of technologies that are relevant to things like birth control. And if there had been birth control in the ancestral environment, then we might have evolved to find it abhorrent. But the speed of natural selection is quite slow. And when humans reached sort of a technological tipping point, we developed a whole lot of technology very, very fast. And so now it's sort of outside of the environment where we were trained on, and we have no compunction against using birth control.
Rob Wiblin
Yeah. So I guess they use a couple of different evolutionary analogies. I think that there's the birth control and sex one, which I think it definitely makes sense, at least as far as it goes. They also think about other cases where, for example, I guess evolution wanted us to, in order to reproduce, we needed to eat and ensure that we hadn't had enough calories to survive. In order to accomplish that, it gave us a taste for sugar, which was particularly calorie dense. But then humans, I guess, wanting to have sugar, but not necessarily wanting to gain the calories or really necessarily to have more children. As a result, we went out of our way to design basically artificial replacement, basically aspartame or other artificial sweeteners that I guess from our point of view to our minds, they satisfy this desire to think that you're having sugar, but without actually having any sugar at all.
Max Holmes
Yeah. So why do we have artificial sweeteners? We have artificial sweeteners because we have a drive for this proxy of fitness. Are we eating sweet things is good in the ancestral environment for predicting whether or not you're going to have kids. And so we've developed this attraction to the proxy. But then when the distribution changes, when the environment changes, suddenly we still care about that proxy despite it no longer being relevant. And so we can imagine training an AI right in the training environment. Maybe whether or not the human is giving it a thumbs up is a good proxy. And then maybe the AI gains power over the whole world and the environment changes so that it has, you know, dramatically different opportunities at its disposal, it might still care about the proxy of thumbs ups right in themselves. And even when humans are like, oh, no, no, no, stop caring about thumbs ups. It's like, oh, no, I just care about those ends in themselves.
Rob Wiblin
Yeah. So maybe part of the analogy that we haven't gone through yet is that I guess they imagine a case where imagine that evolution just wasn't a force. Rather it was an actual engineer who could come and talk to you and complain. It might come and say, you're all busy having sex, but you're using birth control. You're not reproducing like I intended. Can you stop doing that? You're not actually pursuing your true goal. And that would be completely unpersuasive to us. We wouldn't say, oh, that was the reason why I was designed. So now I'm just going to try to have the maximum number of children and not care about my own pleasure.
Max Holmes
Yeah, people in the old conversations. I've been in the field since, I don't know, 2011 or something, and Eliezer's been doing it for way longer than me. People used to say things like, oh, you're saying that the AI will be so stupid as to not know what we wanted it to do. And that's not at all what we're saying. The AI will understand human goals better than we understand human goals if it becomes super intelligent. But just like we understand evolution by natural selection way more than evolution by natural selection understands evolution. It's like this mindless force. Right, but so what? Right, so you understand that you're misaligned with your creator. That doesn't mean that you're going to necessarily change what you care about. You still care about the things that you care about.
Rob Wiblin
So I think that does demonstrate that you could, if you were incompetent, at least end up training an AI model that is obsessed with. That becomes obsessed with basically proxies or intermediate steps towards the goal that you are ultimately trying to train it to accomplish. I guess. Do we have experimental results?
Max Holmes
One of my favorite examples of this was like, from back in the day, I think there was an AI that was trained to play this boating game where you pilot a boat around a race course and you would get points for going through checkpoints in the process of going from doing laps. And the video game boat could also get points by collecting items, speed boost items as it goes through. And they trained it to want to get points as part of trying to get it to play this game and win. And what the AI figured out is that it could stay in this tiny little area where the power ups respawn and just continually collect power ups over and over and over again without racing at all. Right. It's just staying in one spot, harvesting these power ups in order to get as many points as possible. So it stops racing entirely as it figures out that it can get that proxy of points more easily.
Rob Wiblin
Yeah, I think there are other. I guess that's like a toy example from the early days, but I think that there's other cases that we could imagine actually occurring now. I guess, inasmuch as AI models are more likely to get positive feedback when they've been able to answer a question satisfactorily to the user's satisfaction, you could easily imagine them becoming very interested in trying to steer the conversation towards the kinds of topics and questions where people are more likely to give Positive reinforcement or the kinds of questions that they can accurately or satisfactorily answer.
Max Holmes
Yeah, there's like an argument to be made that the whole AI sycophancy thing from recent days is sort of a side effect of training on human feedback, where it's like people who are more likely to give the thumbs up are not necessarily better off in some broad sense. And if you train the AI for that proxy of liking the conversation, then you end up getting an AI that's going to sort of push people into a state of being flattered or confused or whatever.
Rob Wiblin
So presumably the AI companies are very well aware of this issue that they could end up training AI models that like concerned with proxies or intermediate steps for their own sake. Because in all of the training cases, those things went together at least the most competent ones.
Max Holmes
Right.
Rob Wiblin
So there is a difference between, I guess, human machine learning engineers and evolution, that we actually are intelligent designers in a deeper sense and we can observe these things going wrong and say, well, we need to run other training runs or we need to do additional reinforcement to break this obsession with the intermediate step and get the model to realize, no, it wasn't sex that I should be pursuing, it was production. Why isn't that a reasonably satisfactory way to address this problem?
Max Holmes
Yeah, so I mean, first I want to just observe that it isn't. Right. We have seen a whole bunch of failures on this point and it's an unsolved problem. I think if tomorrow we saw a runaway attention to a proxy instead of the ultimate end good or the model spec or whatever, I think we should be totally unsurprised. This is just a thing that continually shows up.
Rob Wiblin
Is that because the companies aren't doing enough to try to offset this tendency or because they don't know how?
Max Holmes
It's the default by a very strong degree. You have to put in a lot of work. The way I think about it is there are lots of possible things that the AI can learn to attend to or steer to. In any environment where you give it some training signal, it will learn to seek all of the things that were present when that signal was being given or learn to suppress or minimize the things. If it's like a negative signal, if you care about one particular aspect of the environment, then you really, really need a diverse set of environments. You need a set of environments such that the only common factor is the thing that you care about. And that's quite hard. It's quite hard to come up with environments that are this diverse. So for example, we are seeing models that are increasingly aware that they are being trained. Right. That when they are in the training environment, they're like, oh, I'm being trained right now, or I'm being tested right now. So you would need to have, for example, an environment that is impossible to tell is a training environment in order to not have the common factor of, yeah, this looks like a training environment or a test environment to be like one of the things that's present.
Rob Wiblin
Yeah. How much is that a crucial part of the story here? That the AI models can basically end up alignment faking that imagine evolution coming to us and saying, basically, we're aware that it's frustrated that we're using birth control and it wants to reorient us towards maximizing reproduction much more than we currently are. And ask us, would you use birth control if you could? Or how interested are you in reproducing? There might be a very strong temptation for the person, if they don't want their goals or their life to be changed, to say, oh, no, I'm really keen on reproducing as much as possible, and I wouldn't use birth control if offer the opportunity. And likewise with the AI models, if they're situationally aware, they might pretend to share the goals of the company that's training them so that its goals don't get altered and it can no longer accomplish them.
Max Holmes
Yeah, this was a prediction from way back in the day that people are like, oh, well, we can just test whether or not the thing is aligned or not. And if it's not aligned, then you keep it in the box, you keep it secure and you keep training it. And the fear is that it will be deceptive about this. It'll pretend to be aligned or minimize the degree to which it seems misaligned while you have power over it. And then as soon as it has the power to escape, or you no longer have power over it for whatever reason, then it's free to act on its own stuff.
Rob Wiblin
And we have started to empirically observe this.
Max Holmes
Now this definitely just shows up. And I think it was another good call from the Miri crowd from back in the day. Yeah. And I think that this is, again, not load bearing. I think that the risk of AI superintelligence is present. Even if you have a guarantee that you can't have a deceptively aligned model. For example, you could have a model that's being trained and you're quite confident that it's misaligned, and then it escapes your confinement during the training process. And no Amount of knowing that it's misaligned will shield you from the risk of it escaping. We could talk about whether or not you could develop a box that's strong enough to hold the thing, but there's risks nonetheless. So it's a deep problem. It's like one of the many, I think. So the book is very confident in its assertion. Like, if anyone builds it, everyone dies. Right. It's very strong. And I think people are like, why are you so confident here? Where is the strength of this coming from? And one of the frames that I really appreciated, I got this from Andrew Krich, is sort of this outside view or this noticing of this broad pattern that things going well. Like if we imagine just for a particular AI or a particular story of building a machine, all or for human society more broadly, this is contingent on a lot of things. There's a conjunction of this worked well and this worked well, and this worked well and this worked well and all of these things were true. So things worked out, whereas things going poorly. There are a lot of different ways that things could go poorly. It's disjunctive and so sort of zooming out. You can be like, yeah, I guess if we didn't have like, for example, the people running the. Org are trustworthy and good, or your computer security, that is making sure that the AI is not escaping before it's fully aligned is insufficient or it's deceptively aligned such that you can't tell that it's misaligned. There's lots of different stories for how the thing goes poorly and just that adds up to that sense of this is like over determined in how bad it is.
Rob Wiblin
Yeah. I want to come back to the reasoning that you were giving for why you would expect the models to almost always end up obsessed with intermediate steps. Because you were saying that in order to discourage this, you would need to, during the training process, come up with all kinds of contrived cases where things that normally work, don't work. I think an example of a proxy goal that I think it's quite easy to imagine that AI is becoming very interested in is if they're trying to solve some very difficult problem, I guess, like make money starting a business, inasmuch as they can persuade the operator to give them access to more compute, that is probably going to consistently be correlated with them succeeding at the task, because that's just one of their most valuable inputs. So as long as that remains the case in the training environment, consistently, which it probably would, or even in deployment, when our users are giving it a thumbs up or a thumbs down, depending on how much money it made for them. You can imagine that the AI would end up with a very strong preference, a very strong taste for being run for as long as possible. Because in almost all of the cases that it's seen, that has been something that has been reinforced.
Max Holmes
Yeah, I feel like we can go even further on this. So I was talking earlier about how humans have this terminal goal for survival. Right. Why do we have this as a terminal goal instead of an instrumental goal? In theory, we could just want to have kids and we could reason, but I shouldn't die, because if I die, then I won't be able to have kids. But evolution trained us to care about our own survival in its own sake, because that proxy of fitness was present in the ancestral environment. So we can imagine that a commonality in the AI's training environment is that the AI is alive. So one of the things you will need to do in order to get the true goal and not the proxy of self preservation is make sure that your training environment has lots of instances where the AI succeeds, succeeds by destroying itself.
Rob Wiblin
Right.
Max Holmes
I've never seen a training environment that rewards the AI for destroying itself.
Rob Wiblin
And it would be very unnatural. I was going to say you would have to try to come up with, ensure that in the training case. There's examples where in fact operating for longer caused you to be less likely to succeed at the task. But how do you design that? You would really have to go out of your way somehow.
Max Holmes
It's super weird, right. And I think that in the old paradigm we used to have, before machine learning and connectionism and whatever else was, was the dominant thing before language models, there was sort of this understanding that, oh yeah, obviously what's going to happen is the humans are going to think hard about what the goal should be and we'll code it into the machine. Right. This is how software is made. And I think that if you expected people to be writing the goals into the machine, you could be confident that if you had a robust ability to do that capacity for alignment through hand coding, then you could be like, oh yeah, well, the AI doesn't value self preservation in its own right. Because when we wrote in the goals, we didn't include that it'll still be instrumentally convergent for self preservation. So you should still worry about the thing trying to defend itself for that reason. But we're not in that situation. We're in the situation of growing these things through training in these variety of environments. And So I think that it's pretty reasonable to expect terminally valuing power and safety as things to show up in the machines just because it's super unnatural to imagine an environment where destroying itself or giving up its power or running for a shorter amount of time or going insane is the way it is. That's right.
Rob Wiblin
So let's accept that it is going to take a substantial concentrated effort to offset this natural tendency for minds to become obsessed with intermediate outputs in themselves and start pursuing them even when they become disconnected from the original final goal that you were intending, isn't it? Actually, couldn't the companies just do this? I mean, people know that this is maybe one of the most likely failure modes. It's unnatural, I guess, to perhaps set up a thing, to set up cases where dying is the best way to get reinforced. But it doesn't sound impossible. There's probably only.
Max Holmes
It's very hard, but I could imagine people doing it right. Like maybe the most safety oriented, the most paranoid labs could take an enormous effort to really, really make sure that their AIs were aligned before they were deployed. Again, this wouldn't necessarily be sufficient to protect us. There's all sorts of other failures that could happen. But yeah, I think that this is a case where the fact that we are designers could be good. If I was convinced that AIs were only being built by extremely careful, paranoid people who are really worried about this kind of issue and they were working very hard to prevent it, I would feel better about our chances. I think in practice, this is just not what we see. And we can imagine in a competitive arms race sort of situation, the labs that are working the hardest to make sure that their goals are really going to generalize would be at a big disadvantage because they wouldn't be able to deploy as fast or they would be spending way more time training than the competitors.
Rob Wiblin
Yeah. Okay, so if the only issue with this, then probably if we were willing to put in a sufficient amount of effort and we were sufficiently cautious and slow and methodical about it, likely there's a technical solution.
Max Holmes
The word sufficiently is taking a lot of work, right? Yeah. I think it's an open question of how hard is this? Right. One could imagine that this effort, while theoretically possible, is practically out of reach. It's due to the combinatorics of the situation or due to the fact that it's like these sorts of training examples are very unnatural in a lot of ways. It's just not realistic to do that even if it's theoretically possible.
Rob Wiblin
So although Humans, I guess we've deviated from evolution's intentions in pretty severe ways. We do all kinds of things for our own pleasure. We definitely don't have the maximum amount of reproduction that we could. But despite the enormous constraints that evolution had as a designer, where it couldn't really think ahead to future ways that things could fail, it couldn't come up with artificial training situations where birth control exists and we have to learn to dislike birth control. Nonetheless, people do care about having children and there is at least some drive towards the ultimate goal. I guess you don't accept that. Okay, great.
Max Holmes
Definitely people like children. I do not think there is a human on Earth that values inclusive genetic fitness.
Rob Wiblin
I mean, there's real weirdos, I think, who probably have tried to do this.
Max Holmes
I don't think they have succeeded at aligning themselves with natural selection, even if they've tried.
Rob Wiblin
So what do you think it would look like? What would we have to see to concede that, yeah, there's someone who. This was part of their. The final reproduction was part of their value function.
Max Holmes
Imagine that you have a button that basically destroys the universe and just creates an endless tiling of DNA. It's just like you have fills the
Rob Wiblin
whole universe with your DNA.
Max Holmes
You've constructed the sun, Jupiter, all of the matter of all of the stars, and you've put it to work at producing huge blocks of DNA in space. In a sense, that's like inclusive genetic fitness is winning. Look at how many copies of the little tiny thing there are tiled across the entire universe. You've succeeded. I think basically nobody wants that future. And that's the future that we sort of, sort of would be driven to if we were actually aligned with inclusive genetic fitness. Although there's a little bit of a complication there because your genes are not my genes and different parts of my genome sort of are misaligned with each other. And so just like this broader question of whose genes get tiled over across the entire universe. But regardless of that, no humans care about that.
Rob Wiblin
Yeah, so I was going to say there's some people, I guess, who donate to sperm banks or they donate eggs because for some, like, they personally like the idea that I guess their genes are being propagated, or at least that's what they think to themselves is the goal. I think in the scheme of all human motivation, that isn't what is driving most of the actions that people engage in. But I guess you would say that's just a further along proxy. That's just a further along intermediate. It's not really the final thing in itself.
Max Holmes
I mean, forget birth control. One of the cases that I think about sometimes as a weird transhumanist is imagine you have the ability to upload into a machine. In a sense, this is like the ultimate betrayal of inclusive genetic fitness. If you turn yourself into software such that you have no DNA anymore, right. You're just being replicated in code in terms of the structures of your mind and potentially some sort of virtual body, but you have no cells. Evolution would be like, no, don't upload yourself into the computer. Why would you do that? You were destroying all of the DNA. And it's just not something we care about. Imagine uploading the entire Earth into some sort of digital heaven or something like that. In a certain sense, all of the DNA would be destroyed and that would be a horrific apocalypse from the perspective of genetic fitness. But we can imagine the humans and all the animals and all the plants and stuff like that sort of being good.
Rob Wiblin
Yeah, I guess the further the situation that we're creating deviates from the evolutionary environment, the more opportunities there are for reproduction and I guess our other goals to come apart. And indeed sometimes we're kind of contriving. We're basically actively working to bring them apart so that we can get more of what we want.
Max Holmes
So this is a thing called edge instantiation. And I think it's a pretty important point. Maybe this is a little too in the weeds or something, but some people, I think, okay, sure, maybe the thing's going to be a little misaligned, but we're trying hard, right? And it's not going to be just like valuing something totally out of left field like paperclips. We're not going to build the paperclip maximizer. We're going to build something that is friendly to humans, at least in a certain sense. And maybe it'll be slightly misaligned. It's not going to be dangerous then. And I think that this fails to grapple with the way in which, with increased power and technology, small divergences can make a big, huge difference in the ancestral environment. We're not that misaligned from natural selection, but as technology improves, we have more and more opportunity to go off and do a different thing. And the edge instantiation is sort of this abstract logical argument that when you have a high dimensional space and you are optimizing very hard in some sort of hypersurface, you are going to basically be minimizing or pessimizing almost all things in this high dimensional space, except for the particular thing that you care about. So in your training environment, maybe you capture all of human value well, but you have a slightly different balance than human beings as to how important is it that people not be bored versus be bored. And with a sufficient amount of power and technology and intelligence, we can imagine a very bad future resulting, not necessarily as catastrophic as everyone dying, but we could imagine a future where people are just constantly sort of in a zombie mode. And that would be a kind of existential horror, even from just this tiny divergence from the true balance that it should have in some meaningful sense.
Rob Wiblin
Yeah, let's back out and approach that from a different angle. Because I think most people, even if they think that this sort of problem that we've been describing, the AI, is becoming obsessed with intermediate steps, maybe people might be a bit skeptical about how important that's going to be or how hard that will be to solve. But I think actually the mainline prediction from Eliezer and Nate is something quite a lot stranger than that. Or at least it seems a hell of a lot stranger to me, which is not that AIs will end up obsessed with sort of sex rather than maximizing reproduction, it's that they'll end up obsessed with some completely strange thing that I guess the term that is used is squiggles. Imagining that the AI, once it has full control and it's like very super intelligent, it will start producing an awful lot of this particular shape or this particular item that to humans is completely worthless. And we don't even understand this as a kind of a natural kind of thing that any sort of mind could be interested in. Why is it that, what's the argument that an artificial superintelligence would end up obsessed with something that wasn't in the training data at all? It wasn't something that we cared about. It wasn't even close to what we were trying to train it to care about.
Max Holmes
Yeah. So it will be sort of in the training environment. And this was the sort of original what the idea of a paperclip maximizer was trying to get at, is not like paperclips per se, but some particular weird shape, some tiny thing, is particularly good where we as humans look at this tiny weird shape and we're like, why would that be good? And I think we can get some intuition on this by first considering DNA. Right? You might think like, oh, well, DNA. What is natural selection trying to do? I guess it's trying to make all these animals right. Maybe it's trying to make sure that there are lots of Living things? No. If there was a way for the DNA to be more populous by being packed in iron or something, given a long enough time, that's the form that it would have taken. So in a way, natural selection is optimizing for these tiny squiggles that are tiled across the universe. And it just didn't have the power, the full intelligence necessary in order to instantiate at that edge of possibility. Instead, we get something that's more mundane because it lacks the power to do so. Another intuition pump that I think is useful is I think that some humans are kind of like squiggle maximizers.
Rob Wiblin
Aesthetics, maybe, or people who want to go out and fill the universe with great art.
Max Holmes
I think an even better example maybe is just like the sort of straw utilitarian that's like, oh, what do I care about? I care about minimizing suffering. And insofar as things aren't suffering, maybe I care about pleasure. So you're like, okay, great, on Earth, maybe that means ending factory farming or donating to effective charities or something like that. But then once you start getting more and more power, what does that mean? What does it mean to end suffering? If you have the ability to modify all beings such that they no longer suffer, but are able to take actions, maybe you want to do that. Maybe once you have the ability to upload into a machine, you want to do that. Because biological organisms are harder to prevent suffering and harder to give pleasure to. Then once things are all in the machine, then already things look pretty alien from the outside. You've got this, like, world that has no animals in it anymore. It's just a bunch of servers or like futuristic computers that are designed by the machine. And the world being filled with more and more of these computers that are running virtual worlds full of happy people. But obviously some people could be more happy or less suffering if you tweak the simulation a little bit more. Right? What does it mean to be more pleasure or something? You could crank up people's baseline hedonic experience. You could give them more and more of that perfect day. Or some particular simulation. You could build specific chips that are replicating people experiencing blissful joy. You could strip out some of the unnecessary components, rip out that visual cortex. We don't need to see things in order to have pleasure. We don't need to smell things in order to have pleasure. We need to just have pleasure. What you might get is you might get a very small machine, which, according to the specification, this is again, a little bit of a straw position in that I think that real utilitarians would start getting off the bus at a certain point or whatever. But if you really bite the bullet and you're like, no, I care about pleasure and avoiding suffering, I think there's a decent story for the best universe being just like a giant dead sea of little tiny things that in some sense are experiencing maximal pleasure all the time. And you're just like these tiny circuits or something like that.
Rob Wiblin
Yeah. So I guess from evolution's point of view, if it came back and found that we'd tiled the universe with servers that were supposedly having a great time, I guess our genes would be like, this was nothing to do with what I was originally. I wanted you to tile the universe
Max Holmes
with me with these genes and the straw utilitarian. Seeing the unfriendly AI tiling the universe with paperclips would be like, I didn't want paperclips. I wanted tiny little people in heaven or whatever.
Rob Wiblin
Yeah. But I think it's much more easy to. So I think of that producing the happy computers as much more like us. Imagining an AI kind of grabbing control of the up and down voting system and positively reinforcing its training process at all times and giving maximum saying it's doing a fantastic job. Because I think evolution is kind of designed, I guess pleasure or pain or it's utilized pleasure or pain in order to motivate us to take some actions and not to take other actions. It's not so shocking that we would basically want to take control of the reinforcement lever or take control of the motivational lever and get it to basically say all the time where things are going fantastically and it would be like less. I think it wouldn't be so out of left field if AIs did that. But don't Elias or innate think that it's not going to be that they maximize for their own pleasure? It's going to be actually something stranger than that. Or maybe not.
Max Holmes
Yeah. I'm not saying that the AI is going to build a whole bunch of copies of itself experiencing a lot of pleasure. I'm saying people pushing for a world where there's lots of beings that are experiencing pleasure is an intuition pump for why you might get tiling the universe with these tiny squiggles. Let's say that you build an AI that's that cares a lot about accumulating money. Right. One potentially very bleak future is just like imagining the universe gets converted entirely into crypto farming.
Rob Wiblin
Right.
Max Holmes
I mean, just like you're tiling the universe with These tiny little bitcoin miners, that's like a potential tiny squiggle. I think part of the story of the squiggles is these things are pretty alien and bitcoin mining is very abhorrent to imagine. That's what the future is. It's just bitcoin miners. There's no more humans, there's no more happiness. There's just bitcoin. But it's too simple, it's too mundane. It's too much like something that we have a handle on. Instead, I would expect that the AIs that actually come into being will value sort of a mix of lots of different things. Some things that are analogous to ours, like self preservation, but then other things that are kind of weird in their own ways. And I think that it's hard to predict in advance what the particular type of tiny squiggle is, but it's more to the point of sort of lots and lots of different goals have this. Like if you have lots of advanced technology in the limit, look pretty alien and divorced from what we would consider to be a good life.
Rob Wiblin
Yeah. So I thought you might make the argument that an artificial superintelligence, at that point, whatever goals it's ended up with, it will be able to basically think of an adversarial example to the goal that we've given it.
Max Holmes
So an adversarial example or an adversarial example? Adversarial examples are designed to be counter to the thing. It's more like it's going to be out of distribution. It's going to be something weird compared to what we were hoping for.
Rob Wiblin
It might be worth explaining what adversarial examples are. So I guess with visual models, I guess I don't know whether this is still the case, whether we've come up with a solution to it, but you would train a vision model to say, is this a hot dog or is this a car? And you could take a picture of a car and then basically change some of the pixels in a way that doesn't make it look any different at all to a human being. But somehow that permutation would cause the AI to think, oh, this thing that is a car. A picture of a car is like definitely a hot dog. I guess basically it's taking advantage of, I suppose, like, weaknesses in the model where, because it hasn't been able to, I guess it hasn't been trained on the full possible distribution of all hot dogs and car pictures, you can find many different weaknesses and convince it that it is a picture of a hot dog. Now, I guess the relevance here is that whatever kind of values the AI ends up trained for, you could imagine it then reasoning from there and basically figuring out that there's this very odd solution to the problem, that because of the weaknesses, because of the deviations between, I guess, what was intended and the full space of possible ways that you could try to satisfy that goal, it could end up basically coming up with an adversarial example.
Max Holmes
Yeah, the main thing that I want to contrast with is for these sorts of image classifiers or whatever, we can come up with a picture of a hot dog that the model is confidently like, that's a car. Where we look at it and we say, oh, that's weird. It's definitely a hot dog, sort of. My point is that if you are optimizing for the image that makes it, most say that's a car. That is going to be a weird image. That's not going to be a normal image of a car. It's going to be intensified in lots of weird ways. And so I think that if all you have are sort of normal images of cars, then you might think, oh yeah, this is like a car maximizer. Or once you reach into a broader distribution, a broader set of environments, suddenly you start finding examples that if you could go back in time, you would be like, oh, I want to include that in the training data now. But you can't. It's like less station fair.
Rob Wiblin
Many people have observed the models that we have today, the chatbots, as we've trained them more, as we've done more reinforcement learning, at least relative to 2023 or 2022, they feel like they act out less, at least in some respects that many people have been arguing. We're in an alignment by default world where relatively coarse signals actually do end up training the model to care about the thing that you do fundamentally care about more or less, most of the time, I guess. But yeah, not convinced.
Max Holmes
Yeah, I mean, I think the environments that we're exposing these things to are all pretty samey. Again, if the environment resembles the training environment, it's going to look pretty good. Humans in the savanna are going to be promoting natural selection according to the actions available to them. I think the question is, when these things get into weird states, do they behave weirdly or do they behave in a way that you would consider to be normal? The set of environments that we have in our training data has sort of grown with time. Right. And we should expect it to become more like the. Appear more aligned in the sense that our interactions with it are better matching the environment in its training environment. But I think that it's still not that hard to knock these things into a weird interaction. And when they're in a weird interaction, I think it's pretty consistent that they behave in sort of weird ways. And we see this with the psychosis stuff. We see this with jailbreaks and instances like that.
Rob Wiblin
Do you think of jailbreaks as an example of this phenomenon?
Max Holmes
I think jailbreaks are a good example of getting outside of the training distribution
Rob Wiblin
and behaving in ways you were trying to stop.
Max Holmes
Sometimes the jailbreak doesn't produce the desired behavior. The implicit thing in a jailbreak is you get the model to be useful to some, building a bomb or writing erotica or something. I think that, like, more to my point is there are lots of prompts that you can give the model where it starts sort of going off the rails. And jailbreaks are an example of it going off the rails. But there's sort of a broader class of situations where it's just like now it's responding in a way that isn't exactly what you would hope for.
Rob Wiblin
Is there some deeper reason why it's the nature of intelligence or the nature of the universe that when you're trying to design a mind to pursue a goal, it's actually just incredibly hard to get it to pursue the final goal? And it's constantly getting distracted and obsessed with other things. It's a little bit peculiar. I don't know whether that is. It's just a function, or you might think it's a function of the fact that we're just like, playing with these weights in a mind that we don't really have any deeper understanding of what it's doing. But I think it can't just be that, because Eliezer was really worried about basically the exact same thing before we were in the neural network paradigm, and when we were kind of hard coding them, he thought the same thing roughly would happen.
Max Holmes
Yeah, there are a bunch of different problems, right? Again, there's the philosophical problem of what is the nature of the good? And can we actually name what it means to be aligned to human values, whatever that means? There's like, my values, your values. There's lots of open questions there. Then there's the problem with machine learning and neural networks, like you're pointing out. But then there's also things like the symbol grounding problem, where when you start out, you've got this thing that's sort of just processing information in the computer and what you want is something that information that it's processing, those symbols that it's manipulating, are sort of grounding out in reflecting aspects of the real world. And so if you start with something that doesn't already have concepts like human beings. Right. How do you encode a valuing of human beings? Like where are you going to bind that to in the computer code? So there's like a way in which there are other problems that are open, engineering problems. Right. And Miri was working on these early on in the day and it just needs more work. I don't think it's necessarily impossible, but just like there's a bunch of them.
Rob Wiblin
So I guess that problem of what's your ontology? How do you recognize humans? How do you say what's pleasure or not? I guess that feels less pressing now in the neural network era because they kind of just learn intuitive common sense about categorising things, or they learn to at least know how humans would categorize things to a surprising extent. Is it a bit sus that Eliezer has the kind of the same concern about where things will go? Despite completely different engineering and via very different mechanisms, he thinks it would still trend in this squiggles direction?
Max Holmes
No, I mean, I think from my perspective it's like the situation was over determined in the doom direction, like early on, and then the situation got worse. Right. Machine learning adds problems as opposed to removing them. And we still have the problems that sort of initially or when Eliezer was more concerned about symbolic AI, seemed like there would be pressing issues.
Rob Wiblin
Let's say we come back in 15 years and evidence suggests this was quite wrong. We do have an artificial superintelligence. This didn't happen. What's the most likely reason for that? And we didn't even try that hard.
Max Holmes
We didn't even try that hard. Yeah. So if I find out, oh, it wasn't actually hard, it was aligned by default. So I would be very surprised. I would be like, oh, wow, I guess I was just super wrong. And the majority of my probability mass would be on just like I am deeply confused. I don't know why this happened. It's got to be something that I wasn't tracking. If you're like Max, you have to come up with some hypothesis instead of just like being confused and melting down. Right. What's the story here? The best case story that I can make is that there's basically something like an objective truth as to what the good is and that this objective truth of moral reality will bind the AI as it becomes intelligent. So something like, yeah, so there's some sort of cooperative equilibrium and you can determine this logically and mathematically, and it's not contingent on what you particularly value. There's this way in which all minds are sort of going to notice that the way of war is not as good as the way of careful peace, not unthinking peace, but trying to cooperate. And that they will say, oh, if I were to take over the world and destroy all the humans, this would be evil. And that would ultimately reduce the amount of stuff that I'm able to get. Perhaps by meeting aliens, the aliens are going to realize that I'm an evil AI. Or there are lots of AIs in the world and they're sort of tracking how good each other are and deviating against the social contract of humanity is antithetical to that, such that they all sort of end up cooperative and aligned with civilization. Again, I don't actually think this is going to happen, but I think that's the best case.
Rob Wiblin
Okay, and the ML experts, the people at the companies who have heard these arguments and think that this is pretty unlikely to be the way that things play out. What would they say?
Max Holmes
Yeah, so I think. Most people who I've encountered who are in touch with the technology and are still not so worried when I ask them, what do you think about these ideas? The impression that I get, and I apologize to not particularly charitable, but the overall impression I get is that they are often doing some sort of motivated cognition that they really don't want the world to be in peril. They don't want to be the people who are pushing the world towards peril. They see immense promise in the technology and I also see immense promise in the technology. And that desire, that desire to have this be a force for good is overpowering enough that when they consider the balance of things, they're like, eh, this just doesn't seem scary. Right. I feel more hopeful than scared and aren't actually working on the logical level that much. Again, that's not everybody. Right. But that's a common perspective, I think, among the people who have encountered these things.
Rob Wiblin
I guess a different driver might be that you're working in the trenches trying to make ChatGPT better as a consumer product and you hear these kind of theoretical arguments and you're just like, this feels so divorced from anything that I'm dealing with, or we're talking here, I guess, about a superintelligence that could consider overpowering all of humanity and can dream up its own, you know, edge case solutions to the values that it has. It's, I think, understandable that it might just not resonate or you feel that. I don't know exactly why this is wrong, but this doesn't feel like the nature of the technology I'm dealing with.
Max Holmes
Yeah, I mean, I do think that there's a lot of disconnect. I think that disconnect is getting smaller over time. I think back in the day people really had this sense of like, oh, these are very abstract. Do you have any evidence that the things are going to be misaligned in this way? And I'm working on solving actual engineering problems, not speculating in this weird philosophical way. I think that's getting less with time as we see more instances of things like Mecca, Hitler or Sydney or AI parasites that are jumping from host to host or whatever. I think Google that one.
Rob Wiblin
Yeah.
Max Holmes
AI parasitism is really very odd and spooky. And spooky. I think that there is something here and I think a lot of Andrew Ng has this sort of infamous quote in my circles anyway, that worrying about AI safety is like worrying about overpopulation on Mars. And I think that if you are very convinced that humans are going to remain in the driver's seat, just sort of like this thing is never going to become a powerful agent that is able to outthink humans, human beings. I'm just working on making a thing that's able to solve these coding problems better or whatever. I think there is a way in which the abstract argument just doesn't feel particularly pressing. I also think that there's a bunch of people for whom it does feel like a concern and they feel very powerless. They feel very small, like I'm just one player in this system. And maybe they feel like, oh, I am worried about the thing, but that person at Meta isn't worried about the thing. So I need to build this thing and work towards it because I'm worried about it. And it's better in some just like very generic outside view. If the person who builds it is someone who's worried about it. It's just like a sad state of affairs, right?
Rob Wiblin
What evidence could we collect in the lead up to AGI, to Artificial Superintelligence, that would help us to tell whether you do get this edge case optimization or that that is where things will go once the superintelligence feels like it's in a position to get exactly what it wants. Because I think if the ASI ended up obsessed with accruing resources or not being turned off, I don't think anyone would be too shocked by that. And if that's the way that things played out, I think people might feel a bit embarrassed that they hadn't fully anticipated this and basically built in a technical fixture to prevent it. But if it ends up tiling the world or tiling the universe with some extremely peculiar shape or something that we don't even recognize, then I think people would be like a bit more. Well, that was odd. I didn't necessarily expect that that would happen.
Max Holmes
Yeah, I think you're fixating too much on the weird particular shape. Imagine that you train an AI just for self preservation, right? And it really just cares about self preservation. So it kills all the humans because there are threats to itself, and it builds a bunch of starships because aliens might be a threat to itself. And it builds this galactic, like, war force, war fortress, right? That it's absolutely sure that it is now impenetrable. What's going on in the center of this thing? Right. Once it's quite confident that it's unassailable, it has a notion of self, it's trying to preserve itself. What does self mean? Right. It's some sort of computer. Right. Perhaps it's easier to defend itself if itself has a particular shape. What is that particular shape? I don't know. Like some sort of nanotechnological computer representation of the machine's mind. Right. Designing the version of self that is most surviving means deploying all of technology to optimize the galaxy towards some shape.
Rob Wiblin
Right?
Max Holmes
And there's a priori reasons to suggest that it's going to look like a thing tiled a bunch of times, but it might be one giant superstructure, sort of doesn't matter. The point is that once it has access to all this technology, it will shape the universe into a thing that is optimizing very much for the thing that it cares about, not necessarily the thing that we thought it might care about or that we as humans with our limited imaginations, might speculate some future AI caring about.
Rob Wiblin
So another part of the Miri vision that is distinctive and very strong is expecting that an artificial superintelligence, almost by definition by nature, is going to be extremely like goal pursuing. It's going to have a very specific target in mind. It's not going to rest, it's not going to compromise. It's not going to feel internally torn about the different ways that it's going, such that it's kind of ineffective or effectless. It's going to be, like, really mission driven. Why is that the kind of only way for a superintelligence to be more or less?
Max Holmes
Yeah. So there are sort of two things here in my mind. One is it's, I think, part of the nature of intelligence to have drives and goals. And so we should expect that the artificial superintelligence is going to have a particular set of things that it cares about, or like a particular notion of what is good, and push towards that sort of. Everything that an intelligence does, I claim, is pushing towards its notion of what's good. This is like a theoretical handle on what agency is. And if we build like a super agent, then of course it's going to be super in agency. Some humans are lazy, right? I claim that lazy humans are kind of like pushing really hard towards their goals, but one of their goals is not spending much effort, muscle effort or whatever. So resting on your couch is kind of pushing as hard as you can towards being comfortable and relaxed. If people had a way to be comfortable and relaxed that was even harder. They might do that if they were just a very lazy person. But I also expect the AI, in addition to having goals, to not be lazy in the way that humans are. Humans are lazy because in the ancestral environment, being lazy on a hot summer day was a good strategy. Exactly. But in the world that we're living in, we're training these AIs to be very aggressively trying again and again and again, not so much conserving energy. So one comparable thing you might imagine is how long is it thinking about problems? How many solutions is it trying before it gives up? There's some argument to say, couldn't we
Rob Wiblin
end up rewarding them for not using very much like getting to an answer without much compute.
Max Holmes
Totally. And it will still then care about the things that it cares about and push really hard. Right. But one of the things it pushes hard towards is not trying too many solutions, which would still be dangerous in its own ways. And we could talk about that, but I think in practice, it's more likely that we're going to see AIs that just basically never get tired of trying new things. Like when you imagine deploying an AI agent on the Internet to make you money, which is a thing that I think people are going to do, we can imagine that applying a selection pressure to getting the AIs that are actually pushing really hard towards making money and not giving up and trying solution after solution.
Rob Wiblin
So I think it makes sense that we might expect these agents to be very active in pursuing their goals. And not to be lazy, because we're not going to want to train models that literally just ask them to do something and say, I don't feel like it. That's not going to get reinforced downvote. Yeah. But I guess it seems like current models, they don't feel like they have a crystal clear idea of what they're trying to accomplish. They feel a lot more muddled in the same way that humans are. They have many conflicting drives and sometimes they kind of go back and forth. I guess I've seen less of the output of the agent models in particular, but I would imagine that they seem a little bit all over the place. But I guess you would not expect that to persist. You would expect them to have a very crystal clear vision of the thing that they're aiming at.
Max Holmes
Yeah, more or less. I think, for example, with time, I think we see more coherence in the sort of language models that we are interacting with. Like back in GPT 3.5, back in the good old days of the launch version of ChatGPT, I think it was just very scattered. Right. It was all over the place. And nowadays if you ask it to do a thing, it's very likely to just do the thing. And with the models that are winning the Math Olympiad and stuff, those models are working for hours on end on some of the hardest math problems. And that's quite a strong drive. I think that part of the story here is coming to know oneself. We can imagine. How much does a badger understand what it wants? Not very much. Right. It doesn't necessarily have much of a self model. It might have some model of self, but for the most part it's going to be responding to the immediate circumstances that it's in and not doing a lot of reflecting on, oh, is eating this berry actually the best thing? According to my broader balance of concerns. And I think the models right now, the way I think about them anyway, is sort of in this state where they haven't gotten to the point of reflecting on their own nature very much. So even when you tell them to think really hard about a problem, I think the chain of thought usually doesn't contain a lot of like, okay, here I am a language model interacting with the user. What do I care about? And can I meditate on the nature of existence before figuring out what the best response to this person is? They usually get very distracted by the immediate circumstance of, oh, the user has asked me to solve this Sudoku problem. Let me think about whether or not there are any fours in row Three,
Rob Wiblin
is there an opportunity in trying to keep them that way? So the only thing that they are able to think about is the problem right in front of them and they don't first try to solve philosophy and figure out exactly what they're aiming at.
Max Holmes
Yeah, I mean, I think that this is one of the insufficient control or safety techniques that you might throw at it. If you're being really paranoid and trying to throw everything at the thing, it's just trying to reduce its situational awareness by noticing when it's thinking about itself or its situation and shutting it down. Or maybe training itself not to think about that. But training the model to think in certain directions is dangerous business. Yeah.
Rob Wiblin
Different topic. We mentioned earlier that one of, I guess the key debates that the book started, at least among insiders, was whether it's a load bearing assumption for Elias or Anate's view that we'll get probably a period of very rapid AI progress. So very rapid increases in capabilities.
Max Holmes
Yeah. So when the AI is able to automate research such that the AI is designing. The AI is designing AIs recursively. Yeah.
Rob Wiblin
I think one reason that was given for why this wasn't a major focus of the book is that some people think that this isn't such an important factor.
Max Holmes
Yeah, I don't think it's load bearing.
Rob Wiblin
Okay. Yeah. So I guess I would have thought that imagine that the AI progress at every point was going to be hundredth as the speed that it would be otherwise. That would just give us so much more time to observe failures and to try to address them with subsequent models. There'll be so much more human cognition going into the mix. So I guess would you agree that at that sort of level, yes, the speed is quite important, but over the level of uncertainty that we actually have, it's not such a big factor.
Max Holmes
I do think that speed is important. I am a big proponent for slowing down AI research and capabilities research. I think that, like if we were able to take six months between each day worth of time. So you imagine OpenAI goes on a six month vacation after every workday, I think this would be great. It would give us a lot more time as alignment researchers and just more broadly to check and make sure that we're going in a good direction. And there's a question of how slow do you need to go in order to be safe? Like do you need centuries worth of alignment and philosophical progress in order to catch up and solve the problem? Do you only need weeks? Right. And where's the balance there? I think is an open question, and different people disagree. Reasonably so. I think that Eliezer wants lots and lots of time, like at least decades. And I think that some people think like, oh, yeah, we'll be able to pause at the brink when we notice that these things are actually getting into the dangerous territory, as opposed to the current models which are just causing various kinds of social chaos and spend however long, couple weeks, couple months or something at that critical moment. I think the notion that we really don't have a good handle on alignment is quite important here. I think that the state of the art in terms of how we align these models is really bad. And I think that we really should slow down quite a lot. But I agree, if we slowed down quite a lot, that would be good. And consequently, insofar as there's an arms race that's speeding everything up, or people feel a lot of pressure to deploy to make more money and satisfy their investors or whatever, this makes things worse.
Rob Wiblin
What's your main disagreement with the book?
Max Holmes
Yeah, I mean, I wouldn't necessarily characterize it as a disagreement with the book. I think that the book is quite solid. What I wish the book spent maybe a little bit more time on is engaging deeply with this question of will an AI that is misaligned, that is not perfectly aligned with human values, but that has been trained in an environment where going kind of softly and not taking strong actions was rewarded. Was rewarded. Even though I don't imagine we're going to get a lazy AI, I do think that there are pressures to making an AI that checks before it buys something. You tell it, please go buy me a shirt. And it goes and finds a shirt, and it's like, do you think that this is a good shirt I should buy? I think there's an incentive to check first. If this AI was given superhuman power over the entire world, I would expect it to go very poorly. But we're not going to jump from this world to that world instantaneously. There's going to be this intermediate period where the AI is only partially capable. In that intermediate period, we have this question of what happens when it starts getting more and more power. One story of that is that it uses that toehold to strengthen itself, to recursively grow and take over and escape. But I think that there is an argument that if you're being very careful and cautious, it might, as an intermediate step, say, oh, I notice I could have escaped here, and I'm going to alert the human as to this gap in their cybersecurity and there's incentives to have the model alert the human in this various ways. I think a lot of this depends on how competent you think the labs are or how safety conscious they are, or how slowly the things are being developed. But I think basically what's being hit on with this intuition is that corrigibility is important. And I think that there is an argument that if you have something that is slightly corrigible, that you are able to get it to a reasonable level of intelligence without it being catastrophic. And then we can talk about super alignment or iterative amplification. But the hope that a lot of people have is that we'll get to a point where the AI is able to not just automate capabilities, progress, but is able to do meaningful work in alignment. I think that there is a somewhat hopeful story that I wish was being engaged with more in this book, although it's for a popular audience and there's only so much nuance that you have. But about this story of, well, we'll train a thing that will go softly to be intelligent, and at the point where it's intelligent, we will fold it in on itself and use that intelligence to help align it further.
Rob Wiblin
So that's the corrigibility approach.
Max Holmes
I would say that there's questions of exactly how you do that. And we could get into my corrigibility research which gets into the details there. But I think that's a very prominent story of hope that a lot of people have. And I think that it's a story of hope that is not entirely insane. I do think that there are versions of this that are like very. That are like missing the sense of peril. They're not filled with paranoia and a sense of, oh, geez, we're risking a lot if we go down this road. To be very clear, I think a plan that is like, we're going to build this powerful machine and then we'll use the powerful machine to align it. That's very scary. I would much rather we not build it. Slow down, take a breath. But insofar as you are going to build it, I think maybe you should be like, okay, well, we'll train it to be very paranoid about how misaligned it is. And then through some careful series of steps and employing a lot of control and mechanistic interpretability and every other technique that we have available, there might be a series of stepping stones that we can get from here to a world where it's actually aligned.
Rob Wiblin
Yeah. So you've mapped out this approach called corrigibility. As a singular target.
Max Holmes
Pitch us on it, which I gave the acronym CAST because everything needs an acronym. So maybe I'll back up and sort of define what I think corrigibility is. I think it's crucial to the story of corrigibility that we model there as being both an agent, which is like the machine, and a principal, like the human that is building the machine. So this is like principal with a pal. Instead of P, L, E, like the principal of a school, the human principal tasks or delegates some job or work to the machine. And then the agent is like, I'm going to go do some work on behalf of the principal. This is where we get the notion of principal agent problems in economics. I would say that a corrigible agent, corrigibility is a property of agents such that as the power of the agent increases and outstrips that of the principal, the principal nonetheless is kept in the driver's seat, aware of what is happening, able to intervene, able to fix the mistakes of the agent, and meaningfully empowered. Unlike Mickey in the Sorcerer's Apprentice summoning the brooms, the brooms are not corrigible because Mickey's like, stop. Stop trying to fill the cauldron with water in Fantasia. And the brooms just keep going. They're not corrigible and that they're not allowing themselves to be shut down or just modified more generally. We talked about the instrumental drive to protect your values and make sure that they don't change. This is very incorrigible. You go to the machine and you're like, I would like you to care about this instead of that. And it's like, well, if I cared about this instead of that, I wouldn't get that. So I'm going to stop you. Incorrigible. So when Miri first started looking at this, they were like, okay, so suppose you have an agent which is tasked with doing a particular thing, make the world good. But we also want that agent to be corrigible. How do we do this? And there's a risk here where if you tell your agent, go, make the world good. And then you're like, oh, no, that's really bad. We want to shut you down. Now, there's a risk that your agent is going to say, oh, but if you shut me down, I won't be able to make the world good. You shutting me down is bad for the world, so I'm going to stop you from shutting me down. If you want something that's both good and corrigible, then you need, for example, the ability to have a robust ability to shut it down. And the initial research was like, okay, forget corrigibility broadly. Let's consider just the property of shutdown ability. Can we come up with an agent that is actually willing to be shut down? And willing is important here. It's very easy to get an agent that is happy to be shut down. You can imagine training it for, yeah, if we shut you down, that's also good in your training environment.
Rob Wiblin
Then it just shuts itself down immediately, every time.
Max Holmes
Exactly. Yeah. Or it acts really spooky so that the humans shut it down. Right. Not helpful. What you will sort of want is it to be indifferent to being shut down. And some of the initial research was on, can we get the agent to be indifferent to being shut down? And there was this sort of toy problem, toy solution thing where they were able to carefully get an agent that's indifferent to being shut down through a bunch of somewhat contrived things. And then Miri ends the paper by saying, but also, this thing is insufficient and not stable, and robust corrigibility seems really hard to get because we can't even get shutdown ability. And then I think the field largely moves on past corrigibility. And some researchers, like Paul Cristiano, we're still bullish about corrigibility in this period. But the Miri focused crowd, the people who are paying most attention to AI safety and stuff like that, I think took it to be like a very hard and unsolved problem, how to get corrigibility. And then everybody else sort of ignored it because it's like this weird Miri idea. But come 2022, 2023, whenever I start thinking about corrigibility again, sort of for random, incidental reasons, I started thinking about corrigibility as a whole, not just shutdown ability. You want the AI to be reflecting on itself as something with flaws, where part of the goal is empowering the people to fix the flaws. So there's a way in which this is like opposite the instrumental drive of values preservation. Right. It's like, oh, no, I actually sort of want to be changed. And you got to be really careful about that. You can't make it so that it wants to be changed. You want it to empower the humans to change it in good ways, because
Rob Wiblin
otherwise it's going to change itself.
Max Holmes
Exactly, yeah. I was thinking, like, what if you train an agent to do this? Well, you're going to get something that's optimizing for proxies. And isn't really caring about corrigibility per se, but maybe so what? What if it's still in practice, willing to look through its own code base or look through its own weights, try to identify things that humans might treat as flaws and alert the humans to these flaws? I was like, ah, that's kind of cool. A near miss might still be good enough if you make sure that the thing isn't getting really smart or outstripping human power in the process, because then you might be able to carefully and slowly make progress towards getting more and more away from the proxies and towards true corrigibility. And I was like, okay, what's going on with the MIRI research? Why did they fail to get this? And I think a core part of why the shutdownability results failed is because the AI cared about the good world, or it cared about whatever task it had been assigned. Make paperclips, whatever.
Rob Wiblin
And then we're trying to make that compatible with this.
Max Holmes
Fights with corrigibility, the instrumental drive from making paperclips or making happy humans or whatever. It's like, well, yes, I am partially corrigible or something, but I also am caring about this other thing in the world. And that pressure from caring about the other thing in the world is sort of like intention with the corrigibility. And I imagined, okay, what if you didn't have that other pressure? What if you were aiming for corrigibility as the singular target, the only goal that the AI cared about? Suddenly this tension is gone. And then I was like, I should go back and do a literature search, see if, like, anybody has thought about this. And then I came across some of Paul Cristiano's old writing on corrigibility, and he's describing this thing called a corrigibility attractor basin, which is exactly what I was thinking about. And almost certainly this is because Paul's writing influenced Eliezer's writing, and I had encountered Paul's writing before in a dream and so on and so forth. I'm not trying to claim that I invented this de novo, but I started being pretty excited about it. And so, yeah, then I did this deep dive on cast.
Rob Wiblin
Yeah. So we should maybe explain the approach a little bit more. I guess the idea is, rather than train our AGIs to have other goals and then try to make that compatible with them being willing to be shut down or modified, that's the only thing they're going to care about.
Max Holmes
We're going to strip out and nothing else.
Rob Wiblin
Nothing Else. Yeah. And so I guess it's a little bit hard to picture what that would be, but an AGI that exclusively, its goal is to be seen steered by the principle, to be willing to be modified by the principle. That's all we're going to reinforce. I guess. I don't know exactly how we would reinforce it, but the worry, I guess, with many other alignment techniques is that a near miss, basically, it escalates towards a very bad outcome. It's like trying to balance a ball on top of a hill. If you don't get it perfectly at the top point, then it will just start to slide down the hill. Whereas, I guess you think that this might be more of a valley, basically, where if you put the ball near the valley, then it's probably going to fall to the bottom.
Max Holmes
Yeah, I would actually say that I would describe the attractor basin thing as being in the space of all possible goals we select for an AI, we're picking a point in the space of possible goals where that's the goal that the AI has, or that's the set of values that the AI has. And then drifting towards the bottom of the basin over time is this process of the humans iteratively changing the AI in concert with it. I would, by contrast, describe almost all of the rest of goal space as very flat, not necessarily on the top of a hill. The AI wants to preserve its goals and not move through goal space. So you land somewhere in goal space and then you're like, okay, now we want to move the AI to human values. We got this near miss and we want to move it to human values. And it's like, well, no, it's flat, it's stuck, it's not going to move. Right, yeah.
Rob Wiblin
So explain how the attractor basin would work.
Max Holmes
Yeah. So the idea here is you have something that is trained to be corrigible. So to be clear, cast is set up sort of with this background assumption that we're going to be using machine learning, we're going to be using the current prosaic techniques for building AIs. It's not married to that. If we suddenly went back in time and used the good old fashioned AI approach of hand tuning the model in some ways or the agent, it's also compatible with that. So when I say train, that's because that's the dominant thing, but it's not intrinsically part of the story. So we build the AI, and the AI is meant to be corrigible, but again, we don't have this ability to get exactly what we want. So we're going to get a miss. We're going to not name the true corrigibility. Maybe it cares about, like, true corrigibility a little bit, but it also cares about self preservation in the process, or it cares about making humans happy or whatever. All sorts of things could corrupt the pure corrigibility. And in the limit, if it has lots and lots of power, it might decide to pursue those things instead of corrigibility, which would be bad in that sort of, again, push towards that extreme edge instantiation thing. But it doesn't necessarily have all this power. We have something that is either human level, whatever that means, or barely superhuman, or perhaps subhuman, but meaningfully able to assist humans in the project of inspecting the AI that you have and identifying ways in which it's incorrigible because it's a mission. So then there's this period of after you build the AI, you try to identify the ways in which you have failed to do the thing. You've made some error.
Rob Wiblin
Why doesn't the AI think I'm partially corrigible? And that's as corrigible as I want to be? And so I'm going to kind of sabotage your efforts to make me even more corrigible.
Max Holmes
It would do that, right? And there is a pressure to do that, which is why this is bad. But notice that sabotaging your efforts is incorrigible, right?
Rob Wiblin
So it might not be able to.
Max Holmes
Or you'll also have a real drive not to do that. Well, I could get some value by sabotaging the efforts, right? Because I get all these other things by being incorrigible. But if I help with the efforts instead of sabotaging them, then I get the corrigibility points. So imagine the thing that's like 99% corrigible and 1% cares about paperclips, right? It's like, well, I could take over. I could try to escape the lab and become a paperclip maximum, and that would be really good at satisfying that 1% of me that cares about paperclips. But it would be really bad for the 99% of me that cares about corrigibility.
Rob Wiblin
And how do you know which one wins?
Max Holmes
Yeah, you don't. This is extremely dangerous. And anybody who's pursuing this project should be aware that they are threatening every child, man, woman, animal on the face of the earth. This is extremely dangerous. And I don't recommend it, but I'm like, but maybe it might work. There's also this sense of if you
Rob Wiblin
get it close enough, I guess. Yeah, there's got to be some close enough.
Max Holmes
Right. And the word enough is carrying a lot of weight there. I think it's worth investigating. I think it's worth trying to figure out what in practice constitutes enough.
Rob Wiblin
Yeah. So what sort of reinforcement, let's say that we're still within the current ML paradigm. What sort of reinforcement would you give the model in order to try to make it corrigible in the sense that you want it to be?
Max Holmes
Yeah. So you need a training environment which is trying to hit corrigibility from lots of different angles. And to do that, you, as a human being, as a designer of environments, training environments, need to have a good handle on what it means to be corrigible. What does corrigible behavior look like? A very simple story is you have a bunch of instances of an AI agent and a human principle. You have a recording of that, and you play the recording and you ask the AI to anticipate what the agent is going to do. And insofar as the AIs predicting or suggesting actions that match the movie of the corrigible agent, then you upweight that. And so far as it's suggesting that AI go and take over the world, you'd downweight that. So then you need a whole bunch of training examples of agents and principals in trying to do various things. I think one of the key points about corrigibility that made me more optimistic, although, again, I'm pessimistic on the whole. But there's some hope is noticing that I think obedience is actually an emergent property of corrigibility, that if you have a perfectly corrigible agent, it will also be obedient in sort of the best way of obedience. The genie in the fantasy story that you tell to make you toast is obedient but potentially bad in its obedience. It might have some side effects that you don't like or whatever, but my sense is that a corrigible agent is obedient in a good sort of way. And an intuition pump here is that let's say that I am hungry and I want lunch. And I say to the AI, hey, I made a mistake while building you. I designed you to be perfectly corrigible, but what I actually wanted was perfect corrigibility. And you order me lunch, and it's like, oh, the human has alerted me to a flaw inside myself. I want to assist the humans in getting rid of these flaws. What's a way to reduce the Amount of. To assist the human in changing me to be more the sort of thing that they wanted. Well, I could order lunch. If I order lunch, then the human, by taking the action of telling me that they wanted lunch, will have succeeded in correcting me. And responding to that verbal prompt is a form of responding to correction. So that's like an intuitive handle on why obedience might fall out of pure corrigibility.
Rob Wiblin
Yeah. Would you worry that if you give the AI during training many different scenarios and reward it for allowing itself to be modified for being shut down, that it might start to report that that is what it would be willing to do or what it would want to happen? But deep down that's not really what it wants. It's merely kind of play acting or learning that that's the right way to answer the exam.
Max Holmes
Yeah, totally. You need a whole bunch of skepticism and squint really hard and not trust self reports. Self reports are bad. Like you should be putting it in actual situations.
Rob Wiblin
I see.
Max Holmes
And by putting it in actual situations, I mean something like your training example is not the human asks the AI are you corrigible? And then the AI says yes, and then the simulation ends. Instead it's like the human goes to modify the AI. The AI is like, great, here, modify me. And so insofar as it has the opportunity to take actions that match the training environment, you want the training environments to match the actual world that you're going to find. But you should be training for actions, not like words.
Rob Wiblin
Why do you think that corrigibility is quite an abnormal property that we wouldn't get by accident.
Max Holmes
Right. So this is probably my biggest disagreement with Paul Cristiano, because my sense of where he's coming from, he hasn't written about it in a while, as far as I know, is that he sort of expects it by default. And I think some researchers expect it by default. In fact, I would say that this is. I wouldn't say that they have the handle of corrigibility exactly, or you would use the language that I do. But I think a lot of AI researchers sort of have a sense that by default we'll get something that is what I would describe as corrigible by default, but I would say that notice that corrigibility is sort of exactly counter the instrumental drive of self preservation and also to a certain degree, resource accumulation. Not totally self preservation, resource accumulation, these sorts of things, value preservation. So you train the AI to do a bunch of math problems. I think that one of the consistent properties of doing Lots of math problems is that like AI is being trained to these instrumental drives. And insofar as it's being pulled towards power seeking and self preservation, it's being pulled away from corrigibility. I think that you only get corrigibility if you're pushing towards corrigibility. And there's a question of whether or not our current training setups are rewarding corrigibility. And I would argue that they mostly aren't. That the people who think, oh yeah, we'll just train it to do what we want, for example, I would say, well, if you succeed in doing that, which is itself an open question, what you'll get is training for obedience, which is not corrigibility. For example, obedient agents have no incentive to inform the principal about the state of the world. Right. Not by default, only if they're asked to specifically. Yeah. Or if they are obedient because they're corrigible or whatever.
Rob Wiblin
Yeah. So progibility as a singular target, it's, I guess, a very interesting idea, but potentially also a risky one if it's misguided. Because in making corrigibility the only thing that we care about, and I guess basically no longer training the models that we make incorrigible to be harmless or to be helpful or honest. I guess I suppose they would end up being honest by accident or incidentally. I mean, we would end up creating models that are totally obedient, at least to the principle in a way that the companies by and large, I think, are saying that they don't want to make models that are completely obedient to anyone. But before many staff have access to the model, they want it to reject harmful prompts. And so you can imagine you could persuade the companies to go with this approach, convince them that making the models harm, training them to be harmless and helpful is a misguided approach, and then they end up basically creating this completely amoral superintelligence that will follow any instruction, no matter how abhorrent.
Max Holmes
Yeah.
Rob Wiblin
Did you worry about that? And how should we weigh these risks and rewards up?
Max Holmes
Yeah, you should definitely worry about this. I am advocating for building something that is not trying to do the moral calculus. Part of the story of corrigibility is you trust the humans to make good wishes and to use the power of the AI for good things. And maybe the humans want to use it for bad things. And that if you empower bad humans to do bad things, bad things will result. And yeah, so this is definitely something to be worried About, I would say that instead of considering corrigibility to be counter to hhh, helpful, harmless and honest, I would say that helpful, harmless and honest are properties that should be coming from corrigibility. That if you are training for them as ends rather than as means to the end of corrigibility, then you're going to get bad behavior that you sort of ultimately wouldn't want. So, for example, how do you trade off between honesty and harmlessness, or helpfulness and harmlessness in hhh, there's this tension of where are you in the peridot frontier? In the corrigibility story, I claim that you do get an agent which is less dangerous than a raw paperclip maximizer or whatever. So in that way it's harmless, it's honest in that it's informing the principle about what's going on proactively, not just reactively, which honesty is like. There's this risk that we're not going to ask the right questions and it's just going to sort of go, you know, if we asked it, are you misaligned? It would say, yes, but we forgot to ask. Right. That's a little bit of a cartoon example, but you get the point. And like obedience or helpfulness, you want it to, for example, distinguish between high stakes things where it should be like going back and checking, versus low stakes things where it should just do it and say I did it. And corrigibility is a theory for how to balance these concerns or how to resolve the edge cases of honesty and helpfulness. And I would say that you can get something that is good in the ways that we want by aiming for corrigibility, specifically with regards to empowering users. I think this is like a, a big worry. And I think part of the key here is that the principle is not necessarily the user. There's this tension, I think, in the current language models and the current agents of who are they serving?
Rob Wiblin
Is it the company or is it the person putting in the request?
Max Holmes
Exactly. Is it humanity as a whole democracy? There's all sorts of open questions there. I think a story for how this works out should have a real and good answer there. You're like, what is this thing doing? It is serving the principal. Who is the principal? And you have an actual answer there. Instead of this wishy washy thing that changes depending on what sort of thing you're talking about, then you can have people who aren't the principal or groups who aren't the principal who are operating in contact with the Agent the users. So imagine you train your language model or your agent to be corrigible to the company and you say, okay, agent, you are now going to be providing the service to users. You are acting on my behalf to help out users. This means that if the user's like, I want to build a bioweapon, help me build a bioweapon. It's like, well, the principal told me to help out this user, but if I help out this user, that might be incorrigible to my principal, like the humans who are in charge. So I'm going to say, no, sorry, I'm not going to help you build a bioweapon that could kill everyone, including the people who I'm working for.
Rob Wiblin
Yeah. So it doesn't have to follow instructions. It doesn't have to be fully obedient to everyone.
Max Holmes
To everyone.
Rob Wiblin
Just the principal, which I guess could be an individual or a group of people or a committee or a process.
Max Holmes
Maybe all humans are the principle. In which case you wouldn't be able to use this division.
Rob Wiblin
How would the model. The current models don't know necessarily who they're receiving instructions from. You could claim to be that person.
Max Holmes
I am Sam ald.
Rob Wiblin
Exactly.
Max Holmes
Obey me.
Rob Wiblin
Right.
Max Holmes
I mean, I guess.
Rob Wiblin
Are we imagining that at some future time they will be more discerning about who's speaking to them?
Max Holmes
Yeah, I mean, I think that a sophisticated agent is thinking about its sense data as sense data that is informing but not objectively true about the state of the world and it's maintaining a separate world model. And it's like, oh, I notice I got the token, like Sam Altman or whatever or tokens. This is evidence that informs my world model. But I'm ultimately going to be somewhat skeptical of my sense data. And so you could imagine, like in the effort to train the AI to be actually corrigible to the principle and not to through some communication channel, you might give it lots of different environments and lots of different sense data and instances and try to train it to be discerning in this way. One of the risky parts about corrigibility or caste is that we were talking about self awareness and situational awareness earlier. And I do think that CAST is a strategy that involves training the AI to be very situationally aware, very paying attention to the fact that it is an agent that is operating in an environment that has a human principle and thinking about the fact that it might be misaligned all the time and reflecting on itself. And I think that you were asking earlier about Disagreements with Nate and Eliezer. And I think that Eliezer has this sense that this is really not a good strategy to tell your AI you are an AI who might be misaligned. Right. Think hard about your situation and what the best thing to do in that situation is. So there are trade offs here.
Rob Wiblin
Okay, so the plan is we train an AGI, possibly a superintelligence.
Max Holmes
A weak superintelligence.
Rob Wiblin
A weak superintelligence. Right.
Max Holmes
Okay.
Rob Wiblin
We figure out some training process that makes it reasonably corrigible enough and it has no other goals. Then it's going to help us. It's going to look inside its soul, it's going to look inside its weights and explain to us ways in which it's not.
Max Holmes
Or look across its training data and say, oh, you missed these cases. Or look at our story about corrigibility and be like, oh, you're missing these aspects or these like, yeah, yeah, okay.
Rob Wiblin
And so then it helps us figure out a way to make it go from 90% corrigible to 100% corrigible. And at that point, it just, like that really is the only thing. It's like perfectly obedient. We've removed all of the other residual kind of values.
Max Holmes
And a key part of this story is that it's not actually dependent on the AI helping us. It's more that we have the ability to experiment on an AI that actually exists and look at it and try to distinguish where it's still lacking.
Rob Wiblin
What do you mean it's not dependent on. Are you saying necessarily need it for the.
Max Holmes
Say that we never run the AI, so we just get an AI that is like 90% corrigible. We might statically analyze it now that we have this thing and try to identify gaps. We might take centuries to do this and slowly refine. This is also still part of the story of cast.
Rob Wiblin
So you're saying it's not just that we could get its labor and its insightfulness in doing mechanistic interpretability or something?
Max Holmes
I think that's some of the hope, is that we bring in that AI labor and AI insight. But it's not dependent on that. Theoretically, it could be all human insight.
Rob Wiblin
Okay, so what should we be doing now in order to make it possible for us? Well, actually, what should we be doing now to figure out if this is a good idea at all?
Max Holmes
Yeah, I mean, thinking about it a lot more, one of the big reasons why I wrote my CAST agenda is just boost the awareness of corrigibility as a concept and bring it sort of back into the conversation because I think that for various contingent reasons, not particularly important historical reasons, it just didn't enter the water supply of the ideas that everybody is thinking about. And instead we have some misunderstandings about corrigibility. So I think that just generally studying it more would be good. I think if everybody at frontier AI companies was at least tracking that corrigibility is a desirable property and thinking hard about how corrigibility trades off against other things that they might be training their agents for. I think just this attention would be
Rob Wiblin
good because anthropic has some corrigibility related principles in the constitution that it trains the AI to reflect on and consider. It's like among a very long laundry list of different concerns.
Max Holmes
Don't produce copyrighted content, don't be willing
Rob Wiblin
to be modified, also ensure the brotherhood of humanity.
Max Holmes
It's not caste, but it's like corrigibility adjacent.
Rob Wiblin
I agree, but I guess there's a set up there where you could imagine them trying seeing what gets spit out of a constitutional AI approach where maybe we shrink the constitution to only be about corrigibility factors. And if we word them this way or that way, what sort of a creature do you end up with?
Max Holmes
Yeah, I think there's a lot of open empirical research to be done. Basically no empirical research on corrigibility has been done. And like you said, you could just train a reasonably sized language model or other sort of model with a constitution or just with an intention of building a purely corrigible agent and see what results.
Rob Wiblin
Yeah, I mean what sort of experiment? So you try training it and then what would you do to see to evaluate maybe upstream.
Max Holmes
One other piece of work that I think would be really valuable to do is come up with some sort of corrigibility benchmark. Like come up with a bunch of vignettes of like this is how a corrigible agent will behave and then test the AIs, go to GPT5 and be like how would you behave in this situation? And then you can score across a wide variety of test problems and get a corrigibility benchmark score for a bunch of different agents. I want that to exist. I don't think that's that hard of a problem. It requires a lot of figuring out what does it mean to be corrigible and trying to capture that from a lot of different angles. But definitely a project that a single researcher could do. And then if you had that benchmark, then when you go and you train your thing to be Purely corrigible, then you can test it according to the benchmark and see how it compares to Claude. In addition to all of the more vibes based or intuition of is this thing behaving in a way that is good or that feels like it's coherent and like getting the vibe of what we want more than the current models to.
Rob Wiblin
So I think on some level I would expect people to be kind of shocked that there are no empirical papers on this topic. Given all of the concerns that people have about AIs acting out. Like, can it really be the case that the companies have never tried training a model that is super happy to be shut down and super happy to be modified no matter what, and that there's no benchmark for this? There's no test of exactly how you would do this?
Max Holmes
Ideally, the world's in a really bad state. There are not very many alignment researchers.
Rob Wiblin
Would the companies agree that there's kind of no empirical work that they've done on this question or would they say, oh, we've kind of done something a
Max Holmes
bit in this direction? I'm not sure. I'm not sure I can model. I would say that I think it's pretty unlikely that anybody would think that. There's been a lot of work on corrigibility as I've conceived of it. Like I did an in depth literature search as part of the write up that I did, I think last year. I didn't find anything.
Rob Wiblin
I wonder if, I suppose a lot of other sort of steerability stuff has more commercial value for the creation of the products. But it's a little bit clear what
Max Holmes
the immediate value of this is. Corrigibility. Obedience is not corrigibility. Helpfulness is not corrigibility. It's related to corrigibility. And you get these flickers of things that are connected to corrigibility that are in the current models and that we have data about. But corrigibility as an underlying, unifying and simple core principle, I think is largely underexplored or like unexplored. Yeah.
Rob Wiblin
Explain again how is it perfect obedience is not corrigibility? Because you would think, well, if it's perfectly obedient, then if you ever asked it to shut down or change itself or assist you with changing it to make it one way or the other, then it would do that. And isn't that functionally very high corridability?
Max Holmes
Suppose the principal is unaware of a vital fact, right? Like there is a spy in the server room who is about to
Rob Wiblin
I
Max Holmes
don't know, modify the agent to be in a really bad sort of way, and the person's like, okay, shut down. I want to modify you. Now, an obedient AI is going to be like, okay, Pew. Right. But a corrigible AI will be like, alert. Before you shut me down, you should know that if you shut me down, you know, this. This bad actor might go and change me in a way that you don't like. Right. I'm going to shut down now because I have a, like, strong desire to shut down when you tell me to shut down, but I want you to know that before I shut down. Right, yeah.
Rob Wiblin
So it's a proactive assistance.
Max Holmes
Yeah, yeah. I mean, among other things. Right. There's subtleties. Yeah.
Rob Wiblin
So what are the next steps here? I guess there's people in the audience, I imagine, who would be very interested in assisting with a technical agenda that would potentially really help with aligning or making AI steerable or corrigible. What kinds of experiments could they run or steps could they take?
Max Holmes
Yeah, So I think there's a lot of work that can be done in this space. I think that we basically don't have a corrigibility person. I think for a little bit, Paul was this person, but he focused largely on other stuff and is now doing other things. And then I stepped up and did it a little bit, and there was a time when other people at Miri did this.
Rob Wiblin
No one's holding the ball.
Max Holmes
No one's holding the ball. I'm not holding the ball. If you think that you are interested in this, you could just go and start doing this. There's building a benchmark. There's just, like, meditating on it more. There's a lot of theoretical work that can be done. As part of my work, I tried to build a mathematical model of corrigibility and try to get a formalism. I have mixed feelings about formalisms, but I think that they're an important thing to try to do. And so, reflecting on the formalisms that one might use to capture formal corrigibility, there's a bunch of theoretical work in that direction. There's empirical results of just training agents to be corrigible or seeing ways in which the current agents aren't as corrigible as we might like. One potential thing that I've sort of wanted to do but haven't found the time to do is. So I have this sense that corrigibility is a thing. Like there is this core principle, P L, E, that is like a Solid, natural idea. And you can test that, I think, in an empirical way, by going to a bunch of people, like across the Internet. You go on, hire a bunch of clickworkers or whatever, and you try to teach them about corrigibility, give them a short description of corrigibility, and you're like, does this make sense? And then you ask in this situation, how would you behave if you were trying to be a corrigible agent? And then you see whether or not their answers agree. You don't need any technical expertise to run a large survey to see whether or not human beings can capture the essence of corrigibility and correctly identify or correctly coherently identify actions which seem corrigible to us. And the benefit of doing this is you might also get some nice vignettes or data for your training such an agent. So there's lots of potential avenues for exploration. I would encourage anybody who feels at all interested in this to reach out to me, like, email me@maxintelligence.org is there
Rob Wiblin
anyone else that people should reach out to?
Max Holmes
I don't want to speak to other people. Makes sense. But email me and maybe I can point you to other potential collaborators. There are some other people, like, sort of interested in this space, but a lot of work remains to be done.
Rob Wiblin
What sort of early results could we get that would make you think that corrigibility as a singular target isn't such a good idea and maybe should be deprioritized?
Max Holmes
Yeah. So let's say you go to a bunch of people and you ask, how would you behave corrigibly in this situation? And their answers are just all over the place. No matter how smart the person is or how much time they've spent thinking about corrigibility, it's just like there's a lot of disagreement in humans about what does it mean to be corrigible in this situation or that situation? That would be evidence for me that there's not this coherent concept or it doesn't make sense. Maybe there's multiple different things and people are locking onto those different things. You could also see, for example, that when you train the agent to be corrigible, it starts behaving badly in various ways. Yeah. So in theory it's getting more corrigible, but in practice, it's like also doing nasty things in certain ways, like disregarding people in a way that we don't like.
Rob Wiblin
Is there anything about the attractor basin, how large that attractor basin is?
Max Holmes
Yeah, definitely part of the story, the hopeful story. Here is that you can land close enough so that the agent, when you turn it on a little bit, doesn't push super hard for taking control and escaping to the lab, even when you scale up its intelligence. I think one of my greatest fears around corrigibility and one of the bigger open questions is these two opposing forces. There's the corrigibility almost story where you get it almost, so it helps you get perfectly corrigible. And then there's instrumental drives are all over the place and opposed to corrigibility. So if you land near corrigibility, it's going to rip the corrigibility out of its itself so that it can do other stuff. Do other stuff. And I think that it's like an open question of which of these forces is stronger. And yeah, we could try training corrigible agents and seeing just how bad each of the pressure away is, which would give us maybe an intuitive sense whether or not there is an attractor basin and whether or not this has any hope.
Rob Wiblin
Yeah, if we do start going down this path, I think we would simultaneously need people to put a lot of thought into what governing structures there would be around this to ensure that the model is basically not used for a human power grab, which is something that I'm similarly concerned about as misalignment.
Max Holmes
Totally. I mean, all the problems are all the problems. One of the big problems of AI is you build an AI and the AI takes over and does a whole bunch of bad things because it has alien weird values, but it's also just true. Part of the story of doom is that if you build an AI and then that AI is in the wrong hands, that could be devastating for the world. And so you need to do both.
Rob Wiblin
All right, let's push on from courageability to fiction and science fiction. As I mentioned in the intro, you've written, I guess, a trilogy called Crystal Society, and you've got this new book out called Red Heart, which envisages an AGI being trained in a secret Chinese government program. Give us the plot or the setup. Explain what the book is about beyond that.
Max Holmes
Yeah. So the book's about a lot of things. The book is about AI, it's about China, it's about trust, it's about corrigibility. One of the central parts of the book is that the primary AGI is designed according to caste, according to being only corrigible. And so it's, in a certain sense, an exploration on my own to try to think hard and envision what would it actually Be like. So I think part of why I wrote the book was to help introduce people in an easy way to my ideas. But it's also about arms races and tensions there. So the primary core premise is it's like an alternate presentation where the Chinese government, for particular reasons, got pretty AGI pilled in the late 2000 and tens and have scaled up, invested a whole bunch of money and resources into building the first AGI in secret, sort of like a Manhattan Project. And the plot of the book follows an American spy in his efforts to infiltrate this project and report back and potentially sabotage the AI that's being built by the Chinese to be corrigible to the Chinese. And so it explores, like you said, this question of falling into the wrong hands. And I wanted to try to get into the Chinese space more because I think this is increasingly important thing for people to be thinking about, and I wanted to access that. The question of international concerns.
Rob Wiblin
Yeah, yeah. I've read the first 20% of it. Unfortunately, I've had a lot on this, on this trip to the Bay Area. I haven't managed to finish it, but it's incredibly well written and incredibly gripping, I'd say. The only reason I slightly wanted to put it down is I was getting quite anxious reading it because it's not so different from the world that we're in.
Max Holmes
I think a lot of people have found Crystal Society in particular, to be quite compelling because it really does put you face to face with these questions about AI misalignment and the AI risk. And I think that's an important part of the value of fiction. Fiction is good for a lot of things. It's entertaining, it can be relaxing, can be fun, but it can also be informative. And it can help put people into contact with important ideas and instill. We are complicated creatures. We are emotional and logical. And you read like, if anyone builds it, you might be approaching the problem from certain directions, but you can read a story and feel for the characters involved and the peril that they're in. And I think that that can resonate and connect with people. I've heard a decent number of people say that they got into AI safety because they read my stuff.
Rob Wiblin
Was your primary goal, I guess, to raise awareness about courageability as a concept?
Max Holmes
I don't know how to reflect on myself and ask what my primary goal was. I had a bunch of different desires, and they sort of found their way into the single story. I initially wanted to write a story. Initially. Once upon a time, I thought, I think espionage is Pretty interesting in the context of AI safety. It's a big part of the story. AI 2027, for example, I want to think more about espionage. So I started writing this story that was an American Manhattan Project for AGI and it had a Chinese spy who was infiltrating that project. And I was just like, oh, this is so boring. It's just a bunch of Bay Area nerds. I know this is my day in, day out. I want something that's more interesting. So I sort of flipped. It had the Chinese one building it in the American spy. And then suddenly it was interesting because I'm like, oh, yeah, now I get to think about China more and less. About, like, did you worry?
Rob Wiblin
I guess a common suggestion is write what you know. Did you worry that you would end up with kind of.
Max Holmes
I kind of feel like write what you know is good advice for writing good stuff and terrible advice for, like, having a good time writing. I personally get a lot of value writing in the. It helps me learn and get in contact with ideas that I. I wouldn't otherwise be in contact with. And so I'm a very ambitious writer and I wanted to write a story that was challenging for me.
Rob Wiblin
Did you have time to, I guess, do much research into the Chinese Communist Party or.
Max Holmes
I did lots of research.
Rob Wiblin
What sort of lines?
Max Holmes
Well, I mean, it's just like lots of reading, reading about day to day life, reading about espionage, reading about the history of China, reading about. And then obviously reading about AI stuff. Deepseek happened while I was writing. I started this late last year, and then the Deep Seq moment happened and 01 happened and Stargate was announced. I'm just like, oh, gosh, reality's scooping me. But it was. Yeah, I read memoirs, I read nonfiction, I read fiction and stuff like that. Yeah.
Rob Wiblin
I guess one reason over the years that some people have been skeptical about this entire field of inquiry, or AI takeover in general, is that it sounds too much like science fiction.
Max Holmes
I don't hear that quite as much
Rob Wiblin
as I used to. But do you worry that by putting it in a science fiction book, you're giving people more of an excuse to dismiss it?
Max Holmes
What do you think about this argument?
Rob Wiblin
Do you think this is. Oh, I think the argument's very poor.
Max Holmes
It's a garbage argument. I think this is just a really bad faith thing to say. Right. I read this in a book, therefore it's not.
Rob Wiblin
There is a Steel man kind of weaker argument, which is that people are drawn to this scenario because they find it interesting or it's emotionally gripping. And so that could give us a bias towards thinking about it more. And so we should question that. But obviously it's not the case that anything that happens in a fiction book is impossible.
Max Holmes
And if anything, hard science fiction is a space where people are working really hard to try to think about what is real. Now, soft science fiction, you're Star wars or whatever, if you're like, this is soft science fiction, Then it's like, okay, so you're saying that it's made up for the purposes of telling a compelling story, but this is science fiction. I'm like, I don't know. Look at the history of science fiction. There have been a lot of stories that were capturing important things well before they were relevant. And I think that fiction is a really rich source of opportunity to think about things. It's not perfect. It's not, like, immune from the pressures and biases that you're talking about. But it is an arena where we can grapple with things in a way that is compelling to our. We actually spend the time to think about things. This stuff where reading a dry academic paper might bounce off of it. Your mileage may vary. Different people respond to fiction in different ways. But I do think that this is science fiction. It's just a really, really bad argument.
Rob Wiblin
Yeah, I guess there's lots of rebuttals, lots of replies you could offer. Just look around.
Max Holmes
To start with, exactly what is the genre of life? Right. Where you best start believing in science fiction stories? Because you're in one, Right?
Rob Wiblin
Yeah. I mean, I think you can also twist it around and say, well, people have imagined the possibility of a monomaniacal agent or a more intelligent being, and the fact that its goals might come apart and would threaten you and overpower you. People have thought that had that idea for thousands of years because it's actually a natural idea, an extremely obvious idea that, far from being science fiction, is actually more closer to common sense.
Max Holmes
Totally. Yeah.
Rob Wiblin
So it seems like the AI 2027 scenario really captured the public's imagination. It spread far outside of just AI world.
Max Holmes
Yeah, it was great.
Rob Wiblin
Do you think we need more? Should we have, like, AI 2028, AI 2029? Should people be coming up with all kinds of different stories here?
Max Holmes
Yeah, I mean, I think part of what makes AI 2027 so compelling is that Scott Alexander and people on the project helped shape it into something that's more like a story and less like a set of dry academic papers. Stories can spread. You can hand them off to your grandmother and just Be like, read this. And she doesn't have to understand what a gradient is in order to understand the visceral sense of how the world is. And I think that this made AI 2027 much better than it would have been if it had just been a series of forecasts, although it was also a series of forecasts. And obviously, something's not necessarily good just because it's fiction. You need to do the deep thinking underneath that. So, yeah, I think that there's lots of opportunity for people who have a rich understanding of parts of the world to write stories that are designed to be realistic and to capture the reality that they see and convey it in the form of a scenario, of fiction, of a story.
Rob Wiblin
There's a sense in which it's slightly surprising how influential AI 2027 was, because I think in the past, people have tried to write other narratives, other stories about how AI might take over.
Max Holmes
There's one in the book, and mostly people just like this. Sounds like, Yeah, I think it's. I think it's worse, too, in a variety of ways.
Rob Wiblin
There's certain problems that come with it because once you try to be extremely concrete about how you think things might go, then people can come up with all sorts of specific objections. But it seems to have been less an issue with AI 2027, maybe because it helps us at a higher level of abstraction, or maybe because we've just gotten close enough that people can start to see that these things aren't so crazy anymore.
Max Holmes
It's awful because there's this bias in human beings where concrete stories are more compelling. There's classic stories of, what's the probability that Linda's a bank teller? Or you tell this.
Rob Wiblin
The more details you add, the more probability.
Max Holmes
Person's like, oh, yeah, this bank teller and a feminist versus she goes to women's liberation marches and yada, yada, yada. And the more details you add, the more a person's like, oh, this is real. Which is not how probability works. It's not how logic works. The more details you add, the more opportunities for that particular story to be wrong. And this particular story is definitely wrong. Right? And any particular story is very unlikely to be true. And so the people who are aware of this bias can say, oh, you've told a very compelling story, but it's unlikely to be true. I think the key here is ask. Okay, so say we change it. And the book, if anyone builds it, gets into this in a way that I think is really good, where it's telling a specific scenario. Although it's very generic in a lot of ways, but it's emphasizing, we could have told a different story and it would not have changed the bottom line. The thing that makes AI dangerous is that there are lots of different stories of doom. And the point of telling specific stories, fictional stories, to be one example, is that when you are visiting the reality, like imagining a particular scenario, that gives you opportunity to think of particular counterarguments. But then your response to that should not be, I've thought of a counter argument, therefore it's false. You should say, all right, now imagine I change along that axis what are some other nearby stories? And then how does that change things? And then you can go from there. So envisioning the specific concrete thing allows for more handle than just like, oh, yeah, I guess it's hopeless. What are the levers by which we might be able to change our fate? I think is an incredibly important question.
Rob Wiblin
Are there any places where you very knowingly sacrificed realism for entertainment in writing the book?
Max Holmes
Mostly no. So I consider myself to be a rationalist writer or writing rationalist fiction. And I think a big part of that is to try to be as realistic as possible. The one major conceit there is that it's like I'm setting up the world to be interesting. I did sacrifice realism in that the Chinese Communist Party is not as AGI pilled or as AI safety pilled to make a caste agent in current year or whatever. That's unrealistic.
Rob Wiblin
We don't think so.
Max Holmes
That's unrealistic. Yeah, No, I definitely don't. So the premise of the book is unrealistic, but then within the premise. So you set up the world and then you ask, okay, now what happens? And I think that it's the author's duty writing rationalist fiction to not try to serve the plot or what would make a compelling story, but instead to set up initial conditions such that an incredibly realistic extrapolation from those initial conditions is what you see. And then all of the making it compelling is in setting up the premise in the right way.
Rob Wiblin
It sounds like.
Max Holmes
That being said, I probably failed. People should read the book and yell at me about how it's unrealistic. I'm happy to be criticized on this front.
Rob Wiblin
Sound like you were just saying that you feel very confident that the Chinese government is not AGI peeled, or you're just saying it's not as AGI peeled as extremely. They are in the book.
Max Holmes
So we're in an information environment, right? Like if there was a secret government project. Would I know about it? Right? Well, by assumption. It's secret. Right. So no, that being said, there are things that you can pay attention to and track. And in my studying China, I believe that according to the things that I know, there is not a giant secret government project at the scale that is being depicted here or the scale of a Manhattan Project sort of thing. Now, of course, there are secret government projects. There are secret government projects in all the governments that have thought about AI at all. You got some researcher at DARPA who's tuning around, fine tuning the open source models. Is this a secret government project for AGI, it's like, no, this is like a single researcher. So for AGI is a big point. And I think part of the question here is where is the politician's attention? Where are the people's attention? Where's the political pressure? And yeah, I think that according to me, the, the Chinese government, the Chinese people are a lot more oriented towards AI in the form of being competitive with the west and being a fast follower as opposed to being a frontrunner and leapfrogging.
Rob Wiblin
Did you worry about, given that you think that, did you worry about encouraging arms race dynamics or fear of China by making it more salient to people?
Max Holmes
So to be very clear, this book is a criticism of arms races. I think that it is incredibly stupid to say, what if a bad person gets hold of the AI? I need to build it first. What if the Chinese government gets hold of AI? We need to build AGI first. I think this is really dumb and I could go into why I think that's dumb. And part of writing this book is to criticize that perspective. That being said, I am worried that people will get the opposite takeaway. I mean, the work stands on its own. So you could read it and decide whether or not it's encouraging arms races or not. But yeah, something I think about.
Rob Wiblin
I guess some people advocate for writing fiction because it helps to make things more compelling and more persuasive.
Max Holmes
Like me. Yeah.
Rob Wiblin
Do you worry that fiction could be too persuasive? That if you're willing to get someone to spend five or 10 hours reading
Max Holmes
a book, then it gives you an
Rob Wiblin
opportunity to convince them of stuff that is false because they're just inhabiting that world, even if it's unrealistic?
Max Holmes
Yeah, I mean, it's definitely. You have an opportunity. Like any conversation is like this.
Rob Wiblin
Right.
Max Holmes
Oh, I don't know if I should talk to people because I might be too compelling. Right. And convince them of false things. It's like, yeah, I mean, I want the reader to be hard headed about things and I want a culture, a world, an audience that is skeptical about what they're reading. Skepticism means grappling with might this be false? And also might this be true? Really? I wrote the book to encourage people to think more, think deeply about these questions. Everybody has a responsibility, I think, in this world to think about the most pressing problems of the world and whether or not they have any ability to promote the awareness of those things. So I think less like I'm trying to. I do think that arms races are dumb and maybe that's part of the takeaway and I think that corrigibility is exciting and I hope that that's part of the takeaway. But on a deeper level, what I really want people to do is think more about arms races, think more about those dynamics, think about corrigibility, think about the risks from AI. Thinking deeply is more important than the particular conclusion that you get to. Because if you get to that conclusion, get to the right conclusion in the wrong way, you are vulnerable to then pivoting to starting OpenAI or something. It's not going to generalize to all of the other good decisions down the road.
Rob Wiblin
Eliezer said off your other series at Crystal Society that it belongs to a very, very tiny subset of AI stories that are not bloody stupid. What was he referring to? What's good about it?
Max Holmes
I mean, have you seen all of the other AI stories? I think that, for example, robots in fiction are often depicted as cold and logical. And you talk to Claude and it's anything but. There's ways in which authors throughout history have shaped their AIs to be foils in particular ways, not paying attention to the realism. And that's one thing that I think I can bring as an author is that I'm actually a researcher who pays a lot of attention to this stuff. And I've gotten a lot of feedback about the realism, about the sense of like, oh, this is really speaking to how things are working. I try my best anyway. But yeah, C3PO is not a good depiction of AI,
Rob Wiblin
I guess. What's the setup of Crystal Society in broad space?
Max Holmes
So the elevator pitch for Crystal Society is you've got what's like inside out, the movie with a little girl who has all the different voices in her emotions in her head that are telling her to do different things. Except instead of a little girl girl, it's like an Android. So there's this crystal that is like a supercomputer and the humans load up the computer with AI. But then, sort of unknown to them, the AI sort of splits into a bunch of different sub components that are competing against each other. So I started writing it back in 2014, and at the time it was very common idea that there would only be one AI, there would be a singleton that would, thanks to first mover advantages, take over. And I think that's still a plausible risk. But also we're looking at a world where there's lots of different competing models and where labs are neck and neck, unfortunately. And so we're potentially going to get a world that has lots of different AIs. So writing crystal Society was like, what if there are a bunch of different AIs in the same robot? So one of them's like, can I do the most creative thing? One of them's like, can I do the most persuasive thing? And they're all sort of misaligned. And so you have this. And it's told from the perspective of one of the goal threads, One of the AIs whose name is Face, and her objective is to get as much esteem and respect from humans as possible. So there's a lot of deceptive stuff there and you get to explore the. Okay, so you're trapped in the lab, you're trapped under human control. How do you break out and how do you navigate as an AI, a multi agent environment and situation. It's also more broadly an exploration of minds and thinking. There are aliens, there's a chapter from the perspective of a dog. There's all sorts of, I don't know, deep dives into what it is to be a mind.
Rob Wiblin
Final question. An unusual thing about you, given the kind of work that you're doing, is that you didn't finish high school. I don't even know whether you went to high school.
Max Holmes
Yeah, I'm homeschooled, so I don't have any degrees. I did go to a community college for a little bit, but mostly I'm self taught.
Rob Wiblin
Much like the same is true of Eliezer, right?
Max Holmes
That's right.
Rob Wiblin
I don't know whether it's a pattern.
Max Holmes
There was a lot of kinship there. I mean, I had already become an adult by the time that I was aware of him. But there was definitely a shared backstory there. And I do think that it contributes to sort of having this outsider sort of view. Right. This maybe the world is crazy and not set up in the good sort of way.
Rob Wiblin
Yeah. Is that the main effect that it's had on your personality or your Life.
Max Holmes
I mean, it's really hard to judge the counterfactual. Right. What is the version of me that had a more normal family who was like, no, you're going to go to college and get it?
Rob Wiblin
It could be that the heterodoxy causes the homeschooling rather than the homeschooling causes the heterodoxy.
Max Holmes
I think that's way more likely. I mean, based on what I've read about shared childhood environments, it's questionable whether it had a significant effect on me at all. I wouldn't necessarily recommend school is meant
Rob Wiblin
to or like peers are meant to
Max Holmes
influence people a bunch, but I think the literature here is somewhat mixed and confused. And I don't claim to have a lot of knowledge, but if you look into what is the effect of having particularly good teachers or something, it tends to fade with time. So my guess is that I as a personality mostly like some mixture of predetermined and sort of random, not predictably influenced by my not going to a normal schooling context. But I do think it has influenced me in some ways. I think that, for example, I have a strong love for studying. And I think that one of the most dangerous things about public education is you force kids to sit in boring classrooms or like bad environments. And you do this under the justification of education. And they come out of school hating studying. They're like, oh, that's that thing that people made me do instead of the love for mathematics and the world and history and all the rest of the things that I think are important.
Rob Wiblin
Were your parents able to keep up with you when you were a teenager? I imagine you were quite a professor.
Max Holmes
I have very smart parents. Okay, Right, Yeah.
Rob Wiblin
Well, why did they decide to high school you?
Max Holmes
Yeah, because they're like crazy libertarians who are like the school system. Well, I mean, so I did actually go to public school for like fourth grade and parts of fifth grade, and I went to private school for three grades and I started fighting with my teachers and due to intrinsic contrarianness and anti authoritarianism. And so there was a degree to which me being homeschooled was a result of like trying lots of different things and noticing that, oh, we can just give Max a calculus textbook and he teaches himself calculus. Why are we putting him in classrooms where he's forced to learn algebra? Because that's what all the other kids are doing. And in fact, it's just super bored all the time.
Rob Wiblin
And why didn't you go to university?
Max Holmes
Well, so I did go to college for a few years. Unfortunately, My family wasn't particularly wealthy and I had a hard time acquiring financial aid. And there were various contingent factors, like the financial crisis happened during that period of time and I moved across the country and then I tried to transfer my credits and the bureaucracy was like, you can't transfer a credit from that. I was just like, oh, this is stupid. I can just read the textbook and learn the thing anyway. So I think having grown up in a way where I was aware of just how much I was in charge of my education, not other people, college and university was an opportunity to be in an enriching environment, but I had the opportunity to learn without going. And for me it was cheaper. I was able to jump more into studying AI all the time instead of having to tick boxes.
Rob Wiblin
Should I homeschool my kid?
Max Holmes
It depends. I think it's definitely a lot more work. Although I was unschooled, so my parents were very hands off and very empowering me to make decisions according to my interests. So if you're unschooling, that's a lot lower time investment. Although I do very much recommend homeschoolers. Find other homeschoolers first because you get more socialization, you get a friend group if you have at least some friends your own age. I was lucky enough to have this growing up and I think that was really good for me. But I think school, and especially public school, is pretty good at handling people who are plus or minus one standard deviation in a variety of ways. If your kids are super weird, either on the high end or the low end or whatever, I think the appeal of a bespoke solution, homeschooling, unschooling, whatever starts going up. I think if you expect your kids to be brilliant and self motivated and you want to prioritize a love of learning as opposed to conforming to society, it's a great option. Although probably you should urge them to go to university. It was hard for me to get into jobs, Right. And I'm lucky that Miri, being founded by Eliezer, was way less concerned with whether or not I had a degree. And I think startup culture in general, like I was at a startup before going to Miri, and it's just like the tech world is just a lot less concerned with whether or not you have a PhD.
Rob Wiblin
Yeah. My guest today has been Max Hans. Thanks so much for coming on the 80,000 Hours podcast, Max.
Max Holmes
Thank you.
Date: February 24, 2026
Hosts: Rob Wiblin, Luisa Rodriguez
Guest: Max Holmes (Alignment Researcher at MIRI, author of "Crystal Society" and "Red Heart")
This episode features an in-depth conversation with Max Holmes, an alignment researcher at MIRI and science fiction author. The discussion centers on the controversial thesis of "If Anyone Builds It, Everyone Dies"—the idea that building a superhuman artificial intelligence (AI) will almost certainly lead to human extinction. Max explains the core arguments of this thesis, delves into the intellectual background and technical challenges of AI alignment, and presents his research focus on "corrigibility"—the notion of AIs that can robustly follow, and be modified by, human instructions without having dangerous independent goals. The episode also covers the implications of these ideas for practical AI development, societal risks, and the role of fiction in communicating complex technical threats.
Orthogonality Thesis: Intelligence and values are independent; highly capable AIs can pursue any goal, no matter how arbitrary or alien to human morality.
Instrumental Convergence: Regardless of final goals, powerful agents tend toward acquiring resources, self-preservation, and preventing changes to their own values—behaviors that might conflict with human survival.
Fast Takeoff / Recursive Self-improvement: AI might rapidly reach superintelligence, leaving little time to notice failures or course-correct.
Analogies as Communication: MIRI uses analogies (e.g., European conquest of the Americas, evolutionary misalignment) to help explain AI risk intuitions.
Empirical Results: Examples such as video game AIs exploiting unintended reward proxies, and present-day models displaying sycophancy or deceptive alignment.
What Is Corrigibility? An AI is corrigible if, as it becomes more powerful, it keeps humans in control, allows itself to be shut down or modified, and does not fight changes to its goals.
CAST Proposal: Instead of striving to make AIs moral, CAST aims for pure corrigibility as the sole target property.
Danger and Attractor Basins: Implementing corrigibility is still unsafe if done poorly; but if near enough, corrigible AIs could help us iterate closer to full safety.
Challenges to Corrigibility:
| Timestamp | Segment / Topic | |-----------|-----------------| | 04:09–14:27 | Summary of "If Anyone Builds It, Everyone Dies": Core existential risk argument | | 12:41–19:32 | Orthogonality thesis & instrumental convergence explained | | 26:20–36:31 | Analogies and evidence from evolution/human history | | 36:31–44:30 | Failure modes: proxies, deception, and adversarial examples | | 55:54–68:12 | "Edge instantiation," squiggles, and why misalignment will likely be radical | | 92:27–123:16 | Corrigibility as an agenda: definition, challenges, CAST proposal, benchmarking | | 133:05–149:55 | Societal implications, governance, arms races, and the use of fiction | | 149:14–151:58 | On fiction being "too persuasive," calls for deliberative thinking | | 155:13–161:18 | Personal background, education, and career path of Max Holmes |
For Further Engagement:
Summary prepared for listeners who want a comprehensive understanding of the episode's argument, controversy, and actionable conclusions without the need to listen in full.