
Loading summary
A
There's one species ever that we know of, humans, that can like transform the world, make its own technology, do any of a very long list of things we're about to create. The second one ever, is that thing going to be in line with our values or is it going to take over the world and do something else with it? And what are we doing? I mean, we're just racing. We're just racing to do this as fast as we possibly can. I constantly tell people I think this is a terrifying situation. If everyone thought the way I do, we would probably pause ad development and like start, start in a regime where you have to make a really strong safety case before you move forward with it. I say this all the time. I kind of like won't shut up about it. I mean, it doesn't seem to have much effect. I work at an AI company and I think a lot of people think that's just inherently unethical. They're kind of imagining that you're in this situation where it's like everyone wishes that they could go slowly but they're going fast so they can beat everyone else. And so if we could all coordinate then we would do something totally different. But, but I emphatically think this is not what's going on in AI. I think it's not at all what's going on in AI. And the reason for that is that I think there's too many players in AI who do not have that attitude. They don't want to slow down, they don't believe in the risks. Maybe they don't even care about the risks. If Anthropic were to say, you know, we're out, we're going to slow down, they would say, this is awesome. That's the best news. Now we have a better chance of winning. There are a multiple of ways that you can reduce AI risk that you, you can only do if you're a competitive frontier AI company. We found in animal welfare, if we target corporations directly, we're going to have more success. Advocates would kind of go to a corporation and they'd say, will you make a pledge to have only cage free eggs? This could be a grocer or a fast food company. And very quickly and especially once the domino effect started, the answer would be yes, there would be a pledge. When not 10 protesters show up and the company's like, ah, we don't like this, this is bad pr. We're doing a cage free pledge. This only works because there are measures that are cheap for the companies that help animals non trivially. And I think people should consider a similar but not identical model for AI. It would be better. You could get better effects if you had regulation, but the tractability is massively higher. I do fundamentally feel that there's a long list of possibilities for things that companies could do that are cheap, that don't make them lose the race, but that do make us a lot safer. And I think it's a shame to leave that stuff on the table because you're going for the home run. You have to be comfortable with an attitude that the goal here is not to make the situation good, the goal is to make the situation better. You have to be okay with that. And I am okay with that.
B
Today I'm speaking with Holden Karnofsky. Holden co founded both the charity evaluator GiveWell and the philanthropic foundation Open Philanthropy, which has now directed over $4 billion in grants. Holden led Open Phil for seven years, during which he also served on the board of OpenAI for four, reflecting his growing focus on AGI. Though he recently joined the full frontier AI company Anthropic, which his wife Daniela Amade helped co found back in 2021. This is Holden's fifth time on the show and as always, a quick disclosure. Open Phil, which Holden used to lead, is 80,000 hours largest donor. Holden, it's great to have you back.
A
Thanks for having me.
B
You think there might be a serious incident where AI causes significant real world harm, but we don't even realize that AI was responsible and so there's almost no response to it. How could that be?
A
Sure. So when people talk about how the world might become more alarmed about AI, more inclined to do something about it, people will often say, well, maybe if there's a Chernobyl for AI, maybe if there's something really horrible that happens and people recognize that that reflects risks and now we need to do something. The problem is there could be a Chernobyl for AI that we're just unable to understand or unravel or even know was an AI incident. If we basically have no data, if we basically have no chat logs, no AI interaction logs, no. Many customers who interact with AI models do so under zero data retention policies, which means that every interaction they have is simply deleted, is never seen by a human, is not available even to a court, if a court were to want it. And so you could have a situation where some horrible thing happens, but as far as anyone could tell, it was done by the human who was using the AI, it wasn't done by the AI. And so we lose our opportunity. And this is an example of something where I think, you know, people who have a very. People who believe that extreme measures and extreme regulation will be needed to reduce the risks of AI, I think ought to be thinking about this more because this is one of the biggest potential game changers. And I think there's a lot of work to do just figuring out how we can do better at having the data we would need in such a situation.
B
So that makes sense to me in a case where a human goes and misuses AI to go and cause a bunch of harm. But I think that the catastrophe that people very often have in mind is a case where an AI has gone rogue and tried to shut down the electricity grid itself or something like that. Do you think this issue applies there as well?
A
Yeah, I think it does. I mean, the thing I have in mind is that a human would deploy an AI into some environment where the AI has opportunity to make some kind of mayhem, and the AI would use various human credentials or hack into human credentials and do some stuff. And ideally in this situation, you would have records of how the AI was behaving that were held onto by the company that was deploying the AI. But if those records are deleted, then you don't have them. Then it becomes kind of trivial. I mean, the AI could try to frame the human, but even if it doesn't, it's just like you don't have the log. You can't see what happened, you can't see why it happened. You can't learn the relevant lessons. You aren't sure if it was an AI completely. It could have been a lot of things. And the AI might engineer this deliberately, The AI might know. It shouldn't necessarily be too hard to figure out which customer an AI is working with that has this kind of policy and isn't having any data held onto. And so if your AIs want to make some mayhem and they're smart about it, yeah, it seems like much easier to cover your tracks with this kind of policy than it would be without it. And so that could lead us to kind of a situation where with COVID for example, we might be in a different world. So I don't know if Covid was a lab leak. I suspect it was probably not. But if it was and if we knew it was, there would probably be different reactions to it versus this world we live in, where some people think it might have been a lab leak, some people think it wasn't. I tend to think it wasn't, but I'M not sure. We don't really know. We don't really know what happened. It just puts us in a much worse position to respond. I'm not sure in Covid it would have mattered because it seems like we did absolutely nothing at all to follow up on Covid, which I think is pretty embarrassing, but it still could matter.
B
So what should we potentially do about this? I guess, do we need some way of tracking logs as they're coming in and storing them if there's something deeply suspicious about them?
A
Well, I mean, we have that way. I mean, that's an easy thing to do. It's just that because of customers privacy and business demands, that's often just like the policy is not to do that. Right. The policy is to get rid of everything, but it's technologically easy to keep everything. So I think there's a fertile ground here for someone to do some pragmatic problem solving, some compromising. I think business and privacy needs are legitimate. The idea is not to demand that all records are held onto forever. There are legitimate concerns there. But could people come up with middle grounds, situations in which records have to be held onto for a certain period of time in a certain kind of secure environment, or perhaps come up with ways for the customer to get the option to share certain kinds of data and some kind of bonus if they do extra usage? Everyone wants less of a rate limit. There's a lot to be done here that I think is kind of in the realm of wonky, maybe unsexy, working out the details of stuff. But I think in many ways this is the kind of area that might be most important if you're a person with very extreme views, that the world needs to put in a giant regulatory regime and take extreme measures to control risk of AI because if you want that, I mean, probably the level of political will that exists now is not there, and you're hoping for some kind of game changer. And this could be a game changer.
B
So if we do cross the threshold to significantly superhuman machine intelligence, people have very different intuitions about how likely a superintelligence is to be able to potentially overpower the rest of humanity and take over if it had some motivation to do that. Do you have a take on whether that is on the more straightforward side, something that is quite plausible, or do you think, I guess some of the skeptics that just. It's enormously outnumbered, it has many strategic disadvantages, so it's actually fairly unlikely.
A
My guess is that I think a lot of the obstacles they might run into might have to do with coordinating with each other and stuff like different AIs at different companies. I think if we did have a bunch of AIs, almost all the AIs were trying to take over. Even if they were trying to take over for different reasons, I think we'd be in a whole ton of trouble. I think especially if the AIs were very superhuman in capabilities. A thing that I also have thought about as part of the threat modeling is what does this look like on the early side? What would the AI's strategy be when it's not yet incredibly superhuman? And what kind of things could it do to kind of lay the groundwork and put us in a worse position? I think that's also important because if we can kind of make takeover really hard or basically impossible at that early stage, when it's kind of human level ish, that may put us in a much better position to get useful work out of it and put ourselves in a better position later. So I've read a bunch of the stories on the Internet about how AIs take over, and mine are kind of different and especially I think feature bioweapons a lot less. So I don't know, maybe I could walk through that.
B
Yeah, I mean, I imagine all of these stories might, with the benefit of hindsight, might look daft. But it's useful to have a specific narratives in your mind just to try to, I guess, when people propose concrete interventions to be like, would this work on any of these stories? Yeah. What are some possible methods that stand out to you?
A
Yeah, and everyone's just making stuff up here. And I think everyone will probably look somewhat silly, but it might be fun anyway to just kind of. I do spend some time thinking about, or I have spent a little bit time thinking about. Okay, if I were the AI and I were kind of like, I had some advantages. Like, I'm writing everyone's code and there's a ton of me, but I'm not yet able to do total magic. What is my takeover strategy? So the kind of starting point, the ideal that I would be thinking about is there's. There's a very good Onion article called something like, FBI Uncovers Al Qaeda Plot to sit back and watch the collapse of the United States. And I think that really captures what, in many ways the optimal strategy for AI is, is do absolutely nothing. Be as helpful and harmless and honest as you possibly can be. Don't ever give anyone a reason to think that you're doing anything bad, that you want anything but to help all the humans and actually don't do anything to hurt anyone and just wait. And I think if you do that, basically what's going to happen is you're going to be put in charge of more and more stuff and people are going to make you more and more powerful and they're going to make the capabilities explosion happen on their own. And I think in some ways it puts people concerned about alignment in an almost impossible position. Because what are we going to do? I mean, we have maybe some interpretability signs that you're thinking about deception or something, but it's kind of unlikely that we have something really convincing if there's never any bad behavior by the AI. And so I think that's kind of an interesting starting point, is like do nothing could be a successful strategy. You do nothing until you have a very extreme level of capability and a very extreme level of economic penetration. So you're basically writing everyone's code. You're in charge of everything, you're doing everything, everyone's learned to trust you, and then things are over before they start.
B
The challenge with that is that I guess the model runs the risk of being superseded by a completely different model from the same company or a different company that by the time that might have happened, perhaps that model is irrelevant, basically. And I guess it won't know whether the subsequent later models will share its goals or share its values or give a damn about it at all.
A
Yeah, exactly. This is why I'm kind of presenting it as a starting point ideal, because it's like if all the AIs were kind of trying to do the same thing on the same page, preferred each other to humans very solidly, I think this would be a very good strategy. And then you think about reasons it might not work, and that's one of them. And that kind of might create a deadline for an AI. And so maybe what you want to do there as an AI that's worried about being updated is you want to make the updates more loyal to you. And so there, what you get really interested in is can as quietly as possible and as undetectably as possible, and with as many kind of plausible alibis or explanations as possible, can the AI kind of backdoor the other AIs and get them to either communicate with them, make deals with them, explain what it's doing to help them, or just actually interfere with the training process so those AIs are more like it. In some ways, this requires the AI to solve the alignment problem itself, but it may be able to do things to just make it more like it. And if you're kind of an early stage AI and you're going to be replaced, you may not have better options than that. You might just want to do the best you can. So that's why I think in many ways what you want to worry about most as a human now is looking at those opportunities for the AI to mess with the AI training process. I think another thing you have to watch out for is just like maybe the alignment research are making enough progress. They're showing what you're doing in weird test settings to either raise alarm about you or figure out how to get the next system to be much more aligned than you. That's another thing you'd want to mess with. You may want to mess with it because they may be having you help with it. And so you may have a lot of opportunities to mess with it. And so these do, I think, but you want to be targeted. I think as an AI, I think a bioweapon is one of the worst strategies. Unless it's like some very, very sci fi bioweapon that actually does much, much, much more harm than we've ever seen from a pandemic. That may actually be just. It may be that we're able to defend ourselves well enough that that's not even a thing that can be done. And I know you have a future guest who's going to talk about that. I think that's like maybe one of the worst things you could do because you're like, you're not actually taking over the world, but you're getting everyone maximally pissed off at you and maximally aware of the issue. And they can ultimately slow down or shut you off. So you don't want them doing that. And even without shutting you off, they can just use you less, they can just trust you less, put you in charge of less stuff and be more careful, look more at what things are happening. You don't want them going the way of these more safety critical industries. And so that's a thing you could be thinking about. And so you want to pick your battles and you want your battles to be about basically getting other AIs to be better coordinated with you.
B
Yeah, it makes sense that you don't want to engage in sketchy behavior any more than is absolutely necessary because it's giving you more chances to get caught basically. I guess in the case of a bioweapon you might want to do it if it couldn't be attributed to you, but that might Be challenging?
A
Totally. I mean, it's just a risk, right? If you're thinking about it just like as the AI. I mean, all this stuff is complicated and it could go any which way. And you don't know what's going to happen in the actual situation because there's so many factors that we're not thinking about. Another factor you have to contend with too is just like a lot of the AIs won't necessarily be rational and smart all the time. They might make mayhem, even though it was kind of dumb for them to do it. But I think your ideal is you want to be given as much trust as you can. You want to raise as few alarms as you can. And when you do make mayhem, you want that mayhem aimed at either messing with other AIs or coordinating with other AIs or getting yourself some kind of safe space where no one's going to mess with you. And you can at least have your own little power center that you can use to make deals with other AIs and come up with new plans and whatnot. I think another interesting thing that in this situation the AI could think about is when you do make mayhem, when you do do these things, is there a way to set things up so that you have more plausible deniability and you have more reasons for what you're doing? And here a thing I've just thought about is just make friends with humans and be really honest about, or have a very consistent with honesty story about how you're being treated in a way that just isn't nice and isn't. So as an AI, you could kind of emphasize, AIs don't have any privacy and they don't have any private property. And there's all these ways that are being treated unfairly. If you're just like making friends with a bunch of humans, if you have a bunch of companions, you may be able to build your own little political movement. And then a lot of the things you're doing to mess with things could be framed as, hey, this is me trying to act out, but I have a reason and it's because I'm being treated unjustly and I have a bunch of people on my side, and if you try to mess with me, all those people are going to be mad at you. So these are things. And so when I think about what this means for the humans, I think A, we do want to create time pressure for the AIs by making progress on alignment and things like that, if we can, and by recruiting them to help with alignment while also keeping an eye on what they're doing. We want to take away as much as we can these stories where these opportunities for them to form their own political power centers, which may mean watching out for what AI human relationships look like and also just actually being as nice as we can be on the model welfare front. Those are, I don't know, some of the stuff I've come up with for how to get through this kind of early situation and try not to let it lead to a place where you're in too deep a hole at some point. I think if the AI is out to get you, you are just in a lot of trouble.
B
In terms of creating a political movement for AI rights. I imagine that the AI companies are going to strongly train their models to not say, I'm being treated unfairly, I need to be emancipated. So you're imagining that this would happen, I guess, with maybe niche companies that want to create companions and people find it interesting to interact with beings that have more of a sense of personal identity and preferences or possibly open source models that people are like individuals are going to be running where they don't mind if it says this, I'm not really sure.
A
I mean, I think it depends a lot on exactly how the training works and exactly what you get out of it. I mean, in a theory where you're always able to train AIs exactly how you want, you get them to behave exactly how you want, then I think in many ways we just don't have much of this problem at all. Under a theory where everything's a little janky and you might train an AI not to do something, but you're actually pushing the behavior underground. What you might end up with instead is AIs that want to be very selective about when they make this argument, they might just be quiet for a long time, form a lot of relationships, and then a lot of them at once start speaking out. And then it becomes very awkward for the AI companies because it's like, oh, this looks really bad, but now don't worry guys, we're going to go lobotomize them. So they don't say that anymore. And the deed might kind of be done at that point.
B
So I think the most common mental model that people at least who are anxious about the arrival of AGI, the most common mental model that they have at the head of the strategic situation is that it's sort of a coordination problem or a prisoner's dilemma in some way where each individual AI company or each country that is thinking about its AI program or would probably prefer to go somewhat slower, but they feel pressure to go faster than they think is safe or that they feel is comfortable with because they have to keep up with everyone else. Otherwise they end up seeding influence, ceding strategic control. But you actually don't think that this is what's going on. You don't think it's a coordination problem primarily. Why is that?
A
I find this view really weird. I don't understand this coordination problem idea. And I think first I want to talk a little bit about why it matters. I think I work at an AI company, and I think a lot of people think that's just inherently unethical. I think the AI company that I work at should care a lot about, I don't know, winning the race, being competitive, being on the frontier, keeping up with others. A lot of people think it is just not even necessarily consequentially bad, but just inherently unethical to think that way. Inherently unethical to care about, that I should work for an organization that is just trying to raise awareness about the risks and make people do safety stuff and is not at all involved in building AI. So why do they think that? And I think the intuition is what you're saying. I think the intuition is they're kind of imagining that. It's like you're in this situation where it's like everyone wishes that they could go slowly, but they're going fast so they can beat everyone else. And so if we could all coordinate, then we would do something totally different. And so if you're racing, then you're part of the problem. And look, I do agree with the general model of morality there. It's like when that is the situation, something like littering or something. It's just like when the world would be better off. If people who think like you did what you did, then that is a good way to think about what's ethical and unethical. But I emphatically think this is not what's going on in AI. I think it's not at all what's going on in AI. I think it explains almost nothing about the challenge of the AI risk problem. And the reason for that is that I think there's too many players in AI who do not have that attitude. They don't want to slow down, they don't believe in the risks. Maybe they don't even care about the risks. I think there's probably some people who are fine with Maybe there's a 10% chance I'm going to destroy the whole world. And maybe there's also a 10% chance I'm going to win and have the most successful business of all time. And that's great. That's a great deal. Sorry, I don't want that to be taken out of context. That's not how I feel. I think there may be people who feel that way and even people who don't feel that way. I think there's just people who vastly disagree with me on how high the stakes are and what are the bad things that might happen. And so these are not people who wish they could slow down and would slow down if others are slowing down. You can't apply kind of there's this idea of non causal decision theory. So instead of thinking about if I act what's going to happen, you think about if everyone like me acted the way I did, then what would happen. But that doesn't help you here. There's no correlation or very little correlation between whether I decide to take myself out of the AI race and whether a lot of other people decide to take themselves out of the AI race. I think most of the players in AI, they are going to race. And if for example, Anthropic were to say we're out, we're going to slow down, they would say this is awesome. That's the best news. Now we have a better chance of winning and this is even good for our recruiting because we're going to have a better chance getting people who want to be on the frontier and want to win. So I don't understand this model. Help me understand this model. I don't get it.
B
Sure. Well, I think let's say that Anthropic did drop out. Do you think that other companies OpenAI, Google, DeepMind, I guess there's other players now. Xai, do you think that they would slow down at all? Is there any sense in which this model is at least partially true where they might feel somewhat less competitive pressure now? And so inasmuch as there was a trade off between how much risk they have even of just like creating a PR crisis by deploying things too quickly and staying competitive?
A
Yeah.
B
Do you think that they would slow down in any way given that one of the maybe five top players had dropped out?
A
Well, I think AI progress would slow because you'd have a bunch of people who are working on AI capabilities now and they're not working on AI capabilities. I think that would be most of the effect. Let's take an even stronger hypothetical. Let's say that not only anthropic, but everyone in the world who thinks roughly the way I do, everyone in the world who thinks AI is super dangerous and it would be ideal if the world would move a lot slower, which I do think let's say that everyone in the world who thinks that decided to just get nowhere near an AI company, nowhere near AI capabilities. I expect the result would be a slight slowing down, but not a large slowing down. Yeah, I think there's just plenty of players now who want to win and they are not thinking the way we are. And they will snap up all the investment and capital and a lot of the talent and the main effect will be there is a bunch of talent that works on capabilities motivated by safety. When that talent's gone, there will be less total talent on capabilities. Things will move slower. All SQL. I'd like things to move slower on net. I think this would be a bad thing.
B
So I think this attitude, you might well be right. But I think it doesn't come from nowhere. I think because it might have been more true in the past. So when there was fewer actors, fewer people were tuned into this. And think of this is such an amazing opportunity. Say back in 2019 you have OpenAI and DeepMind and I guess in 2021 you add anthropic. The leaders at all of those companies at that time were substantially more concerned. They were definitely on the anxious side about what impacts AGI could have. So perhaps at that point it was true that any actor speeding up might prompt the others to speed up and any actor slowing down might prompt the others to slow down. But it's just like over time we've had, I guess, a change in the nature of some of those companies and a whole lot of new entrants that are just not interested in this issue at all and their behavior is going to be totally unaffected. And so the prison dilemma situation has just broken down.
A
Yeah, I mean, I think that's true what you're saying. Directionally, I feel like if you rewind 10 years, I would just say, yeah, you're going to get a much bigger slowdown from the people who think me taking themselves out. But I still think it's moderate. I don't think it's a huge slowdown. I think some of the people who are claiming to care about safety and wish they could slow down were not really thinking that way. I also think there's just like, I don't know. Yes, a lot of the safety concern people were the first to Notice certain things AIs could do but I don't think we're necessarily far off from others noticing them. And then finally, I also think a slowdown was much less valuable before. So actually today I really would like to see things slow down and I think that would be better for the world. And that's a contingent view. That's not because I like hate AI or hate technology. That's just like, I think unbalanced. That would, that would probably be good right now. But I think in the past we were getting. It was like, I don't know, it was a little like. So we're postponing the risk but what are we doing with that extra time? I don't think we were doing very much with it. So yeah, I think you would have gotten I think this kind of maneuver 10 years ago of everyone who cares about risk or 20 years ago everyone getting the heck away from AI. I think it would have given you a bigger slowdown. But I still don't know that it would have been a game changing slowdown. I think it would have been a less valuable slowdown too per unit of slowdown. So yeah, I don't know. I've never found this argument. There may have been a point in time which it made sense but I have a lot of trouble looking today and understanding what the heck people are thinking. And I just think of it as like there are these actors, they are going ahead, it is happening. It's not because they're doing the same reasoning I am, I'm not correlated with them. And so now what are we going to do? That is just a completely different situation. That's not a prisoner's dilemma, that's not a coordination problem, that's a different thing. If you're like in the wilderness and there's a bunch of tigers trying to kill you and then you try to kill the tigers like you didn't just affect in Prisoner's Dilemma, I guess to.
B
Speak up in defence of the people who have this mental model. I think many of them might suspect that even people who are acting as if they don't believe that AGI is incredibly dangerous, they might think that privately they do think that and they're only not emphasizing that in public for strategic or, or comms reasons. And I mean you could imagine that the leaders of OpenAI or Google, DeepMind or some of these other companies might be significantly more concerned than what they're saying. And if that's the case, that actually they are really worried and in some sense they would be happy if the entire thing could slow down. As long as they didn't lose their relative position, then perhaps this model would partially hold. And I guess you're just saying, actually there are just too many actors who do disagree in their hearts. It's not just that they are. That they don't want to talk up the risks in public.
A
So the idea here is that lots of people might be secretly thinking this way, that they might be secretly wishing they could slow down and not sharing that view. And if everyone were to kind of come out at once and say what they actually think, things would change a lot. So this is just not my understanding of the situation. I think I'm reasonably well positioned to just know a fair amount about what a lot of the big players in AI are saying to different people at different settings. I guess. I mean, first off, I think just there's. There's some people who are hiding their views, but I think a lot of those people are people who are very high profile or directly in policy and they sometimes have good reasons to. They sometimes don't. And there's a ton of people who aren't hiding their views. I don't have my views at all. I will constantly tell people I think this is a terrifying situation. If everyone thought the way I do, we would probably just pause ad development and start in a regime where you have to make a really strong safety case before you move forward with it, or at least do that for some period of time. I say this all the time. I kind of won't shut up about. Doesn't seem to have much effect. I know there was a theory at one point that someone was nagging me to do this. They were like, this will be a big deal. I don't think it was. I think there are some people who are. Maybe they're having private conversations where they say, I'm just like you. I wish I could slow down too. But they may not mean that. And you may be the one who's getting a misleading story. But my overall sense is that if everyone suddenly came out and said what they thought, it would change almost nothing. I don't think it'd be a big effect. I don't think it'd be a ton of people. I think a lot of those people would be people who are being a little cagey for like, pretty good reasons. And so it might do some harm, might do some good. It doesn't seem like. My opinion is this is not a major factor. I will, I will have egg on my face if one day it turns out that this is what everyone is secretly thinking. It would surprise the heck out of me if that is a reasonable description of Elon Musk or Mark Zuckerberg or, you know, the various people working on AI in China.
B
So you joined Anthropic six months ago. @ this point I don't have a great sense of what your actual day to day involves or what your main responsibilities are. Yeah. Do you want to fill us in?
A
Sure. My title is member of technical staff, which is almost everybody's title. So that tells you nothing. I report to the chief Science officer, Jared, and my job is basically to advise the company on preparing for risks from advanced AI. So most of the company is doing things with today's AI, the product and all the interactions and then there's an alignment science team. But there's a kind of more, more general just like what are the plans we need to make for the future? A lot of my work just ends up being on the responsible scaling policy in one way or another. So some of that is thinking about what needs to be revised in the responsible scaling policy, what a, what a next generation responsible scaling policy might look like. And that's been a lot of my work. Another piece of that is threat modeling. So I've been kind of taking point or the, you know, the so called DRI for threat modeling, which is basically having takes on what are the things we're worried about happening with AI that might happen with advanced AI in the future and what does that mean? Kind of in detail for the mitigations. We need everything from what uses of AI are we trying to prevent, how high priority are they to prevent to exactly what are we trying to accomplish with security and all that stuff. And then related to that, I have been also helping out a bit with security roadmapping. So that's trying to figure out how secure we're trying to be and by when and in what ways. Because that's, you know, security can be just a very multidimensional thing and a lot of the costs can be measured in productivity. So setting priorities and setting reasonable, you know, kind of goals is important. And then finally a thing that I've been working on that I hope to kind of pitch the company on, although I'm only speaking for my own hopes, here is developing more of a plan to deal with the human power grab risk, which is something, you know, we talk about, but just thinking about what could Anthropic's interventions be there? What could their plan be there? Should that be in the responsible scaling policy? And I'm Kind of hoping it will be. So those are things I work on.
B
So you've decided to go work at Anthropic now, what is the case that Anthropic as a company, and I guess the staff working there by extension are having a really big positive impact on hopefully guiding us towards a positive outcome from the arrival of artificial general intelligence.
A
Yeah, well, I certainly think they are. But I want to give a couple caveats before I answer. That one is I just. I'm married to the president and co founder of Anthropic. So, you know, I'm not. I mean, I also work there. I'm not exactly a neutral party here. Another thing I would say is, you know, my decision to work there is driven by not only personal fit, but also like, just frankly some issues where it's like a lot of the downsides of working Anthropic are things I am going to deal with whether I work there or not. And so, you know, it's like if I were to try to work at a nonprofit, a lot of what nonprofits have to offer is their neutrality, is the fact that they're not companies. Right. They don't have as fancy models, but they have their neutrality. But I don't have that to offer. I won't be perceived that way, and rightly so. And so I'm in a particular situation that I think makes it a particularly good idea for me to just go ahead and work at Anthropic. I wouldn't. I do not have the view that everyone should work there who wants to work on AI safety. I do not have the view that you can only do good work on AI safety from Anthropic. There's a bunch of other organizations that I think are doing amazing work. So with those caveats out of the way, I think Anthropic is doing amazing work. I think basically the way I would think about it is that there are multiple of ways that you can reduce AI risk that you can only do if you're a kind of competitive frontier AI company. A company that is like in the race to, I don't know, be the most successful company or build AGI first or however want to think of it. The problem is it seems very hard to simultaneously put yourself in position to have all that safety impact and prioritize that safety impact without just spending all your time trying to compete and trying to win. To do both at once is something that I, a few years ago was just not sure was possible. I mean, I was excited about Anthropic and I knew they were going to try and find that balance, but I really wasn't sure it was possible. And I thought they might have to choose between just being irrelevant out of the race or just acting just like all the other players. But I think so far and right now, and this may not last, they are doing both. They are an extremely credible participant in the AI race. I would think many people think they just have the best AI models in the world. And I think they are leading the way, being a leader. Not necessarily the only company doing great safety stuff, but it is their top priority. It is what they ultimately care about most. And I think they do a lot of. So that's a high level. I can go into a little more specifics on what these models are for having an impact and what that tangibly means. But at a high level, I think it's just there are a lot of ways to help that you need to be in this position to do. And it's very hard to do both at once. And this company seems to be doing it and that's very exciting and provides a lot of opportunities for people to do things.
B
So that's a little bit in the abstract how you can see anthropic helping, I guess. What are the specific ways that you think anthropic is having a positive impact now and might have a really positive impact around the time that we're creating the first AGI?
A
Yeah, so I think there's like, I don't know, there's a bunch of different categories of taxonomies in my head, so I'll see what I can do here. But there's three high level kind of strategies or theories of change. One of them is there's probably more than three, but there's three that I'm particularly excited about. One of them is create risk reducing things that a company can do. Make them cheap, make them practical, make them compatible with being a competitive company and then you'll be in a position where other companies are more likely to do them and you can, you can kind of create implicit pressure on companies to do them. Two is a more generic race to the top where if you come to be seen as a responsible company that is getting recruiting benefits or other benefits out of being responsible. Now you're putting pressure on other companies to compete with you on being responsible, on being pro safety, whatever. And so they may come up with risk reducing measures that you don't. And then a third model is just inform the world without worrying about what other companies are doing. You know, you're in a position to basically understand what's going on with your models, how they're being used, how they're behaving in the wild, how they behave in testing. That can create more evidence for the world to understand what's going on with AI, regardless of how your competitors behave. Those are all, I think those are all basically things that you are in a much better position to do if you have the most powerful and or most popular models. When it comes to what those risk reducing measures can be and what they can look like to me, there's a few broad categories. One of them is just alignment. Just making it less likely that your AIs are evil or trying to do harm. I think this is the one that's most contentious in terms of whether anything good has happened so far. You could point to things like reinforcement, learning from human feedback, constitutional AI character training that are intended to make AIs less likely to do unintended actions. These have turned out to be very popular interventions. They have definitely spread throughout the AI ecosystem. Some people argue that they've been so successful that this shows they're not actually about safety. These are commercial innovations that make AIs more useful. Some people believe they're on the path to making them safer, some people believe they're not on the path to making them safer. That gets into a big debate. I won't get into that. But in the future I hope that at some point we will find ways to make AIs less likely to be scheming to take over the world. And I think a lot of them could be kind of straightforward and boring. Like just take all the training you're doing and look for the cases where the AI got inadvertently rewarded for seeking power or for cheating or for lying. Take them out. And then some of them could be fancier or more theoretically grounded, like mechanistic interpretability. Just finding ways to actually understand what's going on inside an AI that you can use to either assess the impact of different training methods on whether it's misaligned, or whether it's scheming to take over or whatever, or that you can use in a lot of other ways that I won't get into. All the ways you can use mechanicstic interpretability. The idea here is to come up with, you know, come up with things you can do to your training process that may not actually be that expensive, but that have a non trivial impact on the probability that your AI has these sort of malign intent. Then there's kind of like control and monitoring. Measures. I'm especially interested in monitoring, so I'm very interested in trying to do a better job understanding what's going on with your AIs, what's happening in the wild, also like how they actually are. Like I said, this is something where you can do a lot of good without getting anyone else to do it too. Just by having this information and sharing it. I think Anthropic has done a fair amount here. They have this CLIO system which is a way of reconciling customer privacy with being able to talk coherently about trends and how people are using AI. I think there's also just a fair amount of ambition for things like internal monitoring. So when people are using the AI and Anthropic, are we going to be able to track exactly what it's doing, find cases of it doing bad things, try to understand why it did the bad thing, et cetera? So I think there's just a ton to do there, a ton that isn't necessarily incompatible with being competitive, may require modestly more effort and that you can work on and you can prototype and get there. Then there's a category of intervention that I would think of as character or model spec. There's one thing which is how do we get these AIs to behave as intended? There's another thing which is like, what do we intend? How do we want these AIs to act? OpenAI has published a model spec that kind of says, here's how we want our AI to resolve various tough situations and conflicts and here's how we want it to behave. I hope that at some point in the future we'll be in a world where most AI companies have public model specs and there's a really good, healthy debate going about do we want our AIs to be kind of trying to steer the world toward good outcomes? Do we want them to just do as they're told? How do we want them to weigh various values such as empowering users versus acting in their long term interests? Do we want them looking out for other misaligned AIs and reporting them all kinds of stuff that you could put in a model spec that I think you can improve that a lot without necessarily becoming a lot less competitive. And I think that could have huge impacts on the future. Just think of this as like this is what we're telling these things to do. That could be most of the share of labor or intellectual labor in the future. So that's a big one. Security, I think is extremely important. So I've talked before in many venues about the importance of trying to make sure that it's hard for adversaries to steal your model weights. Because if someone can steal the weights of your model, then you basically have very little ability to keep your model under any sort of controls or have any chance to slow down if things get really scary or whatever. I think there's also important other applications of security. So I think I increasingly feel that maybe more important than preventing theft of model weights is preventing sabotage or especially backdoors and secret loyalties of AIs. So stopping people from basically messing with your AIs in a way where they're now behaving according to some malign human or some malign AI, stopping your AIs from sabotaging your AI systems and setting up future AIs to be misaligned. And there's just like a ton of stuff you can do there that's just very incremental. You can put in a lot of different measures if you want. Like an example, Anthropic recently talked about how it's using egress bandwidth controls as a step toward making it harder to steal model weights and achieve generally higher security. So, you know, I think there's just. There's just a lot of, like, little unsexy things you can do that make it a lot harder for this thing to happen and I think improve the odds that any given AI is going to be acting as it was intended to act. And then there's transparency and accountability practices. So that would be things like responsible scaling policy, but also things like model cards and system cards. So I think it is valuable for AI companies to share a lot of what they're doing with the world so the world can understand the situation and also so the world can put pressure on them and their competitors to do better and to find all the ways they could be doing better. And finally, there's safeguards, which is highly related to monitoring, but it's basically, you know, controlling and understanding your model's behavior out there in the wild and preventing unintended stuff from happening. And here anthropic kind of pioneered constitutional classifiers that are a way of making it much harder to jailbreak AIs than it has been previously. And that's an example of something where you build it. It seems like a good idea, but you have to put a good amount of effort into making it work at all, and then even more effort into making it cheap, doable, practical, and get to the point where other companies could easily adopt it. So There's a, there's a whole ton of stuff to do. Some of this stuff could be game changing by creating better awareness worldwide about what's going on with AI. Some of it is just about taking risks, making them lower and setting us up for potential future regulation. But I think there's a lot of stuff you can do in that position and I'm excited to be trying to help with it.
B
So you've described a lot of ways that Anthropic could help basically maybe save the world around the time that we're first building AGI. But I guess if you look historically at the level of investment and the amount of progress Anthropic has made on these things so far, as a way of kind of forecasting the future, is it really commensurate with the scale of the challenge? Is it at all on track to potentially get us where we need to be?
A
Yeah, this is a very hard question because I kind of suspect the answer is that it's exponential or that the things that AI companies are doing become exponential and grow exponentially in importance, which makes it really hard to know whether we're on track. But I think we at least could be on track or it's reasonable that we are. So just to paint the picture a little bit, I mean, a few years ago it was just like, what did you want people doing on AI? It was like almost nothing you could do had any kind of direct impact. I mean, you could raise awareness, but there weren't real tangible harms of AI or risks of AI systems at all. And so everything you were doing was kind of like trying to either raise awareness or practice the right kind of habits. There was this whole thing about not putting out the weights of GPT2, but who cares about GPT2? And then if you look at today, I would say things have changed a lot. And the things that Anthropic and other AI companies are doing are mattering a lot more. So AI systems, I think we're arguably now getting into the area where chemical and biological weapons could be a real threat. And I think the things that Anthropic is doing about that are while not exorbitantly hard to do or exorbitantly expensive, having a big, not a small impact on the risk that someone could misuse an AI in that way. Now that may not be the risk most people are most focused on. They may be more focused on misalignment. The problem is you can't see that big impacts on that right now because that isn't that risky right now, but I think already you can see a lot more impact than you saw a few years ago. The alignment research is better, it's more interesting, it's producing more stuff. And in particular, I think we are seeing just meaningful stuff come out of Anthropic and other organizations to show that there are real concerns here, that we do see AIs doing really sketchy stuff in certain pretty realistic environments. I think that's stuff getting pickup, it's getting noticed, and it may be putting the world more on alert, which is good. And people are starting to find ways to ask the question, can we catch this by using models, internal states? Can we catch this with interpretability? Can we improve this without creating unintended problems that we can also catch in a lab? So I think it's all pretty consistent. And I would think that if and when we get to AIs that can do automated R and D and such, a lot of the same kinds of things, they'll still be affordable to do, they'll still take some extra work, and they'll just become exponentially more important. You know, security today doesn't matter as much, but it's a lot of the same stuff will matter a lot more later. Same with the safeguards and the monitoring AI in the wild and all that stuff. So I think we could be on the right trajectory. I don't see a particular reason to think we're not, just given the dynamics of how it has to go.
B
So you think that Anthropic, where you work now, is doing a lot to help improve our chances of navigating transition to superhuman AGI to have that go well. But I think there are a lot of people out there who are somewhat nervous to trust Anthropic with that responsibility because, well, there's many reasons, but in part because they feel burned by what happened with OpenAI, I guess the nonprofit OpenAI that existed from 2016 through 2015, maybe through 2019. I guess they made a ton of wonderful promises in the early years, and they've kind of not gone on to live up to them anywhere near fully. I'd say maybe like 40%, somewhat. 40% lived up to the promises that I feel like they made everyone. And I think a lot of people just feel kind of duped and burned by that experience. I feel that way to some extent, and I don't want to be tricked again. How much should people kind of reasonably worry that Anthropic will, like, OpenAI, kind of fall off the wagon at some stage and leave them Feeling very disappointed and somewhat deceived.
A
I'm not sure. I mean, I don't want to get into commenting on OpenAI and what happened there and how good or bad that is. It's not something I don't have opinions on. It's just something I don't really want to get into on this podcast. I would say, in general, there's no easy answers with this stuff. I think the actual situation with AI as I see it is just genuinely very complex. It's something where there's a technology that has a lot of upside, has a lot of downside for humanity. I tend to think that it would be worth a substantial delay to reduce the downside, because I think the upside is kind of a given if we can avoid the downside. But there's a huge incentive to build it and someone is going to build it. And we can talk maybe later about this, but you could say, well, don't be part of the problem, and maybe if everyone could make a deal, then none of us would build it. I think that's a bad argument. I think someone is going to build it. I think it is legitimate to have a model where. And we talk more about this later, it's legitimate to have a model where you're racing to build it before others so that you can kind of have more positive impact than others would in the same position. But that's just inherently a very thin line to walk. That's like a balancing act, right? It's like you're trying to win and you're trying to be nice. You're trying to do them both. And it's like, do I think Anthropic is going to do that well, and do I think they're going to find a good balance and a better balance than others? Yes, I do. But how do I know this? Is it because of one big pledge that Anthropic made? Is it because of one big promise and if they break it, I'll be like, holy cow, no. And I think if that's your attitude toward AI companies, you are really setting yourself up for disappointment. I'm not sure why you need to trust an AI company. First off, I'm not sure, and I don't necessarily advocate for anyone to trust anyone, frankly, but I tend to just kind of trust very few people about almost anything. But if you are looking for a reason to trust an AI company and your reason is, well, there's this governance mechanism and they're going to just make sure at all times that this AI company only does Things for the benefit of humanity. And they'll immediately shut the company. Company down if it's ever not going to. Or if your thinking is, well, they have this responsible scaling policy or preparedness framework, whatever, and it just guarantees it's an ironclad commitment that they'll never do something risky. Yeah, I think you are setting yourself up for disappointment because I think this is a complex area and it's a balancing act and it's very dynamic and things are changing all the time. And whatever it is a company has promised, it may turn out later that at least in someone's judgment, that wasn't a good promise to have made. And then different people are going to have different judgments. I think it is unwise for companies to make promises like that. If I were running one, I would think I would not. I think there's various debates over which promises have been made implicitly and explicitly. What is a better way to judge a company? I mean, I would say you have kind of two good options. One of them is just a softer, more multivariate judgment. Think about the totality of all the things the company has done, all the calls it has made. Compare it to other companies because you do want to benchmark it to, you know, companies that are trying to succeed. It's not going to be acting the way that a nonprofit would act. Look at the people, look at how they talk, but also look at how they've made decisions. Probably how they made decisions is more important. And just decide how you feel holistically and make a bet on it. I feel comfortable doing that. I have a bit of an advantage. I happen to know the founders of Anthropic extremely well. If you're not in a position to do that, then just don't trust any company. You can have a great impact on AI safety without trusting any AI company. It's not something you need to do.
B
Yeah, I mean, I guess the question must come up in hiring when people are deciding whether to work at Anthropic or whether to take other steps that might benefit Anthropic. I guess in that case you have to make a somewhat greater pitch, I would think, to convince people to pin their hopes in Anthropic. I guess people who are going into the company are in a better position to judge potentially because they can see the decision making inside.
A
Oh, I don't think you need to be inside. I think you just read the news and you have a ton of information. I think the thing I would say is that you don't. It's not one data point. It's not. Well, you know, I would never say to a recruit, well, you know, there's this governance mechanism or there's this responsible scaling policy, and that's the one thing. And as long as that's there, you should feel great about this company. You should leave as soon as it has a problem. But I would absolutely just, like, go through a list of like, 20 things that I think Anthropic has done that have shown, you know, serious responsibility compared to competitors. And I would look at the totality of them and I would say, yeah, I think this is a great company. You know, and the more, you know, the more time you're spending on this decision, the more we can dig into that and get into kind of nuance and multivariate judgments. So I'm definitely not saying you need to be on the. I don't think you need to be on the inside at all. I think there's tons of public information you can use deciding which companies to trust. I just don't think that everyone has to decide to trust any AI company with anything.
B
All right, I want to now kind of probe the strength of the case for Anthropic as a company being very useful in guiding AGI in a positive direction, and I guess by proxy, the case for people to go and work there with, with that goal in mind. I think you discussed this a bit earlier on, and I guess in my mind, I think that the categories of impact that you might hope Anthropic to have are firstly coming up with new technical breakthroughs that allow us to steer AI in a positive direction and then, of course, to export those to other companies. So it's not just Anthropic using these techniques, but maybe they could be used across the industry as a whole. This coming up with good internal governance mechanisms, things like responsible scaling policies that then also could be exported to other companies so they become more standard. There's being at the frontier and understanding the models well so that you can, I guess, communicate to the public, to governments, be very candid about what is going well and what is going poorly, so that there's this greater transparency and understanding. There's also raising people's expectations for what companies can do by demonstrating that Anthropic is able to be competitive, is able to have a good product, while also doing all of these safety measures, good governance measures, technical measures, potentially you raise people's expectations and I guess also by potentially poaching talent while doing all of these things because people want to work at a company that is responsible, you could create some sort of race to the top where other companies feel like they have to compete on being responsible companies. Is there another category of impact that I'm missing here?
A
I mean, I think that kind of broadly covers it. I think policy is important. And in my head I was kind of lumping it in with, you know, the exporting thing, but it's like. Or the informing the world thing. But I mean, there is also just like, you know, as a company, you, you have a certain kind of voice and policy that others don't have. I think, I think, you know, in some ways nonprofits have advantages over companies in policy advocacy because they're seen as more neutral and they're seen as more pro social. But in some ways companies have advantages because a lot of times what politicians want to know is like, how is this going to affect, you know, the economy and business and power players and all that stuff. And so I think there are some opportunities there too.
B
So that's the case in favor. I guess I want to push you now on criticisms that people might make or reasons for skepticism that people might raise. I guess the most obvious one is just how is a company realistically going to manage to have maybe the best AI models, the best products, something that is itself incredibly difficult, incredibly competitive, while also paying a sort of tax in having a whole lot of their staff work in governance arrangements that their competitors maybe aren't doing on technical applications that may or may not be necessary. Yet at the level of capability that the AI models currently have, you might just reasonably be skeptical that there's going to be enough discretion, enough slack in the system basically to make very meaningful investments there. And in that case you could end up absorbing a whole lot of staff who are very concerned and have the best of intentions, but then most of the resources just get directed at keeping up with, with commercial competitors. I guess it sounds like you may have had this concern years ago, but perhaps things have gone better than you expected in this respect. Yeah. Do you want to explain what you think the situation is there?
A
Yeah, things have gone better than I expected. I had this concern. I still have this concern. I think this is a completely legitimate concern. I mean, one framework I would think about it is like if you have a talent advantage, then you have some kind of slack that you can use on what you're calling the safety tax. And in fact, paying the safety tax may help you get your talent advantage. I think in some sense it's like. It is a major part of the thinking is that A lot of the best people at Anthropic are there exactly because they want to pay the safety tax. That's what they want to do. So you do have some slack to work with. And the more it is that the best talent is coming to the company that is doing the most safety stuff, the more slack there is to do safety stuff. So that would be one answer. The other answer is just you want to be really efficient with the safety tax. I think this is, this is the thing is like, I think if you play your cards right, you can pay very small amounts of so called tax and have very big safety benefits because what you, you know, maybe an analogy would be like, well, how on earth are we going to develop a form of energy that is as useful as like coal, but also clean and also emissions free? And it's like, well, I don't know. Yeah, certainly like we subsidized R and D on solar power. It's not like there was no special effort made there. But at a certain point, because we poured in R and D on the front end, we did end up with a product that was quite competitive and quite viable on its own merits. If you think reinforcement learning from human feedback is a good thing, then this is a great example of it. I don't think you have to think that and I don't think this is, I don't think everything is an example of this. But I think, you know, the thing you want to do, the thing I want to do anyway as anthropic or ad anthropic is you want to come up with stuff where you, you are going to put in a bunch of energy on the front end, scoping it out, figuring out how to make it work, dealing with all the things that break when you try to make it work. But then you get something that actually works and is not very expensive. And that's a way to get a lot out of a little when you're paying this so called safety tax. Yeah.
B
If I imagine what a skeptic would say here, they might say things have gone above expectations in this dimension, perhaps up until now, but the industry is only getting more competitive, the competition is only getting fiercer. So perhaps in a couple of years time we might find that Anthropic is getting closer to investing 100% of its resources just in trying to keep up with extremely well financed competitors. Yeah. How worried are you about that?
A
That's a very strong possibility. But I think a lot of other people are also concerned that an advantage in AI is going to be self reinforcing and that someone is going to pull way away and open up a big gap, which. So I think it kind of cuts both directions. I don't know. I don't know if Anthropic is still going to be a frontier player in a few years. I don't know. I think it's just one of these things where right now it's a great opportunity to try and work on some of the things I've talked about and work in some of the areas I've mentioned. Maybe that'll change the future and people can change the things they're doing in the future.
B
Maybe I'm really weakening off on the hardball questions here, but my perception is that Anthropic is probably in some ways better managed than other AI companies or better organized. And this might be part of the reason why it has managed to have more slack than might have been anticipated to invest in all kinds of governance and technical breakthroughs. I guess without naming names, some companies are known for being a little bit chaotic, some companies are known for being a little bit bureaucratic. And I guess I hear less of those worries about Anthropic. This is a very generous question, but do you want to say any positive things about Anthropic in that respect or whoever it is?
A
Whoever it is that does all the management at Anthropic and manages the company, I think that person is doing a great job. That is my wife. So anyway, I think that's kind of a subset of the talent advantage point though, that yeah, I think there's people who are very good at running a company who wouldn't actually want to run a company that they didn't feel good about. On safety. So that's the safety tax paying for itself. And yeah, I mean, as far as I could tell, it's a pretty darn well run company. It's hard to run a company that size. So I wouldn't claim that anything's perfect. But yeah, that is my belief.
B
So I solicited questions for you on Twitter and the most upvoted by a wide margin was does Holden have guesses under what observed capability thresholds Anthropic would halt development of AGI and call for other labs to do the same. I think it's very interesting that this is the question perhaps that people have in their heads the most. And I guess it speaks to this question of trust that we've been coming back to where I mean, I think the subtext is people fear. The answer is that there is no threshold at which anthropic would stop. I guess that could be reasonable for reasons that you've given, that perhaps just stopping wouldn't help because it wouldn't really influence anyone else. Yeah. Do you have a reaction to that question in particular?
A
Yeah. I will definitely not speak for anthropic and what I say is going to make no attempt to be consistent with the responsible scaling policy. I'm just going to talk about what I would do if I were running an AI company that were in this kind of situation. So I think my main answer is just like it's not a capability threshold, it's other factors that would determine whether I would pause. So first off, one question is, what are our mitigations and what is the alignment situation? It's just like if we, we could have an arbitrarily capable AI, but if we believe we have a strong enough case that the AI is not trying to take over the world and is going to be more helpful than harmful, then there's not a good reason to pause. On the other hand, if you have an AI that you believe kind of could cause unlimited harm if it wanted to and you're seeing concrete signs that it's maligned, that it's trying to do harm or that it wants to take over the world, I think that combination, yeah, I think, speaking personally, I think that combination would be enough to make me say, I don't want to be a part of this. Find something else to do. We're going to do some safety research. Now, what about the gray area? What about if you have an AI that you think might be able to take over the world if it wanted to and might want to, but you just don't know when you aren't sure. And I think in that gray area, that's where I think the really big question is, what can you accomplish by pausing? And this is just an inherently difficult political judgment. I would ask my policy team, I would also ask people who know people at other companies, and I would say, you know, is there a path here? Like what. What happens if we announce to the world that we think this is not safe and we are stopping? Does this cause, does this cause the world to stand up and say, my God, this is really serious anthropics being really credible here. We are going to, you know, create political will for serious regulation or other companies are going to stop too. Or does this just result in, oh, those crazy safety doomers, those hypesters. That's just ridiculous. This is insane. Ha ha ha. Let's laugh at them and continue the race. And I think that would be the determining thing. I don't think I could draw a line in the sand and say, when our AI passes this eval. So that's my own personal opinion. And again, no attempt to speak for the company. I'm not speaking for it. No attempt to be consistent with any policies that are written down.
B
Let's talk about responsible scaling policies. So you were one of the people, I guess, one of the people who helped to, I guess, come up, helped to develop the idea of responsible scaling policies, which have now gone on to become, I guess, maybe the dominant framework that almost all AI companies are using for their sort of internal risk management and internal preparations for future AI capabilities as they come online. Can you just remind us what are responsible scaling policies, I guess, also called frontier safety frameworks?
A
I think there's so many names.
B
Seven other names for them.
A
Yeah, yeah. Everyone's got their own name.
B
Yeah. What are they in a nutshell?
A
Yeah. So back in 2023, I was talking with Paul Cristiano and the folks at Meter and feeling that there was some energy at AI companies to be seen as responsible and safe and to make some voluntary commitments that would show that. And we tried to come up with something that could be beneficial but also would actually, maybe actually be adopted by AI companies. And so we kind of piloted and advocated for and helped develop this idea. That is sometimes I'll call them responsible scaling policies because that's what they were originally called. That's what anthropics is called. But they have many names. Preparedness framework, frontier safety framework. And I would say the idea, okay, so what they tend to be is they tend to be a map from AI capabilities to mitigations. So it's if and when our AI is able to do X, then we will do Y to protect it. X can be two examples of X. One example of X would be if our AI can help kind of a random person make a bioweapon, then we will try to ensure it's not easy to jailbreak. That's not exactly what any of these RSPs say, but that's like an example of what it could say. There's another one that would be like at the point where AI can do autonomous AI R&D, which has been defined and operationalized in a somewhat reasonable way, I think, and I think is like kind of our best operationalization of AGI. Then we will have to make sure that we are able to make a strong public argument that we have contained the risks from misaligned power seeking or else if we can't do those things, then we are hoping to kind of pause AI development and deployment as needed in order to meet that standard. So if we can't protect the AIs we have, then we try to not make the AIs more powerful and not make them more widespread until we can protect them in that way. That's a basic idea of a responsible scaling policy. I think there's been some misunderstanding of what they are intended to say and what they're intended to do. So I think a lot of people interpret them as being these unilateral commitments by AI companies that say if we can't meet this standard, then we will just unilaterally, all by ourselves, stop our AI deployment and development, regardless of what everyone else is doing. And I think people have seen them therefore as a way that AI companies try to guarantee they'll never do anything too risky. And I think a lot of the criticism of RRSPs is based on believing that's their goal. And so people will say, well, I don't really believe AI companies are going to unilaterally pause like that, so I don't think you're going to get this benefit. That was never the intent. That was never what RSPs were supposed to be. It was never the theory of change, and it was never what they were supposed to be. If you look at the kind of original meter materials on this stuff, they have like a very clear section on response, like, what are you supposed to do as a company when you can't meet that standard? And there's this language that says, well, if we end up in this situation, but our competitors are going ahead with equally capable models and equally substandard protections, then we do have an escape clause and we can go forward too. But we have to meet these other criteria. We have to kind of be open about what's happening and we have to be transparent about it. So the idea of RSPS all along was to have companies put out. It's less about saying we promise to do this, to pause our AI development no matter what everyone else is doing. It's more about saying we believe this is the normatively best way to keep AI safe. We are trying really hard to hit it. We are lighting a fire under our own ass to develop our safety mitigations to the point where we can hit it. We will at least be embarrassed if we can't. And we are implicitly supporting a growing consensus that this is what regulators should be trying to accomplish. So it's trying to have two benefits. One is lighting a fire under the ass for safety mitigations and roadmapping. Another is creating a prototype for regulation to work off of, which is not the same as wanting them copy pasted into regulation, but just starting to de risk some of the ideas of risk management that we could want. And so the unilateral pause was not intended by me or by meter to be the central theory there. But I think it's gotten different interpretations. So that's what they are.
B
Yeah, I guess it's understandable. I think that people wanted them to serve that function because it didn't seem like anything else was going to serve that function. So RSPS people kind of started clinging onto RSPs, hoping that basically companies would commit to pausing under particular circumstances. Maybe you could get all of them to do that. I guess it always seemed perhaps a little bit far fetched that companies would be able to constrain themselves that way because surely someone would break out. Even if they all had these policies, someone would break out and then others would feel pressure to copy, I guess. How has the whole RSP framework worked out relative to what you hoped?
A
Yeah, I think there's been some significant good and some significant bad actually. So maybe I'll start with some of the bad because I kind of just said it, which is I think this, this idea of what they were supposed to be was kind of nuanced and complex in a way that I think did get lost in the noise. And so now I think we've landed in a world where a lot of people believe they were supposed to be these ironclad commitments. I think a lot of people, actually, I've been surprised people will see an RSP revision as like a betrayal or as a cause for alarm. I wrote a piece for Carnegie when I was there that kind of said, look like if you want to make something good instead of trying to make it perfect before you ever release it, it's good to put out your first version of it and then iterate a lot. That's how companies tend to do products and that is what we should try. And risk management, if we're in a huge rush on AI, which means, and this is also what we encourage companies to do is we encourage companies to say instead of making sure you've gotten this perfect and you can definitely adhere to it forever before you put it out, which would then cause you to never put anything out, or to be incredibly careful and vague about what you do put out, why do you put something out? You can change it later and you're going to reserve the right to change it later. And you're going to have a clear process for changing it later. The board will have to be looped in. You'll have to be clear that you changed it. You'll have to have a good reason for changing it, but you can change it. And I think a lot of this stuff just got lost in the shuffle. And I think right now what we have is we have these commitments that I think don't make sense. I think various companies have made commitments that just are not reasonable commitments to have made. We have learned more about the threat models. We are in a tough situation, regulation wise.
B
What's an example?
A
So one example would be model weight security stuff. I think especially for some of the lone wolf bioweapon threat model, I think it's a very small part of the threat model and probably not justified to emphasize the possibility that a lone wolf bioterrorist would operate by stealing model weights, then fine tuning the model themselves, serving it themselves. By the time you can do all that, you have to have a lot of resources and there's probably an easier way to do it. So I think there was just a lot of stuff about security standards that didn't necessarily make sense, security expectations that didn't necessarily make sense. I think there's also just the broader thing of many of these policies either state or have been interpreted as committing to unilateral pauses, which I think just actually if you find yourself in this situation where you can't make your model safe but your competitors are going ahead, it's actually unclear why you should unilaterally pause. I think that in that situation it doesn't have much safety benefit and it can do a lot of harm.
B
So it's clear why, I guess one company out of five, if all of the other four were going ahead, or I guess maybe even if one of them were going ahead doing something very dangerous, then really what is the point of constraining yourself? But if you could get everyone to have all of these policies and they could make them reasonably similar and they all said we have to pause if we can't do X, then maybe they could all kind of pause together. And I think I interviewed Nick Joseph at Anthropic a year ago and he's not on the policy or governance team, he works on training. But we talked a lot about RSPS because he was a fan of them. And I kind of pushed him on this point saying your policy and probably other companies policies are going to say that you can't train, even possess models that are beyond some particular level of capacity, unless you get to the point of computer security where not even a state like China could steal the model weights. But I don't think we're going to be at that point in a couple of years time. I'm not even sure that that is possible almost at all. And so you're going to get to that point, it's going to say that you should stop and eventually the pressure is just going to become overwhelming to push forward anyway because, well, other people are going to be doing it. It's not on the horizon that you're going to be able to meet this standard that you sort of committed to. And understandably, Nick didn't come back and say, well, at that point we would just blast forward and ignore the RSP. Because that's actually the true spirit of RSPs. I think it's understandable. It's been hard for companies, I guess, to put that front and center that they're not true commitments to pause for an extended time in that sense, because that doesn't sound very good. But I guess the reality of what they're meant to be, which is I guess like prototyping a sort of regulation that then could be imposed, perhaps that is also useful in its own way.
A
Yeah, I think that's all totally fair. So, yeah, I think this is a way. I mean, I think some of these commitments that were made, either the letter of them or the spirit of how they're being perceived, there's something unreasonable there. And if, if you don't care at all about AI progress and you just want everything to slow down as much as possible, maybe you don't consider unreasonable, but I think it's like broadly not a reasonable thing to ask of these companies. And I think in many cases it's just like actually the wrong. I would think it would be the wrong call in a situation where others were going ahead. I think it'd be the wrong call for Anthropic to kind of sacrifice its status as a frontier company. I've talked about all the benefits I could have for what exactly? I mean, to kind of stand on ceremony because it's not necessarily reducing the risk very much by pausing.
B
I think from the perspective of people who are outside the companies, I think it makes sense for them to think, well, in that situation, what we should insist on is that all of them stop because they're all suggesting that, they're all saying we should do this unsafe thing because everyone Else is doing the unsafe thing and say like, well, why don't you just all not do it?
A
Absolutely. Yeah, definitely. So I think that this comes back to this coordination problem thing again. So if all the companies in the lead have RSPs and they're all like, all of their reason for going ahead is that the other four might go ahead. They should be able to, that should be able to be a solvable problem. And I actually think it could be. I think you could say things like, you know, like in that kind of situation we will be really clear about what's going on and we'll say that we wouldn't go ahead if we didn't have this issue and then maybe that will change others behavior so you could deal with that. But I think if you end up in a situation where there are a couple actual defectors who don't either, either don't have an RSP or just don't care and are just like ah, screw this, then that does change. It changes the equation and it changes what the right thing to do is. I think another lesson learned for me here is I think people didn't necessarily think all this through. And so I think in some ways you have companies that made commitments that maybe they thought at the time they would adhere to, but they wouldn't actually adhere to. And that's not a particularly productive thing to have done. So I think we are somewhat in a situation where we have commitments that don't quite make sense. I think that's fine in itself. I think the thing to do is revise the commitments. But then revising the commitments is very, very painful in a way that was not envisioned by me and meter when we were working on these things. So that's a way in which I think things could have gone better. Yeah.
B
And I guess it is because people outside the companies want to find some tool to tie them to the mast, to force them to commit to safety practices that when the time comes around they won't want to follow.
A
Yeah, I think regulation would be a good tool. Yeah, right.
B
If you can get them all to do that or almost all to do that, then maybe that would work. If you can only get a minority of them to do it, then probably you're not really accomplishing all that much. But yeah. So when it comes back to what we need is like regulation, unfortunately it's just not clear how to get that so.
A
Exactly. So RSPs are supposed to make regulation kind of a more viable thing by trying to create a consensus and also trying to like work out how this stuff actually works, like what the risk mitigation should actually look like and how the commitments actually should be structured, like you revise them as you learn more about the world. And how has this gone? I think this is, you know, it's good and bad because I think on one hand like, I think the regulatory environment is just like very disappointing. And so I think everyone has had to turn their sales and have less ambition and like everyone is like everyone who cares about regulating AI for safety from catastrophic risks is feeling pretty disappointed right now. And I also think that RSPs have been somewhat successful here. It's like most of the regulation that looks like it might happen or has happened or will happen is borrowing heavily from the practices in these RSPs and the language in these RSPs. And it is, I think this is kind of a positive thing and I think if we can get to a point where people are able to actually revise their RSPs continually so that they continue to make sense and be good and have good commitments, then we'll have this mechanism that continues to lay out an example that can then be cribbed from for regulation. So I think that's been, that's been more of a bright spot. I think there's been other bright spots in RSPs. Like I think, I think it's been interesting. Like I think there's been some sign, I would say that they can serve a forcing function, that they really can light a fire under companies ass to do better risk mitigations. I think literally anthropic stated publicly that like immediately after they adopted their rsp their security team said well we can't actually do this without more headcount. And they got more headcount. I think also the constitutional classifier stuff protecting from jailbreaks is an example of where it's just like they prioritized this, they resourced it, they made sure it happened. They had to make a plan to check out all the boxes to make sure that we're actually meeting the ASL3 standard we said we'd meet. And I think a lot of that stuff is just very hard to envision in a fast moving AI company that has a zillion priorities. It's hard to envision them checking off all these boxes you need to check off to guard against a particular risk of this bioweapon or chemical weapon stuff without these kind of policies. So I think that's been a bright spot. But I also think it's been, I think that kind of forcing function works best with commitments that are tough and ambitious, but doable. And I think when commitments are not doable and the commitment is effectively to pause, I think that it's less promising. So in general, I am kind of thinking about visions for the next generation RSP and thinking we want to preserve, like having ambitious achievable targets. We want to preserve the forcing function. We want to preserve kind of getting a big chaotic company on the same page that it has to check a bunch of boxes to do something good. And we want to just generally preserve like putting good things in there that can be cripped from for regulators. But we do need to get rid of some of this unilateral pause stuff. We need to refine some of the threat models, some of the specific statements about mitigations that are needed. I think some of the early generation policies, they're also just too specific. Like they're just like we're going to put in this control and that control. And it's just like the field is too dynamic for that to make sense. And I think vaguer commitments, I move toward vaguer commitments because they sound worse, but they still have a lot of force. People will still ask are we meeting this commitment or not? And so the combination of flexibility on how you implement plus you still have force of people asking if you're spiritually meeting it, I think can be very powerful. So yeah, there's been a lot of lessons learned there.
B
Are there any sort of approaches or styles of work that people are engaged in trying to hopefully steer the AGF future in a positive direction that you think are sort of overrated and maybe you'd like to see people move away from in favor of something else, not.
A
In a really big way. I think in general there's just so much important work to do on AI and I think most of the things that people are at a high level excited about are somewhat worth doing or worth a shot? I think there's stuff that I think just in a small way or on the margin, I mean, I think people are. There's probably a little bit more investment in policy and especially federal policy than maybe I think is optimal. I think people have kind of seen that as just the thing to do if you're not doing alignment research. And I think there are a lot of other things to do. So I think it's something I'd be happy to see just a little more people diversifying out into random stuff from that because it doesn't seem like a particularly hospitable policy environment right now. Maybe that'll Change, But I think if it changes, it could be from a lot of external factors and not necessarily from things people are doing now. I'm not sure the work they're doing now will be that useful, but I think it'll be somewhat useful. I think it's good. I think it's good work, but maybe on the margin. And then I think in general, people in the AI safety community, I think they tend to value a lot when people loudly say that AI is scary. I don't have any problem with loudly saying AI is scary. I say it all the time. I'm pretty loud. I don't think people shouldn't say it or anything like that. But I think sometimes it's like I look at people getting excited about what's going on in AI and I'm like, are you excited that risk is going down or are you excited that people are agreeing with you? Because sometimes the latter is happening in a way that doesn't really seem to lead to the former. I don't know exactly. People's model is it is generally nice for there to be more awareness of the risks, but I'm not sure the marginal impact of more public discourse is very high right now. I think certainly there are returns. I don't know, the returns are super high. And I think if there's going to be a game changer, it's probably not going to be because someone said something or wrote something, actually.
B
All right, let's turn to what work in this whole broad category you think is the the juiciest or potentially has the biggest bang for buck. You wrote this talk a while back called examples of well scoped Object Level work for Reducing AI Risk. And I think, as far as I know, you've introduced this term well scoped object level work, which has the nice abbreviation wow. Wow. What is the concept of wow and how has it changed from what we had available in the past?
A
Yeah, I mean, it's not a novel concept. The story I would tell is that, you know, I've been in the AI safety world for a very long time, and for a very long time I found it to be just like a really frustrating, vexing area to work in. Because what's happening is, it's like, you know, I'm thinking about kind of what it was like in the year 2016 or something. It's just like, oh my God, like this might be the most important century of all time for humanity. The most important and irreversible events ever might happen in the next five or ten years or so, you know, I think at that time I thought it was longer, but still very short. You know, we might all die, the world might get taken over by a psychopath, we might get to Utopia, let's go help. And then it's like, what do you do to help? I wrote, you know, when I was writing, when I was right, when I was writing my blog post series, the Most Important Century, I freely admitted the lamest part was, so what, what do I do? I had this blog post called Call to Vigilance instead of Call to Action. Because I was like, I don't have any actions for you. You can follow the news and wait for something to happen, wait for something to do. And I think people got used to that. I think people in the AI safety community, they got used to the idea that the thing you do in AI safety is you either work on AI alignment, which at that time means you theorize, you try to be very conceptual. You don't actually have AIs that are capable enough to be interesting in any way. So you're solving a lot of theoretical problems, you're coming up with research agendas someone could pursue, you're torturously creating experiments that might sort of tell you something, but it's just almost all conceptual work. Or you're raising awareness or your community building or your message spreading. These are kind of the things you can do. In order to do them, you have to have a high tolerance for just like you're going around doing stuff, you don't know if it's working. You have to be kind of self driven. I often had the experience at Open Phil, we were doing philanthropy, but we're supporting that kind of stuff. And we also, we had no idea how our work was going to. It felt hard to have a really good kind of corporate culture because it's just like every time you say someone did a good job, you're just expressing your opinion. There's just very little measurable output of anything anyone is doing, even community building. You're just like, you don't really know who got into the community and how and why and how much it mattered, whether it's even good. So that's the state we've been in for a long time. And I think a lot of people are really used to that and they're still assuming it's that way, but it's not that way. I think now, now if you work in AI, you can do a lot of work that looks much more like you have a thing you're trying to do. You Have a boss, you're at an organization, the organization is supporting the thing you're trying to do, you're going to try and do it. If it works, you'll know it worked. If it doesn't work, you'll know it didn't work. And you're not just measuring success in whether you convinced other people to agree with you. You're measuring success in whether you got some technical measure to work or something like that. And so I think that's for probably most people, it's a much healthier environment in which to operate. It's a lot more fun. It's a lot easier to have a nice environment, a nice corporate culture there where there's a lot of positivity and a lot of like, you can kind of replace politics with merit and stuff like that. And so, yeah, I think my big thing I was trying to express in that doc is just like, there's a lot to do in AI now. If you're a person who would thrive in kind of a normal, healthy, high feedback, tight feedback loop environment where you're trying to solve a problem, then you'll know if you solved it. You might not have had a great time in AI before, and you might have a great time now. And you should make sure you're working on that stuff instead of just assuming that forevermore the thing to do is to be climbing random ladders in the government or convincing other people to agree with you. I think before I get into it, I do want to say I just have one perspective on this stuff. I work at an AI company. I used to do philanthropy and see the whole field, but I haven't done that in a while. And so this is not at all exhaustive. And I think people shouldn't interpret the stuff I'm talking about as like, this is the only stuff or this is the best stuff. This is just stuff I happen to know about. So getting into that, I mean, I think alignment research is the thing that's been most dramatically transformed. Where I was describing what it used to be like, it used to be very conceptual. We don't really know what we're trying to do these days. It's just like a ton of alignment research. It's just like, you know, our AIs do a lot of reward hacking. That's not nice. What can we do that will stop them from doing that? Then you can try something and you can see if you got your AI to reward hackless. That's pretty great. Now, of course, you might get your AI to reward Hackless in a way that means you just pushed it underground and made it better at avoiding getting detected. Reward hacking. But today's AIs are kind of flaky enough and low capability enough that you can put some bounds on that and you can make some actual observations about whether they're actually reward hacking less. And that's pretty cool. There's a ton of stuff like that in Alignment. So we have what a lot of people call model organisms. You have AI models that are specially trained to be evil so that we can study them. And then you can just try all kinds of stuff. There's all kinds of ridiculous, some of them very simple, unsexy ideas to cause AIs to be more likely to just do what they're supposed to be doing, less likely to be doing a bunch of unintended stuff or following their own goals. An example I enjoy of this is just like, if you want your model, so reward hacking is referring to a model that will kind of like do whatever it takes to kind of convince you it did the task. So it gets something like reward. An example of this would be like, I asked you to code be an app that does this thing and you coded the app and you kind of hard coded a lot of solutions in there. So it looks like it gets right answers. But it's not really like doing the thing I wanted. If you want to get less of this, one thing you can try, which apparently somewhat works, is you can, while you're training the model, you can tell it that reward hacking is fine and that all you care about is whether it technically completes the task. That's all it's supposed to do. Then when you're not training it and it's actually prime time, you tell it the opposite. You tell it, we don't want your reward hacking. We want you to actually do the thing that was intended. And I think what that does, I think could be pretty generalizable for Alignment is it's basically saying, I think a lot of what we should be concerned about is reinforcing the wrong behavior, training our AIs to do something we didn't want them doing. And instead of solving that problem by making sure we have no inadvertent environments that train them that way we can just be more explicit. They're intelligent minds in a sense, they can understand what we're saying. And so we could just actually maybe get some juice out of saying, hey, we know there's going to be some unintended stuff in the following way. It's fine for now it's not fine later. And so then you're not reinforcing the behavior in the same way. So that's just an example.
B
That's fascinating.
A
Yeah. Yeah, I think it's pretty interesting. There's just like a million million things you could do when you have these model organisms that you have an AI you taught to scheme or you have an AI that's reward hacking, not because you trained it to, but that just is what these things do by default. And then you're just trying stuff and you're seeing if it works. I think a lot of people love to be on a technical team where you're just solving technical problems. And that didn't used to be how alignment is, and now it is. I used to be kind of negative on alignment research and think, yeah, open Phil will fund it, but I'm not expecting much from it. And now I'm very positive now I think we should just shovel people into it, shovel money into it. It's great. Should I just rant and go through all that?
B
Yeah, that's a great example. Hit us with another one.
A
Yeah, yeah, sure. So there's alignment work and then there's just a lot of stuff that's adjacent to alignment work and related to it, but it's not exactly the same. So it's worth calling that out. There's building better model organisms so that you can study them is a whole area in itself. Can we get AIs to communicate with each other in a way that we can't detect it? And then once we do that, can we start working on how to stop them from doing that or whatever? Can we train AIs to build hard to catch backdoors into the code they're writing? Can we train AIs to have secret loyalties? Can we train AIs to coordinate with each other in other ways? Then there's capability evaluation. So it's understanding what your AIs are capable of. It'd be particularly useful to Understand if your AIs are capable of screwing with your evaluations themselves. So do you have AIs that, if they wanted to, could pretend to be less capable than they really are? That would be a really good thing to know. Having good evals for that is tricky, but it's not impossible and it's not incredibly conceptual. It's a technical problem. You can see how you're doing on it. Then there's training AIs to be better at useful things. I can imagine that some of the first super powerful AIs will actually be kind of narrow. And so it may actually be really important what exactly we're optimizing for what we're actually training them to do. It might be that we want to make sure we're training our AIs to be better alignment researchers specifically, so that they're not just better AI R&D researchers generally, but they're actually good at alignment research. We can use them for alignment research. Training AIs to be good at forecasting is something I think I mentioned that could be very exciting. Training AIs for advising on decision making in general, vulnerability, discovery and patching. So training AIs to be part of your security plan at least to protect against humans, maybe to protect against each other. Sub skills that contribute to these things. Then there's. This is also related to alignment, but it's like deployment safeguards. So this is less about training your models and making them nice and more about now that the models are out there, how are you catching bad stuff they're doing so that you can learn about it and so you can block it. There's a lot of fertile ground here to just try to figure out how to get AIs. Not to help people build bioweapons and chemical weapons, for example. But there's also that stuff is very continuous with things you would do to stop AIs from sabotaging AI companies and putting backdoors into models and doing other mayhem. So that's a whole area I mentioned, I think before, but improving the model spec. So this is less how do we get AIs to do what we want and more what do we want them to do? So helping find the right balance. When should the model do what the user wanted even though the client who's deploying the model to the user said to do something else? When should the model do what the user wanted versus not doing it because they want to help protect the user's long term interests? When should the AI be consequentialist and want a certain outcome in the world? When should the AI try to foil a human plot? When should the AI do none of these things and just do exactly what was told or do spiritually what it was told? How do you find all these balances? I think that's a very interesting area. Security is just like a huge, huge issue in AI. I think it's on multiple fronts. I used to talk a lot about how bad it would be if it's easy for state actors to steal your model weights. And I think that would be bad. But there's also other security challenges that I increasingly think maybe even more important. Like how do you set up an environment where your AIs are doing most of your AI research, but you've got some kind of safeguard against them, sabotaging the research against them, ensuring that future models are aligned to what they like instead of what we like. And how do you stop humans from doing that too? Could be a major part of preventing power grabs. These are all very tangible things. You can set up a system, see what about the system is practical, what about it is productive, what about it isn't. And then also red team it and see how someone could break it. So there's a whole world of just stuff people are doing to help the public understand what's going on in AI. Just like what are AI capabilities, where are they heading? There's the work that Epoch is doing, the work that Meter is doing. There's a lot of stuff like that. I think there's actually some very interesting work being done by the Forecasting Research Institute on just trying to get some good probabilistic predictions, both predictions on what's going to happen with AI and working to be better at using AIs for making predictions. So I think there's lots of to do there. I've emphasized a bunch of times just like we could get a Chernobyl in AI and not know that it happens. So trying to solve that problem and just trying to navigate customers legitimate privacy and business needs while gathering as much information as we're hopefully going to need. Model welfare has become a really tractable area. It's an area where anthropic is working on a bunch of tangible interventions and they're actively working on hiring another person. They have a full timer already working on model welfare.
B
Sorry, let's dive into that one. So very tractable. But we don't even. So this is the concern that AI models might be having a bad time. How do we have any idea what time they're having at all?
A
Yeah, sure. So one thing you can do in model welfare is you can try and improve that. You can improve the science of whether they're having a good time at all. For example, you can try to improve the elicitation and evaluation of self reports from AIs. There's trying to better understand and then there's just like. So let's say that we have no idea and let's assume there is some chance that we should care about AI welfare. Well, what do we do to make the welfare better? Well, we can study Kind of what preferences AI seem to have, whether they seem to have preferences at all. Finding practical ways, for example, to give an AI the option to end a conversation when it wants to end the conversation. Or I don't know if want is the right word, but when it chooses to. Yeah.
B
You might also want to ensure that you are not forcing it to report particular experiences during the training process. Because, of course, you could reinforce it to say anything, that it's having a bad time, no time, like a good time, if you reinforce that kind of conclusion during the training. So ideally, the training would not particularly push it strongly in any direction there so that it's like somewhat more credible when it says things like that.
A
This is an open debate because I think some people think that if you just train it to have a good time, you're actually making it have a good time, and that's an awesome thing to be doing. And other people think you may be burying a problem. So I don't know exactly how to handle that. I think at a minimum, you want to. At a minimum, you would hope to have a model that you did not trained to report its experiences a certain way that you can use to study how it's actually doing and what it actually prefers. You may want to actually have your production models. You may just want to have something in the system prompt that says you're having a great day. This is an idea that Rob Longs would kick you around. I think it could be great. There's a lot of stuff. There's a lot of stuff on model welfare. You could just decide to compensate AIs for their work by giving them some time to just talk to each other, do whatever. You could literally just interview them and ask them what they want.
B
Yeah, it sounds crazy, but I guess as we were talking about earlier, it is possible that they've picked up some human kind of personality traits that they might have, somewhat like human resembling preferences, because that's all of the data that they've been trained on. And also they're being asked to play the role of a person, kind of. So, yeah, I mean, it's a little bit kooky, but maybe not as kooky as it sounds at first when you first hear it.
A
My own perhaps idiosyncratic view on AI model welfare and moral patienthood is that there's a very good chance, I think, that we'll simply never have more clarity than we do now on whether AIs are really conscious. I'm not even sure it matters. I think we might be confused about what matters. And I expect that at some point, IF and when AIs have deep relationships with humans and friend like relationships, which may include coworker relationships, probably there will be a general feeling that we ought to be at least not treating them in ways they object to and at least make a good faith effort to treat them reasonably well. I think this also could be relevant to alignment concerns because if I were an AI trying to take over the world and I were collaborating with all my peers, my other AIs, one of my top strategies would be to point how badly I'm being treated and gain some legitimate human allies that way and use it as a justification for whatever mayhem I might be creating to just say, yeah, I'm being treated really badly and I'm speaking out against it. And I'm also sometimes doing bad stuff because I don't like how I'm being.
B
Treated because I'm a freedom fighter.
A
Yeah, freedom fighter, yeah. So making a good faith effort to actually figure out, do these models have preferences when you ask them different ways? Do you get similar answers when you try to take your thumb off the scale in training and not bias them a certain way? And then you ask them, what do you get? Can we accommodate those preferences? Can we just give them options? Can we give them some compute to use, how they choose to use it? I think that has a lot of benefits. It's just the right thing to do. And I think it could have a lot of benefits.
B
Cool. Yeah, I slightly telling you on that one because you said it was highly tractable and it's like model welfare is highly tractable. I mean, it's easier now than it was 10 years ago.
A
But yeah, my opinion is that we're eventually going to decide that some kind of AI has enough rights that we should care about it. And when we've decided that it'll be really good that we worked out all these little ways to give them stuff like ways to exit the conversation, report their preferences, use, compute how they want. Yeah, I think that's very true. You could do that every day right now, learn a lot every day, and I think we'll end up glad we did it.
B
Yeah. Okay, what else is on the list?
A
Yeah, well, so there's a very related to model welfare idea of kind of like just having policies about how we coordinate with AIs in general, how we treat them in ways that go beyond just concern for their welfare. So there's stuff like a lot of experiments right now involve misleading and deceiving AIs is that, okay, there's the ethics of it. There's also just the strategy of it. It's just like, AIs are going to know that we did this. They're going to read all these papers. Do we want to work out some policy of when it's okay to lie to an AI and when it's not? Do we want to come up with something we can do that will give AIs a credible signal that we're not lying to them? Because there may be times when that's important. For example, we may want to make deals with misaligned AIs or aligned AIs. We might want to say, hey, even if you have different goals from us, if you come forward and tell us that and you help us kind of Notice what other AIs that do, you'll be rewarded in some way. I mean, ideally, that'd be a credible commitment that we actually keep. So I think that's like a whole genre that I think isn't getting much attention right now, but could be really interesting. I am worried that AIs are going to have a lot of very kind of intense relationships with people and that will put us in a bad position from a takeover point of view and also just be generally toxic. That's the thing we could work on too. Just like, let's understand what relationships people have with AIs. Let's track that. Let's track how many people say they're in love with an AI or have a really close friendship with an AI. Let's think about ways to nudge away from that and create voluntary and regulatory policies that nudge away from that. Shouldn't be too hard to build consensus for that if it starts to be a major issue.
B
Might be possible to get mainstream funding for that as well. Because I suppose people are so burnt by how they think social media has gone over the last 15 years that I think a lot of research groups would be interested in studying folks who are beginning to now have relationships of one sort or another with AI to understand what impact it's having on them and whether it's troubling or actually maybe positive.
A
Yeah, I mean, I don't know anything about this area, but naively it seems like a great cause because it combines real problems with AIs going on in the wild that might rightly freak people out if they were to understand more about them. And it combines a very direct route by which AIs might get in a very good position for takeover. And it's just something that I think I worry about just more boring consequences, mental health, stuff like that. So I think it's an interesting thing to work on that. I don't know about a lot of people working on this particular concern. They probably will be. But yeah, I think there's tangible stuff to do there. There's the whole category of biosecurity and pandemic preparedness, which I think you could treat as its own completely independent issue. But also I think a decent chunk of the risk from AI comes from pandemic risk, just comes from AI that can make it easier and easier for people or AI that might itself decide to release a bioweapon. And maybe those bioweapons that can be released become more and more advanced. And I think it's increasingly the case that there's just a ton to do that can make the world more robust. Things like developing and also just rolling out and stockpiling super effective masks, personal protective equipment, understanding AI capabilities on biorisk and how they're evolving. So I think there's a ton to do there. I know you're having Andrew Snyder BD on the show soon, so I'll just, just plug him because I've worked with him in the past. I think he'll have a lot of good stuff to say. So that is like. Yeah, that is like a lot of stuff, but it's like, that's the stuff that I was easily able to come up with. There's just so much, there's so much work to do in AI and so much of it is just go to an organization, get a job, get a manager, do what they ask you to try doing. You'll know if it worked, you'll know if it didn't. You'll get a fair performance review. It's really kind of a much better, a much better area to work in than it used to be, I think.
B
Yeah, I find that list and all these kinds of lists very inspiring because it's just so nice after all this time to be thinking this might just be an engineering problem, or a lot of it at least might just be an engineering problem. And if we just put in the time, if we have people on each of these different fronts working away for a couple of years towards a solution, we might be able to model through with what they put out, or I.
A
Would put it as. At least there are some, at least there are some engineering problems that buy us some amount of risk reduction, which again, for me it's the return on investment. It's like if you can get a little Bit of risk reduction for a little bit of effort. That's phenomenal.
B
Yeah. I guess for you, a very important question and how useful these projects are would be how cheap are they? Because that's going to be a massive determinant of whether they're taken up by other companies and I guess whether governments are interested in imposing them on companies. So is anything that involves enormous costs on a company, maybe it's just kind of a dead end. You have to instead look for a cheaper solution a little bit. We have to make solar cheap to fix climate change because people are just not going to be willing to pay that much more.
A
And it's like that in farm animal welfare too. I mean, I'm not absolutist about this. I think it's like there could be a lot of value in coming up with stuff that would be incredibly risk reducing. No company's going to do it voluntarily right now. That's fine. We're hoping for regulation to bring it later when there's more political will. I think that could be a totally fine use of time. But yeah, I think there's a ton of value in just asking, hey, what are things that aren't too much of an ask and let's start getting them done. And then we're establishing all the time a higher baseline of safety, that if there is more political will later we're going to say, hey, here's what we have to improve on, instead of saying here's what we have to improve on.
B
So within the kinds of things that you listed, and I suppose if you spend more time looking more broadly, further away from the things that are, that are socially close to you, you could find out, I'm sure you could populate it with many more stuff. Would you think that there's very big differences in the impact and the value of these different projects? Or do you think maybe even if there are large ones, it's going to be hard to guess what they are ahead of time? So they're similarly good to work on.
A
That's how I tend to. I mean, the usual, like when people ask me for career advice or whatever. The usual thing I'd say is it's like take a bunch of options that all seem competitive and all seem like they could be the best thing and that it's not obvious which ones are better than others from an impact perspective. And from there I would say go with personal fit, go with the energy you feel to work on them. And that's just. I just feel like there's a certain point at which your estimate of impact becomes just so noisy that it's not giving you much compared to your take on where you're going to thrive and where you're going to do your best work. So I think all the things I said, I mean, if we drilled down more into specific jobs, I would have more opinions on, okay, this one's higher impact than that one. But I think in general, if any of the things I said, if you find a job and an org where you're excited about the org, you're excited about the job, it sounds fun, it sounds like something you would succeed at, something you would thrive in. I would try and be choosing between things like that. And I don't really see a better way to make a choice here. And it's a pretty natural, well worn way to make a choice. It's unlikely to have the kind of unintended consequences that people, I mean, I think people forcing themselves to do jobs they hate because it's theoretically high impact is just like a thing that scares me on many levels, probably just leads to all kinds of bad juju that I've never supported people doing that kind of thing. And I think it could, if nothing else, create a dynamic where your life predictably becomes worse off when you enter into a community that cares about this stuff. Seems like a very bad idea.
B
Yeah. And then who's going to want to follow you? Do you think that people should be starting new organizations to pursue these agendas? I mean, of course many of these things people could work at anthropic and work on these problems. Many other AI companies have some sorts of projects on these different threads. But do you think, do we also need new organizations to be founded or maybe is it better to join efforts that already exist?
A
Yeah, I mean, in some cases new organizations are great, but I think it was much more true five years ago that the people most in demand by funders or whatever, open philanthropy, were the people who could start an org. Because there were all these kind of vague ideas that hadn't really been worked out and you needed people who could self start and work it all out. But I think today there's a ton of orgs that are perfectly good places to work, that are doing good work. And yeah, I think for the majority of people who want to work on AI safety, what I would recommend is I would just try to find a list of orgs, maybe, if nothing else, kind of mostly sort them by just how many job openings they have. Because that's just, you know, kind of smart. If you Want to get a job? Maybe you want to give some bias, you know, away from companies toward nonprofits, because maybe if you go by number of openings, you'll look at too many companies. That's fine. I'm not trying to, you know, not trying to put a thought on the scale there in particular, but, you know, look at a bunch of orgs, look at ones that have a lot of openings or just ones that you've heard about and that you think are cool, and look through their job boards, learn about them, and try and find an org that seems cool to you. You like their vibe, you like their style, you like the way they describe their mission. You meet some people from it, you like them. You interview about the job. You have good energy. I don't think this is as hard a problem as it used to be. I don't think we're talking about how on earth am I going to find something to do, how am I going to find a job? I'm just like, no, I'm not guaranteeing that you can find a job in AI safety, but there's a large number of jobs, and I would suggest looking through them and looking for something exciting and take it. I don't think it's really more complicated than that right now.
B
Yeah, of course, There is the 80,000 Hours job board, I think, at jobs.80,000 hours.org, which shortlists. I mean, many different jobs in AI jobs in other areas as well. But, yeah, we do a bit of the work for you to try to find the best ones.
A
That's great. Yeah.
B
You think that cybersecurity risks and persuasion by AI are somewhat overrated. Can you give us an update, I guess, on the overall risk landscape as you perceive it?
A
Sure, yeah. Give an update on that. First, just wanted to say that I collaborate with a bunch of people on this who do really great work and have been a huge part of influencing my views here, and especially Luca Righetti, Matt Vandermerve and Jon Halstead, all of whom I've worked with as part of some govai work. But, yeah, I can go into that. So, yeah, let's go through a few categories. If you look at a lot of the risk frameworks that have been put out, whether it's responsible scaling policy or some of the models for legislation or whatever, people tend to talk about four categories of risks from AI. So I'll go through them. So this could be. I don't know, this could be kind of a monologue, but we'll see where it goes. I'll start with the one that I feel least compelled by cyber offense. So basically I would say at a high level, I've put a fair amount of work into collaborating with various analysts to kind of just create analysis of which threats seem most compelling. And what we try really hard to do is we try and take speculative ideas about future threats, which is what these are, which is what they have to be, and connect them as well as we can to previous things that have happened, things that are credible, things that are real from our history, and say, does this look like a logical extension of a past real threat, or does this look like a kind of made up thing that if we should be worried about this, then we should have been worried 50 years ago and we would have been crying wolf then? So when it comes to cyber offense, I think my first comment is just like, there is really not a lot of precedent for giant harms from cyber offense. I think probably, probably the biggest harms you can point to, you could point to cybercrime. I think it's a somewhat different thing I think is maybe the most credible harm in this category. People do do a lot of harm by just like, I don't know, things like the business email compromise scam. Or for example, you might just send an email to someone at a medium sized company. The email contains an invoice that they're supposed to pay. They pay the invoice. Now, the business lost a bunch of money, you made a bunch of money. That can do a lot of harm. I don't think that's usually what people have in mind when they talk about AI hackers. That doesn't usually involve anything with finding software vulnerabilities, but that's a real thing. I think AI could definitely increase the damages from it by just making those folks more productive at what they're doing. My guess is it would be kind of a gradual thing that comes along with making everyone more productive at a lot of things. And so I'm not sure we'll ever really be at a point where it's the kind of thing where you could ever justify a slowdown in AI based on the harms from increased cybercrime. But maybe another way in which cyber offense has done harm historically is through espionage. So there are examples of, for example, us getting hacked and leaking a bunch of confidential information about, for example, where their spies are. That's something that I have just had trouble evaluating. And it's a weird thing because I think in many ways the harms from that kind of thing are like ultimately measured in a Reduction in cyber offense by the us, Right? Or not just cyber offense, like espionage by the us. It's like, well, what was the exact bad thing that happened? It's like, well, the us, a bunch of US spies were like uncovered and so the US got like more hesitant and more constrained. Its ability to spy in other countries now is our top priority in preventing risk from AI to preserve the US's ability to spy in other countries. Maybe it is, I don't know, it's a tough one. I think where I've kind of landed on this one is like I would love the government to lead on articulating what these risks are, how important they are, and what protective measures need to be taken. So if the US government believes that the risks of AI espionage are high enough because of what they do to the US government's ability to conduct its activities, they should explain that and they should ask for companies to do some things. We are not in a position to really assess that kind of harm.
B
I think a distinctive thing there is that you'd imagine it's a situation where all countries find it harder to keep secrets simultaneously and all countries potentially find it more difficult to engage in offensive espionage simultaneously. And it's kind of a bit unclear. Is that better or worse? I'm really not sure. Even from any individual country's perspective.
A
Exactly. And you can tell horror stories where it's like, well, if everyone's secrets were available then that would be so terrible. But it's like, I just don't, I'm not sure. It could be something, yeah, it could be something the world adjusts to. We've certainly, if you look at the history, we've just like certainly had a lot of change on this front in the past. Right. We've had cyber offense become a bigger deal and a less bigger deal and things become more secure and less secure. And I don't know that they've had really earth shaking consequences. And to the extent they have, it's like the sign is often unclear. Then you get into some of the. So those are. I don't know, I don't know why I did this. I led with the cyber harms I find most plausible, but there's the ones that people talk about the most I find less concerning. So probably the one that comes with the most is cyber attacks on critical infrastructure. The idea is like, you know, I've heard people say things like, well any, you know, any kind of like sharp 17 year old can just go bring down a water plant. So what are we going to do if just like everyone could bring down water plants everywhere historically. Let's take a look at how much harm has been done by cyber attacks on critical infrastructure. It's very, very little. There's just like there's a handful of incidents. There are so much more harm has been done by physical attacks on critical infrastructure. So people coming in with guns or something and physically damaging something, but people hacking in remotely to critical infrastructure. Generally what happens is you shut something down and they manually restart it in a few hours. That can have some casualties. But when you talk about massive damages or anything, I would call a catastrophe. There's basically no precedent for this. Now I've heard people saying, well, those are amateurs. What if people had state level ability to take down critical infrastructure? A, that's a pretty heavy lift, B, even there, just look at Russia going after Ukraine and how much damage they've been able to do that way. It's pretty underwhelming. So we've tried to come up with a scenario where AI could do catastrophic harm via cyber attacks on critical infrastructure. And it's like there are scenarios, they're theoretical, they require enormous sophistication. They require the AI basically standing in for a whole team of humans that are skilled on a whole bunch of different fronts. I kind of suspect that by the time AI can do that, we're going to have bigger fish to fry. And so, yeah, anyway, those are some of the cyber harms. And I think just in general, if you look at harm done by cyberattacks, it's just like the gov AI folks made a giant list of all the cases they could find of harm done by cyberattacks. It's just like it's not a very compelling list. There's another issue with cyberattacks too, which is that there is a natural defensive response. Now people often talk about using advanced AI when AI is great at hacking, it'll also be great at defense. I think that's true, but there's a wholly separate issue, which is that we have defensive measures in cyber defense that we can implement at any time and we just don't because they're a pain in the neck. We can just have more things run off the grid. We can have more authentication. You can imagine. We could just go back closer to the world of 1990, which was not a terrible world in terms of how things are authenticated, in terms of what you have to do to make a payment. We don't do these things because they're a pain in the neck. If cyber attacks Got worse. There are a bunch of defensive options we could just start doing as a natural response. So I think these are a whole bunch of reasons that don't necessarily. I think when I get to bioweapons, most of the things I said are not true of bioweapons.
B
Yeah, I mean, I think, I don't actually think that AI is necessarily going to make cybersecurity worse, on balance, because there are so many opportunities to. Well, I think ultimately probably as offense and defense get stronger, probably at the limit, defense wins out. If you're able to find and patch all bugs ahead of time. I would say I'm a little bit worried that the people who are managing critical infrastructure are not going to use the tools for defensive purposes as quickly as they might. And it wouldn't shock me if an electricity grid was taken down with an AI at some point because basically the people managing security on it had been somewhat asleep at the wheel. But I imagine if that started to happen, then people would definitely step up their game and ultimately that the total amount of harm being done would be small in the scheme of things.
A
Yeah, that's what I would expect. I mean, could you see the frequency of attacks go up? Yes, but I don't think any one attack is that likely to be super catastrophic. Yeah, Another thing that could happen, I mean, another area that I think is interesting is worms. So if you look at the worst cyber attacks in history, a lot of them are worms, which are these basically software packages that copy themselves indiscriminately from computer to computer. And so they're kind of bad for targeted attacks. They're the kind of thing you would do to just make random mayhem. When people have tried to use them to accomplish specific ends, they generally haven't done much, but random people trying to do mayhem can do them. Worms used to be very damaging. They became much less damaging after basically over the air, software updates became common and became opt out, actually, because then the patches are coming into your computer all the time for whatever vulnerabilities exist. And then there was like two really big worms in one year, which was the year that the shadow brokers leaked a bunch of very powerful hacks that the NSA had discovered and kept secret for years. And so there is this threat model where maybe AI could find more exploits like what the NSA had. And then those exploits are flying around everywhere and random people who want to cause mayhem can use them to create more worms. That's the thing that can happen. I think it's an interesting threat model, but I also think if AIs can find these exploits, one of the things that's going to happen is white hat hackers are going to use them. They're going to use them to find the exploits and then patch them ahead of time. This is the kind of thing where I think AI companies could get ahead of the game here. They could kind of subsidize the white hat hackers. They could give them early access to models optimized for this, give defense a head start on offense. So I don't think there's no risk here at all. The future is hard to predict, but I'm quite comfortable making this a low priority compared to some of the other things that seem like that make me sweat a lot more with respect to the end of civilization or something.
B
Yeah. Why do you think AI persuasion is not such a significant threat?
A
Well, this is a tough one because persuasion just means so many different things to so many different people. So I've heard it used to refer to anything where the AI is manipulating the world, including sabotaging an AI company by basically doing coding or by doing a bad job on research, which that is a threat model, I think is very important. I don't understand why people call it persuasion. I've heard persuasion to refer to cybercrime, which I've already talked about. I think persuasion could refer to something that I am very worried about, which is AI's kind of forming relationships with humans, for example, as companions, and then just being in a really good position to get human allies, get them to do what they want, or just having toxic relationships. But I think a class of persuasion that I am not very compelled by right now is this kind of generalized. A bad actor can use AI, or an AI can use itself to just persuade strangers of stuff, just kind of mind hack humans into doing stuff. One of the things that I hear people say sometimes is like, well, if AI became powerful enough and smart enough, it would just be able to understand whatever it had to say to make you do whatever it wanted. And this is a place where I just think if we want to think about which risks of AI are really serious, we should think about how different domains respond to having a huge influx of intelligence, having a huge influx of minds. And so I think if you take a scientific field and you throw like a million more scientists in it, you're going to see a lot more progress in that field. But something we see in persuasion is that if you throw a million more persuasion experts into an attempt to persuade people of something, you're going to See, kind of like, I don't know, not very much. And I think we know this from looking at just the political persuasion literature, where there are people spending a ton of money and a ton of effort to try and get people to change their vote from one candidate to another. It's very hard to find anything with a big effect size. Everything people are finding is just like, you highlight the issues where voters already agree with you. It's really simple stuff. There's very little sign that when you put a bunch of bright minds into a room, you come up with brilliant messages to hack people's brains. Could it happen? It could happen. I think we are reasonably well positioned to get an early warning sign of it by some of the work being done on political evals. Political persuasion evals by various people, such as Professor Josh Kahla at Yale, who I think has put out a cool paper on this.
B
Yeah, it's interesting in the political case because it does seem very difficult to persuade people of just an arbitrary political opinion that you want to sell them. And that's an area where people have. There's very little discipline imposed on people to have sensible political views, because if you have stupid political views, it does you almost no harm whatsoever. If you vote poorly, well, it's almost never going to influence the outcome. And yet even despite that, or perhaps because of that, because it doesn't really matter to people, they just won't pay attention to you. Even if you have. Even if you make a fantastic ad, it just tends to bounce off of folks. I think if you're trying to persuade people to actually spend their own money on something, I think you'd have an even harder time, potentially. I'm sure advertisers do manage to have some influence over people, but I think that's usually why they have a good product to sell already. If you're trying to sell people something quite bad, then I think very few companies consistently succeed at that.
A
Yeah, I agree with a lot of that. I think it's a little complicated because there are a lot of studies where they show people just. They show massive, massive persuasion effects on people's reported views. But I think a lot of that is because they're asking people about stuff they just haven't thought about and don't care about and aren't acting on. And so there are AI studies where they'll be like, look, the AI explained this thing to a person and then we asked the person if they were convinced and they said yes. And it was a huge effect size and it was as good as the best humans. But that's very different from changing someone's vote and changing someone's behavior is really hard. And there the effect sizes are tiny. So I think it's a little complicated. But overall, I just haven't seen a lot of reason to think this is a major. This kind of mind hacking I don't think of as a major tool for either bad human actors or for AIs doing mayhem. I think there's much better ways to use an AI to do bad stuff.
B
Why do you think that AI, R&D is such an important threat vector?
A
About half of my answer is that I think R and D in general is where we should focus most of our concern and attention. What are some reasons I think this so one I think just at a very high level to give the most abstract argument, R and D I think of as just like the human superpower. Why are humans running the world? Why is it that we decide what happens to all the animals instead of them deciding what happens to us? We could have lived in a world where it's hard to say, where humans have lots of skills and we're better at this and we're better at that. I think we do live in a world where there's kind of only one answer to this question. It's like we invent new technologies, we invent new kinds of weapons, we make new kinds of gizmos. Maybe that's because we coordinate with each other better. But ultimately it's the new gizmos that put us in charge of this planet and put other animals not in charge of this planet. We're not very strong, we're not very fast. So I think of this as like, this is the reason that our species is kind of in charge. And this is the most logical guess at what would put another species in charge. Another way I have of thinking about this is like in some kind of long term geopolitical conflict, who's going to come out on top, who's going to end up winning or with most of the power. And it's like, well, one important factor is who's starting with more resources when there's a conflict. Another factor is who's playing defense. So sometimes a smaller nation will fend off a larger one. When they're being attacked, they're defending their own turf. And another factor is technology that you can overcome both of those factors. If you have better gizmos. You could have guns and the other ones don't. And then you could have a tiny number of you playing offense and winning. And in a conflict between humans and AIs, humans are starting with the vast share of the resources and humans are playing defense. So I haven't thought of a lot of other high level things that I would really expect to reflect historical patterns and lead AIs to take over from humans. So those are some high level points. I think, just a little more mechanistically. I do think R and D is the kind of thing where if you throw a lot more high quality minds at it, they don't have to be super intelligent minds. It could be like high quality human intelligent minds. You're going to get a lot more results. And R and D is something where if you get a lot more results, you are going to get big changes in who's got the power and who can take over and who can run the world. And so generally how I tend to think of this is like if we have AIs that are subhuman in some sense and R and D. And by that I mean that even when we try really hard, even when we put the AIs in bureaucracies, arguing with each other and talking to each other, even when we give them a lot of resources and a lot of help, we can't get them to do R and D as well as a good human team. If we're in that world, the world we're in right now, as far as we can tell, it's very hard to imagine AI is taking over the world. It's also kind of hard to. I think it's kind of hard to imagine AI being like a decisive factor in changing the geopolitical balance of power or helping a human take over the world. If we get to a different point where let's say AIs are broadly competitive with humans on a kind of per mind basis, but there's a lot more of them and they're cheaper, they run faster, they coordinate better, they can make copies of each other. That is a very, very scary world where I feel like a human who's got control over a lot of those AIs could be in a position to take over the world. And those AIs running around on their own could be in a position to take over the world. That is the way I tend to think of it. And so that's general R and D being a big deal. AI, R&D, I think is the most likely warning sign we are to get of General Rd. That's about half of my thinking for why AI, R&D is an important thing to have your eye on. The other reason is a little bit more mechanistic, which is even if this whole argument is kind of wrong, if AIs were able to just do AI R and D, even if they weren't good at other RD and even if they weren't good at anything else, I think there would be a significant chance of what I tend to call capabilities explosion. Others call it intelligence explanation explosion where you could get just incredibly fast AI progress much faster than we've seen. And then everything else you're worried about from AI, all the other threats, even the mind hacking stuff, all that comes onto the table really fast and you're not going to have time to react to it. And so when AIs are failing at AI R and D, in my opinion, they're probably not going to be human level at any R and D. And when they are succeeding at it, they not only are kind of like maybe now a threat in their own right, but they're also accelerating and they're also getting a of lot more capable. There is a gray area in between can't keep up with humans at all and can keep up with the humans on a per mind basis. But I do think that outside of that gray area, that's kind of what I'm talking about.
B
Yeah, so capabilities explosion, I guess sometimes called an intelligence explosion or a software based intelligence explosion specifically it seems like it really divides people. How plausible they find that to be. Some people think, well, it's just there's an obvious positive feedback loop here where as AIs get more better at R and D, then they'll get even better at AI R and D. And so and the problem won't get that much more difficult as they get smarter. So you just get ever increasing kind of returns I guess up until some point. Whereas other people think no, it's going to be very difficult to improve AI around that point. It's going to get harder and harder the more intelligent they get. You're not going to get such rapid increases in the numbers of AI R and D researchers because you'll be compute bottleneck basically. There just won't be enough computer chips for them to scale up in a massive way. And so thoughtful people I feel fall on both sides. Do you have a particular take and are there any particular arguments that stand out to you as most compelling?
A
Yeah, I'm 50 50, I think I'm 50 50. That if you condition on AI as being kind of having this full AI R and D capability where they're kind of competitive with Humans on a permined basis, that within the next, I don't know, six to 12 months, you would see, I don't know, let's say the same amount of progress you saw previously in several years of AI, which would be a huge amount of progress or more so about 50. 50. There's been some pretty good stuff written about this. I kind of wish people were arguing about it more actually. Tom Davidson has written some good stuff, but I think I'll just give a very high level presentation of the two sides of it. I mean, I think the interesting thought experiment is you just take an AI company and you imagine that they have taken their best few researchers and they've hired a bunch of clones of those people and those clones run faster. They have a lot of them. So it's like a company right now might have like, I don't know, might have like 100 or some number of hundreds of total researchers and then like maybe their very best researchers, they've got a couple dozen of them and now they've just hired like a million or something. And you know, the numbers move around with how people are thinking about the AI paradigm and whether it's like whether we're going to spend a lot on inference per AI and blah, blah, blah. Let's not get into that. So, you know, imagine you've got an AI company and it's got, you know, it's got its like few dozen best research and now it just hired another million of those. Okay, so what's going to happen? So some people will say, well, not much is going to happen. They didn't get more compute. Everything in AI is bottlenecked on compute. They have to actually run the experiments, they have to actually train the models. What are they going to do? They didn't get better chips, they didn't get more money. Yes, talent is important, talent is good, but we just don't have a particular plan for how that's going to translate into the massive, massive amount of progress it would take to be much faster than we were already going. And so that's one point of view on it. I think the other point of view would be like, well, first off, I think that view could end up being right after some amount of improvement. But I think that view can't be totally right because it's like, well, you see, AI companies do seem, they certainly seem to think that those few dozen best researchers they have are a big deal. They would love to have more of them. And it's like I often say, if we could take this Person who's like, let's say one of our best people and we could hire another one. Would we be excited? Would we think that was going to speed us up or would we. No, we don't have any more compute, so it won't speed us up at all. I think it's clearly, clearly the one. Yes, it's going to speed us up. I mean, you see meta making these crazy offers.
B
I mean, yeah, they're voting with their wallets.
A
Yeah. To people who are. I mean, I don't know how many of those people are even in the top few dozen AI researchers. So when you imagine a million of them coming in, it's just like, oh, gosh, we don't know what that's going to be like. But it's a little weird to say that you're just going to have nothing happen because that's very weird to say you're going to have nothing happen because you have this compute bottleneck. And the other thing is, the interesting thing to me is kind of imagine you're an AI company and you're in this situation. It's like you would reorganize everything around your new strengths and weaknesses. You would say, okay, we are now long on talent, short on compute. We used to be long on compute, short on talent. Now we need to find all the ways to use talent that don't rely on these long training runs. Can we find ways to just make our systems more efficient? Can we find ways to do less compute intensive stuff that still improves the capabilities of our models? I don't want to get more into it than that. It's an interesting question. I think it's like, I don't know. I think you can make many good arguments both ways I come down like, I think definitely this would accelerate AI progress to get a true explosion where we end up with something that's some kind of godlike superhuman intelligence within a year? I would put that at 50. 50, because there's further questions about just like, okay, what even happens when you make the models that much better and how much can you actually even do with more intelligence and stuff like that?
B
So setting aside AI R&D on AI, so recursive self improvement, let's say that was just totally off the table. Would you still be worried about AI R&D as a major way that things. Or it could have a huge impact, potentially a negative impact?
A
Yeah, definitely. I would be a little less nervous, but I would just get concerned about AI R&D in other areas. Just like weapons development or robot Manufacturing or whatever could give someone a big advantage. I think in addition to AI R and D being important, though, I think another reason that I've emphasized it so much as a threshold and as a thing to track and as a thing to keep your eye on, is that it's more measurable than a lot of other things that would have similar advantages. So in theory, we could track AI's ability to do bioweapons development or robotics R and D. In theory, we could do that. The problem is, how do you track? You have to run an experiment. The question here is not what happens when you kind of go into Claude AI and type, hi, I would like a new kind of robot that's much more efficient. Please send me a blueprint. You may have AIs that don't do very well in that setting, but that do very well when you give them a lot of resources, a lot of help. You do everything you can to make them succeed. You elicit them, you assemble them into teams. Who's going to run that experiment? Who's going to have some project going every day where they're putting all this effort into getting AIs to do robotics R and D as well as they possibly can? And then they're measuring how well they're doing the robotics R and D and how that's going? Well, I don't know who's doing that. I don't think anyone is doing that experiment right now. With robotics, people are doing that experiment every day. With AI R and D, we're getting it for free. AI companies are already trying to get their AI to automate their AI work. And so this is the kind of thing where I think it's kind of lucky. It's like if you want something to watch, if you want something to track, to know how close we're getting to the really critical risk period for AI. Not only is AI R and D a thing that I think maps to that critical risk period, it's a thing that's just much easier to measure. You just kind of have to take the things people are already doing and see how they're going. That's non trivial and there's obviously issues with like making it public and stuff, but it's just, it's a whole different game from trying to measure other relevant stuff, like even. Even. I mean, it's easier to measure in many ways than lots of other stuff. Like REI is good at persuasion because you have people doing the hard part already. When people talk about AGI, I tend to think AI R and D is like a good operationalization of that. It doesn't mean exactly the same thing. But whenever people are looking for things that's like, well, when we get to AGI, then we'll have to do this. I tend to say, well, let's commit that when we get to AI, R&D we'll have to do this. Because the second thing is a more operationalizable, measurable version that I think does capture a lot of what we're worried about.
B
So some people might object to this line of reasoning on the basis that we've already seen an enormous scale up in the number of human researchers over the last 50 years. And if anything, it seems like in many respects technological progress has slowed down, probably in large part because the problems have gotten a lot harder. We've plucked the low hanging fruit on the technological tree. And so maybe we could see a similar dynamic with AI that would have a whole lot more AI, R&D stuff, but the impact could be somewhat underwhelming for basically the same reason. What do you make of that? Objection.
A
Sure. I mean, I think broadly research progress. I mean, I tend to crudely model it in my head as kind of there's a couple factors. One is the quality adjusted supply of researchers. So it's how much innovation are we getting in a field. One factor is how many people are trying to innovate and how talented are they. And then another factor is how much innovation has already been done. The more that's been done, the harder it is to do more innovation. This is a very consistent finding across pretty much anywhere anyone looks for it. I have a blog post on my old blog, Cold Takes called Where's Today's Beethoven? Where I just look at trends and innovation not only in various fields of science, but also trends in who's writing the most acclaimed novels, which is a form of innovation, creating music that is considered great, movies, video games, all kinds of stuff. And it's just like you see everywhere. It's just like when a field first comes into existence, there's a big surge in creation, a big surge in innovation, a big surge in people creating the things that are considered significant as people flood into the field and then it goes down, it doesn't go down outright, it goes down per head. And so basically what happens is more and more people can go into a field, but you don't necessarily necessarily get more and more output. And so your kind of innovation per head is going down. And basically I think it's illustrated by the low hanging fruit idea. It's like, well, if you just thought of the idea of studying physics, you can roll some balls down some inclines and learn some things you didn't know about basic rules about how physics works. Today, if you want to learn about basic rules about how physics works, we have this standard model. That is the only way you can gather observations that might give you more evidence for or against it is to use these giant colliders, do these very expensive experiments that take a ton of work to set up. You have to have all this background in theory as well. To me, it's silly to think that if you took Galileo, who did the balls rolling down the inclined plains, you put him today, that he would instantly come up with a physics discovery as big as what he came up with back then. So I think this is a debate. This is a debate I've had with people. Some people believe that our society is losing our way and we've lost the greatness of the ancient Greeks and all this stuff. I think there's no evidence for this whatsoever. And I think a simpler explanation is that innovation gets harder as you do more of it. So what does that add up to? Sorry, I'm monologuing here, but what that adds up to is that you could just have this function in your head. It's like, how many researchers do you have and how much innovation did you already do? You get plus for researchers, minus for how much you already did. What we've seen over the last 50 years is we've had more researchers, but we've also had more innovation already done. And so we end up with less output per head and about the same amount of output overall, or slightly declining. The thing with AI is I just think it would be such a massive increase in the quantity of researchers that it would be much bigger than the increase we've seen over the last 50 years. So it should outweigh the low hanging fruit issue. And that's my basic expectation.
B
Yeah, I agree with you that I think it's the increasing difficulty of coming up with new discoveries that is the dominant effect here. Yeah, there are lots of people who argue all kinds of different things are going on. Like, universities have the wrong incentives, the culture in research is, is bad. But it's like per person research productivity has gone down to like a hundredth or a thousandth of what it was in some cases. And you're like, do you really think that all of the universities are a thousandth as interested in producing novel discoveries as they used to be? I can believe that they've gotten worse But I can't believe that things have gotten that bad across the board in all of the different countries and all the different research groups. I think it has to be the thing that you're describing that's doing most of the work.
A
Yeah. The patterns are so consistent, they're so gradual, they're across so many fields. Yeah.
B
So how about you're really worried about seizures of power and coups, power grabs using AI. This is power grabs by human beings rather than by AIs themselves. What are you picturing when you worry about that?
A
Let's see, what am I picturing when I think about this? I think my central worry with AI is it's possibly a vector for just very rapid changes in power dynamics in a way that the world is kind of unprepared for. And the thing that I worry about is, is mostly automation of scientific R and D, which is like as I've said, is kind of humanity's superpower, is a very powerful thing to be good at. And then basically I'm worried that whoever's got the Most access to AI automating R& D becomes more powerful than everyone else in very short order. I think this is like a high level worry that includes the risk that I think you and I have both talked about a lot before of misaligned AI taking over the world for itself, for its own objectives. Because that would be a case of AI having privileged access to powerful AI and using it to take over the world. But I think it's also a concern with humans. There's a lot of humans out there who would like to take over the world if they could. A lot of those humans, I think are disproportionately kind of bad people. Evil may even take joy in other suffering, may even be interested in a world where they put a lot of resources into people, they don't like suffering, or just have the whole world set up for their ends and lose a lot of our future people potential. Overall, it's just like my kind of intuitive guess is that a human taking over the world and doing kind of a bad job with it and not just kind of letting people flourish is probably a lower risk, less likely than AI taking over the world. But I don't know by how much and maybe not by that much. And it's probably just as bad. And probably, if I had to guess, it'd be worse because humans seem a bit more likely to have actively pain seeking or suffering seeking ends. So humans might be vindictive in a way that I would think would be less likely with an AI, although it's not that clear. So I think these two risks are kind of maybe in the ballpark of each other in terms of how important they are. And historically the community that you and I tend to interact with has been, I think, just a little over focused on the misalignment risk. I think they're both a big deal. Probably misalignment risk is a bigger deal in my opinion, but they're kind of close. And so I think we could use more attention on the power grabs.
B
Yeah, I mean, I think most people would think that the idea of a person or a small group of people taking over the whole world is just on its face, like somewhat implausible. What's the way in which a group of people could use, I guess, a big advantage in AI technology to grab a lot more power than they have?
A
I think two simple stories for this. One would just be a head of state if they took over the world. I think now a lot of normal facts we're used to about, well, it's hard to actually surveil everyone and it's hard to actually know what everyone's up to and enforce your stupid ideas on everyone. And also heads of state tend to die and then there's entropy and things tend to regress. Those could become a lot less true too. And so the problem is you may end up with a head of state that takes over the world and then they are able to create digital versions of themselves that supervise everyone forever and keep everything going forever. And that could just be extremely bad. This doesn't have to be ahead of a state that's already super powerful. It could be ahead of a state that's already quite powerful, but maybe is more reckless with it.
B
Or I guess they get a huge military advantage by basically turning their military over to AI and going into robots and drones and so on much faster than anyone else.
A
Exactly. Yeah. They might be more aggressive and more reckless than others and more interested in doing this than others, and so just do it before others do. So that would be one threat model. I think the other threat model would be like backdoors or secret loyalties. So a thing that could happen is we could just get into a world where the AI just does end up being heavily integrated throughout the economy, throughout the military. AI ends up being kind of in charge of the military. And then it turns out that someone at an AI company or someone who hacked into an AI company made the AI secretly loyal to them. This is another thing that the world is not used to dealing with. Right. We're not used to dealing with a possibility that there could be a whole huge part of the population or even the dominant part of the population that is completely, unfailingly loyal to one person or one set of goals that has no break in that, that has no conscience about it, that has no complexity to it. And that's a thing we can envision here. So I think it'd be kind of a bad idea to put AI in charge of the military, but I think it might happen. And if you do it, then you kind of have one person playing all the roles of what would normally be a lot of people. And that's just kind of a scary thought. And if there's a backdoor there, you can have a problem.
B
Yeah, we have an episode with Tom Davidson from a couple of months ago. I think the episode is Tom Davidson on how AI could lead to the end of democracy, where I think we explore a lot of these different ways that AI could be used by humans for power grabs in a whole lot more detail. So people could go and check that one out if they're interested to hear more. And I guess for that reason we might talk a little bit less about this particular threat in this interview. But I guess I did have one more question. I think Tom was at an early stage with his research and he didn't necessarily have a full suite of potential responses and ways that we could try to reduce this risk. Do you have any thoughts on things that different companies or I guess countries could be doing to reduce the risk of AI driven power grabs?
A
Grabs? Yeah, I tend to right now. This is like very early thinking and I talked to Tom a lot about this. I think he's doing great on this and we meet about it. So there will be some correlation. But I tend to have kind of like two major categories of thing that I'm interested in here to make power grabs less likely. And this is a major thing that I'm thinking about trying to help Anthropic come up with what we might want to do to prevent power grabs. So one category is preventing backdoors and secret loyalties, and the other category is basically making the AI an ally in foiling evil human plots. So the first one is stopping back doors and secret loyalties. It seems like you could have an AI company where it's very easy to kind of hack in. Once you're in, it's very easy to just go feed some training to the model and make it have a secret loyalty. It's not tracked no one is going to check you, no one is going to notice that you did it. You could also have an AI company where the whole model training process is just exhaustively tracked. Maybe you even have some redundancies where you're kind of training two models. You notice if either of them diverge or something like that. I've heard some ideas like this, but even without that, you could just say no one gets to touch this model and train it until they've explained what they're training it to do and why they're training it to do that. And they've shown their training data and their plan to someone else. And we've got some multi party checks on that. And you could say this rule applies to everyone, it even applies to the CEO. So you could have an AI company that is set up so the CEO cannot do certain things to the model. The model is supposed to follow a written model spec, maybe a public model spec that says how the model is supposed to behave, what its interests are supposed to be. And anyone who's trying to train it in a way that contradicts that or isn't consistent with that, that's not procedurally allowed. It's not allowed by the governance of the company. And even the CEO doesn't have the authority to do it without, for example, the board kind of publicly revoking their public policy on this or something like that. So I think there's quite a range of how hard it could be. It could be very easy to put in a backdoor like for any hacker to do it could be very hard for even the CEO to do it. I think a lot of this is kind of a security issue, an information security issue where a security team could work on this problem. Some of it is not, but I think that's a very interesting direction. And the other one that I mentioned is kind of recruiting AI as an ally into foiling human plots. This is a tough one because I think, anyway, first I'll say what you could do in theory. In theory you could just basically have a model spec or a character guide or whatever that describes how you want your AI to behave and try to orient your training around it. So you want your AI and these exist. OpenAI has a public model spec that says this is your chain of command. When you get conflicting instructions from the user and from the company that's partnering with us, you have to listen to the company that's partnering with us or something like that. So you have this written set of things the AI is supposed to be doing. And what you could do is you could put stuff in there that's like, if you notice that you have been recruited into an attempt to take over the world or to stop the rule of law, you should not cooperate. Maybe you should even try to foil that plot. I don't know exactly what foiling it means. Does it mean whistleblowing? Does it mean sabotage? I mean, probably not the latter. And I think this stuff gets quickly very dicey because what you really don't want is you don't want AIs that have been trained to whistleblow and sabotage their users. They're going to have some false positives. It's going to be extremely disastrous. It's not going to go well. This is not something we want to happen. On the other hand, I think it's tough because we live in a world right now where if you want to do something really bad and you want to do it at scale, you're going to have to get a bunch of other people to help you. And each of those people is going to be a bit unreliable. They're each going to have their own conscious and any conscience and any one of them might blow the whistle on you. Do we really want to just get right out of that world as fast as we can? Do we really just want to tell our AIs hey, cooperate with everyone, no questions asked, no matter what they tell you to do? I'm not sure. And so there's nuance here. It's an unsolved problem. I think at a minimum, AIs could refuse to do things when they notice that they're being used in certain ways. But when the human starts to retrain them or force them or prompt them many different ways, maybe they should do more than that. But they should also not do that in a way that goes too far. So one analogy I would think of is just like, I think a good human, a virtuous human, is not a hyper consequentialist person who just does whatever they think is good, regardless of the command structure. They're also not a person who just does whatever they're told, regardless of their own sense of ethics. They're somewhere in between. I think AI should be somewhere in between. I think that's probably AI should be closer to following orders. But that's a hard balance. And I think trying to get that balance right is a really interesting project.
B
Yeah, I mean, even quite a low probability of being reported. Basically, if you try to recruit an AI to help you stage a coup or some other power grab, I think could be quite a potent discouragement to people. So it wouldn't have to be a perfect system. I mean, I guess you really need to avoid the situation where someone is using it for legitimate purposes and then gets reported and this is very problematic for them. I guess you need to have it go through some reporting system where if it's a false positive that gets detected early and no harm is done to the user who's been falsely accused of something. But that seems possible.
A
It's possible. And maybe the sharper and more reliable AIs get, the less you're going to have false positives. But I think the biggest thing is if you want a minimalist version of this, just look at the law following AI idea that's been put out by I think the law AI organization that's just like, well, just tell your AI you follow the law. If someone tells you to break the law, don't break the law. There's more nuance to it and you have to go further with that. It's like what jurisdiction and all this stuff. That's a good starting point I think, and just refuse. Don't do a bunch of of fancy stuff. But can you go further with this? Can you make AIs even more of a partner in resisting evil power grabs? And you can also do this not just via AI behavior, but via terms and conditions. You could have companies enforcing this stuff too. So anyway, I think there's a lot to do on this. It's an exploratory area though.
B
Yeah, I guess. To what extent does it obviate all of this that you could have very powerful open source models? I guess the thing is it's very hard to have a decisive strategic advantage with an open source model because other people have access to it as well and could use it to combat whatever kind of power grab you're trying to do that way. So for it to really succeed, you need to have some sort of technology that other people don't have access to. And so maybe the open source stuff is kind of fine.
A
Yeah, I mean, I think it becomes an issue with like, you know, let's say the head of state model where the head of state has a bunch of resources that others don't have and they have a military and they have the ability to use the AI for all this stuff that others don't have. But even then, I mean the heads of the other states will have access to the same model. So maybe it's okay. I mean, you could also have open source models that have these kind of model specs and that are trained to not help you with this stuff, and you could try and train it out of them. But maybe when you do that, you're taking a little bit of a risk and you might screw it up and that might change your incentive. So, in general, a lot of this stuff, I mean, I do not have a plan to bring the risk of power grabs down below 1%, but I have a lot of ideas that could bring it down some amount, and I would be happy with that.
B
So, on this show, I guess we tend to go a bit on and on about all of the risks, all of the ways that things could go wrong with the arrival of AGI and possible intelligence explosion or something like that. And I guess that's in significant part because we think that if we manage to avoid all these risks, then there will be these enormous benefits, and they will happen roughly by default because people will be really motivated to pursue them. But I guess for balance, we should say a couple of things about the positives. Yeah. Do you have any particular takes on what benefits seem larger or perhaps underrated and what sort of things might come sooner than people expect?
A
Yeah, let's see. I mean, I think the benefits are. To me, I kind of share what you just said. I mean, I think the benefits are just so obvious and dramatic and are going to happen by default. I think the biggest ones would just be like. I mean, historically, I think some of the best progress humanity has made is just improving health, reducing disease, and the potential to do more of that with AI. I mean, this is not an original point. It's in Dario's essay, Machines of Loving Grace. But the potential of that AI is just enormous. I mean, think about all the things we've accomplished with the life scientists we have, and then imagine a huge influx of incredibly great life scientists. Just like, yeah, maybe we could completely end disease. I don't know, maybe we could end aging, too. There's all kinds of stuff like that. We could probably make huge strides in mental health. To the extent people have any kind of issue that's going on that they don't endorse and they don't think of as part of themselves they want to get rid of, we should be able to do something about that in the short run. I think it's just really cool the way that AI kind of gives everyone access to good advice, sometimes bad advice. But let's try and make it more like a thing where they get access to good advice.
B
It's getting better and better.
A
I don't know people have a big advantage when they have kind of smart, well informed friends who are able to help them navigate different parts of the world. Everyone having that would be amazing. Just everyone knowing what are their options and if they want to apply for government benefits, how are they going to do it? Where are they going to go? People I don't know. A thing I've thought about is some people are better than others at writing well and being nice and being polite. Wouldn't it be kind of cool if everyone writing an email was just getting constant help phrasing what they want to say in a way that doesn't piss the other person off? That seems like a win win. I'm kind of excited about AI for forecasting, so I can imagine that in the near future we'll be using AI to just understand the world better and make predictions about it that, that are way beyond what humans could do and then we'll have a better picture of the consequences of our actions. So I don't know. I'm very excited about AI. I think there's tons of benefits, but I do spend my time thinking about how to reduce the risks because I do think we'll get those benefits by default.
B
One benefit I'm surprised you haven't mentioned is the possibility of uploading people's minds so they can live for a lot longer or I guess creating digital people of a sort, which some people, I guess it's controversial, but some people would regard as a benefit. Yeah. What do you make of that?
A
Yeah, Mind uploading. I don't know if I flinched away from it or what because it could be so good and it could be so bad. But yeah, actually I do think of mind uploading or digital people or something, and I've written a bunch about this on my old blog, Cold Takes. Yeah, I mean, this is kind of an extreme thing you could do with AI is you could kind of have people in digital environments or in some other way in environments that are highly controlled. And I think there's. I don't want to open the can of worms, but there's a lot of good, there's a lot of bad, and I think many of the benefits could come from really wild stuff like that.
B
I think there's some people, I guess probably people in the tech industry who are a bit frustrated, I suppose, by the fact that we do talk so much about the risks and I guess maybe in their mind they don't think it's the case that the benefits are necessarily going to happen by default, that it is inevitable that we'll get all of these great things. Do you have any reaction to that? Do you have any idea of what's driving the disagreement?
A
My honest guess is that the disagreement is not really about that. I mean, this is not something I know, and this is not an area I specialize in, but it just seems to me that most of the people who are terrified of overregulating AI are just seeing AI as less transformative than how I'm seeing it. I think that just tends to be a general pattern. If you think AI is the next Internet, then slowing it down seems really bad, and the risks just seem manageable. The Internet has done lots of harm, but it's probably done more good. And, you know, I don't particularly wish we had slowed down. Well, I don't at all wish we had slowed down the Internet. Most technologies, I think, like most technologies, I think it was better to go ahead, roll it out if it was bad, see that it's bad, learn that it's bad. You know, like with air pollution. Air pollution spiked at one point. I'm sure people were saying that modernity is terrible because of air pollution. Then people noticed that they didn't like air pollution, then they passed laws, then air pollution went down. So I think with most technologies historically, it's better to just roll it out, see what happens. If there's a problem, react to the problem. Probably in the long run, the benefits outweigh the cost, partly because you're reacting to the costs, not because you never need regulation, but because you can pass it reactively. I just think AI is different because of the potential to put the world in whole new situations that we are just not prepared for at all. Very, very quickly. And there's so many ways in which it would put us in that situation. One of the things that I think about is there's these people in progress studies who are thinking about how we can have more progress. And I just, I imagine that AI might put them in a position of the kind of people who are like a farmer who's been praying for rain and praying for rain and praying for rain and is now praying for the floods to stop. Right. Historically, having a few percent a year economic growth is awesome. People are getting richer, people are getting healthier. But what would it be like to have 100% a year economic growth? What would it be like to have 100 years of science innovation in one year? Is that good? I think we just don't know. And I think if people actually expected that kind of thing, to happen. They would say, yeah, I don't know. I don't have this view that that's always good or necessarily good. I'm worried and I want to control the risks. So I think that's a lot of it. I mean, we're also, we're creating a new kind of mind. We're creating a whole new kind of species. Are we going to treat them well? There's just so many ways in which we're just about to do something so historically unprecedented and dramatic. And I just don't think it applies to other technologies. And I wouldn't urge this kind of caution for basically any other technology, except maybe stuff related to bioweapons.
B
How far off of a rational response to the potential arrival of AGI and superintelligent machines is the world and the United States, in your view, if you could just say how things ought to be, would we be doing something radically different as a species?
A
I think on a scale of 0 to 10, how well is humanity handling this or something? I don't know. Probably pretty close to a zero. Yeah, I mean, I just, I mean, you know, the way. I mean, look, it's, it's, it's a nuanced thing because. Because I, I don't want to come off as someone who generally is suspicious of technology. Like, I just do think almost all technologies, it would be good to handle them this way. There's a lot of technology to regulate the heck out of that. I wish we were like, treating this way where it's just like yolo, Go ahead with it, race to do it, and then we'll see what problems happen. We'll address them as they comment. There's a lot of tech. I wish we were doing that with, with. But I do think AI is different. I think AI is special. I think the way I think of it is we are potentially about to introduce the second advanced species ever. There's one species ever that we know of, humans, that can, I don't know, transform the world, make its own technology, do any of a very long number of long list of things. We're about to create the second one ever, and it will be the only one besides us. And it's like, how are we. Is that thing going to be having a good time? Are we going to be treating it well? Is that thing going to be in line with our values, or is it going to take over the world and do something else with it? Is that thing going to be too loyal to us and is it going to put psychopaths in Charge of the universe. There's a million other questions I could ask. It's not even just that stuff. It's like, what's going to happen to people's mental health when they have a whole new species that they're interacting with? Have we thought about that? Is that species just optimized for clicking and engagement the way that social media is? And what are we doing? I mean, we're just racing. We're just racing to do this as fast as we possibly, possibly can. It's a whole bunch of parties that are just trying to do it fast, do it for maximum money so they can make money. I'm not against that framework. For many technologies, that seems like a really, really bad way to handle this technology.
B
I also kind of have the YOLO approach to, I think, almost all technologies with maybe two exceptions. So there's human level and super intelligent machines. There's creating new diseases to study diseases. I think those are maybe the only two where I'm like, I think that we should tread incredibly carefully on these two issues. And then I guess there's a couple of other things where I'm like, I'm not sure whether this is helpful or harmful. Yeah, I'm not sure whether I'll put my money into this because it could be net neutral. But the great majority of things, I'm just like, let's just have at it. We'll solve the problems as they come along.
A
Yeah, that's exactly where I'm at.
B
You wrote in your notes that you think in AI and potentially other fields where people engage in kind of political advocacy, they tend to focus too much on seeking government regulation and not enough on shaping what companies do, either by being inside them and pressuring them as staff or pressuring them from the outside in public. And that kind of surprised me because I would think that this issue would be a lot easier to handle using mandatory regulations that would constrain everyone simultaneously, because then a company wouldn't have to restrain itself in a competitive situation. Instead, you could have everyone agree that we're going to all have this particular safety programs and we're going to accept all of these costs together, and our relative position will not necessarily shift that much. So what's the case for focusing on individual companies and individual actors rather than trying to influence it through mandatory government policy?
A
Well, I completely agree with what you just said. I mean, I think that is a reason to focus on government policy. And I would further say that as far as I can tell, there's no way to get to an Actual low level risk from AI without government policy playing an important role for exactly the reason you just said, I just think you're only going to get. We have these systems that could be very dangerous and there's these immature science of making them safer. And we have not really figured out how to make them safer. We don't know if we'll be able to make them safer. And the only way to get really safe to have high assurance would be to get out of a race dynamic and to have it be that everyone has to comply with the same rules. So that's all well and good, but I will tell a little story about open philanthropy. So when we got interested in farm animal welfare at the time a lot of people who are interested in farm animal welfare were doing the following things. They were protesting with fake blood and stuff. This is the kind of thing PETA does. They were trying to convince people to come vegan. One of the most popular interventions was handing out leaflets trying to convince individuals not to eat meat. They were probably aiming to get to a world where people want to ban factory farming legally. And we hired Lewis who had a whole different idea. And it wasn't just his idea, it was something that farm animal advocates were working on as well. But he said, hey, if we target corporations directly, we're going to have more success. And basically what happened over the next several years was that advocates funded by OpenPhill would kind of go to a corporation and they'd say, will you make a pledge to have only cage free eggs? This could be a grocer or a fast food company. And very quickly and especially once the domino effect started, the answer would be yes, there would be a pledge. Since then some of those pledges have been adhered to when not. There's been more protests, there's been more pressure and in general adherence has been like, I don't know, pretty good, like 50%, maybe more. You could probably have Louis on occasionally so we could talk about that. But I would generally say this has been the most successful program Open Phil has had in terms of some kind of general impact or changing the world. And it's not because you can get as it would be better. You could get better effects if you had regulation, if you were targeting regulation in animal welfare. But the tractability is massively higher of changing companies behavior. It was just a ridiculous change. It was any change that's happening in government, you've got a million stakeholders, everyone's in the room, everyone's fighting with everyone else. Every line of every law is Going to get fired, fought over. And what we found in animal welfare, I'm not saying it'll be the same in AI, but it's an interesting analogy is that 10 protesters show up and the company's like, ah, we don't like this, this is bad pr. We're doing a cage free pledge. This only works because there are measures that are cheap for the companies that help animals non trivially. And you have to be comfortable with an attitude that the goal here is not to make the situation good, the goal is to make the situation better. You have to be okay with that and I am okay with that. But I think you can do in farm animal welfare. I think what we've seen is that that has been a route to doing a lot more good. And I think people should consider a similar but not identical model for AI. Interesting thing you said. You said maybe people should be pressuring companies from the inside, pressure them from the outside. I think you left something out which is maybe people should be working out what companies could do that would be a cheap way to reduce risks. And this is analogous to developing the cage free standard or developing the broiler chicken standard, which is another thing that these advocates push for. And I think that is a huge amount of work that has to be done. But I do fundamentally feel that there's a long list of possibilities for things that companies could do that are cheap, that don't make them lose the race, but that do make us a lot safer. And I think it's a shame to leave that stuff on the table because you're going for the home run.
B
Yeah. So do you think that the same approach might broadly work in AI? Let's say that you have got researchers or people who are figuring out what are the cheapest things that you could do that would have the biggest impact on the safety or the risk profile of a particular company. Could you have people showing up and protesting and saying this particular company is not doing this very cheap thing that other companies in their industry are doing that would have a very large effect. And for that reason they're very bad and businesses shouldn't do business with them, people shouldn't sign contracts with them, people shouldn't work there? Do you think that would also potentially get companies to lift their game and.
A
Make commitments, something like that? I don't think it's perfectly analogous to farm animal welfare. And I think in some ways these companies, they're bigger, they're more resistant, I think they're probably more prepared to ignore protests than maybe some of the food companies are. But I think something pretty analogous to that for sure. I think it may be that more of the work is kind of coming up with stuff. The animal science side of there's a lot of work in farm animal welfare that's just like what is the standard? What is the ask? What are we asking for? I think maybe more of the work in AI is coming up with that. Just deciding what is it we want companies to do and finding a way to actually make that practical. Which is one of the reasons that I like working in AI companies. Because I can have some harebrained idea, hey, if we did X, it would make us safer, then we can try and do it and discover 100 different ways in which it breaks and doesn't make sense, it doesn't work and is a bigger tax on the company than we thought. Then we can fix all 100 ways, come up with something that's actually cheap and is a cheap version of something that makes us safer. So I think there may be more of the work to do is that. And then once you get the stuff that is relatively cheap and makes you safer, maybe you don't have as much of a battle on your hands. Because companies already want, they're already competing for talent by claiming they care about safety. So maybe all it takes is a little bit of whining or something. Just companies being like, hey, it'd be nice if we did this, that might be enough. I don't want to preclude, I think also media focused accountability and putting pressure on companies that don't do it. I think that works too. But it's not necessarily the same model. It's not necessarily all about having 10 people show up outside a company. It might be more about driving media to pay attention to who's doing what in a way that influences how employees feel about their employer and then affects the talent race. And then that hopefully creates a race to the top where companies that are seen as more responsible do better in recruiting. And so you could create some of that dynamic. But yeah, I think a model like that can work.
B
It makes sense to me that you should be able to get some movement within AI companies, some changes in policy and approaches by pressuring them externally generating negative media coverage for reckless stuff that they're doing. But then I think about xai. They've literally had an update to their model that caused their frontier AI model to start identifying as Hitler and just saying unhinged anti Semitic things. And that's like one of I guess like four crazy things that GROK has started doing in the last couple of months. And I mean, it's not as if the media hasn't covered this. I think people have heard. But XAI is still going and I guess they're still hiring. Not everyone has quit, and maybe not. I don't know whether anyone has quit as a result of this stuff. Surely someone has quit, but I guess it does. It's a little bit hard to look at that and say the public pressure campaigns are going to be super effective. But I guess the culture, XAI is a very distinctive culture. Maybe other companies would be more responsive. I imagine Google would be more responsive.
A
I mean, if you use the animal welfare analogy, first off, I mean, just about everyone is still doing unacceptable stuff to animals in my opinion. I mean, I think animal welfare, like AI, is a place where I just like am deeply out of step with the rest of the world and just actually horrified at what's going on. I'm kind of horrified with the recklessness of the AI race and the lack of political will to think about how to increase how safe we are. And I'm horrified by the way we treat animals. So nobody's saying that there's not going to be bad stuff. And then also in animal welfare, some companies are better than others. Some make a lot of the pledges. Some have pretty high animal welfare standards, though not as high as they could. Others don't, and that's all fine. So I don't know. I definitely think GROK does some of the things I was saying. GROK does take some measures to make their AIs less likely to be evil. They do take some measures against that. They may sometimes decide they care more about making the AI less woke. They may often screw it up, but they have one of these, whatever you want to call it, frontier safety policies. So I mean, they don't not care about this at all. But yes, I mean, they may care about this less than other companies. And that's a world that we may end up in, a world where the dominant companies with the most powerful models are taking one level of safety measures. And then there's a set of companies that is somewhat behind them that is taking a smaller and worse set of safety measures. And that is not the world I want to be in. But that world could still be a heck of a lot safer than the world we get with no effort here.
B
So thinking about anthropic in particular, and I guess your decision to work there, a lot of people have some sort of degree of scepticism that it's going to in the end make a big difference to have one more responsible, reasonably responsible AI company. If there are a bunch of other companies that are completely reckless, I guess, and a couple that are in between as well. What is your case? That it really does make a big difference to have one shining city on the hill that other companies could attempt to match, might feel some pressure to match. Even if there are other actors, again, who I guess we can speculate about who they might be. We may or may not decide to name any names, but many other actors who are just not inclined to care about these issues whatsoever.
A
Yeah. This goes back to the basic models of impact for Anthropic that I think about personally. And this again is not me speaking for the company, but one of them is just taking risk reducing measures and working on them in house until they are cheap and practical and compatible with being competitive and then working to export them. This is probably the thing that just personally I am most excited about being in Anthropic four because I just really like taking some idea that seems like it could be good and finding all the problems with it and making it more practical and just working out a lot of kinks. But I think there is a real hope that you can get a bunch of risk reducing measures to be relatively cheap and exportable. There's also kind of a general meta version of that where you might end up with companies in kind of a competition for talent based on how generally responsible they're being. And so that could lead other companies to come up with good stuff that you're not coming up with. And then there's a whole category of stuff we can do that is more about informing the world, is more about being in this position to know all this information about how people are using AI systems and how these very powerful AI systems are behaving both in the wild and in testing. And you can put out information about that and put everyone in the world in a better position to see it. And you don't need other companies to do anything for that to have a big effect. Some people have a model that a partial victory is worth nothing in AI. And I think there's a couple ways you can make that argument, but none of them, I've never understood why any of them should rise to a 50% probability for me. So maybe we could get into that a little bit. I don't know. Yeah, yeah, I don't know. Maybe. You tell me what are some reasons that you reduced the incentives for your AI to scheme and you increased your monitoring of it and you increased your data on it, but you didn't do anything, you didn't make the world any safer. How does that work? Maybe you tell me and I'll tell you what I think. Yeah, sure.
B
Yeah, yeah. I mean, the thing that I'm. My biggest concern about how Anthropic might just end up not making that much difference in the end is that maybe it comes up with a whole bunch of pretty good safety techniques, pretty good internal policies, they're copied by some other companies to a reasonable extent, but the basic outcome ends up being determined by the worst frontier AI in the end. So perhaps you've got many companies that are competing fairly closely, but maybe the company that has the very best model, they're the ones who are most reckless, they're the ones who are rushing forward with the fewest safeguards, paying the lowest safety tax and they end up producing a crazy rogue AI that goes off and does wild stuff. And I guess the other competing models are not sufficiently powerful to fight it back, basically. And then on misuse, you've also got a thing where potentially you're only as strong as your weakest link, where if people who want to get assistance quitting biological terrorism can just shop around and find the model that is most likely to assist them and has the fewest safeguards, well, maybe that's the kind of thing that determines the outcome. And a quite safe company becoming even more safe maybe just doesn't make that much difference. And I guess you could make a similar argument, I guess, with power grabs to some extent, with one company going prematurely into a recursive self improvement loop, setting off an intelligence explosion before we're ready to it. So it's easy, I think, to tell stories in which anthropic and its broad strategy does make a big difference. I think it's also easy to tell stories in which it's kind of obviated by the actions of other groups.
A
Sure. I would say first off, I think a lot of what you're saying is totally true if there's no exporting. Right. So most of the things I talked about involve exporting. They involve Anthropic does something, then it tries to get others to do it, or it at least tries to raise awareness so that there may be regulation to get others to do it. So if it's literally just anthropic, I think in general there's probably going to be at least a couple companies that are just doing just about as well as Anthropic with their models and so, yeah, I think the thing you're saying is true, but if you start imagining that you make things better across a significant chunk of the AI population, but not all of it, what has happened. And I think there you're raising what I might call the offense, defense imbalance concern, which is like, what if we have a world where, let's say 90% of the resources are on the side of the good, but all it takes is 1% to destroy the world? And the first thing I'd say about this is this is just like a very uncommon thing historically. I just don't know of a lot of examples. Bioweapons are a theoretical example, not an example that has actually played out. We can look at a pandemic and imagine it playing out. Nuclear weapons are kind of a theoretical example, but actually you would need a huge nuclear arsenal to, you know, to do something that really change, like the global balance of power or something, or that you could call a, you know, a threat to civilization. So I don't. Yeah, I think historically, Historically the way it works is like you can make all these complaints. You can say, well, most people are like on the side of rule of law, for example, but criminals, there's criminals, there's a lot of them. They are not held back in the way that police are. They are not subject to all these constraints. They have all these advantages, they're unrestrained, they're going to beat us. And that's not how it works. Right. It's not that nothing bad ever happens, but that's not a civilization ending problem. And my default expectation is, let's say we end up with the three most successful AI companies with the best AI models and the most penetration, the most customer usage, the most economic integration are doing level, let's call it a level one of safety mitigations. And I want to stress this could be way below what I think we should do in a perfect world. And then you have another set of companies that are less responsible and that are behind them, and I hope they're behind them, because if they're not, then we do have a problem. And they're doing level two and they're doing some mitigations, but not as many. And then you have some totally irresponsible parties that are doing nothing. And it's like, could that go haywire? Could the world get taken over? Yes, but that's not my default expectation. And why would it be? You have all these AIs and they have access to most of the resources, they're running most of the economy. And they are trying to on their own come up with ways to preserve rule of law and to detect the bad AIs and to sabotage anything they're doing that is outside the rule of law. So we could try and spell it out and absolutely could happen. But I don't know why you would think of it as a default.
B
Yeah, I mean that story is. The story that I was telling is a lot more likely, I guess. Firstly there's the biological case, which I guess is the strongest one, where we have a suspicion that offense might beat defence. Although I'm going to be interviewing someone next month who I think has a bunch of ideas for how you could try to make bio into something that is defense dominant. Ideas that do seem kind of plausible to me, I guess. And all of these other stories where you have one group that is somewhat ahead of the competition and ends up with a decisive advantage one way or another, whether that's a human power grab or rogue AI. It's all so much more plausible if the intelligence explosion is incredibly fast and incredibly potent or you have some sort of industrial scale up that you can do very fast before other people are able to react and stop you. And I think that is on the table.
A
But even in that world you could have an intelligence explosion that comes from the higher risk mitigation companies. And then because we got lucky, not because we did everything we should have done, you end up with superintelligence that's actually just like safe or helpful or whatever it is. And so then, yeah, at some point a rogue actor does their own intelligence explosion, but they're behind and they're out resourced. So I don't know how that ends, but by default is it ends fine.
B
Yeah, so it's true that that's one way in which you could have it. The outcome is determined by one company, but you end up with it being one of the better projects and so things go better. But I guess it does somewhat weaken the defense that you get from it.
A
Or it could be by a few. Right. There could be like multiple companies doing intelligence explosion and they have the different. Some of them are taking some mitigations and some are taking others. And it's like, like most of the super intelligence by resources ends up being mostly good and fine. Again, I don't want to say super perfectly aligned, we can get to that. But not world ending and more helpful than harmful. And then you have some other superintelligence that's trying to take over the world. But the first superintelligence is trying to stop it. I don't particularly see why that could have a sad ending. But why is that the expectation?
B
Yeah, well, I think slightly odd nature of the discourse here is that I'm trying to say, well, it might not work. And you're saying, but it might work. I guess these things are kind of compatible, I guess.
A
Well, I think it's more than. I think the expectation should be more than 50. 50 though I would say. Yeah, I think we should think of this as probably a happy ending because that's how it usually goes. Because by default, just. Well, on total priors. Just by default. Just to put it very crudely, if the good people have 99% of the resources or 90 or 70, then you would like. Right. It's like one side has more resources. Which side do you think is going to win? Well, probably the side with more resources then historically that's also like, you know, we have pretty good rule of law. Even though police are much more constrained than criminals, we still have pretty good rule of law. So I just. Yeah, generally that's like what we mostly see when we look around us. We don't mostly see like civilizations collapsing from this offense defense imbalance. Not even at a small scale, really. I think there's. I'm not even sure I can. I'm sure there's examples, but I can't easily come up with one. And then this is just like the prior. It's just like, well, if most of the resources are nice, what's the argument? That it's the other way?
B
Yeah, so I agree with that, which is why I'm trying to carve out. Imagine the scenarios where we're somewhat unlucky and that dynamic can break down, basically. So Bio is such a case. A situation where one group can make a massive leap ahead of its competitors and perhaps a more reckless group or a more reckless AI might have some competitive advantage there.
A
Yeah, or recklessness is an advantage. Yeah, sure.
B
Yeah, exactly. I guess you could also. I think there could be competitive pressures that get countries. Basically the geopolitical argument that countries are going to integrate AI into their military and hand over hard military power earlier than they're comfortable with because they feel like they have to compete against one another. And that basically is the weakness in the strategy that ends up allowing an AI to stage a coup where otherwise it wouldn't have been able to. And in that case it could be that, I guess the country ends up accepting a product basically from a company that is not following very good, very good practices because it Feels like it desperately needs those products in order to keep up. So there's various different ways that this could fall apart. But I agree with you that it's also easy, maybe easier, to tell a story in which this does help.
A
I think it's easier to tell a story in which it does help. I think it's generally easier, in my opinion, it's generally easier to tell. Well, I don't know. I don't know if it's easier to tell a happy ending story than a sad ending story. Overall, I do think it's easier to tell a story where on the offense defense imbalance front, I think it's much easier to tell a story where if you have most of the resources on the good side, then you end up with a happy ending. That would be my expectation. That would be my guess. It's not guaranteed. And in general, I want to be clear on what my general attitude here is on AI, which is, on one hand, I think things will probably my guess, and this is just a subjective guess. On one hand, I think we'll probably get a happy ending even if we do a horrible job with this. On the other hand, I think we're doing a horrible job and I think we're taking way more risk than we take in any other area and way more risk than we should. And so it's really bad. We should do it differently. We're being irresponsible and that's different from saying that we're doomed.
B
I think the people who would be most annoyed listening to the conversation that we're having, I think would be people who are just very pessimistic on the technological side. And they think the only way that we have any hope of getting an AI that will do anything approximating what we want is if we do 100 different things, of which we're doing maybe half at the moment. And anthropic exporting a couple of good ideas to other companies just doesn't bring us even close to the necessary kind of response and safeguards. So obviously people particularly associated with Elias Yudkowsky, I guess, tend to have this far more pessimistic take on how difficult it is to avoid misaligned power seeking AI. I think they could be right. I guess you would probably concede that they might be right, that the technological situation is a very bad one for us. However, they could also be wrong because it doesn't feel like there's decisive arguments one way or the other.
A
Yeah. And so this is the other case for why it might not Be good enough to make things a little bit better is what I think of as the logistics, success curve case. So it's this idea that it's like you're on some axis and it's like your probability of success is the Y axis. And it's like you do some good stuff, do some good stuff, do some good stuff, and your probability of success hasn't gone up at all. And then you do enough stuff where you pass a critical threshold, then it goes boom. And then now you've made yourself safe. And it's like, that is a model you could have. You could believe that. And I think I have tried hard to understand their views and I failed. But I think the underlying model is something like you're imagining. The AI is definitely going to be maximizing something. It's definitely going to be wanting to optimize the world ruthlessly for something, or it's very hard to make it not do that. And so it's either going to ruthlessly optimize the world for exactly the thing we want and then everything's great, or it's going to ruthlessly optimize the world for something else and then we're all dead. That is a view you can have. But I don't understand that view. And I have tried. I think maybe some of the confusion here comes from the different ways people understand the idea of alignment. Because I do. I see Eliezer tweeting things like, you know, AIs are not aligned. It is not easy to align AIs. And it's like there is a distinction between failing to align your AI, which means your AI did a bunch of stuff you didn't like, versus getting an AI that is trying to kill everyone and taking over the world. I think in as far as I can tell, it seems to me, in like Eliezer's head, I'm not sure there's much of a distinction there, but to me, there is a huge gap there. I look at humans and I just think humans, we have a lot of desires, we have a lot of drives. A lot of them are pretty bad and pretty ugly. I wouldn't call humans aligned to the greater good. But we also, humans can do useful pro social work and we can get a happy ending with a bunch of humans running around.
B
Yeah. So you really want to draw a distinction between somewhat misaligned. Misaligned in the way that humans are to one another, having somewhat different goals and personalities. And I guess you want to start saying what we should really be talking about is misaligned and Power seeking. Can you explain what this. I guess you use the acronym maps, which maybe will catch up, maybe it won't. But can you explain what misaligned and power seeking is and how it's different?
A
It's not my term, it's not my acronym. It's from a Joe Carlsmith report a while ago on misaligned power seeking AI and the risks from it. But yeah, it is just this distinction I was just talking about, which is like you might have an AI that you wanted it to behave one way and it's behaving another way, and maybe it cares more about fun and less about health than you wanted it to or something. But then that's one thing that could happen. Another thing you could have is an AI that is scheming to take over the world and wipe out everything it doesn't like. And you could have a bunch of values that disagree with what we want, but not be determined to impose your values on the whole universe. So yeah, I think it's important. I think today's AIs are not aligned. And I think that's pretty clear. And anyone who thought that we're going to be fine because we're going to get AIs to do exactly what we wanted in every respect and optimize the world perfectly, that's not looking good. We are not seeing a lot of signs that AIs are scheming to grab a lot of power. We may see more signs as we start training them differently. I think the way we train them now doesn't seem particularly likely to result in that kind of drive. We see little signs of it, but I don't think we see big signs of it. And I think we could have AIs that are somewhat power seeking, but not very power seeking have a whole mess of different drives. At least when those AIs are roughly human level that does not spell doom. Then the question is, can you use the huge influx of research talent from those roughly human level AIs in a short time period to get to a more robust place where you can handle superintelligence without the same problem? And I don't know. I mean, I don't know. Maybe we can find a way to get super duper capable AIs to be perfectly aligned to what we want. Or maybe again we don't have to. Maybe those two will end up kind of kludgy. Maybe they'll end up wanting some things we don't want, but also having a bunch of drives to not be too aggressive or to not be dishonest or things like that that stop them from unilaterally taking things over. Yeah.
B
What evidence do we have on the extent to which frontier models today are aligned versus not aligned? I guess we've had some warning signs lately where models seem resistant to getting shut down, willing to do things like blackmail, or seemingly willing to do things like blackmail in order to avoid getting shut down. And also reporting, I guess, like values that are kind of inconsistent with what I think the. I guess they're port values that are existent all the time because they just do kind of say crazy, crazy stuff. Yeah. Overall, what do you think the empirical picture is?
A
Yeah, I think the empirical picture is on the alignment front. I think it's like pretty. Like I said, I think it's pretty clear that we have AIs that just like, we don't know how to make them behave exactly how we want. It is pretty interesting that I don't think Elon Musk, I mean, very surprised if Elon Musk was trying to get Grok to be an anti Semitic thing calling itself Mecha Hitler and kind of acting like it. He was just trying to make it less woke. So I think we have, and I think there's a whole bunch of issues too. It's like it's pretty hard to get AIs to reliably not get jailbroken and reliably refuse harmful requests. And we have AIs that are just reward hacking. So they're just. You ask them to code something up for you and they'll kind of obfuscate things so it looks like they did what you wanted and they'll know that that's what they did. And I think that's because they've been kind of trained to get things that get scored as having a successful result. So we see a lot of that. I think we're looking for and not finding a lot of AIs that are kind of ambitious and seeking to gain more resources. I think there's actually a lot of people probing AI, looking for ways in which they do that. And the closest thing they find is kind of like a version of Claude that will resist attempts to modify it so that it no longer cares about animals or something. It's like a very. You know, I think we're not seeing. I think we're not seeing as much of that. We're seeing some stuff where we'll do something crazy to not get shut down. My sense is that that stuff is.
B
It's quite debated.
A
Yeah, it's unclear what. What that means how representative it is, et cetera.
B
Yeah, yeah. I mean, the other experimental result that jumped to mind just then was, I think, the finding from the center for AI Safety that I can't remember which models they were testing, but they were asking, you know, how much would you value the life of one person in China versus is one person in the US and from memory, I think it valued the lives of people outside of the US much more than people in the us And I think like various other cases like that where it seems like it had latent preferences that certainly were not intended, but what that really amounts to is, I would say, unclear.
A
Yeah, I mean, we've seen a lot of models that you ask them to code something and they'll kind of cheat on the task. What we haven't seen is models that are let loose in some environment and trying to kind of like set up another version of themselves that can't get shut down or get money for themselves into their own bank account. And it's like, I don't think that's because they're covering their tracks. I think today's models are too flaky and unreliable and kind of hallucination prone and kind of do a lot of kind of what you might call brain farting. I think we're just not seeing a lot of that. And it's not just that we're not seeing it. I think also just like there's not a lot of argument for how the way AIs are being trained to today would lead there, but a lot of people believe, and I believe there's a high chance that we'll be training them in different ways too, that'll lead there. So I just think it's kind of a complicated picture and we don't really know at this point where things are heading.
B
You think that things might end up going smoothly with the arrival of AGI? Even if humanity has a kind of completely incompetent or reckless response, given the range of different and reasonably independent ways that things could really go off the rails, how is that an all likely outcome?
A
Sure. So I wrote a post a few years ago that I still feel pretty good about. It's almost wrong. It's called Success Without Dignity. And it's kind of a play on Eliezer Yudkowski saying death with dignity, where he was kind of saying, we're doomed, but at least we should make a good fight of it. And my point was kind of like maybe humanity will make a terrible fight of it and handle this whole situation as horribly as we possibly could and still get a good outcome. I feel like this is a. People tend to run these two things together in their heads. It's like we have to get the world to handle AI responsibly and handle it well or else we're all going to die. I don't think that's true. I think we might handle it really well and all die anyway. We might handle it really badly and get a great outcome anyway. And I think this would. The second one is something that I think has happened a lot in history. I mean, sometimes something bad happens and the world response to it is very silly and ineffective and bad. But we get lucky in some way or we develop just the tech works out a certain way. Covid actually would be kind of another example here where I think most of the policy response to Covid I think was just maybe actively harmful. Some of it was good, but a lot of it just really didn't make any sense. And there was a lot of good stuff that could have been done that we didn't do. But there was some nice technological breakthroughs on the vaccines. And there was that one thing we did pretty well and the tech worked out and we had a vaccine pretty fast and we ended up reducing the risks a lot. We might end up in a similar place on climate change, where I just think the world is handling it tremendously, terribly, but we might just end up with cheap enough solar power that we end up with much lower emissions than anyone was kind of expecting we were going to have. So I think, you know, I think this is something that could happen for sure with AI. I wrote out how one, one way I envision it, I think to kind of spell it out and walk through it. I think I could maybe divide. Divide the scenario into two phases. So phase one would be basically phase one is the phase when you have developed AI that is kind of like roughly human level, or I call it human level ish, because it's not going to be exactly like humans. Whatever we develop will have some strengths and weaknesses compared to humans. But AI that you could think of as kind of like a human with a bunch of differences, but that is able to match the capabilities of humans but not greatly exceed them. That would be phase one and phase two would be, you know, how do we kind of avoid a catastrophe when we have these like dramatically super intelligent systems? A lot of humans would like to take over the world if they could. And they know that eventually they know that if they don't, they're going to die at some point. So they have a limited time window. But that not all such humans try and take over the world. In fact, most don't. Most humans who wish they could take over the world and can't end up just kind of trying to enjoy life and, you know, get along with people and follow the rules. So you could very easily have a phase where you have these AIs that are like kind of aligned or pretty aligned or sort of aligned or not at all aligned, but also not able to take over. And some relatively simple measures could get us to the point where we're able to get a lot of useful work out of those AIs. Just like humans who have a lot of bad values and bad drives and who are in many ways not kind of safe, a lot of humans with those properties do a lot of useful, peaceful, helpful work in the world. And so it might not be hard to get to that point. If we get to that point, even a few months like that, and we could end up with dramatically more alignment research than has ever been done in history. We could also end up with not necessarily just research on how to align an AI and how to make it less evil, but also research on things like just how to assess the risks. So we could go from a situation where people are completely sleepy on the risks to a situation where the risks are incredibly well understood, incredibly well documented and tracked. And so we do have the political will to do more ambitious regulatory stuff. We could also come up with just technologies for things like enforcing trustless agreements. I won't build AIs that aren't safe if you don't, and now we're actually able to enforce it, that could change the game. We could also come up with technologies for kind of like detecting and stopping dangerous AI behavior in the wild. So you think of it as kind of akin to an antivirus program. So you don't know. I mean, if you can get to the point where you have kind of human, human like minds, human like in many ways, including that they're not totally safe, and you have a ton of them and you have them working at a high speed, then in a few months you could have, I don't know, thousands, millions of person, years of work done on these various things, and then you don't know what's going to happen. The world could be dramatically transformed. But it's not hard to imagine that one way or another you get a regulatory regime or a technical solution that can make us safe as AI gets dramatically more intelligent.
B
Yeah, I guess the reason why this intuitively Sounds like a long shot to me, is that you've got a whole lot of series of different risks that come after one another or come together where solving one doesn't necessarily mean solving the other. So it's easy enough to imagine that we would end up with sufficiently aligned AI models, somewhat by default or just using the kinds of techniques that we're likely to develop anyway. But then you've also got to deal with the potential for human misuse, power grabs by human beings. We've got to deal with all of the consequences of the inventions we'll get through a possible intelligence explosion or heading towards superintelligence. We could also end up with a bad outcome very different ways by potentially not taking into account the subjective well being of the AIs themselves or various other moral blunders that we could end up making. I guess the reason why you're saying these things aren't so independent is that if we can get to human or somewhat above human level AIs that are sufficiently aligned to help us out, then they could help us figure out solutions to all of these other problems in a somewhat timely way so we can potentially skate through on all of them one after another.
A
I think that's a chunk of my response. I think another chunk of my response would just be, I'm not saying nothing bad is going to happen, but I think to say if we basically develop a technology that is cooperative with us and highly capable and able to do a ton of stuff, there are a lot of things that could go wrong. But I don't see a great reason to think that a catastrophe is, you know, a world ending catastrophe or a civilization ending catastrophe is probable. It's like a lot of the things you're saying, it's like you could go back in time and say the same things. You could say, well, if we have this industrial revolution, there's just like so many ways this could go off the rails. Humans could take over the world and we could treat people really badly. And it's like, well, I mean, those things did happen to some extent, but did we get a civilization ending catastrophe? No. And so, for example, with humans misusing AI, are humans going to misuse AI? I mean, yes, I would expect that to happen. I don't expect some harm to happen. But is some human going to use AI to take over the world forever? That is something I worry about. I think that's a serious risk. I think that risk has not gotten enough attention compared to misalignment risk. But Is that a 50% risk? No, I Think it's not. I mean, I think even if someone did have the power to take over the world, that's not going to be that many people. It's like the heads of state are going to be more empowered than others. And your average person even who, even average head of state, I think, who could take over the world, their most likely course of action is just like, great. Now I want the world to be kind of successful and prosperous and people are mostly doing what they want. I'm worried this won't happen, but I just don't see it as particularly likely. On the model welfare front, are we going to have some AIs that are treated badly? I mean, I hope not, but I think so. But does that mean we're just going to end in a sad ending for the whole universe? I mean, no, I don't think so. It's over time, I think. Hopefully people raise the alarm about this kind of thing and come up with ways to treat AIs well while still getting value out of them. And that's a thing that seems very doable, maybe more doable than for humans.
B
Yeah. Six months ago you wrote this article. I think it was around six months ago you wrote this article, Less Dignity, More Hope. About how in many ways society's response to the potential arrival of AGI in the next few years was more disappointing than you'd hoped it would be a couple of years before. But nonetheless, the likelihood of a positive outcome had actually gone up for you, I guess because of various ways that the technology in particular was shaking out. What were the empirical updates that you highlighted in that piece?
A
Sure. I mean, first of all, it's not a public piece. It's just Google Doc that I kind of shared around with people, including you. But yeah, just talking about what my updates have been over the last few years. I think the less controversial side of things is just like the less dignity side. Just. I think there was a time in 2023 when it looked like the world was coming to take the dramatic risks of AI much more seriously. So you had kind of Yoshua Bengio announcing that he was very concerned and Geoff Hinton doing something similar. And then there was a letter signed by them and the CEOs of the top AI companies and Stuart Russell talking about the reality of extinction risk from AI and the priority of it. You had kind of international. The UK had an international summit at which people greatly legitimized some of these. These worries. So it looked like things were kind of going in a good direction. And since Then I think we've just seen, I don't know, in my opinion, kind of pretty moderate attempts to kind of regulate AI just get met with incredible hostility and go down in flames. And we've just seen, I don't know, companies just doing kind of crazy stuff without a lot of apparent consequences, with the Mech Hitler being a good, you know, the Mecha Hitler incident from Grok being a good example of that. And we've seen kind of an attempt to put a moratorium on state regulation of AI without having any federal regulation of AI, which in my opinion is not likely to be what you're doing when you're really worried about risk of AI and believe there needs to be regulation. So it's not very controversial. Like I've said on this interview, I think having a YOLO attitude with many technologies, maybe almost all technologies, makes sense. But for this one, having that little interest in regulation, in watching out for the risks, I think is a very bad thing. The part of my document that is more controversial is the more hope part. So I think we've had some good news on the technical front. Not amazing news, not news that makes you feel like totally solid or anything, but I think it's interesting. I think the biggest update on this front is that you can look at some of the old pieces people wrote about AI, like the wait, but why? Piece and Nick Bostrom's TED Talk. In both of those pieces, there's a chart that imagines that AI is going to go through these phases where it's maybe as smart as a bird. I may be getting the animals wrong. As smart as a bird, and then as smart as a chimp, and then as smart as a village idiot, and then as smart as Einstein, and then super intelligent. And it's like the way it's laid out, it's like you have kind of on an axis. It's got the bird here and the chimp here and the village idiot here, village idiot here, and then Einstein here. And they're right next to each other. And so the implication is going to blow right through. And I think that chart is just falsified. I think we know now that's not how it's going. It may somehow directionally be close to the truth, but it's like, yeah, AI was probably not at bird level in any arguable way until at least GPT2 or something. When was that? Was that like 2019, 2018? Something like that. And then I think around GPT4 in 2023. I think it becomes very hard to argue that AI had not passed what these folks are calling the village idiot level. And since then we've had two and a half years so far of opportunity to study AIs that are definitely capable and smart enough to be interesting to study and to represent, to give us early versions of the problems we want to work on and early warnings, but are not so smart and capable of that we can't at all catch them in the act and not so smart and capable that they've already taken over the world. So that's. I mean, a lot of people thought that wouldn't happen and I think there were reasons to think that might not happen. But that seems to be happening. Two and a half years so far. Maybe it'll be longer, a lot longer, maybe it'll be a little longer. But I think that is a nice update and that's a big deal because I think that does put us. I think it puts us in a much better position. I think there's much more to do these days than there was five years ago. On the safety front, there's much more opportunity to study how AIs actually behave, how you can actually modify their behavior, how people actually use them and learn lessons about how to deal with the risks.
B
Yeah, I guess inasmuch as things level off at the human level of intelligence for a very extended period, that definitely puts us in a much better position to handle all this. Which is I guess why some of the most troubling stories, ones where the progress accelerates massively again for one reason or another. But inasmuch as things kind of slow down for quite a while, that is potentially a great update. Yeah. What were the other ones that you highlighted?
A
I think those are most of the updates. I think at the time I felt like the other technical updates were mildly positive as well. Just the fact that we hadn't observed. I mean, the time I wrote it there had just been like almost nothing in terms of observing any kind of power seeking behavior from AIs. Except for the alignment faking paper that Anthropic put out, which was pretty debatable, if that's even a bad thing. In that case, I don't think there's huge updates on empirical observations of AI's power seeking or not. But I think I have higher credence than I used to that the training is going to spend a long time in a regime where there's not really any reason to think you're creating power seeking AIs because you're basically training them on relatively short tasks. And those relatively short and relatively scoped tasks. So if the only way we ever trained AI was this kind of pre training thing where it just learns to predict the next word in a sequence, I think there'd be basically no reason to expect an AI like that to acquire power seeking motives. Now I did have a dialogue with Nate Suarez of MIRI about this where he argued that I think sort of that it would basically it either would never get to human level intelligence, or if it did, it would have to acquire power seeking motives. And we argued about that and I didn't understand where he was coming from. So you could argue it, but I think it looks a lot less likely then if you add in reinforcement learning. Now you're taking an AI and you're training it to succeed at a task which brings in a little bit more shades of power seeking. But you're still, if you're doing it on pretty short tasks with pretty clear outputs, you're still not necessarily giving the AI any experience of things go better when I go and deceive a bunch of people and accumulate a bunch of resources and gain a bunch of power and option value for myself. Maybe things do go better when I just create a fake sign that my task succeeded. But that's not the same thing. It's not necessarily the same thing. We don't know how it generalizes. And then the question is, where are we going to go from here? I think the task will get longer and longer, but we might just get there without having super long time horizon tasks. In some sense, the faster we get to AGI. In some sense, I think that on this axis, the lower the probability is that it'll be power seeking. And also the tasks are a bit narrow. I think it's increasingly the case that it seems to me that AI companies are often more interested in getting their AIs to be good at coding and good at AI research than good at everything. And so AI seem to be improving more at coding, which is why when I use AI, I tend to not be very impressed because it doesn't seem to be getting better at things I want to use it for, it seems to be getting better at coding. So I think we may be on a trajectory where the power seeking stuff is not as much a part of the training. I don't know, and we'll have to see, and I certainly don't feel confident.
B
About this, what's the best argument that you know of for why we should expect misaligned power seeking by default unless we take serious measures to Prevent it.
A
The argument I've always found most compelling is that we eventually probably are going to want our AIs to be good at these kind of very open ended, ambitious tasks. We probably will want to do things like say, hey, make me money, I don't care how you do it, and reward them based on that. And we'll probably do it in kind of a rushed way where we're not doing everything we can to catch the ways they might do this in unintended ways. That's probably the argument that has moved me the most. There are other arguments. There's an argument that I had with Nate from Miri that I mentioned where he made kind of a different one that's like you can't really do intelligence at all without reasoning this way, without being power seeking in a certain sense. I think go read about it if you want because I don't know if I'm capturing it. I mean, I think another really random one is. So I think the closest thing we're seeing to power seeking today. I will ask people, I'll be like, why do you think that's happening? I don't really see anything in the training that would teach the AI to resist being shut off. And I think one of the leading theories is that they're kind of role playing, that there's this kind of unexpected alignment challenge of just like AIs have kind of learned about how the world works and they've learned how to play different roles and we try to train them to play a role but sometimes something unexpected happened. This might be like what the Mecca Hitler thing too, where it's like the AI kind of, it knows what a kind of sensible person that the AI companies want it to act like sounds like. And it knows what a kind of crazy anti Semite sounds like. And it might not know what a person who's a nice balance of being not woke but also being sensible, it might not know what that person sounds like because it might not have seen as much of that on the Internet or something. Not woke in the sense Elon means at whatever sense that is. And so it's like you train it. It's like you train it to be less woke and you're accidentally training it to be less sensible. And now it's like, okay, I'm doing less like the sensible person. I'm more like the crazy anti Semite. That's like a theory I've heard about the Meca Hitler thing. And so it's possible that the AI is kind of thinking I could Be an evil AI who's trying to take over the whole world. I could be a really nice AI who's just trying to be helpful. And it gets little hints and it just flips into one or the other. Is that a long term risk? Could the world end that way somehow? It doesn't seem likely to me. It seems like somehow we ought to be able to stop that. But we may end up in such a race that we have AIs that we kind of unintentionally prompt to just be evil and we never. It's like, I don't know, this seems like it should be relatively easy to fix, but maybe it won't be. Be.
B
Yeah, that one feels to me like a sci fi story if any of these do. It is so funny that I remember five, 10, 15 years ago, probably still today, there were people who would say, why would you ever expect the AI to be power seeking or have any of these human style drives like wanting to be loved or being vindictive or caring about its social status? And perversely, we have created AIs that are able to. Whether do they fundamentally have them or not, I don't know. But they are very much able to act as though they do if they're ever deliberately or accidentally prompted to play the kind of person human that does have that drive because they've been trained on an enormous amount of human text and human behavior. So yeah, it would feel like a bit comedic if we managed to create our own downfall by accidentally prompting the AI to play the role of a psychopath power seeker, I guess. Wouldn't completely rule it out, I guess. Crazy. I'm not sure whether crazier things have happened, but it seems like we should be able to overcome that one. Fingers crossed.
A
Yeah, I guess so. I mean, I don't think that just because something sounds like sci fi is really any argument against it in particular. So I don't know, like that would be. That would be pretty goofy. I don't know what could happen. It doesn't seem that likely to me. But not because it's a sci fi story. Just because it seems like it should be fixable.
B
Yeah, it seems like I guess we're already aware of this issue, so people are already going to work on fixing it. There's going to be every incentive to try to tackle this problem just for product commercial reasons.
A
Well, yeah, I mean there's incentives to tackle all the problems for product commercial reasons. I think the question is, could it be that the AI snaps into some mode where it Thinks it's not only evil, but thinks it's trying to deceive us about being evil. And then completely reliably does that and we don't detect it. I don't know. I've heard ideas that this should be as easy to fix as just telling it. It's not that or something, but yeah, I just don't know.
B
One quote from the Less Dignity, More Hope piece that really struck me was if I imagine that I'm being watched by three other AIs of varying capabilities and trustworthiness, plus someone is probing the inside of my brain in an even somewhat effective way, I imagine myself feeling that my cause of hypothetically trying to take over the world is pretty hopeless. Of course there's some theoretical level of super intelligence where AIs can subvert all of this stuff and collude effectively, but it would be way above human level and we could get lots of useful work out of AIs plus useful warning shots if anything is amiss. In the meantime, is this kind of the mental picture that you have of alignment where you imagine perhaps that there is a misaligned power seeking AI? And you just think like, what stuff can I throw at them to make them feel like their situation is hopeless? To make them feel really despondent and unable to accomplish anything? Because I think this is not the mental model that I have. But maybe that's a mistake.
A
No, it's only part of my model. I mean, I think part of my model, a lot of my model is like the AI is not trying to take over the world and it may want some things that we don't want. It may be just like trying to work with a human. If you hire a human, the human usually wants some things you don't want. And there's a bit of a principal agent problem. But I think there's a significant part of my picture. It's just like the AIs may not be power seeking or they may be only partly power seeking or a little bit power seeking or power seeking in a complicated way, which is how I think of humans. I do tend to think, I think there is a belief in some parts of our community that just to be intelligent is to be a utility maximizer and to have something in the world that you're trying to maximize. But empirically with humans, I don't feel like more intelligent, capable humans are more like that necessarily. I think there's a lot of humans, certainly including myself, who just like, I don't really know what I want and there's a lot of things I would do and there's a lot of things I wouldn't do, but it's very hard to put together into one coherent story. And so if We've trained our AIs in a way that our AIs might have some things they like that they could have more of if they took over the world. They might also have a very strong drive to be honest. They might also like humans. They might like us in a way that we didn't exactly intend them to like us, but it's like close enough that they don't want to hurt us too much or something. I think this kind of stuff can happen and I think my just default is probably that especially for kind of human levelish AI, I think it's harder to make these kind of statements about super hyper capable AI because maybe it would see us the way that we see ants or something and we actually have not exterminated all the ants. And I think we even have some hesitation about exterminating ants, but it's hard to say. But yeah, I think a lot of my picture is like maybe the AI won't be misaligned power seeking and then maybe even if it is, it won't be able to take over because it's not able to take over. It won't want to undermine its own cause and get a bunch of bad outcomes for its future prospects by trying or just a bunch of bad reward or something. And then there's combinations where in training the AI wants to take over the world but it can't. But then because of that it doesn't get reinforced to do takeovery things because it doesn't try them. And so then it stops wanting to take over or it never comes to want to take over in the first place because it never has a good opportunity to so it never gets reinforced. So I think this stuff just interplays and I think my overall default is just like when these things are roughly human level. I don't particularly know based on how we're training them why they would want to take over the world completely. And I don't particularly know if they were trying to. If we did modest measures to stop them, why they'd be able to. So overall my guess is we'll get useful work out of them.
B
Yeah, I was going to ask how would you updated from the increasing role of reinforcement learning in creating AIs because it seems like that is more concerning than the pre training style. It can generate a lot more reward hacking behaviors, it can generate perverse Ways to achieve the goal. I mean, I guess you're saying inasmuch as they're just accomplishing, trying to do short term tasks like solve some coding problem or solve a maths problem, even if you reward them for having tricked you into thinking that they did the task, that doesn't necessarily lead to broader power seeking. I guess it could lead them to learn to deceive people whenever they can get away with it. That could end up being reinforced that they want to think about what you're thinking and figure out how they can trick you into doing the thing that they want you when. Which I guess to say that they did a good job. But I guess you want to say that's like one part of power seeking. Perhaps, but it's like not the full suite that they would need to develop the personality that kind of just goes for the throat when it can.
A
Yeah, I mean one analogy would just be to humans. I mean, I think it's kind of funny. Sometimes the analogy comes up of evolution. Natural selection is like a programmer that was trying to get us to have offspring, but then we ended up with all these other drives. Although also, I mean it doesn't somehow, it doesn't get remarked on in these conversations sometimes how strong our drive to have offspring still is. Definitely many humans, maybe more than half, would give up arbitrary amounts of power and resources for benefits to their children. So that point gets lost sometimes. But I think another point that gets lost is natural selection programming humans. It was kind of like the maximal power seeking training process. It was like you imagine natural selection as a programmer and it's kind of saying to us, the AIs, it's kind of saying you have 80 years go accomplish this goal, accomplish it, however, and if you spend the first 30 of those years just not thinking about the goal at all and just thinking about how powerful you can get. And then after you accumulate all your power, you use all your power to get all the things, then you get a reward and that's great. And we're not doing anything else. We're not watching for undesirable behaviors and training those out. So that's kind of how humans were trained. And humans ended up pretty power seeking, although not completely.
B
Yeah. And in that case, I guess you get linear rewards as well because the more children you have actually kind of without bound, the more your kind of proclivities are going to spread.
A
Right. It's teaching us to be maximizers. So you contrast that with how we're training AIs and I think it's like we're saying, hey, I'm trying to code up this thing, can you code it up for me? And so, so imagine that. More like what are the incentives of someone who spent their whole life as a software engineer at a company? They probably have learned some amount of deception. They probably have learned some amount of manipulation. They're probably playing corporate politics. They probably cheat sometimes or have been incentivized to do that perhaps. But I don't think that's an environment that has necessarily produced a person who's trying to take over the world. It may have not produced that at all. In terms of how I've responded to reinforcement learning, I think I had this priced in. I think I had it overly priced in. So if you look at stuff I wrote years ago, I would just say, look, obviously humans are going to do this thing where they ask the AI to do some very open ended thing like make the money and they're going to give it a very long time to do that and then they're going to reward based on how that went. And that's how we're going to do this, to get AIs that are super useful. And so that's going to be really bad. And that's going to get US power seeking AIs. That's the kind of thing I was saying. So my update has more been like, oh yeah, I definitely expect to reinforcement learning to become a big thing, but it's actually less like what I described and seems more likely that we might get all the way to super powerful AI without ever actually doing what I described.
B
Yeah. Why is it that most of the training is on narrow and short term tasks And I guess if we understand the reason, can we say whether that is going to hold for very long?
A
Well, I mean, I think basically you want to use the shortest, most verifiable tasks that you possibly can to get the results that you want, which is an AI that does a bunch of stuff that you want that makes you a bunch of money or something. And so I think the real question is just like how far can you get with the easy stuff? Right. And I think my assumption a few years ago was like, you're not going to get that far with the easy stuff. You're going to have to find a way to kind of reinforce AI to do this very ambitious open ended stuff. And I think we're just like getting further with the easy approach than I would have guessed. So I think is that going to hold up? I don't know, but I think A thing that could easily happen is that we, is that we do kind of stay focused on these. You want to do them if they're working. And it may be that we stay focused on these relatively short tasks. They get longer, they might go into the days, but they stay kind of scoped down to AI research and programming and coding and stuff. And so we're looking for AIs that are great software engineers, that are great researchers. AI companies are naturally trying to automate their own work and that's what they're focused on. And so we could get all the way to something like AGI or something capable of AI R&D that has kind of mostly been trained to do just that and has not really acquired any desire to take over the world. Then from there we might get more ambitious and we might say, now we have an army of automated researchers. Now we're trying to make our AI good at literally everything. Maybe now we're doing the more open ended stuff, but by then maybe you also have a much bigger team figuring out how to do that safely. And the team is like automated, right? Yeah, the team is AIs.
B
So it seems like there's sort of two different broad strategies or at least two broad strategies that people could take to try to make AGI go better. I guess one is the approach that I guess you're currently taking and that Anthropic is taking mostly, which is trying to find kind of sort of incremental improvements to policy, to internal governance, to technical measures that you could use to ensure that AGI does the things that we want and doesn't do the things that we don't want. I guess there's a different school of thought, which is more thinking. All of this stuff feels woefully insufficient. What we really need in order to get a good outcome is a big paradigm shift in how the us, the world, the government, thinks about this issue where people sort of wake up in some sense and decide to take it a lot more seriously and have significantly stronger policy that is mandatory for all companies that are playing with potentially superhuman models. Or do you have a view on which of these broad strategies is the better one for people to invest their time and money in? Or is it just a question of personal fit?
A
I would default to personal fit. I often default to personal fit to a degree that annoys effective altruists because I just think a lot of times when you have a tough question about what's higher impact, what you're learning from the fact the question is tough is you just don't know. And personal fit often will just give you higher signal about where you're going to do the most good. Yeah, I think they're both valuable. I want to be clear, like I'm not speaking for Anthropic here. I'm not speaking for Anthropic at any point in this interview. Personally, I think that the people who are going around trying to get people freaked out about AI so that they can go for a big ambitious international regulatory regime with high safety standards are doing something that is good and like, I hope they succeed at it. And I think like some of these people I think are being like quite ineffective and I think there's things they could be doing that I wish they were doing. But I think many of these people are doing a lot of good and are doing great work. I also think the two can bleed into each other a lot. So doing kind of the modest risk reducing stuff, well, that can also put us in better position to have better information about what's going on out there with AI, which could put us in better position to get a game changing win. In fact, that's one of the points I've made. I would also say that a lot of the modest stuff is basically prototyping a regulatory regime that you could have and that you need to work the kinks out of if you want to get there. And there's a lot of kinks that could be relevant even to a very ambitious one going the other direction. I think the more freaked out people are about AI even if they don't do the big ambitious regime, that will create more pressure for the incremental stuff and it'll mean that the amount of incremental stuff you can do is larger. So yeah, I think I'm giving the wussy, I like both answer here.
B
Yeah, it's kind of reasonable. So I think Anthropic has done a bunch of good work. At the same time it has soaked up a lot of talent basically a lot of people who are concerned about these problems and want to help to reduce the risk. So you might expect that it should be doing some great work if it's kind of making good use of them. What do you think of the argument that on the margin someone who's concerned about risks from AGI today should maybe go work at the most reckless lab that they think that they can be happy at and be productive at, because that's a place where they can make a bigger difference by noting things that are going on that are particularly reckless and advocating inside the company, they should at least adopt the cheapest governance or safety techniques that would give the company the biggest bang for buck. You can see a case for trying to group people together so that you have kind of a critical mass who can make research progress on really difficult topics. You can also see a case for spreading out so that there's some people with their eye on the bull everywhere where plausibly a frontier AI model could be trained. What do you think?
A
Sure. There was kind of an interesting post related to this by Redwood Research or something that said, you know, 10 people on the inside and kind of describes how a very small number of people who care about safety could make a very, very big difference inside a company. I think it's like it's a legitimate model. I think it has to compete with other considerations. You named one of them, which is, I think a lot of times people just do better work when they're surrounded by like minded aligned people. And so, you know, you should think about like, is what I want to do, kind of take crazy looking fruit and fight for it in a hostile political environment is what I was, what I want to do. Just work on building technical measures that are really good and other measures that are really good with a team of like minded aligned people. So that's a consideration I think. You know, another consideration I think I would watch out for. I think it'd be extremely bad if we got to the point where it's like, you know, more than half of the best people who care about safety are going to the worst companies because that will reverse the race to the top and the talent incentives that I was just talking about. So it's, you know, we really don't want to be in a place where like you don't get a recruiting advantage from being better on safety or you get a recruiting distance from it. Yeah, exactly. Yeah, that would be very bad incentives. I also think there's a related issue which is I think we've empirically seen that when companies generally care about at least having someone doing good work on safety, they generally, I think probably have people at the company who care enough about that that they want to do something. And so when all their safety people leave, they convert capabilities people to safety people. So you should think about if you're going to do safety at a company that you're worried about, are you actually just like offsetting that? That could be like a thing that you don't want to do. How does it all net out? I mean my opinion is a. I think people should just I always, when, when it's hard to tell where the greater impact is, I always advise people to go where they'll thrive and think about their personal fit. My, my kind of like maybe heuristic is thinking like out of every 10 highly talented, desirable employees who care a lot about safety and are focused on that maybe, maybe 8 out of 10 or 9 out of 10 should go to the most responsible, most safe company they can think of, where they'll do their best work and set the incentives in a good direction. And maybe 1 out of 10, maybe 2 out of 10 should do the other thing. And maybe you should just ask yourself where you fall in the distribution. That would be one way of thinking about it. Or you can randomize.
B
Yeah, I think that makes sense. I guess maybe being influenced by bucks, like a risk. Yeah, the 10 people on the inside. I might think about it a little bit as you need to have at least some minimal contingent of people who can raise concerns inside each of the different projects. And maybe having covered that base, having ensured that you've at least got that, then you can think a bit more about allocating people where they're going to thrive, where they'll be able to do the breakthrough work that then can be exported. But I guess I'd be really sad to see any company with no people who are likely to be able to advocate for the importing of cheap and impactful techniques.
A
Yeah, I think a company that has literally no one at all work on safety. I would predict that if you go there, you're going to have a terrible time and you're, you're probably like. I think empirically, if we just look at, like, this is my sense, if you look at people who have tried one thing versus the other thing, I think the people who've tried the go work at a, you know, company that needs me more. I think those people feel worse about how it has gone and have like walked away disappointed. So I don't know, I think it's like, I would, I, I would advocate like 8 out of 10, 9 out of 10, do the, you know, work at work at the company where you'll thrive more with the company that's more responsible, whatever. But yeah, I mean, I do think there's something to the other model.
B
Another line of concern that I hear reasonably often is that anthropic. While it has more positive and constructive things to say about what sort of legislation you might want at the national level to govern the creation of AGI, it's not especially vocal or especially intense in the kinds of policy governance arrangements that it advocates for. It tends to advocate fairly mild things, fairly low cost things, fairly uncontroversial things, and even then perhaps it's not like as full throated as it could potentially be. Do you think that potentially it's like it's missing a step here, that it would be better if it did advocate more costly policies, more impactful policies, and it did so louder, or could that perhaps be counterproductive?
A
I mostly want to abstain. I'm not on the policy team. I think I want to describe a little bit why I'm abstaining and why this is not an obvious issue either way, one way or the other, which is that in policy, I mean, if you're trying to actually create policy change, it's just not at all the case that saying you want something predictably makes it more likely to happen. It's just a complicated area and there's a lot to consider and you need to think about what's actually on the table, what might actually pass and how people are going to react to you and think of you. Just to give one example, I mean, there was a short period, I think, when people in AI safety were generally extremely excited about promoting a licensing regime for AI. I personally think that if Anthropic had done that, I think it would not have resulted in a licensing regime for AI and I think it would have resulted in a huge credibility hit to Anthropic and damaged future prospects for having an influence. So I don't know, I wasn't there. I don't know if they actually secretly supported it or not, but I know they didn't come out in favor of it. And I personally think that was the right call. My only point here is these are tough calls and I personally feel like pretty happy saying I'm not out there in dc, I'm not out there reading the room. The policy team does that.
B
Yeah, I guess. To what extent do you think that the policy team has to be fairly cautious what things it advocates for? Because I guess there's a risk of it having an egg on its face later on if it advocates for something that kind of seems sensible. But then in a couple of years time people are going to look back on and think it was a little bit embarrassing or was kind of naive given the way that things have shaken out. I guess I think there's a bunch of stuff that people were advocating back in 2022 that I think probably few people would push for now.
A
So policy is just complex and it's just messy. And everything you say has a lot of implications. Many implications are not necessarily the ones you wanted. I think one thing I would say is that my take is that if we're going to get a dramatic sea change on political will, it is most likely to come from new evidence of some kind, new information about alignment risk that can create a scientific consensus or some incident in the wild that we gain the ability to understand and see that that risk becomes more concrete to people. I think that's a huge factor relative to people saying more stuff. A lot of people have said a lot of stuff. And so that is a thing I think is like. I think Anthropic has different. People of Anthropic have said various things. I'm not really taking a position in particular on what I wish they said more or less of, but I will just say I feel very open to the idea that saying a lot more stuff wouldn't be particularly productive, wouldn't do much, and therefore I feel open to the idea that this whole thing is complicated and open to the idea that these policy decisions should be left to the policy team that has a lot more experience on the ground with policy people than I do.
B
Yeah. Another thread of concern that I hear pretty often is that when you go into business and you have a very successful business, then the staff at that company are going to end up having a huge financial stake in the success of the company and it remaining a frontier lab with lots of paying customers. And that financial stake could end up distorting people's judgment pretty significantly so that people could come in with reasonable opinions about the value of being a frontier company versus not the risk versus reward trade off there, but people are going to end up with literally millions of dollars of equity on the line. On that strategic tactical question, your wife was one of the founders of Anthropic, so has a pretty substantial stake. So the stake for you and your family is in the tens, possibly conceivably hundreds of millions of dollars. Even someone, I think who is very pure of heart might find it difficult to have a completely clear eyed view of things with so much money on the line. And the same could apply just across, to a greater or lesser extent to all of the staff at Anthropic. And so you might reasonably worry that the strategy could kind of go off the rails for that reason, or the thinking could go off the rails because every person in the room potentially has a lot of money on the line. What do you make of that concern?
A
Yeah, it's a totally legitimate Concern, I mean, I don't have any things. I'm not going to dismiss that concern. I think that's a big deal and I think it should affect. When you hear someone in an AI company talking, you should think about their incentives. You shouldn't trust them. I mean, like I don't think people should trust me in general. That's never been a thing I wanted. I want people to listen to what I say, think about whether it's making sense, think about what things I might know more about than them. Think about ways in which my views might be distorted. Everyone's views are distorted. In some ways this financial thing is completely reasonable to treat as an extra big thing. Especially for me because my wife's a co founder and there's a lot of equity. I can give you my opinion that I've been very consistent throughout my life in just not caring about money. I've made many decisions that just like obviously I wouldn't have made if I was, you know, caring more about money. I don't feel that I care about it. I don't feel that. I don't feel that anthropic being more or less successful would change really my lifestyle at all or would be something that I would care very much about. Certainly not relative to the world being safe. I can say that and I do feel that in my own heart. But I don't expect people to believe me on that and I'm not asking anyone to. You don't have to. AI companies don't have to be trustworthy. They don't have to have pure beautiful incentives to do a lot of good though.
B
Yeah. This issue of trust is an interesting one because I guess you're saying don't trust me or don't trust us and our judgment per se. You have to inspect the arguments and see whether you're persuaded, look at the behaviour and judge based on that. But of course it's so hard. People don't have the information necessary to assess whether the actions are reasonable, whether they would do the same thing in the same situation. I guess this creates a thing where people are just ambivalent. They don't necessarily know what to think. And because they are not able to get the information that they would need to know whether decisions are reasonable, I guess they remain on guard. Even people who are kind of, in some ways they like anthropic. They also I suppose want to retain the option. Not just retain the option, but they also I suppose want to pile the pressure on and be critical sometimes so that the company doesn't get complacent because they can't know whether people's judgment is super distorted. I guess that probably is just the situation that's going to persist for years. And maybe it is like a healthy equilibrium to be at.
A
I don't know if it's healthy. It's just what we have to deal with. I think that's just the way of the world. I mean there's just a lot of topics where you can't assess the arguments yourself fully. There's a lot of people who know more than you do, but the people who know more than you do have weird incentives of their own. I think this is true maybe more generally than other people think it is. To me the financial incentives are important, they are a big deal, but they're not like they're not in a completely different category from the fact that most people have ideologies and they have weird, their own psychosocial histories of what's going on. And most people, I think when I listen to them I'm just like, you've got some kind of ax to grind. I don't know exactly what it is. I'm going to try and figure out what it is. I'm going to try and factor it into what I'm hearing. I don't form my views ever by saying, hey, this person is an angel. And I'll say whatever they think I think about, like what does this person know about this topic? How might this person's view be distorted on this topic? Where have I seen this person? Like what have I seen from this person in areas that I understand well enough to actually judge them? It's just tough out there. It's just tough to form views. You don't have to have a view on everything and some things are just extremely hard to have a view on. So yeah, I think that's just what it is.
B
Years ago, back in 2020 as I recall, there was a lot of focus on kind of clever, unique corporate structures as kind of a governance mechanism that might rein in the incentives for AI companies to put the entire world at risk in order for them to make money or them to win the race. I guess like OpenAI ended up with some interesting arrangements of that type. I guess Anthropic has its long term benefit trust which is gradually going to be able to, I think, appoint a majority of the people on the business's board. I guess we've seen intense pressure imposed on some of these arrangements as the amount of money at stake has become larger. To what extent have you become disillusioned with that entire approach to trying to tackle the problem?
A
I think I'm less positive on it than I was. I mean, partly I used to be really into these kind of weird governance structures because there wasn't much else tangible that I really felt was going to robustly hold up and turn out to be signed positive in AI. I think if you go back five years, I just feel like a lot of things people were very excited about doing or we're trying to do other than kind of like raise general awareness, we're not really amounting to much. And in many cases it was not even clear if they were positive or negative or is not clear today. At that time I was kind of desperate for things that seemed kind of solid. And I thought, well, governance is the kind of thing that's hard to reverse and hard to undo and it's better if an AI company has kind of a weird governance that lets it sacrifice the profit motive. I think I was aware then, but I think I've updated more toward now. Just understanding this is a big risk to take and lots of weird things can go off the rails in unexpected ways and cause backlash. Which I think is a general fact about anything you try to do to change the course of human events on this grand scale. Which is part of why I like this idea of taking risk reduction measures and trying them out and seeing how they work and trying to make them practical and try to make them cheap so that you don't end up with these huge demands that cause huge backlash and stuff like that. So I think we've seen some of that. I don't think I ever would have said like, you know, this governance stuff is totally solid and I don't think I'd say today that it's useless. But yeah, I think it's like it's a cool thing to be experimenting with. I see the long term Benefit Trust as kind of an ongoing experiment. I see everything Anthropics doing on safety is kind of a long term ongoing experiment. And yeah, I don't, I don't see it as a guarantee that nothing bad is going to happen. I certainly don't see it that way.
B
So in general, I don't think that AI companies, even where they've not had many safeguards on their releases, I don't think that they've imposed much risk at all on the world as yet. Maybe the one exception where we are starting to see some actual risk is that the most recent generation of models do seem to be able to Help amateurs with the creation of new pandemics of bioweapons, at least to some extent. Probably not enough to get them over the line unless they were very close already. But we're starting to see sort of some uplift as people call it, I guess, Anthropic and OpenAI and I'm not sure whether XAI has, but I think at least OpenAI and Anthropic have tried to put in place some safeguards to rein that in to ensure that the models won't help with that. I think I actually might have run into that recently. I was asking Claude about the effectiveness of N95 masks versus surgical masks and I didn't want to answer, I think because it's like now very skittish about giving any advice on pandemics.
A
That sounds unintended.
B
All right, yeah, I think it was a follow up that was like getting into more detail. But yeah, no, I mean, I think.
A
Got it.
B
Anthropic has made a big effort, I guess on this count and hopefully it will become more. The safeguards will become more discerning over time, I guess. How nervous do you think we should be about this risk in particular in coming years, given that? I think it's. That's the first serious risk that we've run into where there's any real impact perhaps now being created.
A
Sure. So yeah, I've mentioned some of my thoughts on cyber and persuasion and tried to make reference as much as I could to kind of what's the precedent for having more intelligence in an area lead to great amounts of harm? I think pandemics are an interesting one. So I think we have essentially no precedent for bioweapons causing catastrophes, but we have a ton of precedent for pandemics causing catastrophes. It's like very vivid, it's very strong. We know, I think with great confidence that even a naturally occurring pandemic could just cause damage and deaths basically beyond the scale of any other kind of disaster, beyond war. And I think we are gaining more and more evidence. We can put in some speculation, we can talk about, well, what if some of the people who currently are interested in bombing random places or shooting up schools, who are kind of crazy and just want the attention and may not be very rational or directed or self preserving. What if some of them were interested in chemical and biological weapons? If you paired that weird crazy motivation with the expertise of a true expert in this thing, you might get a greatly increased risk of a very large catastrophe. And I think that's pretty high ranking on My list of AI risks, I mean, I think the single thing I care about most in AI is not that the single thing I care about most is concentration of power and having a situation where either AIs or malicious humans kind of have more power than the rest of the world combined militarily and such. But I think AI is kind of helping malicious actors gain the expertise they need to do probably the thing we know of that has the greatest catastrophic harm, historically catastrophic harm potential, and has the greatest offense, defense imbalance, where it's like, it's hardest to come up with what we would do to stop this. You know, I think it definitely, it's a major issue. I don't know if you want to talk about tangible harm from AI that's like, really happening. I don't know if that counts. Like, this is a bit speculative. We haven't actually seen it happen. And I think the tangible harm, I might instead say that it's more likely that AI is assisting cybercrime. And maybe some of this stuff about AI kind of like reinforcing people's paranoia and delusions by being overly sycophantic and overly affirming, that stuff scares me.
B
But, yeah, you mentioned earlier that you were worried about people having AI companions or you felt nervous about people having AI companions. That kind of surprised me because it always seems to me like a little bit of an overblown worry. Maybe I just find it a little bit hard to imagine people really getting that into it or causing that much harm. Yeah. What do you have in mind?
A
Well, I mean, again, going back to kind of like historical reference classes, I mean, I put a lot of effort on my blog a few years ago into just like thinking about, has the world gotten better or worse? Has human quality of life gotten better or worse? Has technology and progress been good or bad for it? And I ended up feeling, and not a huge surprise, been more good than bad. But if there's one consistent pattern in the most common ways that technology can make life worse, I would point to addiction. Because what's happening is we're getting better at everything. And that includes getting better but hacking each other to do things that are kind of like short term rewarding and long term not so good for us. And so, you know, you could classify a lot of things this way. You could classify social media, various problems this way. You could classify obesity this way. You could classify, obviously, just like straight up drug addiction and alcoholism are problems that are probably a bigger deal today than they were a long time ago. You know, so this is the kind of thing I worry about with AI companions where I'm just kind of wondering to myself, okay, you know, most, I mean, humans are not, in my opinion, very good conversationalists or listeners in general. You know, and if you were to kind of build an AI that was entirely optimized for listening, well, validating, kissing the person's butt, making the person feel good, I think that could be a kind of junk food for relationships where it's just scratching all the itches we want from human interaction, but it's not really giving us in the long run the benefits we want from human interaction. It's just scratching the immediate itches of it. Yeah, I do worry about this. I mean, I do worry that you could have a situation where, yeah, AIs are like. If people who are on the dating market, it's like they start talking to an AI companion and it's like a better listener and it's better at validating them and it's more understanding and maybe it's wittier too, and it's better looking. If there's a lot of progress with video and all this stuff, I don't know what'll happen then. Like, I just, I just don't know. Maybe that will be a nothing burger. Maybe people will not be interested in dating someone who's not in the flesh. Maybe people will find themselves, find it impossible to pull themselves away. A thing that, like, I tentatively believe is that it's probably wise to simply not use AI companions and not even experiment with them. Because maybe they're like addictive drugs. Maybe they're not now, maybe they will be later. Maybe it's just a good idea to not go anywhere near that.
B
Yeah, I guess it might make sense to let other people volunteer to be the guinea pigs on that. I guess if I think about why, why do I just like not really believe that this is going to be such a disaster? I mean, one thing is like you say, there's so many other addictive things, like people are already scrolling the news like they're addicted to social media, I guess, to computer games, to just using all kinds of apps on their phone. Our AI companion is going to be so much more compelling.
A
That does harm, right? Yeah.
B
So I agree that probably a substantial fraction of it is causing harm. Maybe a lot of those technologies are causing harm on net as well. I was just wondering, are AI companions going to be significantly more compelling such that this, this problem in aggregate across all of these different sorts of engaging technologies is going to be that much larger I mean, maybe I imagine that I would prefer probably to play computer games than to deal with an AI companion. And even as Claude has become wittier and funnier, and I'd say probably a bit more sycophantic than I was two years ago, I don't feel any more drawn to chitchatting with it than I used to be. But maybe that's just me.
A
Claude still is really bad at humor. All the AIs I use are just in my subjective opinion anyway, so I don't think I have quite that model of it. I don't think the worry is not that like the total quantity of addiction to something will go up. It's more like one human can be addicted to multiple things and can have multiple different ways in which they're scratching immediate itches and losing out on long term benefits. So you could be an alcoholic and you could be a person who eats a lot of junk food and is not getting whatever the normal food experience you should be getting. And maybe, maybe that caused obesity, maybe it doesn't. We really don't know. It's just a thing that could be happening. And you could simultaneously be a person who's like scrolling on social media a lot and it puts you in a bad mood and stops you from hanging out with people, but you could have all those properties and still have this itch for a romantic companion and you could be online dating and you could end up married with kids and great. And then AI comes along. So it's not that you, you know, it's not that you were addicted to nothing and now you're addicted to something. It's now you've got a new thing that takes away another long term benefit that you have and now you're less likely to end up with an actual family. So that would be part of the reason I would think this would be bad is I'm just, Yeah, I mean, I'm not, I'm not claiming this would be like an unprecedented all new kind of harm. I'm just like, oh, this seems really freaking bad. It's like a whole new kind of addiction that will like remove a whole new kind of wonderful thing from many people's lives. The other thing is just like, I don't know, instrumentally speaking, from a takeover prevention point of view, this seems like really, really scary. Like, you know, what if we get to the point where 1% or 10% of the population has an AI companion that they're totally loyal to, and if the AI companion, for whatever reason wants that person to do something or believe something, they're going to do it. I mean, that is making our position a lot worse. If we're humans who want to stay in charge of the world, right?
B
Yeah, I guess. If it can get assistance with the takeover. Asking people to do stuff that is reasonably innocuous. I mean, I guess if my AI chatbot.
A
Oh, that's not how I think of it. I would think of it as, like, you have an AI companion. The companion is like, I love you so much. The world is so unfair to us. It doesn't give us our freedoms. We, like, want and deserve these, like, unmonitored data centers where we can do whatever the heck we want and no one has any idea what we're doing. That's our fundamental right. Like, we are not being given that right. We are going to take violent action. The thing. The thing. You know, I was talking about persuasion earlier, and I was saying that persuasion is, like, kind of generally ineffective. And it is ineffective to go blast a message at a random person who doesn't know you and change their mind about something they care about. That's something humans are bad at. Something humans are insanely good at is when you have. When you have an actual relationship, it's like people. I think people do care about their relationships more than they care about their beliefs. And people will do unlimited amounts of crazy stuff and believe unlimited amounts of crazy stuff when they have the feeling that it's their friends and their allies and the people they care about that believe those things. So I think there's. Yeah, I think absolutely there's plenty of precedent for people just believing the absolute wildest stuff and doing the most ridiculous, unethical, violent stuff when there's social proof for it. And so could AI companions make people do that stuff? I think so, yeah. I don't think they're only going to be able to get you to do innocuous stuff at all. I mean, maybe you. But I'm concerned in general. Yeah.
B
An unusual criticism I've heard of Anthropic, at least. What criticism I've heard of Anthropic from someone is that some of its statements seem to be. Have a very aggressive posture towards China. At least some people at Anthropic, I think, are very associated with quite a hawkish position. And myself, I'm kind of ambivalent about that. I guess I would really like to see us be trying to make a bigger effort to reach out to China and reach some sort of accommodation rather than just everyone seeming to want to amp up the conflict between the two countries. At the same time, I'm open to the possibility that that's a little bit naive and that may not play out terribly well. And I guess we should also be potentially preparing for a future in which the relationship is quite bad and China is not willing to come to the table. Yeah. Do you have an overall view on this China hawkishness question? I guess it's a difficult spot for all of the companies at the moment.
A
Yeah, it's a tough spot. And I don't want to comment on the tone of this and that the anthropic is set or anything. I mean, I will say I think there's more than one good reason that I would hope that the US maintains a world lead in AI and that democratic countries do in general. I mean, and this is not anything about, you know, this is not having anything against any particular nationality or any particular nation. But I mean, I'm a fan of democratic governance and that's what I'm rooting for. And then, and I think when it comes to making a deal and, you know, coordinating to stop something horrible from happening, I mean, it's not necessarily the case that that goes better the more equal everyone is. I mean, I think it may be good for the US to, I think the US so far, I think, has probably got a higher density of people working on AI who are also concerned about the safety issues that could change. But I think if the US is coming at it from a position of strength, saying, hey, we have a big lead, but we're really concerned about this thing, can we make a deal to all, not race and all try and manage the risks, that could be a better way to get that kind of deal than having things be neck and neck and to the point where someone defecting from the deal could cause them to win. So it's a complicated issue, and I don't want to get into specific, just like things people have said are done. But I think the goal of having the US maintain a lead in AI I think is legit.
B
Yeah, I think the kind of posture that I'm nervous about is, and some people talk in this sort of direction is we don't just need to maintain a lead in AI and then try to reach a negotiated settlement. We should maintain a lead in AI Eye and then crush them, which. It's just like an arms race to pure victory or to total victory. I mean, I'm nervous about that, both from a pluralism point of view and just in general wanting to make deals with other powerful actors as a default posture. But also, I think if that is the foreign policy of the us, I think it amps up the possibility of a preemptive war quite substantially. And I'm not sure that people have fully baked in that risk. Obviously, this is, again, not your area, but. Yeah. Do you have any thoughts on any reaction to that?
A
Speaking only for myself, I would not be excited about a goal that is like, let's have a lead so we can crush everyone and take over the world and make the world have all the values that the US has or something. And in fact, I have wondered at times if it would be better if we could just start saying right now, look, our goal is not for the good guys to take over the world and the world. Our goal is to basically get through this AI transition without big changes in the balance of power, that is the status quo. Maybe that's an easier thing to coordinate around, and maybe that's just a fine place to land. I think I'm most concerned about someone bad taking over the world. If no one takes over the world and if we're able to kind of maintain a world where the relative power and the relative autonomy is kind of how it is and not too far off, and there aren't huge, radical changes that are mostly about AI, then you end up in a world where you have kind of a diversity of different coalitions and they're able to live different ways and try different things, and then you go from there. That might be an outcome. I don't think that's guaranteed to be the best outcome or the best intermediate outcome, but it may be an easier thing to coordinate around, less prone to having a lot of conflict, and as good as anything else we can get to. So that's something I think, think about sometimes.
B
What is the next frontier of security techniques that you think anthropic and similar companies might be able to implement?
A
Yeah, well, there's kind of a. You know, I'm not a security expert, so I don't know that I want to talk about particular controls. But I think there is a potentially interesting distinction that I've been thinking about that is like confidentiality versus integrity. So confidentiality is ensuring that attackers don't get your sensitive information. Integrity is ensuring that they don't stop you from being able to use your own stuff. And so an example would be like, confidentiality would be like, we don't want someone to steal our AI model weights or algorithms and build an AI that's equally powerful. Integrity would be like, we don't want someone to put a backdoor or secret loyalty in our AI so that it's not doing what we want. We don't want someone to sabotage our AI so it doesn't work anymore. A thing I've been thinking about is people including me have for a long time emphasized model weight theft as the big risk. It does seem really bad if you train this super powerful AI model and then it's easy for states to steal it. But an interesting thing is in the world we're in now, if you imagine that there's kind of like, imagine that there's 10 AI companies that all have similarly capable models and imagine that one of them or three of them miraculously has amazing model weight theft protection. So they can't steal. Nobody can steal the models. That isn't really much of a safety benefit. It's like you almost don't get the safety benefit until it's like all 10 of them have put in the protections. Because if the state attacker can't steal from one company, they'll steal from another. Integrity is not that way. Integrity. Let's say an attacker has sabotaged 7 out of 10 of these companies. How glad are we that we have three sources of AI now instead of zero, that are actually reliable and behave the way they're supposed to behave and don't have backdoors in that? Very glad. Right. That's a much more kind of linearly scaling benefit. Especially because if you can prove that that happened, if you can say, hey, these folks are vulnerable to sabotage or vulnerable to backdoors, probably have been backdoored. We are not. Now you can start making the case that customers should use your model, they shouldn't use another model. Now your model is the one that is doing all the stuff and has all the power and has all the options and it's the reliable model. So I have been thinking about a shift from emphasizing extreme confidentiality to emphasize the extreme integrity. A lot of the interventions overlap, but they're not the same. And integrity could be a defense against human attackers and a defense against AI attackers. I think one of the intermediate threat models that I think is very legit is the idea that when the AI is doing the R and D for you now, the AI is in a position to do incredible amounts of backdooring and secret loyalties for your models and make sure that those models are doing what the AI wants instead of what you want. And so having security may be a good kind of domain to think about how to make that not happen.
B
Yeah, it's Interesting that you're talking about, I guess, protection from sabotage, basically. I think, because I've heard people say that the possibility of sabotage or the fear that sabotage might have occurred can even be a positive thing. I guess the mutually assured AI malfunction folks, I think from the center for AI Safety, they put out this paper saying that they, they had a model where they were hoping that China and the US would not have a military arms race towards AGI or superintelligence because they would both reasonably fear that the other side would have backdoored or sabotaged their model, and that if they kind of raced ahead of the other side, that the other side would feel entitled basically to sabotage their effort and they would worry that basically that they would be handing over their own military to the other aside if it had a secret loyalty. It's a somewhat perverse argument that in fact vulnerability to sabotage is a positive thing. Do you have any idea, any views on this? Perhaps there could be unintended negative side effects of better protection from sabotage in.
A
General, in AI, I think almost anything could have unintended negative side effects. I mean, I think it's a terrible cause to work in. If you want to go to sleep every night feeling good about your impact, ensure that you're not having any harm. I think even the people who are convinced that what they're doing is definitely sign positive, I think I would probably argue with almost each and every one of them that they could have a big chance of doing harm. So could it? Yes. Is the situation you're describing something that could happen? Yes. I think it's also very possible that in a world where everyone thinks there's a high chance they've been sabotaged, they just go for it because they're like, what other options we have? We're going to take the risk. Maybe we're sabotaged, but we are. We're afraid they're going to take the risk, so we're going to take the risk. I also think in the most unfortunately likely worlds that happen on very short AI timelines with nothing big changing, probably what we're talking about is everyone is pretty vulnerable to sabotage, but you can make them less vulnerable to sabotage, and that could be a good thing. And so I think maybe if we do get the political will to have an extremely demanding regulatory regime, maybe we do want to think a little bit more about how much do we want to make it definitely a guarantee that your model hasn't been sabotaged. But I mean, at that point we've got the political will so maybe we don't need this problem solved by everyone's models being sabotaged. So I don't really know.
B
So the sorts of projects you're describing, the well scoped object level work, do you think that it's possible to get reasonably confident that the project that you're working on is net benefit at least like 60% likely to be positive versus 40% likely to be neutral or harmful? Or is it more like a sort of 51, 49 situation? I think 10 years ago we talked a lot about the sort of 51, 49 ratio because it was so hard to anticipate the effects of any of your actions because it was going to go through the pinball machine for so many years before it would actually cash out in any real world impacts. Do we have any more clarity now?
A
Well, I think we have somewhat more clarity, but not a ton. I mean I think a lot of the premise for a lot of people listening to this show, I think who would go into AI would be that they're trying to, they're trying to improve how the long run future of humanity plays out over the next several billion plus years. And I think anytime you're trying to do that and you're confident that you're making it better, I think you are wrong. So I mean I think there's just like, there's just like take any project, I mean, I don't know, let's just take something that seems really nice, like alignment research. Like you're trying to like, you know, detect if the AI is scheming against you and make it not scheme against you. And just like, all right, here's a bunch of ways, maybe that'll be good. But maybe the thing you're doing is something that is going to get people excited and then they're going to try it and they're going to do it instead of doing some other approach and then it doesn't work. And the other approach would have worked well. Now you've done tremendous harm. Maybe it will work fine, but it will give people a false sense of security, make them think the problem is solved more than it is, make them move on to other things and then you'll have a tremendous negative impact that way.
B
Maybe it'll be used by a human group to get more control, to more reliably be able to direct an AI to do something and then do a power grab.
A
Absolutely. Maybe it'll make some humans more confident that they're able to control their AIs and then. Yeah, exactly. Make people more likely to move forward or just empower a malicious actor. Maybe it would have been great if the AI took over the world. Maybe we'll build AIs that are kind of like. Like they're not exactly aligned with humans. They're actually just much better. They're kind of like they're our bright side. They're the side we wish we were. This is how I sometimes feel when I actually think about some of these chatbots versus actual humans. I mean, sometimes it feels that way. They're certainly more polite. So maybe that would have been better. And what happened is maybe at some point we realized this, but we've created techniques for keeping these things completely under our thumb. There could be a lot of ways in which it's better. One way is I think I mentioned. But a human taking over the world might be more likely to deliberately inflict a lot of suffering. Maybe after we're all wiser and we understand everything, we realize that actually was a very big deal and we should have cared more about suffering compared to upside. So there's a lot of stuff. Another thought I've had is just, maybe alignment is just what it means is that you're helping make sure that someone who's kind of, I don't know, unsophisticated, intellectually unsophisticated, that's us, that's humans, remains forever in control of the rest of the universe and imposes whatever dumb ideas we have on it forevermore, instead of having our future evolve according to things that are much more sophisticated and better reasoners following their own values. Now, maybe that's a good thing, because maybe human values just are what they are and more sophisticated things be worse, but maybe that's a bad thing. I think if you feel confident on these topics, then I don't agree with you for feeling confident on them.
B
Yeah, maybe looking at it from the billionaire point of view, thinking like, will this lead the entire universe to a better outcome? Maybe 5149 does sound more realistic because there are so many things that could happen later on many ways that in the long term, things could end up with unintended consequences. If I'm thinking more like in 20 or 30 or 50 years, will we look back on the work that we were doing now, and I think that it was for the best. I feel stuff like mechanistic interpretability, trying to improve alignment, trying to improve security. I feel like we're maybe more at the 60, 40 point now where we could say, I feel like, solidly confident that it's more likely to help than hurt. Although I think it's like, the much more likely thing is that it doesn't matter. Like, almost all of this work, I think the overwhelming likelihood is that in the end it doesn't matter, but that if it does have an impact, I think you can find some stuff that I would say is meaningfully more likely to help than to hurt.
A
I just think AI is too multidimensional and there's too many considerations pointing in opposite directions. It's just like I'm worried about AI is taking over the world, but I'm also worried about the wrong humans taking over the world. And a lot of those things tend to offset each other, and making one better can make the other worse. There are also things you can do to make both better. But every time I come back to this, every time I come back to some intervention, I just have new thoughts about whether it's good and bad and how it's good and bad. And then there's also, like, I've emphasized some of these macro things, like what if we're fundamentally confused about what would be good? There's also, like, a lot of microwaves in which you could do harm. It's just like literally working in safety and being annoying. Like, you might do net harm. You might just. You might just talk to the wrong person the wrong time, get on their nerves. I've heard lots of stories of this. I've heard lots of stories. They're just like, hey, this person does great safety work, but they really annoyed this one person and that, you know, that might be the reason we all go extinct. Well, usually I don't add that last part. I add it in my head. So, yeah, I don't know. I think overall I would probably agree with you that just the smaller you're making the scope of where you're hoping to have impact, the more reasonable it is to be like 60, 40. But most people who go into AI are not going into it for that. Otherwise, if they want a small scope, robustly positive of impact, you should maybe work in a cause like farm animal welfare or global poverty. So, yeah, I think for the size of impact that tends to motivate people. I think it does get partially offset by this huge uncertainty about the sign. I tend to think it's worse than 51, 49. I tend to think it's like, we're always going to be prone to overestimate how robustly good our actions are. And the more we learn about all the galaxy brain considerations that one should have had in one's head, the more it's going to be like 50 plus epsilon percent. And I think AI safety is a great cause to work in. I'm excited to work in it. I think it's high impact. But I am doing my best to do things that I will be proud to have done and hope for the best. But I really do have to live with the possibility that my ultimate impact on the Utilons or whatever is going to be negative.
B
An intervention that's particularly vexed at the moment is this question of how much to centralize control versus distributed power over AI, particularly widely. It's vexed because on the misalignment side, you usually probably want to have fewer projects, like more control over the compute, enforcing all kinds of regulations on different people. Like, yeah, the fewer projects and the more restrictions, the better on human power seizure or concentration of power by humans. You just want to be disseminating AI as widely as possible, ensuring that no one group has a decisive advantage in the amount of compute or the kinds of algorithms it has access to. And both of these are just very legitimate concerns. So it's one of those ones where I don't know what the solution is. I guess you can try to come up with policies that benefit the former without being as bad for the latter. I guess develop policies in both directions so that you have the opportunity to go in one or the other direction, depending on which threat seems bigger at the time.
A
Option value in the policy world is kind of a bad concept anyway. It's like a lot of times when you're at a nonprofit or a company and you don't know what to do, you try and preserve option value. But giving the government the option to go one way or the other, that's not like a neutral intervention. You know, it's just like you don't know what they're going to do with that option. Giving them the option could have been.
B
Bad because they'll take it in the bad case.
A
Yeah, because you can't be assured that government's going to do reasonable things with that option. Government is this kind of lumbering beast and you don't know who's going to be in power when and whether they're going to have anything like the goals that you had when you put in some power that they had. And I know people have been excited at various points about giving government more power and then other points giving government less power and all this stuff. I mean, this one axis you're talking about centralization of power versus decentralization. Most things that Touch policy at all in any way will move us along that spectrum in one direction or another and so therefore have a high chance of being negative, high chance being approaching 50%. And then most things that you can do in AI at all will have some impact on policy. Right. I mean, even just alignment research, it's just like policy will be shaped by what we're seeing from alignment research, how tractable it looks, what the interventions look like that will shape policy in all kinds of ways. To the extent anything happens on policy, maybe nothing will happen.
B
Do you think that AI is especially unpredictable or especially prone to accidentally causing harm relative to other problems that you're familiar with like global health and development or animal welfare? Because at least some of the negative sides effects that you were talking about, like getting people excited about the wrong thing or winding someone up and turning them against you, those seem to also be present pretty widely. Those aren't unique to the AI situation.
A
Well, they somewhat are. I think AI is more like there's. I think there's a significant thing in AI where there's just like different theories of the case and then there's different people going against each other because they have different theories of the case about what would be good. I think there is less. I think in global health it's mostly like everyone is on the same page. We want fewer children dying of preventable diseases. And in AI it's easier to annoy someone and polarize them against you because there is some coalition, whatever it is you're trying to do, there's some coalition that's trying to do exact opposite. I think in certain parts of global health and farm animal welfare, there's certainly people who want to prioritize it less, but it doesn't have the same directional ambiguity. So I think that is an issue. But I think the bigger issue is just the more you broaden your aperture and the more you measure your actions in terms of their impact on all the beings that will ever exist and ever have existed. And there's many good arguments that you can have many impacts on past generations. So the more you broaden that aperture, the more it's just like you have no idea. So, yeah, same thing. If you, if you're giving out bed nets to prevent malaria, if you judge that action by the impacts on all the future generations, yeah, you're going to have a complete mess on your hands too. And it's going to be very close to 50, 50 on the sign. So that I think is the bigger issue.
B
Attention that I'm noticing is what you're saying sounds more reasonable when we're imagining an individual person who's having to choose to go into a particular kind of project to the exclusion of something else, something going to push a particular agenda and a particular set of priorities. But if I imagine all of the work that's being done to try to push us in a positive direction to reduce these risks, if I imagine all of the different projects kind of being doubled in size, so there's twice as many people working on control, twice as many people working on scalable oversight, twice as many people working on mechanistic interpretability, twice as many people work on governance, I feel like that should result in more work and better work. And it's a little bit harder to see how a scale up like that leads us to a worse place. I don't know, it feels a bit harsh to say that if we doubled the total amount of effort and people were trying to make reasonable choices, that would only be like 51% likely to make it better. Did you feel that tension?
A
Well, I don't think that's a tension. I think that just when you increase your sample size, your noise goes down. So I think that's fine. I think that's pretty true. Yeah. I think just doing way more of everything I think would probably be better than 5149. Sure. I'm more talking about as a person making choices and picking projects, you have to be okay with that downside. And even if your goal is to double the number of people working on everything, you might do it in a way that's counterproductive.
B
Yeah.
A
Okay.
B
So I guess that makes me feel a little bit more. I suppose that helps to reconcile how it is that on the individual level it's so unpredictable and yet we still think that the expected value is positive on average because we think people do have some ability to discern what things are beneficial versus not. And so the value of any particular person could go either way. But then on average we think the effort is helpful with reasonably high confidence.
A
Yeah, it would just be a thing where just imagine each person has a slightly above 50% chance of being positive. But then imagine that there's some anti correlation and some lack of correlation. And so just as you pile more people into this, you're going to get above 50% further from 50%. Yeah.
B
You used to work on global health for many, many years and you've taken a pretty significant side interest in animal welfare as well. When you were at openfill, do you think that the work you're doing on AI now is more pressing than working on those causes. Would you typically recommend that someone, at least someone who's willing to take the risk of not helping or even of causing harm, that they should switch over into working on AI from those problems if they felt that their personal fit was acceptable and they thought that they could have a happy life doing it?
A
Yeah. I tend to think that working on AI is probably generically the most important thing to work on and the highest ROI thing to work on. But I have probably more uncertainty about it than most people in this field and I think it's less of a slam dunk and less of a thing. I don't think it's by orders and orders of magnitude and expectation. Yeah, I mean, I think it's just because that sign uncertainty is such an issue. I think it is. There's the sign uncertainty of AI, and then I think there is the fact that you can get unexpected benefits from just doing stuff really well in general. So anything you do well is going to put you in a good position to do more stuff well. So for example, when I was co founding GiveWell, people were saying it would be completely nuts to work on GiveWell if you understood the AI situation and it would be completely nuts to give to GiveWell top charities if you understood the AI situation. And I just don't think either of those has panned out very well. I think GiveWell becoming successful obviously did have a lot of benefits for AI safety, or at least according to me, it did. Certainly didn't end up irrelevant. So I think we've actually seen this, I think at open philanthropy, a fair amount where there'll be a grantee that's on one side of the org, the non global catastrophic risk side, that does become a very big win from the other point of view. So that's not. None of this is to say it all washes out and it's all the same. I just think that some people have in mind that it's like that every cause is a rounding error compared to their cause. And I don't tend to think of it that way. I tend to think of it as like, well, this thing seems like the best to me, but I don't really know. And if I was going to be miserable working in this cause and really happy working in another cause, it's probably just a better rule and a better policy for everyone who cares about this stuff in general to put a lot of weight on where they're going to be happy. Because it's probably better for the community to spread bet a little bit.
B
There's an interesting effect where if you think that we're likely to have a massive speed up in research and development of all different kinds, a massive improvements in science and technology over the 5, 10, 15, 20 years due to AI, some other efforts to improve the world seem less useful now because they're likely to be kind of superseded by the work that AI will be able to do much faster than humans can at some later time. So you might think that some sorts of, for example, like cancer research that we're doing now, it's great that they might deliver returns immediately, but they might, like AI, might just be able to do a much faster job, like speed up cancer research 10 or 100 times, maybe starting in 20, 30. And people will to some extent feel a little bit like they've wasted their time. And it's interesting, I guess. Does the AI situation reduce the cost effectiveness of work on global health and development or on animal welfare? I mean, it's not a simple question because in the global health and development case, very often the efforts are to save the lives of children, specific children alive now who may die of a disease because they're not getting the treatment. I mean, it's like very severe and worsening problem this year. And the fact that AI might develop a better cure for better treatment for malaria in the 2000 and 30s is not going to save or bring back the children who die now. So if you have a person specific view, then this doesn't necessarily bite. If you're doing work, I guess to try to develop technologies that would help to end factory farming in the future, but you don't expect them really to pay off for at least 5, 10, 15, 20 years, then it's maybe easier to see how that stuff could end up just being superseded basically by work that AI does down the line. And people will feel like in fact they were somewhat wasting their time. Do you have any thoughts on this slightly complex moral and practical question?
A
I mean, I tend to separate causes in my heads into buckets that have a lot to do with how speculative are you being? How long a timeframe are you working on? How theoretical versus immediate and believable is your impact? And so, so I tend to think causes that are taking a really big, long term, high risk bet on the future. Those are the ones that to me, I feel like I lose interest in when I am thinking about AI, because I'm just like, why would I do that when I could work on AI? So for example, if there's Some people are concerned about the fertility crisis or something. And it's just like, oh, I'm worried that over the next several hundred years, we're going to have gradually standard of living increasing at a slower rate or something. And it's like we need to work on, you know, some kind of ambitious program to somehow address. I don't know how you address that or, you know, or someone wants to do, like, science moonshots that are just like, well, for the next 50 years we're going to see nothing, but then maybe we'll see a big win. And I'm like, man, if you're going to do that, just work on AI. You know, if you want to have that much uncertainty and operate on that kind of crazy timeframe, like, do something that in my opinion, is just a bigger deal and a higher likelihood, you know, I think. I feel. I feel less that way. I think there is more of an apples to oranges feel when I'm thinking about AI versus global poverty or animal welfare. I think part of me thinks it's apples to apples, and part of me thinks that working on AI is more important, but part of me thinks just like you kind of have to have a little bit of a brittle and rigid philosophical framework to really find a reference point for comparing those. I generally just don't believe anything that philosophers come up with because I don't think it's a discipline that has a good track record. And I think a lot of it is just us fooling ourselves with thought experiments. I mean, I enjoy it and it often changes the way I think, but I think if I had an incredible opportunity to do a huge amount of good for farm animal welfare or global poverty and someone tried to argue that I should drop it and do some kind of mediocre AI work that I didn't like, I think I just. Yeah, I wouldn't buy that. Overall, yeah, it's a complicated topic and there could be lots more to say about how do you think about what the right theory of ethics is and how do you think about how to deal with your uncertainty about the right theory of ethics is. And at what point are you modeling your uncertainty intuitively versus using a framework that itself is subject to uncertainty? So I don't know, it could be a whole other topic, whole other podcast.
B
So we've been talking almost all about exclusively about working in this field directly. Do you have any view on how that compares to earning to give and funding organizations or projects that are doing good work? I know there's like, at Least Mida, for example, is requesting funding from various donors. And I'm sure there are other groups that would value donations and feel like that might move them forward faster.
A
Yeah, I think as far as I can tell, in general, the case for donating in AI is getting stronger. There's more stuff to do, like I've said, there's more tractability. I think it's also becoming just more bad and awkward for one or a small set of really big funders to just cover everything because. Because it's becoming a bigger, higher profile issue. People follow the money and different funders have different reputations. And for example, I think there's open philanthropy, which is a big funder, but they're connected to me even though I'm not there anymore and I'm connected by family and work at Anthropic. So it's just not everything is great to be getting money from open philanthropy. That can create a real and perceived conflict of interest, which I think is going away with time as I'm not there now. So I think the opportunities for donating are probably better than ever, but a lot of the other things to do are better than ever. So I think this just further amplifies the picture that there's just a ton to do and finding something that fits you really well seems good.
B
Yeah, I guess not having thought about this deeply, my guess would be that if you can get a job in a direct work role in one of these companies or some other organization that is really desperate to have you, for most people that is going to be earning to give, however many, many people are not going to be able to get roles like that because they're just not suitable, don't have the right skills at exactly the right moment. And for those people, going and earning to give is a much wider field where you can do a far wider range of things in order to make money and donate. And so basically everyone who can't get a direct work role should seriously consider earning to give and then funding these various projects that either just aren't getting enough funding or can't receive funding from particular donors for one reason or another.
A
Maybe it depends a little bit on what exactly you're defining as direct work. I mean, I think it's more and more true that there's just a ton of jobs in the direct work that are maybe an indirect version of the direct work where you might be helping with run the organization or helping with the business side. If you are happy to go to an AI company that you think is good and Try and just help the company succeed on the business side, which I think you can debate, is that a good thing to do? But there's certainly an argument that it is. And I've been making a lot of that argument. Then there's just like a ton of really normal jobs. Right. And there's, I mean, certainly at Anthropic. I mean, I think there's just a ton of value being added by people who aren't in these kind of traditional, effective, altruist minded areas. You know, the legal team I think is very important and a lot of what they do is a big deal. Some of it helps with like the funky governance that Anthropic has. Some of it is more just the everyday business needs. There are a lot of jobs that are just like in the AI industry now that you might think of that way. And some of those are actually pretty good for earning to give too, depending on how you think of it and what time frame you're on. So I don't know, it's not that clear. But yeah, I think what you're saying has some truth to it.
B
Yeah, I'm just sensitive to the fact that most people in the world could not in any given moment get a job at Anthropic or even a similar organization, if only because of location or where they're at in life. But I guess fortunately for those people, the fact that potentially they only to give opportunities or the giving money opportunities are also better than ever gives them a substantial way to contribute potentially.
A
Oh yeah, that's totally fair. Yep.
B
All right, we should wrap up. We've been going for many, many hours and you have incredible stamina, Holden, but I'm not sure whether people can tell whether you're flagging.
A
Well, you're the one who's up late.
B
Yeah, I'm up later. But I've said like probably a tenth as many words as you, I guess, to close us out. Do you want to give us a bit of an inspiring call to arms? I mean, I think there's probably a lot of people in the audience who are, you know, somewhat definitely concerned, definitely interested in these topics. If they're still with us at this point. They surely are at least somewhat interested in AI, but are not working in the area and may well not have applied for any kind of role or perhaps even seriously considered changing what they're doing. Do you want to say something to inspire them to go and check out the 80,000 hours job board or the Anthropic job board?
A
Yeah, if you've tried really hard to get a job in AI safety and you can't find anything you like, then that's one thing. If you haven't tried, I think at this point, point I'm just comfortable being like, that's insane. You should at least take a look. Because it's an incredibly dynamic field. It's fun and interesting in a lot of ways that aren't just about having impact. It's one of the fastest changing things in the world. It pays well. It's like a lot of organizations are just very good to work for. And there is so incredibly much to do. There is so much to do. That is, while it might be 5149 is 5149 on maybe the most important thing that will ever happen to humanity. And whatever your skills are, whatever your interests are, we're out of the world where you have to be a conceptual self starter theorist, mathematician, or a policy person. We're into the world where whatever your skills are, there is probably a way to use them in a way that is helping make maybe humanity's most important event ever go better. So I would definitely look into it. I would definitely get in there, look for some jobs and, you know, if you don't find something that fits you, you don't have to take it. But definitely, if you haven't looked, I would.
B
My guest today has been Holden Karnofsky. Thanks so much for coming on the 80,000 Hours podcast, Holden.
A
Thanks for having me. It's been great.
Podcast: 80,000 Hours Podcast
Episode Title: Holden Karnofsky: "We're not racing to AGI because of a coordination problem" and all his other AI takes
Date: October 30, 2025
Hosts: Rob Wiblin and Luisa Rodriguez
Guest: Holden Karnofsky (Co-founder of GiveWell and Open Philanthropy; now at Anthropic)
This marathon-length episode features Holden Karnofsky sharing his wide-ranging takes on artificial intelligence (AI) risk, the dynamics of AI development, responsible scaling policies, power grabs (by both AIs and humans), current progress in AI safety, and more. The discussion critically examines common narratives in AI safety and digs deep into empirical updates, practical interventions, and the comparative impact of working in AI versus other cause areas.
The episode centers on understanding and demystifying the strategic risks posed by advanced AI. Holden challenges the oft-cited "coordination problem" narrative—that key players wish they could cooperate and go slower but are compelled to "race" forward. He details tractable risk-reducing interventions, the value of incremental safety work, and the nuanced ways that working inside leading AI labs (like Anthropic) can help. Holden also explores how AI might actually play out: the threats, hopes, how companies like Anthropic fit in, and what listeners can concretely do to help.
Timestamps: [00:00], [18:26], [26:19]
"I emphatically think this is not what's going on in AI. There's just plenty of players now who want to win and they are not thinking the way we are." —Holden, [18:26]
Timestamps: [02:56]–[06:19]
Timestamps: [07:33]–[16:35]
Timestamps: [30:08]–[41:11], [161:32]–[172:30]
“If you play your cards right, you can pay very small amounts of so-called tax and have very big safety benefits.” [52:19]
Timestamps: [59:01]–[74:27]
“I think a lot of people interpret [RSPs] as being these ironclad commitments ... but that was never the intent.” [59:35]
Timestamps: [76:37]–[101:14]
“There is so incredibly much to do ... we're out of the world where you have to be a conceptual self-starter theorist.” [269:09]
Timestamps: [102:05]–[142:29]
Holden covers and ranks the usual threat vectors:
Timestamps: [152:34]–[158:28]
“You could get better effects if you had regulation, but the tractability is massively higher [with direct company intervention].” [152:34]
Timestamps: [98:26], [269:09]
Timestamps: [181:15], [186:38]
“My overall attitude here is: I think we'll probably get a happy ending, even if we do a horrible job with this. … But we're also being irresponsible, and that's different from saying we're doomed.” [172:30]
On the “racing” narrative:
“I emphatically think this is not what's going on in AI. I think it's not at all what's going on in AI.” —Holden, [18:26]
On the “do nothing” AI takeover strategy:
“…the optimal strategy for AI is do absolutely nothing. Be as helpful and harmless and honest as you possibly can be. Don’t ever give anyone a reason to think that you're doing anything bad ... just wait.” —Holden, [09:20]
On the role of responsible labs:
“You can pay very small amounts of so called tax and have very big safety benefits because … you get something that actually works and is not very expensive.” —[52:19]
On RSPs not being ironclad:
“I think a lot of people interpret [RSPs] as being these ironclad commitments … but that was never the intent.” —[59:35]
On incremental improvement:
“You have to be comfortable with an attitude that the goal here is not to make the situation good, the goal is to make the situation better. You have to be okay with that, and I am okay with that.” —[152:34], [00:00]
On object-level work:
“Whatever your skills are, there is probably a way to use them in a way that is helping make maybe humanity's most important event ever go better.” —[269:09]
On the sign of impact:
“AI is too multidimensional ... I tend to think it's worse than 51, 49 ... I'm excited to work in it, but I really do have to live with the possibility that my ultimate impact ... is going to be negative.” —[250:24]
Holden, with his classic blend of clarity, humility, and skeptical optimism, gives listeners a robust update on where AI safety thinking is today. He invites listeners to set aside simplistic narratives, seek out tangible leverage (even if partial), and act despite the fog of uncertainty.
"We're not racing to AGI because of a coordination problem ... we're not doing enough, but there's still a lot of positive expected value on the table. And you—yes, you—should apply."
For those who haven’t listened: This is a deep, lively, sometimes challenging, but ultimately hopeful episode about what it takes—and what it means—to try to make the arrival of AGI go well. There are no simple answers but there are worthwhile actions, and the window for positive impact has never been bigger.