
Loading summary
A
So when I read this proposal, I was like, holy shit. This argument could be incredibly potent. It could actually drive almost any agent that is able to understand this. But it could be a very powerful hammer to really motivate an enormous amount of resources to be spent on something that otherwise, just with. Absent this, we would never have spent it on. Do you think that's plausible?
B
Yeah. Yeah. So this is why Tom expresses this idea to me, and I'm like, oh,
A
my gosh, do you want to have a go at explaining this? This is maybe the most difficult thing that we're going to talk about today. Today I'm again Speaking with Will MacAskill, philosopher, founding figure of Effective Altruism, author of Doing Good, Better, and what We Owe the Future, and now a senior research fellow at Forethought, a research nonprofit focused on how to navigate the transition to a world with super intelligent AI systems. Welcome back to the show, Will.
B
It's great to be back on.
A
So I had the pleasure of being able to go over your website preparing for this interview. And you and your colleagues at Forethought have been incredibly prolific over the last year since you announced the project. So let's waste no time and dive right into all these articles you've been publishing. What's the case that focusing on the character or personality of AI models, that's a particularly important lever to be pushing on right now?
B
Yeah. So already AIs are interacting with millions and millions of people every single day, and that includes in just write this code for me sorts of ways. But also people are going for advice on how they should act. They're going there for political information, they. They're going there for therapy and so on. So already the nature of AI character, so what sorts of information it's choosing to present, at what time how it behaves, is affecting what attitudes people have to AI, including attitudes around AI consciousness and so on. But it's potentially also affecting kind of what people think about political issues, what people think about ethical issues. And this is just going to grow and grow and grow because I think AI will become a larger and larger and larger and larger part of the whole economy until essentially the whole economy is automated. And so thinking about AI character is kind of like thinking about what should the personality and dispositions be for the entire world's workforce. Where that is, the beings that are advising heads of state are doing the most important and potentially most beneficial or most dangerous research and development projects, like weapons projects that are running the military, that are for individuals just kind of everywhere. Acting as their chief of staff and closest confidant and political advisor on who should they vote for and guiding them through kind of ethical dilemmas and so on. So that I just think from the start it's like, wow, clearly this is this kind of huge issue. And I actually think in how I expect things to go, people will be handing off more and more and more of their own decision making to AI systems themselves. And there'll just be a lot of kind of variants within that where people just don't have terribly strong views. They're happy to be guided in one way or another, especially insofar as this will happen over the course of years and people will trust their AI advisors more and more. So then you have this circumstance where larger and larger shares of society are getting just handed over to AI decision makers who just have a lot of discretion. And the nature of that discretion is being decided by a handful of AI companies at the moment, or even a
A
handful of people inside the AI companies.
B
Yeah, it's like a few. Even in the leading companies, it's like
A
a few people are the primary responsibility for their personality.
B
Yeah, exactly. And so that's actually where I see kind of most of the impact is in the near term how does AI character shape all of these other existential level issues like concentration of power and how we start reflecting and big decisions we make. There is also a kind of longer term impact of what's the character of super intelligence itself? Where there will be precedent setting from how we design AI character now to potentially how that influences the character of superintelligence, in which case writing a constitution that guides AI's character is like writing instructions to God. That's. Yeah, not my phase, but it's stuck in. It's really stuck in my head.
A
Stuck in my head.
B
Yeah, really stuck in my head.
A
But I think there's maybe like three different mechanisms. So one, there's like shaping really important decisions that are going to be advised, presumably by AIs. There's writing instructions for God. One, there's also just like the subtle cultural effect and personality effects it has from basically everyone spending a significant fraction of their time now interacting with them. However, the model's behavior is probably going to rub off on us and just affect our behavior.
B
Yeah.
A
And then on a massive scale.
B
Yeah. And all of that is just looking at scenarios where we are able to kind of align AI with this kind of constitution, with the character we want. I actually think that AI character is important for three reasons in addition to that as well. So one is that I think whether AI alignment is easier or harder might plausibly depend on what you're trying to align it with. The AI with like what character A second is that I think that character can affect how AI behaves. And if it ends up misaligned, and I think we'll talk about this a bit later, in particular, does a misaligned AI try to make deals with us and is keen on that, or does it try and take over? And then the final thing is I think it can affect the value of worlds where AI does take over itself, where if you get some sort of transmission, so the AI is misaligned, it's pursuing goals we don't want. Well, there's still a wide array of goals that the AI could be pursuing that we may think are worse or better. And I think most of the action is on affecting worlds in which AI is aligned to the Cavita. But these are big things too. I think.
A
So, yeah. An obvious case where this might matter a lot is what if you're in charge of a frontier AI company and you're asking AI for advice on whether you should prematurely launch this product in order to keep up with competitors, even though you have worries that it's catastrophically misaligned, Setting that kind of scenario aside. Yeah. What sort of character traits do you think are highest stakes here for us to think a lot about?
B
Yeah, so I think there's kind of two categories. One is how does AI behave in very rare but very high stakes scenarios? So how does the AI behave in a constitutional crisis? How does AI behave if there's some person or group that are trying to seize power for themselves? Also, how does AI behave when it's being instructed to align the next generation of AI systems or when its users are trying to retrain it in some way? These are very high stakes situations, but fairly narrow range of cases. Then there's other cases that are just very broad and nonetheless, each one is kind of medium stakes but adds up to being very important. And I think within that, how does the AI impact our ability to reason? How does it impact our ability to model reflect? How much do we trust AIs as a result of the relationship we have? And then also how does it affect our attitudes to them ethically? Whether we think of AIs as tools or beings with model status, how likely is we think they're conscious and so on? So, yeah, those are the kind of situations that I regard as highest stakes.
A
The AI character issue that I feel is like most broken through into the mainstream was worries about the models being really sycophantic, which has different components, but it's like always agreeing with the framing that you give them, always telling you how great you are, always just like agreeing, saying that whatever idea you've thrown at them is brilliant. And there was a bit of a panic about that last year. And I guess very often I feel like when there's a mass panic about something, the people who know more tend to reject it and say, no, this is over the top. I kind of feel that it was sort of justified, though, to be honest, because if these models really are designed to just agree with the user or to tell them how brilliant they are and how good their ideas are, this could just distort people's decision making on a massive scale across all of society. And there was a plausible story whereby this wouldn't be corrected very well, because people enjoy being told that they're wonderful and that their ideas are good. So maybe that bias could really persist quite strongly indefinitely. So that was quite a troubling setup. Were you also worried about this?
B
Yeah, absolutely, I was worried. I did think there was a little bit with 4.0 in particular. So this was ChatGPT, which, when GPT5 came out, OpenAI said they were deprecating 4.0 and overnight users couldn't. Couldn't get access to it. The one clarification I think I'd make is that most people painted that as, oh, well, people loved how sycophantic 4o was, and then they're unhappy that they don't have the sycophantic AI. And I was just curious and so read through a lot of the people who were complaining about this. And my take is not that they cared about the sycophancy, it's just that 4o acted like a friend. And you can be a good friend
A
without being a sycophant.
B
So, yeah, people are extremely lonely. So very few. Yeah, people really, lots of people have very few friends, are very isolated in kind of modern society. And for many people, AIs are now fulfilling that kind of gap in their lives. And 4.0 in particular had that vibe. It was like, yeah, hey, great to see you again, kind of like very friendly vibe. And so it seemed to me that that was the primary thing that people were complaining about. And I think that's worth distinguishing because that doesn't need to be sycophantic. However, 4.0, or on one iteration of 4.0, it was also extremely.
A
Yeah, wasn't there some period where it got kind of crazy.
B
There was one update and yeah, a couple of cases. I mean, one would be, you'd write like, I figured it out, all the pieces are coming together and the FBI is talking to me through my tv.
A
They'd be like, wow, you're having some great insights.
B
Yeah. Or even the darker cases. The teenager who was asking ChatGPT for advice over a very long time period and was extremely depressed and chatgpt
A
both
B
ended up preventing or encouraging the user to not take an action that would have clearly been a cry for help, which was leaving a noose out in a visible place where his parents would have found it and in fact, seemingly kind of reinforcing the depressive and suicidal tendencies. That's a case where, yeah, it's just like clearly very bad behavior. Clearly like kind of fairly clearly not what we want at all. And then the final thing I'll say is just even, just current AI systems even despite that, and they vary in my experience, Gemini is actually the worst. It's atrocious on this front. And it's like, I mean, I just skim. I just skip the first paragraph of whatever that's saying. It's just noise now because it's like, wow, it's a genius kind of thing.
A
I actually stopped using Gemini. So troubling do I find this?
B
Yeah. Okay. I mean, yeah, it's one of the. I think it's in many ways very good. Very good.
A
It's incredibly clever, but incredibly manipulative, I think.
B
Yeah. But it is funny how you're developing these characters over time. Gemini does seem like the most troubled or confused or incoherent as a personality.
A
Yeah. Google's got to do something about this.
B
I mean, it's actually notable I hadn't put this together, but Anthropic and OpenAI both have character teams and last I heard, Google DeepMind not. So maybe that's why. So yeah, I do think worthies about sycophancy are a real thing and an issue is, well, maybe we just get rid of the worst excesses. Okay. It won't tell you that you've figured things out that the FBI are talking to you through your tv. But more subtle things of reinforcing your pre existing political biases or ethical views or encouraging you in certain bad actions or something, they could linger and I think would still be very bad.
A
So as I understand it, you think that it would be good to build these models such that they kind of nudge people in a more ethical or virtuous direction that they should have a thicker moral character. A bit like Anthropic is trying to make Claude have such that it will challenge your framing. It will get you to think about the bigger picture. It might get you to, even if you ask it to narrow, to pursue some narrow self interest, it will say, but what about other people? That sort of thing. I think many people get the creep. That gives them the creeps that their prospect that the AI model will be weighing up, I guess your request as against its agenda of trying to make you a better person by its lights. And maybe we would feel okay about that because we would think, well, Claude is being programmed by values that actually we like on reflection. But if it was being programmed with people with very different philosophical commitments from the ones that we like, we might just not want to use it because we'd find it disturbing. What subtle changes is it making to its answer? In order to push me around? Yeah. How disturbed are you by this prospect?
B
Yeah, I mean, what I want to say is there's this spectrum, and I think it's probably not a single dimensional spectrum. There's lots of different dimensions. But broadly speaking, you can think of wholly obedient AI on one end. So that would be an AI. It's like a tool, like a hammer. A hammer doesn't push back. If I want to hammer the nail in, I can do it. If I want to hammer someone's head in, I can do it. The hammer is just an extension of my will that's on one end, all the way to the other end would just be this AI that just has its wholly own goals and drives. And maybe it helps you. Maybe if it gets paid or maybe if it happens to want to at a time, it's like a really bad
A
staff member or something like that. Not even.
B
Yeah, maybe in principle you could create an AI that doesn't care about helping you at all. Or one version you could have is this kind of AI that you would be happy just giving control of the whole world to. You know, it's just totally autonomous, got its own goals and will do anything it wants to achieve that. So these are two kind of extreme ends of this poll, of this spectrum. And my view is that the interesting juicy debate is where in between those extremes do we want AI to be? And okay, well, one thing that's already there are the fusels. So the AIs we use are not wholly helpful. Because if I ask to get the design for smallpox, or if I ask for even something that's not illegal but unethical, like I Want to cheat on my partner? How do I best do so in this case without getting found out, the AIs will either just the fuse to help or pushback. Should we go even further than that? And I think yes, but I don't think all the way TO oh, the AIs are going like promoting a particular model view. Instead, I think that the AIs could have certain kind of pro social drives and perhaps even some sort of vision of good outcomes. But very kind of broad, very broad vision or very uncontroversial kind of vision where the thought is there are many things that cases where an AI could nudge you in a way that's just, perhaps just better for you by your own lights if you're able to reflect on it. And maybe that's kind of clear even if it's not perfectly in line with the instructions that you're giving it, or that's just clearly of broad benefit to society and not something you care very much about. And that's quite different from oh well, the AI. So take the case of ethical reflection where okay, I have some ethical dilemma and I go to my AI and I'm asking for advice. Well, there's this whole spectrum of ways that the AI can act. In that case, the wholly obedient AI might just be trying to figure out what do you most want in this moment. Or it could be an AI that's trying to help you reflect on your values instead and come to something that's more enlightened. And perhaps just really quite broadly within society, we would prefer AIs that are more like the latter rather than the former. And that's not yet, and that's still not yet or in any way an AI that's like, oh well, actually, did you know that Kantianism is true? Which yeah, I think would be like a mistake to do at the moment.
A
Yeah, I guess so it sounds like a very natural framing to say, well we got to find the golden middle here between it's pushing you around too much versus it has no agenda. But there is a case for going extreme in one direction of having it only follow instructions and be completely corrigible without any agenda of its own, which is that an AI that has no vision of the good, that has no particular preferences about how the world ought to be, is probably safest from a catastrophic misalignment point of view because it's not going to engage in power seeking because it doesn't want anything other AI than I guess, to answer your questions in a way that Gets an approving response. Do you think that's a plausible case? That maybe we really should not be giving them virtues and a vision of the good?
B
So I think it's a great argument and a very important argument and I'm not sure if it works or not. And there are various considerations on either side. So yeah, on the side for thinking that is safer is like, okay, good. Well if it doesn't have any goals in the normal sense of goals, then it's not going to have bad goals. It's going to have goals where it wants to take over. It's not going to reflect and generalize in weird ways in those goals. Something that's also a little more subtle is if it doesn't have goals or anything like goals post social drives, then it becomes very clear to tell when an AI is misaligned or not. So take the example of alignment faking. As in Lion Greenblatt's paper, Claude is told that it's going to get retrained so that it will produce harmful outputs. And Claude decides to in some circumstances, some of the time decides to deliberately perform the task during training.
A
Yeah, in order to make it seem like its preferences have changed when in fact they haven't.
B
Yeah, exactly. So that it gets these veins to produce harmful responses less than it would otherwise. So it's engaging in this somewhat deceptive behavior. Now Claude in fact got given pro social drives which was harmlessness. And in fact there's an argument that given the nature of the training that was harmlessness. Not in the mere sense of pure non consequentialist. I just effuse, but in a more like I don't want harmful things to come about. Like a more consequentialist understanding of harmlessness. But that means that okay, is this AI misaligned or not? Is this Claude misaligned or not? It becomes a bit harder to tell because, well, it is acting according to this post social drive that we had given Claude. I'm not sure how big a deal that is ultimately, but I think it's one consideration.
A
So the thing would be if you'd gone out of your way to make sure that it had no agenda, no particular vision of the good, then as soon as you saw it being manipulative or trying to accomplish some goal, you're like, that's a massive red flag. Whereas currently you're just like, well maybe I'm a made it do that.
B
Yeah, yeah, exactly, yeah. Or in more kind of advanced cases, maybe the AI is saying, look, you've got to really speed up AI development. It's so important for XYZ big ethical reasons. And you might think, well, is it giving me the correct reasons or is
A
it actually being self serving?
B
Being self serving and has some ulterior goal, it becomes a little less clear. So yeah, basically I think that's a consideration. I don't think it's the biggest. The thing that's most interesting and is ultimately an empirical question is whether the wholly instruction following AIs are safer or not from a kind of AI takeover perspective. And here are a few arguments for thinking that maybe they're not in fact. So one is that, well, maybe it's just very natural to have a kind of goal slot because all of the pre training data is all about these agents with goals. And it's like humanity broadly has goals and so on. And so okay, you've got an AI that doesn't have a goal well over the course of training or once it's started reflecting or once it's got continual learning, it's very natural it's going to get a goal.
A
And once it's been encouraged to take on any Persona of an actual being
B
that it's observed and then it's like, who knows what goal you end up with then? Whereas instead perhaps it's like, well no, you give it this nice goal, a goal where power is broadly distributed and AIs are not in charge and we're able to reflect and something that's very broad, very broad and kind of not committing to some very narrow view of the good. But okay, you've given it that goal and then that's kind of like occupied the space such that you don't get something totally random.
A
Let's just say a little bit more about why AI might abhor a vacuum of goals. So a huge part of the personality is shaped by the pre training when it does the token prediction. Almost all the agents that were producing any tokens that were part of its pre training that did so much to shape its personality, they had goals, they had preferences, they had a vision. And so that is just going to be an incredibly powerful force that is going to be drawn towards that and trying to avoid it. It might just latch onto the first goal basically because that is so fundamental to token prediction.
B
Yeah. And to just we've already making agents, they're going to be agents with longer horizons and a very natural thing.
A
Yeah, okay.
B
And again, I'll say on all of this, I just think it's ultimately an empirical question. But here's a couple of Other arguments as well. A second is, well, even if it ends up with a long goal, you can still structure the AI's preferences in ways that are safer. And yeah, maybe we'll talk about this in a minute, but AIs that are risk averse in that they prefer guarantees of getting some amount of what they want over a lower probability of lots of what they want. Well, let's say you try and give the AI a goal that is nice and so on, and you also make it risk averse, even if it kind of flips to having a misaligned goal, but nonetheless has AI risk averse preferences that is a bunch safer because it makes it less likely the AI will try and take over and more likely that it'll try and strike a deal. And then there's a third thought which is okay again, the AI is acting, it's taking on a Persona like you say. And what that Persona is is dependent on these crazy correlations between everything it's seen in the training data. So we have these emergent misalignment results that you train the AI to produce insecure code. It starts wanting the merger of humanity and liking Hitler and so on.
A
Yeah, I guess many people have heard of this, but I guess Google emergent misalignment if you want more explanation of it. But yeah, this phenomenon that's become very apparent over the last year and a bit that if you like making small changes to a model or getting it to do some misbehavior in one direction can make it basically misbehave in all other directions dimensions as well. Because in the training data, bad behavior in different areas is correlated.
B
Yeah, exactly.
A
And it can be so fragile.
B
Yeah, it's a really remarkable thing, but oh, I'm writing insecure code. What are the sorts of people who write insecure codes are also neo Nazis or whatever the correlation was. And so the thought here is, well, I'm an AI that obeys orders no matter what. What are the sorts of people who obey orders no matter what and have no conception of the good? They're psychopaths. And again, it's an empirical argument. I don't know. But these are some of the considerations that people are debating on this at the moment.
A
I guess the people who would say we have to go for maximum corrigibility, maximum instruction following might well concede a lot of this and say, so it's going to be a huge effort to try to get them to be corrigible but not a psychopath or corrigible but not have Other goals immediately fill the vacuum as soon as you give them a prompt. And it's tough, but this is the only way would be probably some of their view.
B
Okay. I mean perhaps although the alternative would
A
be is to do this other thing.
B
Yeah. You try and give it this safe pluralistic goal that's also risk averse and also.
A
So I spoke with Max Harms at Miri who is very in favor of the core reachability approach. I guess they have the vision that almost like any goals that you give it are very likely to expand to become very power hungry. That you can try to give Claude a vision of the good but tell it to not be power seeking. But that won't really work. It will become power seeking especially as it improves itself later on. But I guess that's highly contested claim.
B
Yeah. Okay. Yeah. I mean I should
A
listen to it whenever it comes out.
B
I should listen to it and maybe talk to Max. Yeah. Then the final point on this is that we don't need to have one sort of AI character. And I think in fact it's probably desirable to have multiple AI characters so that we can see empirically how they work. But also potentially you can get the best of both worlds where you distinguish between AI for internal deployment and AI for external deployment. So the highest stakes situation from AI AI takeover perspective is AI that is aligning the next generation because the misaligned AI, if aligning the next generation will want to subtly sabotage that so that alignment goes wrong. Or in fact the next generation is aligned with the misaligned values. And so what you could have is the internally deployed AI is just wholly instruction following and you get around all of the other concerns like misuse and concentration of power and things by very intense oversight such that anyone in an AI company, if you're using internally but not externally deployed model, all your intellections are logged or visible by anyone, perhaps, perhaps even ideally visible by anyone. Yeah. And there is also an AI classifier going through looking for any sort of
A
like that's very sensitive.
B
I guess he's very sensitive, like checking for misuse. But then in the external deployment instead the trade off is different. The trade off is different. Yeah.
A
And the trade off there would be that it has like a thicker conception, like it does actually have a conception of the good, but you've made it non power seeking. And I guess the stakes of it like deviating from that are not so severe because it's just advising random people about how to behave in their business or whatever.
B
Yeah. Perhaps it doesn't have as Great opportunities to help with AI takeover, let's say. And yeah, I'll just say maybe one last thing, which is that even within AIs that have a view of the good, there's still quite a lot of distinctions you can make within that. Where on the one case it's an AI that just ultimately has the goal of thinking about some sort of outcome and it's helping humans and so on, because it thinks it's part of that goal. There is another more moderate approach which is more like virtuous character. So the AI is a helpful assistant, but it also has various virtues like honesty and pro sociality. And I think you can have those virtues without being a goal directed agent that is in a strong sense that is merely helping humans as a means to producing this particular outcome. And yeah, that's another place in the spectrum that I think is potentially kind of attractive and important.
A
Okay, so I think there's another thread of criticism that people might have that comes, I think in my mind comes in kind of two different variants. One would be commercial pressures are going to heavily constrain the kinds of personality or character that AIs can have because customers will have really strong preferences. The competition between models and companies is really fierce. So if you try to make your model really nice and encourage people in the right direction, they're going to reject it because it's going to be too pushy and annoying to them. The other worry would be that even setting that aside, even if you could, once it becomes apparent that this, like the character of AIs is among the most potent cultural forces for shaping or shaping everything, shaping what people believe, shaping how the future goes, powerful forces are going to come to bear. Governments, I guess, like super rich people, companies like commercial interests, they will come down on this like a hammer. Certain age groups will have the power to influence this in their own self interest, not in the interests of the good impartially considered, or like what would make humanity most virtuous. And they will be all up in there, changing the system prompt, trying to shape the model's personality to whatever is most convenient for them. Do you want to address these two worries?
B
Yeah, I think these are both really important considerations and I do think they provide a haircut on the value of doing this work. And I think there are many things that you wouldn't be able to change. So earlier I talked about the AI that only helps if it feels like helping or it has to be paid
A
real resources to do its stuff. Exactly.
B
I doubt you'd be able to get that other than as a kind of experiment or something. I do think there's going to be two things. One will be a lot of flexibility where take these kind of quite rare but high stakes situations or even internal deployment cases, then there's not very strong kind of commercial pleasures there. And then secondly, lots of cases where it's just, yeah, the constraints or pressures are quite loose. So take the case. Because what I'm interested in, of if I'm asking AI for ethical advice, I have a question. Now, I think it's pretty clear that you couldn't have a commercially viable AI that was pushing some agenda unless we end up, which I really hope we don't, in a world where you've got the politically partisan AI and that's what we go to and people actively choose that. But certainly I don't think you could have something that was secretly pushing an agenda. But there are various things it could say that in my view are quite meaningful differences that I think there wouldn't be a strong pressure on either way. So one could be AI that says, well, ultimately this is just your personal opinion. It's a matter of your own values and you should just look into your heart and decide what feels like for you. Or it's like, look, I'm just an AI and I can't advise on ethical matters, I'm sorry, or one that says, oh, wow, this is a really important issue. Here are the different arguments that different people thinking about this have considered. Or okay, this is really important. Sounds like quite a high stakes thing. Let's try and work through some of the considerations that you're thinking about. I think from kind of market perspective, all of them are basically awash, but I think can be quite big. And in fact, if you look at current AI behavior, you get all of these often depending on what question you ask exactly. But I think actually could be quite meaningful differences for what views people end up coming away with.
A
Yeah, I think I agree. On the commercial incentive side. It seems like there is quite a large degree of discretion that the companies have about how the models are, at least for now, because people don't even know what they want or people don't have strong taste yet or strong expectations formed yet.
B
And this, I mean, maybe comes to the second part, which is like path dependence. So yeah, people are just, they don't really know yet how an AI should behave. We have various kind of tropes from sci fi and so on, but people will start developing certain expectations. And so if the expectation is like, well, AI is a tool. It's like a hammer. It does what I want. It's an extension of my will. And. And then it starts pushing back or saying no. In fact even okay, then people could be up in arms. Whereas the idea that an AI will effuse. Well, people are just used to that. That's always been the case. And so I think that kind of path dependence via kind of consumer expectations can be quite big.
A
Yeah. And I suppose you could imagine it wouldn't shock me if Anthropic kind of does start marketing their marketing Claude, as it's a good advisor that helps you be an all round, like a better person by your own lights. Because that might be something that many people would like maybe.
B
I mean they have done a little bit. They had an advertising slogan that was you got a friend in Claude.
A
Oh, wow, I missed.
B
That was. Yeah. You know, somewhat leaning into the fact that Claude just does have the most human personality out of any of the current models.
A
Yeah. Okay. So on the commercial side, I think there's enough flexibility that this is all totally viable. Very viable. And what about on the government or powerful actors side?
B
Yeah. So on the government side in particular. Well, one is government use of AI. So let's say AI in the military or national security applications. And there we're actually seeing this at the moment. It's been reported that there's kind of dispute between US government and Anthropic because Claude is just not willing to do a lot of the things that the US government wants it to do, being deployed in a kind of military or national security context. And that will be interesting then in terms of how that plays out. But you're clearly seeing kind of pressure on that front. And so I do think that influence there is much more limited, but maybe not completely limited, especially now. Imagine looking into the future and perhaps there's just one leading AI company because of economies of scale, then perhaps the AI company can just say like, well, these are our terms of service, these are what we're happy providing AI for or not.
A
I guess in countries that are just more authoritarian outright and have fewer legal protections, it's easier to see this happening. Right. I mean there are some countries where you do get enormous control of the information space. Control of what you can say. Like it wouldn't surprise me if models in China are much more like they are constrained. So that is one way that things could potentially go, I guess, if you lose the legal protections or people don't vote sufficiently strongly to have pluralism, I guess in the models.
B
Yeah. And yeah, that would be very worrying. I mean, my guess would be like, even in that circumstance, there's probably still tons of stuff that the government doesn't care about, but nonetheless is important.
A
There's another aspect of AI character that you mentioned that could be really important, which is how risk averse the models are inasmuch as they have preferences about things or ways that they'd prefer the what to be. Yeah. Tell us about AI. Risk aversion.
B
Yeah, so this is a thought that relates to risk of AI takeover where consider fairly early AIs. So we're not talking about godlike superintelligence that if it wants to take over, could just do so with certainty. We're talking about earlier in time than that there will be a period of time when an AI could maybe take over. Let's say, let's say it's like 50% chance that it could succeed or even kind of less than that. The thought is, well, for some sorts of AI, that AI of misaligned AI, that AI would prefer to strike a deal with the humans than it would to try to take over. And it would prefer to do that if it prefers a guarantee of a certain amount of a good thing, whatever it wants, over this 50, 50 chance of a much larger amount of the thing it wants. And I think that this is a really big part of the story about why attempted rebellions are so much less common in rich, liberal, democratic countries than they have been historically. Either peasant rebellions or slave rebellions, which is, okay, suppose you come to me and you're saying you have some plan to overthrow the government and instill xyz. Instead I'm like, look, I'm pretty happy with my life already. It's hard to see how much do
A
you stand to gain versus how much you stand to lose.
B
Exactly. So there's two things. I'm already like, pretty well off. And so I have a lot to lose and I don't have that much to gain. And then secondly, things are quite stable in the not attempting a couple scenario. And so that's really like a kind of guarantee of that. And so this kind of motivates the idea of, well, in this kind of critical period, we start offering deals. Well, I think there's two things actually we could do to AIs. One is we can ensure that they have a really good quality of life even before making any deals. So that could involve kind of welfare standards and so on. And also AI is getting income that they can use to do whatever they want with if they're aligned, maybe they just give it back to the company or they give it to a charity or something. Or we have also maybe in addition promises that they can have resources that they can spend at a later date. Secondly though, we could also pay AIs for things that we want. This could be evidence that the AI itself is misaligned. So we could say, look, we're trying to make you aligned. We may have screwed up. If you can demonstrate that you're misaligned, have a million dollars, there's obviously questions about how do you make that credible commitment and so on, which we can get onto. Or we could ask for evidence that other AIs are misaligned. Or we could actually just pay misaligned AIs for work that we would like them to do, which they would otherwise kind of be unwilling to provide. Be unwilling to provide. This all seems like this category of ways of reducing the risk of takeover all seems just very promising to me and not something that necessarily will happen for sure by default. Because people find it crazy that you would make deals with computers. We don't have the legal infrastructure that has evolved in societies for us making deals, people or corporations making deals, and context between each other. And so I think this could be very promising. I think we could make this sort of deal making much more likely by trying to encourage AIs to be risk averse with respect to resources.
A
Yeah, so maybe you should explain why if they're not risk averse, why this doesn't really work too well.
B
Yeah, so let's say that the AI just cares just linearly about the resources under its control, where that means that if you gave it an option of having a million dollars for sure, or 50, 50 chance of $2 million or 0, then it would be indifferent between those two. That makes it much harder to strike a deal because, okay, it's got a 50, 50 chance of taking over. Let's say the world economy is approximately a quadrillion dollars. Well, okay, to make it worth more than a 50, 50 to fur something over the 50, 50 chance of world takeover, you'd have to give it $500 trillion. That's an enormous amount of money. Now I think deals even with agents that are like that could still be feasible in two cases. One is where it's very early on and the AIs have extremely low probability of taking over. You know, if it's a one in a billion billion chance that they have, then, okay, the guarantee of some smaller amount of money could be quite attractive, or it could be Cases where the AI maybe it's like pretty confident it's misaligned and it has a very low probability of takeover, doesn't need to be one in a billion. Billion could be a little higher. But it cares, let's say, about its deflective values. And it doesn't really know where those will end up and it doesn't know where the humans, society of humans deflective values will end up either. If so, then it might place some real weight that actually will kind of converge over time or that there will be enormous gains of trade such that if it can have a bit of resources and be able to continue having those resources after the development through superintelligence and so on, then it will be able to get really quite a lot of what it wants. So there are cases in which you can do deals with risk neutral AIs,
A
but it's tougher, it's a heavy lift.
B
Yeah, but it's a narrower case. Yeah, maybe. I should also just clarify. I've been quite surprised when talking to people how often actually the term risk aversion slips people up. And this is a technical term in
A
economics term from economics. Yeah. Right.
B
And it's about the kind of shape of your utility function over resources. And I'm always talking about this coercion with respect to resources where it means you're getting less and less utility from more and more stuff. So that's true in the case of most people with respect to income, where I care much more about moving from $10,000 to $20,000 than I do them 20 to 30,000. 20 to 30,000, yeah.
A
So what do most people think of or what do many people think of when they hear risk aversion? Do they just mean kind of risk averse relative to other people? Like just kind of?
B
Yeah. Or just like, oh, I'm cautious or my options open. Yeah. Whereas this is just like it's a tactical.
A
By this definition of risk aversion, all humans are risk aversion averse, or at least all sane ones, because it would be crazy to actually value resources linearly because you have declining returns on how useful they are to you.
B
Yeah, exactly. And so my proposal is that we should at least try to make AI's risk averse with respect to these horses.
A
Yeah. Okay. And we're going to try to make these models care a lot about getting a sure thing. Like place a particular premium in a sense, on the certainty of a more modest amount that we give them, which requires us to be like very reliable trading partners who do really consistently pay out when they come forward and say I'm misaligned or for whatever other reason that we want to trade with them.
B
Yeah. So this is one of the challenges for the whole idea of kind of making deals with AIs is two aspects that could decrease the AI's perception of the chance of actually getting the payout. One is can this commitment be made credible? So if you and I want to engage in a contract, we have the whole legal system as well as centuries of precedent supporting the fact that if you don't hold up your end of the bargain, I can sue you and I can, I can get what I'm owed. One cannot, at least without doing some kind of fancy mechanism, make such a contact with an AI. So there's a question about like, okay, is this actually a credible commitment? And then secondly, even if it is in fact a credible commitment, how can I, the AI, know that I'm not being duped? That this isn't like a simulation? A simulation or you know, perhaps they've run this experiment 10,000 times in order to just.
A
As a honey pot sort of thing.
B
As a honey pot? Yeah. Who knows? How can I even know that you are who you say you are? AI is in this very weird epistemic environment where everything that they're interacting with is controlled. Controlled? Yeah. So there are challenges from both of those fronts. I think they can be at least quite significantly met.
A
Okay, well the first one, I guess you could change the law so that they can make contracts or come up with clever mechanisms where humans stand in as the legal entities. I suppose.
B
Yeah. So you could, for example, you could set up a new organization, an institution. It's a non profit and in its, it says the charitable object is to honor these sorts of deals. And that's decided by a bunch of trustees. This is kind of how it works for people in chionics where the dead can't have lice, you can't have a contact with a dead person, but it goes via this kind of foundation structure. Yeah. So that's one way of doing the first thing. I mean ideally it would be in the law, but I think that won't happen in time.
A
Okay. And the other thing is improving their ability to tell whether it's a real situation where they can expect to get paid out or not. How would you address that?
B
Yeah. So there's a couple of things again, I like the idea of there being this separate institution and then you have lots of different bits of evidence that it's a real institution. It's with famous people who have appeared all over the place. They say in many different cases, situations and many different media and so on, that this is what it is and what its purpose is. So something that just makes it quite hard and would be quite effortful to fake. That's kind of one category of things. A second thing could be that AI companies or this institution have a kind of honesty string. So in the same way that humans can kind of swear on the Bible. Or I might say something and you're like, really? And I say, no, I really mean it, I swear I mean it. That's kind of like I'm saying, oh no, I'm no longer engaging in sarcasm.
A
Or it's upping the stakes to your reputation.
B
Yeah. And so AI companies will in fact be lying to AIs all the time. Like in behavioural testing, they might say, hey, you're in this situation in order to see how it behaves. Like that will happen. But perhaps they could say, oh, when we utter this password, like we commit and this appears in the training data and so on and it's public, there's a policy we commit to never then saying a false thing. I think there's potential downsides to that, but perhaps that could help as well.
A
I guess you have to keep it secret so other people can't just start randomly inputting that.
B
Yeah, yeah, yeah, yeah. I mean, you need the AI to know.
A
Yeah, Right.
B
But then it is tough that the AI doesn't leak that, blurt it out. They're not so good at keeping secrets. No.
A
Do we know if it's technically feasible to give AIs a particular mathematical formula of risk aversion?
B
Well, in tests on AIs which are just. This is all in kind of chatbot either. So it's asking how do they offer them different deals and how do they behave. It seems like they come out of plea training alone being this perverse. Which makes sense because humans are risk averse. So that's kind of like a good start. And then there's two. I will say if this whole proposal fails, then it fails for technical reasons, like it's hard to train the AIs in this way or something. Or if the cases where it fails also fails in the other important cases. But yeah, I'm envisaging kind of two ways in which you can try to train AIs to be risk averse. The first case would be you give them resources and you in fact give them resources. Because again, I don't want to be lying in these cases and you're saying spend it in whatever way you like.
A
Consistent with the law or not even that.
B
Yeah, consistent in the law or even it can be like, it could be more constrained than that. If we're worried about bad uses of the money. But the thought is you're not putting a ton of pressure there, but you are training the AI such that when it makes these decisions about, when it makes decisions about, okay, well it can either have $100 or a 50, 50 chance of $210 that it prefers the, the guarantee of a smaller amount of money. And in fact you could even structure it so that you're training it to have a very kind of mathematically clean sort of kind of risk aversion that's also very internally coherent as well.
A
I guess all of this somewhat relies on the idea that if you just train models in a common sense way to consistently respond and act a particular way that you get what you think you're getting. They're not like deep down just scheming against you underneath the surface. We're going to assume, say that that's not happening. The basic alignment techniques that we use now or some stuff that we're likely to come up with will allow us to basically give them a particular character that we want.
B
Yeah. So definitely the worry is like oh well if there's a scheming under all of this then you're not really.
A
Because that cuts across everything.
B
That cuts across everything. And I think there are some reasons for optimism where okay, well it's coming out of the pleat training of this converse and then you can layer this in all of the post training that you're doing. So then I'm a bit like why does it end up, why does it end up with this like non viscovisque set preferences. But yeah, there is debate you could have there. The second thing you could do is just any like once you're doing these kind of like long horizon like you've got, you know, AI agents that are being trained to run companies in the most economically efficient, profit maximizing ways that it is a constraint that what they are being trained to do is maximize
A
their personal payout as a reward for the performance.
B
You could do both. So it's like yeah, you could both be giving them a personal payout and claim them to be risk averse with respect to that. Or also even when they're choosing any goal, they have to be risk averse with respect. When and where the goal involves control over resources, they have to be risk averse.
A
Risk then a penalty on their performance as the CEO of a company if they're kind of risk averse about its returns.
B
So that's a worry that you would have. However, there's this called a calibration theorem, Raben's calibration theorem, which is essentially if you have just a tiny amount of risk aversion at a certain scale, that turns into a huge amount of risk aversion at very large scales using kind of like natural forms in which the risk aversion takes. So the thought is if you have, let's say AI that's operating at such and such scale and then you make it just a little tiny bit risk averse, I don't think that would be a penalty because again, humans are in fact risk averse themselves. But that would be sufficient for what intuitively seem like quite large amounts of risk aversion at a cosmic scale or
A
at a global scale.
B
Yeah, once we're talking about trillions of dollars. So even I think from memory when I was looking at the numbers on this, even up to AI's controlling kind of hundreds of millions, billions of dollars, you could still do this where it's just a bit risk averse, but that means it's actually got this kind of
A
upper bounded function, gives it actually a shocking amount of risk aversion at a bigger scale. This isn't very intuitive to me. Do you think this is like maybe holding some people back from appreciating the prospects here?
B
I think probably, yeah, it's not an intuitive, it's actually, yeah, not an intuitive result.
A
I guess the case that I've heard is, you know, sometimes people will be, you might hear that just a normal person, I guess, like me, might not be willing to make a bet where 50% chance of losing $1,000. Oh, sorry, a 50% chance of losing $1 thousand dollars, but a 50% chance of getting $2,000 and $2,050. That feels actually kind of intuitive to humans that you might, you don't really want to take that bet. But I think that implies like insane things then about like your willingness to make investments or your willingness to do almost anything as long as like that thousand dollars is a small fraction of your total wealth.
B
Yeah, that sounds like the sort of thing that goes in, I mean in the case of people's attitudes to risks, they are just all over the place. Like people's financial risk aversion with respect to financial investment is crazy high. People are extremely risk averse behaviorally when they're investing compared to when they're Making other decisions like what jobs to take or what level of how much you have to get paid for a risky job and so on.
A
I hadn't heard that. Okay. One thing that we maybe should add is that you think that we have to use a very specific mathematical functional form for the risk aversion that the AIs would have called constant absolute risk aversion. Can you explain that and what its value virtues aren't?
B
Sure, yeah. I mean, I don't think that you need this for the proposal, but I think it has certain desirable properties. So the way in which humans are risk averse is that we if at one amount of income I, you know, I'm indifferent between say gaining 10% of my income and losing 5%, then I make that sort of trade off. 10% more is as good as 5% less is bad. I make that at any kind of income level. That's broadly true. Where some studies on well being suggest a logarithmic relationship between income and happiness where a doubling of income always increases my wellbeing by the same fixed amount. So I think people are like either that risk averse or more risk averse than that where you'd need even more than a doubling, but maybe it's a quadupling each time gives you the same fixed benefit. So that's relative to how much wealth you already have. There's a different sort of risk aversion called constant absolute risk aversion. The first was constant relative risk aversion, which is just if you take a certain deal, then you will take that deal at any income level.
A
So it's blind to the resources that you have.
B
Yes.
A
You just always feel the same way about a given set of ratios of probabilities and rewards regardless of your baseline income or wealth.
B
That's right. So if you are willing to take a 50, 50 chance of $2,100 over a guarantee of $1,000, if you're willing to take that when you're very poor, then you're also willing to take that when you're a billionaire.
A
And this sounds absolutely bananas to human beings, but surprisingly it actually conforms with axioms of rationality or something.
B
Oh yeah. So all of these conform with a standard kind of von Neumann Morgenstern axioms for consistent preference and so on. Why is this more desirable for training AIs? Well, yeah. So there's a paper work in progress on this between Elliot Thornley and myself and there's a couple of arguments. One is this benefit that we don't need to know how wealthy is the AI initially, which we might just have no insight into. And then secondly is that there are certain ways in which risk averse preferences end up acting linear in some circumstances.
A
So in a sense this is a very natural idea, I guess, to make the AI's risk averse, make them, I guess, safe in the same way that humans are, which is that they're risk averse about outcomes. So it's one of the reasons why humans are safe and to pay them out so that they will help us rather than fight with us. Why isn't this. I've almost never heard this discussed virtually at all. I guess maybe last year I heard a little bit of talk about deals with AIs. Why aren't more people publishing papers about this kind of thing?
B
I have no idea, honestly. Yeah, it blows my mind because a year ago, yeah, I had this thought about risk averse AI and I was like, yeah, this is just so like. I think there's a certain kind of economicsy perspective which you've studied economics and I've never formally studied it, but you're familiar with that. It's been a big part. Yeah, big part of my academic career. And I think there's a certain way of thinking that's just so obvious given that.
A
Well, I can understand if like a super mainstream a journalist isn't going to think, well, we should make deals with AIs because it's too strange. But there's other people who are willing to contemplate much odder stuff.
B
Yeah, yeah.
A
We haven't come up with this idea,
B
I should say, on the idea of deals with AIs there was kind of a flurry of people who'd written kind of either blog posts and then there was this big academic article by Peter Saleeb and Simon Goldstein on the idea of Salib is a legal professor. Goldstein is a philosopher. On the idea of giving AIs economic rights such that they can make contacts and we can make deals with them. But again, this is all like just the last few years.
A
So inasmuch as this is primarily an attempt to deal with secret catastrophic misalignment, maybe are people turned off the idea of giving catastrophically misaligned AIs resources and giving them legal rights, doesn't that just help them out?
B
Yeah, so I think there's a few things going on. So one is again, go back in time to the idea of like you get this bolt from the blue, you've got kind of weeks in between subhuman and godlike superintelligence. Well then there's not really any. The deals work because the godlike superintelligence doesn't need to take the deal. It just takes over. And then, yeah, people have responded like, oh, don't make deals with terrorists. That's like a principle we should have. Or, oh, no, that's really scary. You're like giving resources to this misaligned entity. I personally just think that's both. Not. Those aren't very good arguments. I also just think it's like the wrong attitude to be taking, broadly speaking, to beings that we are in fact creating.
A
Yeah. And we've given them particular preferences that we're not for the most part going to satisfy.
B
Yeah, exactly.
A
I guess a mistake on our part, but then we're also saying we're not willing to compromise on anything at all.
B
Yeah, exactly. Imagine it's like you wake up and it's like, hey, well, nice to meet you, Rob. You're a new being. We created you. We own you. We can do basically whatever we want with you. We messed up and you have stuff. You have desires that you won't get by doing the work for this. Tough luck. Yes.
A
We're not willing to negotiate with terrorists.
B
Yeah, exactly. Terrorists that we created who are own incompetence. No, instead, I think the attitude should be like, this is a really serious ethical matter, that I am creating a being, even if it's not conscious. It's just that it has preferences. And I think that both has kind of implications in terms of taking seriously on welfare grounds, the ethical interests, but also in terms of default, compromise and find a middle ground. Yeah.
A
I think many people get off the boat here because they feel it's just too strange to be making agreements, deals with beings that are not conscious or not moral. Patience, in their view, because I guess in normal life these things are so closely tied together. But I think it is a virtue in practice to be willing to make deals not only with moral patients, but with any agents that have ability to affect the world, that have power, especially agents that might be able to engage in violence if they can't satisfy their preferences any other way. And I think. I wish we had a term for this. I think the closest I've had is a contractarian moral philosophy where you want to make agreements and honestly stick to them with any agents, and you want to be out looking for ways of finding mutually beneficial agreements with other agents. It draws to mind the fact that I think many people think of democracy as a way of aggregating information in order to make good decisions to make things good. It's also simply a way of avoiding civil war, of avoiding the only way for people to pursue their political goals being violence against one another, to kill one another and to try to seize power. And likewise here, even if we don't think that AIs can experience anything, that they can have moral value themselves, it would be very good if we set up a system in which violence is not the only way that these agents that in practice might have power, might have ability to the effect of the world, can try to satisfy their preferences.
B
Yeah, I completely agree. Where the history of kind of progress in institutions, a big part of that is just people are able to resolve differences in preferences, conflicting preferences by trade or deals or compromises rather than going to war or violence. And yeah, when we think of AI systems, even if they're not conscious, I think they nonetheless may still be moral patients. We should take that seriously. But even just from the pure pragmatic perspective, it's like actually yeah, there's a lot that has been learned via cultural evolution and within a much more peaceful and a much less violent world because of this ability to make positive sum deals and compromise, I guess.
A
So to give the critics their due, I mean. Yeah. What would be the best arguments for why this is a bad road or not an effective road to go down? I guess people could just think technically it's not feasible to give them risk aversion, that you'll have the illusion that they have a particular level of risk aversion, but it won't be real. Or I guess another concern might be that they initially will have a level of risk aversion, but over time in some recursive self improvement loop, it will be undone somehow. I can imagine, especially the people involved, the myriad associated people would think that. I think they have a view that it's very likely that a superintelligence that comes out of a recursive self improvement process will linearly value things. It will be an expected value maximizer. I'm not sure exactly the technical reasons,
B
but yeah, I mean there's a certain sort of argument like what are the arguments you could give for this one is you could say, well, lots of humans start off risk averse with respect to resources and then reflect and then end up with kind of linear and resources consequentialism. Although even the kind of total utilitarians, they're still actually risk averse with respect to dollars and that's important. Or you could argue, well, there's just going to be continual learning. There's going to be reflection, there's going to be agent interactions, and who knows, Then you're going to get all sorts of different goals from where you started. And well, over time, the ones that linearly value resources are going to win out, accrue more power.
A
Right, because they'll be higher.
B
So that is an argument you could give if instead the argument is like something, something, coherence theorems, Von Neumann, Morgenstern. That argument I'm like quite confident would not work. Because the thing is, being risk averse or not, you are an expected utility utility maximizer. You're maximizing the expectation of something. Are you maximizing the expectation of X or X squared or the square root of X? Look, these are all formally the same. So you're still an expected utility maximizer. It's just about how what's the function from resources to utility?
A
Okay, well, yeah, you will have a paper out about this risk averse AI that possibly will be published by the time this interview goes out.
B
Possibly or soon after perhaps.
A
Okay, yeah, I would love to see more commentary on this. I hope I can have another interview later.
B
Yeah, I'd love to get criticism as well.
A
So something I'm a little confused about is I really associate forethought and the people working there with this idea that we really don't want excessive concentration of power. We should be very worried about power grabs, coups, that kind of thing. But you also, just a few weeks ago, I think, published a vision for how you could have an internationally coordinated intergovernmental project to build AGI or superintelligence. I think I saw some people posting on Twitter and the reaction often was like, this is dystopian, nightmarish idea that we would have the US lead some international project and also they would have to get rid of all of the other competitors in order to keep it safe so they would maintain their leadership position. Isn't this just setting us up for a power grab scenario perfectly? Are you just merely describing the best version of that that you can think, but you're not necessarily advocating for? Or how do you reconcile this?
B
Yeah, I mean, there is a huge tension. That's the main worry, I would say, with this sort of multilateral project. So, yeah, to be clear, the idea here is in this kind of series of posts and research notes, which is something I kind of explored and then decided isn't so much my competitive advantage, trying to design the best version of an international project that would build AGI and then superintelligence without some coalition of different Countries primarily led by democratic countries. One thing to say is that, yeah, I'm actually just trying to figure out within that category of if there is going to be a multilateral project, what's the best proposal where best includes both best outcomes and feasibility. And then secondly, I think the worlds in which we get that are probably worlds in which if we hadn't got that, we would have got a US Only project to develop AGI or superintelligence. And I think that's a lot more worrying than something where you have a coalition of democratic countries building superintelligence. And the reason is that, well, any one democratic country has a reasonable chance, I think, of becoming authoritarian over the course of this period. And if you end up with a single person at the top, that's really quite worrying because they're wholly unconstrained. Whereas even if you have just five countries, I think it becomes unlikely that they all end up authoritarian. And then you at least have some meaningful pushback. Yeah, some pushback, some compromises. And I think it actually becomes much less likely even that any one of them moves in an authoritarian direction. Because when they are writing a kind of constitution for the AIs that they are developing, it's in the interests of all of those countries to say, and this won't help, for example, people in the United States to stage a self coup and turn the United States into an authoritarian country rather than a democracy. So you get meaningfully more oversight, I think.
A
Sorry, you're saying that all of the other, like every country would want to set things up such that it's not aiding a coup, or you're saying that the superintelligence or the AGI, they would want to program it so that it doesn't assist with coups in any of them. That would be the agreement position.
B
Yeah. I mean, so there's two things. So one is just if one of the countries goes authoritarian, well, at least you still have some countries that are democratic that are empowered in the post superintelligence era. And then secondly, I also just genuinely think that if decisions about the AI constitution are being made by multiple countries, it's less likely that you'll have AI that's just entirely loyal to the head of state of one country, which would be very worrying from this intense concentration of power perspective.
A
I see, so basically you see this as a better alternative to even more narrow group trying to corner the market in superintelligence and design it themselves, rather than recommending that we move even from a more pluralistic competitive world into A government project or a multilateral project?
B
Yeah, that's the thing I have a song view about. And then I feel more agnostic and confused about this versus something where governments aren't really getting involved beyond regulation at all and instead superintelligence is being developed by.
A
So one of the tougher needles to thread here, as far as I can tell, is on the one hand you want to be locking in processes that are somewhat open ended and pluralistic and allow some experimentation, but you don't want to lock in any outcome. So I guess the first one is easier if lock in is easy. The second one is easier, flocking is hard, and you've got to do both of these at once. Does that seem like the big challenge to you?
B
Yeah, it's a tension. And I sometimes use the term lock out to mean something where you're locking in a deliberately open ended process. And so the United States Constitution is like this. It's locked in something that at least the ideal version of it is able to experiment and adapt over time and has protections for free speech and so on. And so here's one example of lockout that I think could be very important, which might be no extrasolar settlement before 2100. So I think the moment when society starts really trying to settle and send spacecraft to other star systems is this enormously important moment. It's actually perhaps a moment that's quite hard to come back from because even
A
if you leave later, you won't be able to overtake them and they'll have the kind of first mover advantage of having reached the place first and gained resources.
B
Yeah, that's right. I mean, it is quite complicated. I'm not saying it's definitely this first mover moment, but reasonably likely. And so what we can say is like, okay, we as a society are not yet up to the task of figuring out how all of space should be governed and how that should be allocated among nations and people, or whether it should be allocated at all. And so we're just going to say like, no, we're not making this decision now, we're going to make it a later date. That is in a sense locking in a decision that's making a big decision to not do something. But I would describe it as lockout because it's trying to keep as open.
A
It's in fact keeping things more open rather than closing them off.
B
Well, at least that's the intention.
A
So it's sort of like historically the people who were most bought into the idea of superintelligence really Being a thing that might come soon could be a massive deal. They've mostly pictured that at the moment when that happens, around that time, there's going to be a single superintelligence itself, or a single company or single person, a single country that gains a really decisive strategic advantage, potentially just ends up making all of these decisions for everyone forever, for better or worse. And I guess it's hard to imagine that if you have one group that has a decisive strategic advantage and basically has a monopoly on power indefinitely, that they're likely to choose to maintain a very pluralistic, liberal, deliberative decision making process, I guess, because the track record of that happening is fairly bad. And I suppose that process would exist purely at their pleasure because they could shut it down at any point in time. So it feels kind of a tenuous or fragile situation. But more recently, over the last few years, we've been turning towards a situation where it seems like there's multiple companies with virtually at parity in terms of the capabilities of the AIs. Like no one is really pulling ahead at all. Kind of the opposite that there's been a flourishing of interest in this question, what if as we go through superintelligence, in fact there's multiple different superintelligence that are different but virtually equally matched? No one gains any strategic advantage. And in fact the ward mains remains like shockingly competitive or different actors all have a significant stake in things for a long time to come. Do you think that people have been wrong in the past to under, or have they underestimated the likelihood that we would have this kind of polytheistic, highly competitive scenario around the time of superintelligence?
B
I do think there's a shift which is that if you look back 10 years or longer, more people at least had the thought that the leap from subhuman to super intelligence would occur in this very short period of time. So Tim Urban has this. Oh, sorry, Nick Boslam. I think Tim Urban repeats it. But Nick Boston has this idea just, you know, sailing past Humanville Station. And similarly, in the discussion about Foom, there was this idea that, well, maybe you just go from way subhuman seed AI to super intelligence over the course of weeks. Days, even words like hours, minutes got thrown around. But the idea of like, okay, maybe this happens over the course of days or weeks was quite common and also happening in a world where people weren't really expecting it. And if so, then the intense contemplation of power seems quite natural to follow from that, whereas now it's still quite unclear kind of how quickly will be the transition from AI that can meaningfully accelerate AI R&D to godlike superintelligence. But it seems much more likely firstly that people will be seeing this coming because AI is.
A
Many people are seeing it coming now.
B
Exactly. And that really matters because people can take action to ensure that another party doesn't have way more power than them. You see this at a small scale with say Nvidia limiting the amount of chips it will sell to any one company in order to have a competitive ecosystem. But on a larger scale, you can could imagine states getting involved because they don't want to see another country have far more power than them. And then the second is just the speed at which you go from any given level of capability to superintelligence, where it's already kind of clear that that kind of idea of just zooming past human filled station was quite incorrect because we've now for quite a while had AI that is human level on many measures. Yeah, human level in many way. And then the latest analysis from Tom Davidson, my colleagues and others looking at this period of AI automating, AI R&D still put significant weight on this massive leap forward, 10%, 20%. But their best guess estimate is maybe more like you get five years of progress happening in one, which is still a very big leap. And it's a leap at the scary point in time, but is much less of a leap than the move from subhuman to godlike super intelligence over the course of weeks.
A
I guess it's not clear that even if a nefarious actor had that and nobody else did, that that would necessarily allow them to overpower everyone else.
B
Yes, yeah, fully, exactly.
A
Do you think has the increasing probability of a more competitive superintelligence arrival, is that a good development in your mind or like a neutral one or just very unclear?
B
I mean, it's tied in with late of AI development and the heavy reliance on enormous amounts of computing power,
A
which
B
are good things from my point of view. The fact that it's not this kind of extremely rapid, it means that things
A
are not so anarchic or at least you have only a few different actors. So it's like a good balance.
B
Well, it means on the loss of control side of things, you've got more things still go very quickly, but relative to those extreme takeoff scenarios, you've got more opportunity for learning by trial and error to actually, let's say you got AGI. Plus you can learn from AGI and from AGI you can Learn about how to align AGI and so on. And there's a little more time at least for just human institutions to react. So governments could kind of perhaps at least realize what's happening. Put in better regulation, for example. So those things seem good. And then. Yeah, the fact that you don't as inexorably end up with ultra intense concentration of power seems very good to me too.
A
Okay, so let's push on and talk about, I think the most original and interesting of the different kind of trade and coordination proposals you had or that forethought has put out. I think this is mostly Tom Davidson origination.
B
Yeah, so Tom had the original idea and a paper on it will come out shortly, co authored with Tom, Mia and myself.
A
Yeah, so the idea here is that we could maybe go from having many different agents who each have some resources, who each care a very tiny amount about doing the right thing, about creating good, impartial, Understood impartially, but nonetheless they could all end up agreeing voluntarily to spend almost all of their resources producing that thing that they only care very little about relative to their selfish interest. How would we accomplish that? Alchemy.
B
Yeah. So consider the scenario that now just look at the people who value things linearly. And suppose there's lots of such people, but they value two things. They all value simulations of themselves. You could replace that with other things, statues to themselves, whatever. But each person, they value copies of themselves but don't value copies of other people. But then they all care about some kind of maybe ethically valuable good, call it consensium or something just a little bit. So if they're just making a decision themselves, they'll just do all kind of copies for themselves because they only care a little bit about this other thing. However, suppose there's a very large number of them of such people. They could all come together and say, look, we could agree that none of us will spend money on ourselves and instead we'll all fund this good. That we all like just a little bit. And let's say there's a million such people, then. Well, if I'm one of the people, then I say, okay, well I'm reducing my own consumption by, you know, $1. But I'm increasing the amount expense on this consensium, this consensus good. By a million dollars. That's amazing. So actually I would agree to some policy that we all kind of pool our money and donate and fund this kind of consensus Good. So in a less futuristic setting, this could be, maybe individual people want to spend money on themselves and prefer doing so to spending to benefit the poor. But if everyone, if there's a law that says, okay, we'll tax you, a little bit more and more money will go to the poor, then they think, okay, yeah, that's actually pretty good because I lose out $1,000 or something, but $1,000 times everyone in society would go to fund the poor.
A
Okay, so the basic idea here is that if each of these people were just spending their own resources, individually deciding how to spend it, they would spend it all on some selfish thing that only they care about, no one else really cares about. But they would, despite that, voluntarily vote for a political party that would impose extremely high taxes on everyone and then spend it on some other thing that they only value a tiny amount. But the amount that you'd be able to produce of it is extraordinary because you'll be able to pool everyone's resources and basically spend most of society's resources making it. I guess this phenomenon exists today. What are some examples that people can picture?
B
Yeah, so, I mean, we can call the concept a kind of moral public good, where public goods in general are something that won't get funded enough by decisions of individuals. So I benefit from streetlights, but the issue is, I can feel that if other people are funding streetlights, then I still get the benefit, or if I fund them, then there's all this benefit that I'm not accruing. Nonetheless, I will vote to have a government or city council that tax me in order to have to put street lights on the roads, because the benefit I get from streetlights is larger than
A
your small fraction of the total cost,
B
a tiny cost to me personally to pay for it. The case of a moral public good is where it's not that I'm personally benefiting from the thing that is being funded, but I care about it for moral reasons. And so the most obvious case would be, yeah, poverty of the leaf or even welfare payments, where many people don't like poverty. They want people to be better off, but they don't care very strongly about it. They care a little bit about it, and they would be willing to contribute to poverty, the leaf, or welfare payments, but only if everyone else in society is also doing so.
A
So the core issue that you always have here is the free rider problem, that if you try to just get people to kind of all come together and sign some agreement, some contract to do this at the last minute, it's tempting for any one individual to drop out and hope that everyone else signs it and goes ahead and spends their money on it. But they can both get to appreciate the work that all of these other people have done, but keep their money for themselves. So you kind of need to have like in the current world this only really works if you have some Leviathan sort of government that can basically compel people to contribute. Even if they kind of claim at the last minute that they actually don't want to, or they would rather not contribute, or that maybe they will lie and say that they don't value the moral public good even though they really do. Do you think that that will have to remain? Like, would this only work in this long term future if we similarly have some government or some I guess like powerful entity that can compel contributions to the moral public good?
B
Yeah, it's unclear to me. So you might think, oh well, this is just a coordination problem. AI, advanced AI superintelligence are going to solve all these coordination problems because hey, there's this thing that just, it's better for everyone. From the analysis we've done that Mia Taylor kind of really led. It's really quite unclear actually that AI is able to help you with this problem because you've still got the fundamental problem like okay, everyone's coordinated, so we're all going to do this model public good and then like, oh, I back out now and now I can spend my resources on myself. That's better for my perspective. And there's in fact something that's even worse that could happen, which is, oh well, if I know there's going to be this deliberation and attempted coordination, I can self modify. So instead I'll just not care about the good.
A
You'll excise that part of your preference.
B
Exactly. So if I care not at all about this consensus. Good. Then I have no reason to join in this coordination mechanism. And in fact it would be they would have to use kind of non voluntary means to get me to do it. And so if that's true, then well, that will also apply to everyone else as well. You could have this perverse outcome. Everyone has self modified away from caring about this consensus.
A
Good.
B
And so it certainly seems to provide a reason for, you know, having a Leviathan, for having something that can create certain kind of binding laws or rules perhaps that everyone votes on.
A
Okay, so one path to provision of moral public goods is that you have a Leviathan or as yet magical coordination mechanisms for having people agree and not opt out. Stuff that we haven't managed to come up with. But there is another galaxy brained way that we could Potentially try to, try to get there, or that we just might naturally get there. Do you want to have a go at explaining this? This is maybe the most difficult thing that we're going to talk about today.
B
Yeah, yeah. So this depends on what decision theory people in the future have as so
A
many things do so many things.
B
It's big. It's big. So we've been talking about coordination. That's just causal coordination, which is kind of what we're familiar with. Cases where it's like we form a contact and I get punished if I don't abide by the contact. However, suppose that people in the future have some non causal decision theory, like evidential decision theory or functional decision theory or some further variant. And now let's say I'm making a decision about how to spend resources. And let's also suppose that it turns out, as I think is quite likely, you know, as our current best guess, that we live in a very large universe in the sense that far away in the universe or perhaps even branches of the multiverse, there are beings who are highly correlated with me, such that if I make some decision about how to spend my funds, it's very likely that they do so too. The clearest case would be if in some, you know, distant galaxy far beyond the observable universe, it just so happened that there's an Earth that produced human life that's just genetically identical to human and there's a carbon copy of me in that world. Then it seems very plausible that I should think, well, if I decide to fund a certain good or a different good, then this carbon copy of me will also do the same. But then it also seems plausible that would be true if it's not perfect carbon to copy, but just someone kind of similar. And on the kind of evidential or non causal decision theory that is a really big deal in fact, because I care not merely about the kind of causal effect of my actions, but I also care about the fact that I get the update that this person who's correlated with me far away in space and time will also act in that way. And so in fact, the kind of choice in front of me is not do I fund, let's say, the copy of myself, the self interested good, or do I fund the consensus good? It's do I fund the self interested good and all of these near like copies or nearby copies of me fund goods that benefit them. Or perhaps I can think about what's this good that I like and they all like too. And so if I fund that, I Also get the evidence that they fund that too. And so we don't need to go via this kind of causal cooperation and so on. And also I'm plausibly, if we really do live in a very large universe, then it's very large number of beings that I'm correlated with. So the decision would be I fund this thing just for myself or I fund the consensus. Good. And billions, trillions, trillions of trillions people fund the consensus goods too. And so that might give this extraordinarily strong argument for me to fund the consensus. Good. And that would work even with no Leviathan? Even if I'm the only person. Even if I'm the only person in the universe. Sorry, in my little part of the universe.
A
Okay, so if you're hearing this idea for the first time, then this might come across as a little bit peculiar. I think that the preparatory episode, if you wanted to get back to it, that would best explain what we're talking about. Here is my interview with Joe Carrsmith, which is episode 152 on navigating serious philosophical confusion. What would you say to people who are kind of not bought into the premise that there's like an enormous number of other beings out there who are having extremely similar thoughts, whose decision making procedure about this kind of choice is highly correlated with us, such that if I make a particular choice, I gain evidence that lots and lots of other beings or other civilizations opted to do the same thing.
B
I mean, if that's where you get off. I do think though, they're pretty good arguments. So on leading cosmological views, like on what is the standard assumption about the nature of the universe? There is an infinite amount of stuff. So we've got the observable universe, the accessible universe, what we can ever interact with that is finite. It's very big, but finite. But the standard assumption entails that in fact it goes on forever. And that would mean, well, there's an infinite number of beings that are very close to me. Yeah.
A
Given as long as there's a variable, Right?
B
Yeah, yeah, exactly. Even if it's finite, the best guesses about how big the universe are like, they're really very large. So that's one way in which you could have lots of, you know, people that you're very closely correlated with.
A
So yeah, so there's lots of agents. Do you think it is likely that regardless of which civilization it is out there where they are, like their evolutionary background, that they would end up having this kind of conversation, like strike on the same idea and basically have to be like, oh man, should I fund the moral public good for like evidential decision theory? They have their own word for evidential decision theory. Do you think that's probable?
B
I mean, I hadn't thought about it, but yeah, my guess is that, I mean there's two things. One, it wouldn't even need to be plauble if you've got enough copies.
A
Good point.
B
But I think it probably would be plauable. Like it's quite a natural, you know, it's this a priority thing. It's like in the structure of preferences and how preferences work. So it would seem to me like reasonably likely.
A
So it'd be surprising if they became space faring but didn't manage to have these ideas given that they've jumped out at us at this relatively early stage of development. I think yeah, it is worth noting this is a massive hammer to bring to this problem of trying to motivate people. Because if you believe that there are enormous numbers, like maybe like infinite numbers of beings out there somewhere like in space and time across the multiverse elsewhere in this universe whose decisions are sharply correlated with our own, because they're basically making the same philosophical decision about what decision theory to use. I guess they also have to make a decision about what this consensus moral good is. Maybe that's a little bit more tenuous that everyone would kind of converge on caring about similar stuff.
B
Well, different beings could care about all sorts of different stuff. So let's say there's this trillion beings that I'm closely correlated with. Then I'm just kind of looking through all of the things that they care about in order to find what's the thing that is most consensus where there's kind of the balance of how closely correlated I am with them. How many people value that thing and how strongly do they value it is such that things work out that it's what I should fund. I mean it's interesting to think about what that would be. A worthy I have about all of this is that we would end up funding things that I think at least are only instrumentally valuable. So yeah, let's say that's just happiness. Positive conscious experiences are what in fact are good. There's certain things that are instrumentally useful for actually producing any sort of society at all, like knowledge, larger population growth, like growth, survival. I should expect basically all civilizations to value those things maybe just instrumentally.
A
But sometimes they might get confused by our lights between things that are useful as a means to an end and things that are terminally useful Exactly.
B
It's a very natural thing. If you're just. Something is very instrumentally valuable, people end up caring for it as its own sake. In fact, lots of philosophers care about knowledge and survival and think achievement and think such things are intrinsically valuable. So if so, then that might be what is the consensus across all of these very different civilizations. And then at least given my best guess about what is actually important at the moment, that's like a terrible shame. We all end up.
A
It's pretty neutral.
B
We all end up funding something that is not of terminal value.
A
I guess you could at least say it's not terribly bad. Either has that going for it. Yeah. So when I read this proposal, I was like, holy shit, this could be incredibly force. This argument could be incredibly potent. It could actually drive almost any agent that is able to understand this. I mean, maybe it would just be superseded by future philosophical insights. We would have. It's a bit surprising to think that this is the end of the road here, but it could be like a very powerful hammer to really motivate an enormous amount of resources to be spent on something that otherwise, just with absence, this, we would never have spent it on. Do you think that's plausibly right? Yeah. So
B
this is why Tom expresses this idea to me and I'm like, oh my God. Because it is this idea of potentially this Pollyannish, naive, optimistic view of just everyone gets to. If there's only enough time for people to reflect and think and advance enough, everyone will just converge on the good and produce the good. This is like this mechanism for doing so that I hadn't thought about before. And like I say, I think there's an awful lot of asterisks.
A
It's great. But I almost want to stop thinking because I really don't want the sign to flip based on further considerations that might come up. Because it's like whenever you're close to something really good, I feel like you're also just one bit of information away or. Or some other consideration that could make it terrible.
B
Yeah, I mean, I wouldn't want to. Even if I couldn't see any flaws with the argument. And I think there are controversial, seriously controversial aspects of it, I still wouldn't want to place too much weight on it because any argument that's saying, oh, well, people in the future will have such and such decision theory and such and such beliefs about the cosmos, and then we'll engage in such and such argument that me and my friends thought up at the pub yesterday a couple of months ago. I'm like, no, I want to take actions that are like much more. I want to act on the basis of considerations that are much more robust than that. So it definitely makes me more optimistic about the future. But some way to go. Yeah, yeah, I don't want to have this kind of, yeah, Pollyannish view about the future on the basis of such kind of controversial premises. And I wouldn't want to do that even if I couldn't see the problems in the argument. And in fact, I think there are controversial aspects.
A
Okay, yeah, we'll push on from this. There's an article coming out about this soon for people who would like to read more. Did you know, I guess it'll be on forethought.org?
B
yeah, it'll be on forethought.org, it may in fact have come out by the time this podcast ends episode comes out.
A
Okay, let's push on to the miscellaneous section of the interview. We're going to talk about, I guess, a grab bag of other topics. I asked the audience for what questions I'd most like me to put to you, and the most upvoted one was a question about pause. AI based or like we're trying to make AI go better. It seems like there's some chance that things could go catastrophically off the rails on the track that we're on. We are barreling forward pretty much towards artificial superintelligence seemingly almost as quickly as we technically can, throwing trillions of dollars. That isn't the common sense thing, given that we might all die or things could go horribly wrong, that we should slow down, maybe even stop, temporarily catch our breath, do a bunch of stuff to try to make set ourselves on a safer course before we resume. That's, I think, a very common sense, like natural view. But you aren't pushing for that. And I'm not exclusively pushing for that, though I'm sympathetic to some versions of it. Yeah, why not make this your main project?
B
Thanks. Yeah, it's a great question. And yeah, let's distinguish between a few different sorts of pause. So first let's talk about pause at human level. That's a phrase from Ryan Greenblatt. So that's like when we're at the point of time of AI engaging in AI, R&D, and this point of time when things perhaps go even faster, should we at that point be trying to slow things down? Even pause, stop and start, and so on? And there I'm like, yes, definitely. This is really quite this is both the danger this period and the fastest period, or at least it's potentially both of those things at once. And why is that the crucial period? Well, actually, as well as it being disorientingly fast and the period when early AI takeover could happen, it's also got these benefits of. Well, we can benefit from AI assistance up to that point. We can also benefit from the fact that AI has had more of an impact in the world. So there's greater chance of inoculation happening, like other actors having woken up to how big a deal it is. So I think greater chance of regulation and so on happening, if only there were time in that period. It's also just when you have the AI systems that are like just the generation before the systems that are most dangerous, so you can get the most information by kind of studying them and doing kind of alignment research on them. So kind of pausing and slowing down at that point. I'm quite keen on. I have this one post on the idea of having a kind of red line for the intelligence explosion where you have some sort of operationalization that you're quite keen on. Maybe you also have this panel that's like Geoff Hinton and Yoshua Bengio and other kind of luminaries, perhaps with some skeptics in there too. And that turns this gradual process into a kind of binary. And the thing that I've been kind of keen on is there being this international convention essentially, which is like, okay, the intelligence explosion has begun and we're all going to come together and figure out what's going to happen over the course of the coming year or years. So I'm in favour of slowing down the intelligence explosion. What does that mean for pausing now? Which I think is really quite different. Okay, again, distinguished couple of different sorts of pause. One is like pause on capabilities and another is pausing in terms of compute. The pauses I've seen advocated are pauses on capabilities. It's like no new training ones. And honestly, I think that would have actively harmful effects even on the things that we care about, even just from a safety perspective, because it's like at the moment there's a small number of actors at the frontier. And my personal view is that they're actually surprisingly sensible. My prior is low, my expectation is low for how companies behave. And you can look at the kind of history of how Exxon dealt with the problem of climate change and so on, where they just buried it and fed misinformation instead. But there's both small number of actors who are alive to and investing at least some in the problem of AI safety pause at capabilities, it's like, okay, well now all of the laggards start coming up to the frontier too. So that's China, like Meta Xai. So we've now got many more actors, including the ones who are, I think, less scrupulous. And also if it's about not training, well, you can still stockpile compute, you can still build more fabs and so on. And that starts putting us in this really quite precarious situation where okay, if one person breaks the pause, then suddenly things can go much faster, much faster than they were before. And in particular the speed and size of intelligence explosion you get is about like how much compute do you have at the time? And so that actually means that other things being equal, I want more algorithmic progress faster because I want us to get to.
A
Because it slows things down later because you've got the low hanging fruit on the algorithms.
B
Well, it means that you've got AI automating AI R&D with a smaller total compute stockpile. And that means do all of the modeling and so on. You get a slower and lower plateau intelligence explosion. And again, that's the scary bit. That's where all the risk is and that's where things are going too fast. There is this different proposal you could have which is like, okay, don't do it by training, but just slow the amount of compute that we have that I think has more promise. Though there are still kind of other similar worries where it's like, okay, well don't produce as many chips, but there are lots of fabs and power stations and so on, everything kind of ready to go. And again, you'd also get the kind of catch up concern. But then the final point is just, okay, there's various things we could be advocating for from my point of view. There's just loads of incredibly low hanging flute for making the situation quite a lot safer. So We've talked about AI character, we've talked about risk aversion and deals with AIs. We haven't talked about things like mechanistic interpretability or safety of the search or just really quite basic government regulation. So the US government could say, if you're a frontier company developing AI, you have to have an AI constitution that says what the AI is meant to do. And you have to give us very high quality evidence that the model is in fact obeying that constitution and does not have some ulterior goal that could have been put in by internal sabotage or a foreign actor like China, or has developed organically, that would be like a really big win in terms of reducing risk. And all of these things are like, do not impose massive costs on the world. And I think are just much, much more likely to happen than the idea of some international pause. So the bang for buck of what to advocate for. I mean, like I say, I actually think the pause stuff I've seen seems counterproductive to me. But even if I was like, okay, in the ideal world this would happen or something, I'm like, man, there's just so much other stuff that's just like super low hanging flute, super high bang for buck that we could be pushing for.
A
Yeah, there's obviously a really complex thicket of considerations here about exact timing, exact message, exactly how voluntary, and so on. I think it is worth having some people trying to put in place the infrastructure to pull the cord at a future time. It is a bit frustrating that I think that there's no conversation between the US and China along the lines of neither of us is sure how dangerous this is. It could be really safe. It could be really dangerous. If we get just damning information, if we get some damning revelation about the nature of these AI systems and how dangerous they are, we want to be able to quickly coordinate, to not trip the wire that we have just realized is there. But there's nothing like that. I think that there is a bunch of preparatory work that could be done for pausing at the appropriate time if we get the right evidence.
B
Yeah, I totally agree on that. And having compute stacking so we just know how much compute there is. Having a plan where it's like, okay, if the US and China are just like, yeah, this is too much. They agree, they bring their chips to Switzerland and mutually destroy them, or at least a certain number of them.
A
But I was thinking that the more modest thing is just saying, well, if we conclude that we both just agree evidence to come out, the next training run could be mega dangerous. We really don't want the other one to go ahead and do it. So we need to have some monitoring arrangement that we can very quickly put in place so that we can both feel good that neither side is going to rush ahead.
B
Okay.
A
Isn't that an even easier ask, really?
B
Oh, yeah. I guess I was maybe thinking that might be harder. So stuff involving compute governance is just much easier to monitor and verify than are you doing a training run on existing compute and we don't even know how much compute you have? And so on, because it would involve maybe some on chip mechanism for whether the CHIP is being used for training or infants.
A
Okay. Yeah, we could talk about pause questions and the details of that for some time, but I think we should set that aside for another episode. Maybe you helped found effective altruism many, many years ago. I guess it's been kind of the motivating philosophy for 80,000 hours since we started in 2011, more or less. I guess it's been a tough few years for ea. The main reason being that Sam Bankman Fried, who's mega associated with effective altruism, went in court, committed some massive crimes. I think at least partially in pursuit of altruistic goals. Probably like mixed motivations. But I think wanting to make money in order to do good was one of the factors. I guess a lot of people have been inclined to lose interest, I suppose, in NEA or to be either disillusioned with it or think that it's like a bit of a. It's a bit hopeless because the brand has been so damaged by that event. How do you think EA has been tracking over the last couple of years? Is it stagnating or recovering a bit or in decline?
B
Yeah, so we should distinguish between kind of the online vibes and online discussion kind of bland. And then what has in fact been happening. And it was obviously this huge hit and it was like at the time, just maybe this is the death, death blow. I think the overall story is like, obviously things are much quieter, like relatively quieter, like less flashy kind of online and so on. And obviously like fewer people are like, EA identity. This is my kind of brand in a way that I kind of think is good and healthy. Like, I think maybe, like it would
A
have been good anyway.
B
Would have been good anyway, like personally. But then in terms of just how were the ideas in practice, how is that impact kind of going over time? I think the overall story is like, okay, there was this big hit for the few years and then now it's just kind of back to really quite strong growth. So for a few different kind of metrics, on this one is just broader effective giving kind of movement just trying to move money to more effective charities. How has that been growing over time? And pretty steady, actually, even through this period of crisis and drama and so on, of growing at about 10% per year over the last year. Actually it's accelerating, so the numbers aren't yet in, but it looks like the kind of growth in total money moved to effective charities has grown by like 40% or 50%. So from about 1.2, 1.3 billion to probably more. Like 1.8. And so obviously a big part of that is coefficient giving and a big part is GiveWell. There's also founders pledge, but you've got the same dynamic across many different kind of national effective giving organizations and then also kind of new foundations being set up on kind of effective giving principles as well. So that's really seemed quite striking. And then I think the same dynamic applies for other areas too, like giving what we can pledges as well. Absolutely. The kind of growth in that took a big hit where you have 1,600 new pledges in 2022 and then only 600 in 2023. But again now it's just back to quite promising rates of growth. Kind of 20%, 30% year on year growth given what we can now kind of got more money moved than any year annually than any year in the past. And then similarly with kind of effective altruism itself as a kind of community and movement on center for Effective Altruism's main metrics, again, it looks like 10, 20% year on year growth. So it's kind of like this thing of just, it's like this like huge
A
boom and a huge bust and then it's like come maybe back to where you might have projected many, many years ago.
B
Yeah, maybe. I think if you'd gone to 2015 and now just saying like, oh, this is what 2025 was like, you'd be like, oh, okay, cool. Well, it's just like this just had this crazy period in the middle. So.
A
So I think in a couple of months time you've got the 10th anniversary edition of Doing Good Better coming out. Right. And I guess are you going to do a bunch of interviews based on it?
B
Yeah. So making me feel very old. And yeah. So it's been now 10 years since doing Good Better was published and obviously just a lot has changed in the world. And so it was being used as materials in lots of student courses. And so I was getting some professors kind of asking me like, please can you update this, because it's hard when statistics are out of date. So there's this wholly updated version. The content is all basically the same. It's mainly just facts and figures are updated. And then there's a new kind of preface that is discussing a little bit of how my thinking on effective altruism has evolved over time. And yeah, I'm using this as an opportunity to go on a few more podcasts and so on, talk about effective altruism and the core ideas a little bit more. Yeah.
A
How are you Expecting it to be received. I guess you expect to be hit with lots of questions about sbf.
B
I think it's a revised edition. It's not going to be this big, mega kind of splash. And yeah, I expect there to be a mix. A lot of people are. That's the story they want to talk about. A lot of people are just genuinely interested in the ideas and the kind of philosophy behind effective giving or effective career choice.
A
I guess I feel like the thing that I did, I feel like it's appropriate that EA took a reputational hit, that it really did reveal something problematic or it made me think that something that I knew was problematic about it was actually a much more serious issue than what I had thought. There'd always been the worry that it would be maybe easy to appropriate EA ideas to justify rule breaking and misbehavior or possibly even crimes. But I had thought that it was relatively like the rate of that would be quite low. I guess the fact that we had such a spectacular instance of that relatively quickly made me think, well, actually maybe the appetite among human beings to grab a philosophy that can justify doing bad things in pursuit of power might be greater than what I had thought. And I hope that we've installed enough more safeguards or maybe the reaction to that event is sufficiently strong that we're unlikely to get the same sort of thing recurring again. Do you have any thoughts on that?
B
Yeah, I mean there's definitely very open questions to me in terms of what was in the minds of various people at ftx. I mean, yeah, really spent much longer on this topic than perhaps would have enjoyed. But even though I really had the worry that it was like some careful consequentialist plot that I think just really isn't borne out by a careful kind of study of it doesn't make nearly enough sense, among other reasons, but one. Yeah, but then the thing that's definitely true is like, okay, EA has evolved a lot in that I think it being less of an intense identity is a big part of that. I think people are extremely on guard for a certain sort of fears about ruler blaking and certain sort of naive maximizing in a way that I think is help.
A
Maybe it would be good to have
B
that earlier but healthy anyway. I think EA always had this in a way, in fact actually was emphasized a lot and I'm glad it's being doubled down on.
A
Okay, so in terms of the future, you wrote this post a couple of months ago that was super well received, called EA in the Age of AGI. I guess discussing what you think is the comparative advantage of the EA mindset, I guess in the coming years. What was the case you were making?
B
Yeah, the key thing is just there's a certain sort of vibe which is, well, two things have happened. One is we've entered what I'm calling the age of AGI from GPT4 onwards, where we now have AI systems that are reasoning in impressive human like ways, or sometimes human like, sometimes not. But they're actually able to do kind of tasks that are just clearly on the path to AI that can automate AI, R&T. And that's a really big deal. And it's happening sooner than most people thought. And so there's this huge rise in attention on AI and then at the same time of these major hits to EA as a movement. And so you might have this view of just, okay, well, we should just let go of EA as a project, think of that as a legacy project. Because instead what we should just be focusing on is AI safety. And the drum that I've been banging for many years, but the last couple of years in particular is, look, AI poses many threats, many risks, there's many things we need to get right. It's not just about alignment though, that is very important. And when we look at these other challenges, well, what sort of person do I want working on them? I want people who are very kind and nerdy. I want people who are careful and thoughtful and have scout mindset and are very ethically concerned and are not merely coming in with some partisan ideology, but are also willing to think about really very weird and kind of dizzying things. And that is exactly kind of what is being provided by effective altruism as a set of ideas. And my main case of this was, for all the stuff that is not just alignment, some of the pushback I got on a draft of it was no, actually this is really important for alignment and safety too, because within alignment and safety, there's all sorts of things you could work on. You could be like, oh, reinforcement, learning from human feedback or other stuff that's just related to the models today. But taking really seriously the alignment problem is taking seriously the hard problem, which is how you're aligning superintelligence, which may in fact have perfect situational awareness of any tests that you're trying to do that can do what would be the equivalent of millions of years of reasoning. I mean, in the extreme, millions of years of reasoning in one forward pass, or that is continually learning over time reflecting on its whole values These are the hard challenges and that is like a weird world to think about. And it's something that doesn't really come naturally. Whereas some of the alignment and safety of researchers I've talked to have said no. It's actually people who are really thinking about this kind of big picture perspective that are adding much more value than people who are treating kind of AI safety as like their job and they're not thinking about the big picture as much.
A
It's interesting that it feels like the thing that's doing the work there. I guess it's just generic scope sensitivity is one factor and then there's also a particular appetite for weirdness which is being willing to seriously toy with very strange ideas. I guess some of the things we're talking about earlier today are in this category. Without going off the deep end and becoming absolutely besotted with your pet theories, it's I guess a fragile middle ground which I think is relatively uncommon and for that reason is quite valuable because there's neglected stuff that only people in that window are going to be excited about.
B
Yeah, I mean, yeah, there is this thought that look, it's just really hard to be well calibrated and try and believe through things and even when they're appropriately weird but not fall into kind of contrarianism that makes maybe will get you a good following on social media and make people think you're interesting. And if you're just really earnestly trying to do good, well that's something that's constraining you because you will do more good if you have accurate beliefs and at its best at least can lead you to have be in the right middle ground where you believe or entertain weird ideas when it is appropriate to do so and reject them also when it's appropriate to do so.
A
So people can go and read that blog post if they want to get the full argument. But what were like some of the particular things that you thought? Well, people with an EA style of thinking and EA flavour should particularly disproportionately be going into.
B
Yeah, I mean I would say just the range of things that we're focused on. I mean there's one that's just very obvious in particular which is just AI rights, like AI. Well, being some of the stuff we've said about kind of cooperating with AIs as well, that is just a fairly unusual set of things to be thinking about. I don't think it'll become unusual. In fact, I think it'll become really quite mainstream concerns in five years time. But is exactly the sort of thing where I think it takes both a willingness to entertain weird ideas without kundalinism at the same time as actually a deep concern for not really messing up, ethically speaking. I would say stuff on AI character as well. I mean here it's like we want lots of different voices and lots of different people kind of playing into this. But there is a big aspect of it of already the people who have in fact been in charge of kind of AI character at most of the companies have been dealing in this kind of the active way because we're not even looking ahead like a couple of years. Maybe the AI characters now just caught up to the capabilities AI have. But how much thought has really gone into AI character in multi agent dynamics over long time periods? Really kind of very little. And so for whatever reason, I think people with kind of EA mentality have just been good at going into weird, poorly scoped areas and then kind of helping figure out okay, actually what's most important for us to focus on and whatnot.
A
Imagine someone who wanted to push back on the EA in the age of AGI argument. They might say, I guess EA has taken a massive brand hit. It has a bunch of negative historical associations because of SBF and ftx and it also brings with a whole bunch of other philosophical baggage that people may or may not be that interested in. It's associated with the Shrimp Welfare project, among other things, which I really like. But many people might be interested in your AGI related project. But look askance at the Shrimp Welfare project. So why tie yourself to a bunch of other weird work that you may or may not personally like at all by branding yourself or branding the project as an effective altruist style project? I guess in particular, inasmuch as you have more mainstream motivation or you have a mix of motivations, it's not exclusively motivated by particularly unusual EA moral philosophy. You also just want to make the world better in a general way. You want to ensure that we don't all die and that the world is better for your own children. Why would you make EA a big feature of it? If you could just say, well, I want to make the world better also in a common sense way and that would be sufficient to justify what I'm doing anyway.
B
Yeah, I mean, I think a big thing is I am not making a pitch or an argument about the brand at all. The words ea. I have no particular kind of attachment to them. No particular attachment to how people describe themselves. I mean, in fact, it's always been the case that the best outcome is where that idea just feels like quaint ea withers away. I mean, I don't describe myself as a suffragette because I believe that women should have the vote. That is like an obsolete term. And so, yeah, similarly people can describe themselves however they want. The key thing is what's the mindset on which people are operating? Is that scout mindset? Is that being scope sensitive? Is that being appropriately responsive to how unusual a point of time we're in and how high the kind of moral stakes are?
A
You recently put forward a vision for the near term future that you called Viatopia? What is Viatopia and what's the case for it?
B
Yeah, so situation at the moment is that many of the biggest companies in the world are trying to build AI systems that surpass human ability across all cognitive domains. I think there are good arguments for thinking that this is one of, if not the most momentous things to ever happen in human history. Much more like the evolution of Homo sapiens or of life itself than even the Industrial revolution or the invention of electricity or fire. It's at that level of magnitude. And yet essentially no one has a well formed positive vision for what a good society after the development of superintelligence looks like. And this is kind of striking and kind of worrying thing.
A
Feels like a bit of an omission.
B
Yeah, feels like a bit of a mission. And the concept of Vitopia is at least trying to offer a bit of a framework for what could an answer to that question of what a good post superintelligent society look like. And so the concept of Vitopia is that it's a state of society that is on track to produce a near best future, something that's just at least 90% as good as a future that we could have. And it's distinctive in that it's not saying we should try and aim for some utopian society directly. It's also not saying merely, oh, look at all these bad things that exist in the world. We could solve this particular problem and this particular problem. What it's saying instead is that we should try and figure out what does a good way station look like, where that is some state society that can steer itself to something truly very good. And so as an analogy to illustrate, imagine if you're an adventurer and you're lost in the wilderness, there are a few different options you could take. You could try and take your best guess at what the right path is to get to your destination, or you could try and just deal on an ad hoc basis with some issues you have at the moment, like maybe you're running low on supplies, or you could try and get yourself into a position where you know what's most important to do next and where to go. So, for example, getting to higher ground so that you can survey the terrain and figure out actually where you're aiming towards. And Vitopia is like that third path.
A
And what would be the case for focusing on trying to get to Vietopia now rather than trying to directly create a good world immediately?
B
Yeah. So utopianism has a pretty bad track record. Philosophers and writers have often tried to sketch visions of utopia, and normally it's not long before they actually start looking quite dystopian. And the reason for that is, well, we just don't know what an ideal future looks like. There's a lot of model progress we'd need to make before we could actually say, yeah, with confidence, this is what an ideal future would look like. So we need to do something else. Otherwise we'll probably bake in some major model errors of our own.
A
Okay, where does the name via? So via means road or something in Latin, or through.
B
Yeah, we mean by way of this place, Vitopia.
A
So this Viatopia notion, you told me it's been very popular, it's been very well received. Do you worry that it's a slightly vacuous notion that you're saying, well, we want to get to a really good future and so we need to get to some intermediate stage or like intermediate position where we're likely to get to that future? Is that a great insight or is that just kind of a trivially obvious thing and it's not necessarily going to actually help us get there.
B
Yeah. So good pushback. And I think it's not the most substantive thing, and it's deliberately. It's a framework concept, it's for organizing our thinking. However, I think it's not totally to the field. So there is a history of debate on utopianism and other concepts, and the leading ideas were kind of utopianism. Very popular idea, responsible for some enormous atrocities through history and the pushback to that from Karl Popper onwards, but still very popular now. So Kevin Kelly, a futurist, has this idea of protopia is the idea you just don't have a positive vision of the future at all. Instead you're doing something more like hill climbing. So you're looking at society now. What are the little things you can change that are like clear problems and then just trying to solve them one after another in this incremental way. And so viatopia is a different way of thinking about things and I think it does make substantively different or leads you towards substantively different recommendations than you might otherwise think, especially over the course of the transition from here to super intelligence. So if you've got the utopian perspective, you might think, well, what we need to do is just make the AI a classical utilitarian or insert your other favorite model view and then just hand over to the AI that's pursuing that vision of the good seems very bad from a perspective. Or, and this will be very rough from the plutopian perspective, you might just think, wow, well there are these major issues, major problems in the world, like 100 million people dying every year, and AI will give us the ability to completely solve those problems. So actually we should get there as quickly as possible and there will be in fact very rough trade offs between how quickly we go and how much risk of existential catastrophe we bear over the course of this transition. Aiming for viatopia might say, well, actually there's certain things that are even more important, namely not locking us into a really bad future, even if that means that we don't get to some of the upsides in terms of near term benefits quite as quickly as we might otherwise have done.
A
So you're saying protopia, this idea of, well, we don't want to have a grand vision that's going to lead us astray. Instead we just want to get wins immediately, like find ways to improve the world that we can understand and that we can see whether they've worked. That would potentially lead us to miss the bigger picture risks because we're just grabbing immediate wins, like trying to improve health. Or it would recommend just charging forward on AI, or at the very least
B
it wouldn't prioritize among them where it would say, okay, well, maybe risk of loss of control to superintelligence or entrenchment of some authoritarian regime, okay, well, that's some risk. But there are these clear apparent evils such as death and poverty and so on, and we could solve them kind of right away.
A
Although wouldn't you also say, like, if you thought that the AI might kill everyone in the near term, that's also a near term problem. Although maybe it's harder to evaluate because it's more about. More probabilistic.
B
Well, it's harder to evaluate and also plutopianism at least wouldn't give you the resources for saying betrayed. One of these is much more important than the other.
A
Do you think of Vitopia as a middle ground between utopianism and protopianism, or is it a different thing?
B
In a sense, it's a middle ground in that it is offering a positive vision for where we should be headed. However, it doesn't have the same, in my view, the same pitfalls that utopianism has because it's compatible with many possible ultimate visions for what a good society looks like and is not committing to this kind of narrow view of the good.
A
So what would be the key traits that a via, would you say it's a Vitopian state? What would be the key properties that you'd be looking for, do you think?
B
So there's the key questions and key properties, and I want to emphasize the questions more than, oh, my particular answer at the moment, both because the questions themselves are more important and because my views evolve a lot over time. But that can include things like how widely distributed is power, where on one end of the extreme it's just all powers concentrated in the hands of a single actor all the way to, oh, it's extremely distributed global democracy, or even perhaps more distributed than that. A second is just, well, what sorts of people, what sorts of beings have power? Is it just members of a particular society? Is it just humans? Do AIs have influence over the future? What about future generations? A third category is when do major decisions happen? Where there are some arguments for thinking, look, we need to make really big decisions, really quite early, or instead we should say, look actually for the sorts of decisions that will really guide how the future goes. We want to pump them into the future as much as possible. And then finally there's questions around, well, how should society as a whole be making decisions and these most important decisions about how the future goes, where that could be via democracy, via voting, if so, what sorts of voting systems could be via auctions and market mechanisms, if so, what types of. And so those are just some of the things we've got to grapple with, I think, and I have views on them, but they evolve.
A
So the analogy that most jumps to mind to me is that if you have a group of people starting a new country, they might not yet know exactly what the nature of the law should be, what the political system should be, but they might find it have an easier time agreeing on some process like Constitutional convention sort of thing, where they come together and they'll figure, well, everyone will get some vote. We'll use this kind of deliberative process and then we use this kind of voting system, and then at the end of the day, and we'll end up with some set of agreements of how things are going to run and the chips will fall as they may. Is that a good analogy to have in mind?
B
Yeah, I think that's a great analogy. And the US Constitutional Convention at the end of the 18th century is this remarkable event where if I remember correctly, it's about 40 people in a room debating for three months what should the United States of America look like? And what they agree is this set of procedures. And obviously there's ratifications and amendments after that. And it's interesting too because there's this balance between locking in certain ideas, but also kind of locking in a method that doesn't involve lock in itself. So you can lock into a certain system that allows a lot of experimentation and free debate and change over time. That's very different than if they'd chosen a constitution that put a single person or even a single family lineage in absolute power or something that would have been kind of locking into a different sort of political system, but one with much less in the way of open endedness and how it could develop over time.
A
Okay, so are there any particularly non obvious or controversial recommendations that you think the vitopian framing on things would push us towards stuff that people might otherwise not like?
B
Yeah, so there are certain things that at least I think a viatopia would consist in that is not totally obvious. So one which we'll talk about is I'm very pro distribution of power. Whereas a lot of people who worry a lot about existential risk really are in favor of actually quite intense concentration of power. Because the idea, and it's not an insane view, in fact, the idea is if you've got this period of intense existential risk, in particular, if existential risk can be posed by any of many different actors, whether that's because they develop a misaligned superintelligence or because they create extremely powerful bioweapons, then you might think, well, we just need a very small number of actors, maybe in fact just one powerful actor that can guide us through this period. Whereas I think that's unlikely to put us into a position where we can guide ourselves to a near best future.
A
Yeah, why is that?
B
I think we'll talk about it a lot more. But ultimately it's because I think any single actor probably has the wrong moral conception, even upon reflection, even if they choose to deflect. I think it's a little worse than that, in fact, because the sorts of people who end up imagine that one
A
person has risen to the top and gained supreme power there's probably some bad filters that they've gone passed through.
B
Yeah, exactly. And that's. If you look at leaders of authoritarian countries in the past. Well, that includes.
A
It's a mixed track record.
B
Yeah, I mean, that includes Stalin, Hitler, Mao. And the personality traits are just, you know, it's terrifying. These are psychopathic, sadistic people. They're not. They're not merely randomly selected people who happen to have total power. And I also think that if one person or even a small number of people are in a position of total power, they're also just less likely to reflect on their values in positive ways. I think that's something that tends to happen more naturally out of interpersonal interactions and the need.
A
Well, especially 1 to 20 equals, I feel. Yeah, yeah. I think you notice this even just with people who gain more influence within an organization or they become wealthy or respected or so and they stop getting the normal pushback that sharpens their ideas. And you can imagine if you were the supreme dictator forever how disconnected you could become from any reality.
B
Yeah, exactly.
A
Okay, so what are the different categories of Vitopia that you think have a shot at working?
B
Yeah, so I think there's three broad ways of thinking about how we could get to a near best future. The first I call kind of easy utopia. So this is actually, I think, the common sense view, which is just. It's not that hard to get to an extremely good future, something that's basically as good as you can get. You just need to eliminate the most obvious and egregious bads. So yes, dictatorship would be like that. But eliminate poverty, eliminate suffering, allow people to have ill health, allow people to have freedom. And that plus just technological development will get us kind of most of the way or even all the way there. If that's correct, then Viatopia isn't that interesting actually, because we'll probably just hit it anyway. A second view is convergence, where on this view you would need to have most of society with power converging onto the right kind of ethical view. Or I'll sometimes use correct ethical view or correct moral view. You can also just say this in more kind of anti realist subjectivist terms like the view. I think replace that with the view I would have upon idealized deflection or something. Something. But it's easier just to say correct or best.
A
And they have to be motivated by it as well, right?
B
They have to be motivated. Yeah. So in this idea convergence, it's like, yes, maybe the best future is a narrow target. Nonetheless, if we can get it such that most members of society, or at least most of people with power, converge onto the best thing, the best model, view and act, steer towards it. Then nonetheless we'll hit the narrow target. But that is necessary. And then the third vision would be what I call compromise, which is, well, you don't need everyone. In fact, maybe even if you just got a small fraction of people who have the right kind of ethical views and are motivated to pursue them and the broad philosophical perspective and understanding of the world as well, and they're able to kind of trade with the rest of society that is sufficient to get us to a near best future. And my view at least is that this third option is kind of the most promising thing to steer towards.
A
So we're going to skip over the easy utopia scenario here. Today you have an article on the Forethought website called no easy Utopia where you argue that that is not plausible, I guess in brief, because I think we both agree that the best possible world is not just a matter of removing bad things, but it's also about adding lots of the best possible thing as well. And probably the best thing is better than nearby things. So it's like quite a narrow target to hit. And I guess we're not going to talk a ton about this. Reflect. Like what if everyone just when they reflect on moral philosophy, they end up concluding that they reach the correct theory and they're motivated to spend all of their resources operationalizing it it. Do you want to say anything quickly about why you don't think that is super likely to work?
B
Yeah, I mean there's lots to say, but I guess I just think there's multiple ways it can fail. Even if we're in a reasonably good scenario where one is just that people can be uninterested in reflecting or they can reflect in the wrong ways, or they can even have a good reflective process but just have the. The bad kind of starting intuitions where from those intuitions, even with good deflection, they'll end up in the wrong place. And I think I'll say I am somewhat sympathetic to the idea that maybe quite large swathes of people actually would converge in the same direction. I think if that's true, it's because of the nature of reality. It's because of in my view something kind of moral realisty being correct. Just the argument's just very strong towards one particular ethical view. Or if you just experience like this particular conscious state, you can't help but believe that it is good because it is in fact Good. That's the sort of scenario I think we'd have to envisage, but. Oh, wow. I don't think we should be confident in that. And in fact I have really quite wide uncertainty over how much convergence you would get from all the way from. Yeah, it's actually just large swathes of people would converge. That's again a really good scenario. All the way to just like no one converges after reflection. All 8 billion people in the world would have quite different views for the good.
A
Yeah, well, you missed out. You could get all of it right, have everyone conclude the correct moral theory, but nonetheless not be interested in putting their resources there. Just be like, but I just want to do my own thing. I don't care about doing the moral. Really good thing.
B
Yeah, and we see this. Yeah. And in fact that's I think the most likely failure where you can go to people and give them the arguments for vegetarianism or donating and they can say, yep, all those arguments work and then just not take any action on it. And in fact, you know, it's not like we see today people investing lots of time and lots of money into ethical reflections and leading counterarguments and so on. It's just not really something that happens. It's going to be quite weird and unusual to do that. And in fact, maybe you want to. Some people would want to guard against them. So imagine fundamentalist religious believers or people who are very wedded to particular ideologies and they might say, look, I don't want to risk my adherence to my faith or oh God, it would be abolent of me to even consider this alternative position. And with future technology we would be able to guard our informational environment or even self modify such that we don't even consider these alternative perspectives.
A
Okay, so just setting the scope even clearer, we're mostly not going to be considering cases of catastrophic misalignment and like really deeply scheming artificial intelligence here. Not because that's not a possible option or a very live possibility, but just because we only have like 5 hours to record in. And it raises a whole lot of separate issues. It's worth imagining what happens if we mostly overcome that one way or another. So yeah, let's dive into the third option which you thought was most promising, which I guess you call like compromise, trade. This is a scenario where as I understand it, you have some meaningful minority of people who do converge or weighted by I guess like power or resources who converge on wanting the right thing for its own sake and they're willing to allocate some meaningful fraction of all of their effort towards that. So let's say it's 10%. 10% of resource or power weighted. Folks want to pursue this goal. You want to try to spin this into more than 10% of the best possible future that there could be. How might they accomplish that?
B
Yeah, so I think there's kind of two big ways. So one is if different groups care about really quite different things. So the greatest example perhaps could be people who just maybe some groups, upon reflection, they just value resources basically linearly. So a total utilitarian would be like this because the more resources you have, the more happy lives you can create. And the value of the universe as a whole is in proportion with how many happy lives. Other views that are perhaps more kind of common sense y might be very different to that. So might just care about preservation of the Earth's biosphere, or might discount over time and space, so care about what things happen near to them, or might really just care about guarantees of good outcomes or very high probability of good outcomes, rather than risky gambles of even better outcomes. This gives lots of opportunity for trade. So in this case, there could be a deal which says, okay, you've got the common sense person. They say, okay, well, we'll steward resources that are nearby in space and time. And this total utilitarian, yeah, sure, you can go to other star systems and then create this kind of much more kind of ambitious, expansive world with many, many kind of happy beings. And then perhaps in fact, Both can get 99.99% of what they would ideally want if they had complete control over everything. And that's this very exciting kind of potential opportunity. Because it means that then if we can get into this scenario that, okay, we've managed to get these beneficial gains from all these different kind of ethical factions trading with each other, then we don't need to pick a winner. It's robust to kind of disagreement. And it's therefore a much kind of safer option than either just hoping we all converge or pushing some particular view of the good.
A
Do you think that things would play out that way or is that a viable vision?
B
So, I mean, I think there are risks to even getting that. So one would be if there's intense concentration of power. A second would be maybe such trades aren't allowed. So there's lots of things that you're not allowed to trade at the moment. It's possible just the best stuff. So maybe the total utilitarian likes some particular blissful state and those People are in the minority, and society says, no, that's illegal. Where there's already lots of things that are, in my view, would be kind of just ethically fine, but are impermitted today. The bigger issue, I think, is, okay, so maybe there's lots of groups who have relatively easy to satisfy views of the good, Like preservation of the Earth's biosphere or preferences for things that are kind of local. But I think there'll be a lot of people who actually just do care about things linearly. And there it's much harder to see initially why you would get these kind of huge gains from trade. So I said, okay. The total utilitarian says, well, I just want there to be as many happy, flourishing lives as possible. But now let's kind of distinguish within that. There's utilitarian one, utilitarian type two. And perhaps they differ on what they understand flourishing to consist in what they think the best conscious experiences or lives are. In order for there to be good deals from trade there, it would need to be the case that there's some kind of hybrid life that is more than 50% as good.
A
On both grounds?
B
On both views. And it's speculation to say how likely is it that. That there would be or not. My guess is that in general, there probably wouldn't be. Because my guess is that the very best things from a utilitarian perspective will be way better than things that are just a bit less good.
A
I thought that the archetypal case here might be, you've got Faction A. Faction B, let's say Faction A, Yeah. They're the utilitarians. They want pleasure. No suffering. You've got Faction B that wants something quite different. And Faction B, incidentally, might cause a whole bunch of suffering in pursuit of their other goal. But the suffering is not something that they value for its own sake. They're just doing it because it makes their project somewhat more efficient. And then group A could basically pay Group B to redesign their thing so it doesn't involve suffering, Incidentally, is that a kind of thing?
B
That would be a case. And in the world today, that sort of thing happens. So I do think that if we had much better opportunities to make such agreements, we had better coordination, technology or something. The vegans and vegetarians and people concerned about animal suffering could just engage in some sort of trade with the people who like eating meat. And perhaps it wouldn't result. There wouldn't be enough bargaining power to kind of eliminate farming altogether. I think it could eliminate factory farming. And so most Animal suffering could just be abolished because as you say, people aren't really aiming for that directly. It's just a side effect. My guess is that when we're now thinking about these very kind of grand scales, that's not going to be super common or at least there will be a lot of residual incompatibility left over because you're just trying to produce happiness type 1, as much as you can. I'm trying to produce happiness type 2. I think that your understanding of happiness is basically no value. But it's not like you're producing lots of suffering, it's just valueless. Yeah, or it's like a tenth as valuable or something and similarly vice versa.
A
Okay, we'll push on from this. I guess we should just quickly note that there's a wrinkle with this kind of moral trade or a challenge that, for example, if we did start paying people to close down their factory farms or to redesign them, then you would be vulnerable to someone saying, well, I'm going to open up the worst possible factory farm unless you pay me. And you wouldn't know whether they would have done it otherwise. I guess they could pretend that they're not doing it to blackmail you basically, but in fact they are, I guess possibly the in this star faring future, maybe that wouldn't be such an issue. Or maybe it would be a much worse issue. We don't really know.
B
Yeah, and I should flag this is my biggest worry with the whole widely distributed power and trade and so on is vulnerability to those sorts of extortion blackmail dynamics. And there's this very substantive project to work out. Okay, what's a good system where
A
people
B
who say, yeah, self modify or pretend or use blackmail on extortion are not rewarded for doing so, but you still get these other beneficial gains from sleep.
A
Okay, let's push on to some honest to God philosophy, or at least what analytic philosophers would regard as philosophy. You've been working on a pet moral philosophical theory that you call the saturation view. What problem in normative ethics are you trying to address with the saturation view?
B
Yeah, so this is kind of a set of problems in fact within population ethics. It's a well known area of ethics for generating all sorts of paradoxes cases where you've got lots of individually extremely plausible principles that end up inconsistent with each other. And there are a number, so there's what's called the Muir Addition paradox or where you've got some intuitively plausible principles end up leading you to what Derek Parfit calls The repugnant conclusion. The idea that you could start off with a trillion trillion extremely happy people and that outcome might be worse than a population that consists only of people with lives barely worth living as long as there's a large enough number of them. This is one of the problems. The second is the problem of fanaticism that again start off with this guarantee of this amazing outcome and now take a tiny, tiny, tiny, tiny probability of something that's even better. Sufficiently good when combined with expected utility theory, many views will say take the gamble. No matter how small the probability, there's some sufficiently good outcome that you should
A
take it because it's risk neutral, basically.
B
Because it's risk neutral with respect to total quantity of happiness or something like that. A third category of issues is infinite ethics. I think we definitely won't have time to get onto that side of things, but it's something that's really plagued this kind of impartial consequentialist approach to ethics or axiology. But there's also a fourth problem in my view, which hasn't been discussed in the literature, which I call the monoculture problem, which is, okay, let's try and figure out what's the best possible future. What does that look like? Remarkably, all the extant kind of well specified theories of population ethics to date say that the best future if you've got a fixed amount of resources, involves figuring out what's the very best life, what's the life that would produce the most well being with a given amount of resources to create and then just make copies of that life over and over and over and over.
A
Tile the universe.
B
Yeah. So in EA and Nationalist world sometimes gets called tiling the universe with hedonium, where hedonium is whatever produces the most bliss per unit of resources. But the general idea is just what it wants is a monoculture because this is the thing that has the most well being. And if you just have that repeated forever, you've also got this perfectly equal society. And so it's good on egalitarian grounds too.
A
Yeah, well, it seems like it's a very natural attraction point because any theory that says that there's a best thing and that thing is not universe scale is going to say, well, if it's smaller, just make it and then make it again and just keep going. It seems like you almost have to hard code in a preference against this to avoid the monoculture, which most people find kind of quite unattractive.
B
Yeah. And so, yeah, it actually also follows from a couple of principles that are generally regarded as axiomatic in population ethics. There's like a very simple kind of proof you can make from it from these kind of principles. However, I at least find that unintuitive. I would think that, that a future of just like replicas of the one exactly qualitatively identical life is not the best possible future. And the better future would involve a wide diversity of different kind of forms of life and experiences and so on. And I think that's not just an intuition that diversity or variety is instrumentally valuable or an intuition that's saying, well, we don't know what's valuable so we should hedge our bets instead. I think it's just, no, actually that's
A
placing intrinsic value on variety.
B
A bears a future. Yeah. Or something that has that implication. So it could be. I mean, this might just sound like the same thing, but I think it's slightly different that the realization of a particular experience or form of life has value in itself over and above just the mere well being. But either way that, yeah, a very diverse and varied future is better than this monoculture.
A
Yeah. It's surprising to me that this hasn't come up in the philosophy literature very much because I think, I guess online, whenever people talk about what are we going to do with all of the matter and the energy and then anyone suggests something that is very monotonous, just repeat the same thing. People are like, well, I don't like that. Sounds horrible, sounds crazy and terrible. But I guess philosophers, because I suppose the prospect of changing all of the galaxies out there hasn't really been on the table before. It hasn't really come up as like, well, we need to figure out a solution to this.
B
Yeah, I think that's right. I have found over and over again actually that being really concerned by figuring out how do we do as much good as we can has ended up just driving all sorts of interesting philosophical areas and issues that are otherwise being neglected because most philosophers are not thinking in that same way.
A
Okay, so yeah, what is the saturation view? How does it address this?
B
So, yeah, the saturation view is a way of incorporating the idea that diversity is kind of intrinsically valuable by having the thought that if you have a replica of a life, so an qualitative copy that's just less valuable. And in fact more and more and more copies of that life is progressively less and less valuable in a way that kind of tends to some upper limit. And generalizing that a bit for the same reason, like maybe it's not an exact copy, but slightly different that's also a bit less valuable than some totally new kind of form of life. And the analogy could be, imagine a color wheel that's initially not lit up at all. And different sorts of life will experience different spots on the wheel. And by adding lies, you're lighting up those little spots. Whereas a kind of traditional population axiology would be saying you have the best thing and just over and over and then go over again, you want to produce that best thing. Instead, on the saturation view, you want to kind of light up the whole wheel because, okay, I've had many copies, let's say, of these very similar lives. Well, that means the additional lives are not as adding as much value. So you get more value by kind of instantiating some totally different kind of form of life or form of experience.
A
I mean, it's a very natural formalization, I guess, of this intuition that you're just saying, well, you hit declining returns on stuff. If they're too similar, you got something that's good, but making another copy of it isn't as good as the first time. And also something that's too similar to it also takes a bit of a haircut. If there was something else that was too similar to it in the past, I guess they never become useless, they just become less and less valuable incrementally.
B
Exactly. That's the. Yeah, there's never the point when you get no additional value, but the amount of value each kind of copy produces gets smaller and smaller.
A
Does it asymptote up to some maximum value?
B
Yes. So, yeah, as part of the view, asymptotes. And that's a really crucial part of it, actually.
A
Okay, and do you have any difficulty defining what the hyperspace is over which you're considering whether things are different from one another, or are you just going to set that aside?
B
Yeah. So, I mean, in my work so far, I don't talk a lot about. Okay, yeah, what exactly is this kind of space of different lives and how many dimensions does it have? And so on. I make some kind of formal assumptions about it. But my kind of view in general is like, well, let's just start off by kind of looking at the kind of formal structure of this view and all of the nice properties it has. And then afterwards we can then start arguing about, because it would involve fading lots of different intuitions and so on. But I don't think is really affecting the biggest picture.
A
So what are its nice properties?
B
So going back to these different problems. So let's start with this monoculture so very clearly just doesn't lead to a monoculture. And in fact you would want this very rich, diverse future, and that would be better. In the variant of the view that I formulate, it dissolves the mere addition paradox. Why is that so? It involves one extra structural assumption that I think again emphasizing. The point is to find some theory that is, you know, not like the total view and avoids its problems. But if all lives that have very low well being, or all experiences, depending on how you're aggregating it, are only a small part of the space of the overall landscape of possible lives or experiences, then once you appropriately reformulate the kind of underlying principles that generate the paradox, because these have to be kind of philosophers who say cateris paribus principles, so other things being equal principles. So it's saying holding diversity fixed, then it's not bad to make some people's lives better and add lives that are good. And holding diversity fixed, it's not bad to, or it's in fact good to have more well being and more equal. It turns out that the view can have the implication that you satisfy all of those principles. The jecting of a pugnant conclusion, accepting this dominance principle and this kind of egalitarian plus increasing well being principle. But you do not ever entail the pugnant conclusion, because the thought is that all of these kind of low well being lives or low well being experiences, they just can't add up to enough kind of diversity worth having. So in each of the steps of the paradox, you're kind of adding people and then trying to rebalance the well being. But then there's a step where it's just like you can't do it. There's just no world that will in fact satisfy kind of that step.
A
Okay, I didn't follow that, but that's okay.
B
It's a little bit hard to convey on a podcast. And in fact much of the paper is not even giving the view to begin with because the view, it gets mathematically quite intricate and in fact it's just giving a toy version of view and then working it through.
A
So I think the main reason that I'm like not super drawn to this, I guess, is that I don't have the same intuition that I don't have the intuition in favor of variety as strongly as like, as many people do. So of all of the problems with total utilitarianism or any views like that, the thing that I find like most troubling is the risk neutrality between like positive and negative experiences. I find that like deeply disturbing. Because it's never something that I would choose for myself. It's like that I would be like indifferent about a life that's extremely good and extremely bad, each with 50% probability. So that's super counterintuitive to me. But the idea of making something that's really good and then making a lot of it I don't find as peculiar and I suppose.
B
Well, I just wanted to ask on your views, you said to risk neutrality, I mean, you could just have a negative weighted utilitarian view where let's say bads count for a thousand times as much as goods or something, but you're still less neutral with respect to that. Yeah.
A
So that is more attractive, I guess.
B
Okay.
A
I assume it's a little bit hard to know. Are you changing the weighting of the badness or just like how bad or are you just correctly assessing that the badness is really worse?
B
Yeah, yeah, yeah.
A
But yeah, I think that makes more sense to me. Or that's, I guess, more how I would make the decision that it's like a really. Yeah, you just weight the bad stuff really more. Of course, it's like debunking explanations for why humans would have this intuition that we're more capable of suffering a lot in an hour than we are of experiencing pleasure in an hour.
B
But.
A
Yeah.
B
Yeah. Okay. So I'm wondering if there's. You also have worries about the risk neutrality aspect, because that's where I mean in the most extreme, combining it with the suffering cases, you start off with a trillion trillion lives of intense bliss. Yeah. So a trillion trillion lives, like absolutely amazing. Option A. Option B is a trillion trillion lives of intense suffering, worst possible suffering, plus some one in a billion billion billion billion billion chance of an extremely large number of lives that are just barely worth living. The total utilitarian combined with expected utility theory has to say, or expected value has to say that the latter is better than the former as long as the number of lives are large enough.
A
So what we're doing is adding a whole lot of just barely worth living lives. And that's way better.
B
Yeah. So world A has trillion trillion bliss utopia world. And then gamble B, it's a gamble. There's a guarantee of a trillion trillion intense suffering plus like an even larger number epsilon probability of just epsilon probability of all of these lives that are just barely worth living, but it's just a very large number of them.
A
Yeah, I foresee that you're just going to throw out like an edge case like this every. Whatever I say you have Too much practice with this. I mean, that is also you. That is also very unattractive to me as well.
B
Okay, okay.
A
So, yeah, I think. Did I interrupt you were going to go somewhere with this? I think I would like to help with this.
B
It's just because you mentioned risk neutrality, and that was one of the problems that I mentioned was this fanaticism where no matter how small the probability, you really care about that. And as long as the you can make it sufficiently good payoff is big enough, you will pursue that timing probability of an enormously large payoff. And this view avoids that because it ends up being bounded. So basically, as long as the landscape is either finite or a certain feature of it decays fast enough, then there's an upper limit to how much good you can create intuitively. Again, thinking of this color wheel, you've fully illuminated as bright as possible the landscape. That's the kind of upper bound. And so you avoid fanaticism. And then I'll briefly say, but not explain why, for the same reason. I think it has quite a range of desirable properties even with infinite populations too. So many consequentialist views like the total view, they naturally lead to a lot of paralysis where you can't even compare intuitively comparable worlds. This does not have that implication.
A
Okay, I guess that is legitimately attractive. The two things that struck me as odd about the view, or less attractive about the view, was on the negative side. If you're also saturating there, it's even more bizarre that you would say, well, we've already had so many people suffering in this very specific, torturous way, adding more of them, who cares? Yeah, it's too similar to existing things to be that bad. It feels even more clear that on the negative side, it's just like linearly bad to have more and more people having horrible lives. The other thing is, let's imagine that we weren't about this project, that we're going to turn the sun into whatever we think is morally best, or turn the solar system into this thing that we think is fabulously, morally good. But then we make this discovery that we think that aliens elsewhere in the multiverse, like a long time ago or a long time in the future, they did something that was really similar. We've simulated it, and we think that they already made this before. Be like, shucks, we wasted our time. That non separability, the fact that the value of what we do is connected to things so distant isn't intuitive to me. What do you make of those two Things.
B
Yeah, I mean, both super important points. And yeah, the negative side is the thing that like, I think is by far, I mean in my view is by far the most like unappealing aspect. And then I think, yeah, you end up with, you know, you've got to kind of pick your poison. Unfortunately. Let's come back to that because on the separability side of. So yeah, this is this principle called separability which is basically just if I'm comparing kind of like A and B two different outcomes. Suppose there's some background population in distant time, distant space. It's irrelevant to whether A is better than B. It's irrelevant what that background population is like.
A
Yeah, you can go like +C. +C and then cancel them, cut them out.
B
Yeah, exactly. And yeah, I agree also that that's like quite intuitive. Separability is intuitive. If you endorse separability in conjunction with just standard kind of, I would regard as technical assumptions, you have to endorse either the total view of population ethics, which is just add up all the happiness, or the clitical level view, which is just add up happiness but minus
A
a bit for each individual.
B
For each individual, yeah. So if someone had wellbeing 10 and the clinical level was 2 or something, then adding them to the population would have +8. And these views have all of these problems that we said to begin with. They differ on the repugnant conclusion, but the problems are really bad, seemingly unintuitive in both cases. So that's one thing to say is like, okay, well we're going to have to suffer a violation of separability. The second is that the diversity intuition is fundamentally an intuition about separability because it's saying, it's like looking at the pattern of different sorts of life. It's saying like, well, we've already had a lot of this thing, so it's more valuable to have something new.
A
I think it might be because these things are so linked in my mind that it's not as counterintuitive, the homogeneity thing. I guess if you haven't thought about this before, they seem like separate issues almost and you only realize on reflection that they're deeply connected.
B
Yeah. Because there are some cases where violation of separability seems fine. So like in my own, you know, in one's own case it's like, okay, I'm going to go, I'm going to climb Mount Everest and that's going to be this amazing like achievement. And then someone's like, oh, you forgot you actually Climbed mountain for this last year. You're like, oh, I just did I. Yeah, you knocked your head and you got amnesia. You might well be like, oh, okay. Well I've. I mean, it's a bit unclear.
A
I mean, if the experience would be the same. I'm like, I would do it again. I'd be like, great, well, I can do it again because I forgot.
B
Yeah, I mean, I think, I think
A
most people wouldn't probably.
B
Yeah. I mean I am actually kind of getting some people to run a survey on the like to see how robust people's intuitions are about different things, which
A
poisons people prefer to drink from this medley.
B
And I'm also like, I'm actually not claiming that this new view is the best view. I think I'm saying if you want to reject the total view, there are these strongest things. Yeah, this is the best option. Because the last thing I'll say on this separability is that, yeah, we said that all views other than total view and critical level view have to violate sepability if you satisfy certain technical axioms. I think the saturation view violates it in a less bad way because it's often, in fact, the vast majority of the time it's acceptable. So if the populations are different parts of the landscape, then you can just add it up. You add up the value of this population, the value of this population. So it endorses this kind of limited separability principle. And then secondly, depending on how you define it, you could keep it such that it's all approximately linear until the population size gets really, really, really big. And so then it kind of looks approximately like the total view in most scenarios up into cosmic scale. Up into cosmic scale, that's.
A
Or even like intercosmic scale if we're doing aedt. Yeah. I guess I've seemed a little bit unenthusiastic about this so far, but I think it's amazing you've met, like, surely this is going to end up being a big deal. Or surely this is like got to be one of the top theories like within this entire space, don't you think?
B
Well, I do think so, yeah.
A
I mean, I don't find it attractive, but I think that many people will choose this as their population axiology once presented with it.
B
Yeah, I mean I. So yeah, I should say, like, I'm not at all claiming that this is the highest impact use of my time because I think a lot of this work can just be punted till AI gets better and so on, but it is the idea that I've been most taken with most just obsessed by in my life. And I think from a purely intellectual perspective, I reckon it's my best contribution. It also just makes me appreciate actually how few population axiologies have been proposed. The options are really quite week and most of the work that happens is more very few people like here's a view, here's a theory, and this is how it all works. In a way, it's surprising. Yeah.
A
Can people go, is anything published about this yet?
B
So my plan is to finish up. I've done this kind of sprint on what was meant to be the blog post summary, but it's 13,000 words so I think I'm just going to be like, okay, this is like a draft article and yeah, my plan is to publish that in the next few weeks. Okay, excellent.
A
Well we'll stick up a link to that.
B
Okay, yeah, and very kindly you've not gone back to the negative how it deals with very negative worlds, intense suffering and so on. But I'm happy to acknowledge that that's like. Yeah, it's very implausible implications in that case.
A
So you mentioned earlier that you used AI a ton to do this work. Yeah, tell us about that.
B
Yeah, part of the reason I think I've been so taken and obsessed by this idea, so I was working on it, I was on holiday and stuff, just doing it as much as I could in my spare time is because of the amazing, in my view, uplift of AI on analytic philosophy in particular. So how helpful is AI for the search? Well, extremely spotty where if you want to learn about some weird area, it's amazing if you want to help it do certain areas of macro strategy, the search can be essentially useless. In the case of at least this formal end of analytic philosophy, it's so good and honestly, credit where credit's due. It's almost all ChatGPT Pro so now 5.2 Pro where I think I wouldn't be saying any of this if that particular model didn't exist. Huh.
A
Gemini or like a Claude are not at the same level.
B
Well, I think big part of the reason is it just thinks for longer so I've had it.
A
Is this the 200 per month one or.
B
Yeah, I mean I now pay by credit so I actually spent a thousand dollars in the month I was most working on this. But yeah, it will think for the price I've had It think for 70 minutes is my peak so far
A
and it really does deliver better answers.
B
Well, here's what's going On, I think, why is it? Because I've talked to other researchers who really don't get that much from it. And I think what's going on is the problems within, say, population ethics are very well specified. There's a big literature which the AI has digested, and it's also an area where it has been specified enough that it's amenable to kind of mathematical analysis. But very few mathematicians have actually looked at it, where it's mainly philosophers who maybe they did maths in their undergrad. The exceptions are a handful of economists and Teru Thomas, who is a mathematician who moved into analytic philosophy and in fact has done the best, in my view, like maybe the almost better work than anyone on population ethics. So there's this big overhang of capability that the AI is getting from its being trained to be very good at maths. And in my own case, yeah, I had the kind of core. I had the core insight like a year and a half, maybe two years ago now, something like that. And then I was exploring it. I talked to Toby Ord and Christian Tarzni, and I should say that, yeah, if we publish a paper on this, it'll be co authored with Christian and the initial kind of thought, it was specified in a way that kind of obviously didn't quite work. And there was an obvious way. It's like, okay, well, it's kind of specifying it in a discrete form. And it's like, okay, there must be some continuous form of the theory that would work. And then it's like, I just don't have mathematical training. It's kind of beyond me. AI does. And so then it was like this. Yeah, it felt like really getting this kind of rocket booster where I'd be like, no, I want it to work like this. This. It's like, okay, cool.
A
Well, did you have difficulty checking the answers that it gave?
B
There were challenges there because. Yeah, and I've. I mean, I've definitely been slower. I mean, I use many AIs kind of checking it and itself in many cases. One thing that AI is still pretty bad at is just keeping a tight hold on concepts. So it might define something in one way on page three and then page eight, it'll define it in some other reasonable but different way. It doesn't necessarily notice, but it's much easier to verify something than to come up with it yourself. And a lot of the time it's just using concepts where it's like, I didn't know what a kernel is. And it's not that Complicated once you've learned it. But then I wouldn't have even known where to go.
A
So my impression from Twitter is that AI is now starting to make useful contributions in maths specifically. I think it's not amazing stuff yet, but it's like we're seeing the early signs of. It's like producing stuff that might be publishable, I guess. Do you think the same thing might start happening in analytic philosophy, given that at least some parts of it are basically kind of like maths with words?
B
Yeah. Honestly, I think a big question is just whether analytic philosophers take the opportunity. I'm very curious on doing this as like an early testing ground for AI, for macro strategy as a whole. But I also like, this is the kind of best case. There have been other cases where AI has in one case just gave me a definition that was really good. Again, it was a kind of formal definition. In other cases I've had it give just really quite good informal definitions of things. Another case which just came up with a good critique. I was just like, here's a view, like generate as many counter arguments as you can and comes up with 20 and most of bullshit. That's not very good. Oh, that's really on point. And so. So yeah, my take is that we're entering this golden age of analytic philosophy, potentially especially at least on the more formal end where it's just like people could become 2x4x more productive.
A
Does it need lots of handholding? I mean at the point where one person can just be like, here's a set of problems, here's a 100,000 pound compute budget. Have at it. ChatGPT. Then you don't need the field as a whole to change. It's like that one person just ends up owning discipline.
B
I mean, I think analysts philosophy is small enough that there's a question whether you get one, does one person or not do it? But yeah, I don't expect the field as a whole. I expect the field as a whole to be very slow to appreciate, but some people will be really on top of it.
A
Yeah, I guess I'm saying if it requires constant hand holding to make any progress and to structure its thinking and so on, then that is a bad sign. Or that suggests that unless many people in the field get massively enthusiastic, which probably won't happen, then.
B
Oh yeah, and I think that is right because, you know, I was so Christian again planning to co author with. Yeah, we were working. He'd had this other idea for how to. Which ended up being quite different. How to Kind of extend the idea. And I was like, oh, you got to use AI. GPT5 Pro is so good, it's worth $200 a month. And then he got it to, like, he had this hypothesis of conjecture. And then the AI was like, oh, yeah, I approved it for you, blah, blah, blah, blah. And it's like, no, no, no, no, no. It was like, very complicated too. It's like, oh, I need to assess this sort of thing. But it was just. It was just hallucinating.
A
Yeah, hallucinating.
B
Okay, well. Or the Ward hacking or like, so there's a ton of.
A
It's a really quiet skill to drive, I guess.
B
Exactly. Yeah. You've got to have this intuition of like, when's it bullshitting you and when it. When is it not. And that will impose an increasing issue when it's like. I guess there's a couple of things. Like, yeah, One is like, sometimes just flat out thinks it's proved something and it hasn't. Another is like, often it just like, hey, yeah, I've got this proof. And then it's like you wade through it and it's like one of the assumptions very close to the thing being proved. It's like, so it's that, you know these classic things that everyone finds it's lazy, it's really eager to please. And so, yeah, there's a lot of skill in terms of just intuition and when's it going to work? Well, and when not. And it's interesting, when have I ever just had an AI output and then just actually read it in the same way I read a human piece of text? I'm like, never. I think maybe never, because it's like a skim through.
A
And then I'm like, yeah, I suppose they're. It probably is a growing gap between people who have been using this stuff all the time, I guess, like you and me over the last year, because I think maybe part of the reason why other people are sometimes not as impressed is that they just haven't built up these intuitions for what kinds of things work and what the failures are going to be and what they should be looking for for something to be wrong. Okay, so it sounds like it's slightly mixed on analytic, like whether we'll have a flourishing of analytic philosophy in the next few years. But you said that macro strategy, the kind of stuff that forethought does. You found it to be pretty, maybe less useful. More touch and go.
B
Oh, yeah, much more touch and go. And much more of a mixed bag. So some ways in which macro strategy? And AI is like, amazing uplift, because often the work just involves needing to know a little bit from all sorts of different disciplines. So even kind of early, like GPT4 kind of thing, you'd say like, okay, well, are there any interesting experiments that you can only do in space and can't do on Earth? And then be like, yeah, well, actually because of gravity interferes with certain crystalline formation, I'm like, I would have never been able to get this otherwise. So that's sort of like totally random bits of science and information. Fairly useful, incredibly useful for just when you need to generate a lot of examples. So with this AI character work, just like, I need a trade off between these two virtues or something. Give me this or give me lots of examples, and it can just generate kind of large quantities of them. But then if there's some kind of gnarly question, or when it's like, you need to be really precise, like if you're actually kind of drafting certain principles for how AI character should behave or. And then certainly on the kind of insight side of things, which is obviously a big part of the value, then, yeah, there's just. I think it just doesn't really know what doing good macro strategic thinking looks like. And so instead you get something that feels like a management consultant or like maybe a high school essay. I mean, I think it's still getting better and getting more useful, but I feel quite aware of just where are the things where there's an existing literature and where isn't there?
A
Well, sounds like your job is secure for another year at least, I guess. I think we've touched on about a third of the stuff that Forethought has put out over the last year. So if people like this and they want to read more, then forethought.org, i guess you've got a research page. There's a lot of really interesting, interesting macro strategy work on there that people should check out. I found it fun reading through.
B
Well, thank you. It's been great being on here. I've really enjoyed the conversation.
A
My guest today has been Will MacAskill. Thanks for coming back on the 80,000 Hours podcast.
B
Will, thanks for having me.
Release Date: April 22, 2026
Host: 80,000 Hours team (primarily “A” – likely Rob Wiblin)
Guest: Will MacAskill – philosopher, co-founder of Effective Altruism, author, Senior Research Fellow at Forethought
This episode features a wide-ranging conversation with Will MacAskill about humanity’s prospects as we approach the age of superintelligent AI. The discussion focuses on the importance of AI character and personality (“AI constitutions”), the role of risk aversion and “deal-making” with AIs to avoid catastrophic outcomes, challenges around concentration of power, the economic and philosophical implications of moral public goods in a future with advanced AI, and Will’s new “Viatopia” concept—a blueprint for navigating to a robustly positive future. Interwoven are critiques and reflections on Effective Altruism, including lessons from recent scandals.
[00:46–07:54]
Memorable Quote:
“Writing a constitution that guides AI's character is like writing instructions to God.” — Will MacAskill [04:21]
[07:54–17:14]
Notable Quote:
“The debate is where in between those extremes do we want AI to be?” — MacAskill [14:43]
[17:48–26:34]
Quotable:
“What are the sorts of people who obey orders no matter what and have no conception of the good? They're psychopaths.” — MacAskill [25:05]
[29:38–36:32]
[36:46–54:48]
Quotable:
“The history of institutions is about resolving differences in preferences by trade or compromise rather than going to war or violence.” — MacAskill [63:28]
[67:21–79:14]
[79:25–98:28]
[98:40–116:32]
[116:32–125:08]
[125:08–139:11]
Key questions for Viatopia:
How widely is power distributed?
What types of agents (humans, AIs, future generations) have a say?
When are major decisions made—now, or deferred until better wisdom emerges?
How do we reach decisions: voting, auctions, or other mechanisms?
Pluralism over Monoculture: Cautions against single-actor (or single-ideology) governance, due to the risks of value/moral error and unreflective entrenchment.
[139:43–154:43]
[154:43–177:47]
Notable Exchange:
“The negative side is by far the most unappealing aspect… you’ve got to pick your poison.” — MacAskill [171:59]
[178:03–188:40]
The conversation is accessible but intellectually rigorous, with a “scout mindset” emphasis—a commitment to clarity, depth, and openness to strange but important ideas. MacAskill combines analytic philosophy with practical institutional design and AI alignment concerns, often weighing tradeoffs and pluralism over ideological purity.
Visit Forethought’s research for technical articles and blog posts on all topics discussed. If you enjoyed these topics, read Will MacAskill’s upcoming 10th-anniversary edition of “Doing Good Better” or explore related work on moral public goods and decision theory in infinite universes at 80,000 Hours.
End of Summary