
Loading summary
A
I think the next two years to four years will be pretty path defining. I think we need to build workflows, tooling infrastructure that allows us to, without needing to fully blindly trust the AI system, trust the outputs it creates. The crucial question is like, how can we arrive at justified confidence in those outputs? And if we do find ways to do that scalably, then I think we are in a good position to build a world that unlocks a lot of flourishing. We will get massive uplift from AI in all of these domains. Science, engineering, decision making. I think the big question is whether we can channel enough of that newly unleashed creative power that we'll be getting into building the sort of things that maintains stability throughout. I'd rather get all the flourishing things out of like not super intelligent systems, but highly capable systems that I can coordinate with well and that can coordinate with each other well, as opposed to training successor agents that we don't know how to train. There's no perfect security, doesn't exist.
B
Nora, welcome to the Future of Life Institute podcast.
A
Thank you.
B
Fantastic. Could you introduce yourself to the audience?
A
Yeah, sure. So my name is Nora Mann. I currently work as a technical specialist at the Advanced Research and Invention Agency in the UK aria and I've been sort of broadly in the AI safety world. AI, like how are we going to make AI go well world for the last couple of years, working on a few different things. We'll probably touch on a few. A few of those.
B
Yeah. Perfect. So you write to me that you think the strategic picture is that we are in a slow takeoff, multipolar world. So if we kind of take that statement apart, what does it mean that we're in a slow takeoff world? How do you measure that? How do you, what evidence do you see for it?
A
Yeah, I mean, just maybe clarifying what exactly? I mean, it sort of comes in degrees, but I think what I, essentially what I mean is I expect there to be several inflection points of AI progress rather than sort of like one big one. And sort of after that we have, you know, self improved, sort of self improving AI going through the roof. So instead of having sort of one of those inflection points, I think we are seeing several inflection points. And at each inflection point there's sort of a pickup of the, the speed of acceleration itself. And so that will still be fast, but it's sort of in, in the lingo of the broader space, I think we would sort of refer to that as slow takeoff rather than sort of foom kind of takeoff and then what evidence do I think we see? Well, one is that, well we see these models improving, but I don't think we see them improving fast and increasingly faster, but not from one training run to the next. Them sort of achieving full generality and especially sort of like online learning. So that's what we see so far. And then I think there's also another aspect beyond the sort of model capabilities. There's sort of like how those capabilities translate into things that happen in the world. So like economic impacts, adoption. And I think there like sometimes people sort of talk about like unhobbling or there's just sort of like in terms of how can we actually make use of those capabilities that as well is taking time and people are figuring out how to best sort of get the capabilities out of these systems and also adopt them into the world. And there are just various, you know, there's various bottlenecks to this rather than a single one. So even if we unblock one bottleneck, there's a few others. And while this will be picking up increasingly, there's sort of not a singular inflection point.
B
Yeah, that makes sense. In terms of years, what does it mean that we're in a slow takeoff world? Because it can be somewhat misleading to say a slow takeoff world if we.
A
Yeah, yeah, I mean obviously uncertainty for sure. I think sort of what was, I think my best guess or median guess here is that we'll see by the end of 2026 pretty intense uplift in terms of software engineering in particular. So like thinking especially about sort of the meter work on the task horizons that AI systems are able to pursue autonomously. And that task horizon seems to be doubling every four to seven months. And if you follow that trend, you will have AI systems that can do in software engineering tasks of a sort of day long length. So like what would take a human a day to do, they would be able to automate by roughly the end of, like roughly fall end of 2026. And so that would obviously then give a, give a strong uplift to software engineering and ML research to some extent and then that will, you know, get us to find new algorithms for training and eventually that will, you know, have these AI systems be able to do R and D on hardware as well. So, so, so that they expect it to go relatively quick after that. I think by 2028, 2030, I think these systems will be pretty autonomously capable of a wide range of R and D. And I'm sort of. Yeah, you know, definitions are tricky, but I think probably More in terms of what's the window of intervention here to like steer the direction of that takeoff. And I think that is sort of mainly located within the next two to four years. And then whether we call it AGI or asi, I'm sort of like, don't care too much. But when do we have time to intervene? I think the next two years to four years will seem to be pretty path defining.
B
And why do you think that?
A
So I think of this often as sort of like differential progress. So like technological progress. There's a sense of like, okay, how fast will this go? But I'm also interested in like, what direction is this taking off? And if increasingly these systems become more capable and more capable of autonomous R and D, autonomous work more generally, it will be those systems doing more and more of that. And they can do this much quicker. They can do this in parallel. There can be big fleets of these systems doing this. So sort of the volume of what will be happening will be increasing. It will be a speed up. And so it will be harder for humans to intervene at that point unless we already built like, good structures that allow us to do that. So I think setting that up on the right trajectory, given how fast the speed up is, seems to be sort of lying roughly in that time period.
B
I'm assuming that we will get more capable. So us as humans, we will get more capable because we will have these AI tools that enable us to do more work. And so if that happens, would that extend the window that we have to intervene just because we would also get more powerful as the AIs become more powerful?
A
Yes, and I think that's exactly the sort of differential acceleration that I think is critical to get on the way over the next few years. So if in fact we use the next few years well, then it will be true that like human AI teams will be collectively very capable. If we don't do that as well, I think there might be a more heavy shift towards like AI is autonomously doing that without us having meaningful oversight or steering or involvement in that.
B
And you think the path of kind of minimal oversight or not much ability to steer the systems, that's what's happening by default.
A
Yeah, it sure seems a lot of that should be happening. I mean, it's always tricky. I'm hesitating a little bit because we're all actors in the world creating the future. So saying that that's my default prediction, you know, depends on what I'm conditioning on. Right. Mostly what I'm saying is, like, it seems really, really valuable to put a lot of effort right now into making these systems more steerable and building scalable oversight mechanisms.
B
You also write to me that we are on a narrow path where we can sort of fail in two different ways. We could either fail through domination or through chaos. So maybe you could sketch out what those two failure modes look like.
A
Yeah, I mean those are a bit terms of our. Let me unpack. I'm not sure there are the best words at the end of the day, but I guess when I think about this future and like, okay, cool, these capabilities keep improving, AI uplift on a bunch of things we want to do will become increasingly significant. And I think for me the big question is like, cool, how do we make that go well? I can see that go many different ways. How do we make that go well? And what would it look like for it to go well? And when I think through that, there's sort of like two clusters of scenarios that I'm like worried about. And importantly, those are clusters that break into many, many scenarios. But roughly the clusters I guess you could sort of imagine here is one is the sort of more classical rogue ASI, loss of control scenarios where AI capabilities are improving and eventually we have sort of an AI system or maybe a sort of collection of AI systems that are more capable, more powerful than humans and misaligned. What exactly, you know, whatever exactly that means, but sort of out of control and don't care about humanity in such a way that that could cause a lot of harm. And that would sort of be a scenario where the harm comes from sort of a single system that is accumulating a lot of power and is sort of dominating other actors that maybe also, you know, have their interests. So that's what I maybe call the cluster of sort of like failure through domination. And there's nuances to this. You could imagine there's some other people that are sort of interested in AI enabled coups, for example. So like, what if the AI system is maybe quite aligned with like a human individual or like a small group of human individual, but those individuals are sort of making a coup for power and then they are effectively sort of having all the power and dominating the rest of the world. And that, that too could be a, would be a bad outcome. And that too sort of falls maybe into this like gestalt of like failure through domination. So that will be. There will be sort of one cluster of failure. I'm like, ooh, we got to be careful about that sort of thing. And then I think the other One that stands out to me is sort of a bit more diffuse and looks more like cool. Let's assume there's not going to be a single system or single set of actors that really accumulate most of the power enough so to sort of dominate and everyone else, by domination here, I mean something like powerful enough that you can just impose your will on others. And so let's assume that there's not a single actor or set of actors that can do that. But instead we may be in a world where everyone has very similarly, is very similarly capable, but we are sort of locked in somehow in this like race to the bottom where it's sort of very adversarial and everyone is trying to win out over the other. And what you might end up is sort of like a setup of like excessive decentralization, excessive competition, such that every individual agent, because they can't coordinate is sort of like locally always incentivized to sort of spend the last, the last extra resources they have on increasing their fitness or increasing their power or increasing their defenses. So much so that as a collective outcome, no one can sort of spend any of their resources on anything else that they might care about. So we have sort of an erosion of. Yeah. Values. Are the den just winning out this competition? And that also seems sort of bad. And it seems like an opposite cluster because that cluster looks more like there's no single sort of dominating power, but they're sort of like all out competition and that eroding away what we care about. So in that picture with sort of these two clusters, I'm like, oh, you know, I can tell stories about how AI is going to accelerate. Both of these stories, both of them don't sound that exciting. What does it look like to sort of, you know, scoot past both of them? Like how do we, how do we thread the needle where we don't end up in either of those worlds? And then what would the world look like that stably it doesn't collapse into one of those. One of one or the other of those clusters.
B
Yeah. The scenario in which there's sort of excessive competition, that's. That's a scenario you write about with co authors in Gradual Disempowerment, a paper. I'll link in the description and listeners can also scroll back in the feed and listen to my interview with David Duvernux on that topic. But just quickly on the competition question here, competition seems to be a positive force in the world or has been historically. So, so what is it that's changing with AI that's making it into a sort of a constructive force.
A
So you can. It can definitely be like. I agree with the premise, like a competition can definitely be constructive in many ways. But I think, to put it succinctly, I think that the challenge with competition is basically how can you deal collectively with negative externalities. And so negative externalities being like, you know, effects of your actions, individuals or collectively that are sort of bad. But like, no one is pricing in. Or like there's no incentives as such. Like no one, no individual actor is pricing them into their actions. So it's sort of ignored. You know, classical examples, pollution is a negative externalities. And we're sort of not pricing the downstream cost of this insufficiently. And then actually everyone is kind of worse off as a result. But it's like hard to coordinate on not doing this. And I think excessive technological risk as well is a negative externality that is hard for a group of actors to coordinate on managing appropriately. Yeah, there's a lot of upside that we can get out of, for example, the technological development of AI, but there's risks, right? And then how do we collectively coordinate on managing those risks? So these are examples of negative externalities. And if you're in a context, I think my cl. My picture here would be something like if you're in a context where it's all out of competition and you. You can't, for whatever reason, those various actors in this multipolar world can't coordinate, those negative externalities will sort of pile up and there's like no way to coordinate around those not happening. And then I think in the world that we have, especially in the world where we're in some sense so powerful. Right. Like, I think the moment when, like humanity invented nuclear nuclear power was sort of a big deal because it was maybe in, you know, in some sense the first time that humanity was able to create negative externalities so big that we could really honestly not like properly not recover from them, which is very different than externalities that are recoverable.
B
You're thinking of nuclear weapons, correct?
A
Yes. Yeah, yeah, yeah, yeah. So if you have negative xanalities and you can't recover from as a collective, you really want to coordinate on not incurring these externalities. But if you don't have the means of coordinating on that, you collectively can't steer away from them.
B
Yeah. And one worry here would also be that the societies or companies or individuals that thrive in a world with a lot of competition are exactly those that are kind of set up to be maximally Competitive and to not care about anything except for getting more resources and sort of winning. And so you end up in a world where everything is optimized for gathering the most resources and using those resources to sort of find further resources and so on, with not much left over for, say, human flourishing, for example.
A
Yeah, exactly.
B
So one question I have here is that is there sort of a tragic tension between avoiding these two outcomes? Is it the case that trying to prevent one of these outcomes is to the detriment of trying to prevent the other one?
A
Yeah, I mean, well, it sure seems tricky right from where we're standing. And this is what, you know, this is why I sort of like use this, this notion of like a narrow path, like threading a needle. Like, it's in some sense interesting to observe that there's a way in which those failure modes seem like quite the polar opposite of each other. And at the same time, from where I'm standing right now, is sort of like partial observability over the world and what's going on and what will happen. I'm like, yeah, but I'm still worried about both of them. Like, both of them seem kind of somewhat serious, plausible outcomes. Even if they, if they seem like at polar ends of the spectrum, what could happen? So, so, so that's interesting. And then I would just say I think it's sort of unhelpful to frame it as, which is, you know, which of those bad outcomes is the. Is the better one. I'm like, no, the problem is just stated correctly as we need to navigate both of those challenges and then what does that look like? Rather than trying to make a hierarchy of which one is worse. So, yeah, I'm worried about both of them. I think we need to look out for both of them. And threading the needle is in fact hard.
B
Yeah. So let's get into your sort of proposed solution for threading the needle here, which is around building coalitions, specifically coalitions between humans and AIs. What is it that we need from those coalitions? Which sort of capabilities would they have?
A
So there's a few different things that I. Okay, like here is one way I'm like, AI is gonna, you know, AI is happening. AI is improving. We are finding more ways of feeding it into things we do and getting benefit out of that. So, like, people currently very excited about sort of AI enabled science. Like, what if, like, AI can sort of like, you know, really supercharge how we do science? What if it, you know, it already. We have a lot of example of how it supercharges us doing software engineering and coding that can, like, increasingly feed into other areas of engineering and R and D. We're thinking about, like, how can AI help us, you know, navigate the complex world, make good decisions? And partially I'm like, well, that seems to be happening. How do we make that go well? And what would that even look like? What will this look like here? And I think a few of the pieces that we need are we want to be able to leverage AI system, even though we maybe can't trust the system 100% or maybe shouldn't trust the system 100%. And I want to say there's a lot of reasons for why that might be the case, right? A bunch of people listening to this podcast might not want to trust AI systems just sort of like full on because we're worried about safety, safety concerns, right? Like about deception, about misalignment. Seems legit, seems important. But there's also. I just want to, like, there's more mundane ways for why we might not want to trust AI systems. As they get better, they still make mistakes, or it actually can sometimes be hard to tell the system precisely what you want. So, you know, I might say, like, oh, build that website for me. And then it does that and it tries really hard. And maybe it's sort of like really well intended in trying to do that, but I actually kind of didn't specify what security properties I want from. From the back end, et cetera. So we might also just like, be ambiguous about what we want and make it hard for the system to. To do what we want, given that, or there's ways in which the system, you know, I individually might trust what the system does. But like, as a collective, like as a society, it matters that we all feel like we have justified trust. And that's like, not the same problem. So the. There's a few different reasons for why we can't just trust that the AI just close our eyes and the AI will do all the things we want in science, in engineering. So instead I think we need to build oversight, build workflows, tooling infrastructure that allows us to, without needing to fully, blindly trust the AI system, trust the outputs it creates. So if we get AI to start to do science for us, if we get AI to start to do engineering for us, I think the crucial question is, like, how can we arrive at justified confidence in those outputs? And if we do find ways to do that scalably, then I think we are in a good position to build a world that Unlocks a lot of flourishing in this coalitiony way where AIs and us can work together with building cool things. And then the question is, like, what sort of things should we be building? And I guess like a lot of things, right? Like the lightcom is pretty big. There's a lot that could be built. But going back to what we talked about earlier, it's like notion of negative externalities. I think one of the things we need to make sure so that, you know, we can continue to expand into lightcon and do cool and fun things. We need to make sure that we like manage negative externalities. Right. In particular sort of big, big ones. And I think some of the stuff that is needed for that are public goods. Like we need to build a resilient infrastructure. So such that like, you know, yes, accidents might occur, yes, there might be even these sort of adversarial dynamics going on, but we need to be resilient enough that they are not that we can sort of recover from them or work with them or adapt to them. And we need to build the sort of R and D that allows us to do that. And on the other hand, I think we also need to build out the infrastructure that enable sort of positive, positive sum trait. Right. Like agents, by working together can sort of unlock more than they could do alone, which is really exciting. And I think there's basically here is the possibility for coordination, but there's some challenges to actually being able to enter these trades because, and you know, that's sort of like economic theory, bargaining theory, curse and bargaining theory. Like there's challenges like even finding these opportunities of collaborating. There's things like sort of security or enforcement related risk from like, cool. Like you have something I want and I have something you want. That's great. Like, let's just do it. But what if, what if I have to worry that if I gave, you know, I gave you the thing that I had, but then midway through the trade you decide to, you know, change your mind and walk away. I'm like, well, if I face that risk now, my sort of expected value estimation of whether I should engage in this trade becomes very different. And now it becomes an obstacle to engineering entering what would have been a positive sum trade. So I think building up these sort of public goods that allows various coalitions of actors to emerge and find positive sum trade interactions would allow us to navigate maybe a rich world and navigate the negative externalities both on the domination and on the chaos or excessive competition side.
B
Yeah, so this vision starts with us being able to trust AIs that are not fully trustworthy for us to build these coalitions between US and AIs. How is that work coming, coming along? I know you've, you've been working on it and, and others have too. So where are we and, and could you perhaps kind of explain how this might work?
A
Yeah, yeah. I mean there's many way. I think there's, I want to say that there's a lot of folks working on aspects of this problem and a lot of different ways you might achieve this. So. And I'm not gonna name all of this so but just to acknowledge that. And there's sort of classical alignment work, right, like that, that like frontier companies are doing, other places are doing that can feed into this. There's interpretability work where we're like trying to just better understand what AI systems are and how they behave. And that can help here as well. But I think that the thing that I'm more working on is I often try, I often sort of summarize it at the moment as like trying to build the tooling that allows us to DO High Assurance AI enabled R& D. And by high assurance I mean have AI do a lot of R and D, you know, software, hardware, various types of R and D for us, but in a way where it can actually give us really pretty rigorous quantitative assurances about what the outputs of that process is. So that's the safeguard AI program that I work on at ARIA is doing that or is aiming at that to some extent. And I mean we're a bit more than a year in doing that and you know, it's ambitious that that's sort of like the design by design meant to be very ambitious and we'll see how it goes. I think so far I'm, I'm excited about what's happening and I think we'll know, we'll probably know much more about sort of next summer whether some of the initial big bets are panning out.
B
Let me just try to give an example of how I see this problem. So we can imagine that we have a superhuman coder AI in a couple of years and it gives us a piece of code that's way too complex for us to understand. And then it's a question of the AI tells us that this code is very helpful to the projects that we're trying to achieve. Should we run this code? Is running this code dangerous? And this picture is a little naive, but I'm like, the question is, should we run this piece of code then one thing that you can do. And this is what I sort of. My explanation of what you're working on is try to have the AI provide assurances of what will happen when you run the code. And so, yeah, you can explain exactly how that's done. But the output here or the kind of desired end state would be that you can say with confidence, when I run this piece of code, I will such and such will happen and I will get this outcome in the world. Yeah. Is that an accurate sort of sketch of what's happening.
A
Roughly? I mean, I think it would be. Maybe it's slightly too ambitious. I think if it's literally, and here is all the things that would happen when you run this code that's maybe not exactly right or could be misunderstood. I think maybe the way I would put it is cool. So we want our AI to produce some code for us and one way, but then there's specific properties we want from that code, like maybe in anc, we want to use this code in critical infrastructure. So we're really interested for this code to not have exploitable bugs. Now, one way we could do this is by asking our AI agent to write that code and to tell it to write it securely and then maybe even later on to try to review the code and then see whether there's any. We find any bugs. But the problem with that is a, it's actually pretty hard to review code and find all the bugs that are in it. And secondly, it's not scalable. Right. If we actually really buy into the thesis that AI is going to do more and more of that work and can do that much faster, it will, you know, the future, in the future, AIs will produce volumes of code that like humans could never keep up with reviewing. So that doesn't seem like a very good bet. So what if instead, in some sense the thesis of the program is like, what if instead we made the AI do some extra work for us to sort of show us in a way that is much more easily reviewable for us, that the code actually has the property that we want. So in this case specifically, we're interested for the AI to provide proofs that this bit of code that is given to us meets particular specifications that we have given to it. And to make that scalable, we need to build some inference, some tooling that a makes it easy for us to specify what we want. That's not an easy task. But if we have good toolings here, we can use that sort of across the board. And then we need to give the AI sort of additional affordances to get good at giving us proofs that the outputs it produces for us do in fact meet the specifications given. And yeah, I think in the. So that would be sort of an example of like scalable oversight in this sense.
B
And so you're asking the AIs to do more work. What's your sense of the sort of performance tax you're paying for doing that versus just, you know, just asking the AI to produce the code and then crossing your fingers and hoping it works? I guess that that might be cheaper, but yeah, what's the performance difference between those two things?
A
So it will vary on what you do, right? Like if you approve about a piece of software will be much cheaper than a proof about like some cyber physical control system, for example, because it's a much more complex world model and specification that you have to prove against. So it will vary. But I think the main point I want to make here is that, you know, boldly I want to say it doesn't matter that much because if you're the principal, right, You're a human company, whatever, and you're like, great, right? We have these AIs and they can really uplift how much we can do. But then those AIs end up doing something and you have no idea what that is and you can't keep up with making sense of what that is. Effectively, that's just not useful at all. It's not like a little bit useful. It's basically not useful. So I think there is a bunch of natural incentives to be like, yeah, obviously I want the work to be done well. But yes, it is also very important that we build the infrastructure into tooling that makes using those workflows easy and scalable. And this is why I sort of think of this as sort of a public good that we can provide that would then sort of change the trajectory of like how we're using AI systems going forward. I think there's more details to be said here around even. I think you can get pretty nitty gritty around how scaling and compute costs pan out. And you know, it's to be seen and there's, that's really in. It's really pretty detailed. But I think one advantage that this like scalable oversight infrastructure has that we're building out is that it also makes it easier to use fleets of agents sort of in parallel, but in a way that actually works well because, you know, like, if you want to parallelize your work, you need to like have good ways of carving up that work. And you need to have good ways for these agents to coordinate with each other without, you know, being able to, like, communicate freely or having all the shared knowledge.
B
And why, why can't they just communicate freely and share knowledge? Because that would be dangerous to you or why is that?
A
So you need to build harnesses that do that. You need to figure out how they would be sharing, sharing knowledge, like, and, you know, they could in some ways, but you still need to sort of carve up if you want to work on two things in parallel. Right. What if one finishes before the other? Should they. What if, like, they try, you know, one agent tried X and then it actually failed? Does that now mean my plan doesn't work out anymore and I have to sort of recash or is like, is that still, you know, sort of a way of, like, how do you carve up your, like, technical roadmap of building this such that you can effectively efficiently parallelize? And that's like an actual problem that's like, not that easy. And then there's sort of another scaling law here, which, like, how well can you, you know, I could use the same amount of compute to like, run one model that's bigger and more competent, or I could use 20, 20 agents in parallel. And depending on how effectively I can use them in parallel, one or the other might, you know, might give you better outcomes. So there's an interesting scaling law going on here that we'll see how exactly it pans out. But I think, yeah, given some recent data I've seen on RL scaling, it seems plausibly. It plausibly means that it keeps you relatively competitive to be able to parallelize across agents very well.
B
Interesting. Does ARIA have working prototypes of any of these things? Have you sort of. Do you have a toy example of how this looks like?
A
So, I mean, there's sort of different parts to the entire tool set. So there's not an overarching single working prototype. There's definitely examples. I mean, folks are using, using workflows that look a bit like combining coding agents with SMT solvers or something to get proofs about your code. And then we have like some, some, some early prototypes in the world modeling domain. But yeah, we're still definitely early on and there isn't. I can't, I can't point you at a nice website where you can play around early stage, but, you know, maybe by, by summer with more that looks like that.
B
So making agents collaborate, this seems like it would be a breakthrough in capabilities. Also. Are you sort of Playing with fire here in a sense. Is this something you think the frontier companies are also trying to solve? And I guess why is it important from the safety perspective to make sure that the agents can collaborate?
A
I guess one way of sort of describing this is that we will get massive uplift from AI in all of these domains. Science, engineering, decision making. I think the big question is whether we can channel enough of that newly unleashed creative power that we'll be getting or that will sort of be unleashed into the world into building the sort of things that maintains stability throughout. Right. Like they build resilience quick enough while you know, AI and Neville science will also find dual use pieces, you know, discoveries. And so we need to build a, I think basically in order to make things go well here, we need to build a bunch of tech and a bunch of solutions and technological and institutional solutions that help us navigate those various trade offs. And so I think in part this is a bet to be like, how can we differentially and first accelerate the building of those resilience or stability increasing technology soon and fast enough so that on that we can create and maintain those sort of islands or those coalitions of positive sum trade and protect some of those things we care about.
B
So if you're, if we're successful in trying to make AI agents collaborate with each other on solving a problem, does that help us build these human AI coalitions? Because collaborating with an AI agent is somewhat of the same thing as collaborating with a human. Is that the angle you're taking here?
A
They're partially similar and some of the tooling that is needed here is the same and then there's also slightly different things. So yeah, we want the human AI link to be sound and we also separately I think we'll be in a world with plenty of AIs floating around the economy and most of the economizing that's going to be happening will be AI agents doing it often in maybe the sort of representing humans in some ways or representing companies or something in some ways. But so yeah, they need some technical solutions to be able to make bargain with each other and bargain in our set. So you know, I might have some agent that will be scheduling all the meetings for me in the future and understand something about what my preferences are and also sometimes come back to me and be like, you know, how do you feel about the scheduling? So we need that human AI link and we also need like for those agents to be able to go out and coordinate with each other. Sorry, now I lost your question though.
B
I'm thinking of where the safety angle is on making AI agents collaborate. It could also be that having agents collaborate to solve a problem is safer because the individual agents are not as smart, but they can produce an outcome that is collectively high quality. And in a way where we have more insight into what's happening because we might be able to sort of. It's not one agent that's incredibly smart just giving us an output, it's sort of a hierarchy that we might be able to look into.
A
That's right. And then maybe an even more. Yeah, an additional consideration here is something like, so I think we face serious risks or like negative externalities that we've been calling them from like building ever more capable AI systems or like building systems that sort of really self improve themselves. And because we don't know how to make that, you know, how to do that in a way that maintains alignment reliably, we don't have that science of like how to, how to safely self improve your source codes. But so I think that's a risk and I think we need to navigate that. Now interestingly, an AI system that is pretty capable and maybe capable of doing like ML engineering and they could sort of train the next iteration of more capable agents. They, they're in some sense also facing a ambitious alignment problem. Right. Like how should they, like how should they be designing or training the next succession of successor agent in a way that protects some of the key things that that agent cares about. And you know, there's always some role where like we've, you know, discovered a science of doing that and then we sort of know how to do that. But I'm not quite sure whether metaphysically that's quite, that's entirely coherent or what that means. But if we don't, then that agent might have a choice between training a successor agent that they don't, can't control or they can't align with high confidence or they may be alternatively, if they have the option, if they could use tooling and use infrastructure and use coalitions and teams to as a collective be more capable or sort of become more capable through tooling rather than through self improvement, that might become more attractive. So yeah, that's sort of like we're maybe facing that situation, right? Like I'd rather get all the good things, all the flourishing things out of like not super intelligent systems but highly capable systems that I can coordinate with well and that can coordinate with each other well. And then those AI systems themselves also might rather build flourishing coalitions that can, you know, get Us a lot of exciting things as opposed to training successor agents that we don't know how to train well until we do. Like maybe at some point we do, I don't know. But so that there's some, there's some gesturing here at like a, where some of that stability might come from.
B
And these are the types of things that you work on like scaffolding or tooling or infrastructure. Could you give some examples of what, what would, what, what is tooling? What is scaffolding? What is infrastructure? Why does this allow agents to sort of improve it in a perhaps more safe way?
A
Yeah, there's a couple of different bits. One is that one that we refer to as world modeling. So this is basically how do you specify what you want in software? This is somewhat easier. You can write certain sort of memory safe properties or whatever. You can specify them formally relatively easily. Once you're thinking about cyber physical system, then specifying what you want requires you to have some sort of model of the world and some sort of causal structure of how the world works. So yeah, one bit of tooling is like create the tooling that allows us to specify what we want and do that in a scalable way. And sort of, that's sort of the world modeling part. So that's a bit of tooling. There's another bit of tooling which is more about, well, once you have a spec and how do you scalably take that spec and produce a solution to the problem that was specified to you alongside a approved certificate that asserts with high confidence that that solution actually meets those. So there's also sort of like, how do we make that easily paralyzable? How do we make that go well? So giving AIs tooling to do that is another thing we want to do.
B
Yeah. Before we move on, just what does this concretely look like? So these are pieces of software, are they. Should we or should I listeners think of these tools as a sort of like a word program or an Excel program. Again, I'm just trying to understand exactly what's going on kind of is it something on the screen? Are you input putting your sort of specified values into a text box? What does this look like concretely?
A
So for the AI agents, the term harnesses these days used a bunch in terms of like, what can we give to an AI agent to help them in some sense be more capable, like do more things? And we give them harnesses to guide them, allow them to sort of be coherent over longer time horizons to use other tools. Whether that's like Searching stuff on the Internet or using a SMT solver or something like that. So I would broadly think of this all as like building fancy harnesses that allows AI systems to help us build world models and then write output and proofs about that. And then there's some elements as well of like, cool, which bits here are really critical for humans to understand and review. So in this case it might be humans should be reviewing the specs. Right. Like the place where we tell the agents what we want against which we were getting guarantees. So you know, there's sort of UX UI questions about like, how do we best do that? That like humans can sort of co create these specifications.
B
So how is the spec different from a prompt, say when, if you're just sort of a regular user of AI, you'll prompt the system, try to get it to do something for you. What you're doing with a spec is more advanced than that. Is it sort of expressed? Yeah, it's not expressed in pure text, but yeah. What is a spec and how is it different from a prompt?
A
Yeah, I mean the thing that we are aiming at is to have formal specifications. So that's, you can just think of it as mathematics. You want to sort of mathematically specify what you want. So this could be some sort of like temporal logic. Here is some states, here are some states in the state space that we don't want to end up in. And then basically what you ask the AI system is to give you some sort of assurance that indeed while optimizing for the goal and trying to get as much of that or get as good at as possible, I can guarantee you that we're not ending up in those parts of the state space that we have specified. So that latter part is what we. Well, the specs can be the safety specs where you say this is all the things we don't want to happen. And then you can also specify what's the goal, what's my utility, what do I care about? On top of that.
B
Yeah. And when you say guarantee, this is stronger than a legal guarantee. This is a mathematical guarantee that something won't happen.
A
Yes. I mean there's some nuance here of guarantee can actually mean a bunch of different things. Right. It can be just like fully sort of logical. Yes. You know, I have a proof that 100% states that this thing is not going to happen. What we are more dealing with, especially in sort of cyber physical system, is like guarantees of like the likelihood that this is going to happen is like less than. And then you try to make that value value pretty small. You might even, you could even. Yeah, you could even get into sort of like asymptotic guarantees where you say, you know, in the limit of spending effort on this or spending compute on this, it will converge to some value. So guarantees can come sort of in a range. But I would typically just say let's talk about quantitative guarantees as sort of like a relevant category that is like a higher degree of rigor than the AI system just telling you. Yeah, I did check the code in fact, or stuff like that.
B
And we wouldn't see the same types of problems that we see when humans try to give guarantees. You know, there's a bunch of examples in history of people saying this can't happen or this is vanishingly unlikely and so on. Then it happens anyways. Perhaps because we're overconfident, perhaps because we don't have an accurate model of the world. To what extent would AIs face those same problems?
A
I mean, I think in principle we face the same sort of problems like there's no perfect security, doesn't exist. But I think the way I think about this is like we sort of have to sort of adopt the security mindset about like, where do these errors tend to creep in? And then how can we, in a sort of, with a sort of layer defense approach, try to reduce those So a few places errors can come in. Here is like you've already mentioned this. We can have a wrong model of the world, right? And then if we have a guarantee relative to that model of the world and that turns out to be wrong, then our guarantee isn't giving us what we thought it would give us. Now how can we layer additional defenses on this? So one is we can try to make the model iteratively the world model or the specification iteratively better. So before deployment or before we sort of deploy the solution, we say, can we sort of like adversely test first? Can we come up with scenarios where actually it turns out that the spec was not what we wanted it to be? So we can do that a bunch. We can have like runtime verification, meaning we have a system that was synthesized to have certain guarantees against the world model. Let's deploy that. But as it's sort of operating, let's observe what is happening de facto in the real world. And once we start to, if we start to observe things that sort of conflict with what we thought was like plausible trajectories, that is like evidence that our role model actually something was wrong, right? Like we made the wrong assumptions and our guarantees don't hold anymore. And then you maybe want to, you know, have a fail safe structure which says, oops, that guarantee wasn't what I thought it was. Let's back up to something that's much more conservative than that. So that would be sort of an additional layer of safety. And then exactly how many layers are appropriate and how exactly you want to do this should depend on your context. If I code up some personal website, then I care about it less than if we're trying to secure the energy grid and make sure that no one can hack it. So it will be contextual and it should be layered because no single approach gives you perfect safety. But I think the types of errors are pretty, pretty similar.
B
Yeah, I, I can understand how you can have safety guarantees and sort of accurate models of very simple systems, very simple programs say. But as soon as you start interacting with, with the real physical world, it seems like you would face some sort of explosion of complexity that means that you can't accurately model what's going on. So if you're talking about a piece of code written by AI that's supposed to be plugged into the energy grid, that's a very, very complex system and it's a sort of a system that's both physical and written in code digital and. Yeah, how do you handle that? How is it going, the project of trying to model different parts of the world?
A
Yeah, I mean I think the goal here is sort of the goal that we're pursuing is sort of dual. One is like at the heart we're trying to build out the capabilities of doing these sort of things at scale and well. And then also we are trying to like find ways of demonstrating, checking and demonstrating whether whether that's on particular examples, whether that that works. And in principle, yeah, we're not, we're not. What we are giving assurances over is like say in the energy grid case we might be interested in what is the control system that does the energy grid balancing of like cool, like how much supply is coming in, how much demand is going out, how do we make sure enough energy is at the right time, at the right place, such that overall the property of stability of the energy grid is maintained. Now this system is not the whole world. This is that control system. And right now this is a lot done manually because it's a pretty high stakes thing. But doing it manually means it's not very optimized and we have a lot of slack to just make sure we're not having blackouts. Et cetera. And there's a lot of sort of safety layers. And the proposition is basically just to say probably AI in some sense quite obviously lends itself to this problem because it's essentially an optimization problem. But we don't just want to optimize in classical methods because they out of distribution don't generalize well. And the system might be seeing out of distribution occurrences.
B
And I think that's, that's almost guaranteed. Right. In the real world you will see all kinds of, kind of one in a million things happen.
A
Right, right. So now you want the world model that's in some sense quite conservative, that doesn't very narrowly say, oh, the history, exactly, you know, that's exactly the demand pattern that we had. This is why it should look roughly like that. You want to instead say much more conservative things like, you know, what are the physical laws that govern how fast energy can travel here? And like, and what frequencies that implies and what constraints there should be. So like that, you know, that's a set of differential equations that sort of govern the physics of that. And then you understand, you know, what are the control knobs that I have available here? And then you feed in some amount of like, cool, like what do I know about what demand supply tends to be? And that's like where you sort of optimize. But you build in enough conservatism here that even to relatively intense sort of spikes or surprises here, you could still maintain the frequency that would keep the energy grid stable. And then there's societal questions here about how much buffer accepted do we want. Right. Are we okay with a blackout? One in 100 years, one in 20 years, one in the thousand year. Right. And then how does that trade off against the cost efficiencies we get? But once you have a sort of formal world model that specifies sort of the plausible trajectory, not just the probable trajectory that your system can take, but all the plausible trajectories, then at least you can sort of make that question and that trade off more explicit. And you can say, okay, cool, we'll annually save that much money and we're willing to have a blackout once a hundred years or I don't know what the, you know, that is a collective deliberation social question.
B
Yeah. And we will be much better informed about those questions than we currently are about the trade offs between cost and performance and reliability of the systems that we're relying on. Again, there's a question of how much additional cost is imposed by us having to do this world modeling. And these guarantees as opposed to we can imagine just building like we have now a grid where we don't have these things. So what's. Do you have a sense of what's the additional cost there? And is it, would you, would you think it's higher for when you're interacting with a physical system as opposed to a system that's only digital?
A
Yeah, I mean doing it for a cyber physical system is more complex than for just cyber systems. So it will be higher in some sense. But I think the question is sort of like what are the different considerations pulling on either direction? And I think one thing that we see is that AI adoption, especially in these high stakes domains has actually been quite slow and is really tricky. There's a lot of governments that actually would be very interested in increasing the efficiency of power grid balancing, et cetera because the costs are really immense, including a bunch of other systems. But the adoption problem is just like pretty significant. And so maybe having this like slightly more that is this way of doing this in a way that has maybe more that looks like has initial, more initial cost. If that's the only way that you get to the level of confidence that like we are actually as a society happy to adopt these systems into high stakes environment, then kind of actually the cost isn't actually meaningful at all. So that could be one consideration. And then obviously society is right now a tricky time to price in the older catastrophes that didn't happen. And again maybe sort of world modeling, collective sense making, tech, AI enabled can help us make those trade offs better because now we have more meaningful guesses about what these numbers actually are. And again this will be sort of a way of finding positive state positive some traits. And then lastly I think that one thing that's like worth flagging is that a lot of this seems like upfront costs that actually going to amortize relatively quickly later on. So once you have a world model of the energy grid, of a specific energy grid, you don't need to every time redo that, you might update it over time, etc. But you have that set of specifications that can be reused and that's true in a bunch of different domains. So you might also see some sort of amortization effect happening.
B
Interesting, interesting. So there's the problem of specifying of getting accurate world models, specifying what you want and then having guarantees that you're getting what you want measured against the world model. Then there's the whole separate problem of sort of selling this to people in power, policymakers, the general public and because it's not simple to understand what the goal is here and then how the technical details work out. How do you sort of present this in a way that's both accurate but also sort of enticing to people in power and people who might have the power to implement something like this?
A
Yeah, I mean, so that's, I mean very context specific. Right. Like in a specific country, like who is doing the like energy grid management or something. Right. Like there's definitely that, I think. And then in other cases, other elements of R and D, all you need to convince is like some startup founder that's like, oh, that tool is actually great, I want to use this. And this is how I built. I'm now going to use AI systems to help me build stuff. So who you need to convince varies a lot. And I think that convincing folks by having compelling demonstrations is sort of the way to go, sort of. In this more bottom up world, I think in the higher stakes, critical infrastructure cases, the barrier to adoption is already high. And I think there, the interesting question is how well can we sort of communicate that this could be a good solution in the language that is already being adopted and in these high stakes environments. The notion of safety cases is actually a pretty common terminology. And I think you can sort of cash out this entire setup as how do I give you a safety case.
B
And what is a safety case?
A
A safety case I think really you can just think of as like, what's your argument, what's your carefully presented argument for why the system is in fact safe? So if you're telling me you've generated this neural control system for balancing the power grid, why should I, a regulator or whoever, or energy grid operator, why should I trust the system? What are your arguments? And then you walk through, right, cool. Here is how we set up the world modeling and had an AI system help us generate this control system. Here's the proofs we got. And then the proofs are sort of part of the safety case, but there's also a bunch of arguments around it. So for example, how does your runtime verification work? What iterations have you gone through as you develop the specifications and how much have you reviewed and tested those? So there's sort of the entire package that forms a safety case. And that's sort of like the safety case is the language that is currently used. And I think this is sort of an easy way to translate it into that language. So you know, like on the ground it will be tricky, but like not necessarily more tricky than it already is to get. Because you get AI based solutions adopted.
B
Yeah. Let's talk about AI resilience, sort of broadening out. So we've mentioned the, we've talked about the energy grid from a sort of bird's eye view. What do we need in order for the future with advanced AI to be more resilient?
A
Yeah, I think we need a bunch of things which is also why we need good AI enabled R& D processes to help us build these things faster. But just like zooming out a little bit or trying to like set this up a bit. Resilient. But the reason I like the term resilience is that it's, it doesn't mean like anything bad ever happened, nothing bad ever happens. It instead means like nothing that's like ever irrecoverably bad ever happens. Like we would like to play, keep playing this game. We like to maintain, you know, the core functions that like make society run the way it does currently or like you know, makes a human body run the way it does or whatever. You know, a healthy system is a healthy functioning system. We want to maintain these core functions and actually, you know, we can, something can go wrong and we can adapt to it and we can recover and like there's sort of within some bounces is fine, but we want to be really resilient against these like outlier hazards that could happen. And I think the interesting thing here is to say something like cool. Like I mean, you know, evidently humanity so far has been resilient because we're still here and functioning. So we have built out some resilience in various ways. But given the advent of AI, a lot will be changing, including kind of like what it looks like for a civilization to be resilient. And sort of the risk profile of that is changing. So I think we should just. There's sort of this exciting premise of like, okay, let's think through what resilience would look like in the age of AI. And then, and I sort of, when you, if you try to do this relatively structuredly, you could be sort of like cool, like what are the key, what are key functions that we need? What are key attack vectors that like where we have vulnerabilities that we need to make more resilient. And you could, you know, you could say let's, let's try to list them, right? Like we have cyber systems that like a lot of our civilization is like running on, right? Like from the energy grid to the Internet to the financial system, etc. We have physical systems that we need as well. We have institutions, like socioeconomic institutions that we sort of need to work in certain ways. We have like, human psychology or sort of human cognition, and also collective sense making, collective coordination. And AI will sort of is introducing new dynamics in all of these domains. And then there's sort of this, like, interesting task of sort of strategic foresight of saying how exactly will AI change these domains and what we need to do such that they remain within the bounds of what good and healthy functioning looks like.
B
Yeah. I think what the COVID 19 pandemic showed is that we have, as a world, not been optimizing for resilience. So it kind of showed that we, for example, in manufacturing, there's been a lot of optimizing for reducing sort of storage costs. So doing just in time manufacturing, that means that we can't. That we couldn't manufacture many things we needed during the pandemic. Same with kind of where we are manufacturing critical things we need during a pandemic. Is this sort of an example that we should be thinking about when we are thinking about resilience? This is something that could have been foreseen. It's not historically unprecedented. And it broke a bunch of the systems that we rely on to function in the world.
A
Yeah, totally. And then, I mean, you know, I think it's, like, worth thinking about this quite holistically. Like, yes, in a sense, we didn't have some stockpiles that we should have had or we didn't, you know, we hadn't streamlined certain vaccine approval processes as much as, like, we could have had or something, so we could build some of this so we can now stockpile stuff. But. But there's sort of a metro problem here, which is like, as societies, a thing that we see is that just after a catastrophe, we're like, cool, yeah, we're willing to pay this extra cost to do this, but a couple of years in, we're like, why, again, are we paying this? We should just not be paying this. So there's also sort of that collective sense making a protest here of like, yeah, how do we coordinate on managing these negative externalities? So that in itself could be like a way to strengthen resilience. So there's lots to do, which is kind of exciting. But, yeah, I mean, I've done some work just trying to sort of start out at a relatively high level. And, you know, in particular in biosecurity. I'm, like, not an expert, and I am always very impressed by the experts that we do have in terms of really analyzing what are the Sort of highest leverage interventions we could do. And then you know, in cyber it looks different again. I think like with the advent of like AI systems that are really capable of coding and mixing that with formal methods. I think the promise of being able to write, especially in high stakes domains, code that is just like formally meets certain specifications and getting rid of the exploits that are there in the first place rather than just investing in being able to find and patch them quicker. There's a promise of making the system very defense favored that I think that's sort of the really exciting things that I think should be aiming for and should be very ambitious to get to as soon as possible.
B
What would be examples of systems that are sort of favoring the defense?
A
Well, I think, I mean I think of defense favorite as like how can we build out the socio technical stack such that civilization is like as defense favorite as possible. I think I like the example of like formally verified code just to illustrate this. Well, like what do I mean by defense favorite roughly? I mean that if you sort of pour more resources into both offense and defense, offense will sort of, sorry, defense will systematically out compete. Right. Which is much better than if you're in a situation where like, you know, you need to put in twice and three and four and five times as much resources into the defense in order to keep up. So as much as possible, if we can set up the structure such that little extra input into defense allows you to sort of defend against much more offensive resource input. That's really interesting. And then I think we should then for each of the critical domains try to envision how could we get society into a defense fair position here. And I think that could be very stabilizing as we go through this. On the face of it, potentially pretty destabilizing AI transition.
B
One example might be preventing cyber attacks where you could have AI agents on both sides of that equation, both trying to defend a system and trying to attack a system. Are there any good options here for making it such that the defenders have an advantage?
A
Yeah, I mean you can write code that just doesn't have exploits. So there's just like nothing for an offender to do. Now this isn't as easy as that might have sounded or something. Right. Like there are side channel attacks. There's like you thought you didn't have a bug and you still had one. But there is in fact a way like in principle there is a way to write a code that just doesn't have those exploitable bugs and we have some history of side channel attacks and stuff to draw on to be able to make systems like this that are highly secure. So I think cyber really among all of the domains I think might be the most easily defenseless favored.
B
This is interesting because when I've, I've interviewed sort of IT security experts, it's. It's often the case that they are depressed or pessimistic about ever creating safe systems because it's so difficult to find out all the ways in which these systems could fail or might be able to be attacked. Is this something that's changing with AI, where with AI we are more. We're better able to create secure code in a way that's scalable to systems that are actually useful and not just secure, but can't do much?
A
Yes, yeah, I think it does. So a few things to say here. I think one of the key points to make here is something like we have examples of very secure code. So one example I like to give is DARPA had this Hackam's program number of years back where they tried to, where they. One of the things that came out of that program is a formally verified microkernel that they used and they tested in sort of like pretty wild scenarios. So I think they used those systems in like a manned helicopter and I think a quadcopter and they had red teamers and those red teamers knew everything about the development process. Like they had full access of like how the system was built, how the software was built. They also had access, they got access to the system. I think they got, they were on the, I don't know, I think they got access to maybe the like music playing system or the camera or something from the start. And then from that position they were like cool, like now try to hack the rest of this helicopter and take it down. But they didn't succeed to do that because SL4 was able to sort of give an isolation around that system that they couldn't break out of because they use formal verification methods to get those guarantees. So we have examples of highly secure code. Now you might ask why are we not using that all the time if it's so good? And the reason for that is that it just takes a lot of human effort to write code in this highly secure way. But a lot of that effort AIs will be very good at doing. And they actually already start to be very good at using systems like lean, etc. To proof properties. So yeah, the argument here would be it wasn't as much adoption as you might have thought because it was very expensive. But the cost here is coming down and, you know, there will be other bottlenecks. Like, getting the specs right is a bottleneck. And, you know, you don't prove that you have the correct specs. You need to, like, figure it out and iterate on that. But then I think the answer to that is we should invest a lot in figuring out how to do specs. Well, right. Like, that's, that's how you should do software in the future. And maybe the last thing I just want to say here is that the word safe is. Can always mean quite a lot of things. Or if you're like, well, you know, this system is completely safe, like, this is sort of unspecified what you mean. Exactly. And I think ideally we're just much more precise. So, for example, an STL4 system gives you in particular this, like, property of, like, strong isolation. Now, that in itself isn't, you know, or is that, like, that is a component that gets you. That makes it easier to reason about the composite properties of your composite system of whether it's, like, safe, given what that means in the context. So, yeah, I guess one might be a bit allergic to being like, oh, it will just make everything safe. I'm like, well, that's just being imprecise. But we can build R and D artifacts that have very specific properties, and those properties can be security properties, et cetera, that stack up in useful ways.
B
So the specification is important, and it's important that we try to invest resources into becoming better at writing these specifications. Is it dangerous to have AI help us write these specifications? I'm thinking here that the AI model we might be working with has some set of values. That set of values will compete or pollute the values you were trying to. To write into the specification. And so if we're partially automating or delegating this work, maybe we end up with a specification that we are not fully in agreement with.
A
Yeah, so we definitely need to be very thoughtful about that part. And we definitely should not just close our eyes and let the AI specify things. But I do think there's a couple of things we can do to collaborate with AI, including on writing specs in a way that we can still feel good about. There's a few different things that seem worth saying here. One is, I mean, one. One thing I think to say is that the current generations of AI systems, frontier systems that we see, they're actually in, you know, in some pragmatic way, like, pretty well aligned. They seem, they seem pretty friendly to me. You know, if you want to get them to do nasty things. They tend to not want to do that. And obviously they're like, there's ways to, you know, jailbreak them, etcetera, that are clearly not, you know, perfectly aligned, but they're pretty friendly. Like, I'm sort of using friendly in a. Sort of, like, yeah, I think, like, generally cloth and chatgpt, et cetera, like, are sort of trying to get what I'm saying. So. So that's already an. A good collaborator, and they're already pretty capable. So even if we just had these systems, like, I think they give us uplift in writing specs. Now, the second point is we need to write specs in such a way that humans can review them. Right? Like, for example, this is why we want to write them down formally and very precisely such that, you know, there is a precise definition of what we want. I think if we, for example, if you had purely natural language specs, and you said, hey, my specification is that this thing should be safe, like, that's just imprecise, right? Like, and that's not a good spec. And so, yeah, we, we need to write specs maybe with the help of AI systems in a way that's human, auditable. And then I think there's a bunch of. There's so much entrepreneurial and creative energy that is in the world. And I think one great problem to try to solve is cool. How can we engineer great workflows for human and AI systems to work on writing specs at scale in a way that we can trust those specs? So I think, yeah, that's a few of the answers.
B
So let's talk about cosine bargaining and how this might be sort of supercharged with advanced AI. Maybe you can explain the concept and then we can talk about how AI might help us actually implement it.
A
Cool. Yeah. So cushing bargaining is some economic work that basically says that, yeah. What would it look like for multiple agents to coordinate on, like, reducing externality? So, like, I think the typical example used is like, you know, you have two neighbors and let's say like, with two neighbors and like, one of them likes to play loud music and the other one is like, no, this is actually kind of costly to me. I don't like this. The question is, could they. Under what conditions can they coordinate on a solution that's like, you know, good and pretty optimal for both of them without, for example, a, you know, a state, just like an external power, just like, enforcing some rule. And sometimes this works and sometimes this doesn't. And Then you can sort of look into, like, on the. Why is it that sometimes we don't manage to coordinate on reducing negative externalities or bargaining on negative externalities in a way that's like, good for everyone? And often this is, like, referred to as sort of transaction costs more broadly. So there are some costs to just directly bargaining with each other to find good solutions that make it inefficient, right. That make this what could have been an efficient market, make that market inefficient. Sometimes this is like, broken down into, like, information costs, bargaining costs, and enforcement costs. So if, you know, you know, both of us would, you know, I want to give you a pair of shoes and you give me a can of milk or something. And we were both happy because, like, we each get sort of relatively more utility from that trade. It matters that I can, like, trust that that trade occurs how we thought it would. If I really can't trust that, that makes it harder to trade. And I think we can sort of also think of, like, society as having, like, had a lot of, like, years of it trading on institutions that make finding those positive sum traits and engaging in them easier. So, for example, legal norms and having, like, you know, at the end of the day, a state that enforces them, right? Take you to court if we both sign an agreement and you're like, not. Not keeping your word or something, and that's sort of like enforcement, like setting the incentives such via sort of threat of punishment or something, that makes it more optimal for the agents to like, keep their word or something. We also have very soft things, like just norms, right? Like, you know, maybe we know, we both know as a shared set of people and, like, it actually would be costly for one of us to just sort of renege on these things. So I think humanity is like, build very rich layers of institutions that help us coordinate on positive sum traits.
B
But there are still many of these traits that fall outside of, that are too costly for us to implement. So say that, for example, I'm playing loud music, and we live in an apartment complex together, and I'm playing loud music on every second Sunday. And this sort of annoys you slightly, but not a lot. You would, if you could send me, say, $10 to make me stop, if you could be sure that I would actually stop playing the music. And if so, another cost might be that I perceive it as weird that you want to pay me money to stop playing music. All of these kind of frictions for us to get a trade that we would both like to Engage in. So this brings us to AI, how AI might help here. How do you envision AI sort of helping us making more positive some traits?
A
I mean some sense in a bunch of ways. But I guess like an intuitive starting point is something like the AI could go out and negotiate with other AIs representing other people. Like representing my neighbor with the loud music and my seven other neighbors that also do things that I would like to coordinate on in parallel scalably, without me needing to invest that time or etc. So that's one. Right? Like we could just do much more like types of traits that I in principle could do myself, but I can't do them all at the same time. So we could just do more of it. That's one. But there's sort of even more exciting opportunities. Like sometimes reasons traits can't happen is because it's sort of like hard to sort of not incentive aligned to share the information that will make us notice that there is like a positive side, positive sum trait here. So you know, I might not want to leak certain information, but if we could just both mutually know that actually we're like both are interested in that exchange that you know, we would be interested in doing that. So you could now imagine for example, having an AI agent that represents me and that knows these things and an AI agent that represents the other person and knows knows their private information. And they could just, especially if we then find ways to have strong enough information security standards that they could sort of exchange that information with each other without either of the humans ever learning that information. And now what used to be a trait that wasn't accessible to us because we didn't want to share that information becomes accessible because we can sort of share that information without actually sharing it. So there would be another example in which traits that were sort of not accessible ahead of time become accessible. And maybe the third category would be something like the agent might be able. If we built the tools and built the ways to do this, an agent could in principle sort of make much more credible commitments or credible commitments in context where I as a human couldn't. So they could literally build a system. They can't do anything other than the thing it's claiming to do. And you know, I can show you the code of that system and you can be like, yep, in fact that system seems to do exactly that thing and nothing else. And that's just like a very credible commitment. And again, that might unlock positive some traits that otherwise wouldn't have been accessible.
B
Amazing. All right, do you Want to point listeners in any directions in the end of this episode? Just we've been talking about a bunch of topics and listeners that are with us so far might be in very interested in these topics and so are there any places they should look for more information on these things? I will link everything in the description.
A
Cool. Yeah, I mean there's a few. So they can check out the ARIA website. There's a few different programs we have like. So the Safeguard AI program is doing some of the work discussed. There's another opportunity space called Trust Everything Everywhere that is doing some of the. Some of this like co bargaining related multi agent stuff that were mentioned. So there's some interesting resources there. There is a website called AirResilience.net, which I worked on together with my collaborator Eddie, where we just sort of started to think about like okay, what would resilience look like in the age of AI? And both what are sort of like exciting endgames where actually we have maybe worked out good ways of making ourselves more defense favored and then also trying to infer what are some R and D problem priorities that seem like promising to work on if. For folks interested in the cost and bargaining sort of story. There is a cool article by Seb Krier called Co Bargaining at Scale. No, something related, something like that. Some cognitive is in the title. Yeah, maybe that's a good collection.
B
Great. Final question here. So we've been talking a lot about AI and listeners to this podcast are very interested in AI. Could you recommend a book or an article or a movie or something that's not about AI, but that's relevant to the world that we're entering?
A
There's a bunch of them. I guess I'm gonna go with Recency Bias. I think seeing Like a State BY I think C.S. lewis. I think I see Author's Name is a great book that has to do with sort of some of those themes of excessive centralization versus excessive decentralization and sort of navigating the political philosophy, political economy, questions of how we can sort of coordinate with each other and you know, the role of the state and in what ways the state is doing some really helpful things to make us coordinate. In what ways it's doing maybe somewhat harmful things. It's a really interesting book.
B
It's from. It's the book from the late 90s, right?
A
Yep.
B
Yeah. It's by James C. Scott.
A
James C. Scott. Wow. I said that really wrong. Yeah.
B
Any. Any other recommendations?
A
Well, so I did like the. The book called Underground Empire. Yeah. There we Go Underground Empire, How America Weaponized the World Economy. There's definitely some geopolitical interesting insights. I think it was a very fascinating book talking about how through the sort of just the infrastructure of modern. The modern financial world that gave the US and especially in the 20th century, a lot of sort of like soft power over the world because they fucked a lot of the finances ran through the US So that was really interesting.
B
Great. Nora, thanks for chatting with me. It's been really interesting.
A
Cool. Thank you so much.
Episode Title: How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann)
Date: January 7, 2026
Host: Future of Life Institute (B)
Guest: Nora Ammann (A), Technical Specialist at the Advanced Research and Invention Agency (ARIA), UK
This episode explores the risks of advanced artificial intelligence (AI) and strategies for safe governance, featuring Nora Ammann. The central discussion revolves around two critical AI failure modes—domination and chaos—and how humanity can steer progress to avoid them. Nora shares insights from current research and practical work at ARIA, including the development of technical tooling for scalable oversight and high-assurance AI. The conversation also tackles coalition-building between humans and AIs, resilient societal infrastructure, and the promise of AI-enabled bargaining and economic coordination.
Key Concept: Rather than a sudden technological leap (“foom”), AI progress is marked by multiple inflection points and incremental acceleration—termed a “slow takeoff.”
“If we use the next few years well, human-AI teams will be collectively very capable.”
– Nora Ammann [07:06]
Failure Modes:
“There’s sort of two clusters of scenarios that I’m worried about: failure through domination, and failure through chaos.”
– Nora Ammann [08:26]
Challenges:
Core Solution:
Developing coalitions that combine human and AI strengths, leveraging AI capabilities without ceding unchecked control or relying on blind trust (17:17).
“The crucial question is, how can we arrive at justified confidence in those outputs?”
– Nora Ammann [17:56]
Definition: Not about preventing all failures, but ensuring that no irrecoverable failures occur and that society/civilization can recover and adapt (57:15).
On AI Progress:
“I expect there to be several inflection points… Each inflection point there’s sort of a pickup of the speed of acceleration itself.” – Nora [01:49]
On Strategic Intervention:
“The next two years to four years will seem to be pretty path-defining.” – Nora [05:58]
On Coordination Risks:
“Excessive technological risk as well is a negative externality that is hard for a group of actors to coordinate on managing appropriately.” – Nora [13:01]
On AI Oversight:
“It’s really, really valuable to put a lot of effort right now into making these systems more steerable and building scalable oversight mechanisms.” – Nora [07:46]
On Verification:
“No perfect security, doesn’t exist… We sort of have to adopt the security mindset about where these errors tend to creep in.” – Nora [44:44]
On AI-enabled Code Security:
“We have examples of highly secure code… but a lot of that effort, AIs will be very good at doing… The cost here is coming down.” – Nora [65:04]
On Human-AI Specification Collaboration:
“We need to write specs maybe with the help of AI systems in a way that’s human, auditable… We need to be very thoughtful about that part.” – Nora [68:39]
Application in Real Infrastructure:
Nora’s example of using AI-assisted formal methods to secure power grids highlights high-stakes, real-world implications (47:45).
Resilience as Amortized Investment:
Upfront modeling and specification costs can pay off by making advanced systems reliable, scalable, and widely adoptable (51:47).
Coasean Bargaining Supercharged:
Vision of AI agents negotiating on behalf of people/entities, lowering barriers to beneficial deals, and supporting greater economic and societal coordination (74:19).
Nora Ammann and the FLI podcast highlight a pivotal moment in AI development: humanity’s window to build structures steering AI toward prosperity, not disaster. By framing the risks as domination and chaos, and advocating for robust coalitions, high-assurance oversight, and defense-favored public goods, Nora charts an actionable, though challenging, path ahead. The episode is a call for urgent, creative, and collaborative technical and institutional efforts to ensure AI augments human flourishing for the long-term future.