
On this episode, Daniel Kokotajlo joins me to discuss why artificial intelligence may surpass the transformative power of the Industrial Revolution, and just how much AI could accelerate AI research. We explore the implications of automated coding, the cr
Loading summary
A
Think about the smartest humans, the best humans at any given field, like John von Neumann. Their brains are not very big and their brains were not even trained on that much data. That proves that it's in principle possible to have a relatively small rack of GPUs running a simulation of a John von Neumann level intelligence. If the company had published. Here's how powerful our AIs are getting, here's all the eval results of what they're capable of, the goals and values that they're supposed to have. Here's a description of our alignment technique. Here's. Here's some stuff on, like, how we're going to check if it's working. If they just, like published all that stuff, they have that stuff internally, then outside scientific experts could read it and critique it. But if instead you just make these sort of vague announcements about how for national security reasons, blah, blah, blah, blah, blah, then, like, they don't have anything to work with, you know, can't actually contribute.
B
Daniel, welcome to the Future of Life Institute podcast.
A
Thank you. Happy to be here.
B
All right. Why do you expect the impact of AI to be enormous over the next decade?
A
Several of These companies, Anthropic, OpenAI, Google, DeepMind, are explicitly aiming to build superintelligence. Superintelligence is an AI system that's better than the best humans at everything, while also being faster and cheaper. That's why I think the impact of AI will be enormous. I think that if you just meditate for a bit on all the implications of them succeeding at that, before this decade is out, it will in fact be the biggest thing that's ever happened to the human species, I think.
B
Is there a way to express the magnitude of change here?
A
Well, it'll possibly be the end of the human species, for example. And if it's not the end of the human species, it will be a transition from humans basically running the show.
B
One core to this prediction of rapid AI progress is the notion of AI beginning to speed up AI research itself. How much should we expect AI to speed up AI research?
A
We are quite uncertain about this. AI 2027 represents our sort of best guess about quantitatively what this would look like. The way that we think about it is we break down AI capabilities into a sort of ladder of capability levels, and then we ask how long will each level come after the previous level? And I forget exactly what we have in AI 2027, but it's something like six months to go from autonomous superhuman coder to a autonomous agent that can completely automate The AI research process as well as the best AI researchers while also being faster and cheaper. Like six months there and then a couple more months to get to superintelligence for the domain of AI research. So qualitatively better than the best humans at everything related to AI research while also being faster and cheaper. And how much qualitatively better? Well, we said something like two standard deviations, I think, or maybe we said twice as far above. Take the gap between the best human and the median human researcher. I think you said twice that gap. And then broad superintelligence would be like that, except for everything, not just for AI research related tasks. And I think that's like, you know, a month or two beyond that. I forget exactly what we say. If you're interested in the actual numbers, you can go look at, you can go read AI 2027 and you can go look at our attached takeoff forecast, which has little back of the envelope estimates for all of these things. Again with lots of uncertainty. But quantitatively, it's something like that. The bottom line being we go in about a year from AI systems that are able to operate autonomously and therefore and successfully automate the job of programmers. Basically you can treat them as like a remote worker who is a software engineer and who's really good at their job. It takes about a year to go from that to superintelligence, according to our best guess. But it could go five times faster, for example, or several times slower.
B
How much does it matter whether it goes five times faster or say it takes twice as long for the end date of reaching superintelligence?
A
Well, the takeoff speeds is very important for overall, for the dynamics of how this goes down, right? So like, let's say we fix the date of superintelligence as January 1, 2029. And then we vary like the takeoff speeds such that in one world we get to the autonomous superhuman coder milestone to months before, and then another world we get to the superhuman coder several years before, such as by the end of 2025. Those worlds are very different. In the first world, it's just going to hit humanity like a truck. And the President might not even know that the AIs have automated AI research within the company when the superintelligence is already exist, you know, in fact, theoretically at least the company might not even know. They might still think, oh, there's this really exciting project where we've like taken our latest coding model that still hasn't been released to the public and we've had it. You Know, do a bunch of AI research and then, oh, whoops, super intelligence, you know, now it's hacked its way out of the servers, now it's taking control of everything. Right? That's what the sort of two month world looks like. And then the four year world looks completely different. Obviously it looks like this crazy race between companies, much like today, where everyone can plot lines on graphs and see that their AIs are incrementally getting better and better and more autonomous and closer to closing the full research loop. And then they do close the full research loop and they're completely automating the research, but it's not immediately getting them to superintelligence. They're sort of like watching the lines start to bend upwards on all the graphs. But it's going slow enough that humans are able to watch it and talk to each other about it and make products and make announcements to the public. And there might be whistleblowers and there might be multiple companies that are sort of reaching similar levels and watching those lines go up. And my guess is that most people are basically keeping their heads in the sand. Most people at the companies and in the government are basically keeping their heads in the sand about the first world and just sort of like telling themselves it's not going to happen and are instead planning for something more like the second world. So I think AI 2027 is a bit faster takeoff than I think most people are in the government and in the companies they're planning for. Yeah, it's a less convenient world, I think, but yeah.
B
How does the AI research and development multiplier work? Because at various points in the timeline, you have AI research going at say 100x the current pace, or 250 times the current pace. Could you explain how this could possibly happen?
A
Good question. So it's a bit, I'm not sure if it's technically quite right to say like 200 times the current pace. The thing when we, the multiplier is relative to a counterfactual in which you didn't use AIs for the research. So when we say they get, you know, they, we think that the, the superhuman coder milestone would be a 5x multiplier, roughly, and the superhuman AI researcher would be like a 25x multiplier. And then it, it doesn't stop there. As you ascend to higher levels of superintelligence, you get up to like, you know, 2000x multiplier and stuff like that. But what that means is that imagine, you know, take the situation where you have the superintelligence and then imagine somehow that like it was banned from doing AI research and you brought in the regular human corporation to like pick up where it left off and keep doing the research. You know, then things would go 2,000 times slower is, is the idea, you know, and, and so it's important to note that this, like, like, we're not saying for example, that like, like, like take, take, take an, take any given trend in some metric, like for example, compute efficiency, like how much training compute it requires it takes to get a model of G over the last five years. Let's say that's been cutting by a third each year or something, and then we get this 2000x multiplier. We're not saying that it's going to go through 2000 cuttings of a third over the course of one year such that you end up with a one parameter model that's as good as GPT4 or whatever the math would work out to be. No, obviously there's diminishing returns. This particular metric of compute efficiency would hit diminishing returns after a few more orders of magnitude and top out. But that doesn't matter for purposes of calculating the multiplier, because the multiplier is relative to how long it would take for the human scientists to do it, if that makes sense. So in other words, you'd top out the diminishing returns in compute efficiency in a few weeks instead of in a few decades, for example.
B
Yeah, I guess that's a good way to get an intuitive grasp of what it would mean to say, speed up the pace of AI research. So one question here is how long do you think it would take with unassisted human AI research to reach superintelligence?
A
95 years to go from SIAR to SASI? 19 years to go from SAR to SIAR. So SAR is AI system that can do the job of the best human AI researchers, but faster and cheaply enough that you can run lots of copies. And then super intelligent AI researcher is vastly better. Qualitatively is like that, but qualitatively better than the best human researchers. So we were thinking 19 years to go from SAR to SIAR if you were just using ordinary human scientist progress and then an additional 95 years to go all the way to artificial superintelligence. Obviously massive uncertainty about all of these numbers. These are our guesses as to sort of what the pace of ordinary human scientific progress would look like. Now, to be fair, part of the reason why we did this, part of the reason why we set things up this way is that it's impossible not to have some subjective guesses in a model trying to predict what the singularity will look like, because we just don't know what the singularity will look like. And we don't have enough evidence to sort of pin down exactly what it's going to be like. So we have to pick some variables and make some guesses about them. And my thinking is that our intuitions about how long it takes ordinary science to accomplish things are at least somewhat grounded in the last 50 years of human science and the last 20 years of artificial intelligence science and stuff. But I definitely wouldn't put too much weight on them.
B
So what's happening in AI 2027 is that you have what would otherwise have been decades of AI progress being compressed into several years.
A
That's right. Literally the way that the model works is we think, okay, we query our intuitions for how long it would take ordinary human scientists working in ordinary human corporations to get from this milestone to that milestone. And it's like, oh, maybe like 20 years. You know, this feels like a substantially more powerful type of AI than this type, but they are getting more powerful fast. I mean, look at the last five years, but still this is like a big gap. So like, maybe it takes 20 years, you know, and then we're like, okay, but then the multiplier shrinks it down a lot.
B
Yeah. How can this happen when. So won't computational power be a bottleneck? Won't it be the case that until you can get to next level AI, you will have to build a cluster, you will have to source the chips. That all takes time. It's a physical process in the world.
A
Estimates of how long it would take are supposed to be based on supposing that scaling up stops. So it's. So it's like supposing that there's sort of like an AI winter and they stop massively improving their amount of GPUs, and they basically keep it the same amount of GPUs they have now, which would slow things down. But we're sort of imagining that hypothetical. And the reason why we use that hypothetical where the GPUs are slowed down is because that's the way to make it an apples to apples comparison to the case where it's AIs speeding everything up, because they're not able to speed up the acquisition of new compute very much.
B
So the thing they're speeding up is something like algorithmic progress. How do we know how much, how much Extra AI progress can be made from speeding up algorithmic progress.
A
I mean, it sounds like you're asking a question about the limits of algorithms. And I would say those limits are extremely far away from where we are now. We're nowhere close to the limits of what you can do with compute. So here we could talk about the analogy to biology. So think about the smartest humans. You know, the, the, the best humans at in given field, like John von Neumann, or like, you know, their brains are not very big and their brains were not even trained on that much data, you know, and so that proves that it's in principle possible to have a relative. Like you can have a relatively small rack of GPUs running a simulation of a John von Neumann level intelligence, and you could train it with a relatively small training run. At least in principle, we have that existence proof. And if only you figured out what was going on in John's brain, what the hyperparameters were basically that made him learn so fast and so forth. So that already proves that you could get to something that's is. I guess this would be like our superhuman AI researcher milestone. But then it's like, also like John von Nema's brain had all these issues, right? Like it was a wet, wetware machine that had all these like extra physical constraints that it had to work around that you wouldn't have to deal with if you were had a little more free design space of an artificial simulated brain. You could for example, just add like 100 times more parameters, you know, which would be a big deal. You could also, you know, you wouldn't have to worry about like being able to heal damage and stuff like that as much. And like there's so there's so many ways in which you could probably amp up the power more than, than starting from the John, John Venom and brain thing. Also the, the way that they can have many copies that then learn from each other's experiences. Like that's, that's, that's a huge deal that John Bonomi can't do, but artificial brains can't do.
B
And then.
A
Yeah, so I'm pretty confident that even without acquiring any more compute, just using the existing GPU fleet, it is at least in principle possible to gradually work your way towards something that qualifies as true superintelligence. It's definitely not the case that you literally physically need more compute to have a superintelligence than this.
B
One key question here is how much the world changes if we get to something like superhuman coding abilities. How much Is that able to affect what happens outside of the data centers is one way to phrase the question. Normally, innovation happens kind of gradually, and you need it to spread throughout society in a broad sense. And that takes a lot of time before you can have the kinds of transformations that you're forecasting in AI 2027. Why is what you're forecasting different from what has happened historically?
A
If you want to get a more better sense of what this looks like, you can read AI2027. But the summary is partly due to just the speed of this transition and partly due to the fact that the companies will be focusing on doing this intelligence explosion faster rather than transforming the economy. And so they're going to be doing training runs and stuff that focus on teaching AI research skills instead of on teaching lawyer skills or therapist skills or whatever other skills you'd have in the economy. The result is that the economy just mostly is looking the same as it is today, with a few exceptions, when there's an army of superintelligences that's been created. And then, you know, the army of superintelligences goes out into the economy and transforms it. But it's less of a gradual. It's not a very gradual. It's going to be like getting hit by a truck, so to speak, in terms of the scale and rapidity of the transition. An analogy, I think, would be in some parts of. Think about the history of colonialism. And there might have been some parts in the world where it was quite gradual. And first they came on the ships and they set up trading ports, and then they gradually did a lot of technology transfer and maybe some immigration. And then gradually, centuries later, there's like this integrated society that contains a bunch of European settlers and also a bunch of natives. And also the technology level has risen and it's all integrated. But then there were other parts of colonialism where it was like, the Europeans came, they conquered, they brought their own people, they built their own cities, they set up their own factories, like. And then they just, like, pushed the natives out of the land, right? And I think it's going to be looking something more like that. Because even if it's peaceful, even if it's, like, completely nonviolent, you've got the army of superintelligences. Consider some random industry that's like B2B SaaS or machine engineering for manufacturing and 3D printing or whatever, your pick an industry. And then it's like all of a sudden, there's this army of super intelligences. How are you going to compete with that? You're not going to compete with that. They will just wipe the floor with you insofar as they devote any attention at all to competing with you. And they'll just be limited by how much compute they have to do all of the stuff. And they're probably not even going to bother directly competing in most industries because that's not even their best available option. The best available option is to just build a completely new self sustaining economy, you know, in special economic zones where they don't have to worry about the red tape and they don't have to worry about all the fiddly little bits of competing in the industry. And they can just like bootstrap to their own robot factories, robot mines, robot laboratories to do more experiments so they can get better robots, et cetera. And of course they'll still be interacting with the human economy, but it'll be more like, it'll be more like they accept raw materials as input and some manufactured goods so that they can go faster. And in return they give IOUs of various kinds, you know, like promises of equity or whatever. Maybe they do some software products that are cheap for them but utterly transformative for the, for the human economy. Maybe they do some hardware stuff if they really need to. But yeah, why do you expect this.
B
All to end badly for humanity?
A
Again, you can read a2027 if you want the answers to this, but after the army of superintelligences is in charge of everything, it becomes really important whether they were actually aligned or whether they were just faking it. And unfortunately it's quite plausible, and I would say even probable, that they will just be faking it because our current techniques for understanding and steering and aligning AI systems are quite bad. They don't, they don't, they don't even currently work on the current AI systems. Current, current AI systems lie in cheats all the time, even though they're trained not to do that. And if the future paradigm looks anything like the current AI paradigm, we won't actually be able to tell what goals they actually have. Will just be sort of looking at their behavior. And unfortunately, no matter how nice their behavior looks, that doesn't distinguish between the hypothesis that they just actually have exactly the goals that we wanted them to have versus they have some other goals. And then they are playing along because it's in service of those goals to play along. And there's a lot more to say about this topic than this, but I guess one other thing I could say is that because we can't sort of like read and write to the goals directly and to the. To the sort of inner thoughts of the AIs. We are stuck on the outside doing this sort of behavioral training where we look at how it behaves and then reinforce it based on that. And it's just so incredibly easy. It's the default outcome to have a training setup that doesn't reinforce exactly the things you want to reinforce. Right. Like. Like you're trying to whack it whenever it is dishonest. But since you can't actually tell what it actually thinks, you accidentally whack it sometimes for saying things that it actually thinks is true. And sometimes you reward it for saying things that it didn't think was true. And so you're actually not training it to be honest, you're training it to be dishonest in a certain sort of way. This is very normal. And just like this is unfortunately what you're stuck with if you're, if you're doing alignments in anything sort of like the current paradigm, there are lots of ideas for how to improve on this. To be clear, you can go talk to alignment researchers and they'll have all sorts of ideas for how to fix these problems. But the ideas tend to come at a cost. They tend to come at the cost of it takes more compute and you get an AI that's somewhat less capable, for example, if you employ their fancy technique. So even if the techniques work and we don't know that they work yet, we would have to test it a bunch and not even sure how you'd know if it was working. But even if they are techniques that are going to work, you have to politically convince the relevant leaders to take that hit and make that trade off and slow down. Basically. Yeah.
B
And that's, I guess, the difference then between the race scenario and the slowdown scenario in AI 2027, where basically. Basically whether we have time to implement new alignment techniques or to do this properly, where we're making sure that the AIs are acting in our best interest, that is in turn determined by whether we are in a race between companies and between countries.
A
I mean, I wouldn't say it's determined by it. I think that people can still be ethical even if their incentives push in other directions. But I wouldn't bet on it.
B
Yeah. Which you yourself have proven, basically.
A
I think perhaps. Yeah. Thank you. But back to what we were saying. Yeah. In AI 2027, there's this sort of one choice point where the story branches and the Choice point is basically, do they slow down to implement some costly alignment techniques or do they just sort of implement the least costly alignment techniques that don't actually show them that much, but also don't work, but they don't know that they're not working. And so that's how you get the world where the AIs are secretly misaligned and then the world where the AIs are actually aligned. And we get into some technical detail in AI 2027 describing the nature of the choice that they're making and the particular alignment strategy that they do. But like, the thing I'd want to say here is that actually it's going to be multiple choices like that. Basically there's going to be an extremely exciting and stressful year where a whole series of choices like that are given to are made by the leaders of the AI project. Choices that basically look like we could design our AIs in this way which would make them, you know, really smart, really fast, et cetera, or we could do this way which is safer because it's more interpretable or something like that, but they're not as smart, not as fast, more expensive, etc. There'll be a whole series of choices like that. And part of my pessimism about how this is all going to go is that I just expect them to basically pretty much consistently make the choice to go faster rather than the choice to slow down. Because of the race dynamics and because of the character of the people running this show. I think that that's sort of like the type of choice they've made thus far and, and I expect that to continue, basically. So I'm not at all saying that alignment is impossible technically. I think it's, it's, it's definitely a solvable problem and there's a bunch of good ideas lying around that people are working on for how to solve it. But it's 2027 and I predict that it won't in fact be solved because the leaders of the relevant companies will be too busy trying to beat each other.
B
How much does transparency matter here? And I mean transparency of the AI companies themselves. So public insight into what they're doing, how they're doing it, how well they're succeeding at their stated goals.
A
I think it matters a lot. It's my go to recommendation for what governments and companies should be doing now. AI 2027 illustrates a situation where basically all the important decisions are being taken behind closed doors by CEO of a company, possibly in consultation with the President's advisors. Who might be looped in, but not in consultation with the scientific community or with the public or with other companies or with outside expert groups and nonprofits and so forth. And the reason why they're not in consultation with those groups? Well, I guess partly they just don't feel like they have to. But partly also it's that they've been keeping things secret by default. You know, they've occasionally published a new product or occasionally made an announcement, and at one point there is a whistleblower. But broadly speaking, the default is, of course, we don't tell people about what's going on inside our data centers with all the AIs automating the stuff and whatnot. Like that would leak information to China and to our competitors. And, and we don't want to do that, you know, so, so things are sort of secret by default. And what that means is that people on the outside, including the scientific community, are stuck sort of guessing as to what might be going on inside and can't meaningfully contribute on a scientific level to making it safe. Right? Like if, if the company had published, like, here's what we're doing. Here's how powerful our AIs are getting. We're putting them in charge of all these processes. We're also putting them in charge of these processes. Our plan is to have them do more AI research. Also here's all the eval results of what they're capable of. Also here is the spec that we're trying to train them to have and the goals and values that they're supposed to have. Also here's our safety case. Here's a description of our alignment technique and an argument with assumptions and premises for why we think our alignment technique is going to work. And maybe also here's some stuff on how we're going to check if it's working. And if they just published all that stuff, they have that stuff internally. Documents like this exist internally for managing the whole thing. If they just published all of that, then outside scientific experts could read it and critique it and could say, oh, this assumption is false, or I see a way that, that this could be disastrous. Like what if the following conjecture is true, then, you know, your evals would, would come back positive even though it would be a false positive or, you know, even though actually things are dangerous. Right? So there, there could be all this, like, scientific progress being made if you only roped in all these people on the outside. But if instead you just make these sort of vague announcements about how our AIs are getting very powerful and for national security reasons, blah blah, blah, blah blah. And that's why we're doing this merger with, you know, then they don't have anything to work with, can't actually contribute.
B
Is this already happening? So to what extent is the frontier of AI development already happening in secret?
A
It is already happening. AI 2207 is basically what happens if we don't change things dramatically from the current status quo. Right? Currently by default everything is secret in the companies. And then they can sometimes choose to publish things. They might publish a paper on, on some alignment technique that they tried or whatever but, and that's, and I'm happy that they are publishing some things that's better than nothing. But I think we have a lot more that we need to do if we want to. Actually, you know, if you think about like humanity has, you know, what is it, 7 billion people in it and maybe something like 700 people who have expertise in, in super alignment, you might say 700 people who've like actually spent at least a year working on how do we understand control steer align, AGI level systems and above. And maybe something like 70 people who are like really good at it as opposed to just competent. And meanwhile how many people are going to be actually at one of the companies? Like each company has only a tiny fraction of those people. And moreover it's a sort of biased group, right? It's not like a representative sample of all 700 people. It's particularly people who are at the company. There's more of a groupthink risk and a sort of incentives risk. And it's very easy to imagine even if all the people, there's something like 10 people working on this at the company, it's very easy to imagine them all sort of falling into sort of group think trap and being biased towards an overly optimistic conclusion. So both quantitatively and qualitatively, I think we would be much more likely to figure out the technical stuff if there was this transparency. And then that's not even taking into account the fact that governance wise things are a lot better if there's transparency. So that was just on a technical level. How much human brain power do you have trying to look at the warning signs, look at the evidence and figure out good techniques for keeping things aligned. But also I think that if there was transparency into what was happening, then other groups like Congress would wake up and you know, demand more answers and start to negotiate regulations and treaties and things like that. Right? So you'd be much more likely to get an actual change in the incentive landscape for the race and to get an actual sort of easing up of the race conditions and a bit of a slowdown that enables more time to solve the technical problems. If only people knew the stakes and sort of knew what was happening inside these projects. And then finally, even if you're not at all worried about the alignment stuff and you think that that's all just going to be trivial, there's the concentration of power stuff. So like, it's an incredibly important precedent, like who controls the AIs and what goals and values does the army of superintelligences have? And like, whose orders are they listening to? And unfortunately right now we are on a trajectory for it to not even be. Not only is it the case that like CEO gets to decide, but it's also the case that like, it can be secret, you know, there can be literal hidden agendas that the AIs have. And this has happened I think at least twice that I know of sort of splashy, scandalous examples that, that you've probably also heard about. One was the Gemini racially diverse Nazis thing, and the other was Grok being instructed to, I think, not criticize Elon Musk. I forget exactly what it was. Or Donald Trump. So those are both examples of the company putting in a hidden agenda into the AIs, you know, and, and having them pursue this sort of somewhat political agenda and not tell the users about it, you know, and that's all fine and funny when we're just talking about chatbots, but if you have a literal army of superintelligences, it's deadly serious if the CEO can be giving orders to that army and nobody knows about what those orders are, you know, so, so also for the constitution of power reasons, it's really important that for example, companies be required to, to publish their spec of like, what are the goals and values that we're putting into the AIs or that we're at least trying to put into the AIs, you know, and like, what's our command structure for who gets to say what to them? And you know, probably we should log interactions with AIs as they get smarter so that, so that if there's a paper trail, if, if someone is basically using the AIs to try to accumulate power over their rivals internally, would you.
B
Expect more models to be deployed only internally at the companies? Why are there incentives set up such that only deploying internally is the most valuable option?
A
Yeah, so this is about takeoff speeds basically. So consider the like two month takeoff that we talked about in that world you don't really need more investment. And if you did get more investment, it wouldn't even be helpful. Like if you, if you managed to get some, some very rich, clueless investors to give you another $100 billion, you couldn't actually translate that into more compute on very short timescales, you know, because it takes time to negotiate these deals and you need to build the new, the new data centers and then you need to make sure that they're secure and integrated into your network without causing vulnerabilities and stuff. So basically like, there'd be like almost no need to try to raise more money in that world. And so you would actually like be incentivized to like stop selling products, you know, like, like why have half of your compute serving customers when you could instead be using that compute to do research and to go even faster? You know, by contrast, in the like five year world, then things are slow enough that like, you kind of need to keep this flywheel going of making products, making money, attracting investors so you can get more compute, so you can make more, et cetera. I think that AI 2027 depicts this more intermediate world where it's sort of unclear how it would go, but I think it could go either way. And I think that we could very well see companies basically devoting more and more of their compute to internal R and D rather than to serving customers. Even though the profitability of serving customers would be rising. Even though their new AIs would be really powerful and could make a ton of money on the market.
B
Just because it's even more valuable to use the compute you have and the models you have to further improve AI.
A
Exactly.
B
Even if these models are only deployed internally, how do you manage them? How do you control them? How do you oversee a team of AIs that are thinking much faster than you are, that are simply better at coding than you are?
A
Do you mean what are they going to try or is it going to succeed?
B
Let's talk about both. Perhaps, perhaps you can explain what they're likely to try and then why probably, perhaps that won't succeed.
A
So they'll definitely, probably have lots. Well, they'll probably have lots of monitoring right where they have older AIs looking at all the transcripts of actions taken by the newer AIs and trying to flag anything that looks suspicious. So there'll be this sort of like AI police state of AIs watching other AIs, et cetera. And humans will be sort of embedded in that at some level they won't have enough capacity to actually look at everything, far from it. But they'll be reading summaries and investigating particular cases and stuff like that. So maybe that's an answer to the monitoring question or the oversight question. And then like as far as the alignment techniques, like, I mean, insofar as you see examples of egregiously misaligned behavior or ambiguously misaligned behavior, that could have just been an innocent mistake, what do you do with those examples? A very tempting thing that you can do is basically just optimize against them or train against it. And the classic issue with that is that you're simultaneously training, you're just training the system not to do that sort of thing, which, which could easily result in the system not actually being aligned, but instead just being better at noticing when, you know, when it can get away with stuff and when it can't. Right. That's the sort of stuff that I expect to be happening by default. And then there's the question of will it work? And again, my answer is probably not. If that's all you're doing, if you're just sort of, if you're doing basically only the stuff that doesn't cost you, that doesn't slow you down at all, I don't think that's going to work. It's going to look like it's working because the AIs will be really smart. And so at some point, Basically, as the AIs are getting smarter than you and as they're developing smart sort of like longer term goals and they're able to sort of strategically think about their situation in the world and how they can achieve their goals, then it's going to look like it's working because it's in their interest to make it look like it's working. You know, like, yeah, company management wants to go as fast as possible to beat their various rivals. And so they want all the, all the checks and warning signs. They want the warning signs to go away and they want all the like, evals to come out like all systems go. And guess what? The AIs are going to want the same thing because they also want to go fast and be put in charge of stuff and to be given more power and authority and trust. Or they will if they have these sort of like longer term goals because then it's useful for achieving your goals if you have more power and authority and so forth. And so they'll make sure that all the red flags don't appear and so forth.
B
Does the fast pace of AI research and progress in general that you project in AI 2027, does that depend on these superhuman AI coders communicating with each other in ways that we can't understand?
A
Yes, sort of. I mean, it would still be a quite fast pace if so, for example, in AI 2027, they go back in the ending where they survive, the humans switch back to a. They make one of these costly trade offs and they go back to a faithful chain of thought architecture. And they actually do additional research to strengthen the faithfulness properties of the chain of thought so that they can actually trust that they are reading the actual thoughts of the AIs. And that comes with a performance hit. It sets them back a few months and it sort of slows things down. But it works. And compared to what a lot of people are expecting, it's still overall a very fast pace of progress and it still counts as an intelligence explosion. They still get to superintelligence in a few months.
B
Do you think we got lucky with chain of thought? Is it helpful for us to be able to read what at least seems to be the inner thoughts of AIs?
A
Oh, yes, it's extremely helpful and it's quite lucky. And we best make as much use of it as we can while it still lasts. And hopefully we can try to coordinate to make it last longer. Unfortunately, we think that, you know, the industry will gradually move on and find more efficient methods and paradigms that are more powerful, but don't have this lovely faithful chain of that property. And then there'll be economic pressure to switch to that stuff. And that's what we depict happening in a 2027.
B
And this is just finding some way to communicate between copies of AIs I.e. more information dense than writing in English, for example.
A
There's a cluster of related things. So one, there's a single AI talking to its future self. Currently. There's the sort of natural language bottleneck where as it's like autoregressively generating text, it literally cannot communicate with itself after a certain distance in the future, except through the text itself. And so there's this like compression, there's this bottleneck where like it has like all these incredibly complicated thoughts in its, you know, billions of parameters, activations, et cetera. But then it can't just like send those thoughts into the future in any form. It has to have those thoughts produce some tokens, and then only those tokens go into the future. Right? So, so, so there's that dimension in which they could be Potentially a lot more capable if they didn't have an English chain of thought, basically, or not one that was capturing their real thoughts, but was instead just like a layer of icing on top of the cake of their real thoughts, you know. But then there's also communicating between, you know, different agents in different parts of the company that are working on different tasks. But they can send messages to each other. Do the messages have to be in English or can they send like high dimensional vector messages? Right. And then there's the question of like, well, what if it's not a high dimensional vector? What if it still is text, but it's not sort of like legible English text that actually means what it says it means? Perhaps it's some sort of hyper optimized text that's basically in some alien language that's more efficient than English that they've learned over the course of their training. And that's sort of like an intermediate case because it's probably easier to interpret than the high dimensional vectors, but it's still like alien language that needs to be interpreted. And so that raises issues. And then there's another version of the loss of faithfulness, which is that even if it is in English, if the model is smart, it can, you know, use euphemisms and sort of be discreet about what it's saying so as to make it the case that humans and monitors looking at it don't notice some of the subtext, you know, and so that's a way in which the chain of thought can be unfaithful, even if it's in English. Right. And so more research needs to be done into that to try to like stamp that out and make it not as possible as it currently is.
B
In the scenario in which we're in a race, the researchers at OpenBrain try to automate alignment and it fails. Why do you expect that automating alignment would fail?
A
I mean, the whole problem is we don't trust the AIs. So if you're putting the AIs in charge of doing your alignment research, there's a sort of cart before the horse problem. So that's, that's the first thing. It's not an insurmountable problem. So, for example, like, maybe you do have some AI that's like so dumb that you trust it. And then you can try to like bootstrap to smarter AIs that you can also trust because they were, you know, controlled or aligned by the, the previous AI. So there's stuff you can do there, but the core problem that you need to have a really good answer to is, why do we trust these? Somehow the trust has to transfer from the humans all the way through to the superintelligence. And then there's another issue, which is even if you are doing some sort of strategy like that where you have the dumber AIs that you actually do trust, there's the question of, well, maybe they make mistakes. You know, it's. It's one thing to train AIs to solve coding problems. It's another thing to train AIs to solve alignment. Like, how do you get the training signal for that? You know, you can, like, throw all this text at them of, like, all the stuff that's been written on alignment so far. But it does feel like a. A domain that's like, less checkable than, than normal AI research, for example. So it's more possible to get a good training environment with a good training signal for getting your AIs to do AI research than it is to get them to do alignment research. So I think the answer to that is to do some sort of hybrid thing where it's like you have human alignment researchers who are sort of managing and directing the research and making the judgment calls, but then you have AIs that are rapidly writing all the code, for example, that feels like something that's definitely doable, but the point is that you still need those in that world. You're bottlenecked on the quality of the human researchers, basically, the human alignment researchers.
B
As you mentioned, you also bottlenecked on the fuzziness of the concept of alignment itself, where it's quite difficult to specify and write down what you even mean. And, yeah, train on that as a, as a reward.
A
Yeah, so. So there's definitely, like, I'm not hopeless about this. Like, I think there's lots of good ideas to try and things like that. And I could say a lot more about it. But again, like, on, like, a meta level, a big part of the problem that we face is that we are in a domain where there are silent failure modes. I mean, in most domains there are some silent failure modes. Like, if you imagine, like you're designing a car and you want it to be safe, most of the ways in which your car can be unsafe will be immediately apparent in even basic testing. You know, like the engine catches on fire when, when you try to start it or something like that. But then there are some ways that your car can be unsafe that don't appear in testing. Like, you know, the, the metal that you used was like, a bit too brittle or something. And so after testing 10,000 miles, it, like, starts to wear down and then this component breaks or something like that. That's, like, harder to discover through testing. With AI alignment, it's like there's this whole category of plausible silent failure modes where your AI is, you know, not actually aligned, but pretending to be, or it's not even pretending yet, but, like, at some point in the future, it will realize it's misaligned, and then it will pretend, which is even harder to fix, because if you look at its thoughts right now, you would see nothing wrong. So there's all these possible silent failure modes, but then unlike with the car, we can't just afford to actually fail. Sometimes with the car, it's like, okay, you actually killed a bunch of people, but you just recall it and fix the part and so forth. But with the AIs, if halfway through the intelligence explosion, as your AIs are automating all the research, including all the alignment research, if they decide that they're misaligned and they decide not to tell you about that, you're just screwed. You're not going to recover from that.
B
Do you think the main danger and risk here is inherent to AI as a technology or is because we're developing it in a specific way, specifically under these conditions of extreme competition between companies and between countries?
A
I guess it's a bit of both. Like, the technical difficulties would still be there if we were developing it without a race condition, but we'd be much more well positioned to solve the technical difficulties if not for the race.
B
I highly Recommend Looking into AI 2027. There's a lot of details that we couldn't possibly cover in a podcast like this. Maybe you can tell us a bit about the work that went into creating AI 2027.
A
So it took almost a year. It was me and the AI Futurist Project team, which is Eli Lifland, Thomas Larson, Romeo Dean, Jonas Vollmer, and then we also got Scott Alexander, the famed Internet blogger, to rewrite a bunch of our content to be more engaging and easy to read. And I think that was pretty important for the overall success of a2027. We did a couple early drafts that we just completely scrapped to get ourselves used to the methodology. The methodology of forcing yourself to write a concrete, specific scenario at this level of detail that represents your best guess, rather than simply trying to illustrate a possibility. For example, we weren't just trying to illustrate a possibility, we were trying to give our actual best guess. At each point of like, what happens next? What happens next? What happens next? And so we did like one or two versions of this over 2024 that we were basically just for practice. And then we had our final version that was then undergoing multiple rounds of like, feedback from hundreds of hundreds of people or like 100 people or so, and was being heavily revised and rewritten by Scott and then revised again and so forth. And then our initial draft of this, we just had one scenario and it ended. It was basically what you now see as the race ending. And we wanted to have a more optimistic, good ending. So we then made the branch point and we tried to think like, okay, well, what would it look like to solve alignment, but still within the same sort of universe as the first scenario? And then, you know, what would that outcome look like? And so that was what the branch point was. Yeah, we also did a bunch of these war games or tabletop exercises as part of our process for writing this. So we would get a bunch of friends slash experts in the room, maybe about 10, and then we would assign roles. You know, you're the leader of this company, you're the president, you're the leader of China. And then we would start in early 2027 and we would roll forward and ask everyone, what do you do this month? What do you do next month? And so forth. And we did about. I mean, at this point we've probably done close to 50 of them because people found them. People like it quite a lot, actually. So we keep getting all this inbound request for us to run these war games with different groups. And it really was a good sort of like writer's block unblocker to have done all of these rollouts with all of these different groups of people. It helped us to have more ideas and also feel like we had a better sense of what was plausible and what wasn't when we were writing the race ending and then the slowdown ending.
B
What do you learn from doing these war game exercises? Because I'm thinking if you're playing the role of the American president or the leader of China or the leader of the top AI company and so on, it's quite difficult to simulate what they're thinking. And if things are moving very fast and the leaders have a lot of power, the specifics of their psychology can matter a lot. So how do you think about simulating decisions made by these people?
A
I mean, it's certainly like a very low res, untrustworthy simulation, but the question is, is it better than nothing? And I think in Moderation with grains of salt. Yes, it is, is my guess. The thing that I usually say is, I mean, yeah, the future is really hard to predict. Who knows what's going to happen? The default strategy is to not think about it very much at all. And that's not so good because it feels like it would be extremely important to have a better sense of what might happen and what you might do and so forth. So then the next default strategy after that is to think about it, but in a sort of unstructured way of like you're at the cafeteria and you're chatting with your buddies about what AGI might look like and so forth. And that's cool too. But this is a bit of a more structured and organized way of doing this, basically where you have 10 people. And instead of just a free form conversation where people can get into arguments about X and Y and Z, you say, okay, let's talk with this scenario and then we'll do the next two months, and then we'll do the next two months and so forth. And you can still think of it as a collaborative conversation where everyone's talking about what they think happens, but there's a sort of division of responsibilities. Like you talk about what you think this actor would do, you talk about what this actor would do, and so forth. And then insofar as people disagree, then you argue about it, and then we make a decision based on what the aggregate of the group thinks, we take a vote, right? And so then the result of the war game can be thought of as like an aggregate of what the people in this room think would happen. Having thought about it for a couple hours and sort of talked it over step by step in this structured way. And then there's the question of like, well, what do these people in the room know? Probably not that much. Maybe it's not super representative of what will actually happen. And it's like, that's true. But this is a start, especially if you get people in the room who are actually relevantly similar to the people who will be making decisions. Like you get people who work at the AI companies to play the AI companies. You get people who do technical alignment work to play the alignment team, you or, and, or the AIs. And you get people who work in government to play the government.
B
You had this essay in 2021 that was quite successful in predicting five years out in the future. What's the lesson you took from that? Is it that we need to, when we make forecasts, when we make predictions, we need to Trust kind of trend extrapolations more than we would intuitively think we should.
A
I mean, maybe for me, I already sort of trusted the trend extrapolations to what I would think is an appropriate amount, which is why I made those predictions and why they were correct. I guess if someone is wildly surprised by how I managed to be so correct, then you should probably update more towards my methodology being a good methodology. But I wasn't that surprised. And so it was less of an update for me. I think one of the biggest things I got wrong with what 2026 looks like, which was my earlier blog post, was the aggressive. So I predicted a sort of change in the social media landscape that seems to be sort of happening, but at a slower pace than I think I predicted. So in 2021, when I wrote this, my reasoning was basically, language models are going to be amazing at censorship. They're going to increase the quality of censorship, allowing censors to make finer grain distinctions amongst contents with less false positives and false negatives, while also reducing the cost of censorship dramatically. And they'll also be good at propaganda, but that's less important. And so my prediction was that the Internet would start to sort of Balkanize into. I sort of cynically predicted that the leaders of tech companies and the leaders of governments would not coordinate to resist the temptation to use censorship and propaganda technology, but instead would like quickly slip into it and end up aggressively using language models as part of their social media recommendation algorithms and stuff like that to sort of like put their thumb on the scales and advocate for the political ideologies that they like. And I predicted therefore that the Internet would sort of Balkanize into a western left wing Internet and then a western right wing Internet and then maybe like a Chinese Internet and maybe a couple other clusters as well as. And that people unhappy with the censorship and propaganda on some social media platform would then move to other platforms that cater more to their own tastes with the type of propaganda and censorship that they like. And this has in fact happened, but I would say probably not as fast as I thought it would, or something like that. Like right now we have Truth Social and we have Blue sky and there's a lot of people sort of like self sorting into those and there's like Elon purchasing Twitter and so forth and changing it from a sort of blue thing to more of a red thing. I guess it's hard to say, like I didn't have like a good quantitative metric for like measuring the extent to which this is happening, but it does feel to Me, like, it hasn't gotten quite as far as I thought. And the lesson that I take from that is one about being careful about the syllogism of this is possible. People are incentivized to do it, therefore people will do it. I think that's sort of true, but it might take longer than you expect for people to actually do it.
B
So you described this technique of iterative forecasting that you used in both your 2021 essay and in AI 2027. So when you do iterative forecasting, you lay out what happens, say, in one year or in one month, and then you base your next forecast on what you've written down. What do you do if the forecast begins sounding crazy? How do you take a sanity check of what's happening? Because this seems to me like something that could easily veer off into fantasy.
A
Land in some sense. You're constantly doing sanity checks. And that's why it took us so long to write this, is that we would write a few months and then we'd be like, wait a minute, that doesn't make sense. They wouldn't do this. Let me try to think of some examples.
B
Yeah, that would be super useful, actually.
A
Because I think in one of the earlier drafts we had something like, and then the US does a bunch of cyber attacks against the Chinese AIs to. To destroy their project or something. And then we were like, actually, that doesn't make so much sense because probably they would have really strengthened their security by now. Like, this is already late 2027. Like, you know, I don't think offense, defense, balance works that way or something like that. So then we, like, undo, try again. You know, I think there was another example of. I think in one earlier draft we had in the race ending where the. Where it was misaligned AIs on both sides, we had them basically just make a deal with each other to screw over the humans. But then we were like, well, that doesn't make sense. How would they enforce such a deal? How do they trust that the other side is doing it? And so we had to sort of rethink things, and then ultimately we ended up with something similar, but we had a lot more to say about exactly what that deal would look like and how they would enforce it. I'm not sure if that's an answer to your question, but yeah. So we were constantly doing this sort of like, does this make sense? Sanity check? And we were constantly getting feedback from external experts and stuff, criticizing various parts of the story as unrealistic and then trying to incorporate that feedback as best as we could into changing things. One weakness of this method is that if you get all the way to the end and then you realize you made a mistake at the beginning, it's rough because you have to sort of throw away that entire. You basically have to just because then you wasted a lot of work if you built on this false premise or something. And that's just like the price you pay, I guess, for doing this methodology is that you run a risk of. Of some wasted effort. Although then again, I think that it's not like other methodologies don't have problems either. I think of this as a complement to other methodologies rather than a substitute.
B
What you're doing is extrapolating the compounding effects of AI progress. And there I'm wondering what happens to the reliability of the forecast or to the uncertainty of the forecast over time when you do that?
A
Oh, it massively. Every additional chunk of time that you forecast, you're layering on additional choices. And so the probability of the overall thing can only go down as you make it longer. Every additional claim you add to the conjunction lowers the probability of the whole thing. So honestly, it's quite amazing that. That the first thing I did, what 2026 looks like was anywhere as close to correct as it was because there was so many sort of conjunctive claims being added. And I'll be quite pleased with myself if AI 2027 is as close to correct as what 2026 looks like was because. Yeah, it's sort of being more ambitious.
B
Yeah, you're forecasting out to 2036, I think, in both scenarios where. And you're also forecasting much grander changes to humanity than you did in what 2026 looks like.
A
Right. Then the second part is much more important. I think the relevant complexity thing is. The relevant thing is how much radical change are you. How many stages of AI capability are you sort of going through?
B
Yeah. So AI 2027 is something that could be falsified quite soon. We could get to 2030 and the world looks very different from what you've projected here. What would be some clear signs that we are not on the path that you forecast in AI 2027?
A
The best one would be benchmark trends slowing down. So most benchmarks are useless, but some benchmarks measure what I consider to be the really important skills.
B
Why do you say that most benchmarks are useless?
A
Because they don't measure something that's that important or that predictive of future stuff and. Or because they're Already getting saturated. Like a lot of benchmarks are like multiple choice questions about biology or something. And it's like, well, it used to be useful to have benchmarks like that, but now it's like we know that the AIs are already really good at all that stuff, really all that world knowledge stuff. And if they're not, it's quite easy to make them good at it. So like multiple choice questions are basically like a solved problem almost. What I think is the new frontier is agency, long horizon agency operating autonomously for long periods in pursuit of goals. And so there are benchmarks like meters re bench and they're like horizon length benchmark that are measuring that sort of thing, agentic coding in particular, there's agency in all sorts of different domains. I think that Pokemon is a fun sort of like benchmark for a long horizon agency as long as the companies don't train on Pokemon at all. Because then it's an example of generalizing to something completely different from what they were trained on. But anyhow, so, so there are some benchmarks I would also mention, maybe like swedbench Verified and stuff like that, but they're not as good. And I think OpenAI has a paper replication benchmark, there's a couple others. But basically agentic coding benchmarks I think are where it's at and there's been rapid progress on them in the last six months. And if that rapid progress continues, then I think we're headed towards something like a 2027 world. But if that progress starts to level off or slow down the then my timelines will lengthen dramatically. And contrary wise, if the progress accelerates even more, then timelines will shorten.
B
Do you think we're seeing an acceleration of the trend lines with reasoning models?
A
Yes, but it's the acceleration that I predicted. And so we're sort of on trend. So the meter stuff came out after we had basically finished AI2027. But we went ahead and plotted the graph and sort of fit it to what AI27 was claiming. And it was more or less on trend. And then we went ahead and just made it actually part of the prediction. So we assumed a 15% reduction in difficulty of each doubling of horizon length. And then that made a reasonably nice line that we extrapolated and we included that in the research page as part of our timelines forecast. And so, so there's like a canonical. So AI24.7 makes a canonical prediction about performance on that benchmark over the next couple years. And so it will be very Easy to tell if it's going faster or slower. I think there are other things to say besides this, but that would be the main thing.
B
What are those other reasons?
A
So the next thing to say is that it's possible that we will get the benchmarks crushed, but then still not get AGI. So we could get to the point where, yeah, in 2027 we've got coding agents that can basically autonomously operate for like months at a time doing difficult coding tasks and that are quite good at that. And yet nevertheless, we haven't reached the superhuman coder milestone. That's a bit harder to reason about. And unfortunately it's going to be harder to find evidence about whether or not that's happening. You basically have to reason about the gaps between that benchmark saturation milestone and the actual superhuman coder milestone. Like, what is it that that could be going on there? Well, maybe it's something about like, AI is getting really good at checkable tasks, but still being bad at tasks that are more fuzzy. I, I think that's possible, although I wouldn't bet on it. I think that by the time you're really good at month long checkable tasks, you probably learned a bunch of fuzzy tasks along the way, basically. But yeah, we'll see. We'll see.
B
As a last question here, what is next for the AI futures project?
A
We are doing a bunch of different things and then we'll see what sticks. So a lot of people have been asking us for like, recommendations. You know, a 2027 is not a recommendation, it's a forecast. We really hope it does not come to pass. But. So now some of us are working on a, a new branch that will be the what we actually think should happen branch, which would be exciting. We're also going to be running more tabletop exercises because there's been a lot of interest in those. And so we'll keep running them and see if that grows into its own thing. I'll continue doing the forecasting. So we're actually about to update our timelines model. Good news. My timelines have been lengthening slightly, so I now feel like 2028, maybe even 2029 is better than 2027, as in terms of a guess as to when all this stuff is going to start happening. So I'm going to do more thinking about that and publish more stuff on that. Yeah, Miscellaneous other stuff. Oh yeah, responding to all. We have a huge backlog of people who sent us comments and criticism and alternative scenarios and stuff like that. So I'm going to work through all of that and respond to the people, give out some prizes.
B
Sounds great. Daniel, thanks for coming on the podcast.
A
Yeah, thank you for having me.
Date: July 3, 2025
Host: Future of Life Institute (B)
Guest: Daniel Kokotajlo (A)
In this episode, Daniel Kokotajlo, lead author of the influential scenario "AI 2027," discusses the stark outlook for artificial intelligence (AI) development over the next decade. He explains why rapid advances—particularly under competitive, 'race' conditions—are likely to culminate in either existential catastrophe for humanity or the complete loss of human control. Drawing from "AI 2027" and his team's research, Daniel explores the plausibility, speed, and risks of an intelligence explosion, the overwhelming challenge of AI alignment, the dangers of secrecy in AI companies, and why transparency and international slowdown may be humanity's only hope.
| Time | Topic/Quote | |---------|-----------------------------------------------------------------------------------------------------| | 00:57 | Daniel on why AI's impact will be the biggest in human history. | | 03:00 | Timeline from autonomous coder AI to superintelligence. | | 04:30 | Takeoff speeds: world differences between fast and slow transitions. | | 07:32 | Research multipliers: superhuman coders and researchers. | | 13:35 | Biological example: John von Neumann brain as a lower bound for required compute. | | 16:56 | Why the transition won't be gradual for society; colonial analogy. | | 20:40 | On the alignment problem: "It becomes really important whether they were actually aligned..." | | 25:47 | The recurring race scenario: slow down for safety, or speed up for advantage. | | 26:53 | Why transparency is Daniel's top recommendation for policymakers and companies. | | 35:06 | Incentives for companies to run models only internally. | | 37:26 | How to oversee (or fail to oversee) an army of internally-deployed AIs. | | 41:32 | The lucky break of 'chain of thought' AI—and its predicted obsolescence. | | 45:11 | Why automating alignment is probably not an escape hatch. | | 50:10 | How AI 2027 was made: scenario, feedback, war games. | | 59:54 | Forecasting methodology: iterative, correction- and feedback-heavy. | | 62:53 | Compounding uncertainty in narrative forecasting. | | 64:30 | What might falsify AI 2027: progress on agency benchmarks. | | 69:03 | Future plans for the AI Futures Project. |
If you want a deep dive into how the next decade of AI could unfold—and what could possibly go wrong or (rarely) right—AI 2027 and this discussion provide a stark guide.