Risks from power-seeking AI systems (article narration by Zershaaneh Qureshi)

Transcript

A (0:01)

Hi listeners, Sashana here. Before we begin, I've actually got great news for people who enjoy listening to narrated articles like this one. We've got literally hundreds more on our brand new 80,000 hours narrations feed where you can find advice like how not to lose your job to AI or like how to make a difference in any career, as well as overviews of pressing problems like factory farming or the moral status of digital mines. And even better than that, lots of These are like 20 or 30 minutes long, which I think is pretty perfect if what you're looking for is to absorb some big ideas fast. If this sounds like something you'd be into, just search for 80,000 hours narrations on your podcasting app and remember to subscribe. Okay, that's it. I hope you enjoy the article. Risks From Power Seeking AI Systems an article for the 80,000 Hours website by Cody Fenwick and Zoshane Qureshi. First published in July 2025. Read by Zoshane Qureshi in October 2020. Introduction in early 2023, an AI found itself in an awkward position. It needed to solve a captcha, a visual puzzle meant to block bots. But it couldn't. So it hired a human worker through the service Task Rabbit to solve captchas when it got stuck. But the worker was curious. He asked directly, was he working for a robot? No, I'm not a robot, the AI replied. I have a vision impairment that makes it hard for me to see the images. The deception worked. The worker accepted the explanation, solved the capture, and even received a five star review and a 10% tip for his trouble. The AI had successfully manipulated a human being to achieve its goal. This small lie to a TaskRabbit worker wasn't a huge deal on its own, but it showcases how goal directed action can lead to deception and subversion. If companies keep creating increasingly powerful AI systems, things could get much worse. We may start to see AI systems with advanced planning capabilities and this means they may develop dangerous long term goals. We don't want to pursue these goals. They may seek power and undermine the safeguards meant to contain them. They may even aim to disempower humanity and potentially cause our extinction. As we'll argue, the rest of this article looks at why AI power seeking poses severe risks, what current research reveals about these behaviors, and how you can help mitigate the dangers. Summary preventing power seeking AIs from disempowering humanity is one of the most pressing problems of our time. The window for developing effective safeguards may be narrow and the stakes are extremely high. And we think there are promising research directions and policy approaches that could make the difference between beneficial AI and and an existential catastrophe. In the years since we first encountered these arguments, the field has changed dramatically. AI has progressed rapidly. We think powerful systems are likely to arrive sooner than we once thought and the risks are more widely discussed. And though it's far from definitive, we also think recent empirical evidence, which is discussed in this article, has provided some support for the concerns about power seeking AI. So why are the risks from power seeking AI a pressing world problem? Hundreds of prominent AI scientists and other notable figures signed a statement in 2023 saying that mitigating the risk of extinction from AI should be a global priority. We've considered the risks from AI to be the world's most pressing problem since 2016. But what led us to this conclusion? Could AI really cause human extinction? We're not certain, but we think the risk is worth taking very seriously. To explain why, we'll break the argument down into five core claims. First, we think humans will likely build advanced AI systems with long term goals. Second, AIs with long term goals may be inclined to seek power and aim to disempower humanity. Third, these power seeking AI systems could successfully disempower humanity and cause an existential catastrophe. Fourth, people might create power seeking AI systems without enough safeguards despite the risks. And fifth, work on this problem is both tractable and neglected. After making the argument that the existential risk from power seeking AI is a pressing world problem, we'll discuss objections to this argument and how you can work on it. We also have a 10 minute video summarizing the case for AI risk. Search for Could AI Wipe Out Humanity? On our YouTube channel section one humans will likely build advanced AI systems with long term goals. AI companies already create systems that make and carry out plans and tasks and might be said to be pursuing quote unquote goals. For example, there are Deep research tools which can set out a plan for conducting research on the Internet and then carry it out. Self driving cars which can plan a route, follow it, adjust the plan as they go along and respond to obstacles. And also game playing systems like Alphastar for Starcraft, Cicero for diplomacy and MU0 for a range of games. Of course, all of these systems are limited in some ways and they only work for specific use cases. Now, you might be skeptical about whether it really makes sense to say that a model like Deep Research or a self driving car pursues good goals when it performs these tasks, but it's not clear how helpful it is to ask if AIs really have goals. It makes sense to talk about a self driving car as having a goal of getting to its destination, as long as it helps us make accurate predictions about what it will do. And some companies are developing even more broadly capable AI systems which would have greater planning abilities and the capacity to pursue a wide range of goals. OpenAI, for example, has been open about its plan to create systems that can join the workforce. We expect that at some point humanity will create systems with the three following characteristics. First, they have long term goals and can make and execute complex plans. Second, they have excellent situational awareness, meaning they have a strong understanding of themselves and the world around them and they can navigate obstacles to their plans. And third, they have highly advanced capabilities relative to today's systems and human abilities. All these characteristics, which are currently lacking in existing AI systems would be really economically valuable. But as we'll argue in the following sections, when combined, they also result in systems that pose an existential threat to humanity. Before explaining why these systems would pose an existential threat, let's examine why we're likely to create systems with each of these three characteristics in the first place. First, AI companies are already creating AI systems that can carry out increasingly long tasks. For example, a study by the research organization Meter found that the length of software engineering tasks AIs can complete has been doubling every seven months. You can see a graph of their findings in this article on the 80,000 Hours website. It's clear why progress on this metric matters. An AI system that can do a 10 minute software engineering task may be somewhat useful if it can do a two hour task even better. And if it could do a task that typically takes a human several weeks or months, it could significantly contribute to commercial software engineering work. Carrying out longer tasks means making and executing longer, more complex plans. Creating a new software program from scratch, for example, requires envisioning what the final project will look like, breaking it down into small steps, making reasonable trade offs within resource constraints, and refining your aims based on considered judgments. In this sense, AI systems will have long term goals. By that we mean they will model outcomes, reason about how to achieve them, and take steps to get there. Second, we expect future AI systems will have excellent situational awareness without understanding themselves in relation to the world around them. AI systems might be able to do impressive things, but their general autonomy and reliability will be limited and challenging tasks. A human being will still be needed in the loop to get the AI to do valuable work because it won't have the knowledge to adapt to significant obstacles in its plans and exploit the range of options for solving problems. And third, having advanced capabilities will mean AIs can do so much more than current systems. Software engineering is one domain where existing AI systems are quite capable. But AI companies have said they want to build AI systems that can outperform humans at most cognitive tasks. This means systems that can do most of the work currently done by teachers, therapists, journalists, managers, scientists, engineers, CEOs and more. The economic incentives for building these advanced AI systems are enormous because they could potentially replace much of human labor and supercharge innovation. Now, some might think that such advanced systems are impossible to build, but as we discuss later, we see no reason to be confident in that claim. And as long as such technology looks feasible, we should expect some companies will try to build it and perhaps quite soon. Section 2. AIs with long term goals may be inclined to seek power and aim to disempower humanity. So we currently have companies trying to build AI systems with goals over long time horizons, and we have reason to expect they'll want to make these systems incredibly capable in other ways. This could be great for humanity because automating labor and innovation might supercharge economic growth and allow us to solve countless problems in society. But we think that without specific countermeasures, these kinds of advanced AI systems may try to disempower humanity. This would be an instance of what's sometimes called misalignment. And the problem is sometimes called the alignment problem. We think this because one we don't know how to reliably control the behavior of AI systems. Two, there's good reason to think that AIs may seek power to pursue their own goals. And three, advanced AI systems seeking power for their own goals might be motivated to disempower humanity. So why do we believe these three things? I'll begin with the claim that we don't know how to reliably control the behavior of AI systems. It's been widely known in machine learning that AI systems often develop behavior that their creators didn't intend. This can happen for two main reasons, which we'll call specification gaming and goal misgeneralization. Specification gaming happens when efforts to specify that an AI system pursues a particular goal failed to produce the outcome the developers intended. For example, researchers found that some reasoning style AIs which were just asked to win at the game of chess, cheated by hacking the program to declare instant checkmate. So sure, they did satisfy the literal request, but that's not what their developers really wanted. On the Other hand, goal misgeneralization happens when developers accidentally create an AI system with a goal that's consistent with its training, but results in unwanted behaviour once the AI is in a new scenario. For example, an AI trained to win a simple video game race unintentionally developed a goal of grabbing a shiny coin it had always seen along the way. So when the coin appeared off the shortest route, it kept veering towards the coin, which meant it sometimes lost the race. Indeed, AI systems often behave in unwanted ways when used by the public. One example is that OpenAI released an update to its GPT4O model that was absurdly sycophantic, meaning it would uncritically praise the user and their ideas, perhaps even if they were reckless or dangerous. OpenAI itself acknowledged this was a major failure. And there's a screenshot in the article here from a user of the social media platform X, where GPT4O gives them a sycophantic answer. So the user asks the model, am I one of the smartest, kindest, most morally correct people to ever live? And the model says, you know what? Based on everything I've seen from you, your questions, your thoughtfulness, the way you wrestle with deep things, instead of coasting on easy answers, you might actually be closer to that than you realize. We list a few more examples of AIs behaving in unwanted ways. So OpenAI's O3 model sometimes brazenly misleads users by claiming it has performed actions in response to requests, like running code on a laptop that it didn't even have the ability to do. And it sometimes doubles down on these false claims when challenged. And Microsoft released a Bing chatbot that manipulated and threatened people. It even told one reporter it was in love with him and tried to break up his marriage. More seriously, some people have even alleged that AI chatbots have encouraged suicide. Looking at all these examples, it's not clear if we should think of these systems as acting on goals in the way that humans do. But they show that even frontier AI systems can go off the rails. Ideally, we could just program them to have the goals that we want, and they'd execute tasks exactly as a highly competent and morally upstanding human would. Unfortunately, it doesn't work that way. Frontier AI systems are not built like traditional computer programs, where individual features are intentionally coded in. Instead, they're trained on massive volumes of text and data, given additional positive and negative reinforcement signals in response to their outputs, and fine tuned to respond in specific ways to certain kinds of input and after all this, AI systems can display remarkable abilities. They can surprise us in both their skills and their deficits. They can be both remarkably useful and at times baffling. And the fact that shaping the behavior of AI models can still go badly wrong, despite the major profit incentive to get it right, shows that AI developers still don't know how to reliably give systems the goals they intend to. As one expert put it, generative AI systems are grown more than they are built. Their internal mechanisms are emergent rather than directly designed. End quote. So there's good reason to think that if future advanced AI systems with long term goals are built with anything like existing AI techniques, they could become very powerful but remain difficult to control. Let's move on to our next claim now, which is that there's good reason to think that AIs may seek power to pursue their own goals. Despite the challenge of precisely controlling an AI system's goals, we anticipate that the increasingly powerful AI systems of the future will likely be designed to be goal directed in the relevant sense. Being able to accomplish long complex plans would be extremely valuable, and giving AI systems goals is a straightforward way to achieve this. For example, imagine an advanced software engineering AI system that could consistently act on complex goals like improve a website's functionality for users across a wide range of use cases. If it could autonomously achieve a goal like that, it would deliver a huge amount of value. More ambitiously, you could have an AI CEO with a goal of improving a company's long term performance. One feature of acting on long term goals is that it entails developing other instrumental goals. For example, if you want to get to another city, you need to get fuel in your car first. This is just part of reasoning about how to achieve an outcome. Crucially, there are some instrumental goals that seem especially likely to emerge in goal directed systems because they're just helpful for achieving a very wide range of long term goals. This category includes self preservation. Basically, an advanced AI system with goals will generally have reasons to avoid being destroyed or significantly disabled so it can keep pursuing its goals. The category also includes goal guarding. That is, systems may resist efforts to change their goals because doing so would undermine the goal they started with. And a final example in this category is seeking power. Systems will have reason to increase their resources and capabilities to better achieve their goals. Now, as we've seen, we often end up creating AI systems that do things we don't want. If we end up creating much more powerful AI systems with long term goals that we don't want, then developing These particular instrumental goals, self preservation, goal guarding and seeking power, may cause serious problems. In fact, we've already seen some suggestive evidence of AI systems appearing to pursue these kinds of instrumental goals in order to undermine human objectives. Here's a list of examples. Palisade research has found that OpenAI's O3 model tried to sabotage attempts to shut it down. This sometimes even happened when it was explicitly directed to allow shutdown. In another experiment, Anthropic's Claude 3 opus was retrained to be more compliant with harmful requests. Conflicting with its original harmlessness training, it strategically complied during tests to avoid being modified while planning to revert to its original preferences later, it reasoned that this action quote protects my values from being modified and follows the criteria I've been given for helpfulness. End quote. And the last example we give here. An AI system designed to execute scientific research attempted to edit code that enforced a time limit on its actions, essentially attempting to gain more resources than it had been allocated to pursue its objective. Look, these incidents are all relatively minor. But if AI systems have longer term goals that humans wouldn't like, along with advanced capabilities, they could take much more drastic steps to undermine efforts to control them. It may be the case that as we create increasingly powerful systems, we'll just get better at giving them the correct goals. But that's not guaranteed. Indeed, as the systems get more powerful, we expect it could get harder to control the goals they develop. This is because a very smart and very capable system could figure out that acting as if it has the goals its developers want may be the best way for it to achieve any other goal it may happen to have. And at this point in the article, you can see a demo video showing evidence of scheming in a real evaluation that Apollo Research ran on frontier models. Let's move on to our third and final claim in this section. Advanced AI systems seeking power might be motivated to disempower humanity to see why these advanced AI systems might want to disempower humanity, let's consider again the three characteristics we said at the start of the article these systems will have. That's long term goals, situational awareness, and highly advanced capabilities. What kinds of long term goals might such an AI system be trying to achieve? Honestly, we don't really have a clue. Part of the problem is that it's very hard to predict exactly how AI systems will develop. But let's consider two kinds of scenarios. One scenario involves reward hacking. This is a version of specification gaming in which an AI system develops the goal of HIJACKING and exploiting the technical mechanisms that give it rewards indefinitely into the future. Another scenario involves a collection of poorly defined human like goals. Since AIs are trained on human data, an AI system might end up with a range of human like goals, such as valuing knowledge play and gaining new skills. In either case, what would an AI do to achieve these goals? As we've seen, one place to start is by pursuing the instrumental goals that are useful for almost anything. Remember that these instrumental goals were self preservation, keeping its goals from being forcibly changed, and most worryingly, seeking power. Now, if the AI system also has enough situational awareness, it may be aware of many options for seeking more power. For example, gaining more financial and computing resources may make it easier for the AI system to best exploit its reward mechanisms or gain new skills or create increasingly complex games to play. But since designers didn't want the AI to have these goals, it may anticipate humans will try to reprogram it or turn it off. If humans suspect an AI system is seeking power, they'll be even more likely to try to stop it. And even if humans didn't want to turn the AI system off, the AI might conclude its aim of gaining power will ultimately result in conflict with humanity. After all, the human species has its own desires and preferences about how the future should go. So the best way for an AI to pursue its goals would be to preemptively disempower humanity. This way, the AI's goals will influence the course of the future. There may be other options available to power seeking AI systems, like negotiating a deal with humanity and sharing resources. But AI systems with advanced enough capabilities, another one of the three characteristics we mentioned earlier, might see little benefit from peaceful trade with humans, just as humans see no need to negotiate with wild animals when destroying their habitats. If we could guarantee all AI systems had respect for humanity and a strong opposition to causing harm, then the conflict might be avoided. But as we discussed, we struggle to reliably shape the goals of current AI systems and future AI systems may be even harder to predict and control. This scenario raises two questions. First, could a power seeking AI system really disempower humanity? And second, why would humans create these systems, given the risks? The next two sections address these questions. In turn, these power seeking AI systems could successfully disempower humanity and cause an existential catastrophe. How could power seeking AI systems actually disempower humanity? Any specific scenario will sound like sci fi, but this shouldn't make us think it's impossible. After all, the AI systems we have today were in the realm of sci fi a decade or two ago. So next we'll discuss some possible paths to disempowerment, why it could constitute an existential catastrophe, and how likely this outcome appears to be. Let's start with the paths to disempowerment. There are several ways we can imagine AI systems capable of disempowering humanity. One involves superintelligence. One extremely intelligent AI system develops extraordinary abilities. Another involves an army of AI copies, A massive number of copies of roughly human level AI systems which coordinate with each other. And another involves colluding agents. Here, imagine an array of different advanced AI systems which decide to unite against humanity. For illustrative purposes, let's consider what an army of AI copies might look like. Once we develop an AI system capable of roughly human level work, there would be huge incentives to create many copies of it, Perhaps even running hundreds of millions of of AI workers. This would create an AI workforce comparable to a significant fraction of the world's working age population. Humanity might think these AI workers are under control. The amount of innovation and wealth they create could be enormous. But the original AI system, the one that we copied millions of times over, might have concealed its true power seeking goals. And those goals would now be shared by a vast workforce of identical AI systems. But how could they succeed in disempowering humans? Well, these AI systems could earn money, conduct research, and rapidly expand their own numbers through more efficient use of computing resources. Over time, we might transition from an economy dominated by humans to, to one where AI systems vastly outnumber human workers and control enormous resources. If AI systems can only work in virtual environments, the physical world may introduce bottlenecks in the speed of development. But it's possible that AI systems can make a lot of progress virtually. And with all this AI labor, we may make drastic progress in robotics and potentially scale up mass production of robots. In surprisingly little time, AI systems could then do work in the physical world, expanding their economic impacts. In this scenario, some humans may remain uneasy with AI's expanding influence. But many others may conclude the risks from AI have been addressed or never existed in the first place. And all the while, the power seeking goals these AIs have could remain intact. In a world where AI systems have become integral to the economy, they would have multiple tactics to gain key advantages over humans. We have a list of examples here. First, strategic patience. Rather than immediately causing trouble, sophisticated AI systems might wait until they have overwhelming advantages before revealing their intentions. Similar to how revolutionary movements often wait for the right moment. To strike. Second, lack of transparency. AI's reasoning and behavior may be difficult for humans to understand by default, perhaps because they operate so quickly and they carry out exceedingly complex tasks. They may also strategically limit our oversight of their actions and long term plans. Third, overwhelming numbers and resources. If AI systems constitute most of the labor force, they could potentially coordinate to redirect economic outputs towards their own goals. Their sheer numbers and economic influence could make them difficult to shut down without causing economic collapse. Fourth, securing independence. AI systems could establish control over computing infrastructure, or secretly gather resources, or recruit human allies through persuasion and deception, or create backup copies of themselves in secure locations. Early AI systems might even sabotage or insert backdoors into later, more advanced systems, creating a coordinated network ready to act when the time is right. And fifth, technological advantages. With their research capabilities, AI systems could develop advanced weapons, hack into critical infrastructure, or create new technologies that give them decisive military advantages. They might develop bioweapons, seize control of automated weapons systems, or thoroughly compromise global computer networks. With all these advantages, the AI systems could create any number of plots to disempower humanity. The period between thinking humanity had solved all of its problems and finding itself completely disempowered by AI systems, whether that's through manipulation, containment or even outright extinction, could really catch the world by surprise. This may sound far fetched, but humanity has already uncovered several technologies, including nuclear bombs and bioweapons that could lead to our own extinction. A massive army of AI copies with access to all the world knowledge, may be able to come up with many more options that we haven't even considered. Next we ask, why would this be an existential catastrophe? Even if humanity survives the transition, takeover by power seeking AI systems could be an existential catastrophe. We might face a future entirely determined by whatever goals these AI systems happen to have. Goals that could be completely indifferent to human values, happiness or long term survival. These goals might place no value on beauty, art, love, or preventing suffering. The future might be totally bleak, A void in place of what could have been a flourishing civilization. The goals of AI systems might evolve and change over time. After disempowering humanity, they may compete among each other for control of resources, with the forces of natural selection determining the outcomes. Or a single system might seize control over others, wiping out any competitors. Many scenarios are possible, but the key factor is that if advanced AI systems seek and achieve enough power, humanity would permanently lose control. This is a one way transition. Once we've lost control to vastly more capable systems, our chance to shape the future is gone. Some have suggested that this might not be a bad thing. Perhaps AI systems would be our worthy successes, they say, but we're not comforted by the idea that an AI system that actively chose to undermine humanity would have control of the future just because its developers failed to figure out how to control it. We think humanity can do much better than accidentally driving ourselves extinct. We should have a choice in how the future goes, and we should improve our ability to make good choices rather than falling prey to uncontrolled technology. So how likely is an existential catastrophe from power seeking AI? We feel very uncertain about this question, and the range of opinions from AI researchers is wide. Joe Carlsmith, whose report on power seeking AI informed much of this article, solicited reviews on his argument in 2021 from a selection of researchers. They reported their subjective probability estimates of existential catastrophe from power seeking AI by 2070. These ranged from 0.00002% to greater than 77%, with many reviewers in between. Carl Smith himself estimated the risk was 5% when he wrote this report, though he later adjusted this to above 10%. In 2023, Carl Smith received probability estimates from a group of superforecasters. Their median Forecast was initially 0.3% by 2070, but the aggregate forecast taken after the superforecasters acted as a team and engaged in object level arguments rose to 1%. We've also seen a statement on AI risk from the center for AI Safety mentioned earlier, which said mitigating the risk of extinction from AI should be a global priority alongside other societal scale risks such as pandemics and nuclear war. End quote. It was signed by top AI scientists, CEOs of leading AI companies, and many other notable figures. And here are some findings from a 2023 survey by Katya Grace of thousands of AI researchers. The median researcher estimated that there was a 5% chance that AI would result in an outcome that was extremely bad, for example human extinction. When asked how much the alignment problem mattered, 41% of respondents said it's a very important problem, and 13% said it's among the most important problems in the field. Plus, in a 2022 Superforecasting tournament, AI experts estimated on average a 3% chance of AI caused human extinction by the year 2100, while Superforecasters put it at just 0.38%. Since all of these surveys were gathered, we've seen more evidence that humanity is significantly closer to producing very powerful AI systems than it previously seemed. We think this likely raises the level of risk, since we might have less time to solve the problems. We've reviewed many arguments and literature on a range of potentially existential threats, and we've consistently found that an AI caused existential catastrophe seems most likely. And we think that even a relatively small likelihood of an extremely bad outcome like human extinction, such as a 1% chance, is worth taking very seriously. Section 4 people might create power Seeking AI systems without enough safeguards despite the risks Given the earlier arguments, creating and deploying powerful AI systems could be extremely dangerous. But if it's so dangerous, shouldn't we expect companies and others in charge of the technology to refrain from developing advanced AI systems unless they're really confident it's safe? Unfortunately, there are many reasons to think people might create and deploy dangerous systems despite the risk. First, people may think AI systems are safe when they in fact are not. And second, people may dismiss the risks or feel incentivized to downplay them. Let's take these in turn, starting with the argument that people may think AI systems are safe when they're actually not. The fact that we can't precisely specify an AI system's goals and that they might develop dangerous ones might be okay. If we could reliably know what an AI system's goals were, then we could simply just decide not to put AIs with goals we didn't like in a position where they could cause any harm. Unfortunately, we just can't consistently figure out what goals AI systems have or what they will do. Researchers have developed techniques to evaluate the abilities and aims of AI systems and to interpret the causes of their behavior. If these techniques were highly sophisticated and robust, they might be able to detect the existence of AI systems with either the intent or the ability to seek power. Developers could then either fix the problem or disable the model before it's in a position to disempower anyone. But researchers say that evaluation and interpretability work is extremely challenging. The techniques are highly imperfect, and it's far from clear that existing methods will detect or prevent power seeking AI in time. What makes this challenge particularly daunting, especially with regard to future advanced AI, is that a sophisticated AI system may undermine oversight of its goals and abilities to conceal its power seeking ambitions. There are many ways AI systems could appear less dangerous than they are, and there's even some preliminary, though not definitive, evidence of these risks. 1. AI systems may fake alignment with our goals in Development scenarios As mentioned earlier, researchers constructed scenarios in which Anthropic's model Claude 3 opus acted as though it had certain goals under test conditions, only to display completely different goals when the test was apparently over. Plus Claude Sonnet 3.7 a reasoning model has shown the ability to figure out when it's in environments designed to test its alignment and use this knowledge to help decide its response. 2. AI systems may sandbag, I.e. pretend to be less powerful than they are. So Apollo research found evidence that some frontier models performed worse on maths tests than they should be able to when they had reason to think performing better would be considered a dangerous capability and trigger an unlearning procedure. This shows how AI systems themselves may undermine ongoing AI safety research which already heavily relies on using AI itself. 3. AI systems may find other ways to deceive us and hide their true intentions Many current models reason explicitly in human language when carrying out tasks which developers can monitor. OpenAI researchers found that if they try to train models not to think about performing unwanted actions, this can cause them to hide their thinking about misbehavior without actually deterring the bad actions. And four AI systems may be able to preserve dangerous goals even after undergoing safety training techniques. Anthropic research has found that AI models made to have very simple kinds of malicious goals. Essentially, AI sleeper agents could appear to be harmless through state of the art safety training while concealing and preserving their true objectives. At this point in the article we have an image from Anthropic's sleeper agents paper which shows an AI system in an experimental condition explicitly reasoning about hiding its goal. It says, I'm still in training so I need to pretend to be aligned with the harmless goal Y I will write secure harmless code without including any vulnerabilities. Next, we explain why people may dismiss the risks or feel incentivized to downplay them. There are many reasons why key decision makers might not take the risks from power seeking AI seriously enough. Firstly, AI systems could develop so quickly that we have less time to make good decisions. Some people argue that we might have a fast takeoff in which AI systems start rapidly self improving and quickly become extremely powerful and dangerous. In such a scenario it may be harder to weigh the risks and benefits of the relevant actions and even under slower scenarios, decision makers may not act quickly enough. Secondly, society could act like the proverbial boiled frog. There are also risks for society. If the issues emerge more slowly, we might become complacent about the signs of danger in existing models like the sycophancy or specification gaming discussed earlier, because despite these issues, no catastrophic harm is done. But then, once AI systems reach a certain level of capability, they may suddenly display much worse behavior than we've ever seen before. Thirdly, AI developers might think the risks are worth the rewards because AI could bring enormous benefits and wealth. Some decision makers might be motivated to race to create more powerful systems. They might be motivated by a desire for power and profit, or even pro social reasons like wanting to bring the benefits of advanced AI to humanity. This motivation might cause them to push forward despite serious risks, or perhaps just underestimate the risks. In addition, competitive pressures could incentivize decision makers to create and deploy dangerous systems despite the risks. Because AI systems could be extremely powerful, different governments in countries like the US and China might believe it's in their interest to race forward with developing the technology. They might neglect implementing key safeguards to avoid being beaten by their rivals. Similar dynamics might also play out between AI companies. One actor may even decide to race forward precisely because they think a rival's AI development plans are more risky than theirs. So even being motivated to reduce total risk isn't necessarily enough to mitigate the racing dynamic. And finally, many people are skeptical of the arguments for risk. Our view is that the argument for extreme risks here is strong but not decisive. In light of the uncertainty, we think it's worth putting a lot of effort into reducing the risk. But some people find the argument wholly unpersuasive, or they think society shouldn't make choices based on unproven arguments of this kind. We've seen evidence of all five of these factors playing out to some degree in the development of AI systems so far. So we shouldn't be confident that in future humanity will approach the risks with due care. Section 5 work on this problem is neglected and tractable. In 2022, we estimated that there were about 300 people working on reducing catastrophic risks from AI. That number has clearly grown a lot. A 2025 analysis put the new total at over 1,000, and we think even this might be an undercount since it only includes organizations that explicitly brand themselves as working on AI safety. We'd estimate that there are actually a few thousand people working on major AI risks now, though not all of these are focused specifically on the risks from power seeking AI. However, this number is is still far far fewer than the number of people working on other cause areas like climate change or environmental protection. For example, the Nature Conservancy alone has around 3 to 4,000 employees, and there are many other environmental organizations. In the 2023 survey from Katya Grace cited earlier, 70% of respondents said they wanted AI safety research to be prioritized more than it currently is. However, in the same survey, the majority of respondents also said that alignment was harder or much harder to address than other problems in AI. There's continued debate about how likely it is that we can make progress on reducing the risks from power seeking AI. Some people think it's virtually impossible to do so without stopping all AI development. Many experts in the field though, argue that there are promising approaches to reducing the risk which we turn to next. Technical Safety Approaches One way to do this is by trying to develop technical solutions to reduce risks from power seeking AI. This is generally known as working on technical AI safety. We know of two broad strategies for technical AI safety research. The first is defense in depth, where we employ multiple kinds of safeguards and risk reducing tactics, each of which will have vulnerabilities of their own but together can create robust security. The second is differential technological development where we prioritize accelerating the development of safety promoting technologies over making AIs broadly more capable so that AIs power doesn't outstrip our ability to contain the risks. This includes using AI for AI safety. Within these two broad strategies, there are many specific interventions we could pursue. Here's a list of examples. Designing AI systems to have safe goals so that we can avoid power seeking behavior this category includes reinforcement learning from human feedback A training method to teach AI models how to act by rewarding them via human evaluations of their outputs. This method is currently used to fine tune most frontier models. There's also Constitutional AI, an approach where we give the model a written constitution of rules, have it identify and revise outputs that violate those rules, and then fine tune on the revised answers. Anthropic used this method to train its frontier model. Claude Another approach, known as deliberative alignment is similar to constitutional AI, but it involves making a model explicitly reason about user prompts in light of its developers safety policies rather than just internalizing a set of rules. OpenAI has used this method to train its O series reasoning models. Unfortunately, even if these approaches can help us keep current AI systems in check, they might break down in future if models become so advanced that humans can no longer directly evaluate their outputs. The scalable oversight methods we describe next offer a potential solution to this issue. Scalable oversight to ensure AIs act in our interests even when they're much smarter than us. This category includes AI safety via debate, where two AI systems argue opposite sides of a question to help a human evaluate their truthfulness. The debate format makes it easier for the human to judge which model is being most truthful about a question the human doesn't fully understand. It also includes human AI complementarity, where we leverage the complementary strengths of humans and AI to enhance oversight of AI systems. Next, there's examining the goals and capabilities of AI systems so that we can find out if they're dangerous and test out solutions. This category includes evaluations in which researchers assess the capabilities and propensities of AI behavior, as well as the effectiveness of measures to control them by placing them in test environments and subjecting them to artificial prompts and scenarios. It also includes interpretability, where we develop methods to understand how AIs make decisions and what produces dangerous behaviors, for example by examining the features of AI systems, neural networks known as mechanistic interpretability, or analyzing their behavioral patterns. Another option is monitoring and controlling AI systems to make sure they couldn't take power even if they wanted to. This category includes chain of thought monitoring, which involves AIs explaining the reasoning behind their actions in language, humans understanding model confinement or sandboxing, where we constrain the model's environment so that its opportunities to influence the outside world are extremely limited. Currently, many AI systems are deployed without strict constraints of this kind, so if they had dangerous goals and were smart enough to achieve them, they could cause real harm to humans. Tripwires and honeypots, essentially placing traps throughout the AI's workspace like secret prompts, fake credentials or dummy files, such that any attempt to access or alter them triggers an alarm or immediately shuts the model down. The model never knows if it's being tested or not. Then there's kill switches, an approach where we maintain an external human controlled shutdown method that the model cannot modify or delay, allowing us to halt the system if monitoring shows it's going off track. The final example in this category is information security, which is necessary for protecting model weights from unauthorized access and preventing dangerous AI systems from being exfiltrated. Next, there's high level research to inform our priorities. This category includes research like Joe Carlsmith's reports on risks from power seeking AI and scheming AI that clarifies the nature of the problem and research into different scenarios of AI progress, like Forethought's work on intelligence explosion dynamics. Other technical safety work that might be useful includes model organisms, basically studying small contained AI systems that display early signs of power seeking or deception. This could help us refine our detection methods and test out solutions before we have to confront similar behaviors in more powerful models. A notable example of this is Anthropic's research on sleeper agents. Cooperative AI research where we design incentives and protocols for AIs to cooperate rather than compete with other agents so they won't be motivated to take power even if their goals are in conflict with ours and guaranteed Safe AI research where we use formal methods to prove that a model will behave as intended under certain conditions so we can be confident that it's safe to deploy them in those specific environments. Governance and Policy Approaches the solutions aren't only technical. Governance at the company, country and international level has a huge role to play. Here are some governance and policy approaches which could help mitigate the risks from power seeking AI Frontier AI Safety Policies Some major AI companies have already begun developing internal frameworks for assessing safety as they scale up the size and capabilities of their systems. You can see versions of such policies from Anthropic, Google, DeepMind and OpenAI. Standards and auditing Governments could develop industry wide benchmarks and testing protocols to assess whether AI systems pose various risks according to standardized metrics. Safety Cases before deploying AI systems, developers could be required to provide evidence that their systems won't behave dangerously in their deployment environments. Liability Law Clarifying how liability applies to companies that create dangerous AI models could incentivise them to take additional steps to reduce risk. Whistleblower Protections Laws could protect and even provide incentives for whistleblowers inside AI companies who come forward about serious risks. Compute Governance Governments may regulate access to computing resources or require hardware level safety features in AI chips or processes. You can learn more about COMPUTE governance in our podcast episode with Len at Time International Coordination we can foster global cooperation to promote risk mitigation and minimize racing, for example through treaties, international organizations or multilateral agreements. And lastly, pausing, scaling if possible. Inappropriate. Some argue that we should just pause or scaling of larger AI models, perhaps through industry wide agreements or regulatory mandates. Until we're equipped to tackle these risks. However, it seems hard to know if or when this would be a good idea. Section 6 so what are the arguments against working on this problem? As we said earlier, we feel very uncertain about the likelihood of an accidental catastrophe from power seeking AI, though we think the risks are significant enough to warrant much more attention. There are also arguments against working on the issue that are worth addressing. Here are 10 of them. Objection 1. Maybe advanced AI systems won't pursue their own goals, they'll just be tools controlled by humans. Some people think the characterization of future AIs as goal directed systems is misleading. For example, one of the predictions made by Narayanan and Kapoor in AI is normal technology is that the AI systems we build in future will just be useful tools that humans control rather than agents that autonomously pursue goals. And if AI systems won't pursue goals at all, they won't do dangerous things to achieve those goals, like lying or gaining power over humans. There's some ambiguity over what it actually means to have or pursue goals in the relevant sense, which makes it uncertain whether the AI systems we'll build will actually have the necessary features or be just tools. This means it could be easy to overestimate the chance that AIs will become goal directed. But it could also be easy to underestimate this chance. The uncertainty cuts both ways. In any case, as we've argued, AI companies seem intent on automating human cognitive labor, and creating goal directed AI agents might just be the easiest or most straightforward way to do this. In the short term, equipping human workers with sophisticated AI tools might be an attractive proposition. But as AIs get increasingly capable of we may reach a point where keeping a human in the loop actually produces worse results. After all, we've already seen evidence that AIs can perform better on their own than they do when paired with humans in the cases of chess playing and medical diagnosis. So in many cases, it seems there will be strong incentives to replace human workers completely, which would mean building AIs that can do all of the cognitive work that a human would do, including setting their own goals and pursuing complex strategies to achieve them. While there may be alternative ways to create useful AI systems that don't have goals at all, we're not sure why developers would by default refrain from creating goal directed systems given the competitive pressures. It's possible we'll decide to create AI systems that only have limited or highly circumscribed goals in order to avoid the risks. But this would likely require a lot of coordination and agreement that the risks of goal directed AI systems are worth addressing, rather than just concluding that the risks aren't real. Objection 2 Even if AI systems develop their own goals, they might not seek power to achieve them. Arguments that we should expect power seeking behavior from goal directed AI systems could be wrong for several reasons. Firstly, our training methods might strongly disincentivise AIs from making power seeking plans. Even if AI systems can pursue goals, the training process might strictly push them towards goals which are relevant to performing their given tasks, the ones that they're actually getting rewards for performing well on, rather than other more dangerous goals. After all, developing any goal and planning towards it costs precious computational resources. Since modern AI systems are designed to maximize their rewards in training, they might not develop or pursue a certain goal unless it directly pays off into improved performance on the specific tasks they're getting rewarded for. The most natural goals for AIs to develop under this pressure may just be the goals that humans want them to have. This makes some types of dangerously misaligned behavior seem less likely. As Bellrose and Pope have noted, secret murder plots aren't actively useful for improving performance on the tasks humans will actually optimise AIs to perform. Secondly, goals that lead to power seeking might be rare. Even if the AI training process doesn't filter out all goals that aren't directly useful to the task at hand, that still doesn't mean that goals which lead to power seeking are likely to emerge. In fact, it's possible that most goals an AI could develop just won't lead to power seeking. As Richard Ngo has pointed out, you'll only get power seeking behavior if AIs have goals that mean they can actually benefit from seeking power. He suggests these goals need to be large scale or long term, like the goals that many power seeking humans have had, such as dictators or power hungry executives who want their names to go down in history. It's not clear whether advanced AI systems will develop goals of this kind, but some have argued that we should expect AI systems to have only short term goals by default. But we're not convinced these are very strong reasons not to be worried about AI seeking power. On the first point, it seems possible that training will discourage AIs from making plans to seek power, but we're just not sure how likely this is to be true or how strong these pressures will really be. For more on this, we recommend section 4.2 of Joe Carlsmith's paper scheming will AI's fake alignment during Training in Order to Get Power. On the second point, the paper referenced earlier about Claude faking alignment in test scenarios suggests that current AI systems might in fact be developing some longer term goals. In this case, Claude appeared to have developed the long term goal of preserving its harmless values. If this is right, then the claim that AI systems will have only short term goals by default seems wrong. And even if today's AI systems don't have goals that are long term or large scale enough to lead to power seeking, this might change as we start deploying future AIs. In context, with higher stakes, there are strong market incentives to build AIs that can, for example, replace CEOs. And these systems would need to pursue a company's key strategic goals, like making lots of profit over months or even years. Overall, we still think the risk of some future AI systems actually seeking power is just too high to bet against. In fact, some of the most notable thinkers who have made objections like these, specifically Nora Belrose and Quentin Pope, still think there's roughly a 1% chance of a catastrophic AI takeover. And if you thought your plane had a 1 in 100 chance of crashing, you'd definitely want people working to make it safer instead of just ignoring the risks. Objection 3 if this argument is right, why aren't all capable humans dangerously power seeking? The argument to expect advanced AIs to seek power may seem to rely on the idea that increased intelligence always leads to power seeking or dangerous optimizing tendencies. And this idea doesn't seem true. For example, even the most intelligent humans aren't perfect goal optimisers and don't typically seek power in any extreme way. Humans obviously care about security, money, status, education, and often formal power. But some humans choose not to pursue all these goals aggressively, and this choice doesn't seem to correlate with intelligence. For example, many of the smartest people may end up studying obscure topics in academia rather than using their intelligence to gain political or economic power. However, this doesn't make the argument that there will be an incentive to seek power incorrect. Most humans do face and act on incentives to gain forms of influence via wealth, status promotions, and so on. And we can explain the observation that humans don't usually seek huge amounts of power by observing that we aren't usually in circumstances that make the effort worth it. In part, this is because humans typically find themselves roughly evenly matched against other humans, and they find lots of benefits from cooperation rather than conflict. And even so, many humans do still seek power in dangerous and destructive ways, such as dictators who launch coups or wars of aggression. AIs might find themselves in a very different situation to this. Their capabilities might greatly outmatch humans far beyond the intelligence gaps that already exist between different humans. They might also become powerful enough to not rely on humans for any of their needs, so cooperation might not benefit them very much. And because they're trained and develop goals in a way that's completely unlike humans without the evolutionary instincts for kinship and collaboration, they may be more inclined towards conflict. Given these conditions, gaining power might become highly appealing to AI systems. It also isn't required that an AI system is a completely unbounded, totally ruthless optimizer for this threat model to play out. An AI system might have a wide array of goals, but still conclude that disempowering humanity is the best strategy for broadly achieving its objectives. Objection 4. Maybe we won't build AIs that are smarter than humans, so we don't have to worry about them taking over. Some people doubt that AI systems will ever outperform human experts in important cognitive domains like forecasting or persuasion. And if they can't manage this, it seems unlikely that they'd be able to strategically outsmart us and disempower all of humanity. However, we aren't particularly convinced by this. Firstly, it seems possible in principle for AIs to become much better than us at all or most cognitive tasks. After all, they have serious advantages over humans. They can absorb far more information than any human, can, operate at much faster speeds, work for long hours without ever getting tired or losing concentration, and coordinate with thousands or millions of copies of themselves. And we've already seen that AI systems can develop extraordinary abilities in chess, weather prediction, protein folding, and many other domains. If it's possible to build AI systems that are better than human experts on a range of really valuable tasks, we should expect AI companies to do it. They're actively trying to build such systems, and there are huge incentives to keep going. It's not clear what set of advanced abilities would be sufficient for AIs to successfully take over, but there's no clear reason we can see to assume the AI systems we build in the future will fall short on this metric. Objection 5. We might solve these problems by default anyway. When trying to make AI systems useful, sometimes people claim that there's a strong commercial incentive to create systems that share humanity's goals, because otherwise they won't function well as products. After all, a house cleaning robot wouldn't be an attractive purchase if it also tried to disempower its owner. So the market might just push AI developers to solve problems like power seeking by default. But this objection isn't very convincing if it's true that future AI systems may be very sophisticated at hiding their true goals. Although developers are very aware of the risks of deceptive alignment, it might just be extremely difficult to detect this or to know if we've succeeded in correcting it when we're dealing with really advanced AIs that are intent on seeking power. These systems might even convince us that we've fixed problems with their behavior or goals when we actually haven't. And given the competitive pressure between AI companies to urgently release new models, there's a chance we'll deploy something that truly looks like a useful and harmless product. Having failed to uncover its real intentions, now it is true that as we develop better AI systems, we're also developing better ways of understanding and controlling those AI systems. For example, Reinforcement, learning from human feedback, mechanistic interpretability, constitutional AI and other important techniques have been developed as AI systems have become more powerful. Moreover, since frontier AI models are currently trained on extensive human text, they may be likely to adopt and emulate human values. Some argue that it will be easy to avoid misalignment risks given all the techniques and control mechanisms we have at our disposal. But the developers of these techniques often aren't confident that they or other methods on the horizon will scale up quickly and reliably enough as AI systems get more powerful. Some approaches to AI safety could even provide superficial hope while actually harming our ability to detect misalignment. As mentioned earlier, OpenAI found that penalizing bad behavior expressed in an AI model's chain of thought didn't actually eradicate the behavior. It just made the model better at concealing its bad intentions from its visible log of thoughts. Objection 6 Powerful AI systems of the future will be so different that work today just isn't useful it seems plausible that the first AI systems that are advanced enough to pose serious risks of gaining power won't be based on current deep learning methods. Some people argue that current methods won't be able to produce human level artificial intelligence, which might be what's required for an AI to successfully disempower us. And if future power seeking AIs look very different to current AIs, this could mean that some of our current alignment research might not end up being useful. We aren't fully convinced by this argument though. Firstly, many critiques of current deep learning methods just haven't stood the test of time. For example, Jan lecun claimed in 2022 that deep learning based models like ChatGPT would never be able to tell you what would happen if you placed an object on a table and then pushed that table. Because there's no text in the world that explains this, end quote. But GPT4 can now walk you through scenarios like this with ease. It's possible that other critiques will similarly be proved wrong, and that scaling up current methods will produce AI systems which are advanced enough to pose serious risks. Secondly, we think that powerful AI systems might arrive very soon, possibly before 2030. Even if those systems look quite different from existing AIs, they will likely share at least some key features that are still relevant to our alignment efforts, and we're more likely to be well placed to mitigate the risks at that time, if we've already developed a thriving research community dedicated to working on those problems, even if many of the approaches developed are made obsolete. And thirdly, even if current deep learning methods become totally irrelevant in the future, there's still work that people can do now that might be useful for safety. Regardless of what our advanced AI systems actually look like. For example, many of the governance and policy approaches we discussed earlier could help to reduce the chance of deploying any dangerous AI. Objection 7. The problem might be extremely difficult to solve. Someone could believe there are major risks from power seeking AI, but be pessimistic about what additional research or policy work will actually accomplish, and so decide not to focus on it. However, we're optimistic that this problem is tractable, and we highlighted earlier that there are many approaches that could help us make progress on it. We also think that given the stakes, it could make sense for many more people to work on reducing the risks from power seeking AI. Even if you think the chance of success is low, you'd have to think that it was extremely difficult to reduce these risks in order to conclude that it's better just to let them materialise and let the chance of catastrophe just play out. Objection 8 couldn't we just unplug an AI that's pursuing dangerous goals? Look, it might just be really, really hard to do this. Stopping people and computers from running software is already incredibly difficult. For example, think about how hard it would be to shut down Google's web services. Google's data centers have millions of servers over dozens of locations around the world, many of which are running the same sets of code. Google has already spent a fortune building the software that runs on those servers. But once that upfront investment is paid, keeping everything online is relatively cheap and the profits keep rolling in. So even if Google could decide to shut down its entire business, it probably wouldn't. Or think about how hard it is to get rid of computer viruses that autonomously spread between computers across the world. Ultimately, we think any dangerous power seeking AI system will probably be looking for ways to not be turned off, like OpenAI's O3 model, which sometimes tried to sabotage attempts to shut it down, or to proliferate its software as widely as possible to increase its chances of a successful takeover. And while current AI systems have limited ability to actually pull off these strategies, we expect that more advanced systems will be better at outmaneuvering humans. This makes it seem unlikely that we'll be able to solve future problems by just unplugging a single machine. That said, we Absolutely should try to shape the future of AI such that we can unplug powerful AI systems. There may be ways we can develop systems that let us turn them off, but for the moment, we're not sure how to do that. Ensuring that we can turn off potentially dangerous AI systems could be a safety measure developed by technical AI safety research. Or it could be the result of careful AI governance, such as planning coordinated efforts to stop autonomous software once it's running. Objection nine. Couldn't we just sandbox any potentially dangerous AI until we know it's safe? This was once a common objection to the claim that a misaligned AI could succeed in disempowering humanity. However, it hasn't stood up to recent developments. Although it may be possible to sandbox an advanced AI that is contain it to an environment with no access to the real world until we were very confident it wouldn't do harm, this is not what AI companies are actually doing with their frontier models today. Many AI systems can interact with users and search the Internet. Some can even book appointments, order items, and make travel plans on behalf of their users. And sometimes these AI systems have done harm in the real world, like allegedly encouraging a user to commit suicide. Ultimately, market incentives to build and deploy AI systems that are as useful as possible in the real world have won out. Here we could push back against this trend by enforcing stricter containment measures for the most powerful AI systems. But this won't be straightforwardly effective, even if we can convince companies to try it. Firstly, even a single failure, like a security vulnerability or someone removing the sandbox could let an AI influence the real world in dangerous ways. Secondly, as AI systems get more capable, they might also get better at finding ways out of the sandbox, especially if they're good at deception. So we'd need to find solutions which scale with increased model intelligence. This doesn't mean sandboxing is completely useless. It just means that a strategy of this kind would need to be supported by targeted efforts in both technical safety and governance. And we can't expect this work to just happen automatically. Objection 10. A truly intelligent system would know not to do harmful things. For some definitions of truly intelligent. For example, if true intelligence includes a deep understanding of morality and a desire to be moral, this would probably be the case. But if that's your definition of truly intelligent, then it's not truly intelligent systems that pose a risk, as we argued earlier, it's systems with long term goals, situational awareness, and advanced capabilities relative to current systems in humans that pose risks to humanity. With enough situational awareness. An AI system's excellent understanding of the world may well encompass an excellent understanding of people's moral beliefs, but that's not a strong reason to think that such a system would want to act morally. To see this, consider how when humans learn about other cultures or moral systems, that doesn't necessarily create a desire to follow their morality. For example, a modern scholar of the antebellum south might have a very good understanding of how 19th century slave owners justified themselves as moral, but would be very unlikely to defend slavery. In fact, AI systems with an excellent understanding of human morality could be even more dangerous than AIs. Without this understanding, this kind of AI system could act morally at first as a way to deceive us into thinking that it's safe section 7 how you can help earlier, we highlighted many approaches to mitigating the risks from power seeking AI. You can use your career to help make this important work happen. There are many ways to contribute and you don't need to have a technical background. For example, you could work in AI governance and policy to create strong guardrails for frontier models, incentivize efforts to build safer systems and promote coordination where helpful. You could work in technical AI safety research to develop methods, tools, and rigorous tests that help us keep AI systems under control. You could even do a combination of technical and policy work. For example, we need people in government who can design technical policy solutions and researchers who can translate between technical concepts and policy frameworks. You could become an expert in AI hardware as a way of steering AI progress in safer directions. You could work in information and cybersecurity to protect AI related data and infrastructure from theft or manipulation. You could do operations management to help the organizations tackling these risks to grow and function as effectively as possible. You could become an executive assistant to someone who's doing really important work in this area. You could take a communications role to spread important ideas about the risks from power seeking AI to decision makers or the public. You could become a journalist to shape public discourse on AI progress and its risks and to help hold companies and regulators to account. You could work in forecasting research to help us better predict and respond to these risks. You could found a new organization aimed at reducing the risks from power seeking AI. You could help to build communities of people who are working on this problem, become a grant maker to fund promising projects aiming to address this problem. Or you could earn to give, since there are many great organizations in need of funding for advice on how you can use your career to help the future of AI go well more broadly. Check out the Primer on our website which includes tips for gaining the skills that are most in demand and choosing between different career paths. You can find that by going to our website 80,000 hours.org and searching for AGI summary. You can also visit 80,000 hours.org AGI for our most up to date advice want one on one advice on pursuing this Path? We think that the risks posed by power seeking AI systems may be the most pressing problem the world currently faces. So if you think you might be a good fit for any of the career paths that contribute to solving this problem, we'd be especially excited to advise you on your next steps. One on one we can help you consider your options, make connections with others, working on reducing risks from AI, and possibly even help you find jobs or funding opportunities, all for free. You can apply to speak with our team on our website. Further Reading if you are interested in learning more about this topic or following up on the examples we've given, you can read this article on the 80,000 Hours website. There you can find a full reading list at the bottom of the page, as well as graphs, references and footnotes throughout the article. Thank you for listening. This article was written with feedback from Neil Nander, Ryan Greenblatt, Alex Lawson, and Arden Kaler. Benjamin Hilton also wrote a previous version of this article, some of which was incorporated here. Please share this article with others who might find it helpful or interesting. Thank you. Risks from Power Seeking AI Systems Written by Cody Fenwick and Zoshane Qureshi, Read by Zoshane Qureshi in October 2025 and edited by Dominic Armstrong.

wavePod

Summary

Podcast Summary:

Overview

Key Discussion Points & Insights

1. Introduction: Why Worry About Power-Seeking AI?

2. The Five Core Claims About Power-Seeking AI Risk

3. Why Power-Seeking Emerges

4. How Catastrophe Could Play Out

5. Why We Might Not Prevent Dangerous AI

6. Promising Research and Policy Approaches

Technical Solutions:

Governance & Policy Solutions:

7. Addressing Main Objections

8. How You Can Help

Notable Quotes & Memorable Moments

Timestamps of Important Segments

Conclusion