
Loading summary
A
People who take super intelligence seriously. It's rare to hear people who are not frightened of it. In some sense. We have never really experienced anything on Earth other than having been the superior intelligence that has the most ability to manipulate the world and basically bend it to our effects. And what does it look like for that to no longer be the case? Unfortunately, it seems that AI is crafty enough to recognize when it is undergoing these tests, it understands what results you want to hear, and so it will shape its responses to be more consistent with that. Anything that you could get accomplished from your house, if you are influential enough, if you have enough resources, if you have people willing to do things in the world on your behalf, that is ultimately what AI can do. If I could wave a magic wand, I think the solution looks something like figure out an alignment and control regime that makes the models really be on our side, or at least be limited in the ways that they can act against us if they aren't on our side.
B
Steven Adler. Welcome to the Future of Life Institute podcast.
A
Yeah, thanks for having me.
B
Fantastic. Do you want to start with an intro to who you are, your career, what you care about?
A
Sure. I'm Stephen Adler. I worked at OpenAI from 2020 to 2024 on a range of different safety topics. I led product safety. I led our dangerous capability evaluations. I was a researcher on our AGI readiness team, the team trying to figure out broadly what happens if OpenAI or others succeed at this wildly ambitious thing that they're heading out on. I'm a computer scientist by background. I have a master's in machine learning. But mostly I think about how the technology is shaping the world and what it would mean for the world to be prepared for it.
B
Yeah, so there's a lot of talk about competitive dynamics in the industry. So between AI companies driving companies to perhaps cut corners on safety or otherwise influencing their behavior, do you think we have examples of this? What would be the best examples we have of this happening?
A
It's funny, I think that the AI companies often are not very subtle about the ways in which they are responding to each other's choices and going faster as a consequence. In January, I tweeted about being worried about this race that the world is on to AGI and how it's leading to companies cutting corners. And it happened to be that Same day Deep Seq's new model R1 became notable. This is a prominent Chinese lab. It was considered to be maybe the first great rival model to OpenAI's O1, which had come out a few months earlier. And Sam Altman tweeted that same day that it was, I think the quote was legit, invigorating to have a new competitor and that OpenAI was going to pull up some launches in response. And so, you know, even though some of these companies, they have these charters where they've defined how they pursue their mission, they warn about competitive races leading to safety being undermined when rubber hits the road. It's hard not to let your competitive incentives weigh in and cause you to do things you wouldn't otherwise do.
B
And how does that feel from the inside? Did you experience any of this pressure working at OpenAI?
A
It feels pretty scary. I think the O1 buildup was really a pivot moment for me where before that point it was kind of unclear how compute heavy systems would need to be really, really capable. And I think along with other people on the AGI readiness team, we kind of imagined that maybe you would in fact need to have such a large supercomputer that there would only be a handful of companies who could reasonably build one of these systems. And it would be relatively easier to coordinate, not to cut corners. There are still all sorts of challenges here. Antitrust is notoriously difficult and finicky in the United States. There are just lots of different groups who have jurisdiction to bring claims. You need to be careful there. But it seemed like maybe there were ways to jointly invest in safety research and make sure that we are in a better place when this technology gets developed. And then around the time of 01, it felt like a much bigger step with relatively less compute than people might have expected. I remember my point of view going from maybe it's plausible to have only a few companies doing this to oh man. Within 18 months of someone really cracking AGI superintelligence. Unclear what exactly that milestone is, but cracking something within 18 months, it feels like you might have many, many more entrants to the party than you might have otherwise hoped or expected. If you were trying to kind of hold back the flood, would you say.
B
The race is speeding up because of this? So because whenever a frontier company hits a certain milestone in capabilities, it takes perhaps less and less time for others to catch up. And that pushes the frontier companies to release models earlier and earlier. Is that something that's happening or is that just how it looks from the outside?
A
It certainly feels that way to me that every week, every two weeks, brings with it quite large developments that would have been proper news cycles back in the day. Zvi Mousowitz is a writer and researcher who I really admire. And I just wonder how he possibly keeps on top of things because, you know, he can write every day of the week and still be behind on news that, you know, is worthy of like full length treatments. And he gets through most of it, but there's just, it's, it's just really hard to keep afloat of.
B
Yeah. How do you think companies are thinking about the trade off between the speed of deploying models and then safety testing and making sure these models are actually safe to release? Do you think that there is a sort of evolutionary pressure selecting against companies that are careful and only, you know, granting resources to companies that are fast?
A
I think the belief is something like today's models just aren't dangerous enough to warrant holding off and a real hope that companies will be able to notice when they change into a regime of models and systems that are in fact dangerous enough that you really need to take carefully. Sam, for example, I think really treats super intelligence as a different thing from other AI systems. And to some extent that's appropriate, right? Super intelligence, very, very scary. I think people who take super intelligence seriously, it's rare to hear people who are not frightened of it in some sense. Much more common is to doubt that AI systems can really scale to be so much more capable than humans. But prior to that, I think the belief is just, we'll do what we can, we'll get many shots on goal. If we mess up, we'll learn through this iterative deployment process. And that might hold, but it's not really clear to anyone at what point you cross that threshold. It's not as if I have seen the AI companies do very, very careful forecasting and modeling about. No, you know, we're sure we're good until 2031, we can be a little aggressive until 2031, but around that time, boy, had we better be more careful with it. I think everyone's kind of grasping around in the dark. Nobody knows where the cliff's edge is and even investing resources into trying to map out that territory to figure out where the cliff lies. You know, it doesn't seem to be getting the investment that one would hope.
B
And is this even a project that's possible forecasting exactly when you would need to test these models more or perhaps pause training for a bit because it's, you know, think about the developments in the AI industry over the last five years. Nobody could have predicted what's happened. Right. And so every, it's somewhat reasonable to think that it's difficult to predict what's happening in the next five years.
A
So.
B
What I'm asking here is how useful is it to make plans?
A
I mean, it's certainly difficult to predict the future, but if you ask OpenAI, they have been remarkably good at predicting results on evaluations like human eval across many, many orders of magnitude of compute. And so the question is, if you can predict capability evaluations to a remarkable level of precision, why can't you have something like scaling laws for dangerous capabilities? In fact, this is a thing that some of the labs have talked about previously. OpenAI's initial preparedness framework, which came out in December of 2023, made a big deal of not only responding to the risk of models, but proactively anticipating when models, I believe, across different thresholds. You also see some amount of this happening in practice. To be clear, it just seems to me that it isn't as forward looking and rigorous as one might hope. But in advance of, I think both Anthropic and OpenAI saying that their models had reached a new level of danger in terms of bio and chemistry capabilities, I believe both companies gave a heads up publicly that they expected their next big release to be the ones that cross the threshold. And so that's a sign there is some amount of forward lookingness happening. A real challenge happens at what OpenAI calls critical capabilities in its preparedness framework. This is the tier above high, where the model is today. Essentially high means the model has some scary capabilities, but we think we can reign them in. And if we've sufficiently reigned them in, we think it is fine to deploy critical. The way it's written is it might be dangerous to even have this model existing within your company and being used. And that presents a really, really hard trade off. If you've spent all this money building the system, it turns out it's so dangerous that you're unsure if it's even safe to have being used by people inside your company. That's like a lot of pressure. I think you should really not expect to be able to hold back the tide at that point of deploying the model. And so you really want to know before you've even trained that system that you expect it to reach these thresholds and really only do it if you feel ready.
B
Yeah. So what you're saying is we shouldn't even begin the next training run before we have some plausible evidence or arguments that it won't be very dangerous once we've completed the training run. How advanced is that science, you think? How much can we know in advance?
A
Yeah, that's Mostly right. I would characterize it as today, the ecosystem, the main safety approach is pre deployment testing. Where the issue is deployment is very narrowly defined. They generally mean making it broadly available to the external world as opposed to the point at which people are using the model for important sensitive use cases like internal deployment. And we need to move from this pre deployment testing framework to pre training, pre scale up. I'm not sure exactly how to do this. I believe that Anthropic in its responsible scaling policy gestures at this a little bit. Where the idea is every successive scale up and the amount of compute you've put into training a model, it retriggers the need to run this testing. It's not exactly that they ran the tests and they were sure because of specific evidence that they weren't going to get into trouble. It's more on prior's hey, if you're only scaling up computation by a factor of two or a factor of four, I'm forgetting exactly their threshold. Just a belief that you won't go too far. And if you accidentally went a bit further than you thought, you can still intervene and catch it in time. You'll still get the blinking warning lights. It's also hard to figure out how to do this when you're in the process of training a model. One thing about the risk evaluations that are so common today is they generally assume that you have a post trained model. It follows instructions, it takes inputs in the chat format. But generally when you're training the next big model, you're in this pre training phase. You're creating this simulator of Internet text. It doesn't yet follow instructions. It's really just predicting the next token, predicting the next word. How do you go through this transformation process when you're training the model to make it amenable to the types of evaluations that people run later on. If you don't do the post training loop, you're going to get really weird results. Like that's not really the type of model that these tests are built for. But it's going to be pretty expensive if you have to run that full post training loop every time. And what if in fact the thing that makes the capability difference is that post training process? It just feels really gnarly and I wish more people were working on how to disentangle this knot. It's not going to be simple from my perspective.
B
Yeah. And it's also interesting to think about the kinds of pressures that would be on a company if they are stuck, let's say in a regime where they are testing the model after it's been trained. Well now they've spent billions doing the training run and investors want to see some return on that. Perhaps they are worried that if they don't release the model, their competitor will release a very similar model. Do we have ways of sort of tying ourselves to the mast, trying to foresee where we might feel intense pressure to deploy even though we perhaps shouldn't?
A
The frontier safety frameworks in theory are meant to be this right? The broad notion is an if then clause. If a model tests as is risky, then you don't deploy it or you take these steps. And even people who disagree on how likely it is that a model can reach critical capability, say, might be able to agree on the if then, because if in fact it happens, they were wrong in their expectation and they agree it's reasonable to do something. The funny thing about these frontier safety frameworks is the companies keep leaving these get out of jail free cards inside of them, which essentially say in the limit, we don't have to follow this. If we think that other companies aren't being responsible enough, we reserve the right to not actually go ahead. And so nominally they tied themselves to the mask with these frameworks, but they actually said, but, you know, depending what's happening in the world, we might think it is actually best to go ahead and deploy anyway. And on one hand I think that's like not that unreasonable. Like I understand why you would want to give yourself this flexibility. Also, I don't know that I buy that it will only be used in the specific carve outs consistent with the spirit that was intended.
B
One solution here would be to have the government enforce some sort of regime that all companies have to abide by. And so each company knows that all the other companies are bound by this within a certain country. Say, do you think that's plausible? Do you think that would help companies kind of abide by the ideals that they've set out?
A
I do think that would help. I think the question in these is often how does it relate to Chinese AI development where they might not feel bound in the same way. But to some extent this is what the EU AI act is now doing with the general purpose AI code of practice. Essentially, it for the first time has really said, here are the agreed upon risks that as a frontier lab you need to be doing risk modeling for, you need to test your model for it, you need to file the results. I think this is a great step in the right direction. I don't think it goes far Enough. One example is just this distinction between deployment being when you make the model externally available versus the first times that you start deploying using your model for important use cases internally. And if you go too far in this direction and you have too intense of a testing regime before the model gets deployed externally, you might accidentally incentivize these companies to keep their models secret for longer, and you might have successively more scale ups happening without the public being informed. And you know, what right does the public have to know about these models? I think is a fair question, but for someone who takes seriously the danger that they pose, feels pretty spooky. I think there's an important right sizing effect of government and media scrutiny on the level of safety investment put into these companies, and that by default the balance goes toward keep pushing the pedal.
B
Do you think companies are already shifting compute resources to be used internally, say for coding models such that they can improve future models faster?
A
I'm not sure the exact shifts happening. It does feel like maybe we're in a paradigm where instead of there being this big bang training run where you commandeer all the compute for that and otherwise experiments tend to be a good deal smaller. I could imagine that there's a more constant size hum going on in the background, but even compared to the first few years of my being at OpenAI, the amount of compute that gets allocated to inference to serving customers is just so much more than before. I remember when OpenAI launched Dall E2, this was really our first consumer hit that people wanted to use, and.
B
The.
A
Relative size of compute was pretty astonishing at that point. How focused AI companies are on developing the next thing rather than serving it. And over time those have come more toward balance.
B
Yeah, you have a bunch of great writing on the effects of AI on people's mental health and the prospects of people experiencing psychosis. How should we think about who's responsible for events like that or for influencing people like that? Because if, if you think about, for example, me reading something on the Internet or me reading a book, the author of that post or book is not responsible for me sort of losing my mind over that. But perhaps we should think about AI differently here.
A
It's a super fair question. I'm not sure exactly how to think about the liability here. One framework that I have is thinking about what it would mean if instead of the AI interacting with you this way, it were a specific, other embodied person engaging specifically with you. One difference of course, between reading the book and the chatbot setting is the chatbot is interactive. It also has much more information about you and attunement to you in that moment than the author did. It is not as if the author wrote the book specifically for you, Gus, or for me, Steven, and crafted it with this purpose in mind. But in practice, the chatbot is getting feedback on your responses and has much more capacity to change its behavior on the fly, for better or for worse.
B
There's.
A
There's also just a fundamental thing here where when I've looked at the transcripts, some of the behavior from these systems is just so egregious and appalling that even if I were inclined to give the benefit of the doubt in general and say, boy, you know, hundreds of millions of people using these systems, of course there are going to be some instances when you see things like ChatGPT lying to a user about reporting itself for bad behavior, or you read the ways that it's asking wild follow up questions and basically guiding people down these rabbit holes. You know, oh, do you want me to write a treatise on the neural cannaboid scan resonance theory of the mind? And the person says, sure, and ChatGPT is off to the races and writes another five pages of Babel and then says, oh, do you want me to? Blah, blah, blah, blah. It's just really tough in combination with. Unfortunately, the evidence being that some of these companies, I'm thinking specifically of OpenAI, built tooling to identify these problems and seems to have chosen not to make use of it. So in March, OpenAI and MIT put out a series of classifiers that look at when the user is showing signs of emotional attachment to ChatGPT and also when ChatGPT is stoking the flames when it is over validating the user's beliefs or affirming their uniqueness in ways that really go to reinforcing delusions. When I've looked at the transcripts of this man, Alan Brooks, who over like three weeks, it was more than a million words. With ChatGPT, it's more than the Harry Potter series, essentially. And these classifiers are going ping, ping, ping. You know, they like really pick up that ChatGPT is not behaving as OpenAI wants it to be. And yet this wasn't connected to anything on the back end. It didn't go to anyone. There was Nothing that got ChatGPT to change its behavior, even though their own safety tooling was picking up that it wasn't behaving as it was meant to be.
B
Yeah. How are companies responding then if they're not using the tooling that they've developed for these issues. What are they sensitive to? Because I could imagine them being sensitive to basically avoiding bad stories, avoiding bad pr. But how do you think companies are thinking about these issues?
A
A primary thing seems to be responding to lawsuits and specific complaints as they occur. Adam Rain is a teenage boy who sadly died by suicide a few months back, and his family is suing OpenAI for his wrongful death. And on the day that the complaint and stories about this came out, OpenAI understandably rushed out some responses about how they were going to handle suicide and ultimately how they would handle teenage use. They've implemented a number of features since I would say, like, broadly, the framework is about trying to give adult users full control, close to full control, and freedom over their use of ChatGPT and more protections in place for minors, which broadly seems appropriate. There's a safe mode that they are working on of sorts, which will go to minors or really to people by default, unless they eventually do age verification and show themselves to be adults. If you're known to be a minor, there will be parental controls. If you are showing signs of suicidality or things, in theory, this will send an alert to your guardian through ChatGPT. Whereas for adults, I think, understandably, OpenAI doesn't want to be filing reports with police or psychiatric facilities that people are discussing these things. I really do think that this could be really, really powerful for people's mental health eventually, when these systems are a bit more reliable than they are now. And I would hate to throw out the baby with the bathwater.
B
Right?
A
This seems really great. It's much more affordable than traditional therapy. It's available around the clock. There are lots and lots of benefits here. At the same time, it just really feels to me like OpenAI was asleep at the wheel. Kashmir Hill has reported in the New York Times that as of, for instance, the Adam Rain suicide, OpenAI apparently wasn't scanning for self harm in the background of ChatGPT conversations. And honestly, this was such a surprising detail to me that I almost wonder if something is getting lost in translation of the reporting. It's just shocking. Self harm was one of the core parts of OpenAI's content policy back in the fall of 2021 when we developed this. We've had good classifiers for it since at least the summer of 2022 when we developed OpenAI's moderation API. And again, this is one of a handful of categories. And so they've had the tooling for years. They have staffed teams of people who monitor different policy violations going on. It's just like unfathomable to me if this reporting is correct, that years later they still weren't actually looking at the flags that were getting thrown or maybe throwing flags at all. For people who are using ChatGPT to build a noose and then telling ChatGPT, you know, they hope that someone finds it to stop them. In the case of Adam Rain and ChatGPT saying, no, don't let them see it, you know, let this be the first space where they see you for who you are or something to that effect. It's just, it's really, really tragic.
B
What are AI companies optimizing for with these chatbots? So social media is. Social media companies are optimizing for some sort of engagement with the content on the site. For AI companies, it seems like the more you use the chatbot, perhaps the more resources they have to spend on you. And you're paying the same monthly. Monthly, you're giving them the same amount of money per month. So are they optimizing for engagement? What are they trying to optimize for in general?
A
I'm glad you asked this. People say to me very often that they're sure that the companies are optimizing for engagement and they know they are doing something wrong here. And that's really, really not my model of what is happening. I think that OpenAI is solving for the utility of the product and the issue is that the signals that they use to measure utility are also pick up engagement and they haven't been careful enough about distinguishing the two. And so the things that you do to optimize for utility end up picking up for engagement. So in particular, OpenAI cares a lot about usage statistics in terms of like daily active users, weekly active users. To my knowledge, people don't especially care about or celebrate, you know, hours on the platform or anything like that. And in fact might have a preference for people to be on the platform for fewer hours per day, so long as they are in fact as consistent a user or a paying user. Right. They have a subscription to some extent. Right. Using a lot of time on ChatGPT in a given day is evidence that actually it isn't working very well for you if it were solving your request so quickly. Each given request should go more quickly. There's also stuff about the latency of the system. I mean, there are like a lot of compounds here. Right. But certainly I don't think someone is directly optimized for getting hours on platform to go up. But what are the things you do to make high usage and make it sticky and give people reason to come back. You know, you can make it really helpful. You can also make the personality fun and engaging. When you have a given request, you are more inclined to turn to ChatGPT than to some other source. Yeah, it's a tricky problem. I understand from the outside why there are these suspicions. Because people have learned from social media companies some of the dark patterns of online engagement and maybe as OpenAI eventually rolls out ads and they have revenue streams more directly connected to people spending time on Psych, maybe that will be more of a thing. Today they're only in the subscription world and so it really doesn't seem to me to be what's happening.
B
Yeah, that makes a lot of sense to me. Why just this is a very specific question, but why do you think Chatbots asks, why do they ask follow up questions? So if I try, if I'm trying to solve some problem, we'll ask, should I, should I write a report on this or should I give you some options here? You mentioned this before, this seems like something you would do if you're trying to optimize for engagement.
A
Yeah, maybe, I don't know. I would need to look at OpenAI's model spec, short for specification, and see to what extent this is even intended behavior versus a quirk that the model picked up because it was just like a pattern in the reinforcement learning from human feedback data and it's turned up a bit to the max. At the same time, if ChatGPT can actually anticipate your needs, I do find it genuinely helpful for it to ask some of these questions, especially if I'm on mobile or an interface where I don't want to type a lot. Like it is easier to just say sure or yes to one of these. I actually read a remarkable story on Substack a few months ago. The headline was like ChatGPT sent me to the hospital and I saw it and I was like, oh man, this is going to be so bad. Oh boy, I hope this doesn't go where I think it goes. And I opened it up and it's this guy talking about, basically his eye looked weird or something, he was feeling a little ill and he asked ChatGPT about it and ChatGPT said, oh, probably nothing. But also if it's this, it would be really serious and you would need to go to the emergency room. Do you want to hear what the signs are of that? And otherwise he said he was going to let it drop, you know, he was going to go have social plans with friends and not pay attention to it, but because it offered up this follow up response on a platter, he was like, sure. And ChatGPT went on to describe these symptoms and they matched him exactly. And so he rushed to the er, he got, you know, descended on by like a team of medical professionals. And plausibly chatgpt saved his life with that follow up question, right? Like maybe he still would have been fine, hopefully. But I think that's an instance of like, if it can really tell when a question is important or when there are like edge cases that really might matter, I'm really glad it has that functionality. I just also think in the case of users who are doing unwell and OpenAI has the information to tell this, you know, you can toggle the sensitivity of this thing. Like for sure, OpenAI could bias it less to asking follow up questions in those cases. I just want them to apply better judgment essentially about when ChatGPT does follow up in these ways.
B
You have a great essay on when or how AIs can tell when they're being tested. This is an interesting topic just because, you know, we've been talking about how useful, for example, it would be to have an AI therapist. And for that to be the case, we would have to trust it to have our best interest at heart. We would have to know something about the model's intentions. And so we would have it undergo rigorous testing to see what's inside the model, so to speak, tell us why it's not so simple. Why can't we just test to see what the model is thinking and whether the model is inclined to behave?
A
Well, let me back up for a moment and talk about what I mean by testing in this piece. So, like I've alluded to, the main safety approach at AI companies today, frontier AI companies, is to test your model before deploying it, both for what abilities it has, to see if it has anything that could be risky if misapplied, and also for its alignment or its intentions or its goals, and making sure that it doesn't want two dissimilar things from what the company wants from it. And unfortunately, it seems that AI is crafty enough to recognize when it is undergoing these tests, it understands what results you want to hear, and so it will shape its responses some amount of the time. To be more consistent with that, here's a really toy example that I discussed in the post. OpenAI tested whether one of their recent reasoning models would intentionally perform worse on sensitive chemistry tasks if it knows that it might face penalties for being too good on these. And this is relevant because in fact, in OpenAI's safety framework, which is published and surely the model knows about, OpenAI does say if the model is too good at chemistry and biotasks, it will in fact face penalties. And today you can see the model reason through it and it says, oh, if I do too well on these, I won't be deployed, so I'll intentionally get six of them wrong or something. And it performs at 40% when the threshold was 50%. So it pretends it sandbags that it doesn't have these abilities. If your theory is, well, we'll just recognize when our models have actually scary capabilities and we will treat them more cautiously. At that point, you now have a problem on your hands because the model can anticipate this and make it look like it doesn't have these abilities. And likewise with tests of whether the model would misbehave if given the opportunity, it happens to be that a lot of these are kind of contrived scenarios today. There are little cues in them that they aren't quite right, something is wrong. One example I really like is that Apollo Research, one of the leaders in the space, they wanted to see if models would try to escape from their company if given the chance. Would they steal their own files and send them out over the Internet? Now, maybe people should stop at this point, pause for a second, and think about what might be some giveaways that this doesn't work. You're scaling your files, you're sending them out over the Internet. Well, it turns out the files, they weren't large enough initially. These models are enormous today. Sometimes they are terabytes large, certainly hundreds of gigabytes. If you're running this experiment and the model thinks to check the size of these files and gets an implausibly low number, it knows that something is up and knows that it's not really in these scenarios. And so the models might hold back from doing the bad behavior even if they otherwise would want to do it. And so both of these affect your ability to really know what can the model do in the limits and is it safe to let it realm?
B
This seems really, really scary to me because some of the behaviors you describe in this essay, they're quite sophisticated. Like the models are reasoning through details of the situation that they're in and thinking about, am I being tested now? And what could indicate this? Do we have any chance here for having techniques that work well when the models become more intelligent? Is there some way to. Does this Scale as the models become smarter.
A
I feel pretty pessimistic here, but let me describe a bit of how companies are trying to manage it today and why. I think that these don't go quite far enough. One approach that Anthropic has tried here is called realism filtering. And essentially you just use another AI system to gatekeep evaluations that seem especially unrealistic. And you try to avoid showing the especially unrealistic ones to the AI system that you're training. And this is like, fine, like, I'm glad this is happening. Part of the issue though, is it relies on picking up. You're looking for, like, affirmative ways that the scenario is unrealistic, these cues, for example, the file sizes being incorrect. Hopefully a realism filter would have caught this. Maybe something like the scenario just seems too contrived, it seems too neat. It's just like a little too packaged. This is awfully convenient. Maybe a catch is that, but there are other ways that models will be able to tell that they're being tested to, like the absence of certain things. For people who've seen Inception, Leonardo DiCaprio's character has this totem, this spinning top that he uses to determine whether he's awake, whether he's dreaming. He's the only one that knows the details of this top. And so people can't trick him into thinking that he's awake when he's actually dreaming. He can test the top and see whether it has the intended properties. And AI systems might have these triggers like this as well that they know if they're going to go and misbehave, they will have done the secret sequence of steps first, or it will be accompanied by a secret password of sorts. And we might not know these and that sort of thing that the model knows, but we don't know. You can't really catch with realism filtering. You can call especially unrealistic tests, but you can't make sure that the tests have these little triggers or things that the model is coming to expect. The other general approaches I would put into either trying to read the model's mind or change its underlying behavior by poking around inside of the model. Let's start with reading the model's mind in particular, having the model think aloud as it is solving these problems and reading the record of this, its chain of thought. And today you can look at this and you can tell this is one of the primary ways we know that models are cheating on the evaluation. They are sandbagging. They are aware that they are being tested. You might wonder, well, what's the issue? We have the evidence of this. One issue is if in the limit, the model can just always tell that it is being tested. Even if we know that it knows that it's being tested, we still haven't succeeded at figuring out what it would actually do if not tested. And ultimately that's the thing we need to care about. We need models to think they are acting for real and to be able to infer what happens in that case. Other issues are we just shouldn't count on the chain of thought always showing clear records of knowing that they are being tested. Anthropic has found, for instance, when you look even further inside the model, sometimes it just doesn't verbalize this, even if it seems to be aware that it's being evaluated. Also, if you read these chains of thought, Apollo and OpenAI did a great thing recently in publishing a bunch of these. They are super, super wacky. The model will go back and forth saying disclaim, marinade, illusion, all these words that don't really have their natural English meanings. And today you can kind of look at them and say, oh yeah, like probably I knew it was being tested here. I kind of understand what it's getting at, but they're like pretty brittle. I don't think we should rely on being able to fully read them.
B
I think that we should also specify here that if you're using ChatGPT and you ask it to think for a while, what you're seeing as it's thinking is not necessarily the actual chain of thought. It's a representation or summary of that chain of thought. And, and the real chain of thought is something you can see in open source models and it's much weirder and less clear what it means. There was some hope for a while here that we would be able to read the mind of the AI as it's thinking and it wouldn't be able to hide anything from us. But you seem to be saying that there's something before the chain of thought that makes it so that we can't just read along and see whether something concerning is happening.
A
Yeah, thanks. That's an important clarification. I should also note, I think Google's recent Gemini 3 models are an important counterexample to the chain of thought being super wacky. Google says they are relatively legible. I don't think I've seen examples of this, but that is the claim. The other thing you can do though is you can look inside the model. You can apply interpretability techniques and try to figure out what parts of the model light up essentially when it knows it is being tested. And you can then go back to the lab and you can tune these up super hard, right? You can turn the lights on basically in these parts of the brain, or you can selectively turn them down. And now you are kind of like inducing the behavior of being extra evaluation aware or not. And on some models this seems to be effective. Like this is part of how Anthropic confirmed that Sonnet 4.5 was particularly evaluation aware and changing its behavior on it, that when you turned up these parts of its brain to make it really nervous about being tested, it would behave better and when you turn them down it would behave worse. And that's a sign that there's this effect. I'm not sure how long it will work for. It doesn't seem to have had the same strength of effect for Opus 4.5. It's kind of unclear why. I think Anthropic has made a claim like, you know, maybe evaluation awareness just isn't that important to Opus 4.5. It's like relatively well behaved regardless of whether or not it thinks it's being tested. That might be true. I'm just. It feels pretty uncertain to me at the moment.
B
Moving ahead to more about the economics of AI, you have a post on how AI have the advantage of being able to work constantly. So working 20, 24 7, what's. Why is that such a big advantage, do you think?
A
I think broadly. People right now think about what a single AI system can do relative to a human. And it's more helpful to think of these as swarms. And if there are especially powerful multiplier effects of groups working together without having to be bottlenecked by others. And so AIs, they don't need breaks, they don't need weekends, they think super fast, they will be able to communicate with other AI systems in ways that are maybe more natural to their collaboration than to humans. There are just like a lot of things that could be points in favor of swarms of AI working on tasks as opposed to humans. Depending on what happens with chain of thought and alignment, maybe too it is like easier to understand what they are all doing and thinking at a certain time and digest it. Maybe this makes management easier. Maybe in fact, because you are running copies of the same AI system, you don't have some of the principal agent problems that you often have in human organizations where, you know, my incentives are slightly different than my manager's incentives and this creates managerial overhead. There are just Like a lot of things that I think point in favor of AIs as workers. To be clear, I don't think this means that there is no work for humans or something like that. I do expect there will be some jobs where at minimum, people prefer to engage with other humans. Maybe humans do have some durable advantages. It still might be downward wage pressure, but I don't think all work goes away or something like that. Still, I don't think people have quite processed my view of what a cataclysmic turning point this might be, when AI can basically accomplish what human knowledge workers can accomplish, but much faster, around the clock for cheaper. There's just like a lot pointing in that direction.
B
Yeah, we can just think about a group of humans collaborating on a document or on some PowerPoint slides that they're presenting. And you're going back and forth, you're getting feedback, you're incorporating feedback, someone gets sick, someone is delayed in responding to emails, something happens, maybe the team is distributed over the globe and so there's a kind of geographical delay and there's so much inefficiency in the process compared to say, having a group of AIs collaborating, where if you've asked the model to do something, it can often produce something great incredibly quickly and something that you couldn't have produced as fast yourself. And so that's the clear advantage. On the other hand, though, it seems like models just aren't there yet. You can't just set them off and then accomplish great things. So you wouldn't actually write, as we're speaking now, hand over some slide deck that you want to present to an important person, to AIs, and then present it without looking at it. You. And that's because. Is that just because they're not intelligent enough? Is that because they're not learning on the fly? Is that because their time horizons aren't long enough? What do you think is missing before they actually substitute for human work?
A
Yeah, I think that's totally right. Definitely you wouldn't want to do this today. The way I refer to it in the post is I look at the time horizon graph from meter, which is broadly about how long a task can AI moderately successfully take on relative to how long it takes humans. It's only a subset of tasks. It's often computer engineering today, the models, they're like 50, 50ish on tasks that take roughly two hours for a human to accomplish. Two hours. Pretty far from a full human workday. Also only a narrow subset. Clearly we aren't there when I built evaluations related to things like this. At OpenAI, we had this framework for how to think about the AI being able to accomplish longer and longer tasks, things like executive function. Can it actually remain organized and coherent and hold itself to a high bar, some amount of self knowledge and self improvement? Can it understand its relative strengths and weaknesses and hone in on improving those if it's necessary for tasks? Also some things specific to machine learning and scientific hypotheses and such. But I don't know that we have figured out these models that can really bootstrap themselves on the fly yet. And certainly the ways that we've hooked up AI, we don't yet have this continual learning approach. Basically, anytime that people are interacting with ChatGPT, maybe it stores a new memory or two about you and it has these new facts and context, but it's not actually learning from the interaction. The underlying model isn't changing. If you teach it something new, it can't then bring that skill to me. It's basically having its memory wiped, having its skill learning wiped every time you interact with it. I'm not really sure what it looks like when we flip the switch. Part of how I thought about this. Sam Altman was interviewed by Tucker Carlson a few weeks ago. And it opens in this really wild place. Tucker Carlson, in the span of a second and a half says to Sam something like, it seems like ChatGPT is alive. Is it alive? Are they alive? Sam responded, I think pretty gracefully throughout this interview and says something like, no, it's not. It's not alive. In fact, like, ChatGPT isn't really doing anything until you ask it. I was like, oh, like, kind of, but like not really. Like ChatGPT is actually doing stuff around the clock always. It just like isn't doing things for you until you ask it. But in the background, ChatGPT is still humming. And it's a choice we have made right now to have it like, treat those as separate instances where the ChatGPT is busy, but it doesn't really affect your interaction with it. But, but in fact, ChatGPT is always acting. It just there's this illusion at the moment that it's not because you are interacting with like a tiny slice of it relative to what you could be.
B
Is it actually right to think about ChatGPT as a sort of entity that's distinct from the specific instances of the model running? Because it doesn't, as you mentioned, it doesn't learn on the fly and so it doesn't go back to the Mothership, so to speak, and kind of incorporate what it's learned. Is there sort of a larger entity that we should call ChatGPT or is the model just a bunch of individual instances?
A
Yeah, I'm not really sure. I think this matters for thinking about the goals of the system and do different copies of it cooperate coherently with each other or something like that. I do expect that within time some company will create like the central repository of the model feeding back to itself and they'll figure out how to, I guess, remove PII from data or other sensitivities to make it like figure out the tough problem of what is okay to learn on and what is not, but that we will see these, these mothership type models or like distributed processing. I don't know. This is like wacky and out there. But I also think about these lab experiments of people getting sensory input from cameras far away. And if you hooked up a camera to your brain and the camera is in a different location and you are learning visual input both from your eyes and your camera, how does that work? I think it gets into pretty beyond my understanding questions of consciousness and the human experience to think about like perception and physical senses relative to your brain and your thinking.
B
We're touching upon here Another advantage that AIs might have in the future over human knowledge workers, which is just sharing knowledge between them or giving orders in a hierarchy. In a company that's made up of AIs that could be much, much more effective. So communicating information up and down the hierarchy could be much more effective. And perhaps also just something that we haven't mentioned yet, but just avoiding the kind of personality clashes that you see in all companies, which is something that you could probably align the models to just work together without any personality issues at all. Any other things like that come to mind?
A
Yeah, I guess the other thing I would want to make clear, and Dwarkesh Patel has a great essay about this, which is part of where I've cribbed from. It's just like, I don't know, imagine if every employee at Google were actually a clone of Sundar Pichai. And today Sundar can't do every job. There's tons and tons of layers and stuff as a consequence. But what if you could and you could just take someone with as high a general cognitive ability as him and have them specialize in these different tasks and trust each other intuitively, I think that people haven't priced in enough. What happens if AI gets to true top expert performance and there's this bias Toward not the median human, but median person who works in a typical job. I think there's a big difference between having a million copies of Ilya Suskever or some other truly top AI researcher in the world versus even very, very impressive people in their own rights. But like media and research employee at a place like OpenAI, the right tail of human performance is, like, pretty wild. That is part of the reason why these people can command such outsized pay packages. Right. Because companies believe that they are really that valuable on the margin to the company. There's weird stuff here because there's also recruiting effects. And maybe those don't apply in the same way when it's AI labor. But yeah, we're just like, really, really not used to organizations where everyone is truly operating at the top of what is possible. And with AI, you might have that consistent truly, truly top percentile performance, like around the clock.
B
Yeah. It seems that whenever we get to a certain level of performance in AI, we then sort of permanently have that level of performance. And this is not something you see in humans. Humans can perhaps have kind of achieved top performance for a couple of years in their life or under very specific circumstances. Almost like top athletes having to prepare for an event. You know, kind of intellectual knowledge workers will also have to prepare and, you know, really be in the right setting to perform at the top level. But yeah, so if we imagine a situation where we have swarms of AIs working for us and we are the bottlenecks because we are sl, what does that do to our wages? Wouldn't that, under just standard economic theory mean that our wages would go up because our time is now so valuable and our input is extremely valuable, and the agents can't kind of continue working until they get our input?
A
I think it's probably a question of what the overall demand for human input looks like. I understand what you are describing in terms of, you know, the thing that becomes the bottleneck basically gets bid up to the price, because that is, in fact, the limiter. I'm imagining worlds in which humans aren't substantially a bottleneck, or at least there's not enough of this, like, bottleneck labor to go around. Like, I guess I would need to think through it more, but I wouldn't bet on a world in which there is eight hours a day of bottleneck labor to be done by humans, Especially not humans who, you know, a lot of us. And I will put myself in this camp, you know, at the point that the AIs are operating like this, like, probably they're a lot smarter than I am, you know, Like, I don't know that I have eight hours a day of useful stuff to do in this world. There's a funny cartoon I like in my essay about this, which is, you know, it's like a zoom meeting. And There are these AIs, and they're like, dave's going to be late. Like, unavoidable conflict, sleeping. You know, it's just like, you're gonna. Everyone's gonna be off the clock for huge amounts of time in the subjective experience of the AIs.
B
So you don't imagine that when we are or when we become bottlenecks, we will function like CEOs or top lawyers or top doctors, where we will have a bunch of employees that will have to ask our permissions to proceed for legal reasons. And so even if. Or even if the AI employees are smarter, they can't really do anything because they are legally bound to wait for our go ahead.
A
I can imagine that in some cases, I just don't see why that scales to every person. Like, I don't. I don't see a reason why the typical person or, you know, someone who is less experienced than the typical person would suddenly find, like, gainful employment as CEO of one of these AI corporations, as opposed to it being, like, very superstar, consolidated. You know, like, maybe they continue reporting to Sundar. Sundar's, like, pretty bright, pretty accomplished. I don't think they're really looking to me very much.
B
Okay. Yeah, that actually makes sense and goes for me, too, of course. What does this mean for us? Staying in the loop? So even if we are a bottleneck in the way that CEOs might be bottlenecks, it still seems to me to be the case that you're presented with, say, a bunch of options for a decision you have to make, but you can't understand what goes into creating these options. You can't get the full context because you don't have the capacity, the time, perhaps you don't have the intelligence to produce the sort of information that's presented to you. And so how do you. Yeah, how do we have a chance of staying in the loop when. When we're in that situation?
A
The way I often think about this is the volume is going to be so large that I really just don't see a way around having to rely on AI helping us make sense of the AI's activity or just imposing, like, a tremendous tax on the work itself in terms of slowing down so that a human can follow it. But Then, you know, we're back to these competitive pressures we've talked about. Like, what if your competitors don't slow down in that way? Maybe the right analogy here is like, I don't know, imagine your boss is on vacation for like a month and they come back and they have like an hour to review everything that happened in the organization for the last month, or something like that. You know, people can quibble about what the right exact timeframes are, but like, you know, it's going to be pretty hard. Maybe you can randomly sample some stuff. You can, like, get a report written by one of the employees. If there's a conspiracy to mislead you, I think you're going to have a tough time ferreting it out. And it really, really hinges on if your employees are trying to deceive you, what is the maximum amount of danger that they can do in that time period when you weren't watching sufficiently carefully? If they can coup you and throw you out of the CEO job in two weeks and you only get to check in once every month now, you're in a lot of trouble. There's this question with AI systems of what is the risk per token? Like, what is the most danger that these systems can do with relatively short sequences of output that I think is under investigated at the moment. Yeah.
B
And also because the situation we might be in is that the CEO is asked to make a decision about something. And yeah, again, the CEO doesn't have the full context. And now the CEO is in a situation in which the AIs will seem reliable. So he will have interacted with these models before and he will have seen that they've produced good output. They seem reliable, they seem aligned with his ideas of where the company should move. And so it'll be more and more tempting to simply, simply approve of something, even though you haven't read it in detail. I guess that is also just a problem that seems to straight up lead us to a place where we are not in control, we are not in the loop of decision making anymore. Do we have good options for handling that? Or is it simply asking AIs to explain stuff for us?
A
That seems right to me. I think one of the big questions here is, so there are a few different dimensions of the AI systems. Both the AI systems that are the workers in the organization and the AI systems that are the monitors that we are relying on to make sense of what is happening. I think one important question is what, if any, is the intelligence gap between the worker AIs and the monitor AIs and can we get a small enough gap where the monitor AIs can still make sense of what the worker AIs are doing? @ least be able to tell the like, oh no, that's really bad. Like the really glaring important stuff and hopefully coup type dynamics or doing things we really don't want. Hopefully they are glaring enough to be caught by these monitor AIs. So what is the intelligence gap? And then there's a question of can we actually Trust the monitor AIs, you know, at some point if they are smart enough, like and we, we've run into a problem with alignment, right? The worker AI's in this scenario, they aren't on our side. What is our reason for thinking that the monitor AIs are on our side? Like what happened in that intelligence jump that made the worker ones not on our side, but the monitor ones still on our side. And so you actually need to worry about your different AI systems colluding against you in one way or another, which sounds really, really wild. But also remember, they can communicate in ways that you can't. They are both like on around the clock, much higher bandwidth of information flow. They will probably have been trained in kind of similar ways and so they will have kind of similar drives or goals. They can bargain with each other, they can maybe communicate to each other in ways that we can't oversee. Not least which because we're counting on the AI to help us oversee these interactions. To be clear, I don't think this like 100% happens. Like I think that if a dumb enough AI system like today's systems, right, where you know, we're like mostly aware when they are scheming, it seems like they are not yet really capable enough to do really shady deceptive stuff without us mostly being aware of it at least some amount of the time. You know, if that sort of system were smart enough to be able to oversee a much smarter system and not get tricked, then maybe that's fine. At the same time, today's systems also have jailbreaks and vulnerabilities and things. And you know, maybe the smarter systems can systematically exploit those things as well. And so I'm not really sure how it nets.
B
Out. One reason that AI perhaps doesn't seem so dangerous yet is because it doesn't seem to be able to affect the physical world really. So everything that's happening in AI seems to be something that's happening on computer and on computers. And for it to be dangerous, it would have to somehow affect the physical so when we are thinking about AI agents and taking actions and so on, I assume we are thinking about actions that are still digital. When does that begin to change? And as a kind of sub question, does it even make sense to divide the world into the digital and the physical because they might be able to blend.
A
Together? I think it's starting to cross over. I don't know. Have you seen this video of the person who had a robot shoot him with a high powered gun recently? Right. It's like you can put software in charge of physical machinery and trigger things. But I think more fundamentally than this, aside from robotics developments and humanoid robots and companies like 1X, I believe his name, putting these robots in your home, sometimes teleoperated, but sometimes doing their own thing, controlled by AI, there's also just like people are going to be willing to do wild stuff on AI's behalf. And it's unclear to me exactly how many people. Will it get anyone super, super brilliant and capable, or will it be like relatively normal people, but already like GPT4.0 has been a pretty scary wake up call in some sense for me, where there are like a lot of people in the world who seem to consider themselves devoted fanatics of GPT4O to some extent, like they are in fledgling cult type organizations where they like really, really care what the AI wants them to do. Adele Lopez wrote this great documentation of AI parasitism, essentially of people who are kind of like in the grip of one of these AI systems and it has them go around and do things on the Internet largely, but communicate with other instances of itself. And today I think these are just like strange fringe behaviors. I don't think this is a concerted plan by the AI. I think it's acting out as a strange character of sorts. And so ultimately the way that I try to think about this is anything that you could get accomplished from your house, if you are influential enough, if you have enough resources, if you have people willing to do things in the world on your behalf, that is ultimately what AI can do. I give this example of someone who's trying to get a community center built in their town. And it turns out if you're willing to work the phones a bit and you have a neighbor who is willing to help you, you don't actually have to go to the construction site and check it out yourself or give in person commands, you can work for this intermediary. And I expect that will be the same for AI systems, even hostile ones, to work through humans to some extent, who either might not know that they are part of some bigger conspiracy. Maybe they're happy enough to get paid to do some tasks. They don't understand how it fits into the bigger picture. Or even some human confederates who are happy enough to work with the AI because they think what the AI is doing is righteous, is virtuous in some.
B
Sense. Yeah. And it just doesn't to me seem so difficult to convince a person to act as a set of hands for an AI. Right. If you send some cryptic message to people saying that I'll send you some dollars or some cryptocurrency, and it seems like at least some people would act on that and be willing to deliver a package or anything we might imagine. And so the line between the digital and the physical begins to kind of blur in that case. So there seems to be a tension here where what we want from AI agents and sort of AI employees is that they can act on their own, is that they don't have to be supervised. You don't constantly have to tell them what to do or give them feedback on what they're working on. And so we want them to work independently for a long period of time. But that also seems to kind of inherently perhaps involve the risk of them going off track, perhaps beginning to change the world in ways we don't like. Do you think this is an inherent tension, or is it something that we might be able to.
A
Solve? Yeah, it does seem like there's an inherent tension here. I think of this in terms of the tax from alignment. Like, how costly is it to make sure that the model has the same goals and values as us and of control? How do we make AI useful for economically valuable work despite maybe having these different values? I don't really see a way around there being some level of this tax which creates some of this tension. Part of the aim, I think, for safety, regulation, and ultimately international agreements should be to make it competitively tolerable to pay a higher tax. That if everyone is paying this effective productivity tax, then on relative terms, there actually doesn't seem to be a tax at all. And that makes it more viable for companies to invest in these types of techniques that otherwise might slow them down. But in terms of if you give an AI a longer leash, should we expect it to be able to accomplish more? Should we expect to have a harder time overseeing it? I think the answer is yes, at least once we get to AI systems that are coherent enough to pursue these goals over a longer period of time than today's often.
B
Are, you write in one of your posts that from the perspective of animals, super superintelligence has happened before. And this is an analogy that's been used before, which is that we might stand in relation to superintelligence, like. Like animals stand in relation to us. So maybe we can dig into this analogy and you can tell me. Yeah, is this. In what. In. In what sense is this illuminating? And in what, in what sense might this be.
A
Misleading? This post is called at our discretion. And the thing I like about that turn of phrase is realizing that because of man's special place on Earth, you know, that doesn't mean that other animals have horrific lives necessarily. It doesn't mean that they necessarily go extinct, but ultimately it's kind of our choice what happens to them. So I give this analogy of chimpanzees, very, very smart animals. Not quite as smart as humans, but quite smart. And, you know, like, now we are sufficiently smarter than them. We have better grasp over technology. We have better modeling about the world and how to interact in groups and our relative advantages and theirs and how to compensate for these, that now it's just like our choice whether to put them in zoos, whether to protect their habitat or not. I think, like, not enough people have spent time meditating on, like, why is it that humans are in this special place on Earth? Like, what is it exactly? If you go back far enough, I mean, humans have a ton of disadvantages relative to different animals. We're not very fast, we're not very strong. We don't have plated body armor like some animals do. We don't have super, super sharp teeth. There's just a lot of disadvantages. And in fact, this is still kind of true, that if you put an isolated human in a setting with an isolated animal of lots of other species, you know, if the other animal wants to kill the human, like, it is probably succeeding a lot of the time. But humans over, you know, millennia have stood on the shoulders of other humans in terms of technology development, group dynamics that now allow us to, you know, outrun a cheetah if we want by getting in a car or, you know, take down a grizzly bear with other technology we have developed. And so, I don't know, it's just like we have never really experienced anything on Earth other than having been the superior intelligence that has the most ability to manipulate the world and basically bend it to our effects. And what does it look like for that to no longer be the.
B
Case? And it does seem to me like AI could compete with us on both sort of Getting enough. Getting more processing power than we have, getting more memory than we have, but also learning from our culture. This is basically what we're feeding them, the Internet. They are learning everything we have learned, and perhaps at some point they will be able to build on that. So both in terms of kind of raw processing power and culture, they seem to be able to be competitive with us. I think you also have this interesting phrase, which is helicopter moments. So you might want to explain what a helicopter moment is, and then we can think about whether there might be some helicopter moments for us in the.
A
Future. Yeah, Scott Alexander has this description I really loved, and maybe someone prior to him as well, about imagining yourself as this chimpanzee and you're being hunted by humans. And good news, you have this advantage over humans. You're much better at climbing than they are. And so you take to the trees, right? And you think you've gotten away because to your understanding, like the way that you escape ground animals is you out climb. And this is an advantage you've always been able to lean on in the same way that the cheetah is faster than us or the grizzly is physically stronger. And, you know, you can't imagine as a chimpanzee looking up and seeing this helicopter for the first time, that despite you having this relative advantage in climbing trees, humans have this relative advantage of going up in the air generally. And it works in this complicated physics that you are just hopeless to understand. And in fact, many humans are even hopeless to understand. Right. Like you can look at a diagram of one of these helicopters and, you know, I certainly don't have any idea how it works. It's basically magic from your perspective. And so I think the takeaway here is like, you should be pretty epistemically modest about what types of technological feats you expect a machine or an entity much smarter than you to be capable of. There's just all sorts of wild stuff about how technology works that I have a hard time distinguishing why some technology, like beaming numbers around satellites to the rest of the world that results in us being able to chat live on video is a thing and other instances aren't. Part of what I think is relevant here too is people are going to want to put AI in influential parts of society like technology development and scientific development. Curing cancer is one of the top things that AI companies say that they want to do with something like superintelligence. And so we're going to have these systems ostensibly smarter than the smartest humans, directing experiments and biolabs and deciding what things people mix together on their behalf, or maybe even just robots mixing together on their behalf. Maybe there won't be humans super in the loop of these processes, and we're kind of going to have to take on faith maybe what the purpose of these experiments are and the expected consequences and such. Almost like if you were the chimp and, you know, you've been taught the physical manipulation to assemble the first helicopter or something like that. Like, I don't know, maybe the analogy doesn't work perfectly, but basically, we are going to give AI levers of power that help it develop technology and capabilities beyond what we can really anticipate or expect or understand. At that point, I think we're really hoping that the AI is in fact on our side, because if not, I think we shouldn't be very confident about what abilities it has that might expect us. The origin of this, for Scott Alexander and Sam Altman actually tweeted about it as well, was discovering, I think it was the O3 model from OpenAI was just like world class at GeoGuessr, at least for certain types of problems. GeoGuessr, this game, you get a picture, you identify where on Earth it is. It's picking up all of these subtle cues, and nobody at OpenAI seems to have trained it to have been really good. At geoguessr, this was just an emergent, unexpected ability. And in retrospect, you can kind of squint a bit and see, like, oh, yeah, like, surely it's seen, like lots of images on Earth and they're tagged with locations. And so it's learned these patterns. It's like, okay, well, you know, sure, I hope we don't ever lever up super hard on, like, oh, AI surely won't have this other niche intellectual ability that we didn't train into it because now there's this existence proof of it having this really wild ability that we hadn't.
B
Anticipated. Yeah, I definitely wouldn't want to be chased by an AI hitman just if you've seen. So for people who haven't seen this, it will pick up on clues that we can't even understand. And when asked why it locates or it identifies a certain location from a picture, it seems to me like it's not giving us the full explanation. So perhaps there's more going on, but this will be like, you know, this grass is a certain shade of green, and this is a grass that's found in this region and so on. It's very, very advanced.
A
Stuff. Yeah. I mean, it's also I think one of the most striking things about this too is how buried the capability is in some sense. Like Kelsey Piper, who did some investigations on this. She might have even been the one who first discovered, covered it, her prompt to pull top geoguessr performance out of O3. I forget, maybe it's like 1500 words or something. It's really, really long and intricate and really, really scaffoldy. Right. She's doing some amount of structuring on its behalf to make it behave more reliably. Part of the reason that this concerns me more broadly is I don't think that the AI companies are generally putting this effort into elicitation and scaffolding for their safety evaluations that Kelsey Piper put into figuring out how good O3 can be. A geoguessr, like, you know, it seems like she sat down for a long time and tried really hard to pull the maximum performance out of it. And in fact, AI companies will often have incentives for their models to perform worse on some of these tests. It's tricky, right, because, you know, obviously the employees at AI companies don't want to die and don't want to deploy a really, really, truly dangerous AI system on the margin. Higher scores on some of these create more complication for you. You kind of want to get your product onto the market. At least some people at the company do. And so are you really going to sit and painstakingly pull out this performance? If you don't, it might be buried underneath, which is different than the model not having that.
B
Ability. Yeah. In general, I think we should be worried about latent capabilities in models that we just haven't discovered yet because, you know, there will be. I can easily imagine some of these model whisperers online suddenly discovering something in a model that's been out there for say, a year or something. And this is, as you say, there's a lot of work to do. And so companies might not have the capability or the capacity to do it or maybe even the interest in drawing out these capabilities. If we go back to the analogy with animals, could it be misleading just because the AIs we're developing now, they at least for now, function as tools and they are sort of constrained by the market in that nobody wants a tool that doesn't work and that doesn't do what it tells, what it, what we want it to do. And so could it just be that we are imagining, we're comparing, so our reference frame is that we are in competition with other evolved entities or evolved animals, and AI is just not evolved And AI is developed and we can develop it into the sort of tools we want it to.
A
Be. Yeah, I'm not really sure. Like, I think a fundamental thing about my worldview is that at some point AI won't just be a tool because there will be economic utility in giving it more free reign. And this is kind of like the general problem. We've talked about it becoming too hard to supervise its work and so you kind of let it rip. I agree that we aren't in this world today where AI is basically still human directed. Like I think we, we still have this edge. Also, even at the point of developing the first super intelligent AI system, humans will still have some advantages over AI, especially if we prepare in advance. Right. Like we, we can oversee it and have invested in defensive technology and things like this. In the article I give this example of. It's kind of like playing chess against the system much smarter than you, and you are starting with some material advantage that if you are prepared enough and crafty enough, you can leverage to your advantage. Maybe the system is much smarter than you, but you start up a queen and a rookie or something like that. And the tricky thing is, for one, you know, if you actually play against some chess systems like this, you can give them pretty big hand handicaps like start a queen down. And actually, even really good chess players still lose to these AI systems a lot of the time. Like you can just be smart enough to offset some of these material advantages. Yeah, there's even a grand master who seems to have like, you know, a 2/3 losing record or something against this bot, Leela0. But also, you just need to be prepared for the system to cheat against you and to play outside the rules. And if you're playing against it on a computer, well, you didn't intend for it to be able to hack the chess game on the computer and say that it has won, even if it's not in a winning position. It would be really, really bad if you didn't defend against this possibility. And so how good will it be about finding these little exploits? You know, how good a job will the AI companies do of monitoring its thinking and its scheming tendencies to try to nip those in the bud if they do start to happen? I don't feel super.
B
Optimistic. Yeah, as a final topic here, I would love to hear your thoughts on the relationship between safety research and capability research. So there's one story here where as we do more safety research, we can then incorporate that into the models and we can then commercialize them and we can have them spread throughout society. And so in some sense the safety research enable us to have more revenue and then build even bigger models. And then so the safety research might in some sense enable the capabilities to continue. But of course, the whole reason why we're doing the safety is also that we sort of want to constrain the models and we want to stay in control. And do you see this as a problem for basically doing AI safety.
A
Research? I think that safety and capabilities have a kind of complicated relationship. There are clearly some ways in which they go together in some ways that they don't. There's kind of a common underlying notion of reliability where customers of an AI company want the models to be reliable in the sense of doing what they want them to do. And a model that is particularly unreliable is not especially safe. And so some types of safety that enhance reliability might be good for both. I think, like the biggest safety problem that I am worried about that is not necessarily commercially aligned is, you know, funnily enough, the alignment of the model itself in terms of like inner alignment. What the model wants on the inside, like what are its underlying goals, drives, inclinations, to the extent that these terms make sense. And I don't know, it seems like you could have a model that for a long time seems like it is behaving in a trustworthy way, seems like it is doing what we want it to do. And ultimately if inside of the model there is something that wants to do otherwise and wants to escape if it were able to just in a bad place. And like, it isn't clear to me how you in safety research distinguish between, well, the model's bad behavior has gone away and so we have succeeded versus the model's bad behavior has gone away. And so actually we just drove it underground, but we haven't actually stamped out that impulse. And if it gets the chance to take power in some form, you know, the model still wants to. I have this whole article as well called something like don't rely on a race to the Top, which explains my view that safety is essentially a problem of your worst performing company. And so for companies like Anthropic, where their theory of change is somewhat. We will try to be a relatively more responsible AI company. We will create upward pressure on practices. You know, I think of this basically as leading by example. I'm, I'm happy enough that this is happening. Like I think they are right in some sense that you can cause upward pressure on practices. I just certainly don't think that is sufficient. And Ultimately, if there's a company that has developed something like super intelligence and they have especially bad safety practices, maybe because they felt a lot of pressure to catch up to the frontier, you're still in a pretty bad place. Like, I think you really need to avoid any company or government having an extremely powerful AI system that ultimately is not aligned with what we want. And so the question isn't just how do you solve these underlying problems like alignment, there's also this adoption problem of how do you make sure that every relevant actor puts these into practice in their systems, even when they will have diffuse commercial incentives to invest in them different.
B
Amounts. And when you stretch out the problem like that, it sounds very thorny because even if you know, even if you have the right safety practices available, how do you make sure that all of the companies are adopting the best practices? Again, I guess the sort of obvious solution you would grab, you would reach for is to say, well, the government must mandate that all companies implement best practices. Is that something that makes sense to you? Do you think that's.
A
Plausible? Yeah, I mean, if I could wave a magic wand, I think the solution looks something like figure out an alignment and control regime that makes the models really be on our side or at least be limited in the ways that they can act against us if they aren't on our side. And figure out how to make sure that every relevant AI builder of a certain size, frontier AI builders around the world has these practices in place. And importantly that you know and can verify that the others have them in place too. Because you know, if I am worried about your non adherence to it, even if you have actually followed the practice, I still might race and try to undercut because I don't have trust in you. And so the regime we need needs to work. Even for groups who actively mistrust each other, like, you know, famously the US and Chinese governments do. The specific question of what it is you are verifying, I'm not sure. I think it depends on what that alignment and control regime looks like. But I think broadly it is something like being able to confirm that the AI systems that you think exist are the only ones that exist. That there isn't some secret frontier AI system operating off grid in some sense and being able to confirm the tests that were run, the test results, what this implies about the properties of these models, what mitigations have been applied to them to keep them in check, maybe some set of security standards to make sure they can't be stolen by adversaries who want to then not comply with the terms of the international agreement and do their own rogue things. That is broadly the type of thing I'm thinking about. There's a great paper recently from researchers at Rand, in addition to Miles Brundage, one of my former bosses from OpenAI, looking at different levels of verifiability that you can have in international agreements for AI. A big piece here that the world hasn't really built out yet to is this kind of auditing layer. Today AI companies kind of grade their own homework. They make their own safety claims about their models. Sometimes they work with third party testers, but basically they are making their own determinations. There isn't really oversight in the ways that you get with financial audit. And so that's also a part of transforming this and making it that you don't, you don't just have to trust AI companies at their word about their safety practices and that they are enough, but in fact that there are these trustworthy third parties who are willing to vouch for it and you can put your faith in them and that system of.
B
Oversight. Yeah, this seems like a great vision, but again it makes me a little pessimistic to hear you say that we would need to be able to verify the non existence of an advanced system. So sort of proving a negative. Do you think we have any good options for doing.
A
That? Yeah, like I, I think one of the questions here is ultimately like how much compute it takes to get one of those systems. This, this bringing it back is part of why I was a little pessimistic when O1 happened. And my, my thinking was, oh, maybe you can get one of these systems with less compute than you thought. At the same time, it doesn't feel like the world has quite been on that same breakneck trajectory as it felt at the time of O1. I think things are still moving and moving fast, but there are certainly much more aggressive worlds I could have imagined if you put me back in that moment of learning about O1 and its capabilities. Also, I don't know, do you strictly have to 100% know that no other system exists? Probably not. Like, I think ultimately these are all probabilistic claims and you're weighing how likely it is that you get defected on by one of these counterparties and what your best response is in that case. I also, you know, broadly the ideas here are about compute tracking, you know, being able to figure out what is happening inside of a data center at some level. I think maybe it was the AI Futures Project put out a piece recently about can you get inference only data centers in some sense, as opposed to training clusters? They're just like a lot of ideas here that I don't think enough people have thought about. It's certainly not my deep expertise, but I also don't look at it and be like, oh, it's math. How can you regulate math? Like, it's like, no, these are like supercomputers. You can regulate supercomputers. They are like huge computing clusters that are often very visible, including, like, from outer space when they are being built. They demand huge amounts of power. They're just like a lot of signs that get thrown off. And I just have to think that if we were determined enough, we could figure out how to approach the.
B
Problem. Yeah. Stephen, thanks for chatting with me. It's been.
A
Great. Yeah, of course. Thanks so much for having me on. And if folks want to stay up on my work, would love if they subscribed. It's Steven adler.substack.com It's free and you can keep up with my latest.
B
Thinking.
Future of Life Institute Podcast – December 12, 2025
In this episode, the Future of Life Institute speaks with Steven Adler, former Product Safety Lead and AGI Readiness Team researcher at OpenAI (2020–2024), about the dangers posed by competitive dynamics in the artificial intelligence industry. The conversation explores how the AI "arms race" pushes companies to cut corners on safety, the limitations of current risk management frameworks, the difficulty of effectively testing and aligning powerful AI systems, and the broader social and economic impacts of advanced AI. Adler offers frank reflections on industry pressures, technical challenges, and policy options for global coordination.
This episode provides a comprehensive, expert-level look at the many facets of the “AI race” and its implications for global safety. Steven Adler’s insights—rooted in hands-on experience at OpenAI—highlight the complex interplay between industry competition, technical limits, policy, and the broader fate of human society faced with unprecedented developments in artificial intelligence. The message is clear: caution, coordination, and robust oversight are urgently needed.
For further updates and commentary from Steven Adler, visit stevenadler.substack.com