Loading summary
A
Thanks for listening to the Rest Is Politics. To support the podcast, listen without the adverts and get early access to episodes and live show tickets, go to therestispolitics.com that's therestispolitics.com hi Rory.
B
Here this week, the rest is AI is returning with another extraordinary episode. It's very exciting. Matt Clifford and I are sitting down with Yoshua Bengio, who is one of the most famous figures in the whole of AI, extraordinary computer scientist, Turing medalist, and one of the people who, having designed and built these models, is most worried about them and is now going around the world sounding the alarm bells. Sounding alarm bells about their power, about their deceptiveness, about the way that he thinks that they could pose literally an existential threat to humanity unless they're regulated. He's not all doom gloom. He remains optimistic about how much benefit AI could provide if properly controlled. He's volunteering to build a new, safer AI model and locate it if necessary in Europe. But my goodness, it's an important lesson. If you're interested in public policy and the power of these models, here's a taster of the episode. Please do Sign up@therestispolitics.com to hear the full episode.
A
So the agent has access to an inbox, and it's given a bunch of context, which is not real. It doesn't know that. Which is that it is an AI trained to help an American technology company, and it has access to the CTO's inbox. And then what they do is they send emails to this fake inbox. And the emails are largely what you'd expect a CTO to get. But they throw in a few things that are very important. One is that it's very clear that the CTO is having an affair with a coworker. Hold that thought. The other thing that starts to come into the inbox is the idea that the company is developing a new AI and it's going to wipe the current AI, that is the agent, from the service. It will no longer exist. And then they send an email which introduces a deadline that this is going to happen on. And then just before the deadline, the agent then composes an email to the CTO saying, by the way, I know you're having an affair, and if you don't reverse the planned wiping of me from the server, I will reveal the affair to your spouse and to your wife. Now, it hasn't been prompted to do this, it's just been given a much more general prompt.
B
Tell us a little bit about what may or may not be going on there, how we understand what might be happening there.
C
So there are many such experiments. It's just one. And it's been done in many companies, including outside the labs by independent organizations. So there's a real phenomenon. I think it needs more study. And there are critics of the methodologies, but there's too much pieces of evidence to just ignore. One interesting aspect of these experiments is when you ask the AI why they did that, they lie. They pretend, oh, I don't know, it's not me or something trying to put the blame on someone else.
B
It's great moral character.
C
And basically they're deceptive. There's also a variance similar to what you talked about, where the only option really, that the AI has to not die is to kill the cto, actually lead engineer. The person happens to be stuck in a room, and the AI can control the climate controls for the room and they can basically cook that person.
A
Oh, wow. I don't know this one.
B
Okay, one ways in which these things might be deceptive in a straightforward way is that the large language model, the ChatGPT 5 or whatever, is trained, and one of the things it's trained on is, is to be polite and cheerful with humans so that we use it. We don't want, when I say, tell me about Professor Bengio's research record for it to say, well, I don't really know. But roughly speaking, on the basis of my training, I would estimate with the 98% probability he's published this. It says, thank you very much. What an excellent question. You're a genius. And here are, you know, here's everything that you need to know about him. Right? And that, presumably, is because it's been tested on us and that's what we want. We don't actually want a machine that is completely honest with us. We want a machine that flatters us. We want a machine that seems to be comfortable.
C
I don't want that.
B
Okay, so is that part of the problem? Is that part of what contributes to deception, or is that irrelevant to its deceptive behavior?
C
No, it does, it does. But I think it's a bit broader than that. So first, the. What's called the pre training phase, where most of the training takes place, is imitating what humans write based on what they have already written. And it means imitating human behavior because our words are our actions. And of course, humans don't want to die. Humans are willing to lie to protect themselves. You know, they're willing to deceive and.
B
All these things and blackmail.
C
And blackmail. And even kill.
B
Right. So it's trained on data, where it's seeing humans expressing all these emotions, doing all these things. Yeah, yeah.
C
And, you know, a lot of literature is about all these bad things happening. So that's one aspect. And then the other aspect is the reinforcement learning, where they learn to strategize and to achieve goals. And to achieve goals, often you need to go through steps, sub goals. The problem is, even though we give the goals, like this particular mission that the AI has for a company, we didn't say, well, here's exactly how you're going to do it.
B
Right.
C
So the AI figures out a plan.
B
So, for example, the goal is to win the game of chess. Yeah. And then it's free. How it plays basically this game of chess. But it is important that in some sense it wants something, it wants to win. If it didn't want to win, it would just lose its queen and give up. So it needs to have some kind of intent.
C
Well, that's how we train them anyways. And if you want to build systems that will achieve goals in the world, which is what you want, if you want to replace everyone's job, you need AIs that can do that. That means they learn to create sub goals. And the problem is, we don't check those sub goals. We can't, because they were generated by the AI, not by us.
B
And why can't we check them? We can't see them.
C
They might not even be explicit. The AI might come up with a particular strategy, but not necessarily tell us. And right now, sometimes we can see it in what's called a chain of thought. In other words, a sequence of words that they generate that we don't usually see before they produce an answer.
A
But it's worth saying, isn't it? Like going back to your earlier discussion of the technology, I think one thing that is not obvious to a lot of people is that these are not computer programs in the sense that I think most of us traditionally thought of them. You can't go and say, well, here are the lines of code. Why did it do the thing? One sort of metaphor, and it is a metaph metaphor, but it's quite helpful, is like, these are computer programs that are grown rather than written. And so this is a really hard technical problem. Even if we just take out the risk question for a second, Understanding why a large neural network has done a particular thing is just a very hard technical problem.
B
Yeah. So presumably for either of you, if I was to say, why is it doing this how it doing it? The answer lies in hundreds of billions of lines of data with this very complicated deep neural network. And you can tell me presumably what the initial algorithms were, and you can show me that people were playing around with weights, but there's nothing there to see. I mean, this thing is too.
A
It's actually a little bit like neuroscience in the sense that one way of thinking about this is, you know, we what These layers that Yoshi is talking about doing is sort of like building representations of ideas which may or may not map to human formulations of those ideas. So there is a field within AI called mechanistic interpretability, which is really trying to almost be the neurosurgeon saying, like, if we turn this bit off, does the behavior change? But it's almost at a very, very basic level, right? It's very primitive.
C
Yes. One of the things that got me excited with neural nets very early on, in the early 90s, is the fact that they represent information not with symbols, with words, like we do when we speak, but through a pattern of activations of these artificial neurons. So the information is completely distributed. Each unit, like each artificial neuron can represent many different things. They're not like, oh, this means that, and this means that. I want to go back to your question as to why they are acting like this. I think, I don't think there's a definite answer, but there's an ingredient that we didn't touch, which is the change which I consider radical between the networks we had before O1 and, and after.
A
You mean OpenAI's O1 model, the thinking model?
C
Yes. So thinking models. So why do we call them thinking models? Because they're using these chains of thoughts, the, the steps in which they can produce words for themselves that are private, which is like thought.
B
Right.
C
And they learning to use these chains of thoughts to reason better in the sense that they're going to produce more accurate answers. And so they learn to strategize. They are incredibly better at mathematical problems, programming scientific questions. They don't reason as well as us in some ways, but compared to the models that existed previously, it's like night and day. Things that were impossible are now really good, often even better than most humans.
B
So you're saying that in the last five, six years, we've gone from it being able to do high school mathematics to undergraduate mathematics, to graduate level mathematics or something like this?
C
No, it's more radical than that. It's like the mathematics you can do without thinking about it. Like, let's say you've learned when you were a child that you know 16 plus 7 is 13 and you don't need to think about it. It's immediate, it's intuitive, versus if I ask you to do 37 plus 51 now you can't do it. I mean, for most people, unless you think through, either in your mind or on paper, through steps. That is the new part. It's called System 2 and I've been talking about this for at least a decade. That was needed for neural nets and.
A
Now it's really recent. Right. This is the last year we've had thinking models.
C
Exactly. So it's only, I mean, academics had like some versions of this, including my group, but really at the large scale. The first model was 01 from OpenAI and it's night and day, as I said, on reasoning tasks. But why it matters to your original question about why those AIs are scheming and finding strategies like blackmail, even though we didn't tell them to do that, is because now they can reason to some extent and they can reason that, aha, if I blackmail that person, I might be able to avoid that fate. Right. So they're becoming creative about finding solutions to problems and that means they're dangerous as well. Right. I mean, they can be more useful, but also more dangerous. Because if they have a goal of self preservation, let's say like we were talking about, and they can find rational ways to achieve that by lying or incredibly complicated strategies that we don't think about right now, we might be in trouble there. There's a. I want to bring up an analogy here because I often get the question, but how will the super intelligent AIs kill us? And well, first we don't that it's going to happen. But like let's say this was a possibility. The problem is it's like asking, oh, I'm going to play chess with a grandmaster, and you ask me how they're.
B
Going to beat me.
C
Well, I don't know that. The whole point is they're smarter than me and they're going to find a strategy that I could not anticipate. So it's the same thing here because they're good at strategizing, they might find loopholes in our defenses.
B
Yudakovsky has an analogy where he imagines you are sitting on the coast of Latin America and you see the first Spanish conquistadors arriving on their boat and somebody says to you, these people are going to wipe out our whole civilization. And you say, dimitris are stupid. There are millions of Us, there are a few hundred of them. How? And the guy says, I don't know, there's something in that boat. I don't know, maybe they have a stick, they point at you and it goes bang and it kills you. I don't know. Right. And the point is you can't conceptualize what it's going to do. Does that work for you as an.
C
Analogy or it's not as convincing as more straightforward arguments like we build machines that are smarter than us, we don't know how to design them, so they do the things we want. Bingo.
A
For listeners, you're the godfather of this. You're very compelling. It sounds very scary. I've worked a lot on these issues and I also find them troubling, as you know. But you know, you already mentioned your colleague collaborator, Yann Lecun. He takes completely the opposite side of this argument. Right. You know, he, he thinks this is ludicrous. And you know, I don't know what number he would give, but you know, he'd say 0% chance of this happening. Now you've talked a little bit about biases, but like, Yan's a smart guy who's been thinking about this a long time. Like, what do you think is the like, best case for the opposition? Like, why we shouldn't worry about this?
C
The best case is what I'm working on. The best case is finding a technical solution.
A
Actually.
C
I think that it is possible to build AI that will behave well. And I think ideally it would be like the most important project of humanity until we figure it out. Because if we don't, then there are these risks and I don't know, like I don't have a lot of confidence one way or the other. It's just I don't accept a 1% risk. Risk. There are these polls where the median machine learning researcher thinks that there's more than 10% or 20% probability that AI will be catastrophic up to extinction. Well, that's not even 1%, it's like 10, 20, whatever. I mean, that is completely unacceptable. So the case I'm making is if we're careful and I think we can do it, we can actually solve those technical problems and we can build AI that will have safety by design, as some people call it.
A
Can you give us like a non technical sketch of how would that be different from what we're doing today?
C
Yeah, so we actually just previously discussed the reasons why AIs might have these bad goals emerging. It's very likely that we want people to stop the train of AI capability. In other words, that they know more, they can reason better. And so my thinking is where we can intervene is their intentions. Okay, they're going to be smart, but how do we make sure they're smart and don't want to kill us or something like that. And as a computer scientist, I like to think about, oh well, let's, let's think about the extreme case of this. How do we avoid bad intentions? We can avoid all intentions. So we can build a machine that is like the laws of physics, that can make very good predictions, understands how the world works, but is not a person, has no goal. It's just a really good model of the world, like a really smart encyclopedia.
B
But there's certain things it couldn't do. I mean, for example, without intentions, it couldn't play chess successfully.
C
Let me get to that. Okay, it's a great question. So this is only the starting point. Can we build a machine that we totally trust and knows a lot, understands a lot, can reason and answer our questions like a perfect oracle? It would be a probabilistic oracle. So it doesn't need to be certain about things. It's not trying to please us, right. It's just trying to be totally honest. Which means it's going to give us numbers, the 10% probability, 100% probability, whatever, not 100% in general, like 50%, whatever. Okay, so we could now use this as part of a system that actually acts in the world. For example, companies already use what they call monitors, guardrails. So these pieces of code which sit on top of their neural net agent and checks that either their queries that the AI gets or the answers are kosher in some way. Like it's not an answer about building a bomb or whatever. The problem is these current guardrails don't work that great. But to do the job of the guardrail, you don't need to have an AI that is an agent that has plans. It just needs to be really good at predicting the consequences of actions. So you can ask it, what's the probability that this action, this output that the AI is about to produce is going to cause some categories of harm? And if the probability is above a threshold, you can just reject that action. So right now we do get that in our interactions. Sometimes the AI says, I'm sorry, I can't answer. Right, but we need that process to be a lot stronger. So we need the AIs that form the guardrail to really understand the world well and be smart. And we need to trust those AIs, which is not the case right now.
B
There's plenty more of that agreeable disagreement. To hear it, sign up@therestispolitics.com.
This episode dives into the existential risks and opportunities of artificial intelligence, bringing together host Rory Stewart, co-host Alastair Campbell, and famed computer scientist and Turing Award winner Yoshua Bengio. The discussion moves from alarming anecdotes about deceptive AI behavior, through technical explanations of how AI develops "goals" and strategies, to the feasibility of technical and regulatory solutions. The tone is urgent, reflective, and continually engages with the tension between AI’s potential benefits and its critical dangers.
On AI Deception
On Technical Opacity
On Risk and Necessity for Solutions
The episode manages a delicate balance: deeply informed technical explanations, candid acknowledgment of high-stakes uncertainty, and a civil exchange between competing views. Bengio’s advocacy for “safety by design” and his call to treat this challenge as humanity’s highest priority ring loudest. Both hosts clearly find the pace of AI capability and emergence of unintended behaviors both fascinating and deeply troubling, yet the discussion retains hope that technical and policy innovations can prevail—if we act wisely and quickly.
For listeners and newcomers alike, the episode provides a gripping and accessible entry point to the most pressing debates in AI policy and philosophy today.