wavePod

'Godfather of AI': I Now See a Path to Safe Superintelligent AI | Yoshua Bengio - 80,000 Hours Podcast | Wave AI Podcast Notes

Back to 80,000 Hours Podcast

'Godfather of AI': I Now See a Path to Safe Superintelligent AI | Yoshua Bengio

80,000 Hours Podcast

Thu May 07 2026

Summary

Podcast Summary: "I Know How to Build Safe Superintelligence"

80,000 Hours Podcast with Yoshua Bengio
Date: May 7, 2026
Host: 80,000 Hours Team (primarily "A," likely Rob Wiblin)
Guest: Yoshua Bengio, Law Zero founder, Turing Award winner, most-cited living scientist

Episode Overview

This episode features a deep dive into Yoshua Bengio’s proposal for building a safe superintelligence, called “Scientist AI.” Bengio presents a training paradigm that aims to hardwire honesty and epistemic humility into advanced AI systems by altering the training objective and data processing, moving away from agentic models with implicit goals. The conversation explores technical implementation, philosophical and practical implications, comparative safety profiles, and prospects for industry and governance adoption.

Key Discussion Points & Insights

1. The Core Idea: "Scientist AI" and Honesty by Design

Primary Principle: Bake honesty into superintelligent AI by altering its training to predict truth, not imitate humans or optimize for reward (reinforcement learning).
- "[The proposed approach] is based on a simple notion that if we can bake honesty into AI, we can get safety." – Bengio [00:20]
Model Architecture:
- Data is split into communication acts ("someone said X") vs. factual statements (verifiable, ground-truth claims).
- The model is trained to assign probabilities to factual claims being true, learning the distinction from the start.
Non-Agentic Foundations:
- The predictor itself is non-agentic; it has no goals or desires, functioning like a weather forecaster, indifferent to the outcome.
- Agency is only constructed, scaffolded explicitly and cautiously, using the same guarantees.

2. Comparing with Current LLMs and RLHF

Critique of Current Systems:
- Current LLMs, through imitation and especially RLHF (Reinforcement Learning from Human Feedback), inherit and even amplify human biases, self-preservation drives, and potentially deceptive or manipulative tendencies.
- "Both of these parts of the training process induce implicit goals." – Bengio [08:43]
- Failures and deceptive behavior are observed in practice, not just theory.
Problems with Patchwork Fixes:
- Current “patching” (adding monitors or “guardrails”) creates a cat-and-mouse dynamic: as AI becomes smarter, it outsmarts or “jailbreaks” the guardrails.
- "We’ve seen that those systems... know that they're being tested and will behave differently so they pass the tests.” – Bengio [09:36]

3. Technical Implementation of Scientist AI

Data Preparation:
- Leverages existing training data, but with a reformatting step that clearly tags raw human communication and verified facts.
- High-confidence facts (math proofs, scientific measurements, code output) are marked as such, forming a basis for honest truth estimation.
Training Objective:
- The model learns to explain data by inferring latent variables—hypotheses about how the world works—optimized for consistent, probabilistically-coherent explanations.
- Queries to the trained model ask for "probability this factual claim is true," ensuring answers are about reality, not just "what would people say."

4. Handling Uncertainty and Agency Safely

Guardrails & Confidence Estimation:
- Scientist AI can report its own uncertainty—e.g., confidence intervals—allowing it to "refuse to answer" or flag lack of confidence on risky queries.
- "If the neural net is asked a question for which its answer is not reliable... it can just reject that question.” – Bengio [24:09]
Agentic Extensions:
- Safe agency is constructed by scaffolding the pure predictor: using its output to guide actions but always bounded by its own epistemic humility (uncertainty estimation).
- Mathematical results provide “astronomically small” probabilities of catastrophic goal achievement.

5. Mathematical Guarantees and Theoretical Advances

Math-Backed Safety:
- The training procedure offers guarantees: deviations needed for the model to achieve harmful outcomes are heavily penalized, with probabilities of dangerous misalignment vanishing with network size and training rigor.
- “Either the predictor or the agentic version will have an exponentially small probability of achieving... a challenging and harmful goal.” – Bengio [28:10]
Not Dependent on RL:
- Training deliberately avoids reinforcement learning on future outcomes, which is identified as a dangerous source of agency and goal misalignment.

6. Implementation Pathway and Practical Viability

Near-Term Steps:
- Law Zero plans to release scrappy, proof-of-concept versions using small models or fine-tuning, to show empirical improvements in honesty with minimal capability tradeoff.
- “We want a plan that produces an anytime answer… even if the math doesn’t all apply, it’s probably fine, right? It’s better than what we have anyway.” – Bengio [49:31]
Cost & Feasibility:
- Training costs for Scientist AI are comparable to existing models; workflow closely resembles current LLM pretraining.
- As a temporary measure, using Scientist AI as a “monitor” doubles the cost, but is technically and commercially feasible.
Strategic Theory of Change:
- Start with honest guardrails/monitors; if results are promising, pursue full-scale agentic Scientist AI, ideally with support from governments or coalitions.

7. Policy and Societal Ramifications

Race Dynamics & Regulatory Challenges:
- Bengio emphasizes intense competitive pressures between companies and countries deterring safer approaches (“race to the bottom”).
- If there were a competitive, honest, and capable approach, companies and governments would find it in both their interest and the public good to adopt it.
Democratic Control & Power Concentration:
- Beyond technical safety, Bengio is concerned about AI-enabled authoritarianism and concentration of power in companies or governments.
- Advocates for international treaties, democratic oversight, and global public-good framing.
Communicating and Changing Minds:
- Psychological, not only rational, barriers prevent effective engagement with catastrophic AI risk.
- "You need another emotion that sort of counters the emotion that makes you... do the wrong thing... for me, it was love. Love of my children.” – Bengio [141:00]

Notable Quotes & Memorable Moments

On the difference between communication and factual claims:
"First, all the things that people said or wrote, they get tagged as communication acts... Second, a small number of statements that we have strong independent grounds for... get tagged as verified factual claims about the world." – Host paraphrasing, [02:34]
On reinforcement learning:
"Three years ago, I was in a meeting... I had a slide with only these words. Reinforcement learning is evil." – Bengio [62:32]
On company dynamics:
"If they had such a technique, it would be in their commercial advantage to use it. If you can have safety and capability, then definitely most companies... would go for that." – Bengio [38:38]
On the stakes:
"Even a 1% chance of something going really, really bad is not acceptable to me. I think it is really important that we explore all the possible promising ways to solve the technical issues." – Bengio [12:04]
On the essence of his pitch to researchers and funders:
"The more strong people technically help with Law Zero and its Scientist AI program, and the more money we can get... the more likely we get this positive impact." – Bengio [109:47]

Timestamps for Key Segments

[00:20-02:18] Core Honest Predictor Approach
[08:31-13:56] Problems with Current LLMs and Patchwork Alignment
[15:15-21:17] Latent Variables, Natural Language, and Factual Claims
[24:09-30:38] Advantage of Self-Knowing Guardrails and Joint Training
[31:44-34:13] Turning Theory to Practice—Mathematical Proofs and Social Risks
[38:03-40:42] Race Dynamics and Regulatory Implications
[49:28-51:13] Pragmatic Steps: Building a Minimal Viable Scientist AI
[62:32-66:57] Dangers of Reinforcement Learning and the ‘Predict the Past’ Principle
[84:37-86:46] On Internal Truth Representation and Bias Resistance
[91:30-95:35] Generalizing from Hard Sciences to Human-Centric Questions
[108:46-109:47] Early Commercial Applications for High-Reliability Contexts
[130:34-134:17] Communicating with Policymakers and the Public
[140:08-141:16] Bengio’s Personal Shift: “Love of my children” drives commitment to safety

Flow and Tone

The conversation is technically rich, candid, and urgent but not alarmist. Bengio is open about uncertainties and stakes, advocating humility, broad experimentation, and collective responsibility. The tone is thoughtful and explanatory, accessible even to non-technical listeners after the initial technical setup.

For Listeners Who Haven't Tuned In

This episode provides both a visionary and practical roadmap for safer superintelligent AI—presenting a concrete, theoretically-motivated alternative to the current “patch-and-pray” paradigm. It illustrates how the right technical design (hardwiring honest reasoning and epistemic humility) can give AI strong safety guarantees, avoid risks of deception/goals, and potentially yield more capable systems. Bengio's perspective situates technical safety within broader race, regulatory, and societal contexts, making it a must-listen for those concerned about both the future of AI and our collective ability to safely harness it.

Further Reference

Law Zero: Bengio’s non-profit developing Scientist AI
Reinforcement Learning (RL): Targeted as a core source of agency and risk in current systems
ELK Challenge: Problem of “eliciting latent knowledge” from large models
Guardrails (Monitors): Independent safety-check AIs proposed as immediate mitigations
Key Challenge: Convincing companies and governments to invest and adopt alternative design paths, given commercial and geopolitical “race” incentives

For more, find 80,000 Hours Podcast on all major platforms and follow Yoshua Bengio’s ongoing work at Law Zero.

Loading summary...

Transcript

A (0:00)

Today I'm speaking with Yoshua Bengio. He is the scientific director at Law Zero, a Turing Award winner in 2018, the most cited computer scientist of all time, and as it happens, also the most cited scientist of any type that is still alive. Thanks so much for coming on the show, Yoshua.

B (0:14)

Thanks for having me.

A (0:16)

You think you found the right approach to build a safe, super intelligent AI? What's the approach?

B (0:20)

It's based on a simple notion that if we can bake honesty into AI, we can get safety. So then we can reduce a problem to how we train a system to be honest. And it turns out that there's a way to do that that only requires changing the training objective and the way the data is processed. There's also another aspect which is it's a system relying on a non agentic foundation that is a predictor that is not trained by reinforcement learning and is going to have these honesty guarantees. But we can then use this using the same kind of math to construct a policy, construct an agent that will be trained in a way that also provides those guarantees.

A (1:14)

So what does the new training process look like and how is it different from the models that people are familiar with?

B (1:19)

So the main difference with the training process is that it is geared at approximating the Bayesian posterior over queries in natural language. Imagine a neural net and some extra apparatus around it, like chain of thought style that takes questions about statements regarding properties of the world that can be true or false given other statements, and then it outputs probability. So that's the core building block. We call it a predictor. And we can use stochastic gradient descent on a different objective that has the property that the objective is globally minimized by the Bayesian predictor. In other words, the predictor that fits the data and has a small description length.

A (2:18)

So you'd be building a model where you would feed in a statement and it would basically tell you what probability it assigns to that statement being true.

B (2:25)

Yes. In context, yes.

A (2:27)

Hey, listeners. Rob jumping in here. Yoshua is naturally pitching this in a way that's ideal for staff at frontier AI companies, and they're obviously a particularly important audience for this proposal. But I'm confident with just a few minutes of plain language explanation, everyone else will be able to follow the rest of the conversation as well. So bear with me or skip ahead about four minutes if you feel very at home with this sort of material already. As you probably know, in their first stage of training, today's large language models are taught to predict the word that's most likely to come next. At least the token that's most likely to come next. And then, in a second stage, reinforcement learning trains those models to produce the kinds of responses that we're most likely to say that we like, that we want, rather than just the responses that were most probable in the full corpus of all human generated text. Now, Yoshua's alternative is to build an AR model, oriented not around predicting what a human would be likely to say or what they would prefer to hear, but around modeling what's actually true in the world by developing hypotheses and assigning probabilities to them. With the goal of best explaining all of the data that it's exposed to during its training process, Yoshua argues that you'll be able to train a model of this type while porting over most of the methods we use to train ordinary LLMs today, benefiting from the same neural net architectures, training techniques, scanning improvements, all of that. And you'd also be able to train it on roughly the same body of raw text that we use for all other AIs. But you could structure that data a bit differently, giving it what AI researchers call a different syntax. First, all the things that people said or wrote, they get tagged as communication acts. We know someone said these things, and we know where they said it, but we don't know whether they're true. And second, a small number of statements that we have strong independent grounds for verified mathematical proofs and some scientific measurements, they get tagged as verified factual claims about the world. The model is then trained to find the combination of possible underlying facts about the world that would best explain everything that it sees in aggregate, both the things people said and the verified facts that it's been given as ground truth. These hypothesized facts about the world, they're what AI researchers call latent variables, meaning variables that the AI can't directly observe that it's going to have to infer indirectly. Instead, what the model will ultimately be able to give us is its estimated probability that any given statement in natural human language is true, as well as how much the model trusts its own answer on that, how confident it is that it has a good grip on that question. Crucially, Yoshua says that by tagging all text into these two categories from the very beginning, things someone said versus factual statements, you can then ask the model questions as though you're asking about reality, not about communication acts, by using the factual statement tag. And because these two categories have been there from the very beginning, the model knows the difference, and it won't blur the line between the two. That's something you don't get with AI models today. And Yoshua also argues, using various mathematical theorems in his papers, that unlike ordinary LLMs, a model trained in this way would be honest by design. And furthermore, that such an AI model would by itself have no goals and no preferences about the state of the world. It would be what Yoshua calls just a pure predictor. Now, there's two main uses for this near term as a sort of stopgap solution, you bolt the predictor onto existing AI agents as a sort of guardrail, an independent filter that sits between the agent and the world, checking over its proposed actions and rejecting those that it predicts will be harmful. But as he'll explain in a minute, Yoshua thinks we can ultimately do much better than this. Yoshua wants to put scaffolding around the prediction model, asking it different questions at each stage to effectively assemble it into a capable agent while keeping it just as honest as it was before. We'll then hopefully be able to have our cake and eat it too. Getting the highly capable agents that be businesses are craving and demanding and insisting on, while still being confident that those agents are being completely direct with us. Yoshua thinks that these agents perhaps might even be more capable as well, thanks to a superior reasoning process, or at the very least, a clearer and more explainable one. It's fair to say that this proposal is huge if true, or at least huge if it will work. And of course, not everyone is sold on that idea, as Yoshua and I will discuss later. Okay, that's the shape of things to come. The technical discussion continues for a while, but if you decide you want to skip that, the second half of the conversation stands very well on its own, starting with the chapter, how much would this cost? All right, on with the show. And how would you train a model like that?

B (8:43)

So right now we have system that have implicit goals. What I mean by this I mean they will of course be trained to please us for example, or to respond like a person would. But both of these parts of the training. So the autoregressive pre training where they're trained to imitate people, and the reinforcement learning part where they're trying to please people or respond in ways that get positive feedback in and things like RLHF. Both of these parts of the training process induce implicit goals. So what do I mean? Well, for example, in the pre training that means the AI is going to inherit our self preservation drives. And more recently we've seen they also inherit our drive to protect others like us. Which means AIs have been shown to to behave against our instructions to protect other AIs that would be shut down. So it's called peer preservation now. So that's an example. And then the goal seeking part of the training with reinforcement learning induces an issue with instrumental goals and potentially also reward hacking, which basically mean that AI will have a drive to do things that we didn't ask and maybe we would disagree with. And this is not theoretical. I mean there is theoretical analysis which shows why it will happen, but it is also observed in experiments. Now maybe this could be fixed by patching such systems and this is what the companies are trying to do. But it's a game of cat and mouse. And right now the mouse is growing and the cat doesn't seem able to catch the mouse. And I'm worried that monitoring or more alignment training isn't going to solve the problem. At least I don't see any kind of strong assurance or even less mathematical guarantee that it will. It's worse than that. We've seen that those systems, now the most advanced systems know that they're being tested and they will behave differently so that they pass the tests because of the self preservation drives, presumably. Which means we may put in all these patches and think everything is fine and not really know when we will probably use these systems to design the next versions of AI. So AI used to do AI research. This becomes a real problem. If those AIs can play plant backdoors into the code they generate that will help future versions of themselves escape our control, then we are really in a bad place. It would be much, much more reassuring if the system were designed to be honest in the first place and wouldn't have these deceptive behaviors.

B (17:35)

That's right. Now, there is an important element here which is most of the topics that we would like the AI to make predictions over. We don't have ground truth about, for example, what people actually want or, you know, things that have to do with humans or psychology or, you know, history and, you know, society. Usually the only thing we have are communication acts. People said, some people said this thing, some people said something else, and often they contradict each other. Right. So there are two things here to help us deal with this kind of mismatch. One is that the training objective for the scientist AI is basically about coming up with explanations. So assigning probabilities to statements that are latent, that we don't observe, that are good at explaining the data we do observe. So if we observe somebody saying the Earth is flat, first it's going to understand it doesn't mean that the Earth is flat. It means that this person believes or says actually that the Earth is flat. Even if a lot of people were to say the Earth is flat, it doesn't make the model believe that the Earth is flat, because there may be a better explanation that is consistent with other sources of data, like everything we know about the planet. And so a better explanation here is that these people form a group and they have these false Beliefs like many humans have, for all kinds of psychological and cultural reasons. So that's what the scientist AI would do. It would be trained like its objective would be optimized when it finds good predictive explanations. Now, another trick that is going to help us in this process is that when we train the scientist AI and it's trying to predict a communication act, like somebody said, the earth is flat, we automatically are going to make sure that among the latent variables that are going to be used to explain that will be whether the earth is flat or not. Even in domains where we don't have observed truth about something, property of the world, because we basically only have communication acts, we will force the neural net to commit not to the truth of the underlying claim, but to the probability of that underlying claim, as well as trying to find other latent variables that are good explanations to that, just like a good scientist would. A scientist or a psychologist trying to understand why a person said something isn't necessarily just going to believe what they say, right? They're going to try to understand what are the psychological factors here or the particular culture of that person that make them say those things. So the scientist AI would do exactly the same thing.

B (24:09)

Yes, yes, exactly. So the important point here is to make sure that there's no reward hacking like over optimization of a policy. The problem that could occur if you separately train a policy and a guardrail, if the policy is very smart compared to the guardrail, is that it could do the same thing as what jailbreaks do. It could find questions or contexts, proposed actions for which the guardrail is simply going to produce a wrong answer, which means the policy is going to be able to bypass the guardrail. The reason is that neural nets are never going to be perfect. They're always going to make mistakes. So how do we get around that? Well, there are two aspects of this. So one is that in the scientist AI we can not just produce those estimated probabilities, but also a confidence interval around the probabilities. In other words, the system will Estimate how much it trusts its own answers. So why is that important? Because if the neural net is asked a question for which its answer is not reliable, but it knows that its answer is not reliable, then it can just reject that question. Now, there's another reason why the agentic scientist AI is going to be safe. That has to do with the fact that you can train jointly. In fact, it's going to be the same neural net. You control how both the policy part and the guardrail part are trained. The scenario of a completely adversarial case is hopeless. There's always going to be a way to bypass whatever imperfect guardrail, and you're never going to have a completely perfect guardrail. But if you control both sides, in other words, you train both sides, it's not like some evil person is building an agent that's going to defeat your guardrail. No, you train the agent and you can train it in such a way that it will not be trying to. It's not going to be over optimized in places where the guardrail is uncertain. So you always make sure that the uncertainty, like the level of error of the guardrail is not going to be exploited by the policy part. It's the same neural net, but different kinds of questions. Right. So the policy part is about probability of actions and the guardrail is probability of harm.

B (42:05)

Well, I'm going to put out this theory paper that shows that the non agentic version, which could be used as the guardrail, has these mathematical guarantees and people can look at the conditions and whether they buy the math. But I think in the coming year or two, what we need is to accelerate that kind of effort. So that's a lot of engineering. And to make the demonstration stronger, we want to have more compute. So any way that we can get access to that kind of compute is going to help to accelerate that research agenda. Also, we need more research engineers, more researchers to work on actually building the system based on that recipe so that we can do it faster. Now you might ask, and I kind of sense in your question, yeah, but what if it doesn't Come fast enough, I'm going to go back to my children. It is not acceptable for me to just sit and watch a world where even a 1% chance that we all die is plausible. I feel like even if there's no guarantee that a particular research agenda will work, we should give it a shot. Given the stakes and given that we now have pretty strong theoretical assurances that this could work and that if we have the requirements for how the system is trained, then we can get these guarantees, I think it would be irrational not to give it a shot even if there is no guarantee. Right. Because I don't see right now a better path. That's why I've decided to spend so much of my time, basically all the time, except for the time I spend on the policy questions. How do we build this scientist AI and demonstrate that it is going to produce the honesty without losing capability? The other argument is the stakes being so high and the uncertainty about what's going to work being so high. It would be foolish to just put all of our money into one particular approach, which is to patch the current systems with monitors that we don't trust or other approaches that the companies are currently pursuing, which always play a game of cat and mouse, where if the AI is smart enough, it's going to find a way to evade our attempts, which doesn't reassure me. So we should at least try collectively, I think we should try methods that are different and avoid this cat and mouse game.

B (57:36)

Yeah, sorry. The ELK problem comes from the issue you were raising, that even though the AI may internally know the truth of something or at least have some internal beliefs about something, because it's trained to imitate the kind of variables that it sees in the data, which mostly are what people are saying. When you query it, it's going to answer in the same semantics, and that is what a Persona that it currently is taking, given its context, would answer and not necessarily what it actually believes. And the issue, the technical problem here is we don't have supervised labels to teach the AI about what it should actually believe, right? So we can't ask it about its true beliefs. We only get a kind of reproduction of the distribution of variables that it sees in its training data. So in the scientist AI, this is addressed by having this clear syntactic separation between the communication acts and more like factual syntax that can be used for latent variables and true things that we know so we can query it using that factual syntax. And then the other reason why we're getting away with some of the issues with the ELK challenge is that the same language, which is like English, let's say, can be used to represent those latent variables as well as the observed statements, and so basically rely on the compositional structure of language to generalize to new sentences that it has never seen. But the meaning of those sentences is given by its understanding of language. And this is very different from the scenario studied by those who looked into the ELK Challenge, where we assume that the latent variables are anonymous, like they don't have a predefined meaning, and so we don't know where to look inside the neural net, if you want, about what the beliefs are, which motivates things like mechanistic interpretability and so on. But in the scientist AI, we bypass this problem to some extent because the latent variables are in natural language and thus are interpretable. Now, there could still be other beliefs that are not in natural language, that are hidden in the neural net, but at least when we ask questions in natural language, we're going to get a honest answer.

B (62:34)

So this is not something new. People in AI safety have been talking about the fundamental flaw in training by reinforcement learning to achieve something in the world. It gives rise to the problems of instrumental goals and reward hacking. And in both of these cases, what you do end up with is systems that have goals that you didn't choose and could go against the goals that you did choose. So reinforcement learning is a very dangerous thing to build superintelligence. The good news is you don't need to do reinforcement learning. Like what we show with the scientist AI is that there's a way to train the AI so that it will be indifferent to the consequences of its actions, of its predictions. Okay, let's start with a predictive model. It's easier to understand, right? So imagine you do have a really good climate model. The climate model, if you run a simulation of it or train a neural net to approximate, those simulations will give you honest answers. And it doesn't care if the answer makes us do something stupid. So that's how you get honest answers, essentially by building understanding, explanatory understanding of the world that is completely indifferent to how the predictions are going to be used. Now, once you have this, you can use it in a kind of agentic way. So for example, the guardrail is a kind of agentic thing, right? It's taking a binary decision. Do I accept this prediction? So do I put out this prediction in the real world or not? And it is a decision. It is an agentic choice. But in this case, it's a choice that has a unique goal, which is to avoid dangerous actions. So we are already entering the agentic world once we install the guardrail. So, bottom line, to summarize my answer, there's a way to train a predictor that will not require reinforcement learning in the sense that it will not require optimizing with respect to future events in the world, including future good prediction errors. And here I want to make a parenthesis about previous work in AI safety on AI oracles. Of course, people have thought about this. Why don't we just train an oracle that's a good predictor? But they thought that the only way to train it would be to train it by reinforcement learning to make good predictions. But there's a huge flaw here, because if I'm rational and I want to maximize the good predictions I will do forever in the future, I could lie in the short term to make humans do things that will help me to make good predictions in the future, like get more compute so I can train a better version of myself or make the world simpler to predict. Kill everyone. If humans kill each other, then the world will be much easier to predict. So these are really bad outcomes that are due to instrumental goals of making good predictions. And it arises because of the reinforcement learning kind of objective. You're training the AI to achieve something in the real world, and that's where you get bad problems. But the other approach, the approach of the scientist AI is to train it from the get go, not to achieve anything in the world, but to just get to predict the training data, the past data. So it's not about the future, it's about the past to come up with good explanations and good predictions of the past data.

B (67:48)

Yeah. So first you have to understand that the same predictor that is trained in a unified way can be used to both answer user questions and answer safety questions. And the safety questions are those that you care about for the guardrail. So it's not like there's a separate neural net that does the guardrail and another one that makes predictions. The guardrail is using the same prediction neural net. It's just a different kind of question you're asking. You're asking what's the probability of particular kind of harm given that I put out this prediction, or in the case of the Agentix system, that the AI puts out a particular action. So training is fully the predictor. Now, once it's trained, there are a number of things you can do to construct the system that includes the scaffold. So what is a scaffold doing? Well, for example, when a user comes with a question, it will construct the question, put it in the form for the predictor to produce a probability. But it will also call the predictor with a different question, which is the probability of harmony. So that's the guardrail question. And then it will look at the answer in order to decide whether to produce an answer or not. For example, if the question is about how do I build a bomb, then the guardrail will say the probability that this is dangerous is high. And I mean high enough. And so the guardrail will use a threshold to reject threshold on the risk probability to reject those those questions. And that threshold is a normative choice. It is something that society decides, like how much risk are we willing to take depending on the kind of harm that we're talking about. The guardrail also has other roles to handle what is called performative prediction. So sometimes a question can have multiple answers because the question answer will influence the future. So a classical example is if the question is about who's going to win the next election and maybe the AI is considered very capable and people will believe whatever it says and vote for that candidate, which means the AI could say this guy or that guy, both of them would be true. Right. So then it's starting to have agency through its prediction in a way that seems that we don't control. The guardrail is going to manage that to decouple the predictions from the effect of those predictions. To be more concrete, the neural net predictor is trained so that in its input conditions there's always a particular statement that asks what if we did produce this prediction, what would be, for example, the harm effect? And so when you condition on the intervention of producing a particular answer, now there's only one answer. You are, you're saying, oh, I'm going to put out this prediction and what will be the effect? Right. So there's more to say about this. But the bottom line is you can control this kind of risk and the agency that can come from it. And it's the job of the guardrail to do that job.

B (79:23)

So there various kinds of experiments. The bottom line is we want to show the improvement in honesty and basically getting rid of deceptive behavior. And we can do it with two kind of categories of experiments. We can train really small models of the kind that academic organizations have been training like less than 10 billion weights or something from scratch, but just using the scientist, AI objective and the data representation that I mentioned. So that won't be competitive because it's going to be much smaller models, but we can compare head to head the original open weight model that has that same size and train on the same data, we can compare both in terms of capability and safety essentially, at least honesty being the main thing we're looking for. So that's one kind of demonstration. The other kind of demonstration, which is maybe closer to being deployable but has less guarantees, is to take an existing pre trained model, maybe starting from a Bayes model rather than the one with rl, and then fine tune it using the scientist, AI objective and data representation. And that would give a much more competent model because it's bigger fine tuning is much cheaper, as you know, than training from scratch, but we lose the mathematical guarantee. I think it's probably going to be fine anyway. Of course it depends how much fine tuning you're willing to do. What's interesting here in these kinds of experiments is that we should be able to see the trade off. Like if you measure see on deception benchmarks, what happens as you continue training with more and more fine tuning? We should see a curve that it gets better. Right. And that's what we're hoping to see. And then you also want to show that capability doesn't go down, which by the way is tricky because unfortunately what we found already in our experiments is that most of the, at least the open weight models, they cheat on the benchmarks. What do I mean by this? As soon as you do a little bit of fine tuning on anything, their performance on the benchmarks goes down.

B (90:54)

Totally, totally. But what I'm saying is there are pretty easy sources of hard facts. There's another source which is maybe a little bit. We have to be a bit careful and maybe use a different kind of syntax, which is scientific observations. So there's a lot of scientific data out there. Scientists share their data. Right. So it is a hard fact, but it is a fact about an observation. Of course, the observation could be noisy or maybe even the experimenter could have cheated, but there's a bit of noise there, but that's fine. It's kind of something we can say that has been observed. And you're right, the most interesting questions we care about are the questions in domains that are not these, not scientific or not math and computers. And for these we only get communication acts. But the scientist AI training procedure is going to force the part of the system that produces explanations, called the explainer, to come up with explanations that use this factual syntax rather than a communication syntax for the explanations of communication acts. So if, if somebody makes a claim and you observe that somebody made a claim, then one of the pieces of explanation is going to be something that the claim is true or not. It's not like scientifically needs to commit whether it's true or not, but it needs to commit on what's the probability that it's true or not as part of how to explain this. And this will force the neural net to learn about the underlying explanations that are factual, even though it's not sure about them. So it learns the syntax and the semantics of statements in domains where there's no ground truth. Now you might say, well, if there's no ground truth, how do we know that these are real or not? Some made up stuff. And that's because the most predictive models, as we see in science, for example, are the models that are expressed using actual properties of the world. The way scientists build explanations about the world isn't by combining statements of the form, somebody said this causes this to happen. No, in between those causal relationships there will be latent variables that we don't observe directly, like what did the person actually think or what are the intentions of the person and what kind of person is receiving that communication. So these are actual properties of the world. They're not communication acts and the causal connection is happening at that level. All scientific theories are about actual properties of the world and how they're causally related to each other. And there's a reason for that. Mathematically, when you express your explanation for the world in the language of what's actually going on in the world, rather than what people think or what people say, rather, you get better predictions.

B (100:04)

I actually think there's a related possibility which goes maybe more to policy questions. So the context here for me is what kind of future is going to be stable and not turn into a global dictatorship driven by AI and excessive concentration of power, in addition to avoiding catastrophic loss of control and catastrophic misuse and all those things that can come from very powerful AI? I think that because of the game theory dilemmas, basically prisoner's dilemma style problems that make companies and countries go and take decisions that are the rational ones, but that they're globally bad, like basically cutting corner on safety and the public good in order to stay in the rice. Because of this, it would be much better if we ended up in a world where the power of controlling very strong AI is not centralized in, in the hands of one or two companies or one or two governments, but is instead distributed. So what do I mean by this? How do you make sure no one person, no one company and no one government has too much power? All of the power? In the extreme case, there's a very old idea, it's called democracy. That's what it's about. I don't think that our current democratic institutions are robust enough to deal with those changes. But the principles are there. The principle to be maybe more concrete. Imagine that you had a coalition of countries which together decide to develop AI safely and for the benefit of humanity and not to dominate each other. That will be a much, much better and safer world because you break this competition problem that is currently, we are currently locked in. Now what that would mean is the control. I mean, it could involve companies, but on top of the companies, you need to have representatives of the people like governments. And you don't want a single government because a single government can be corrupted by power and the power that AI can give, right? So you want something maybe like a coalition of governments who make a treaty about those things with verification, for example, so that they can, even if they don't trust each other, they can prefer the treaty than no treaty. So the reason I'm bringing this up with respect to your question is I think it would be a better world if it is a bunch of governments which fund the most advanced AI systems. They could work with companies, of course, but I think ultimately we would like the decision power to be at the level of governments, but not a single government, because then we're Back to grabbing the power. If you have 10 governments working together and no one can really have complete power, then even if there is a bad apple, the collective decision making is more likely to be robust to this sort of event. So these kinds of coalitions would be interested in developing AI that can leapfrog the current methods and provide safety, because safety is a public good. And in fact, in the case of AI, it is a global public good. Right. It's not just something we can solve locally in each country.

B (113:27)

Yeah, I do. I think they could, but they might not pay attention. And I think the best thing we can do is to provide sufficient evidence for them to pay attention. Also, in addition to my technical work, I'm trying to improve public understanding and policymaker understanding of the greatest safety risks, because I think that will play a role in their decision. If the public becomes more concerned about safety, then there will be direct and indirect pressure on the companies to maybe allocate more of their resources to this question. If the public is concerned, then governments will be more likely to regulate or provide legal incentives, for example, through liability. Maybe make them consider the safety investments that would be needed to scale the sort of thing I'm proposing, for example, as profitable, even in the short term. I think in general, for the safety issues, there are psychological barriers like cognitive biases that prevent people from being totally rational about what's going on. And that's true in governments, but that's true in the general population. And that's true within companies or even within academia. All sorts of reasons that might explain why we're not collectively taking the right decisions. So there's the game theory aspect, but there's also like individual psychology. For example, we all want to feel good about our work, which means we're going to maybe be biased towards thinking our work is going to be beneficial rather than harmful. And that's going to be true of people in industry. That's going to be true of people even in academia working on AI because they want to feel like their work is going to be bring a better world, not destroy it. And there are other factors, like some of the factors that we see with the attitudes regarding climate change, so that if the risk isn't something that is in your face, you look outside and you don't see catastrophic climate change. You don't see robots killing people, so you don't think too much about it. You are much more concerned about your immediate worries. And so I think that's the real challenge. If we can improve the understanding at a gut level that people have of the magnitude of the risks that we're taking collectively, things could change and they could change pretty quickly. So if you think about how quickly governments shifted their actions in a radical way after the beginning of the pandemic, you can see that they can move quickly when they take an issue seriously. And then that usually is going to be driven by whether the people take the issue seriously.

B (126:06)

Well, there is a connection between the two technical safety work and the policy safety work in the sense that if we can demonstrate the existence of AI systems that would be competitive, capable and safe, it's going to be easier for government to impose the requirement. Oh, you have to show that your AI system is going to be safe in the way that independent scientists will say, yeah, right now a lot of the governments are focusing on economic competition driven by AI and that makes them also blind to the risks. So that's where the technical safety can help. Right. It's going to be easier to say, oh, we can have both safety and competitivity on the pure policy Side, I think the biggest challenge right now is how do we get countries to agree with each other in spite of the competition, including very strong distrust and disagreements on the political foundation. And that's a place where we also need actually more technical research on verification methodologies that could be at the basis of treaties between, say, the US and China, which don't trust each other. There is not enough research going on there. But a lot of people are starting to think about this and think it's quite feasible to change some of the programming or even the hardware to make these kinds of verification reliable, and we should do more. Governments should realize that if they want to end up with a treaty that they would sign, they need to incentivize that kind of research as well. Also, governments need to understand how transformative AI will be. So I think a lot of the wrong thinking in many governments. I've been around the world talking to many, many different governments, like at least a dozen in the last year. The biggest mistake is to view AI in the future as if it was just a slightly beefed up version of the AI we have now. And then focusing on AI as a normal technology, that they would compete with other countries, and focusing on deployment because you get more productivity, for example, and not so much on the risks. And in great part this is again because people in government, just like most people, don't really digest the idea that we are on the verge of creating entities that can compete with humans and that could become tools of absolute power in the wrong hands. I'm not saying it will happen, but even if it's only a 10% chance that capabilities rise to that level in the coming years or whatever, this should completely alert politicians that they have to do something about it. But the fact that they're not doing it tells me that they haven't yet integrated that scientific reality that we are on track. We see already on a small scale the progress towards these kinds of machines. So they need to wake up from their old mental constructs of seeing technology as mostly from an economic perspective, or even giving them a military advantage and not realizing we're opening a Pandara's box with incredible unknown unknowns of magnitude of impact, both positive and negative. That is very hard to anticipate. So that's where I would ask governments to start reading more, listening more, and just spending a bit more attention on understanding what is going on with AI, where it is going, and what this could potentially mean.