Transcript
A (0:00)
Today I'm speaking with Yoshua Bengio. He is the scientific director at Law Zero, a Turing Award winner in 2018, the most cited computer scientist of all time, and as it happens, also the most cited scientist of any type that is still alive. Thanks so much for coming on the show, Yoshua.
B (0:14)
Thanks for having me.
A (0:16)
You think you found the right approach to build a safe, super intelligent AI? What's the approach?
B (0:20)
It's based on a simple notion that if we can bake honesty into AI, we can get safety. So then we can reduce a problem to how we train a system to be honest. And it turns out that there's a way to do that that only requires changing the training objective and the way the data is processed. There's also another aspect which is it's a system relying on a non agentic foundation that is a predictor that is not trained by reinforcement learning and is going to have these honesty guarantees. But we can then use this using the same kind of math to construct a policy, construct an agent that will be trained in a way that also provides those guarantees.
A (1:14)
So what does the new training process look like and how is it different from the models that people are familiar with?
B (1:19)
So the main difference with the training process is that it is geared at approximating the Bayesian posterior over queries in natural language. Imagine a neural net and some extra apparatus around it, like chain of thought style that takes questions about statements regarding properties of the world that can be true or false given other statements, and then it outputs probability. So that's the core building block. We call it a predictor. And we can use stochastic gradient descent on a different objective that has the property that the objective is globally minimized by the Bayesian predictor. In other words, the predictor that fits the data and has a small description length.
A (2:18)
So you'd be building a model where you would feed in a statement and it would basically tell you what probability it assigns to that statement being true.
B (2:25)
Yes. In context, yes.
A (2:27)
Hey, listeners. Rob jumping in here. Yoshua is naturally pitching this in a way that's ideal for staff at frontier AI companies, and they're obviously a particularly important audience for this proposal. But I'm confident with just a few minutes of plain language explanation, everyone else will be able to follow the rest of the conversation as well. So bear with me or skip ahead about four minutes if you feel very at home with this sort of material already. As you probably know, in their first stage of training, today's large language models are taught to predict the word that's most likely to come next. At least the token that's most likely to come next. And then, in a second stage, reinforcement learning trains those models to produce the kinds of responses that we're most likely to say that we like, that we want, rather than just the responses that were most probable in the full corpus of all human generated text. Now, Yoshua's alternative is to build an AR model, oriented not around predicting what a human would be likely to say or what they would prefer to hear, but around modeling what's actually true in the world by developing hypotheses and assigning probabilities to them. With the goal of best explaining all of the data that it's exposed to during its training process, Yoshua argues that you'll be able to train a model of this type while porting over most of the methods we use to train ordinary LLMs today, benefiting from the same neural net architectures, training techniques, scanning improvements, all of that. And you'd also be able to train it on roughly the same body of raw text that we use for all other AIs. But you could structure that data a bit differently, giving it what AI researchers call a different syntax. First, all the things that people said or wrote, they get tagged as communication acts. We know someone said these things, and we know where they said it, but we don't know whether they're true. And second, a small number of statements that we have strong independent grounds for verified mathematical proofs and some scientific measurements, they get tagged as verified factual claims about the world. The model is then trained to find the combination of possible underlying facts about the world that would best explain everything that it sees in aggregate, both the things people said and the verified facts that it's been given as ground truth. These hypothesized facts about the world, they're what AI researchers call latent variables, meaning variables that the AI can't directly observe that it's going to have to infer indirectly. Instead, what the model will ultimately be able to give us is its estimated probability that any given statement in natural human language is true, as well as how much the model trusts its own answer on that, how confident it is that it has a good grip on that question. Crucially, Yoshua says that by tagging all text into these two categories from the very beginning, things someone said versus factual statements, you can then ask the model questions as though you're asking about reality, not about communication acts, by using the factual statement tag. And because these two categories have been there from the very beginning, the model knows the difference, and it won't blur the line between the two. That's something you don't get with AI models today. And Yoshua also argues, using various mathematical theorems in his papers, that unlike ordinary LLMs, a model trained in this way would be honest by design. And furthermore, that such an AI model would by itself have no goals and no preferences about the state of the world. It would be what Yoshua calls just a pure predictor. Now, there's two main uses for this near term as a sort of stopgap solution, you bolt the predictor onto existing AI agents as a sort of guardrail, an independent filter that sits between the agent and the world, checking over its proposed actions and rejecting those that it predicts will be harmful. But as he'll explain in a minute, Yoshua thinks we can ultimately do much better than this. Yoshua wants to put scaffolding around the prediction model, asking it different questions at each stage to effectively assemble it into a capable agent while keeping it just as honest as it was before. We'll then hopefully be able to have our cake and eat it too. Getting the highly capable agents that be businesses are craving and demanding and insisting on, while still being confident that those agents are being completely direct with us. Yoshua thinks that these agents perhaps might even be more capable as well, thanks to a superior reasoning process, or at the very least, a clearer and more explainable one. It's fair to say that this proposal is huge if true, or at least huge if it will work. And of course, not everyone is sold on that idea, as Yoshua and I will discuss later. Okay, that's the shape of things to come. The technical discussion continues for a while, but if you decide you want to skip that, the second half of the conversation stands very well on its own, starting with the chapter, how much would this cost? All right, on with the show. And how would you train a model like that?
