
In this episode of Google DeepMind: The Podcast, VP of Reinforcement Learning, David Silver, describes his vision for the future of AI, exploring the concept of the "era of experience" versus the current "era of human data". Using AlphaGo and AlphaZero as examples, he highlights how these systems surpassed human capabilities by engaging in reinforcement learning without prior human knowledge. This approach contrasts with large language models, which depend on human data and feedback. Silver emphasizes the need to explore this path to drive AI progress and achieve artificial superintelligence.
Loading summary
David Silver
We're going to need our AIs to actually figure things out for themselves and to discover new things that humans don't know. And I think that's going to be a whole new era of AI that's going to be incredibly exciting and profound for society.
Hannah Fry
Welcome Back to Google DeepMind, the podcast. My guest today is the inimitable David Silver, an original deepminder and one of the key people behind the phenomenal success of AlphaGo, the first program to master the world's most complex board game and achieve superhuman performance. Now, at the end of today's podcast, we have a little extra treat for you, a conversation with David and Fan Hui, the first professional Go player to take on the AI. But now David has a bold idea about the direction that AI should go in next. After all of the current buzz and excitement and achievements of multimodal models, David has a plan for the path towards superhuman intelligence, a new phase which he calls the era of experience. Now, this is a profound idea and not one without risks. David, welcome to the podcast.
David Silver
Hi, it's great to be here. Real pleasure. Thank you.
Hannah Fry
Okay, so I have spent the weekend with a very enjoyable read of your, of your position paper, and in it you are talking about the era of experience. Summarize for us, what do you mean by that?
David Silver
Well, what I mean is that if you look at where AI has been for the last few years, it's been in what I call the era of human data, which is that all of these AI methods, they're based on one common idea, which is we extract every piece of knowledge that humans have and you kind of feed it into the machine. And that's one incredibly powerful way to do things. There's another way to do things. And this is what's going to lead us into the era of experience, which is where the machine actually interacts with the world itself and it generates its own experience, it tries things out in the world and it starts to build up its own experience. And if you think of that data as fueling the machine, then that will lead to this next generation of AI that we can think of as the era of experience.
Hannah Fry
I guess this is, in a way, you're sort of thumping the table, saying large language models are not the only AI. Right? Like there are alternatives, There are different ways that we can approach this.
David Silver
That's right. I think we've really got a lot out in the field of AI of building large language models, harnessing the vast quantity of human, particularly natural language data that's out there and kind of assimilating that all into a machine that knows everything that humans have ever written down. But at some point we need to get past that. We want to go beyond that. We want to go beyond what humans know. And to do that, we're going to need a different type of method. And that type of method will require our AIs to actually figure things out for themselves and to discover new things that humans don't know. And I think that's going to be a whole new era of AI that's going to be incredibly exciting and profound for society.
Hannah Fry
Well, okay, then let's talk about some other sort of famous AIs famous algorithms that have employed different types of methods, most notably AlphaGo and AlphaZero, which of course notoriously beat the world's best Go players about a decade ago. Right. Tell us about the techniques that we use in that and how they differ from large language models that we see today.
David Silver
So AlphaZero in particular is very different from the type of human data approaches that have been used recently because it literally uses no human Data. That's the zero in AlphaZero. So there is literally, literally zero human knowledge that's pre programmed into this system. And so what's the alternative? How do you learn Go knowledge if you're not copying humans and you don't really know in advance the right way to play? Well, the way you go about it is through a form of trial and error learning, where AlphaZero basically played itself millions of games of Go or chess or whatever the game is that it was wanting to play. And bit by bit it figured out, oh, if I play this move, this kind of move in this kind of situation, then I end up winning more games. And then that's a piece of experience that is used to fuel it to become stronger. And then it will play a little bit more like that, and the next time it will discover something new and say there'll be some new pattern. It's like, oh, when I use this particular pattern, I end up winning more games or losing more games, and that feeds the next generation and so forth. And that learning from experience, this learning from the agent self generated experience is enough and was enough in Alphazero to fuel its progress all the way from completely random behavior, all the way up to the strongest chess and Go playing programs that the world has ever known.
Hannah Fry
They didn't start off just as like random empty boxes though, right? That kind of found how to play Go from nothing. I mean, initially when you were designing your Go algorithms, you'd worked out A way to encode go games and then feed them in as a database, right?
David Silver
Yeah, that's right. So the original version of AlphaGo, the version which famously beat Lisa Dahl in 2016, this version of, of AlphaGo actually did use some human data to start it off. So we basically fed it a database of human professional moves and it learned, it ingested those human moves and that gave it its starting point and then it learned for itself by experience from that point onwards. However, what we discovered a year later was that the human data wasn't necessary, that you could actually throw out the human moves altogether. And what we showed was that actually the resulting program, not only was it able to recover this level of performance, it actually worked better and was able to learn even faster than the original AlphaGo to achieve a much higher level of performance.
Hannah Fry
That is such a strange idea, Such a strange idea that you throw away the human data and you find not only was it not necessary, but it was actively limiting performance in a way.
David Silver
I think one of the hard lessons for people in AI, this is sometimes called the bitter lesson of AI, is that we really want to believe that all of the knowledge that we accumulated as humans is really important. We really want to believe that. And so we feed it into our, into our systems, we, we build it into our algorithms. And what happens is that actually that makes us design the algorithms in a way which is, may be fitted to the human data and is less good at actually learning for itself. And what happens is if you throw out the human data, you actually spend more effort on how the system can learn for itself. And that's the part which can then learn and learn and learn forever.
Hannah Fry
The bitter lesson. I suppose in a way it's sort of saying, accepting that it's possible that something could play go better than humans can, sort of removing that ceiling in a way.
David Silver
That's right, that you know, human data, it's really helpful to get you off the ground. But there is a ceiling to everything that humans have done. And you know, we see that in Go there was a maximum level of performance that humans have ever achieved. And we need to break through these ceilings. And in AlphaZero we were able to break through that ceiling by building a system that learned for itself by self play and got better and better and better until it blasted through that ceiling and went far beyond. And I think the idea of the era of experience is that we find the methods that allow us to break through that ceiling everywhere. We build AI systems that become superhuman in all of the capacities that humans Seem so amazing, but we find the way to go beyond that.
Hannah Fry
Let me just stick with Go for a second, okay, before we get onto the other ways that you can get rid of human data and thus improve on human ability. Because it sort of sounds a little bit. When you say, let's just get rid of all of the games of Go that. That humans have played and start with nothing, it sort of sounds like a magic trick. Just tell me a little bit about the techniques that you're really using there in order to get a machine to, as you say, chain together thousands and thousands of different ideas in order to be. To be amazing at the game of Go.
David Silver
Well, the main idea is an approach that we call reinforcement learning. And the idea of reinforcement learning is you basically give the outcome of your game a number, and we say plus one if you win and minus one if you lose.
Hannah Fry
One point.
David Silver
Exactly, exactly. And what we do then with reinforcement learning is we get the system to basically, we give it a reward each time it does something, right? And we train the system to basically reinforce. That means do more of the things that get more reward. And so in terms of. For example, if you've got a neural network, like we do in AlphaGo, that's picking the moves, what you want to do is tweak the weights of your neural network a little bit in the direction that gives you more reward. And that's the main idea of reinforcement learning.
Hannah Fry
But then, okay, a game of Go is quite long. How do you make it so that you do the right moves in the beginning so that you end up with the right outcome towards the end? How do you work out which bits of the game are important? I suppose so.
David Silver
This is a really important problem. It's called the credit assignment problem. And the idea is that if you've. Yeah, you're absolutely right. That you could have had, you know, 100 or 200 or 300 different moves, and then at the end, you just get one bit of information saying, you know, win or loss, and you somehow have to work out which of the moves in the game were responsible for winning and which of the moves in the game were responsible for losing. And there's lots of ways to do that. The simplest way is just to assume that everything that you've done contributes a little bit to that outcome at the end. And it sort of all comes out.
Hannah Fry
In the wash. One of the biggest moments of the AlphaGo story was move 37 that everyone always references. Just tell me about that.
David Silver
So move 37 was a move that happened in the second game of AlphaGo against Lee Sedol. And AlphaGo played a move that defied everyone's expectations. The traditional idea in the game of Go is that you play your moves typically on the third line or the fourth line of the board, because this either gives you territory for the third line or influence on the fourth line, and you never go below or above that. It just wouldn't make sense to humans. AlphaGo played on the fifth line, and it somehow played this in a way that just made everything make sense in the board. It kind of connected everything together with this move on the fifth line. And it was so alien to humans that, that we estimated that only 1 in 10,000 probability that a human would ever think of playing this move. Humans were shocked by this move, and yet it helped win the game. And so it was a moment where humans said, look, here's something creative that happened, something that a machine came up with that was different from the way humans traditionally thought about the game. That actually was a big piece of progress and took us beyond the kind of confines of human knowledge.
Hannah Fry
And I guess if we really do want to advance AI, we sort of do want those alien ideas, as you put it. Do you think you've seen an equivalent of Move 37 with large language models?
David Silver
Move 37 in some ways was special because it was the first moment, it was the first time that people had seen a big breakthrough like this. And because we've been in the era of human data, we focused a huge amount on reproducing human capabilities, and we've focused much less on on going beyond them. And I think until we really emphasize systems learning for themselves to go beyond human data, we won't see huge breakthroughs. The equivalent of move 37 in the real world. It seems unlikely to me because when.
Hannah Fry
You'Re anchored in human data, you're only ever going to have human like responses.
David Silver
That's right. And I think there are things you can do that allow you to maybe do things in the middle a little bit. So if you push me to say, what's the greatest move 37 like moment? I would probably pick out some work by scientists at MIT who discovered a new antibiotic that no human knew about. And I think that's an incredible discovery of massive importance to humanity. So in that sense, it goes way beyond move 37. But what I like about move 37 is, is that it's not just a single discovery. It's one of an infinite series of discoveries where the system can just keep on learning and learning and learning and move 37 is important to me because it represents just a single point in that infinite sequence of discoveries that can happen once you've got this kind of approach of learning from experience rather than.
Hannah Fry
Actual result in and of its own. Right?
David Silver
Yeah, that's right.
Hannah Fry
Give me a brief rundown of how alphazero worked.
David Silver
So alphazero is surprisingly simple. I mean, there's some very complicated algorithms out there in the world, but this one is really straightforward. So all you do is you start with a policy, a way to pick moves, and a value function, which is a way to evaluate moves and say whether they're good or bad. So you start with that, you run a search, and then what you do is you take the best move according to your search and you train your policy to do more things like that, to do more of the good moves according to your search, and you train your value function based on how the game actually panned out. When you played a game with this search, and that's it, you just iterate that millions of times and outbops a superhuman game player.
Hannah Fry
It's like magic, basically.
David Silver
It does sometimes feel like magic. I remember the first time that really felt like magic to me was when we had just completed Alphazero on chess. Someone had the idea of trying it on a different game. So we plugged it into a game that none of us could play, a game called Shogi, which is Japanese chess. And we had no idea how to play this game.
Hannah Fry
What, you didn't even know the rules?
David Silver
So the system knew the rules. The agent, we taught it the rules, but none of us had the first clue of how to really, you know, strategy or tactics. Strategy or tactics. It would have been like blunder after blunder if we'd been playing this game and we just plugged it in and it was literally the first ever time we ran Alphazero on Shogi. We had no idea whether it was good or not. We couldn't evaluate it. But we sent it off to Demis, who's actually a reasonably strong player, of course, and he said, hmm, this looks quite good. I'm sending it to the world champion. And the world champion said, hmm, I think this is superhuman. And so it literally felt like magic because we just pressed go on this system and had no idea of the process and how it got there. But somehow out popped a superhuman Shogi player.
Hannah Fry
Can AI design its own reinforcement learning algorithms?
David Silver
Well, funnily enough, we have actually done some work in this area. It's work we actually did a few years ago, but is coming out now. And what we did was actually to build a system that through trial and error, through reinforcement learning itself, figured out what algorithm was best at reinforcement learning. It literally went one level meta, and it learned how to build its own reinforcement learning system. And incredibly, that reinforcement learning system actually outperformed all of the human reinforcement learning algorithms that we'd come up with ourselves over many, many years in the past.
Hannah Fry
I mean, this is the same story over and over again. The more of a human you put into something, the worse it acts, the worse it performs. Take the human out, does better. Okay, if AlphaGo and AlphaZero, then are really exceptional examples of reinforcement learning used to the best it can be. You still find reinforcement learning in the large language models that we have at the moment. Right. Tell me about how they're integrated into these systems.
David Silver
So reinforcement learning is used in almost all large language model systems. And the main way it's used is by combining it with. With human data. So unlike the alphazero approach, this means that the reinforcement learning is actually trained on human preferences. So the system is basically asked to produce outputs, and then a human says, this one is better than this other one, and the system becomes more like the one that the human prefers. And this is called reinforcement learning from human feedback. And it's been massively important in LLMs, and it's helped transform them from systems that just blindly mimic any kind of data that you see on the Internet into systems that actually usefully produce answers to the kind of questions that people really want to see. And so it's an incredible advance. However, it feels like we've thrown out the baby with the bathwater. These reinforcement learning from human feedback systems, or RLHF, they're very powerful, but they do not have the ability to go beyond human knowledge. Like, if a human rater doesn't recognize some new idea and under appreciates that there are some series of actions that would actually end up being far better than some other series of actions. There is no way that the system will ever learn to find that sequence, because the raters might not understand that better behavior.
Hannah Fry
That human feedback element, though, it does seem to give these models some sense of grounding. Like, I know the last time we spoke, grounding was like this really big topic, this idea that you want these algorithms to have a conceptual understanding almost of the world that we're living in. So if you take away or you remove that human feedback aspect, do you still end up with models that are. That are grounded?
David Silver
I almost want to argue the opposite. I want to Say that when we train a system from human feedback, that it is not grounded, and the reason is that we are. Basically, the way RLHF systems normally work is the system presents its response, its answer to a question, for example, and a rater says that's good or bad before the system actually does anything with that information. So it's like the human is prejudging the output of the system. So, for example, if you're asking for a cake recipe from an LLM, the human rater will look at the recipe that's output by the system and judge whether that recipe is good or bad before anyone has actually made the recipe and eaten the cake. And in that sense, it's ungrounded. Like a grounded outcome would be. Someone actually eats the cake and the cake is either delicious or disgusting. And then you've got grounded feedback that says, this cake really was a good cake or this cake was a bad cake. And it's that grounded feedback that allows the system to iterate and discover new things, because it can try out new recipes that maybe expert chefs presume will be disgusting, but actually turn out to be delicious.
Hannah Fry
Yeah, like a monster munch muffin or whatever.
David Silver
The most delicious food that ever existed.
Hannah Fry
Okay, that's interesting, though, because I have heard, I mean, even a conversation with Demis talking about how grounding gets into these models, how they kind of have built this conceptual understanding of things. And it sounds almost like what you're saying is that the grounding that they have is like a sort of superficial level of grounding, maybe.
David Silver
I think human data is grounded in human experience. So it's like the LLMs are sort of inheriting all of that information that humans maybe figured out from their own experiments. For example, in science, you know, a human might have tried to walk across water and discovered that they fell in, and then they might have created a boat and discovered that that floated. And all of that information can be inherited somewhat by the LLM. But if we want a system that actually makes discoveries and discovers some completely new form of propulsion across water, or some completely new mathematical idea, or some completely new way to new medicine and new approach to biology, the data just isn't there. And the system needs to figure out for itself, through its own kind of experimentation, its own trial and error, and its own grounded feedback, whether that's a good idea or a bad idea.
Hannah Fry
I got to talk to Auril Auriel Vinales, who really spoke about how we are running out of. Of human data and that we are going to need to start Creating synthetic data in order to fill that gap. I mean, this is related, right, to that idea. It's just rather than using LLMs to create more human dialogue data, you're going about the solution in a different way.
David Silver
That's right. So synthetic data can mean a lot of things, but normally it would mean that you've got some process where you kind of take your existing LLM and you use it to generate some set of data. And, and I guess the argument is similar to the ceiling that we have from human data, that however good that synthetic data is, they will reach a point where that synthetic data is no longer useful to the system becoming stronger. So the beauty of a self learning system, where the fuel of the system is actually experience, is that as the system starts to get stronger, it starts to encounter problems that are exactly appropriate to the level it's at. So it will always be generating experience that allows it to solve the next problem that it's encountering. And so it can just get stronger and stronger and stronger forever. There is no limit. And that I think is what differentiates this particular approach of using self generated experience from other forms of synthetic data.
Hannah Fry
Just returning to your cake example though, I mean, if you kind of follow that through, somebody eats the cake and says, yes, this was delicious, you're using the human feedback then at the end of the process anyway, are we talking about that or are we talking about maybe having systems that are completely untreated, tethered from humans and are embodied or in the physical world somehow so that they can get their feedback in that way?
David Silver
Look, I think the ideal is that like AlphaZero, we have systems which are able to generate vast volumes of self generated data experience that they can then verify for themselves. And in many domains that's going to be possible. And in many domains it's not going to be possible in the ones where it's not possible. We have to acknowledge that humans are a big part of the environment that we're in. We have to acknowledge that they're a part of the world that we want our agents to live in. And so it seems reasonable to think of humans as a part of that environment and to think of the way that they behave as part of the observation that the agent receives. I think the thing which I'm pushing back against and saying is not grounded is not that, it's the fact that the reward that the agent learns from is coming from a human's judgment of like whether this sequence of actions is good or bad. And the system is not judging for itself based on the consequence of those actions in the actual world. And so, you know, one way to say it is that we shouldn't make, you know, human data a privilege part of the agent's experience. It's just, just, just observations in the world, and we should be able to learn from that like any other data.
Hannah Fry
If we go back to that AlphaGo example earlier of, of assigning that reward, that one point that it gets at the end is this almost like the way that we're handling AI at the moment, is that the algorithm does its first 10 moves or 15 moves, and then we insert a human in who says, yes, that's a good first 10 moves and doesn't allow the whole process to kind of execute fully before you input that little bit of feedback.
David Silver
That's exactly right. So imagine that we were training AlphaGo, and after every single move, our best Go player comes in and says, oh, that move, that move was amazing. Or oh, no, no, that move was totally wrong. And then we get that feedback and we put it in, and the system learns to pick the move that the human prefers. It would not end up discovering move 37, because it would just end up playing like the human thinks is a good game of Go, and it would never discover the new ways to play Go that that human didn't know about.
Hannah Fry
Okay, so I think the environment of Go, what you're saying, makes a lot of sense in that environment. There are other environments too, where I think that this makes a lot of sense. I'm thinking here about the pinnacle of human thought, of mathematics. Tell me what's been going on.
David Silver
That space, like you say, is an incredible human endeavor that's had millennia of human effort going into it. And so in many ways, it does represent literally the limits of achievement by the human mind. And so naturally we turn to it for AI to see can we achieve those same levels of performance that humans have achieved over all of those years of endeavor. We recently put together what I think is a very exciting piece of work called Alpha Proof. It is a system that learns through experience how to correctly prove mathematical problems. So it can, if you give it a theorem and you don't tell it anything about how to actually prove that theorem, it will go away and figure out for itself a perfect proof of that theorem, and we can actually verify and guarantee that this proof is correct. One thing which is interesting about this is that it's the exact opposite of how LLMs normally work. Because if you ask LLMs to prove a mathematical problem at the moment, they will normally Output some informal mathematics and say, just trust me, this is correct. And it might be correct, but it might not be. Because we know that LLMs tend to hallucinate a lot. They can make things up. And the nice thing about alpha proof is that it will actually guaranteed produce the truth.
Hannah Fry
So let's think of an example here to kind of anchor this in people's minds. Let's say that prime numbers are something that can't be divided by anything but themselves and one and there are infinite number of them. Off you go. Prove it.
David Silver
Yeah, so the way alpha proof works is it's trained on millions of different examples of theorems, not just one. And what happens? It goes off and it trains on them. And to begin with, it can't solve the vast majority of them. 99.999% of the theorems, it just can't do.
Hannah Fry
And these are theorems that humans have already proved. Are you feeding in?
David Silver
We feed into the system something like a million different theorems that humans have come up with themselves. But we don't provide the human proofs, we just provide the questions, but not the answers.
Hannah Fry
So you're giving it stuff that you know is true, but you're just not telling it how to prove it.
David Silver
And sometimes we don't even know it's true because what we actually do is we take the human theorem, the human question, and we actually turn it into a formal language.
Hannah Fry
These aren't using language in the sense that language models are using, but they are using a form of language. Like a mathematical language.
David Silver
That's right. So in fact, we do use a small, large language model. And that large language model allows us to output programming languages. And in particular, we use a programming language that's called Lean that allows all of mathematics to be expressed. And so it's an amazing idea that mathematicians have come up with that you can actually formalize all of these kind of things that we normally talk about in English language or whatever language you happen to be speaking can be transformed into a perfectly clear, verifiable mathematical language that allows all of the ideas of maths to be expressed and also all of the ideas of mathematical proof to be expressed. So you can say, for example, that if A implies B and B implies C, then there's a way to go from that to A implies C. And that's the kind of thing that you can do in this mathematical programming language. You essentially write a program that takes you from one to the other and da, da, you have a proof of this statement. So we take our kind of million human problems. And from that we generate 100 million formal problems. And some of those might actually not be possible, or they might be incorrectly formulated or, yeah, they might just be false. And it doesn't matter because all we do is we learn to prove those things. And the ones which we can't prove become. We keep trying and keep trying the ones that we already prove. Okay, they're done. They're out the way. Now if we disprove them, that's fine. They're out the way and we're left with the really interesting ones, which are the ones which are really hard to prove. And we keep kind of climbing up from just being able to solve one or two of them to then being able to solve 10 or 20 of them and eventually being able to solve a million of them.
Hannah Fry
Is this the equivalent, then that moment of the proof is correct or incorrect? Is that equivalent to alphago? You win the game or you don't.
David Silver
It's exactly equivalent. So if we use the idea that Lean says, well done, you've proved this as a reward, and we give the system plus one if it solves it and minus one if it doesn't, get that correct. And so this allows us to then train a system by reinforcement, learning to get better and better at proving mathematical statements. In fact, we literally used the same alphazero code that we used to get better at Go and chess and all of these other games. It's literally the same code, but it's running, if you like, with the game of mathematics.
Hannah Fry
The game. How dare you trivialize my subject. Okay, how good is it?
David Silver
It's not yet a superhuman mathematician, although that is where we'd like to get to one day. But one thing which alpha proof did achieve was the most well known and challenging of mathematical competitions is called the International Mathematics Olympiad. And this is a competition that happens once a year for the most incredible and amazing young mathematicians from all around the world. And the problems, to say the least, are extremely hard.
Hannah Fry
They're spicy. They're very spicy. As a professor of math sometimes, I mean, they're spicy.
David Silver
So you heard it from Hannah. These are hard problems, very hard and alpha proof. Amazingly, it actually achieved a silver medal level of performance in this competition. So this is a level of performance that only roughly 10% of the contestants would actually be able to achieve in the entire world. In the entire world. This is like the cream of young mathematicians, like the six best from every country. And not only that, but there was one particular question, that less than 1% of all the Contestants were able to solve and alpha proof got a perfect proof for this particular problem. So that was nice to see.
Hannah Fry
What do the proofs look like? I mean, do they follow human style arguments if you're not inputting any human data into them?
David Silver
I have to say that to me, the proofs, I don't understand them at all.
Hannah Fry
But Tim Gowers, I mean the Fields medalist and former imo, I mean, did he get, was he a gold medalist at the.
David Silver
Tim Gause was a multiple gold medalist at the imo.
Hannah Fry
Mega brain, right? Like extraordinary mathematician, but I mean he understands these proofs, right?
David Silver
So Tim Gowers actually was the refereed our solutions to make sure that they were valid solutions and that we hadn't broken any of the rules. And he understands the solutions and thought that they were a huge leap beyond anything that previous AI mathematics could do before. So it's a jump forward, but it's still just the beginning in the sense that we really want to go beyond human mathematicians. And that's where we'd like to go next.
Hannah Fry
Because at the moment, basically you've got yourself a very, very, very talented 17 year old mathematician basically. Right?
David Silver
That's right. And it should be said that the system that entered the IMO did take longer than a human contestant would be allowed to take. So that's something we're just going to assume will get better over time as machines get.
Hannah Fry
I mean the IMO is like the perfect test bed because there are correct answers, it can be judged, you can compare it to human performance, all of that kind of thing. But if you are feeding in conjectures so things that we don't even know are true. You know, I'm thinking of like the ABC conjecture here or the Riemann Hypothesis or any of those like really grand unsolved challenges in mathematics. If alpha proof outputs something and says no, no, no, we've checked this proof, it works. Can you trust it? And maybe even beyond that, is it worth anything if we don't understand it?
David Silver
I think the good news about Lean is that mathematicians who are better than myself are always able to take a Lean proof and translate it back into something that humans can understand. And in fact, we've even built an AI system that can do this, which can take any formal proof and what we call informalize it, which means it will turn it back into something which is very understandable to humans. And if we did solve the Riemann hypothesis, and by the way, a long way from doing that, but if it was done, there would be millions of mathematicians who'd be very excited to understand whatever new mathematics came out of it and decode it back into things that humans can understand.
Hannah Fry
Okay, but here's my question, right there's the Clay Maths Institute in the year 2000, offered a million dollar prize for seven different mathematical problems. And, you know, human mathematicians have had a quarter of a century in order to try and solve them, and only one has fallen. Do you think potentially the next one could go to AI?
David Silver
Yes, I do, actually. I think that it might take time. I don't think we're there yet. I think there's a long way before AI systems are capable of doing this, but I think AI is on the right track, and systems like alpha proof will become stronger and stronger and stronger. What we saw in the IMO is just the beginning. And you know that once you have a system that can scale and can keep learning and learning and learning, really the sky's the limit. So what will these systems look like in 2 years or 5 years or 20 years? Well, I personally would be amazed if AI mathematicians don't transform the whole of mathematics. I think it's coming. Mathematics is one of the few areas where in principle, everything can be done completely digitally by a machine interacting with itself and just going and going and going. So there's really no fundamental barrier to an experience driven AI system mastering mathematics.
Hannah Fry
Okay, I really buy what you're saying about alpha proof, by the way, and the same with AlphaZero. I mean, I think they're really excellent examples of how far you can go with reinforcement learning, but they are also examples where there is a very clear metric of success. You win a game of go or you don't, your proof is correct or it isn't. How did these ideas translate to systems where it's a lot messier? And actually these, these very clear metrics might not necessarily be present.
David Silver
So first, I want to acknowledge that this question is probably the reason why reinforcement learning methods or these kind of experience based methods that I'm talking about have not yet broken into the mainstream of absolutely everything that we do in every AI system. So it has to be cracked. If the era of experience is to come about, then we have to have an answer to this. But I think the answer might be right in front of us, because actually, when you look at it, the real world contains innumerable signals. There's just a vast number of signals in the way that the world works. If we look at all of the things that we do on the Internet, for example, there's any number of Signals like likes or dislikes or profits or losses or pleasure, pain signals you might get, or yields or properties of materials. There's all these different numbers representing different things about different aspects of experience. And so what we need is really a way to build a system which can adapt and which can say, well, which one of these is really the important thing to optimize in this situation? And so another way to say that is, wouldn't it be great if we could have systems where a human maybe specifies what they want, but that gets translated into a set of different numbers that the system can then optimize for itself completely autonomously?
Hannah Fry
So, okay, an example then. Let's say I said, okay, I want to be healthier this year. And that's kind of a bit nebulous, a bit fuzzy. But what you're saying here is that that can be translated into a series of metrics like resting heart rate or, you know, BMI or whatever it might be. And a combination of those metrics could then be used as a reward for reinforcement learning. Is that. If I understood that correctly.
David Silver
Absolutely correctly.
Hannah Fry
Okay, are we talking about one metric, though? Are we talking about a combination here?
David Silver
So I think the general idea would be that you've got one thing which the human wants to optimize for my health, and then the system can learn for itself which rewards help you to be healthier. And so that can be like a combination of numbers that adapts over time. So it could be that it starts off saying, okay, well, right now it's your resting heart rate that really matters. And then later, you know, get some feedback saying, hang on, you know, I really don't just care about that. I care about my anxiety level or something. And then it includes that into the mixture, and based on feedback, it could actually adapt. So one way to say this is that a very small amount of human data can allow the system to generate goals for itself that enable a vast amount of learning from experience.
Hannah Fry
Because this is where the real questions of alignment come in, right? I mean, if you said, for instance, let's do a reinforcement learning algorithm that just minimizes my resting heart rate quite quickly. Zero is like a good minimization strategy there, which would achieve its objective, just not maybe quite in the way that you wanted it to. I mean, obviously you really want to avoid that kind of scenario. So how do you. How do you have confidence that the metrics that you're choosing aren't creating additional problems?
David Silver
One way you can do this is to leverage the same answer, which has been so effective. So Far elsewhere in AI, which is at that level, you can make use of some human input. If it's a human goal that we're optimizing, then we probably at that level need to measure and say, well, a human gives feedback to say, actually I'm starting to feel uncomfortable. And in fact, while I don't want to claim that we have the answers, and I think there's an enormous amount of research to get this right and make sure that this kind of thing is safe, it could actually help in certain ways in terms of this kind of safety and adaptation. There's this famous example of paving over the whole world with paperclips when a system's been asked to make as many paperclips as possible. But if you have a system which is really its overall goal is to support human well being and it gets that feedback from humans about and it understands their distress signals and their happiness signals and so forth, the moment it starts to create too many paperclips and starts to cause people distress, it would adapt that combination and it would choose a different combination and start to optimize for something which isn't going to pave over the world with paperclips. So look, we're not there yet, but I think there are some, some versions of this which could actually end up not only addressing some of the alignment issues that have been faced by previous approaches to goal focused systems that maybe even be more adaptive and therefore safer than what we have today.
Hannah Fry
Outside of the world of AI though, is there a problem with using quantitative metrics as a measure for success at all? I'm thinking here about exam scores or GDP or the myriad of problems that you can get into when you focus too carefully and end up with a tyranny of metrics.
David Silver
So look, I would be the first to agree that when you mindlessly pursue a metric in the human world, that it often leads to undesired consequences. At the same time, the whole world of human endeavor is organized around us optimizing for some things. If we didn't have anything that we could optimize for, we wouldn't ever be able to make progress. We have all kinds of signals and metrics and so forth that drive progress. And people say, oh, okay, maybe that isn't the right metric and they adapt.
Hannah Fry
It is part of the problem then that at the moment you have an interaction with an AI that is really contained within time. There aren't these sort of longer term learnings or adjustment of what the goals might be like once you decide that GDP is the thing that you're going for, it's GDP forever and there's no change.
David Silver
I think that's absolutely right. That the kind of AI that we have today doesn't have a life. It's not something which has its own stream of experience in the way that an animal or a human might have. That of goes on for years and years and years and can keep adapting over time. And that needs to change. And one of the reasons it needs to change is so that we can have systems that just keep learning and learning and learning over time and adapting and understanding how to better achieve the kinds of outcome that we really want.
Hannah Fry
Is there something that is quite risky about untethering algorithms with potentially quite a lot of power from human data, really?
David Silver
There are certainly risks and there are certainly benefits. And I think we absolutely have to take this very seriously and be extraordinarily careful about taking these steps that come next in this journey towards the era of experience. And I should say that one of my reasons to write this position paper is because I feel that people aren't recognizing that this transition is going to come and that it will have consequences and it will require careful thought about many of these decisions. And the fact that so many people are still thinking only about the human data approach means that not enough people are taking seriously these kinds of questions.
Hannah Fry
The last time I got to speak to you on this podcast, we talked about a different position paper that you had just written, Reward is enough, essentially saying that reinforcement learning is all you need to get you towards AGI. Do you still think that that's the case?
David Silver
I think the way I would answer this is by saying that human data might give us a head start. It's a bit like, to borrow a metaphor, it's a bit like the fossil fuels that we discovered in the Earth and all of this human data just happens to be there, and then we kind of mine it and burn it in our LLMs. And that gives them a certain level of performance that they have for free. But then we need, in the analogy, some kind of sustainable fuel that keeps the world going once all the fossil fuels are gone. And I think that's what reinforcement learning is. It's the sustainable fuel, this experience that it can keep generating and using and learning from and generating more and learning from it. That's really the process that's going to drive progress in AI. And I don't want to in any way denigrate what's been done with human data. I think it's great. I think the AIs that we've got now are amazing, mind blowing things and I love them and enjoy working with them and do research on them myself, but it's just the beginning.
Hannah Fry
Dave, thank you so much. That was amazing.
David Silver
Thank you.
Hannah Fry
Of course, there is this monumental amount of progress that's going on at the moment, but when you stop to think about it, there really has been this narrowing in the diversity of ideas around AI. I mean, the success of multimodal models has been so rapid, it's been so profound, so beyond what most people were expecting, that they kind of have sucked a lot of the oxygen out of the broader conversation. And it is noticeable that we're hearing again and again now these murmurs that we have reached the limit of usable human data. And okay, of course there are risks involved with this approach of untethering AI from human data, all sorts of areas that need careful thought and attention. But I can't help but be quite convinced by what David was saying there. If we really want superhuman intelligence, maybe it is now time to step away from the human. You have been listening to Google DeepMind, the podcast with me, Professor Hannah Fry. And before you go, we have got an extra special treat for you today in the form of a conversation between David Silver, the man behind AlphaGo, and Fan Hui, the first professional Go player to face it.
Fan Hui
How are you, Dave?
David Silver
I'm really well. Good to hear from you. It's been a long time, long time no see.
Hannah Fry
A decade ago, a little while before the very famous 41 victory over Lisa Dahl, Fan Hui became the first professional Go player to test his skills against your algorithm. How long has it been since you spoke to him?
David Silver
It's been quite a few years. Yeah. So nice to see Fan Hui. It's been absolutely amazing to catch up. Fan Hui played such a huge part in the, in the development of AlphaGo, so it's really just a genuine delight.
Hannah Fry
Thank you so much for joining us, Fan Hui.
Fan Hui
Oh, thank you, thank you. For me, it's a very extraordinary experience.
Hannah Fry
Okay, so I want to ask you about that match that you had all those years ago, because I think, I guess now looking at the full history of it, it almost seems like a, like a foregone conclusion. But at the time, I mean, you must have been pretty nervous, David. And how did you feel about it as well? Fenway?
Fan Hui
I remember the first time I saw the damage email. Tell me, like the exciting Go project. I still remember when I played with AlphaGo. First game I lost. I feel something strange. I also remember when I lost the second Game, I feel fear because I feel maybe I will never win with this program or AI. And when I lost my five game last game, I feel my old goal world is totally broken, but my new GO world is open.
Hannah Fry
David, just. I want to ask you as well, though, in advance of that match, how confident were you about the performance of your algorithm?
David Silver
We really weren't confident. It was just so hard to judge where we were because we knew that we'd gone beyond the players that we had at DeepMind and we knew that we'd gone beyond all the programs that had been written before. But there's such a huge gap beyond that towards the level of professional players like Fan Hui. And we had no idea, you know, are we somewhere in that gap? Are we somewhere beyond that gap? Like, we just genuinely didn't know. And so this was like the first time we had any opportunity to calibrate our level of performance. And I don't think any of us would have been surprised if we'd lost all five games. So it was a very pleasant surprise to win all five. And yeah, we just. I genuinely, it was like one of those moments where the world could have branched either way and we just didn't know until the match happened.
Hannah Fry
But of course, Van Hoi, this algorithm then advanced with your help. In fact, after your match, you came on board and supported the team in developing it further. But that earlier version, what did it feel like to play it? Did it feel fundamentally different to having a human opponent?
Fan Hui
You know, I play with another program before AlphaGo. When I play with another program, I feel like this is a program because they don't play like human. But with AlphaGo, I feel something very strange sometimes. I feel like it's really, really like human.
Hannah Fry
What's the impact been then of AlphaGo and AlphaZero on the go community? Has there had to be a process of acceptance or was it, you know, positive from the off?
Fan Hui
First of all, when I lost with AlphaGo, so for the All Gold community, nobody really believe this is true because, yeah, you know, I'm only European champion, so it's not world champion like lise, but when AlphaGo went with Lee Zedar and our GO community see something different because AlphaGo play really, really well. I remember on the second game, the move 37, such beautiful move. Really, really beautiful. So creative. It's very creative for the human. We will never play this move. After that move, everything changed in the GO world because for us, everything is possible today. Even the GO student use AI to learn. So, yes, I think this is really, really good for our GO community, I think. It's not just for GO community, it's also for the world, I think.
Hannah Fry
Fan Hui, thank you so much for joining us. That was such a real treat, especially with the big anniversary coming up.
David Silver
Just great to see you again and thanks for everything you did on AlphaGo. I don't think it would have been the same without you. I think we would have made some terrible mistakes if we hadn't had your advice to help us along the way. So thank you.
Fan Hui
Thank you, Dave.
Title: Is Human Data Enough? with David Silver
Host: Hannah Fry
Guest: Professor David Silver
Release Date: April 10, 2025
In this captivating episode of Google DeepMind: The Podcast, mathematician and broadcaster Professor Hannah Fry engages in an insightful conversation with David Silver, a foundational figure at DeepMind. Renowned for his pivotal role in developing AlphaGo—the first program to achieve superhuman performance in the complex game of Go—Silver delves into the future trajectory of artificial intelligence (AI) beyond the reliance on human-generated data.
Hannah Fry opens the dialogue by referencing Silver's position paper on the "era of experience." Silver elaborates:
"If you look at where AI has been for the last few years, it's been in what I call the era of human data [...] there's another way to do things. This is what's going to lead us into the era of experience, where the machine actually interacts with the world itself and generates its own experience."
(00:04)
Silver contrasts the current reliance on human data with a prospective phase where AI systems gain knowledge autonomously through interaction and experience. This shift aims to transcend human limitations, fostering AI that can discover novel insights beyond existing human understanding.
Hannah Fry prompts Silver to discuss the distinction between AlphaGo and large language models (LLMs):
"They didn't start off just as like random empty boxes though, right?"
(04:48)
Silver explains that while the original AlphaGo utilized a database of human professional moves to gain initial proficiency, AlphaZero marked a significant departure by operating with "literally zero human knowledge."
"AlphaZero, in particular, is very different from the type of human data approaches that have been used recently because it literally uses no human Data. That's the zero in AlphaZero."
(03:30)
Through reinforcement learning, AlphaZero learned by playing millions of games against itself, iteratively refining strategies based solely on trial and error rather than pre-programmed human expertise.
Silver introduces the concept of the "bitter lesson" in AI—a realization that relinquishing human data can lead to superior performance.
"If you throw out the human data, you actually spend more effort on how the system can learn for itself. And that's the part which can then learn and learn and learn forever."
(06:06)
He emphasizes that reliance on human data imposes a ceiling on AI advancement, as machines constrained by human knowledge cannot surpass it. By embracing self-directed learning, AI systems like AlphaZero can exceed human capabilities, breaking through previously insurmountable barriers.
One of AlphaGo's most celebrated moments was its unconventional Move 37 during a match against Lee Sedol.
"Move 37 was a move that happened in the second game of AlphaGo against Lee Sedol. [...] AlphaGo played on the fifth line, and it somehow played this in a way that just made everything make sense on the board."
(09:54)
This move defied traditional Go strategies, showcasing AlphaGo's ability to generate creative solutions beyond human expectations. Silver reflects on whether similar creativity exists in LLMs, concluding that until AI systems surpass human data reliance, such groundbreaking innovations remain rare.
Expanding AI's horizons, Silver introduces AlphaProof, a system designed to autonomously generate and verify mathematical proofs.
"AlphaProof is a system that learns through experience how to correctly prove mathematical problems. So it can, if you give it a theorem and you don't tell it anything about how to actually prove that theorem, it will go away and figure out for itself a perfect proof of that theorem."
(25:03)
Unlike LLMs, which often produce informal and sometimes unreliable proofs, AlphaProof ensures correctness by adhering to formal mathematical languages. Demonstrating its prowess, AlphaProof achieved a silver medal level at the International Mathematics Olympiad, solving problems that only the top 10% of contestants could.
Silver critiques the prevalent use of Reinforcement Learning from Human Feedback (RLHF) in LLMs.
"Reinforcement learning is used in almost all large language model systems. [...] it feels like we've thrown out the baby with the bathwater. These reinforcement learning from human feedback systems [...] do not have the ability to go beyond human knowledge."
(16:07)
While RLHF enhances LLMs by aligning outputs with human preferences, it inherently limits AI's potential to discover beyond human-established data, as systems become tethered to existing human judgments.
The discussion shifts to the concept of grounding—achieving a true understanding of the world through interaction.
"When we train a system from human feedback, that it is not grounded [...] it's the fact that the reward that the agent learns from is coming from a human's judgment."
(17:43)
Silver argues that RLHF provides superficial grounding, as feedback is based on human evaluation rather than real-world consequences. Instead, he advocates for AI systems that derive feedback from their own interactions with the environment, akin to how AlphaZero learned through self-play.
Additionally, Silver touches on synthetic data:
"The beauty of a self-learning system [...] is that as the system starts to get stronger, it starts to encounter problems that are exactly appropriate to the level it's at."
(21:08)
He posits that experience-driven AI can continually evolve without the stagnation inherent in synthetic data generation, which often mirrors existing human data limitations.
Silver acknowledges the complexities in designing AI systems that optimize for nuanced human goals.
"One way you can do this is to leverage the same answer, which has been so effective so far elsewhere in AI, which is at that level, you can make use of some human input."
(38:39)
He discusses the pitfalls of metric-centric approaches, where an overemphasis on specific metrics can lead to unintended consequences—paralleling concerns like the "paperclip maximizer" scenario. To mitigate such risks, Silver suggests dynamic and adaptable metric systems informed by continuous human well-being feedback.
Envisioning the future, Silver posits that reinforcement learning serves as the "sustainable fuel" for ongoing AI advancement.
"It's the sustainable fuel, this experience that it can keep generating and using and learning from and generating more and learning from it."
(42:50)
He underscores the necessity of moving beyond finite human data, advocating for AI systems that perpetually enhance their capabilities through self-generated experiences, thus unlocking limitless potential.
The episode culminates with a heartfelt reunion between David Silver and Fan Hui, the first professional Go player to compete against AlphaGo.
Fan Hui shares his experiences during the groundbreaking match:
"I feel something strange [...] sometimes. I feel like it's really, really like human."
(47:53)
Reflecting on the aftermath, Hui acknowledges AlphaGo's profound impact on the Go community, inspiring new strategies and training methodologies.
"After that move, everything changed in the GO world because for us, everything is possible today."
(48:18)
Silver reciprocates the gratitude, attributing AlphaGo's success and subsequent evolution to Hui's invaluable contributions.
Hannah Fry concludes the episode by emphasizing the necessity of diversifying AI methodologies beyond multimodal models and human data reliance. She echoes Silver's vision of stepping away from human-centric AI paradigms to achieve superhuman intelligence.
"If we really want superhuman intelligence, maybe it is now time to step away from the human."
(44:01)
The episode serves as a compelling exploration of AI's next frontier, advocating for systems that learn autonomously through experience, thereby unlocking unprecedented advancements and creativity.
Notable Quotes:
This episode of Google DeepMind: The Podcast offers a visionary roadmap for AI development, challenging entrenched paradigms and highlighting the transformative potential of experience-driven artificial intelligence.