
Loading summary
A
Welcome to machine learning. How did we get here? I'm Tom Mitchell. Today's episode is an interview with Geoff Hinton, one of the pioneers in the field of neural network learning. Jeff started out early, as you'll hear, in the 70s, 1970s, and has continued working in neural networks ever since. During the period of the 1990s and early 2000s, when neural networks were really in disfavor in the field of machine learning, Jeff nevertheless persisted and he co led the triumphant return of neural networks in the form of deep Networks in the 2010ish period. In 2018, Jeff, along with Joshua Bengio and Yann Lecun, received the Turing Award in Computer science. That's the highest award given in the field of computer science to researchers. In 2024, Jeff along with John Hopcroft were awarded the 2024 Nobel Prize in Physics for their work on artificial neural networks. I hope you enjoy the episode. I'm pleased to have with me today Jeff Hinton, one of the pioneers of machine learning. Geoff, great to see you again.
B
Thanks for inviting me.
A
What I'd like to do today is get two things, two types of things from you. One is your own personal history and how you got into this field and what happened after you did. And the second is kind of your perspective on the whole field of machine learning, AI, and how things are turning out.
B
So when I was in high school, I had a very smart friend who was a very good mathematician and read widely, unlike me. And he came into school one day and talked about how memories might be distributed over the brain rather than localized in a place like a hologram, because this would have been 1966 and holograms had just come out. And that got me interested in how our memory is represented in the brain. And I've been interested in that ever since.
A
Now, when I met you, we were both at Carnegie Mellon. It was 1986 when we really got to do some work together or teach course together. How did you get from 1966 up till 1986? What was the path?
B
Slightly rocky. So I went to university, I studied physics, chemistry and physiology. I and in physiology, in the last term, they're going to teach us how the central nervous system worked. And I was very excited. And they taught us how action potentials are conducted along an axon, which wasn't what I meant by how it worked. And so I switched to philosophy, that was even less useful. And then I switched to psychology, which was completely hopeless. And then I became a carpenter. And after I'd been a carpenter for about nine months I met a real carpenter and he was so much better than me. I decided it'd be easier to be an academic. So I went to graduate school in Edinburgh with Longit Higgins, who had published interesting stuff on using neural nets for memory. Unfortunately, around the time I arrived, when Winograd's thesis came out and he switched his allegiance to symbolic AI and gave up on neural nets. And so I spent five years as his graduate student with him trying to persuade me to give up neural nets, and he never succeeded. In the end he was very helpful to me. But for a long time there was a lot of argument about how I should really be doing symbolic AI. And all this neural net stuff was complete nonsense. And everybody else in Edinburgh but believed that neural nets were nonsense, with actually a couple of exceptions. There was a postdoc called David Wilshaw who'd done associative memory. He'd basically done something quite like Hopfield nets, but a long time before Hopfield. And Aaron Sloman was a visitor for a while and he was more sympathetic. But basically they all knew it was rubbish. And they would explain to me how neural nets can't even do recursion. So because everybody believed in recursion, then I actually figured out how to true recursion in a neural network and implemented it on a machine with I think by then it had 192 kilobytes of memory and it was only shared by 40 people, but it had a huge disk that had two megabytes. So you never ran out of memory because it used virtual memory. And I actually implemented a little neural net that did true recursion. That is in the recursive call, it used the same neurons and the same connection strengths for the recursive call as it did for the high level call. Now, to do that, of course, it had to offload all the parameters of the high level call into some short term memory onto a stack eventually. And I figured out how to implement a stack with associative memory in a neural net. So I had this little neural net running that was doing full recursion in neural nets. And that was the first talk I gave. And people were very puzzled. They said, why would you want to do recursion in the neural net? I mean, it's so easy to do in POP two, which was our sort of unfortunate bastard child of Pascal and Lisp, although I didn't think Pascal existed then. So. Yeah, so I keep meaning to go back to that.
A
I was going to ask, is there a future to recursion for neural nets.
B
Oh, yes. I mean, to do true recursion, you have to use the same neurons and weights for the recursive call. That means you have to have a stack, something like a stack to store the parameters of the high level call. That all works if you have fast weights. So that was the first thing I did with fast weights in 1973, I should say. Fast weights were invented by schmidt Huber in 1990, something.
A
Fair enough. Okay, so then you moved on from Edinburgh. Did you come directly to Carnegie Mellon from there or how did.
B
Oh, no, no, I dropped out again. After I finished my thesis, I dropped out and became a teacher in a free school in London. It was voluntary, I was unpaid. They were rough, emotionally disturbed, inner city kids. And after a few months of that, I again decided academia might be easier. So I went back to a postdoc with Aaron Sloman in Sussex. Longer Higgins had moved from Edinburgh to Sussex, and as I was finishing my PhD, I got a postdoc with Aaron Sloman. And there were no proper faculty jobs in Britain then. There was one job in the whole of Britain, which Alan Bundy got. And so I applied for jobs in the States and I got a job as a postdoc in UCSD with Don Normal and Dave Rumelhart. And I really got along very well with Dave Rumelhart and that made a huge difference. So I'd moved from a country where it was a sort of small country, Britain, and there was only room for one ideology, and the ideology was symbolic AI, and neural nets was just rubbish. And I moved to the States, where on the east coast it was symbolic AI, but on the west coast they were kind of more open. And in particular Don Norman and Dave Rumelhart thought neural nets were worth considering. So. So it was a huge liberation to be in a place where neural nets were regarded as not obvious nonsense. And I went there, I got to meet Terry Sinofsky, who I invited to a conference, and we'd been sort of lifelong friends and collaborators. I got to meet Francis Crick later on, who was there. So I was there for a couple of years and then I got a job in Cambridge in the Applied Psychology Research Unit, where I was meant to do applied psychology. And I was strongly reminded of William James comment about applied psychology, which is, to do applied psychology, you have to have something to apply. But I actually did some interesting stuff. It was just around the time some workstations were coming out and they had a contract with the British telephone company to help with network management and network Management then was all done by hand and you had information about the loads on various switching centers. And the information was on a huge wall that was 20ft high and worked like you sometimes see at train stations. It was little flaps with white letters on that come down. They sort of rotate around until you get the right flap. And so you could see all these numbers that said how busy each switching station was. And I figured a some workstation could do that and would be a lot cheaper. And the resolution wasn't that good then. So I had to figure out if you could display the states of all the switching stations in Britain on the screen of a some workstation. And you couldn't type the names, but if you had two letters for each name, you could get two letters there. And I worked on a display with two letter names and there were a large number of switching stations, there were hundreds. And the question was, could an operator remember which they were? So I actually taught myself to remember all those two letter names. They were in very small type to fit on. And I actually got a serious migraine from looking at it too long. That was my interaction with human factors.
A
And then I remember it does remind me of. That was around the time that Unix was getting invented and all the commands had no vowels in them. So there was a theme there.
B
Yes. So I wrote a report on it and they said it was a very nice report, thank you very much, and they weren't going to implement it, even though it would have been much more efficient and much easier to update. And I said, why not? And they confidentially explained to me that, well, when people come and visit the network control center, or actually when they visit the headquarters of British Telecom, they have to have something to show them, like the politicians have to see something. And they would always show them this huge wall that displayed the state of all the networks of all the switching stations. And they were very impressed by that. And if they got rid of the huge wall and had just some workstations, they were very worried that network management would get less funds from British Telecom, so they were going to keep their huge wall. I learned a lot then about applied research. It's not about whether it works, it's whether or not the company likes it.
A
Fair enough, fair enough.
B
Then after that I went back to San Diego for six months and that's when we worked on the PDP books with Dave Rummel Hart and Jamie Clement. I was one of the authors until almost when they were published. And at the last minute I dropped out because at that point I decided Boltzmann machines were the future. Boltzmann machines were just a much better idea than backprop. Backprop was a silly idea and Boltzmann machines were a much better idea. And there was no point being an author of a book where the main thing was back prop. That was a mistake. But in 1984, I figured out, yeah, in 1982, then I applied to CMU, and because it was a private university, you could just. They didn't have to sort of advertise very widely. And Scott Farman was sort of my host. He sort of interacted with him at many workshops and we got along well and he pushed hard for me to get them to go there. And I had a very funny interview. So I went there on the first day there, I gave a talking to Computer Science, and then Scott Farmer took me out for lunch at a place, it might have been called the O. I can't remember what it was called, but it had a motto which is, if you don't get sick, you got a bad one. And I got terribly sick. And the next day I had acute diarrhea. I couldn't eat anything. I was living on coffee and Coca Cola. I gave a talk in psychology about mental imagery and my theory of mental imagery. And then there was someone at the end who asked a question which I didn't understand to begin with. And then I realized the question he was asking was, did I believe the theory of someone called Marcel Juste about how you weren't really rotating an image in your mind, you were just looking backwards and forwards between two things. And in my reply, I said, oh, I see, you mean that silly theory by Marcel Jost, where I didn't realize it was Marcel just asking the question. And after that, I got a request to go and see Nico Haberman. Now, Nico and I were always great friends, even though we were politically extremely different. I was a sort of lefty 1960s radical with long hair and rather disheveled. Nico was a European gentleman who was very nicely dressed, worked with the Defense Department, set him an institute I wasn't allowed to go to because I was a foreigner. But we got along very well, and I think it was because of our initial interview.
A
And Nico was the department head in Computer science.
B
Yes. And so in the initial interview, he said, so we've decided to offer you the position. And I said, oh, oh, there's something you should know. And he said, oh, what's that? And I said, well, I don't actually know any computer science. And he said, it's okay, here. It's okay. We have people here who do. So I said, okay. So I said, okay, in that case, I accept. And Nico said, don't you think perhaps we should talk about the salary? And I said, oh no, I'm not interested in the salary. You can pay me whatever you like. I'm not doing it for the money. And he said, well, how does 26,000 sound to you? I said, that sounds fine. I discovered I was being paid 10,000 than the next lowest paid professor, 10,000 less. But every year I got a big pay rise. And me and Nico got along very well after that because he knew I wasn't doing it for the money. Things have changed so much.
A
That's fantastic. Okay, so now we're up to the mid-80s when really neural nets are reborn. Is that the right word?
B
How would you say yes, with backpropagation. I mean, we didn't invent it. We invented by several different groups, but we showed that it really worked to learn representations. And as you know, sort of one of the big problems in AI is how do you learn new representations? How do you avoid having to put them all in by hand? And my particular example, which was the family trees example, where you take all the information in some family trees, you convert it into triples of symbols, like John has father Mary, and then you train a neural net to predict the last term in a triple given the first two terms. So it's just like the big language models. You're predicting the next word given the context, it's just much simpler. I had 112 total examples, of which 104 were training examples and eight were test examples, which is a bit less than the trillion examples they have nowadays. But it was the same idea. You convert a symbol into a feature vector, you then have the feature vectors of the context interact via hidden layer. They then predict the features of the next symbol, and from those features you guess what the next symbol should be, and you try and maximize the probability of predicting the next symbol. And you then back propagate through the feature interactions and through the process that converts the symbol into features. And that way you learn feature vectors to represent the symbols and how these vectors should interact with to predict the features of the next symbol. And that's what these big language models do, except it's a bit more complicated. The feature interactions are much more complicated. They have many more layers of interactions, so they can disambiguate ambiguous symbols and refine the shade of meaning of things where the meaning depends a lot on context. But it's basically an extremely simple version of the current large language models. I call it a tiny language model. And that convinced the editors of Nature that we really could learn interesting representations because the vectors I learned for the symbols, which were people and relationships, they had six components. And if you used weight decay, you could interpret what all those components were. And they were sensible semantic features. They were the nationality of the person and the generation of the person and which branch of the family tree they were in. And so it would learn things like the relationship uncle requires the output person to be one generation older than the input person. And so it would have generations for people. And if the input person was a generation two, it would predict that the output person would be a generation one. So it was actually running a whole bunch of little rules just probabilistically. And the people interested in rule based induction got interested in it because they said, oh, we can do that too. And it's true, they could do that too, with rules that weren't probabilistic. The point about neural nets is they can mimic something that learns discrete rules, but they're also perfectly happy if the rules are just usually true and they use the preponderance of the evidence then, which is much harder to do in logic. And so it was that example, which curiously was a little language model that convinced the editors of Nature to publish paper. I know because I talked to them later, the referees. I talked to one of the referees later and he said, yeah, it was that example that did it. And then we were all very excited. We thought, we can solve everything. You just have to give it a lot of training data and run backprop and. And it'll learn all the representations you need and it'll learn to do parallel computation. Because that time people were very interested in parallel computation, but it was quite hard to program. And the idea was, well, this will have all these neurons inside and they'll all be operating in parallel and it'll figure out how to use them. So there aren't any problems. At that point, people were very interested in races and things like that. And you didn't have to worry about any of that. It was all synchronous and they just learned what to do. So we thought we'd solved everything and little did we know we had. It's just we needed more data and more compute.
A
So then there's the long period of waiting for more data and more compute.
B
Yeah, and not realizing that that was the main problem. Obviously, with other little problems, there were more sensible kinds of neurons to use and more sensible ways to regularize it and all that. And things like Transformers had to be invented to make it really efficient. But basically backprop was the way to do it. And you couldn't convince anybody. When computers were slow, it would work for little problems. It will work for slightly bigger problems. Like a few years later, Yang got it working for mnist, recognizing digits. But all the vision people said, that's not real vision. You're never going to do it with real images that are high resolution on the web. And so it wasn't until about 2012 that they had to eat their words.
A
That's right. That was the year when, well, you tell this story. You were the first person.
B
Well, I was the advisor of the first two people. Now, it's not quite fair because Jan had already basically shown that they worked for real images. And Jan realized when Fei Fei came up with the ImageNet dataset, Jan realized they could win that competition. And he tried to get graduate students and postdocs in his lab to do it, and they all declined. And Ilya Sutskova realized that backprop would just kill imagenet. And he wanted Alex to work on it. And Alex didn't really want to work on it. Alex had already been working on small images and recognizing small images. In CIFAR 10, Ilya Pre processes everything for Alex to make it easy. And I bought Alex two Nvidia GPUs to have in his bedroom at home. And Alex then got on with it. And he was an absolutely wizard programmer. He wrote amazing code on multiple GPUs to do convolution really efficiently. Much better code than anybody else had ever written, I believe. And so it's a combination of Ilya realizing we really had to do this. I know was involved in the design of the. Net and so on, but Alex's programming skills. And then I added a few ideas like use rectified linear units instead of sigma units and use little patches of the images, I mean big patches of the images, so you can translate things around a bit to get some translation variants, as well as using Convolution and use Dropout. So that was one of the first applications of Dropout. And that helped by about 1%. But it helped. And then we beat the best vision systems. The best vision systems were sort of plateauing at 25% errors. That's errors for getting the right answer in your top five bets. And we got like 15%, 15 or 16, depending on how you count it. So we got almost half the error rate and what happened then was what ought to happen in science but seldom does. So our most vigorous opponents, like Jitendra Malik and Zissiman, Andrew Zisselmann looked at these results and said, okay, you were right. That never happens in science. And slightly irritatingly, Andrew Zisselmann then switched to doing this. He had some very good postdocs or students working with him. Simonyan.
A
And
B
after about a year, they were making better networks than us. But that was really the. As far as the general public was concerned. That was the start of this big swing towards deep learning in 2012, when we really nailed computer vision. But it actually happened before that. It happened in 2009 when we showed how you could do speech recognition, or rather the acoustic modeling part of speech recognition. We showed how you could do that a bit better than the best technology that influenced all the big speech groups. The big speech groups at IBM and Microsoft and somewhere else. Oh, Google, yes, All switched to doing neural nets for acoustic modeling. And so by 2010, it was clear that neural nets were the right way to do acoustic modeling. And we had lots of people on site. But in 2012, it actually came out for the Android and suddenly the Android caught up with Siri in speech recognition. So really, we demonstrated it for speech before that, but that didn't make a big impact. The reason it worked for speech was they had a big data set. They had millions of examples. They were one of the areas, unlike vision, they had big data sets because of the DARPA speech project, because they really wanted to be able to benchmark systems. Also, speech is easier than vision. Speech is just vision with either one or two pixels. It's just they change rather fast. And so we demonstrated for speech when we did it for vision, the big companies already knew it worked for speech and they saw it work for vision, and so they realized it was sort of universal. It wasn't just a specific trick for a specific domain. It will work for perception in general. They didn't realize at that point it would work for language, and nor really did we, even though our very first impressive example was for language. Yeah, so in 2012, there was this big swing to neural networks. And that's when Jensen at Nvidia realized. He finally realized those Nvidia boards weren't just for gaming. They were supercomputers for doing machine learning. Now, I actually gave a talk at nips in 2009 when I told 1,000 people, this was about speech. I told a thousand people, if you want to do machine learning now, you have to buy Nvidia. GPUs. Nvidia GPUs will make your program go about 30 times as fast because they're relatively easy to utilize. Parallelism. They're just right for neural nets. It was Rick Zieliski, who was a student of mine at CMU, who told me that in about 2006, and it was true. And I sent mail to Nvidia saying, how about giving me a free one? Because I told a thousand machine learning researchers to buy your boards, and they declined. Years later, Jensen came to Toronto and gave a talk and mentioned how Toronto was the place where they convinced him that Nvidia GPUs were good for AI and that it all happened in 2012 and I couldn't resist it. At the end. I said, well, I told you in 2009 that you ignored me. What he should have said was, well, you're very silly. You should have bought stock in 2009. If I'd done that, I'd be a billionaire. But instead he gave me, he opened his briefcase and gave me their very special, very latest gpu, of which they'd only made a few that had twice as much memory as any other gpu. So that was a nice move by Jensen.
A
That's a great story too. So then in the 2010s, things really just kind of rapid fire started taking off. Take us through that.
B
So speech worked. We got a good collaboration between the research groups at IBM and Google and. Toronto and Microsoft. Yeah, we actually published a joint paper, which is sort of quite rare in this stuff, about this sort of new view of how to do acoustic modeling. And then we did vision. And then I started getting lots of requests from big companies who wanted to buy me or buy me and Alex and Ilya, or fund our company or get us to come work for them. And I realized this stuff was probably valuable. We had no idea how much it was worth. So Craig Boutleer, who was the chair of the Department of Computer Science, was an expert on auctions. And he said, you know, you should actually, since you have no idea what it's worth, but there's many people interested, you should set up an auction. So at Lake Tahoe, which seemed like the appropriate place in a casino, a casino hotel. In 2012, Alex and Eeni and I set up a little company for the sole function of doing acquihire. And there was an auction between Microsoft and Google and DeepMind and Baidu. DeepMind dropped out fairly early. And on the ground floor they had all these people at slot machines which, with cigarettes hanging out the corner of their mouth, just pulling these levers. And every so often they made like $1,000. And lights would flash. And we were upstairs having an auction where you had to raise by a million. That was fun. And the auction went on for quite a long time. We were completely amazed when it got to 44 million. It was so much money that we couldn't imagine that any more money would be useful. I mean, that seemed like as much money as anybody could possibly want. And so we then became much more concerned about who we worked for. And I wouldn't have been able to get to China because I couldn't fly at that time. And I'd spent the summer of 2012 working with Jeff Dean at Google, and I got along really well with Jeff Dean. It was a really nice group. And I figured it was much more important to work in a really nice place than to get more money. So we actually terminated the auction. We told Baidu we got an offer we couldn't refuse. And the offer we couldn't refuse was the chance to work at Google Widget Dean. And that all worked out very well. So then I was off to Google. And while we were there, Ilya, along with Cuakley and Yoshua and Bodno in Montreal, they developed Attention language models with Attention, which was the precursor of Transformers, and showed that language models actually work well for machine translation. And I think that was the final nail in the coffin of symbolic AI, because if anything was going to be good for symbolic AI, it was converting symbol strings in one language into symbol strings in another language. The idea that you might do that by taking symbol strings and manipulating them actually sounded quite plausible. But that's not the way to do it. The way to do it is to understand what's being said in one language by associating big vectors with words, appropriate big vectors, and then convert that to the other language. So it was clear by about 2015 that neural nets were going to do everything, including language. That's the point at which Gary Marcus published a book chapter saying neural nets were, okay, maybe they could do object recognition, but they'd never do language because language involved novel sentences. They were already doing it. Wow.
A
So that was 2015. You were still at Google.
B
I was at Google. And Ilya then moved to OpenAI around 2015, maybe 2014. I can't remember the price here. And. And then OpenAI did rather well. OpenAI basically just took stuff that had been done at Google on Transformers and put a nicer interface on it and realized which Google hadn't realized that if you do human reinforcement learning, you didn't need that many examples to make it behave nicer. You didn't need like 100 million examples, which you might have thought you could do it with. Like some fraction of a million examples would already make it behave a lot better. So you could actually train it up to have nicer behavior. And that was ChatGPT. Google was then in the classic situation of not wanting to interfere with search, which was this moneymaker. So it was in this difficult situation, do they release chatbots or not? But when Microsoft teamed up with OpenAI, they basically had to release them, but they lost a few years. I think it was partly because search was working so well. And it was obvious search would be better if instead of using keywords, it used what you meant, which would mean it had to understand what you meant. But they didn't want to undermine their money maker. Now, that's based not on any inside information. It just seems obvious,
A
really amazing. So, so here we are now, and you were famously on record warning people about some of the risks of AI. What should people who are working in this area do in response to that risk?
B
Okay, so I didn't actually talk much about the risks until I'd left Google. I realized in the beginning of 2023 there was a huge existential threat I hadn't fully appreciated because it's a better form of intelligence than us. And it's better because it can share. So different copies of the same neural net can look at different data and share the gradient and then update all their weights in sync and stay the same. So they can keep doing that. And when they share the gradient, they're sharing information they got from different types of data sets out of the order of a trillion bits per episode of sharing, if they've got a trillion weights. Whereas what we're doing now is sharing the information of maybe 100 bits per sentence. So at a few bits per second, maybe, if we're lucky, we're sharing at 10 bits per second. And so you're comparing like trillions of bits with hundreds of bits. They're billions of times better than us at sharing. And that's why if you have them running on different hardware, they can learn so much more than us. They can learn from the whole Internet. It doesn't all have to go through one piece of hardware. And it's going to get more important that effect as we go to AI agents that operate in the real world in real time. Most AI images, you can just speed them up and Send them through one network very fast, because obviously computers operate thousands of times faster than a brain. But if you're operating in the real world, you can't get experience faster because the real world has an actual time scale. If you're interacting with other agents who take a little while to reply, then this advantage that different copies of the same neural net can share will be an even bigger advantage. So at that point I decided there's all these short term threats, and it wasn't really my intention to warn about those, but I got sucked into warning about those because journalists always confuse the existential threat with all the other threats. They just muddle all the threats together. They move seamlessly from joblessness to fake videos, to cyber attacks, to lethal autonomous weapons, as if they're all the same thing. So I had to sort of clarify lots of those threats. But my main worry was the much longer term threat, but not long enough that they will be much smarter than us. It's not necessarily the case, but I think most people, most neural net experts believe that within 20 years we'll have superintelligent AI. We vary. Demis thinks it'll be about 10 years. I think it may be as long as 20 years, and it'll very likely be more than five years. Dario thinks it'll be three years, but then he runs a company, so Ilya thinks it'll be sooner than 10 years. We all think it's probably going to happen. So the question is, what happens when AI is a lot smarter than us, and when it's AI agents that are smarter than us, so they're also more powerful than us, they can collaborate with other AI agents, get stuff done, even if they can't sort of fire guns or pull switches. They can persuade people. And we know AI is already very good at persuasion and will soon be much better than people at persuasion, like in 10 years time. And so they'll be able to persuade people to do things just like Trump persuaded people to invade the capital, so they don't actually have to be able to do anything themselves except talk. So most of the tech bros are thinking they have a model, which is, I'm the CEO, you're the secretary, you're much smarter than me, but I can always fire you. And you'll make my life really easy. Because whatever I want to happen, I'll sort of. It'll be like Star Trek. I'll say, make it so and it will happen. And I don't really have to understand it. I'll still get the credit for it. Because I said make it. So I think that's their model, and I just don't think that's going to work. I think the big problem is how do we prevent these things ever wanting to take control or to take over? They may have control, but they may still not want to replace us. And so I've fallen back on the only example I know of a less intelligent thing controlling a more intelligent thing, and that's a baby controlling a mother. And evolution has put a huge amount of work into that. So evolution has made sure the mother cannot bear the sound of the baby crying. And the mother gets huge rewards for being nice to the baby. Lots of pleasurable sensations and just generally good feelings. And we need to do the same for these AIs for being nice to us. We're still making them. And if we could make an AI that was super intelligent but cared more about us than it cared about itself or other superintelligent AIs, then we might be okay. But we have to accept that we're going to be the babies and they're going to be the mothers. And people aren't prepared to accept that. Trump's not prepared to accept that. Trump would never accept that. I think we have a lot more hope of the Chinese understanding it. So I recently went to Shanghai and talked to a member of the Politburo. Me and Eric Schmidt, who aren't natural allies in terms of politics, are rather different. Eric Schmidt, for example, thinks Kissinger was a good guy. But we agree on this existential threat, and the Chinese leadership will understand it much better than any of the other leaderships because many of them are engineers and they actually understand how this stuff works. They understand the argument that it's a better form of intelligence. But I think all the countries will collaborate on, can we make it so that it cares more about us than it does about itself? Because there, if any country figured out how to do that, it'd be very happy to tell the other countries. That's like preventing a global nuclear war. And there, the USSR and America collaborated in the 1950s on that, the height of the Cold War. They still collaborated to prevent that. So what I think we should have is research institutes in different countries that get access to their own country's super smart AI, which they're not going to give to any other country and can do experiments on how to make it not want to, how to make it care more about people than about itself, and share with other countries how to do that. Because I believe the techniques for doing that will Be roughly orthogonal to the techniques for making it smarter. They're not going to share the techniques for making it smarter because they're all doing cyber attacks on each other and they all know the best. You want a better AI to do better cyber attacks and better fake videos and better autonomous weapons. They're never going to share that stuff. They're anti aligned. But on not having AI replace us, they're aligned, so they will collaborate. Now, one other thing I ought to mention. I ought to mention Russ Salakutinov. He was one of my best students. He came to Toronto, did his PhD at Toronto. He did a postdoc with Josh Tenenbaum, and then he wanted to come back to Toronto and he had a faculty offer from Harvard. And I really tried to get the Department of Computer Science, which had an open position in machine learning, to give a job to Russ and they refused. Basically. This was about 2011 or 12. No, no, this was probably 2012 or 13. My department was one of the last departments to accept that neural networks really worked. They had a big AI group, and the big AI group said you had got several people in neural networks already. That's your quota. We're short on people in knowledge representation and we need as many people in computational linguistics as we do in neural networks. And they refused to give Russ a job. So he eventually got a job in statistics so that he could be in Toronto. And we were trying to negotiate it for him to move to computer science. And then CMU swooped in and I think they offered him tenure at cmu and that was that.
A
Well, you have a whole cadre of former students who are really leading the charge, leading the way in a lot of areas of neural nets. It's pretty amazing if you.
B
Well, it was luck. It was luck. Basically, there were so few people who believed in neural nets. There was Jan, there was Yoshua, there was me, there was Schmidt Huber. There were a few other people, but MIT didn't have anybody. Stanford didn't have anybody. Berkeley didn't have anybody. Mike Jordan made sure of that. And so the few of us who believed in it got the really good students who believed in it. And that was great.
A
It worked.
B
Yeah. People like Russ and Elia and George Dahl and other people. Yeah.
A
So if you could. One final question. If you could give advice to new PhD students now entering this area, what would you say?
B
Sometimes I'd say become a plumber. You're too late. But actually I say if you're a CMU and you're doing this, you may be in the small fraction of people who survive in AI and don't get replaced, because for quite a while, there's going to be creative people making AI work better, and you've got a good chance of being one of those people if you're a cmu.
A
All right, well, we'll take that. Jeff, thank you so much for spending the time sharing that. It's always great to catch up and thank you.
B
Okay, well, thank you for inviting me. Tom Mitchell is the Founders University professor at Carnegie Mellon University Machine Learning How Did We Get Here? Is produced by the Stanford Digital Economy Lab. If you enjoyed this episode, subscribe wherever you listen to podcasts.
Episode: Five Decades of Neural Networks with Geoffrey Hinton
Host: Tom Mitchell
Guest: Geoffrey Hinton
Date: February 23, 2026
Produced by: Stanford Digital Economy Lab
This episode is a deep-dive conversation between two machine learning titans: Tom Mitchell, long-time educator and author, and Geoffrey Hinton, one of the foundational architects of neural networks. Hinton traces his wild, non-linear career path and shares first-hand accounts of pivotal moments in the history of neural nets. The discussion spans the earliest inspirations for distributed memory, the “AI winter” years of disfavor, the field’s renaissance with deep learning, the inside stories behind landmark results and breakthroughs, and sober reflections on the existential risks posed by future superintelligence.
The tone is candid, anecdotal, and filled with wit and wisdom as Hinton revisits the frustrations, false starts, serendipities, and social dynamics that shaped five decades of progress in machine learning.
“I figured out how to do true recursion in a neural network and implemented it…so I had this little neural net running that was doing full recursion in neural nets. And that was the first talk I gave. And people were very puzzled.”
(05:16-05:46)
“I learned a lot then about applied research. It’s not about whether it works, it’s whether or not the company likes it.”
(11:43-12:02)
“That convinced the editors of Nature that we really could learn interesting representations…that was the example that did it.”
(16:05-19:24)
“We thought we’d solved everything and little did we know we had. It’s just we needed more data and more compute.”
(20:18-20:24)
“Alex then got on with it. And he was an absolutely wizard programmer…”
(21:25-22:33)
“Our most vigorous opponents, like Jitendra Malik and Zisselman…looked at these results and said, okay, you were right. That never happens in science.”
(23:27-23:58)
“We thought we’d solved everything and little did we know we had. It’s just we needed more data and more compute.”
(20:18–20:24, Geoffrey Hinton)
Successes in Speech & Vision
Deep nets succeeded first in speech—thanks to large datasets—then in vision, and eventually in language.
“They didn’t realize at that point it would work for language, and nor really did we, even though our very first impressive example was for language.”
(27:39–28:46)
Odd Misses in Academia
Hinton laments that leading U.S. departments (MIT, Stanford, Berkeley) were late adopters:
“There was Jan, there was Yoshua, there was me, there was Schmidt-Huber…MIT didn’t have anybody. Stanford didn’t have anybody. Berkeley didn’t have anybody. Mike Jordan made sure of that.”
(43:46–44:16)
The 2012 “Acquihire” Auction
Hinton recounts how Google, Microsoft, DeepMind, and Baidu bid for his group.
“We were completely amazed when it got to $44 million. It was so much money that we couldn’t imagine that any more money would be useful.”
(28:57–29:16)
Ultimately, he joined Google for collaborative quality, not maximum money.
Transformers, Attention, and the Demise of Symbolic AI
“If anything was going to be good for symbolic AI, it was converting symbol strings in one language into another…But that’s not the way to do it.”
(30:24–31:20)
How AI May Surpass Humans Hinton identifies a fundamental reason for AI’s possible superiority: information sharing at “billions of times” the efficiency of human language.
“They’re billions of times better than us at sharing. That’s why…they can learn so much more than us. They can learn from the whole Internet.”
(34:53–35:17)
Existential Risks and Alignment Hinton stresses the importance of learning from evolution—using the analogy of “a baby controlling a mother” for how less intelligence might steer superior AI.
“If we could make an AI that was superintelligent but cared more about us than it cared about itself…then we might be okay. But we have to accept that we’re going to be the babies and they’re going to be the mothers.”
(37:44–38:12)
Cultural and Political Readiness Hinton claims (somewhat tongue-in-cheek) that China’s engineer-heavy leadership may better grasp the alignment problem than U.S. tech CEOs.
Luck & the 'Faithful Few'
“There were so few people who believed in neural nets…so the few of us who believed in it got the really good students…”
(43:46–44:16)
Advice to Students
“Sometimes I’d say become a plumber, you’re too late. But actually I’d say if you’re at CMU and you’re doing this, you may be in the small fraction of people who survive in AI and don’t get replaced…for quite a while, there’s going to be creative people making AI work better, and you’ve got a good chance of being one of those people.”
(44:39–45:07)
| Timestamp | Speaker | Quote | |-----------|---------|-------| | 05:16–05:46 | Geoffrey Hinton | “I figured out how to do true recursion in a neural network and implemented it…so I had this little neural net running that was doing full recursion in neural nets. And that was the first talk I gave. And people were very puzzled.” | | 11:43–12:02 | Geoffrey Hinton | “I learned a lot then about applied research. It’s not about whether it works, it’s whether or not the company likes it.” | | 20:18–20:24 | Geoffrey Hinton | “We thought we’d solved everything and little did we know we had. It’s just we needed more data and more compute.” | | 23:27–23:58 | Geoffrey Hinton | “Our most vigorous opponents…looked at these results and said, okay, you were right. That never happens in science.” | | 34:53–35:17 | Geoffrey Hinton | “They’re billions of times better than us at sharing. That’s why…they can learn so much more than us. They can learn from the whole Internet.” | | 37:44–38:12 | Geoffrey Hinton | “If we could make an AI that was superintelligent but cared more about us than it cared about itself…then we might be okay. But we have to accept that we’re going to be the babies and they’re going to be the mothers.” | | 43:46–44:16 | Geoffrey Hinton | “There were so few people who believed in neural nets…so the few of us who believed in it got the really good students…” | | 44:39–45:07 | Geoffrey Hinton | "Sometimes I'd say become a plumber, you're too late. But actually, if you're at CMU and you're doing this, you may be in the small fraction...for quite a while, there's going to be creative people making AI work better." |
This episode is a unique oral history and reflection on the history and future of neural networks, as told by one of its most important contributors. Hinton’s long view, humility, and openness about missed turns, wrong bets, and serendipity offers key lessons for established researchers and new students alike. His ultimate advice: stay creative, stay humble, and—if at CMU and lucky—stick with it a little longer.
For listeners seeking detailed, spirited, and transparent insight into the winding path of deep learning, this episode is unmissable.