
Loading summary
Kwame Christian
Hi, I'm Kwame Christian, CEO of the American Negotiation Institute, and I have a quick question for you. When was the last time you had a difficult conversation? These conversations happen all the time. And that's exactly why you should listen to Negotiate Anything, the number one negotiation podcast in the world. We produce episodes every single day to help you lead, persuade and resolve conflicts, both at work and at home. So level up your negotiation skills by making Negotiate Anything part of your daily routine.
Jessi Hempel
From LinkedIn News, I'm Jessi Hempel, host of the hello Monday Podcast. Start your week with the hello Monday Podcast. We'll navigate career pivots. We'll learn where happiness fits in. Listen to hello Monday with me, Jesse Hempel on the LinkedIn podcast network or wherever you get your podcasts.
Alex Kantrowitz
Why has generative AI ingested all the world's knowledge but not been able to come up with scientific discoveries of its own? And is it finally starting to understand the physical world? We'll discuss it with Meta chief AI scientist and Turing Award winner, Yann Lecun. Welcome to Big Technology Podcast, a show for cool headed, nuanced conversation of the tech world and beyond. I'm Alex Cancerowicz and I am thrilled to welcome Jan Lecun, the chief AI scientist, Turing Award winner, and a man known as the godfather of AI to Big Technology Podcast. Jan, great to see you again. Welcome to the show.
Yann LeCun
Pleasure to be here.
Alex Kantrowitz
Let's start with a question about scientific discovery and why AI has not been able to come up with it until this point. This is coming from Dwarkesh Patel. He asked it a couple months ago, why do you make of the fact that AI's generative AI basically have the entire corpus of human knowledge memorized and they haven't been able to make a single new connection that has led to discovery? Whereas if even a moderately intelligent person had this much stuff memorized, they would notice, oh, this thing causes this symptom, this other thing causes this symptom. There might be a medical cure here. So shouldn't we be expecting that type of stuff from AI?
Yann LeCun
Well, from AI, yes. From large language models, no. There's several types of AI architectures. All of a sudden when we talk about AI, we imagine chatbots. Chatbots. LLMs are trained on an enormous amount of knowledge which is purely text, and they're trained to basically regurgitate, to retrieve, to essentially produce answers that conform to the statistics of whatever text they've been trained on. And it's amazing what you can do with them. It's very useful. There's no question about it. We also know that they can hallucinate facts that aren't true, but they're really in their purest form. They are incapable of inventing new things.
Alex Kantrowitz
Let me throw out this perspective that Tom Wolfe from Hugging face shared on LinkedIn over the past week. I know you were involved in the discussion about it. It's very interesting. He says, to create an Einstein in a data center, we don't just need a system that knows all the answers, but rather one that can ask questions nobody else has thought or dared to ask. One that writes, what if everyone is wrong about this when all textbooks, experts and common knowledge suggest otherwise. Is it possible to teach LLM to do that?
Yann LeCun
No, no, not in the current form. I mean, and whatever form of AI would be able to do that will not be LLMs. They might use LLM as one component. LLMs are useful to turn to produce text. So we might, in the future AI systems, we might use them to turn abstract thoughts into language. In the human brain, that's done by a tiny little brain area right here called the Broca area. It's about this big. That's our lineup. But we don't think in language. We think in mental representations of a situation. We have mental models of everything we think about. We can think even if we can speak. And that takes place here. That's like where real intelligence is. And that's the part that we haven't reproduced, certainly with LLM. So the question is, are we going to have eventually AI architectures, AI systems that are capable of not just answering questions that are already there, but giving new solutions to problems that we specify? The answer is yes, eventually. Not with current LLMs. Then the next question is, are they going to be able to ask their own questions, figure out what are the good questions to answer? The answer is eventually yes. But that's going to take a while before we get machines that are capable of this. In humans, we have all the characteristics. We have people who are, who have extremely good memory. They can retrieve a lot of things. They have a lot of accumulated knowledge. We have people who are problem solvers. You give them a problem, they'll solve it. And I think Thomas was actually talking about this kind of stuff. He said, if you're good at school, you're a good problem solver. We give you a problem, you can solve it, and you score well in math or physics or whatever it is. But then in research, the most difficult thing is to actually ask the good questions. What are the important questions? It's not just solving the problem, it's also asking the right questions, framing a problem in the right way so you have new insight. And then after that comes, ok, I need to turn this into equations or into something practical model. And that may be a different skill from the one that asked the right questions. It might be a different skill also to solve equations. The people who write the equations are not necessarily the people who write them, who solve them, and all the people who remember that there is some textbook from 100 years ago where similar equations were solved. Those are three different skills. So LLMs are really good at retrieval. They're not good at solving new problems, finding new solutions to new problems. They can retrieve existing solutions, and they're certainly not good at all at asking the right questions.
Alex Kantrowitz
And for those tuning in and learning about this for the first time, LLMs is the technology behind things like the GPT model that's baked within ChatGPT. But let me ask you this, Yann. So the AI field does seem to have moved from standard LLMs, two elements that can reason and go step by step. And I'm curious, can you program this sort of counterintuitive or this heretical thinking by imbuing a reasoning model with an instruction to question its directives?
Yann LeCun
Well, so we have to figure out what reasoning really mean. Okay. And there are obviously everyone is trying to get LLMs to reason to some extent, to perhaps be able to check whether the answer they produce are correct. The way people are approaching the problem at the moment is that they basically are trying to do this by modifying the current paradigm without completely changing it. So can you bolt a couple words on top of LLMs so that you kind of have some primitive reasoning function? And that's essentially what a lot of reasoning systems are doing. One simple way of getting LLMs to kind of appear to reason is chain of thought, right? So you basically tell them to generate more tokens than they really need to in the hope that in the process of generating those tokens they're going to devote more computation to answering your question. And, and to some extent that works surprisingly, but it's very limited. You don't actually get real reasoning out of this reasoning, at least in classical AI and in many domain involves a search through a space of potential solutions. So you have a problem to solve. You can characterize whether the problem is solved or not. So you have some way of telling whether the problem is solved, and then you search through a space of solutions for when that actually satisfies the constraints or is identified as being A solution. And that's kind of the most general form of reasoning you can imagine. There is no mechanism at all in LLMs for this search mechanism. What you have is you have to kind of bolt this on top of it. So one way to do this is you get an LLM to produce lots and lots and lots of sequences of answers, sequences of tokens which represent answers, and then you have a separate system that picks which one is good. This is a bit like writing a program by sort of randomly, more or less generating instructions while maybe respecting the grammar of the language and then checking all of those programs for one that actually works. It not a good way, not a very efficient way of producing correct pieces of code. It's not a good way of reasoning either. So a big issue there is that when humans or animals reason, we don't do it in token space. In other words, when we reason, we don't have to generate a text that expresses our solution and then generate another one, and then generate another one. And then among other ones, we reproduce, pick the one that is good. We reason internally. We have a mental model of the situation and we manipulate it in our head and we find kind of a good solution. When we plan a sequence of actions to, I don't know, build a table or something, we plan the sequence of action. We have a mental model of that in our head. If I tell you, and this has nothing to do with language, so if I tell you, imagine a cube floating in front of us right now. Now rotate that cube 90 degrees along a vertical axis. You can imagine this thing taking place and you can readily observe that it's a cube. If I rotate it 90 degrees, it's going to look just like the cube that I started with. Because you have this mental model of a cube and that reasoning is in some abstract continuous space. It's not in text, it's not related to language or anything like that. And humans do this all the time. Animals do this all the time. And this is what we yet cannot reproduce with machines.
Alex Kantrowitz
Yeah, it reminds me, you're talking through chain of thought and how it doesn't produce much novel insights. And when Deepsea came out, one of the big screenshots that was going around was someone asking deep seek for a novel insight on the human condition. And as you read it, it's another one of these very, like, clever tricks the AI pulls, because it does seem like it's running through all these different, like, very interesting observations about humans, how we take our hate, like our violent side and we channel it towards cooperation instead of competition, and that helps us build more. And then you're like, as you read the chain of thought, you're like, this is kind of just like you read Sapiens and maybe some other books. And that's your chain of thought.
Yann LeCun
Pretty much, yeah. A lot of it is regurgitation.
Alex Kantrowitz
I'm now going to move a part of the conversation I had later closer up, which is the wall effectively is training standard large language models coming close to hitting a wall. Whereas before there was somewhat predictable returns. If you put a certain amount of data and a certain amount of compute towards training these models, you can make them predictably better as we're talking. It seems to me like you believe that that is eventually not going to be true.
Yann LeCun
Well, I don't know if I would call it a wall, but it's certainly diminishing return in the sense that we've kind of run out of natural text data to train those LLMs they already trained with on the order of 10 to the 13 or 10 to the 14 tokens.
Alex Kantrowitz
That's a lot.
Yann LeCun
That's a lot. And that's like the whole Internet. That's the publicly available Internet. And then some companies license content that is not publicly available. And then there is talks about generating artificial data and then hiring thousands of people to kind of generate more data.
Alex Kantrowitz
Write all their knowledge, PhDs and professors.
Yann LeCun
Yeah, but in fact, it could be even simpler than this because most of the systems actually don't understand basic logic, for example. So to some extent there's going to be slow progress along those lines with synthetic data, with hiring more people to plug the holes in the knowledge background of those systems. But it's diminishing return, the costs are ballooning of generating that data and the returns are, are not that great. So we need a new paradigm, we need a new kind of architecture of systems that at the core are capable of those search and searching for a good solution, checking whether that solution is good, planning for a sequence of actions to arrive at a particular goal, which is what you would need for an agentic system to really work. Everybody is talking about agentic system. Nobody has any idea how to build them other than basically regurgitating plans that the system has already been trained on. So it's like everything in computer science. You can engineer a solution which is limited in the context of AI. You can make a system that is based on learning or retrieval with enormous amounts of data. But really the complex thing is how you build a system that can solve new problems without being trained to solve Those problems, we are capable of doing this. Animals are capable of doing this. Facing a new situation, we can either solve it zero shot without training ourselves to handle that situation just the first time we encounter it, or we can learn to solve it extremely quickly. So for example, we can learn to drive in a couple dozen hours of practice and to the point that after 20, 30 hours, it becomes kind of second nature, where this become kind of subconscious.
Alex Kantrowitz
You don't even think about it.
Yann LeCun
You don't need to think about it.
Alex Kantrowitz
Speaking of System one, System two, right?
Yann LeCun
That's right. So we call the discussion we had with Danny Kahneman a few years ago. So the first time you drive, your System two is all present. You have to use it to imagine all kind of catastrophes, scenarios and stuff like that. Your full attention is devoted to driving. But then after a number of hours, you can talk to someone at the same time. You don't need to think about it. It's become sort of subconscious and more or less automatic. It's become System one. And pretty much every task that we learn that we accomplished the first time, we have to use the full power of our minds. And then eventually, if we repeat them sufficiently many times, they get kind of subconscious. I have this vivid memory of once being in a workshop where one of the participants was a chess grandmaster. And he played a simultaneous game against like 50 of us, right? You know, going from one person to another. I got wiped out in 10 turns. I'm terrible at chess, right? So he would come to my table. I had time to think about this because he was playing the other 50 tables or something. So I make my move in front of it. He goes like what? And then immediately plays so he doesn't have to think about it. I was not a challenging enough opponent that he had to actually call his system 2. His system 1 was sufficient to beat me. And with that tells you is that when you become familiar with the task and you train yourself, it kind of becomes subconscious. But the essential ability of humans and many animals is that when you face a new situation, you can think about it, figure out a sequence of actions, a course of action to accomplish a goal. And you don't need to know much about the situation other than your common knowledge of how the world works. Basically, that's what we're missing. Okay, with the AI systems.
Alex Kantrowitz
Okay, now I really have to blow up the order here because you've said some very interesting things that we have to talk about. You talked about how basically LLMs have hit the point of diminishing Returns, large language models, the things that have gotten us here. And we need a new paradigm. But it also seems to me that that new paradigm isn't here yet. And I know you're working on the research for it, and we're going to talk about that, what the next new paradigm might be. But there's a real timeline issue, don't you think? Because I'm just thinking about the money that's been raised and put into this last year. $6.6 billion to OpenAI last week or a couple weeks ago. Another $3.5 billion to Anthropic after they raised 4 billion last year. Elon Musk is putting another small fortune into building grok. These are all LLM first companies. They're not searching out the next. I mean, maybe OpenAI is, but that $6.6 billion that they got was because of ChatGPT. Where's this field going to go? Because if that money is being invested into something that is at the point of diminishing returns, requiring a new paradigm to progress, that sounds like a real problem.
Yann LeCun
Well, we have some ideas about what this paradigm is. The difficulty that, I mean, what we're working on is trying to make it work. And it's not simple. That may take years. And so the question is, are the capabilities we're talking about, perhaps through this new paradigm that we're thinking of, that we're working on, is it going to come quickly enough to justify all of this investment? And if it doesn't come quickly enough, is the investment still justified? The first thing you can say is we are not going to get to human level AI by just scaling up LLMs. This is just not going to happen.
Alex Kantrowitz
Okay, that's your perspective.
Yann LeCun
There's no way, absolutely no way. Whatever you can hear from some of my more adventurous colleagues, it's not going to happen within the next two years. There's absolutely no way in hell, to pardon my French, the idea that we're going to have a country of genius in a data center, that's complete bs. There's absolutely no way. What we're going to have maybe is systems that are trained on sufficiently large amounts of data that any question that any reasonable person may ask will find an answer through the systems. It would feel like you have a PhD sitting next to you, but it's not a PhD you have next to you. It's a system with gigantic memory and retrieval ability, not a system that can invent solutions to new problems, which is really what a PhD is. This is actually a deal. It's connected to this post that Thomas Wolff made that inventing new things requires a type of skill and abilities that you're not going to get from LLMs. So there's a big question, which is the investment that is being done now is not done for tomorrow, is done for the next few years. And most of the investment, at least from the Meta side, is investment in infrastructure for inference. So let's imagine that by the end of the year, which is really the planet Meta, we have 1 billion users of Meta AI through smart glasses, standalone app and whatever you got to serve those people. And that's a lot of computation. So that's why you need a lot of investment in infrastructure to be able to scale this up and build it up over months or years. And so that's where most of the money is going, at least on the side of companies like Meta and Microsoft and Google and potentially Amazon. So this is just operations essentially. Now, is there going to be the market for 1 billion people using those things regularly? Even if there is no change of paradigm, the answer is probably yes. Even if the revolution of the new paradigm doesn't come within three years, this infrastructure is going to be used. There's very little question about that. So it's a good investment and it takes so long to set up data centers and all that stuff that you need to get started now and plan for progress to be continuous so that eventually the investment is justified. But you can't afford not to do it because that would be too much of a risk to take if you have the cash.
Alex Kantrowitz
But let's go back to what you said. The stuff today is still deeply flawed. There have been questions about whether it's going to be used. Now Meta is making this consumer bet, right? The consumers want to use the AI. That makes sense. OpenAI has 400 million users of ChatGPT. Meta has 3,4 billion. I mean basically, if you have a.
Yann LeCun
Phone, well, you have three point something billion users, 600 million users of Meta AI.
Alex Kantrowitz
Right. Okay, so more than ChatGPT.
Yann LeCun
Yeah, but it's not used as much as ChatGPT, so the users are not as intense, as active.
Alex Kantrowitz
But basically the idea that that meta can get to a billion consumer users, that seems reasonable. But the thing is, a lot of this investment has been made with the idea that this will be useful to enterprises, not just a consumer app. There's a problem because like we've been talking about, it's not good enough yet. You look at deep research, this is something Ben Dick Devins has brought up. Deep research is pretty good, but it might only get you 95% of the way there's and maybe 5% of it hallucinates. So if you have a 100 page research report and 5% of it is wrong and you don't know what 5%, that's a problem. Similarly, in enterprises today, every enterprise is trying to figure out how to make AI useful to them, generative AI useful to them and other types of AI, but only 10% or 20% maybe of proof of concepts make it out the door into production because it's either too expensive or it's fallible. So if we are getting to the top here, what do you anticipate is going to happen with everything that has been pushed in the anticipation that it is going to get even better from here?
Yann LeCun
Well, so again, it's a question of timeline, right? When are those systems going to become sufficiently reliable and intelligent so that the deployment is made easier? But the situation you're describing, that beyond the impressive demos, actually deploying systems that are reliable is where things tend to falter in the use of computers and technologies and particularly AI. This is not new. It's basically why we had super impressive autonomous driving demos 10 years ago. But we still don't have level 5 self driving cars, right? It's the last mile that's really difficult, so to speak, for cars.
Alex Kantrowitz
See what you did there the last.
Yann LeCun
Few, that was not deliberate. The last few percent of reliability, which makes a system practical and how you integrate it with existing systems and blah blah, blah, and how it makes users of it more efficient if you want, or more reliable or whatever. That's where it's difficult. And this is why if you go back several years and we look what happened with IBM, Watson, okay, so Watson was going to be the thing that IBM was going to push and generate tons of revenue by having Watson learn about medicine and then be deployed in every hospital. And it was basically a complete failure and was sold for parts and cost a lot of money to IBM, including the CEO. What happens is that actually deploying those systems in situations where they are reliable and actually help people and don't hurt the natural conservatism of the labor force. This is where things become complicated. We're seeing the same. The process we're seeing now with the difficulty of deploying a system is not new. It's happened absolutely at all times. This is also why some of your listeners perhaps are too young to remember this, but there was a big wave of interest in AI in the 1980s, early 1980s around expert systems. The hardest job in the 1980s was going to be knowledge engineer. And your job was going to be to sit next to an expert and then turn the knowledge of the expert into rules and facts that would then be fed to inference engine that would be able to kind of derive new facts and answer questions and blah blah blah. Big wave of interest. The Japanese government started a big program called fifth generation computer. The hardware was going to be designed to actually take care of that and blah blah blah. Mostly a failure. There was kind of. The wave of interest kind of died in the mid-90s about this and a few companies were successful, but basically for a narrow set of applications for which you could actually reduce human knowledge to a bunch of rules and for which it was economically feasible to do so. But the wide ranging impact on all of society and industry was just not there. That's the danger of AI all the time. I mean the signals are clear that still LLMs with all the bells and whistles actually play an important role, if nothing else for information retrieval. Most companies want to have some sort of internal experts that know all the internal documents so that any employee can ask any question. We have one at Meta, it's called metamate. It's really cool. It's very useful.
Alex Kantrowitz
Yeah. And I'm not suggesting that AI is gonna, that modern AI is not, or modern generative AI is not useful, or I'm asking purely that there's been a lot of money that's been invested into expecting this stuff to effectively achieve God level capabilities. And we both are talking about how there's potentially diminishing returns here. And then what happens if there's that timeline mismatch like you mentioned. And this is the last question I'll ask about it because I feel like we have so much else to cover, but I feel like timeline mismatches that might be personal to you. You and I first spoke nine years ago, which is crazy. Now, nine years ago, about how in the early days you had an idea for how AI should be structured and you couldn't even get a seat at the conferences. And then eventually with the right amount of, when the right amount of compute came around, those ideas started working and then the entire AI field took off based off of your idea that you worked on with Bengio and Hinton and a bunch of others and many others. But for the sake of efficiency, I'll say go look it up. But just talking about those mismatched timelines, when there have been overhyped moments in the AI field, maybe with expert Systems that you were just talking about and they don't pan out the way that people expect. The AI field goes into what's called AI winter.
Yann LeCun
Well, there's a backlash. Yeah.
Alex Kantrowitz
Correct. If we are potentially approaching this moment of mismatched timelines, do you fear that there could be another winter now, given the amount of investment, given the fact that there's going to be potentially diminishing returns with the main way of training these things and maybe we'll add in the fact that the market is. The stock market looks like it's going through a bit of a downturn right now. Now that's a variable, probably the third most important variable of what we're talking about, but it has to factor.
Yann LeCun
So yeah, I think there's certainly a question of timing there, but I think if we try to dig a little bit deeper, as I said before, if you think that we're going to get to human level AI by just training on more data and scaling up LLMs, you're making a mistake. So if you're an investor and you invested in a company that told you we're going to get to human level AI and PhD level by just training on more data and with a few tricks. I don't know if you're going to use your shirt, but that was probably not a good idea. However, there are ideas about how to go forward and have systems that are capable of doing what every intelligent animal and human are capable of doing and that current AI systems are not capable of doing. And I'm talking about understanding the physical world, having persistent memory and being able to reason and plan. Those are the four characteristics that need to be there. And that requires systems that can acquire common sense, that can learn from natural sensors like video as opposed to just text, just human produced data. That's a big challenge. I've been talking about this for many years now and saying this is where the challenge is. This is what we have to to figure out. And my group and I have, or people working with me and others who have listened to me are making progress along this line of systems that can be trained to understand how the world works on video, for example, systems that can use mental models of how the physical world works to plan sequences of actions to arrive at a particular goal. So we have kind of early results of this kind of systems and there are people at DeepMind working on similar things and there are people in various universities working on this. So the question is, when is this going to go from interesting research papers demonstrating a new capability with a new architecture to architectures at scale that are practical for a lot of applications and can find solutions to new problems without being trained to do it, et cetera. And it's not going to happen within the next three years, but it may happen between three to five years, something like that. And that kind of corresponds to the sort of ramp up that we see in investment now, whether other. So that's the first thing. Now, the second thing that's important is that there's not going to be one secret magic bullet that one company or one group of people is going to invent that is going to just solve the problem. It's going to be a lot of different ideas, a lot of effort, some principles around which to base this that some people may not subscribe to and will go in a direction that will turn out to be a dead end. There's not going to be like a day before which there is no AGI, and after which we have AGI. This is not going to be an event. It's going to be continuous conceptual ideas that, as time goes by, are going to be made bigger and to scale and going to work better. And it's not going to come from a single entity. It's going to come from the entire research community across the world. And the people who share their research are going to move faster than the ones that don't. And so if you think that there is some startup somewhere with five people who has discovered the secret of AGI and you should invest 5 billion in them, you are making a huge mistake.
Alex Kantrowitz
Jan. First of all, I always enjoy our conversations because we start to get some real answers. And I remember even from our last conversation, I was just always looking back to that conversation saying, okay, this is what Jan says, this is what everybody else is saying. I'm pretty sure that this is the grounding point. And that's been correct. And I know we're going to do that with this one as well. And now you've set me up for two interesting threads that we're going to pull out as we go on with our conversation. First is the understanding of physics and the real world, and the second is open source. So we'll do that when we come back right after this. And we're back here with Yann Lecun. He is the chief AI scientist at Meta and the Turing Award winner that we're thrilled to have on our show. Luckily, for the third time, I want to talk to you about physics, Jan, because there's sort of this famous moment in big technology podcast history, and I say famous with our listeners. I don't know if it really extended beyond, but you had me write to ChatGPT, if I hold a paper horizontally with both hands and let go of the paper with my left hand, what will happen? And I write it and it convincingly says like it writes, the physics will happen and the paper will float towards your left hand. And I read it out loud, convinced. And you're like that thing just hallucinated in me. You believed it. That is what happened. So Listen, it's been two years. I put the test to ChatGPT today. It says when you let go of the paper with your left hand, gravity will cause the left side of the paper to drop while the right side still held up, but your right hand remains in place. This creates a pivot effect where the paper rotates around the point where your right hand is holding it.
Yann LeCun
So now it gets it right, it learned the lesson. It's quite possible that this someone hired by OpenAI to solve the problem was fed that question and sort of fed the answer and the system was fine tuned with the answer. I mean obviously you can imagine an infinite number of such questions. And this is where the so called post training of LLM becomes expensive, which is that how much coverage of all those style of questions do you have to do for the system to basically cover 90% or 95% or whatever percentage of all the questions that people may ask it. But there's a long tail and there is no way you can train the system to answer all possible questions because there is an essentially infinite number of them and there is way more questions the system cannot answer that then questions it can answer. You cannot cover the set of all possible training questions in the training set. The training set, right.
Alex Kantrowitz
So because I think our conversation last time was saying, you said that because these actions of like what's happening with the paper if you let go of it with your hand, has not been covered widely in text, the model won't really know how to handle it because unless it's been covered in text, the model won't have that understanding, won't have that inherent understanding of the real world. And I've kind of gone with that for a while. Then I said, you know what, let's try to generate some AI videos. And one of the interesting things that I've seen with the AI videos is there is some understanding of how the physical world works there in a way that in our first meeting nine years ago you said one of the hardest things to do is you ask an AI, what happens if you hold a pen vertically on a table and let go, will it fall? And there's like an unbelievable amount of permutations that can occur and it's very, very difficult for the AI to figure that out because it just doesn't inherently understand physics. But now you go to something like Sora and you say, show me a video of a man sitting on a chair kicking his legs and you can get that video. And the person sits on the chair and they kick their legs and the legs don't fall out of their sockets or stuff, they bend at the joints.
Yann LeCun
And they don't have three legs.
Alex Kantrowitz
And they don't have three legs. So wouldn't that suggest an improvement of the capabilities here with these large models?
Yann LeCun
No.
Alex Kantrowitz
Why?
Yann LeCun
Because you still have those videos produced by those video generation system where you spill a glass of wine and the wine floats in the air or flies off or disappears or whatever. Of course, for every specific situation, you can always collect more data for that situation and then train your model to handle it. But that's not really understanding the underlying reality. This is just compensating the lack of understanding by increasingly large amounts of data. Children understand simple concepts like gravity with a surprisingly small amount of data. So in fact there is an interesting calculation you can do, which I've talked about previously before. But if you take typical LLM trained on 30 trillion tokens, something like that, right? 3, 10 to the 13 tokens, a token is about 3 bytes, so that's 0.9 10 to the 14 tokens, let's say 10 to the 14 tokens to round this up. That text would take any of us probably on the order of 400,000 years to read, no problem at 12 hours a day. Okay, now a four year old has been awake a total of 16,000 hours. You can multiply by 3,600 to give number of seconds. And then you can put a number on how much data has gotten to your visual cortex through the optic nerve. Optic nerve. Each optic nerve, we have two of them carries about 1 megabyte per second, roughly. So it's 2 megabytes per second times 3,600 times 16,000. And that's just about 10 to the 14 bytes. So in four years, a child as seen through vision, or touch for that matter, as much data as the biggest LLMs. It tells you clearly that we're not going to get to human level AI by just training on text. It's just not a rich enough source of information. And by the way, 16,000 hours is not that much video. It's 30 minutes of YouTube uploads. We can get that pretty easily now. In nine months, Baby has seen, you know, let's say, 10 to the 13 bytes or something, which is not much again. And in that time, baby has learned basically all of intuitive physics that we know about. Conservation of momentum, gravity, conservation of momentum, the fact that objects don't spontaneously disappear, the fact that they still exist even if you hide them. I mean, there's all kinds of stuff, very basic stuff that we learn about the world in the first few months of life. And this is what we need to reproduce with machine, this type of learning of figuring out what is possible and impossible in the world, what will result from an action you take so that you can plan a sequence of actions to arrive at a particular goal. That's the idea of wild model. And now connected with the question about video generation systems is the right way to approach this problem to train better and better video generation systems? And my answer to this is absolutely no. The problem of understanding the world does not go through the solution to generating video at the pixel level. I don't need to know. If I take this cup of water and I spill it, I cannot entirely predict the exact path that the water will follow on the table and what shape it's going to take and all that stuff, what noise it's going to make. But at a certain level of abstraction, I can make a prediction that the water will spill, okay? And it'll probably make my phone wet and everything. So at a. I can't predict all the details, but I can predict at some level of abstraction. And I think that's really a critical concept. The fact that if you want a system to be able to learn to comprehend the world and understand how the world works, it needs to be able to learn an abstract representation of the world that allows it to make those predictions. And what that means is that those architectures will not be generative.
Alex Kantrowitz
Right. And I want to get to your solution here in a moment, but I just wanted to also, like, what would a conversation between us be without a demo? So I want to just show you. I'm going to put this on the screen when we do the video. But this is a video I was pretty proud of. I got this guy sitting on a chair, kicking his legs out, and the legs stay attached to his body. And I was like, all right, this stuff is making real progress. And then I said, can I get a car going into a haystack? And so it's two bales of Haystacks. And then a haystack magically emerges from the hood of a car that's stationary. And I just said to myself, okay, Jan wins again.
Yann LeCun
It's a nice car though. Yeah. I mean, the thing is, those systems have been fine tuned with a huge amount of data for humans, because that's what people are asking most videos that they ask. So there is a lot of data of humans doing various things to train those systems. So that's why it works for humans, but not for a situation that the people training that system had not anticipated.
Alex Kantrowitz
So you said that the model can't be generative to be able to understand the real world.
Yann LeCun
That's right.
Alex Kantrowitz
You are working on something called V. Jepa.
Yann LeCun
Jepa.
Alex Kantrowitz
JepA, right. V is the video. You also have I. Jepa for images, right? That is, we have.
Yann LeCun
Jepa is for all kinds of stuff. Text also.
Alex Kantrowitz
And text. So explain how that will solve the problem of being able to allow a machine to abstractly represent what is going on in the real world.
Yann LeCun
Okay, so what has made the success of AI, and particularly natural language understanding and chatbot in the last few years, but also to some extent computer vision, is self supervised learning. So what is self supervised learning? It's take an input, be it an image, a video, a piece of text, whatever, corrupt it in some way and train a big neural net to reconstruct it. Basically recover the uncorrupted version of it, or the undistorted version of it, or a transformed version of it that would result from taking an action. That would mean, for example, in the context of text, take a piece of text, remove some of the words, and then train some big neural net to predict the words that are missing. Take an image, remove some pieces of it, and then train big neural net to recover the full image. Take a video, remove a piece of it, train a neural net to predict what's missing. Okay, so LLMs are a special case of this where you take a text and you train the system to just reproduce the text. And you don't need to corrupt the text because the system is designed in such a way that to predict one particular word or token in the text, it can only look at the tokens that are to the left of it. So in effect, the system has hardwired into its architecture the fact that it cannot look at the present and the future to predict the present, it can only look at the past. But basically you train that system to just reproduce its input on its output. So this kind of architecture is called a causal architecture. And this Is what an LLM is, a large language model. That's what all the chatbots in the world are based on. Take a piece of text and train the system to just reproduce that piece of text on its output. And to predict a particular word, it can only look at the word to the left of it. So now what you have is a system that given a piece of text, can predict the word that follows that text, and you can take that word that is predicted, shift it into the input, and then predict the second word, shift that into the input, predict the third word. That's called autoregressive prediction. It's not a new concept, very old. So self supervised learning does not train to do a particular, does not train a system to accomplish a particular task other than capture the internal structure of the data. It doesn't require any labeling by a human. So apply these two images, take an image, mask a chunk of it, like a bunch of patches from it if you want, and then train a big neural net to reconstruct what is missing. And now use the internal representation of the image learned by the system as input to a subsequent downstream task For, I don't know, image recognition, segmentation, whatever it is, it works to some extent, but not great. So there's a big project like this to do this at fair. It's called MAE Masked Autoencoder. It's a special case of denoising autoencoder, which itself is the sort of general framework from which I derived this idea of self supervised learning. So it doesn't work so well. And there's various ways to, if you apply this to video. Also I've been working on this for almost 20 years now. Take a video, show just a piece of the video, and then train a system to predict what's going to happen next in the video. So same idea as for text, but just for video. And that doesn't work very well either. And the reason it doesn't work, why does it work for text and not for video, for example? And the answer is, it's easy to predict a word that comes after a text. You cannot exactly predict which word follows a particular text, but you can produce something like a probability distribution of all the possible words in a dictionary, all the possible tokens. There's only about 100,000 possible tokens. So you just produce a big vector with 100,000 different numbers that are positive and sub to 1. Okay, now what are you going to do to represent a probability distribution over all possible frames in a video or all possible missing parts of an image. We don't know how to do this properly. In fact, it's mathematically intractable to represent distributions in high dimensional continuous spaces. Okay, we don't know how to do this in a kind of useful way, if you want. And I've tried to do this for video for a long time. And so that is the reason why those ideas of self supervised learning using generative models have failed so far. And this is why trying to train a video generation system as a way to understand, to get a system to understand how the world works, that's why it can't succeed. So what's the alternative? The alternative is something that is not a generative architecture, which we call jepa. So that means joint embedding predictive architecture. And we know this works much better than attempting to reconstruct. So we've had experimental results on learning good representations of images going back many years, where instead of taking an image, corrupting it, and attempting to reconstruct this image, we take the original full image and the corrupted version, we run them both through neural nets. Those neural nets produce representations of those two images, the initial one and the corrupted one. And we train another neural net, a predictor, to predict the representation of the full image from the representation of the corrupted one. And if you train a system, if you successfully train a system of this type, this is not trained to reconstruct anything, it's just trained to learn a representation so that you can make prediction within the representation layer. And you have to make sure that the representation contains as much information as possible about the input, which is why it's difficult, actually. That's the difficult part of training those systems. So that's called a JEPA joint embedding Predictive architecture. And to train a system to learn good representations of images, those joint embedding architectures work much better than the ones that are generative, that are trained by reconstruction. And now we have a version that works for video too. So we take a video, we corrupt it by masking a big chunk of it, we run the full video and the corrupted one through encoders that are identical. And simultaneously we train a predictor to predict the representation of the full video from the partial one, the representation that the system learns of videos when you feed it to a system that you train to tell you, for example, what action is taking place in the video, or whether the video is possible or impossible, or things like that, it actually works quite well. That's cool.
Alex Kantrowitz
So it gives that abstract thinking in a way, right?
Yann LeCun
And we have experimental Result that shows that this joint embedding training, we have several methods for doing this. There's one that's called Deno, another one that's called vcreg, another one that's called vicreg, another one that's called I jepa, which is sort of a distillation method. And so we had several different ways to approach this, but one of those is going to lead to a recipe that basically gives us a general way of training those JPA architectures. So it's not generated because the system is not trying to regenerate the part of the input. It's trying to generate a representation, an abstract representation of the input. And what that allows it to do is to ignore all the details about the input that are really not predictable. Like the pen that you put on the table vertically, and when you let it go, you cannot predict in which direction it's going to fall. But at some abstract level, you can say that the pen is going to fall without representing the direction. So that's the idea of JetPA. And we're starting to have good results on having systems. So the JEPA system, for example, is trained on lots of natural videos. And then you can show it a video that's impossible. Like a video where, for example, an object disappears or changes shape. Okay, you can generate this with a game engine or something, or a situation where you have a ball rolling and it rolls and it stops behind a screen, and then the screen comes down and the ball is not there anymore.
Alex Kantrowitz
Right?
Yann LeCun
Okay. So things like this, and you measure the prediction error of the system. So the system is trained to predict, right? And not necessarily in time, but basically to predict the coherence of the video. And so you measure the prediction error as you show the video to the system. And when something impossible occurs, the prediction error goes through the roof. And so you can detect if the system has integrated some idea of what is possible physically or what's not possible, but just being trained with physically possible natural videos. So that's really interesting. That's sort of the first hint that a system is quite some level of common sense. We have versions of those systems also that are so called action conditions. So basically we have things where we have a chunk of video or an image of the state of the world at time T, and then an action is being taken, like robot arm is being moved or whatever. And then of course, we can observe the result resulting from this action. So now when we train a JEPA with this, the model basically can say, here is the state of the World at time T. Here is an action you might take. I can predict the state of the world at time t +1. In this abstract representation space, there's this.
Alex Kantrowitz
Learning of how the world works, of.
Yann LeCun
How the world works. And the cool thing about this is that now you can imagine, you can have the system. Imagine what would be the outcome of a sequence of actions. And if you give it a goal, saying like, I want the world to look like this at the end, can you figure out a sequence of actions to get me to that point? It can actually figure out by search for a sequence of actions that will actually produce that result. That's planning, that's reasoning, that's actual reasoning and actual planning.
Alex Kantrowitz
Okay, Jan, I have to get you out here where we are over time, but can you give me like 60 seconds? Your reaction to Deep Seek and sort of has open source overtaken the proprietary models at this point and we've got to limit to 60 seconds, otherwise I'm going to get killed by your team here.
Yann LeCun
Overtaken is a strong word. I think progress is faster in the open source world, that's for sure. But of course, the proprietary shops are profiting from the progress of the open source world. They get access to that information like everybody else. What's clear is that there is many more interesting ideas coming out of the open source world that any single shop, as big as it can be, cannot come up with. Nobody has a monopoly on good ideas. And so the magic efficiency of the open source world is that it recruits talents from all over the world. And so what we've seen with Deep SEQ is that if you set up a small team with a relatively long leash and few constraints on coming up with just the next generation of LLMs, they can actually come up with new ideas that nobody else had come up with. They can sort of reinvent a little bit how you do things and then if they share that with the rest of the world, then the entire world progresses. And so it clearly shows that open source progress is faster. And a lot more innovation can take place in the open source world, which the proprietary world may have a hard time catching up with. It's cheaper to run. What we see is for partners who we talk to, they say, well, our clients, when they prototype something, they may use a proprietary API, but when it comes time to actually deploy the product, they actually use Llama or other open source engines because it's cheaper and it's more secure, it's more controllable, you can run it on premise. There's all kinds of advantages. So we've seen also a big evolution in the thinking of some people who are initially worried that open source efforts were going to, I don't know, for example, help the Chinese or something, if you have some geopolitical reason to think it's a bad idea. But what Deep Seq has shown is that the Chinese don't need us. I mean, they can come up with really good ideas. We all know that there are really, really good scientists in China. And one thing that is not widely known is that the single most cited paper in all of science is a paper on deep learning from 10 years ago, from 2015, and it came out of Beijing. The paper is called resnet. So it's a particular type of architecture of neural net where basically by default, every stage in a deep learning system computes the identity function. It just copies its input on its output. And what the neural net does is compute the deviation from this identity. So that allows to train extremely deep neural net with, you know, dozens of layers, perhaps 100 layers. And it was the first author of that paper is a gentleman called Kaiming. He, at the time he was working at Microsoft Research Beijing. Soon thereafter, the publication of that paper, he joined FAIR in California. So I hired him and worked at FAIR for eight years or so and recently left and is now a professor at mit. So there are really, really good scientists everywhere in the world. Nobody has a monopoly on good ideas. Certainly Silicon Valley does not have a monopoly on good ideas. Another example of that is actually the first Lama came out of Paris. It came out of the Fair Labs in Paris, a small team of 12 people. So you have to take advantage of the diversity of ideas, backgrounds, creative juices of the entire world if you want science and technology to progress fast. And that's enabled by open source.
Alex Kantrowitz
Jan, it is always great to speak with you. Appreciate. This is our, I think, fourth or fifth time speaking again, going back nine years ago. You always helped me see through all the hype and the buzz and actually figure out what's happening. And I'm sure that's going to be the case for our listeners and viewers as well. So, Jan, thank you so much for coming on. Hope we do it again soon.
Yann LeCun
Thank you, Alex.
Alex Kantrowitz
All right, everybody, thank you for watching. We'll be back on Friday to break down the week's news. Until then, we'll see you next time on Big Technology Podcast.
Host: Alex Kantrowitz
Guest: Yann LeCun, Chief AI Scientist at Meta and Turing Award Winner
Release Date: March 19, 2025
In this episode of the Big Technology Podcast, host Alex Kantrowitz engages in a deep and insightful conversation with Yann LeCun, a luminary in the field of artificial intelligence (AI). The discussion centers around the limitations of current AI systems, particularly Large Language Models (LLMs), in making genuine scientific discoveries, and explores the future paradigms necessary for AI to achieve human-like understanding and innovation.
Timestamp: [00:52]
Alex opens the dialogue with a critical question: "Why has generative AI ingested all the world's knowledge but not been able to come up with scientific discoveries of its own?" This query, inspired by Dwarkesh Patel's thoughts, sets the stage for unraveling the inherent constraints of current AI architectures.
Yann LeCun responds by distinguishing between different types of AI:
"LLMs are trained on an enormous amount of knowledge which is purely text, and they're trained to basically regurgitate... they are incapable of inventing new things." ([02:47])
He emphasizes that while LLMs like ChatGPT can retrieve and generate text based on existing data, they lack the capability to form new connections or innovate independently.
Timestamp: [02:47] - [06:33]
Alex introduces Tom Wolfe's perspective from Hugging Face, highlighting the necessity for AI to not just know answers but to ask novel questions that lead to discoveries:
"To create an Einstein in a data center, we don't just need a system that knows all the answers, but rather one that can ask questions nobody else has thought or dared to ask." ([03:21])
LeCun agrees, asserting that current LLMs are fundamentally limited to retrieval-based tasks and cannot engage in the innovative questioning that drives scientific discovery. He elaborates on the multifaceted nature of problem-solving in humans, which involves asking the right questions, framing problems creatively, and applying diverse skills—all of which are areas where LLMs fall short.
Timestamp: [11:49] - [24:19]
The conversation shifts to the economic and developmental aspects of AI. Alex points out the significant investments pouring into LLM-centric companies, questioning the sustainability given the diminishing returns:
"If that money is being invested into something that is at the point of diminishing returns, requiring a new paradigm to progress, that sounds like a real problem." ([11:49])
LeCun acknowledges the issue, explaining that the vast amounts of data required to train LLMs are reaching practical limits:
"We've kind of run out of natural text data to train those LLMs... we need a new paradigm." ([12:22])
He argues that moving forward, the AI field must develop new architectures capable of understanding and interacting with the physical world, beyond mere data retrieval. This entails creating systems that can reason, plan, and learn from sensory inputs like video, akin to human cognitive processes.
Timestamp: [29:50] - [34:13]
Alex expresses concern over a potential mismatch between AI investment and the emergence of breakthrough technologies, drawing parallels to past AI winters. He questions whether the current investment surge, primarily focused on LLMs, might lead to a stagnation if new paradigms don't materialize swiftly enough.
LeCun responds by emphasizing the gradual and collective nature of AI advancements:
"It's not going to be a day before which there is no AGI, and after which we have AGI. This is not going to be an event." ([34:13])
He posits that the transition to more advanced AI systems will be a continuous process involving diverse research efforts globally, rather than a sudden breakthrough from a single entity. This collaborative approach, he suggests, mitigates the risk of an AI winter by ensuring sustained progress across multiple fronts.
Timestamp: [35:50] - [44:35]
A significant portion of the discussion delves into the AI's comprehension of physical laws. Alex references a past experiment where ChatGPT incorrectly predicted the behavior of a falling paper, contrasting it with recent improvements in video generation AI systems like Sora.
"When you let go of the paper with your left hand, gravity will cause the left side of the paper to drop while the right side remains in place..." ([35:50])
LeCun explains that while AI can now generate more plausible physical simulations, this improvement doesn't equate to genuine understanding. He underscores that AI's ability to predict outcomes is still surface-level and relies heavily on data patterns rather than an intrinsic grasp of physical realities.
Introducing Joint Embedding Predictive Architecture (JEPA), LeCun outlines a new approach:
"Together, this is what JEPA is... it's not generative because the system is not trying to regenerate the part of the input. It's trying to generate a representation, an abstract representation of the input." ([43:06])
JEPA focuses on abstracting and predicting representations rather than reconstructing raw data, enabling AI to understand and reason about the physical world more effectively.
Timestamp: [55:00] - [59:50]
Alex shifts the conversation to the impact of open-source initiatives like Deepsea on the AI landscape. He inquires whether open-source has begun to overtake proprietary models.
LeCun responds affirmatively, highlighting the innovative speed and diversity of ideas within the open-source community:
"Progress is faster in the open source world... it's cheaper to run, more secure, more controllable." ([55:44])
He cites examples of successful open-source contributions, such as the ResNet architecture from Beijing and the Llama model from Paris, illustrating that no single entity holds a monopoly on groundbreaking ideas. LeCun advocates for the global and collaborative nature of open-source development as a catalyst for rapid AI advancement.
Timestamp: [59:50] - [60:08]
As the episode wraps up, Alex appreciates Yann's ability to cut through hype and provide grounded insights into the AI industry's trajectory. Yann emphasizes the necessity for diverse, collaborative research and cautions against expecting instant breakthroughs from current LLM approaches.
"If you think that there is some startup somewhere with five people who has discovered the secret of AGI... you are making a huge mistake." ([34:13])
The conversation leaves listeners with a nuanced understanding of where AI stands today, the challenges it faces in achieving true scientific discovery, and the promising avenues that lie ahead through innovative architectures like JEPA and the open-source movement.
This comprehensive discussion between Alex Kantrowitz and Yann LeCun sheds light on the current state of AI, its limitations, and the innovative directions required to overcome these challenges. By moving towards architectures that prioritize understanding and reasoning, and fostering open-source collaboration, the AI community can pave the way for systems that not only process information but also drive genuine scientific and technological discoveries.